Anti-Spam Techniques In PHP - Part 2
Provided by Quentin Zervaas
Introduction
This short series of articles provides a few simple techniques for protecting yourself and your web site from spammers.
It does this from two perspectives:- Protecting people (including yourself) who post to your web site from being spammed (this article) (Part 1)
- Protecting your web site from being spammed (this article)
This is the second article in the series, and covers the issue of protecting your web site from spammers.
In this article we cover a number of ways of overcoming spammers, including:
1. CAPTCHA
2. Manual approval (moderation)
3. Text filters (blocking URLs)
4. Email Validation.
Protecting your web site
Introduction To CAPTCHA
This is the main method we're going to look at in this article for preventing spam. CAPTCHA is acronym which stands for Completely Automated Public Turing test to tell Computers and Humans Apart.
You've probably seen one of these tests before - it's basically an image that contains distorted text (and potentially a background image) to prevent automated tools (such as spam bots) from registering at web site or posting comments. Users are required to enter the text that appears in these images in order to proceed past the form they are completing.
How you can use CAPTCHA
You can use CAPTCHA in your web site in exactly the same way as I described above. All you need to do is add an CAPTCHA image and a text input to the form you're trying to protect. This may be used in the following situations:
- Protecting a registration form to stop automated registrations on your site
- Protecting your publicly-editable Wiki from being overrun by automated posts
- Protecting your blog from comments posted by spam bots
How this is implemented
Implementing CAPTCHA using PHP is fairly straightforward. We will be doing this using PEAR's Text_CAPTCHA
class
and PHP's sessions. The basic algorithm goes something like this:
- In the form:
- Generate a phrase for the user to enter, and write it to the session
- Place a HTML img tag to call the CAPTCHA script (which we will create)
- Place a text input box where the user can input the phrase
- Create the script referenced in the image tag, which:
- Creates the Text_CAPTCHA object
- Fetches the phrase from the user's session
- Outputs the image in the form processor
- Check that the submitted value matches the value in the session
Implementing A PHP CAPTCHA Solution
We will now implement the actual PHP code to do this.
The first thing that needs to be done is to install the Text_CAPTCHA
class. At time of writing, this
classes also depends on Text_Password
and Image_Text
, so if you don't already have these they
must also be installed.
In order to install these via the Linux command line, you should use commands similar to the following:
# pear install -f Text_CAPTCHA
# pear install -f Image_Text
The captcha.php file
This is the script that generates the CAPTCHA image. This script assumes that a phrase has already been set in the session.
Additionally, you must have TrueType font present in the same directory as captcha.php
. This is the
font used to write the secret phrase. If you use a Windows computer, you can find a bunch of these in your
C:\Windows\Fonts directory
.
The captcha.php
file looks like this, from start to end:
<?php
require_once('Text/CAPTCHA.php');
session_start();
$phrase = isset($_SESSION['captcha']) ? $_SESSION['captcha'] : 'Error';
$options = array('font_size' => 24, 'font_file' => 'georgia.ttf');
$cap = Text_CAPTCHA::factory('Image');
$cap->init(120, 60, $phrase, $options);
header('Content-type: image/png');
echo $cap->getCAPTCHAAsPNG();
?>
Note that this doesn't do any error checking, so you may want to improve this. Additionally, the text used if no phrase was found is "Error", so if this is the case, then this text will appear in the image, and the user will probably never be able to complete submission of the form.
Now, here's the PHP code for the form and form processor. Note that the whole thing has been over-simplified, and real form-processing would have much more to it than this.
<?php
session_start();
if(isset($_POST['process'])){
if(!isset($_SESSION['captcha']))
die('Form accessed incorrectly');
if(isset($_POST['captcha']) && $_POST['captcha'] == $_SESSION['captcha']){
die('CAPTCHA text matched! Phrase was '.$_SESSION['captcha']);
}else{
die('CAPTCHA text did not match. Phrase was '.$_SESSION['captcha'].
', you entered '.$_POST['captcha']);
}
}else{
// generate a new CAPTCHA phrase
$_SESSION['captcha'] = substr(md5(uniqid(null)), 0, 4);
}
?>
<html>
<head>
<title>CAPTCHA Demo</title>
</head>
<body>
<form method="post" action="<?php echo $_SERVER['PHP_SELF'] ?>;">
<img src="captcha.php" /><br />
Enter phrase: <input type="text" name="captcha" /><br />
<input type="submit" name="process" value="Submit" />
</form>
</body>
</html>
In this code, we're just generating a random string of text using MD5
and uniqid()
for our phrase.
This has a lot of scope for improvement or change, although this will do the trick.
That's all there is to it! The images generate by Text_CAPTCHA
are somewhat straightforward. There are other
implementations on the Internet for creating CAPTCHA images, but essentially they all do the same thing and the algorithm I've
provided above will work with all of them.
Drawbacks Of CAPTCHA
While CAPTCHA is very useful and widely adapted across the Internet, there are some drawbacks to using it.
Probably the biggest issue is to do with accessibility. People that are vision impaired may have great difficulty in using your CAPTCHA forms, so you should provide alternatives. At minimum, you should offer a description of it and how it works, and a contact form that people can contact you with so you can complete the form for them.
In fact, even people who are not vision impaired may have difficulty in using CAPTCHA. Sometimes the generated images are just really hard to read. So make sure they are random (e.g. even if the phrase stays the same, then the noise image and/or text placement changes). Also indicate to the user that they can refresh the page so the image is recreated and possibly easier to read.
One last note to be aware of, is that CAPTCHA is not totally foolproof. People have written bots that do OCR (Optical Character Recognition) in order to foil these tests. Obviously the more complex the CAPTCHA image becomes, the harder it is to do text recognition on. There's a bit of information on Breaking a Visual CAPTCHA at UC Berkeley Computer Vision Group.
Other Methods Of Spam Prevention
We've had a look at CAPTCHA to prevent spam, but it's not the only way. There are some other ways to help combat this, but be aware that this list is by no means exhaustive. In fact, it may be a hybrid of these methods that works best for you.
Manual Approval / Moderation
The way this works is that for every person who signs up (or for every comment that is posted, etc.), a trusted person (e.g. the site administrator), checks this new account (or comment) and manually validates it.
The biggest drawback with this method is that there's a lot more work involved. It's important to spend a bit more time in the development of the approval system in order to make your life easier down the track:
- Have a script in your administration section which lists all the items awaiting approval. Each item could have a Approve/Delete checkbox next to it which processes the items in bulk.
- Email each new item to you when it is submitted, with a single link for each item you can click to approve or delete the item
From an implementation point-of-view, all this really requires is a single extra field in your SQL table which indicates whether or not an item is approved or not. You may also want to track the date/time items were approved, but this depends on your requirements.
Text Filters
If you're trying to protect your blog or Wiki from unwanted spam, you can use a text filter. This basically checks for "spammy" words and blocks the post if they are found. Wordpress has built-in functionality for this, and as such they have a nice list of common spam words that you can check for.
Just be aware that people may create posts that legitimately use these words - so make sure you know your topic.
Blog comment spam is useless unless it has links in it, since the spammers are aiming to improve their backlinks for Search Engine Optimization purposes. As such, you could also automatically approve posts that have no links in them, but put posts that have links in an approval queue (this is a hybrid of this method and the previous method).
Email Validation
The final method I'll briefly cover is to use email validation. This is a similar idea to CAPTCHA but works somewhat differently. This method involves the user entering their email address with their comment submission or account registration. Then in order for the account to be approved, then must click a link that is sent to their email address. This ensures they are using a valid email address.
Note that this method is probably very easy for spammers to defeat with a bot, but at least they'll need a working email address to do so. Just make sure you then make them validate their new email address if they ever want to change it.
Conclusion
In this article, we covered a few ways of protecting your web site from spammers. The methods were covered fairly briefly, but hopefully give you a few ideas for stopping (or at least minimizing) spam on your web site.
CAPTCHA is a very effective method that is widely used, and you should strongly consider it as a solution to the spamming problem - just make sure you're aware of its drawbacks.
Resources
- The CAPTCHA project
- Securimage (captcha script)
- Wikipedia entry for Captcha
- Pear's Text_CAPTCHA
- WordPress list of common spam words