Fighting and avoiding ‘Spam,’ or Unsolicited Publicitary E-mail

What is spam?
Types of spam
Keeping Spam Out of Your Mailbox
A few good advices to avoid having your address added to a Spam list
Help! I'm receiving massive amounts of Spam, what can I do?
Protecting internet forums, guestbooks, and blogs against spambots
Using a honey pot

What is spam?

If you don't know what spam is — aside from a brand of canned meat when spelled “SPAM,” it is a general name for those irritating publicitary mails with subjects like:
E-mail marketing works!, Credit card problems? The solution is RIGHT HERE, Generic Viagra!, UNIVERSITY DIPLOMAS, or Are You Getting the Best Rate on Your Mortgage?
Synonyms are ‘junk mail’ and ‘UCE’ (Unsolicited Commercial E-mail). If you still don't know what I am talking about, praise yourself lucky and hope that you'll never receive any, because once you have received one, you can be pretty sure that thousands will follow soon…

The term “spam” originates from a certain sketch by Monty Python's Flying Circus, involving the repeating of the word countless times. The first major case of ‘spamming’ was in April 1994, when the same advertising message was sent to thousands of usenet newsgroups. After this incident, the term was more and more commonly used to indicate unwanted commercial mails. When I first started using the internet back in 1996, it was mostly spam-free. I have seen the rise of spam over the years and it was not a pretty sight.

The things advertised for in spam mails range from mortgages to medical products. However, a vast amount of these products are either cheap imitations of the thing they are supposed to be, or they don't even exist. So if you do pay, you are likely to just lose money and either get nothing or total rubbish in return. Taking a drug ordered via a spam message is playing with your health. If there's one thing you should remember from this page, it is the advice to consider all spam mails as total garbage which must be paid no more attention to than required to get rid of it.

Aside from all this, and of course from being slightly up to extremely irritating, the largest problem with spam is that it causes an unbelievable amount of useless network traffic. Spammers send their garbage to millions of addresses, in the hopes that at least a few of those belong to people dumb enough to buy their product. The rest deletes the mail, bounces it back or whatsoever. All wasted network bandwidth, and network bandwidth is not for free.

The problem with spam is that it has a net profit, even when just one single person replies to it in a positive way. The reason is that sending spam doesn't cost a thing, so the worst case scenario is a null operation. A relatively simple proposal to solve this, would be to charge people for sending e-mail. Even a puny 1 (Euro-/Dollar-)cent per e-mail, would already discourage sending a million mails, knowing that only a few percent of those mails will generate revenues. Normal people only send a few dozen mails per day at most, so the costs for them would be negligible. Although these are all very interesting ideas, it is unfortunately very hard to implement them in a waterproof way. The only thing that can be done is making the sending of spam illegal. However, this only will work if there is a strict uniform regulation across the entire world.

Remember, spam only exists because there are people who respond to it. Spread the news and tell everyone you know to ignore spam mails. Replying to spam in whatever way, is asking for more. Ignoring spam is making it die.

Types of spam

Nowadays, there are multiple kinds of unwanted e-mail. While originally they were mostly just attempts to lure people to a site to buy things, soon other, more malicious types emerged. In the rest of this text, I'll use the word ‘spam’ for all types, but here is an overview of the correct names for each type.

Spam: the original, mostly just containing text and/or images and links, in an attempt to make you buy something. Nowadays the ‘text’ in these mails often contains lots of garbage to try to thwart spam-filtering software, sometimes making the mail itself unreadable…
Scam: these are attempts to lure you into depositing money on someone's bank account. These mails always start with some story about a bankruptcy, a military coup, or whatever, and sometimes are written in ALL CAPS. The author claims that he needs to transfer very large amounts of money and needs multiple foreign accounts to do this. As a reward, he offers you a percentage of the sum. But the catch is, you first need to deposit an amount yourself for him to cover some costs or whatever. Of course, if you're naive enough to do this, you'll never hear anything from him again! In the first versions of scam mails, it was always someone from Nigeria who was supposed to be the author, that's why this kind of spam is often called “Nigerian scam” or “419 scam,” with 419 being the Nigerian penal code for fraud. One of the latest new forms of scams are threats in which the sender pretends to be a hired assassin, whose job is to kill you unless you pay him more than his imaginary contractor. To avoid receiving crap like this, do not spread around personal data on the internet.
Phishing: this is even worse than the previous, because again the objective is to steal money from you, but this time without you knowing it. Technically, this is a type of fraud, but there are currently few places on the world where phishing is actually punishable by law. I hope this changes soon.
These mails are made to look like official mails from eBay, PayPal, online banks, or anything else involving money. They ask to confirm login details and/or credit card data of your online account. A link is provided to a site which has been made to look exactly like the real eBay/PayPal/… site - except that the URL will be different. Of course, if you're naive enough to fill in your data, it is sent to the spammer who will pillage your account as soon as possible.
It is easy to recognise these mails. Not visually, but conceptually: they are all fake. PayPal, eBay, or online banks will never ask to verify any of your account details by e-mail, and will never put links in their mails which lead to pages asking to type your password or credit card number! Also, any real mail sent by these companies will contain your real name, as registered in your account, and not just Dear PayPal/eBay member or Dear <(first part of) your e-mail address>. Anything that does not contain your real name is to be obliterated without hesitation! If you're still in doubt whether such a mail is real or not, there is a safe way: just go to the company's website manually, i.e. by typing its URL yourself or by using your bookmarks. Then, log in. If an update of your data is required, the site will ask you.
Targeted scam: technically this is just a subtype of the more general scam concept, but it deserves its own category. These mails are specifically crafted and targeted towards a certain e-mail address and may even address you by your real name. Their idea is to lure the recipient into buying a service that doesn't actually exist. A good example are domain names. WHOIS information generally contains a contact e-mail address, which can be easily retrieved by scammers. Suppose you are the owner of the domain “www.example.com”. The scammer sends a mail to your contact address that starts like We are a domain name registration center… and claim that someone else wants to register the same domain name under different suffixes, e.g. “example.hk”. The idea is that you can buy those domains instead, avoiding that they are being taken by someone else. Of course, the sender of those mails is probably no registrar at all, he'll just take your money and disappear. Remember, this is just one example. The same can be applied to other scenarios — never underestimate the creativity of the average scammer. The trend of spreading one's personal information around on ‘social networking’ sites offers many new opportunities to create this kind of ‘targeted scams’.
Fake stock alerts: this is one of the most puzzling types of spam, because they almost never have any links or addresses inside them. They only contain some pseudo investor information, often ‘penny stock alerts,’ but it can be any type of stock report. There seem to be two possible goals for this type of mails: 1. ‘pump and dump,’ i.e. make you buy stocks from the advertised company, so that its value rises and the spammer can sell his own stocks at a higher price (leaving you with a loss); 2. harass the company which is mentioned in the stock report (as a respectable company, you don't want to be associated with spam). An important message: there's no reason to believe that the stock report was sent by someone who has anything to do with the company mentioned in the stock report. Most likely, it wasn't, so do not attack a company unless you can prove that the spam originates from it. As with all types of spam, ignoring it is the right reaction.

Keeping Spam Out of Your Mailbox

When I first set up this page, its main purpose was to provide statistics about spam subjects and senders. The idea was to allow people to use these resources to set up mail filters in an optimal way. Nowadays however, simple mail filters simply won't do because most spammers use random subjects. Moreover, collecting the statistics became intractable due to the sheer volume of spam I started receiving, so they are no longer available. What I recommend is reading the advices below, and if necessary installing a specialised spam filter like SpamAssassin or a Bayesian filter.

A few good advices to avoid having your address added to a Spam list

The short list:

Create a dedicated freeMail address.
Use your freeMail address at risky sites.
Never reply to spam!
Use Bcc: and remove addresses when forwarding mails.
Do not litter your address around on newsgroups etc.
Do not make mailto: links when building websites.
Install an antivirus and anti-spyware program if you use Windows.
Do not create easy-to-guess e-mail addresses.

The long list:

Start by making a freeMail address (Hotmail, Yahoo, …) which you won't use for serious purposes. The sole purpose of this address will be to deflect spam to, while you are still able to receive possible activation e-mails for certain services. Besides, most of these freeMail providers have a quite effective spam blocker too: turn it on immediately!
Don't fill in your ‘good’ e-mail address in sites that don't belong to a trustworthy company, even if they claim that the address won't be shared with other companies. If you really have to fill in an address somewhere, use your freeMail address and check if there's an option like “allow us to share your address with selected partners…” If there is, UNCHECK it. Beware: these check boxes are often active by default!
If you receive a spam mail, never reply to it, even if they claim that you can unsubscribe. Replying is the worst you can do, especially if you show some kind of interest in the advertised product. Buying something through spam is the online equivalent of utmost suicide. Any sign that your address is active, is an invitation for spammers to quintuple the amount of junk sent to you. Only if the mail is not ‘pure’ spam, i.e. it originates from a respected company or a service you subscribed yourself to, it is generally safe to unsubscribe.
If you forward mails (with funnies, slideshows etc.) to friends, make sure to remove all previous addresses from the mail. Also, if you send the mail to many people at once, use the ‘Bcc:’ (Blind Carbon Copy) field. This will ensure that the recipients of the mail can't see to what other people it was sent. Otherwise, the mail will accumulate addresses and become a goldmine for spammers. If it gets forwarded to someone whose computer is infected with a virus, or to an untrustworthy person, all people who ever received the mail risk being added to a spam list. So, ask all your correspondents to erase addresses and use Bcc.
Don't use your personal e-mail address in public newsgroups, certainly not in message boards on websites! Use your freeMail address, use none, or ‘mutilate’ your real address in an obvious way, like “yourName_removeThisWord@yourDomain.com”. This also counts for all other kinds of webpages where you can enter an e-mail address: guestbooks, forums… Some people won't be smart enough to see and remove the extra words, but you mostly don't want to receive mail from such people anyway :-)
If you're a webmaster (i.e. you make webpages), don't use direct mailto: links, unless they point — again — to your FreeMail address. I did some experiments with this by putting a (hidden!) link to a dedicated ‘bait’ address on a single page. The address was first spammed 2 weeks later and massive amounts of junk soon followed. This is due to spammers using web crawlers to harvest mailto: links.
If you do want to use a mailto: link, you can avoid some trouble by concentrating all your contact information onto one single page. Then put this page in your robots.txt file, which will prevent most search engines from indexing your e-mail address. Of course this won't protect you from filthy spambots (spam robots) which ignore the robots.txt file, unless you create a ‘honey pot’.
There are good alternatives to dumb mailto: links, like a HTML form with a cgi script which sends the form's data to your address. By encoding your e-mail address in the cgi script, there's no way for an outsider to detect what your address is until you reply. You can find an example of such a script here.
Another alternative is to use a JavaScript which constructs the ‘mailto:’ link on-the-fly. To make this effective, the script should write your address in multiple steps so that it is fragmented in the page's source. This thwarts spambots looking for addresses in webpages. To make sure that people who have turned off JavaScript can still contact you, put something like My address is blahblah AT yahoo DOT com (replace ‘AT’ and ‘DOT’ by ‘@’ and ‘.’ respectively) in the NOSCRIPT tag. Mind that this is not 100% secure, the most advanced bots might be able to interpret JavaScript.
If you just want to mention your e-mail address without a link on a page, a clever way to make it invisible for spambots is to convert it to a GIF image. But of course this is not the handiest solution from a user's point-of-view.
Get yourself a good virus scanner and spyware removal program, and make sure they are always up-to-date. Most modern viruses spread e-mail addresses around the internet as a side-effect of the way they work. Spammers will of course collect every address they receive. It is believed that some viruses are even designed specifically to gather addresses and send them to spammers. Unfortunately your address will also be distributed if it is in the address book of any infected PC, so this is not a waterproof protection. Of course this advice is only applicable if you are working with Windows. If you're fed up with viruses and spyware, install Linux or buy a Mac.
And now an advice which you probably have never heard before: if you are creating a new mail account and you can choose the account name, do not use an ‘obvious’ name! With “obvious” I mean commonly occurring names or words (like john_smith@domain.com or superman@domain.com). To make the address less obvious, use multiple words or add special characters like numbers or underscores. If you do use a common word for your account, you will start receiving spam even before having used your address! The reason is that spammers often add ‘likely’ addresses to their list by taking commonly used words and pasting domain names after them. If the address doesn't bounce, they are happy to have found a new potential buyer of their crappy product… So the only thing you can do if you have an obvious address and you can't change it into something less obvious, is to move on to the next section:

Help! I'm receiving massive amounts of Spam, what can I do?

Unfortunately, you will have to learn to live with the fact that you will always keep on receiving junk mail, unless you completely destroy your e-mail account and create a new one with a new address which cannot be easily guessed. There is no remedy against spam, except putting your poor mail account out of its misery by killing it.
I already said this, but I'll repeat because it is so important: Rule N°1 when receiving spam: never reply to it! In most cases your reply will never arrive because the sender's address doesn't exist anyway. In some other cases, another innocent victim will receive your reply. In the rest of the cases where the sender does receive your reply, (s)he will be happy with the attention you have given to him/her and will probably be stimulated to send more junk! It really doesn't matter what's in your reply. Those people see every incoming mail, especially insults, as a begging for more.
Also, don't bounce messages. Some programs allow to send fake “unknown account” messages back to the sender, in the hopes that the spammer will think that your address doesn't exist. Don't do this, because in most cases you are only doubling the amount of useless network traffic. Spammers will likely be able to recognise these fake bounces after a while, and then you're screwed.

If you can't afford destroying your mail account, there are a few things you can do to recognise the inevitable junk mails, so you can delete or even filter them without wasting your time.

If you can, install a spam filter. These filters are able to recognise spam by looking at certain characteristics. There are filters based on fixed rules, and self-learning filters based on statistical principles like the Bayesian filter. The latter has to be trained by indicating which of your incoming messages are spam, and after a while it is able to sort the good from the bad by itself. Mostly it is also possible to use a ‘white list’ of addresses (e.g. all addresses in your address book) which the filter will ignore automatically. Of course, when using a filter, there's always a small risk of labelling real mail as spam, therefore you should let the filter redirect all ‘spam’ to a special folder, and check this folder before emptying it every week. This still involves some work, but in total it is far less effort than having to sort your mailbox yourself, because legitimate mails are mostly easy to recognise between the junk.
A few examples of such filters are SpamAssassin and the Mail program included in Mac OS X, but nowadays lots of other mail software like Mozilla Thunderbird and online services like GMail feature similar filters with quite good performance. The only thing you should definitely avoid, is software being advertised for in spam mails!
Another technique that is very effective against spam at the time of this writing, is greylisting. As the name implies, ‘greylisting’ is a kind of middle ground between ‘whitelisting’ and ‘blacklisting,’ and it works as follows. When the mail server that handles your e-mail account receives a message, it will check if it already received a mail to you, sent by the same combination of originating server and sender address. If not, it will pretend to be be busy and tell the sending server to re-try in a short time. Any legitimate mail server will effectively re-send the mail after a certain time, but most quick & dirty spam servers won't bother, so the spam will never arrive. There are two disadvantages to this system, however. First, it needs to be installed on the mail server itself (you can't just install it on your own PC). Second, the first mail sent by someone ‘unknown’ to you will be delayed. This delay can be between 10 minutes and several hours. This can be especially annoying with ‘activation’ mails for forums etc.
Avoid opening spam mails! Lots of them contain ‘bugs,’ i.e. images with a unique identifier in their URLs, enabling spammers to detect that you have opened the message. Some mail programs are eager to open messages even when dragging them to the trash, so you may need to use some tricks like selecting multiple messages at once to avoid that the mails are being displayed, or simply unplugging your network cable temporarily. Newer mail programs have the option to only load images when you want to, I highly recommend turning this on.
If you do not have a filter, or for those spams that get past your filter: most spams are easy to recognise. The senders' addresses often have a similar structure. Any mail originating from an address of the form [random collection of characters]@[popular freemail domain like hotmail or yahoo].com can be deleted on sight. It would be great if mail servers would simply trash such mails automatically. If the SMTP protocol is ever updated, I would strongly suggest including a mandatory check for existence of the sender's e-mail at mail server level! Of course this could also help spammers in searching valid addresses, but this hazard can be minimised, and at least it would avoid a lot of useless network traffic.
Also the subjects follow a distinctive pattern. Typical characteristics of spam subjects are: SHOUTING (uppercase letters); letters replaced by numbers, like “V1AGRA”; s p a c e s or d.o.t.s between letters; and random junk or nonsense appended to the subject, like requested website address vu xdcb or integrate warhammer rosebud multiply (often the junk is preceded by lots of spaces in an attempt to hide it). An example with all these atrocities combined:
```
What is G.E.N.ER1C V1.A.G.R.A? attentive and at the          zrku
```
Nowadays, using exotic Unicode characters that look similar to normal letters, is also a popular practice, for instance this word is written entirely in mathematical italic symbols: “ℎ𝑒𝑟𝑟𝑖𝑛𝑔.” Anything showing at least one of these characteristics, is to be deleted without reading it!
In the days when spam was not yet a major problem, simple mail filters could be quite effective: by simply checking for certain words, a decent amount of spams could be filtered. Unfortunately the amount and diversity of junk mail has grown to such a degree that nowadays this won't do.
Even if the spammer managed to make the address and subject look inconspicuous, the content of the message will almost always reveal that it is spam. (This puzzles me: if one gets a letter titled meeting at 10AM and the contents are about something completely different, why would a sane person ever consider taking the message seriously?) The texts in spam messages have the same characteristics as the subjects described above. As soon as you recognise this, hit the delete button! Make it a sport to delete spams as quickly as possible! This way, you might be able to get some entertainment from it, instead of just being irritated!
Try to convince politicians in every possible country about the urgent need of a world-wide law which forbids spamming. A good example is article 13 of the guideline 2002/58/EG of the European Parliament. Only opt-in mailinglists can be allowed, and only under strict conditions. Opt-out is total nonsense!
Last but not least: even though you are being spammed, it remains very useful to apply the tips from the previous section. If you make sure that your address is nowhere immediately accessible on the internet and it only circulates where it should, then eventually the amount of spam will decrease. It is crucial that you give no sign whatsoever to spammers that your address is still being used.

Protecting internet forums, guestbooks, and blogs against spambots

This section is intended for people who run a website which contains a forum, guestbook or blog (in other words, for ‘webmasters’). In the early days of the internet, a guestbook was as simple as a CGI script which appended the input from a web form to a webpage. If you would do this today, your guestbook would be stuffed within a few months with utmost garbage. Moreover, the few real persons who would sign the guestbook or leave a message on the forum together with their e-mail address, would be spammed to death after a few weeks. These two phenomena are due to two types of ‘spambots’ that roam the internet today:

The first type is the ‘harvesting’ spambot. It is similar to a web crawler or spider, as used by search engines like Google. However, these bots do not just index webpages, they collect (‘harvest’) everything that looks like an e-mail address. Lots of these bots have a preference for guestbooks and forums, mostly they don't even do the crawling themselves. They just search for combinations of the words “guestbook”, “forum”, and common words or brand names in Google. Believe it or not, but there are even real people whose life is so pathetic that they have nothing better to do than mimicking these bots, all day long. They're called ‘mugu’ or ‘guymen’, and often drop a message like mugu keep offff on the guestbooks, to indicate up to which point they have harvested.

The most effective solution by far is not to display any e-mail addresses on your guestbook. You can either omit any ‘address’ input fields and discourage people from putting addresses in their messages, or only keep the addresses for your personal records — although I've never mailed anyone who signed my guestbook. Remember, no addresses on your guestbook means nothing to harvest!
If you do want to display e-mail addresses for some reason, a simple and effective way to protect them from these bots and pathetic people, is to make sure that your guestbook cannot be found through search engines. In other words, you should add all guestbook pages to your robots.txt file. For guestbooks this makes sense, because you'll only want people to sign your guestbook if they came through your site. But for forums (and blogs) this is not a good solution, because the main idea of a forum is that people can find answers in it without having to ask the same questions all over again.
Another solution is to make the access to your guestbook or forum a little different from what a bot would expect. For instance, do not simply put your guestbook in a ‘/guestbook’ directory, use some random path name instead! You could also avoid using the word ‘guestbook’ or ‘forum’, or replace all instances of these words by images. Be creative!

To protect e-mail addresses even if a bot finds your guestbook/forum, you should make sure that they aren't readily available. In a guestbook, you could automatically deform addresses so they don't look like e-mail addresses. In most forums, you have to register and log in to see addresses of other members. But often, e-mail addresses are always hidden, and people can only mail other forum members through a web form. This is the best solution as it also protects against human harvesters. Of course, the page where users have to register needs to be protected too, otherwise a bot could register and log in. A common technique is to show a ‘challenge’, like an image with a distorted word (often called a ‘captcha’), and require the submitter to type this word. But other inventive methods are possible, like simple mathematical sums. The only requirement is that the challenge be different each time, otherwise it's useless as soon as the designer of the spambot discovers it.
The second type of bot is the literal ‘spam’-bot. All it does is spamming. These robots also crawl or search for forums, guestbooks and blogs, and then try to add their own messages, filled with links to all kinds of stupid sites. This is to trick people in clicking those links, and/or to boost their Google PageRank. Often they try to hide the commercial nature of their messages by using ‘innocent’ phrases like nice site!, liked it!, …
Again, adding your guestbook (especially the page with the submission form) to robots.txt will make it harder for bots and spammers to find your page. This won't suffice, however, so you need a protection against those bots that do reach your page. For a guestbook/blog, requiring people to register is not done. You'll just scare away people who would otherwise quickly have typed a nice comment. So you need to use other ways to detect whether the submitter is a real person or a robot. You could again use a captcha, but I consider this a nuisance to the user (captchas that are sufficiently bot-proof are often also hard to read by humans). There are more subtle ways to detect bots.

For starters, bots will often try to put ridiculously long strings in input fields, so set a (sensible) ‘maxlength’ attribute for all input fields, and reject messages if a maxlength is violated. This is a good test, but it won't suffice. To make it really tough for bots, do the following.
Start with giving your input fields non-obvious names in the HTML source. Use random strings or nonsensical words. Next, replace the textual labels next to the input boxes by images. (To keep the form usable for visually impaired people, you can still put the label in the ‘alt’ tag.) This will prevent bots from filling in names and links, but you will still get garbage in your main textarea. Rejecting anonymous messages will block most of this junk, but some bots fill in random nonsense in fields they don't know.
To detect this too, there's a simple trick: add one ‘dummy’ field to your form, and request users to leave this field empty. For this field, you should use an obvious name, like “url” or “name”. Do not hide it away at the bottom of your form, put it between the real input fields. Spambots will most likely want to put something in this ‘mouse trap’, so configure your cgi script to reject any messages with this field non-empty.
Mind that this trick will only keep on working well if it doesn't become a ‘standard’. The problem with captchas is that they are standard, so once one type of captcha has been cracked, all sites using it are vulnerable. The best protections against bots are custom things that only exist on your site, like a silly question where the user has to select the right answer (e.g. Select the name of the largest object: house, planet, car.) Note how this also protects against real humans that are so dumb that you probably don't want to receive messages from them…

If your guestbook receives an increasing amount of spam (attempts), you should move it to a different location in your site, but leave a fake decoy page in the original location with a form that does nothing. This works well because many bots store the location of guestbooks they have found, and keep on spamming them until they get errors. If they get no errors on your fake page, they won't be re-crawling your site for the real guestbook.

Using a honey pot

The robots.txt file is not a miracle solution against crawlers that are specifically designed to gather mail addresses or other data from websites. Those crawlers will simply ignore the robots.txt file, or worse: use it to figure out what URLs are forbidden hence potentially interesting. Even if your webpages do not contain sensitive information, such crawlers can still be a major nuisance because they will eat up your bandwidth. They are often written in such a primitive manner that they will download everything, including large data files.

What you need in such case is some way to detect that a crawler is misbehaving, and stop it in its tracks immediately. The trick is to create one or more special URLs that will never be visited by normal visitors. As soon as such a ‘honey pot’ page is being requested, the visitor's address is added to a blacklist, and is instantly served error pages for every subsequent request. There is a rather simple but effective way to implement this if you can run scripts and configure .htaccess files for your website.

Create a few special webpages in the root of your site and put ‘Disallow’ directives in robots.txt for all these pages. The root is a good place because this is where most crawlers will scan first. Put invisible links to these URLs on a few normal pages. I recommend two places to put these links: your main page, and any page that contains links to large download files. Put the links at both the start and end of the page. You can simply open and close an <A> tag without any content in between, or only a space or dot. To make sure that no human visitor will see the link, put it inside a DIV that has style display:hidden. To really make sure that the links will not be visited by a human (for instance by someone who uses a screen reader that ignores the hidden property), you can give the links obvious names like “do_not_follow_this_link.html”. Be a little creative and add some variation. If everyone would use the same URLs then eventually the bots would be programmed to avoid those.

If you really want zero risk that a normal visitor will ever accidentally visit one of these ‘honey pot’ URLs, you can drop the links altogether and only put a ‘Disallow’ directive in robots.txt with the URLs. This will not be as effective, but it will still catch the worst crawlers of them all: the ones that intentionally abuse robots.txt in the hopes of finding the most crawl-worthy pages.

Next, create a script in Perl or Python or whatever, that when invoked takes the visitor's IP address and creates an empty marker file inside a directory. Next, configure the .htaccess file in your website's root as follows:

RewriteRule do_not_follow_this_link.html$ /cgi-bin/blockmyip.pl [L]
RewriteCond /blocked_ips/%{REMOTE_ADDR} -f
RewriteRule .* - [F]

This does two things: first, it sends anyone trying to open the forbidden file to the script. Otherwise, it checks whether the visitor's IP address exists as a marker file inside the directory ‘blocked_ips’ and if yes, serves a 403 Forbidden error page. You can precede this with other RewriteRules to allow a few pages to be visited even by blocked users, etc. I will not go into the details of .htaccess here.

I will however give a slightly more advanced variation on the above, that allows to serve a custom 403 page specifically to blocked users. This is a good idea so you can explain why the user is blocked and how they may be able to contact you, in case a human visitor still managed to set off the honey pot trap despite all precautions. You could even do something fancy like allowing the visitor to unblock themselves through a captcha of some kind.

RewriteRule do_not_follow_this_link.html$ /cgi-bin/blockmyip.pl [L]
RewriteRule ^(forbidden-Crawl.html|robots\.txt)$ - [L]

<Files blocked403>
ErrorDocument 403 /forbidden-Crawl.html
</Files>
RewriteCond /blocked_ips/%{REMOTE_ADDR} -f
RewriteRule !blocked403 /blocked403 [PT]
RewriteRule blocked403 - [F]

This is a little hack that passes through to a non-existing path, and defines a custom 403 page for that path. Of course we must whitelist that error page, and we also whitelist robots.txt of course to remind the bot why it has been naughty. The line with ‘!’ is needed to avoid an infinite loop.

It may be wise to also rely on the X-Forwarded-For header to prevent blocking entire networks behind a NAT if only one idiot behind that NAT runs a bad crawler. For instance, update the ‘-f’ line as follows, and let the blockmyip script create marker files whose name is the same underscore-joined concatenation of REMOTE_ADDR and HTTP_X_FORWARDED_FOR (or empty string if the latter is not defined).

RewriteCond /blocked_ips/%{REMOTE_ADDR}_%{HTTP:X-Forwarded-For} -f

You should do an occasional clean-up of the marker files, for instance delete all files older than a month. Spam crawlers may move between different IP addresses and if you would never clean up, you would be blocking access to an ever increasing part of the internet.

You can also implement this same robot trap without relying on robots.txt. This requires adding an extra layer of indirection between your real webpages and the spider trap links. Instead of sprinkling your real webpage with hidden links that point immediately to the dangerous trap pages, make the invisible links point to simple static pages instead. These pages must have a ‘robots’ META tag with content="noindex, nofollow". This has several advantages:

you can put a clear warning in the page to reduce the risk of humans accidentally triggering the trap, which is more difficult with direct links;
you are catching crawlers that do not respect the META tag, which is arguably an offense much worse than not respecting robots.txt;
there are no direct links to forbidden URLs on your real pages, which means less risk that search engines will rank down your pages if they penalise such practices;
you can (and probably should) still put the trap URLs in robots.txt for extra protection, and to catch the really bad crawlers that use robots.txt as a guide of what certainly to look for.

If you want to go even further, you can use Project Honey Pot to block visitors based on globally gathered data about misbehaving crawlers and spammers, but it may require a bit more effort to set it up than a few small scripts and a .htaccess file.

This text is licensed under a Creative Commons Attribution 4.0 International License.