I recently woke up to the staggering level of abuse occurring on this web site. This is old news to some, but we all wake up at different times. I’m talking, of course, about automated robots and spiders. They come at all hours, they take as much as they can, and they leave me with the (bandwidth) bill. They do so without respecting the Robot Exclusion Standard, now almost 10 years old. Some come to gather email addresses, which are then sold to spammers; some come to steal images or other content, and republish it without my consent; some come to spy on me and sell information to their clients about perceived violations of copyright, trademark, or some nebulous concept of brand identity.
None of them act in my best interest, or in the interests of my readers.
Some will say that the Internet is a public place, and if I don’t want something abused, I shouldn’t put it on the Internet. Well, that’s true. It is also true that if I don’t want to get mugged, I shouldn’t leave my house, and if I don’t want calls from telemarketers, I shouldn’t have a phone. But I like leaving my house, I like having a phone, and I like having this web site. I fight back against telemarketers who abuse my phone (you can too), and now I’m fighting back against robots who abuse my web site.
I started this site less than two years ago. Last month, it totaled about 1600 pages. (That number is actually much larger now, due to some site-wide structural changes.) Keep that number in mind as you read on.
All of these techniques require that you have mod_rewrite installed, and that you have privileges to create your own rewrite rules in your own .htaccess file. If you’re not sure what this means or whether you have those privileges, check with your system administrator. The further reading
list at the end also suggests some alternate techniques that do not require mod_rewrite.
First of all, turn on mod_rewrite and set up some initial parameters by adding these lines to your .htaccess file:
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
Now, let’s familiarize ourselves with a few of our enemies:
A program that shows up in access logs with the name EmailSiphon, whose primary use is to rip through as many web sites as possible looking for email addresses, and sell them to spammers. Various people running this spambot hit me 143 times in January. I have banned it by User-Agent.
Important note: I originally thought that the EmailSiphon User-Agent belonged to a retail product called PowerSiphon. I was incorrect. Here is correct information, straight from the company:
By default, Power Siphon uses the same user agent as the installed version of IE that the user has. The reason being is that the downloaded content is intended to be used for off-line browsing on the user’s machine. Since many Web sites tailor their content based on the user agent, it is imperative that the user agent reflect the target browser. The Power Siphon product includes a custom browser based on the IE engine. The user can modify the user agent if they wish to target their content to a different browser.
While the company admitted in private email that its PowerSiphon product could be used for harvesting email addresses, it’s not the culprit here, and I apologize for mixing up the two.
OK, so back to EmailSiphon. Whatever it is (I can’t find a home page), it is the culprit here, and here’s how to block it:
RewriteCond %{HTTP_USER_AGENT} EmailSiphon
RewriteRule .* - [F,L]
Cyveillance is a spybot that scours the web for copyright violations and damaging information
on behalf of clients such as the RIAA and MPAA. Their robot spoofs its User-Agent to look like Internet Explorer, and it completely ignores robots.txt. This spybot hit me 448 times in January. I have banned it by IP address.
RewriteCond %{REMOTE_ADDR} "^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$”
RewriteRule .* - [F,L]
There is another email harvester which always claims to be referred from http://www.iaea.org/. You may have seen this in your own referrer pages. This spambot hit me 477 times in January. I have banned it by referrer.
RewriteCond %{HTTP_REFERER} iaea\.org
RewriteRule .* - [F,L]
There is another email harvester which shows up in access logs with a User-Agent that includes the phrase Microsoft URL Control
(the name of the underlying URL library, MSINET.OCX). It fetches consecutive pages as quickly as possible and completely ignores robots.txt. Various people running this spambot hit me 1341 times in January. I have banned it by User-Agent.
RewriteCond %{HTTP_USER_AGENT} "Microsoft URL Control"
RewriteRule .* - [F,L]
NameProtect peddles their online brand monitoring
to unsuspecting and gullible companies looking for people to sue. Despite the claims on their robot information page, they do not respect robots.txt; in fact, they spoof their User-Agent in multiple ways to avoid detection. They hit me 2085 times in January. I have banned them by User-Agent and IP address.
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR]
RewriteCond %{HTTP_USER_AGENT} NPBot
RewriteRule .* - [F,L]
Turnitin peddles their plagiarism prevention system
to draconian universities. Their TurnitinBot hit me a whopping 19,575 times in January. I have not checked whether they respect robots.txt, and with these numbers, I really don’t care. They’re sucking down over a gigabyte of data a month for no purpose that even remotely benefits me or any of my readers. I have banned them by User-Agent and IP address.
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR]
RewriteCond %{HTTP_USER_AGENT} TurnitinBot
RewriteRule .* - [F,L]
This is only a small sampling of what goes on on the public Internet every day. In the course of researching this problem on Webmaster World and elsewhere, and examining my own access logs, I have identified 87 different spambots, spybots, and offline downloaders that treat my site like a five-dollar whore. They are all unilaterally banned.
# User-Agents with no privileges (mostly spambots/spybots/offline downloaders that ignore robots.txt)
RewriteCond %{REMOTE_ADDR} “^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$” [OR] # Cyveillance spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR] # Turnitin spybot
RewriteCond %{HTTP_REFERER} iaea\.org [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ^[A-Z]+$ [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} anarchie [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} Atomz [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} cherry.?picker [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} “compatible ; MSIE 6.0″ [OR] # spambot (note extra space before semicolon)
RewriteCond %{HTTP_USER_AGENT} crescent [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} “^DA \d\.\d+” [OR] # OD
RewriteCond %{HTTP_USER_AGENT} “DTS Agent” [OR] # OD
RewriteCond %{HTTP_USER_AGENT} “^Download” [OR] # OD
RewriteCond %{HTTP_USER_AGENT} EasyDL/\d\.\d+ [OR] # OD
RewriteCond %{HTTP_USER_AGENT} e?mail.?(collector|magnet|reaper|siphon|sweeper|harvest|collect|wolf) [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} express [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} extractor [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} “Fetch API Request” [OR] # OD
RewriteCond %{HTTP_USER_AGENT} flashget [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} FlickBot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} FrontPage [OR] # stupid user trying to edit my site
RewriteCond %{HTTP_USER_AGENT} getright [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} go.?zilla [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} “efp@gmx\.net” [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} grabber [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} imagefetch [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} httrack [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} “Indy Library” [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} “^Internet Explore” [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ^IE\ \d\.\d\ Compatible.*Browser$ [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} “LINKS ARoMATIZED” [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} “Microsoft URL Control” [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} “mister pix” [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} “^Mozilla/4.0$” [OR] # dumb bot
RewriteCond %{HTTP_USER_AGENT} “^Mozilla/\?\?$” [OR] # formmail attacker
RewriteCond %{HTTP_USER_AGENT} MSIECrawler [OR] # IE’s “make available offline” mode
RewriteCond %{HTTP_USER_AGENT} ^NG [OR] # unknown bot
RewriteCond %{HTTP_USER_AGENT} offline [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} net.?(ants|mechanic|spider|vampire|zip) [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} nicerspro [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ninja [NC,OR] # Download Ninja OD
RewriteCond %{HTTP_USER_AGENT} NPBot [OR] # NameProtect spybot
RewriteCond %{HTTP_USER_AGENT} PersonaPilot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} snagger [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} Sqworm [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} SurveyBot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} tele(port|soft) [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} TurnitinBot [OR] # Turnitin spybot
RewriteCond %{HTTP_USER_AGENT} web.?(auto|bandit|collector|copier|devil|downloader|fetch|hook|mole|miner|mirror|reaper|sauger|sucker|site|snake|stripper|weasel|zip) [NC,OR] # ODs
RewriteCond %{HTTP_USER_AGENT} vayala [OR] # dumb bot, doesn’t know how to follow links, generates lots of 404s
RewriteCond %{HTTP_USER_AGENT} zeus [NC]
RewriteRule .* - [F,L]
I must warn you not to copy-and-paste this list onto your own site. The decision of which robots to block is a very personal matter. For example, I block many different offline downloaders
, programs designed to spider a site and download various files on behalf of real people (in other words, they are not completely automated — someone has to point them at my site and click a button). Some specialize in stealing images, others are just general-purpose spiders. I even block MSIECrawler, which is Microsoft Internet Explorer’s make available offline
sychronizer, because (according to my access logs) it is rarely used as intended. On several occasions I have seen it abused to recursively download large sections of my site (like my Safari information pages, which contain many large screenshots of test cases). So I block it, even though this could conceivably inconvenience actual readers who would like to use it for legitimate purposes. There will always be edge cases like this, which is why you need to do your own research, examine your own access logs, learn how to identify abuse, and decide what kinds of abuse you care enough about to deal with.
Since I started fighting back several days ago, I have been blocking an average of 700 hits a day from spambots, spybots, and other rude programs. (Keep in mind that this number is kept low because the robots are blocked from any access, and therefore have no links to follow.) Like fighting spam on the email world, fighting robots is a neverending battle with no winners, only casualties. You will never stop all abusive behavior from all automated robots and rude programs, but you can minimize their effects and reduce the abuse to acceptable levels.
Further reading:
mod_rewrite is not enabled on your system, you can use mod_setenvif, as described in Using Apache to stop bad robots.EmailSiphon htaccessor
htaccess block robotsto find additional examples and lists of User-Agents to block.
mod_rewrite for other uses, Ralf Engelschall’s URL Rewriting Guide is the definitive work. Here be dragons…§
There are two commercial products for IIS that do a similar job to Mod Rewrite.
http://www.isapirewrite.com/ ISAPI Rewrite
and
IIS Rewrite
http://www.qwerksoft.com/products/iisrewrite/
So there are some possible solutions for those running IIS.
Coincidentally, just yesterday I wrote up a small bit of analysis I did on spambots which have hit my own sites (http://www.sharding.org/j/archives/000018.html).
So far, very few of the hits on that website have actually resulted in spam (this experiment has been in place since some time in May 2002 and I’ve had fewer than 20 spam messages sent to addresses from that website). Still, I’ve been working on my own set of mod_rewrite rules to try to keep out some of the identifiable bots. Unfortunately, many of them seem to be nearly impossible to identify programatically.
— Sean ![]()
Sean: a static, manually maintained list of robots is about as useful as a static list of spammer email addresses. That’s why you need to set traps for them to fall into. Mine has caught 11 in two days.
— Mark ![]()
I’m wondering if you feel the actions suggested above run at all counter to your “copyleft” ideas (as stated at least in the terms of use of Dive Into Accessibility). I’m not on the side of those who choose to ignore robots.txt (after all you are asking them nicely to not go through your content), but I am forced to ask what’s more important - “what you’re saying,” or “how users get there?”
— Nick ![]()
Nick: First of all, this web site is not licensed uner a copyleft license. Several of my other sites are, but this one is not.
Second of all, all the GFDL requires is to ensure that “the general network-using public has access to download [the work] anonymously at no charge using public-standard network protocols.” (GFDL, section 3) A link to a .zip file on the home page of a web site, with no registration, no passwords, not even any cookies, freely downloadable by any person with any web browser on any platform, certainly satisfies that requirement.
On my “Dive Into Accessibility” and “Dive Into Python” sites (both of which are licensed under the GNU Free Documentation License), I offer the entire web site for download as a .zip file — in multiple formats and multiple languages. And what do the spiders do? Download all of the .zip files, and then spider the rest of the site anyway! Nothing in the GNU license says I have to put up with that kind of crap.
— Mark ![]()
While it’s clear that turnitin.com is beastly and unreasonable in its consumption patterns, I can at least testify that it serves one of your readers–me.
If you were receiving as many blatantly plagiarized term papers and thesis proposals as me, you might not see it as “draconian” to use turnitin.com as a defense. It’s saved me countless hours in manually tracking down the unattributed sources of my students’ papers.
— Liz ![]()
Since I have the luxury of not paying for my bandwidth, I like to redirect spambots to a CGI script which autogenerates pages full of fake email addresses to harvest.
The point is not to suck up their bandwidth, but (hopefully) to pollute their lists with bogus email addresses.
I never thought of doing the same to Cyveillance (and its ilk).
Thanks for the tip!
Jacques, the problem with giving spammers bogus addresses is that most never notice. I’ve always heard the reasoning that “it’ll pollute their name lists”, but most spammers bounce their messages through open relays and forge their headers anyway. There’s no one listening to get the bounce messages. Or worse, there is someone listening, but it’s not the spammer.
Shelley was hit hard by this just recently. http://weblog.burningbird.net/fires/000932.htm
— Mark ![]()
I’ve long hesitated to implement mod_rewrite-based spam-ban techniques because I’m concerned about performance. So I’m especially interested in your thought process here.
You’re using more than 50 RewriteConds. That’s making the server go through more than 50 regular expressions *per page view*. This is frying (http://www.aaronsw.com/weblog/000404) to an extreme. Which isn’t to say it’s not worth it — that’s a personal decision — but I’m wondering what you think about this.
My questions, then: Did you consider this performance loss in your decision? Did the fact that your blog pages are static HTML (with lean, CSS-based layout) play a part? Would you have done the same had your content been generated on the fly? And have you noticed a performance loss on this site?
— Adrian ![]()
A fair point.
s#RewriteRule .* name/address.pork [L]#RewriteRule .* - [F,L]#
Actually, it occurs to me that this whole discussion is so very 1997.
We’re talking about individually-maintained static bot-blocking lists.
I used to have a long list of IP addresses of spammers to block in Sendmail. So did lots of people.
Then Paul Vixie had the bright idea that we could pool our efforts, and distribute the resulting blocking list via DNS.
Much less work per individual, and much more maintainable.
Thus the concept of the DNSBL was born.
There are now dozens of DNS zones http://www.declude.com/JunkMail/Support/ip4r.htm , and you cna pick and choose what to block.
Why not something similar for web ‘bots?
As a long-time user of GetRight, I (quite predictably :-) ) consider your decision to ban GetRight altogether as incorrect.
First and foremost, GetRight is primarily a download manager, and not a spider (offline downloader). The spidering functionality has been added only in version 5.0, which is still in beta. So, assuming GetRight does the right thing and specifies its version number in the User-Agent field, you can at least restrict your banning to GetRight/5.0.
Second, when GetRight is installed, all downloads usually go through it. Thus, users wanting to download some ZIP file from diveintomark.org will see weird error messages with no apparent reason, when they actually don’t try doing anything evil.
What’s more important for you - your bandwidth or your users?
By the way, hosting with unlimited bandwidth is commonly available here in Russia. Isn’t it so in the USA?
Talking of anti-telemarketing, have you seen the Counterscript (http://www.xs4all.nl/~egbg/counterscript.html )? Puts an amusing spin on it, I think you’ll agree.
— Tim ![]()
The iaea.org spambot hit me last night. About once every second, and it didn’t stop until it had spidered absolutely *every* page on my site.
However, if mod_rewrite isn’t available, you could block it by using SetEnvIfNoCase in your .htaccess:
SetEnvIfNoCase Referer “^http://www.iaea.org“ BadBot
Deny from env=BadBot
— Arve ![]()
Mark, like Adrian, I’m very interested to hear your take on the effect all these Regular Expressions must make in your server performance (see comment #10 above).
P.S. I’m running Opera 7.01 and if I submit/preview a comment with the “Remember my name” option selected, I get an Illegal Cookie warning from the browser telling me that the cookie was promptly rejected.
Well… it seems that you are not excluding ZOE :-(
Its user-agent looks like the following:
Mozilla/5.0 (Mac OS X) ZOE/0.4 Java/1.4.1
With the operating system and the java version varying depending on the platform running the software.
I don’t know if it qualifies as a robot but nonetheless it’s most likely abusing your site by shamelessly retrieving information from it (and completely ignoring robots.txt). Which is totally unacceptable. No matter how you look at it.
To help you stop this shocking abuse of your bandwidth, “diveintomark.org” and associated sites will be totally ignored by the application as from next release… no matter if the user want it or not…. a line has to be drawn.
Regards,
Z.
— Zoe ![]()
Z, I assume you’re kidding, but your ignorance is disturbing nonetheless. I have never seen ZOE request anything but my RSS feed — it even supports ETags! That’s great. So it’s not a spider.
It is, however, a bit overzealous. It appears to be requesting my feed every 5 minutes. Perhaps this could be cut back a little? The de facto standard for news aggregators seems to be no more than one request per hour, per feed.
— Mark ![]()
Thanks Mark. This is very interesting and informative - good to know beforehand, though there have been some good points made regarding performance.
What think ye?
— Gina ![]()
Mark, I’m curious. What was the ratio of bandwidth used by spambots and cache-engines to bandwidth used by “normal” users before you started your blocking efforts?
Mark: Just a suggestion, since you’ve implemented a trap based system of detecting bad robots, you could conceivably produce a list of well-behaved robots. This may prove a point that it is possible to build a well behaved robot and not suffer.
— Isofarro ![]()
All the major search engines use well-behaved robots. Some of them actually make real money.
See also this comment of mine, on a previous thread, regarding spiders that build offline caches for mobile/handheld users:
http://diveintomark.org/archives/2003/02/21/newsmonster_day_2.html#c000413
— Mark ![]()
Re: performance. Unfortunately (for the sake of this discussion at least) I just moved to a new faster server a few weeks ago, so I have little comparative data on the performance hit of these rewrite rules. However, I researched that exact question before starting to implement this solution, and the answer appears to be that my solution — based on mod_rewrite, multiple similar conditions combined on one line where possible, everything in a single rule — is about as efficient as you can get.
It also depends on how much traffic you get, and how dynamic your pages are. This thread on Webmaster World — http://www.webmasterworld.com/forum10/1297.htm (free registration required) — says “Try to make the ban files as efficient as possible, combining multiple bans on one line, for example. How much this overhead affects your server’s performance depends heavily on how much traffic you get. With a few thousand hits per day, you’ll never notice it. Raise that to a few thousand hits per hour, and you might see a difference.”
My two biggest problems are (1) spiders downloading large images from my photo galleries, and (2) spiders getting caught in my “Recommended Reading” tool ( http://diveintomark.org/newdoor/ ), which generates a near-infinite number of unique links, each of which requires several database queries. Pretty much any amount of effort is worth it to avoid this. (I’m also implementing throttles and sanity checks of requests-per-IP-per-minute in the script itself, as a secondary defense.)
— Mark ![]()
Re: you’re kidding
No. This is frown upon where I come from.
http://www.corsica-nazione.com/
It is a sign of ignorance of the seriousness of the matter at hand.
ZOE is now MP rated.
http://guests.evectors.it/zoe/contents/MP.jpg
Hope this helps.
Regards,
Z.
— Zoe ![]()
Z, your behavior is puzzling, and your reasoning is downright bizarre. I’ve never blocked your program, nor do I see any reason to. It’s not a spider, so it has no relevance to this conversation. Only after you brought it up did I even notice that it seems to be requesting my RSS feed every 5 minutes (which seems unnecessary).
Whatever. I got the same kind of flak when I started my web accessibility series last summer. I take this as a sign that this is some of my best technical writing; in my experience, people don’t bother trying to tear down the mediocre stuff.
— Mark ![]()
This is an interesting problem. I think a better approach would be to implement rate limiting for clients who download too rapidly.
This could be done using a priority queue of IP addresses, that holds an exponentially decaying average of hits per second (the same technique used to track Unix load average, which requires only one float and one timestamp to implement). Well-behaved clients who access robots.txt could have their cap raised.
Some disadvantages in this approach:
- we would have to maintain a list of high-traffic proxies like the AOL proxies and raise their cap
- spambots could start using robots.txt (and ignore what it says) to get around this scheme
Re: distributed ban lists. This has been oft-discussed, but as far as I know, never implemented. (If someone knows of a working large-scale implementation, please post a link.) It is undoubtedly the wave of the future.
On a related note, while researching this the spambot issue, I stumbled across http://www.openproxies.com/ , which sells its list of open HTTP proxies to webmasters as a ready-to-use .htaccess file. Of course, they sell the same list in a different form to clients who want to surf anonymously. :)
Presumably dedicated crawler companies would subscribe to the distributed ban rules to make sure they could bypass them, in much the same way that dedicated spammers have altered their tactics to bypass SpamAssassin rules.
As I said, it’s a war with no winners, only casualties. It just depends how far you’re willing to go.
— Mark ![]()
Re: throttling. There is a mod_throttle module for Apache. I’ve never used it. It does have a “ThrottleClientIP” parameter; I don’t know if this would cause problems with major proxies. And it’s not installed on my server, so I can’t try it. :(
http://www.snert.com/Software/mod_throttle/
Has anyone had any experience with this?
— Mark ![]()
I’ve played with mod_throttle some in the past; it seems to work fairly well, provided you’re not serving really big files (it coped reasonably well with 200 MB ISO images, but it’s better suited to much smaller stuff). Of course, you can also do traffic shaping at the kernel level for more fine-grained control.
Hey, Zoe’s nuts! Say some more wacky stuff, Zoe, I’m trying to figure out where exactly you’ve misunderstood Mark.
— Anil ![]()
Oops, sorry for the double-trackback there…
Un-oh. You’ve pissed off a Corsican. Better invest in some heavy weaponry, the rabid ones can survive multiple shots to the head before they drop.
— d ![]()
is there any way to tell if mod_rewrite is installed like the check script (mt-check.cgi)that is used with movable type?
‘httpd -l’ will list the modules compiled into that httpd binary.
— sean ![]()
Is there an Opera OS X Award for excellence in convincing potential customers and users that you’re a spoiled brat? I think Zoe would make a perfect first nominee. (The winner will receive a small statue shaped like a Safari brushed-metal snapback button.)
For spambots, I include a bogus email address of spidertrap@. The text around the bogus email address, at the bottom of a page, reads in very small, very grey print: If you’re a spammer and want to be banned, send an email to
Then I have scripts that look at every message send to the spider trap, and delete that message from my real inboxes.
It doesn’t save bandwidth, but I never see the email caused by spambots.
For the heck if it I DID copy and paste your .htaccess file into my server to test and found that…
RewriteCond %{HTTP_USER_AGENT} ^ IE \ \d\.\d\ Compatible.*Browser$ [OR]
…kept my browsers Mac Explorer 5.2 and Safari b1.0 v60 from viewing my site.
Eric, that can’t be right. I have that rule defined here and I’ve been browsing my own site with IE5/Mac and Safari all day. My access logs show no problems for anyone else either.
Safari’s User-Agent is “Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/60 (like Gecko) Safari/60″.
IE5/Mac’s User-Agent is “Mozilla/4.0 (compatible; MSIE 5.22; Mac_PowerPC)”.
Neither comes close to matching that rule.
— Mark ![]()
Hi Mark!
Job well done. Actually this problem has become increasingly severe for my Domino site during the past 3 months.
Unfortunately my access to the server is restricted so I can’t use htaccess and servlets. Currently I’m examinating using a hide-when based solution combined with a spider trap link (web bug).
However some of these bots have become so aggressive, some are not even parsing HTML but will follow everything that looks like a link even in source HTML, JS sections aso.
I agree that a product that compiled lists of “bad bots” could be very useful. Something like the lists of known spammers or like AdAware. What if a user could subscribe to an RSS feed or two of known “bad bots” and have that feed automagically be included in the .htaccess file of the user? Way too complicated for me to even begin to implement, but would this be possible?
I may be being a bit naive, but this strikes me as a case of technology overkill. Wouldn’t it work just as well to make the site password-protected (perhaps using .htaccess, either with redirect to a script or just BA), then provide the password openly on a “front-door” page using a method inacessible to robots - such as an image?
Some paid subscription sites use this technique (except obviously they don’t publish the password!), redirecting you to a login page then after login going to the url you provided originally.
This would be a minor inconvenience to legitimate visitors, but would stop fully automated systems. It wouldn’t stop anyone prepared to manually guide their robot, but it wouldn’t be hard to change the password automatically once a day - and anyhow I’m guessing that the fully automated hoovers are most of the problem.
As for Z, I think you’re being sent up. It looks to me like Raphael finds the whole idea of a large and growing script to block spambots rather amusing and thinks it will be ultimately ineffective. You may not like it, but this is legitimate criticism in the form of satire.
For a recent example from elsewhere, see http://enquirer.com/editions/2003/02/21/tem_0221daaligshow.html (Cincinnati) but also http://www.freep.com/entertainment/tvandradio/mside21_20030221.htm (Detroit).
well a thing that I don’t understand here, if someone that was doing these nefarious things wanted to be really evil couldn’t they just write a program that sent something in the header that was not a spambot, spider etc. but was in fact the info for a normal browser. For example I use cURL a lot, with it you can set the headers to be for example, ie5 if that’s what you want (useful if you have solutions that output variant content for different browsers). If I then wanted to use a lot of my time and build a spider with cURL that spoofed ie5’s header, well it wouldn’t be impossible. Another thing I use a lot (for automated gets etc.) is Rebol, if instead of using rebol’s built in http understanding I rolled my own I could then take control of the headers and output what I want.
Finally I sometimes use wscript to start IE hidden and download individual pages, if I wanted to be a real pig I could build a spider using the same technique. This is not to troll or anything I’m just not understanding why these spiders are going ahead and telling you upfront what they are, is there a legal requirement that they must do this, excuse me but this is not an area that I have any knowledge of in a professional way as I only do these things for my own usage.
— bryan ![]()
oops, sorry, missed the part about Cyveillance spoofing IE. So I guess that answers my question about legality, but it still doesn’t answer my question as to why most firms etc. don’t spoof the headers, because if they did you would be stuck banning by IP and referrer and such like and for the one’s with deep pockets it should be possible to maneuver against that tactic.
— bryan ![]()
“I may be being a bit naive, but this strikes me as a case of technology overkill. Wouldn’t it work just as well to make the site password-protected (perhaps using .htaccess, either with redirect to a script or just BA), then provide the password openly on a “front-door” page using a method inacessible to robots - such as an image?”
And then how do blind people access his website? Anything that is able to be hidden from spam bots will by necessity be hidden from some people. I assume Mark does not wish to discriminate against blind people, just like I’d hope no-one would.
— Lach ![]()
Bryan, there is no legal requirement for bots to identify themselves. In many cases, the person who wrote the bot is not the person using the bot. Especially in the case of commercial software, the developers have some perverse incentive to advertise themselves in logfiles. Even if their software provides a means to change the User-Agent, most users won’t bother, because they don’t understand how the web works or why it would matter. And not a lot of people block by User-Agent, so it really doesn’t matter all that much.
If somebody writes a mod_blacklist module and sets up a distributed auto-updating bot blacklist, then it will matter more, and things will escalate.
Meanwhile, the deep-pocket companies running spybots 24-7 (like Cyveillance) generally operate out of a specific IP range. They have a bunch of servers, specially configured, their own data center, an upstream provider that doesn’t care what they’re doing, and so forth. I don’t really think they’re all that portable, but that’s just an educated guess.
— Mark ![]()
Re: roping off sites entirely with passwords. God, I hope it never gets this bad. At that point, you may as well just give up and get off the Net. I mean really. Also, that would totally cut off all robots, including legitimate search engines that send me tons of good traffic every day. (I’m trying to cut down on abuse, not use.)
— Mark ![]()
“there is no legal requirement for bots to identify themselves”; Yeah I figured that after reading about that RIAA bot you talked about.
” In many cases, the person who wrote the bot is not the person using the bot” this is no reason why a law could not be written requiring bots to identify themselves, laws are a very complicated type of code in themselves. The difficulty is in getting someone to phrase succinctly what a ‘bot’ is, which would require abstracting quite a bit. Such a law would in the end most likely be, like many laws, unenforceable in logic but quite enforceable in practice as those doing the enforcing would choose whom to apply it against. Anyway that’s just an aside.
“Re: roping off sites entirely with passwords. God, I hope it never gets this bad.”
Well, so do I. But the web is already divided this way, to some degree. A lot of commercial and free-but-must-register (ie, they want your email address) sites run on just this basis. Usually they have a free front-door site which contains a full-ish index for the benefit of search engines.
Following 9/11 Gartner Group made their whole research site free access “for a limited time” as a pro bono gesture to help the rescue/recovery effort. But after a few hours they had to drop back to “email us for a pro bono password”, because their gesture was being abused by people sucking up everything - sort of cyber-looting, I suppose.
“And then how do blind people access his website?”
A good point. There must be a way - an optional voice prompt (equally useless to bots) perhaps.
Great info, Michael. I noticed this bad bot behavior the other day when I was setting up a logstat thingy. Saw really nasty behavior out of FAST and Mercator bots. FAST seems to have settled down, and I told Mercator bot to go to hell after the webrape it did on my server a few months ago.
How bad was it? Here’s an excerpt from my blog story:
[Mercator bot] has been going on an absolute feeding frenzy - where Googlebot crawled 987 pages totaling 7.8 MB from 28 Feb 2002 to 20 Oct 2002, Mercator Bot crawled 19856 pages totalling 479 MB. Um, that is a little excessive. Fast-Webcrawler (AllTheWeb) was not any nicer to this site. It had 18084 hits totaling 356 MB…. Oh, and guess how many hits those 18084 hits from Fast (AllTheWeb) generated? 29. Yep. For all that bandwidth that got sucked on by Fast, I got a whole 29 hits. Google gave me 1482 hits from the 987 pages it spidered. You do the math.
Now, I could handle Mercator going nuts if I was eBay.. but I ain’t. I have maybe 200 pages on this site. Most of those are story archives of my blog. However, back during the time period above, I had maybe 20 or 30 static pages - not near enough to warrant 19856 hits. Hell, that is more hits than I, the webauthor, put on my own site!
— Randy ![]()
RewriteCond %{HTTP_USER_AGENT} “Enter new UA String or choose a common one. Then press ENTER”
This string is present in a dropdown menu of the UAbar, an add-on for Mozilla and Phoenix browsers allowing to change the user-agent string “in the fly”.
(http://uabar.mozdev.org)
Ha! I saw it and assumed it was from some stupid offline downloader. I’ll remove it.
— Mark ![]()
This is amusing.
http://diveintomark.org/public/go_to_hell.txt
— Mark ![]()
eh, do what you will, but realize that sometimes bots are good, that is, when they are used by Blogdex or Memeufacture (my site) to aggregate for public consumption.
Check out http://neilgunton.com/spambot_trap/ it has the advantage of successfully dealing with spiders that change their user-agent field.
John, I agree that not all bots are bad. Not all bots are spiders either. Bots like Blogdex/Popdex/Daypop/whatever provide clear benefits to myself and my readers. Also, they don’t follow links and spider my whole site; they only ever request my home page. I have no problem with them and they are not blocked.
— Mark ![]()
Sorry for the ping flood… MT seems to have a thing.
— Peter J. ![]()
Very interesting story. I have made some experiments with mod_rewite in the past und would recommend to use the rewritemap feature. Thus you are able to maintain the user-agents and IP-addresses very simple in a text-file.
— Gerald ![]()
Mark,
After perusing your 26-Feb-03 article on blocking bad visitors, and visiting WebMasterWorld and some usenet chats, I came up with a list of blocks of my own. Including NameProtect, whom I didn’t realize was hammering my blog sites.
That said, here is the killer question. It wouldn’t take much for me to write some Perl that I could call by Cron and keep my blocking list up-to-date, provided there was a place, specifically a database, that I could query user agents and/or domains by name, and receive in return, either their IP or even better, an argument I could plug directly into my .htaccess automagically.
Sorta of like the RTBH, only I would see myself creating an XML-RPC query of targets I want to update, and receiving back a list of arguments to feed into my .htaccess. I’d probably also add code to email me of changes.
Question is, does any such database, or mechanism exist?
Thanks for all you do.
Dean
Are we due for a robots.xml standard? It would be nice if we can specify more than what we are allowed with robots.txt.
http://www.csandoval.addr.com/main.jsp?blosdate=/2003/03/02
— Cesar ![]()
Cesar: Tim Bray has something to say about this over in his weblog: http://www.tbray.org/ongoing/When/200x/2003/02/27/Websites
One of the things in his (linked) w3c-tag post is “The kinds of stuff that could go there could include robots info, language info, favicon.ico equivalent, RSS info, p3p info, etc etc etc.”
— Peter J. ![]()
Why not put a link to a page that blocks that user’s [IP address|User-agent|X-forwarded-for] values as presented in the headers, say for 1 hour, with the clear text “Click here to block your IP address” It would not be that hard.
The use of all headers present would help to avoid blocking all users of a proxy, so there should be no issue there…
Then, to allow legitimate robots, just tell them to not go there in the robots.txt!
Problem solved, no?
— JNaWK ![]()
I see the NameProtect outfit has a 12. address. All addresses starting with 12. are AT&T. You might be able to send mail to AT&T abuse address and stop them. Since they are a “business” it may not work but who knows….
In each page I serve, I include a bogus email address, encoded with the date of access as well as the host IP address and embedded in a comment. [Apache's server-side includes are great!] This has allowed me to trace spam back to specific hosts and/or robots.
One of the first I caught with this technique was the robot with the user agent “Mozilla/4.0 efp@gmx.net“, which always seems to come from argon.oxeo.com - it’s identified it above as simply rude.
The all-inclusive solution is, of course, in the .htaccess file. However…
If you’re already using $scripting_language to publish your site’s content, you can also place an include at or near the top of each file that checks the requesting UA against an array.
I have such a file, though its original purpose was to parse the reqested URI and assign stylesheets accordingly.
Then someone decided to stick me amidst drama from which I explicitly explained I should be excluded, and I decided to “block” my site against the subnet to which the instigator connected at home (along with a couple of others that were rotten); that was easy enough.
Then, as I was going through serverlogs, I found the same grief that Mark did. Enter a second array.
If there’s a match and alexa.com isn’t a requesting domain, the script closes all of the open tags and exit()s. That I can withstand with greater ease than seeing the entire page slurped up… and while it doesn’t protect the images, I’ve got some art direction tricks up my sleeve for the better part of those.
The overhead’s higher - when I asked IOCOM’s admins about it, they shrugged - but the approach has certain advantages (especially if you have a hosting plan that locks down .htaccess).
Zoe seems pretty out of it unless one remembers there’s no proof it is Zoe. Perhpas soem Zoe hater is out there giving us all reason to not want to use Zoe’s product by pretending to be Zoe. I hate identity theft.
Of course, Zoe coud be that stupid. *shrug*
People crafting web pages should examine
the following method of hashing email
addresses to further frustrate harvesters.
http://www.hiveware.com/enkoder_form.php
— shades2 ![]()
found this page off /.
look at the sugarplum package:
http://www.devin.com/sugarplum/
it does a combo of rewrite and other tactics (like
teergrubing) to act as a honeypot for spam robots.
I have intalled a scriptnamed Robot Control Pro
and it now banned all evil robots automaticly ;-)
Oeps …
here it is ;
http://webcomposing.com/webcomposing/HTML/cgi-programs.htm
Is it now a cool script, advertisement, or spam?
— nospam ![]()
While nothing beats detecting the user agent for known spambots, other methods are more universally effective and easier to implement, such as using javascript to generate mailto links or just substituting ascii characters for text. Many of these methods are described here:
http://www.webmasterworld.com/forum21/3451.htm
Ultimately, a combination of .htaccess & on-page methods is the best strategy, since known spambots are blocked and future spambots should be fooled.
— john ![]()
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
§
firehose ‧ code ‧ music ‧ planet
© 2001–8 Mark Pilgrim