Cenzic 232 Patent
Paid Advertising
web application security lab

Blocking Bots By HTAccess

While doing a little research into some random stuff for a client I ran into a bot that was spidering in a bad way. Within a few search results pages I found my way to a blog entry by BrontoBytes talking about blocking spiders by HTAccess. This is a pretty interesting pro-active approach to stopping request level attacks, and something used commonly by mod_security, for instance. You can check out the blog entry which shows how to set up an .htaccess file to block some modern robots.

A word of caution, however, is that some of these aren’t “bad” per se - but they may be undesirable. Like Baidu is simply a chinese robot that doesn’t obey the robots.txt file. Some might find that terrible but others might be okay with it. wget and libwww for instance just mean someone is manually interested in your robot. If you consider that bad (system level exhaustion perhaps?), then there are lots of other things you should probably be blocking too. Anyway, it’s a pretty good starter list.

10 Responses to “Blocking Bots By HTAccess”

  1. Beelzebub Says:

    Hmmm. Wouldn’t bots not respecting robots.txt use a spoofed user agent?

  2. hackathology Says:

    Cool stuff. That article is well written..

  3. Ronald van den Heetkamp Says:

    I have a similar list obtained by staring into my logs and detect badbots over the years, It’s about 3 times as large as that one. The Waybackmachinebot should be on it also, or else your facing eternity in the internet archive. I figure no one wants that, imagine that… :)

  4. ChosenOne Says:

    @Beelzebub: That’s just what I thought…nice list though

  5. eKstreme Says:

    You’d be surprised how many bad bots do not spoof their UA.

    I keep my own list too, but it’s nice to find some new ones.

    Pierre

  6. hackathology Says:

    Ronald, care to share your list?? Post it in the forum maybe? I would love to have a glimpse.

  7. Johann Says:

    I don’t recommend using these ages-old lists.

    There are bad bots missing (like “” or “Mozilla”) and most of the list could be compressed a lot (”Missigua Locator”, “Missigua Locate” etc.). It’s also lacking some of the newer scripts.

    Setting up bot traps can be effective however you probably need more than just one.

    Blocking entire nets is very effective.

    I posted some of the ones I see most in The top 10 spam bot user agents you MUST block. NOW., but I still have lots more blocked.

  8. Ronald van den Heetkamp Says:

    I will post my .htaccess soon I thought about making it an article someday because it contains a ton of other security measures.

  9. Ronald van den Heetkamp Says:

    @hackathology

    I’ve uploaded my .htaccess as a txt file here, I don’t know how much it resembles from the above blog post, but I know it’s a ton more: 284 bots, harvesters and web scripts.

    See: http://www.0×000000.com/ht.txt

  10. Johann Says:

    Ronald,

    that’s quite an extensive list, however: How many of those do you actually see on your site?

    Also, why one would block the W3C validator is beyond me, too (yes, you can use it as a proxy, but there are easier ones out there).

    And btw it’s “heritrix,” not “Heretrix.”