While doing a little research into some random stuff for a client I ran into a bot that was spidering in a bad way. Within a few search results pages I found my way to a blog entry by BrontoBytes talking about blocking spiders by HTAccess. This is a pretty interesting pro-active approach to stopping request level attacks, and something used commonly by mod_security, for instance. You can check out the blog entry which shows how to set up an .htaccess file to block some modern robots.
A word of caution, however, is that some of these aren’t “bad” per se - but they may be undesirable. Like Baidu is simply a chinese robot that doesn’t obey the robots.txt file. Some might find that terrible but others might be okay with it. wget and libwww for instance just mean someone is manually interested in your robot. If you consider that bad (system level exhaustion perhaps?), then there are lots of other things you should probably be blocking too. Anyway, it’s a pretty good starter list.