Robots.txt Just Isn’t Working For Me
Dear Search Engines,
I’ve worked for huge companies for many years. Each have their own unique issues. One issue they all have in common is you. You crawl our sites and expect us to know better and be able to react to that in real time. You expect us to know what we don’t want crawled and you expect us to be able to conjure up a robots.txt file put rel=nofollow Meta noindexes or whatever to satisfy that need.
Another issue that all the companies I’ve ever worked for have is slow time to develop and release anything to the website. If I know there is an issue a) I have to explain it b) get buyoff from engineering/business units/execs c) get engineering to build the document d) merge the code e) QA it f) wait until the next build/release and then poof, just like magic we may or may not have fixed the issue depending on if QA and engineering did their job right. If not, the cycle continues.
Here’s a crazy thought. Why don’t you let us upload our robots.txt to you guys? Make us put some crazy hash in a file somewhere to prove we are who we say we are, but let us tell you immediately what no not index, what to not follow or otherwise waste our bandwidth doing. I ran into a situation today where any reasonable person could have immediately told you how to fix the issue, yet it may take weeks or months to fix the problem instead of one guy in a few minutes uploading one file to tell you not to do XZY. You allowed us to do things like upload site-maps, why not let us tell you what we DON’T want you indexing? I know, where do I come up with all this crazy talk?
Tell you what, search engines, I’m going to let you think on it while I grind my development resources to a halt trying to keep you off of certain areas of my company’s site. Let the patent wars begin.
-Love
RSnake



December 13th, 2006 at 4:07 pm
While opting in as opposed to option out is way more friendly and prevent crawlers finding things it shouldn’t, that won’t happen as, off the top of my head, 80% of sites don’t even have a robots.txt file. If the major search engines suddenly decide that users have to write what they do want searched into their robots file then growth of the engines DB will seriously halt.
December 13th, 2006 at 4:28 pm
also..there’s alot of search engines, not just google/yahoo.. so every change you want to make.. you’d have to upload to a dozen search engines?
Unless they got together and created a central robots registry..which is unlikely. So i think web admins will have better luck just convincing their bosses to allow out-of-cycle robots.txt updating.
December 13th, 2006 at 4:43 pm
WhiteAcid, I’m really not talking about 100% of sites. I’m talking about mega huge sites. Also, I think you mis-read what I wrote. I’m not talking about re-building sitemaps, I’m talking about writing a file to opt OUT. The inverse of sitemaps.
Maluc, yes, there are tons and tons and tons more search engines. I wasn’t writing this to Google or Yahoo specifically. I was writing it to all search engines. I’d happily get an intern to log into 100 places rather than take 10 developers out of development for a week so they can fix something. But mainly, as you said, those two (plus MSNbot and AskJeeves) accounts for the vast majority of sites I would want to block anyway, based on traffic.
If you had to work on huge websites all day, I think you would share my pain.
December 13th, 2006 at 4:54 pm
Also, I’m not just talking about robots.txt Robots.txt only tells the spider where not to crawl. Sometimes I want to allow it to crawl but I don’t want it to index. Or sometimes I just don’t want to pass link love from my pages to other pages. None of that can be controlled through a single file the way it is built today. So it would have to be a modified version of robots.txt.
December 13th, 2006 at 5:02 pm
heh,probably so.. but it seems backwards to get upset at spiders for finding data that’s publically accessible. If a spider can find it.. so can evil overlords.
but alas, i don’t have to mess with robots files personally .. so i’m just on the outside looking in - take it with a grain of salt. ^^”
December 13th, 2006 at 5:56 pm
Ah! No, this isn’t about security. It’s about search engines not taking copyrighted content. It hurts brand recognition. Forget security. Think search engines stealing/storing/caching content. But beyond that sometimes you don’t want search engines spidering because they ruin things (modify server settings, change what the logs look like, etc). It would be nice to be able to shut them off at the source rapidly instead of waiting weeks to be able to deploy something to stop them.
December 13th, 2006 at 6:55 pm
ah, i completely misunderstood then
..and not somethig i mess with so i guess it’s more of a problem than i assumed.
another solution might be to make those local robots.txt files fully functional for all the search engines’ options. (would still rely on out-of-cycle updating)
December 13th, 2006 at 7:45 pm
that face was meant to be : X
not angryman =.=
December 13th, 2006 at 8:22 pm
SE’s are really automated bindshells on steroids.
i mean how the hell do they find all those files and maps?!1!1
They clearly do, when i Dorking google, and find links to etc/passwd/ etc. From my viewpoint they are snooping around on servers which should be forbidden.
hence, i don’t need Google.
December 14th, 2006 at 2:15 pm
RSnake, if you can’t get approval to upload a robots.txt file what makes you think you’ll get permission to send it to search engines? The problem here isn’t with the search engine companies its with your internal bureaucracy. I agree with Maluc that it seems more efficient to copy the file to one webserver rather than to 100 search engines. I can always find better ways to torture an intern. Better to try to shortcut the process. If you don’t do any harm the suits will probably never notice.
Jungsonn - A spider should be able to go anywhere links take it. If people don’t get their security right the bad guy will eventually find that out themselves (they have their own spiders). I’ve caught people trying to exclude sensitive data through a Robots.txt file. Thats just dumb. Ok maybe Google won’t broadcast it to the world but you’ve just told everyone looking at your (freely available) robots.txt file exactly where to find the good stuff. That is better than a google search (for the bad guy) since you don’t even have to know the content to find it. No one should be using a robots.txt file for any kind of security.
My problem with robot.txt files is with the search engines not agreeing on standards. One allows wildcards, one doesn’t (etc.). It becomes a real mess trying to customize it for each one. But if that is the biggest problem I face today it’ll be a fine day indeed.
December 14th, 2006 at 5:16 pm
The approval part is easy. It’s the weeks of dev/QA that just isn’t working for me. Believe me, it’s nowhere NEAR as efficient because I’m not just talking about robots.txt. I’m also talking about changing meta tags (nofollow/noindex/noodp). Adding Rel=nofollow. And whatever ever foolishness the search engines ask us to hack into the HTML of our pages.