SES SEO News
So I have an insider at SES who has been reporting back some interesting things that came up during the conference there. Of particular note was some of the spider topics that came up that are particularly relevant to some of the search engine spider mapping that I’ve been doing (I haven’t talked about those projects on this blog so most of you won’t know what I’m talking about, but bear with me). For search engine optimization (SEO) this has a lot of relevance, especially for the blackhats.
So one of the points of particular interest was that the search engines are now considering adding some sort of certificate to their engines so you will know which engine is real and which one is fake. People fake browsers often to see what competitors are doing (no, I don’t do any of that on my sites, do don’t waste your time).
But this is relevant for being able to detect which bots are real and which ones are fake. That could have major impact on fingerprinting valid browsers, instead of current techniques which involve reverse DNS on IPs to see if it matches the host domain, or User-Agent detection (neither of which I’ve ever felt are particularly great at catching everything with no false positives. It’ll be interesting to see which companies do what. I think it’s a ways off before we see this implemented in any practical way, but it sure will make spamming robots more reliable.
Another interesting thing that came up was that one way users hack into websites is by looking at robots files to see if there is any information there that might point the hacker to a more useful location to attack. A concept of using IP delivery came up where you can deliver a robots.txt file only to robots from IP addresses that you want stopped. It sorta feels like a chicken and egg sort of thing where you have to know they are a robot before you can tell the robot that you don’t want the robot to do stuff. It also feels pretty exploitable, depending on how it is delivered. For Google, you can use Google’s translation service, or better yet, here is a Google cache of Microsoft’s robots.txt file. Nice try.
Then there was mention of a way to do IP delivery to the spider and give the user a “nocache,noindex” version, so they won’t see what the robots see for the meta descriptions so they can’t rank as high, even if they steal every word on the page. Again, exploitable, and obviously so via cache. So then Google apparently said it doesn’t penalize people for having “noindex” nad “nocache” on your pages. It just happens that both super good guys and super bad guys happen to use it. So it might hurt you in terms of heuristics, but it sure won’t kill you. Sounds like music to spammer’s ears.


