Cenzic 232 Patent
Paid Advertising
web application security lab

Anti-Splog Evasion

I know I’m really going to kick myself for this one, as it will no doubt come back to haunt me, but I’ve been thinking about this one for a long time. One of the things that Blackhat SEO types do is they attempt to scrape other people’s sites that have original content (such as mine). Then they post that content on their site as their own, attempting to raise their own page-rank. Because the search engines aren’t smart enough to know who is the original author, the sploggers get higher in the page ranks.

One of the tactics to evade them is to deliver unique content to them (a one time token or something of the like) that allows them to see the content, but if they attempt to replay it, the webmaster can tell who it is by going to their lookup table and seeing who scraped them. Often times you can shut them off at the source or do something more evil like I did. But there’s a way around it.

If you click on the image you can get an idea of the concept. The concept revolves around using more than one scraper (which is not a new concept - see splog hubs for more details - but that’s only been used to hide the real IP address in the past). The difference between that method and this method is that you use more than one scraper and then validate that the responses are the same. If they are, you’re good, if they aren’t the same (because there is a unique token in the content) the content can be either thrown away or the splogger can attempt to clean it up.

This would make it much harder for sites to protect themselves from sploggers attempting to steal copyrighted materials. So why am I writing this? Because I still have a few tricks up my sleeve to stop sploggers, but I thought it should at least be known that there are ways around some of the more obvious protection mechanisms.

8 Responses to “Anti-Splog Evasion”

  1. jasonk Says:

    Pagerank is a function of links pointing to a page, not of the content itself. But of course if you can get new content indexed first, chances are you’ll get the links first. Sometimes.

  2. Jordan Says:

    Sounds remarkably similar to the issues with the Sveasoft firmware images. In that case, though, the number of places where changes could be introduced was fairly large. He made normalizing the images much more difficult (though obviously still doable). There’s a lot of different ways you could do it, including using snow (http://www.darkside.com.au/snow/), creating a synonym function that would translate the incoming IP into an algorithm for translating different words in the article to different other synonyms. Just some random thoughts, I’m sure there are many others.

    Also, it seems like the easiest way around that sort of thing would be to use a literal screen scraper that used OCR to snapshot the page content, then rescan. Presumably you’re getting what the user actually looked at which would allow you to evade most of the behind-the-scenes fingerprinting mechanisms.

    Incidentally, I’ve been thinking lately about combining Mr. T with a fuzzy-fingerprinting algorithm to uniquely identify visitors. Of course, I’ve got no time to actually implement it, so I’ll just throw the idea out there and see if anybody runs with it. It’s a little overkill compared to some of the other more simple mechanisms for tracking users without cookies, but it would potentially be much more robust. Then again, who knows, maybe commercial website tracking companies already do something like that.

  3. phaithful Says:

    I guess there would have to be a caveat that if the spammer was scraping from an RSS feed that the victim’s RSS feed would have to be set to “summary or snippet” as opposed to “full text”. Otherwise if I were a spammer, I’d just scrape from my reader and bypass having to have 2 servers and running a diff.

  4. RSnake Says:

    Actually, that doesn’t really matter much. If I know Google is the source of someone scraping I can just turn it off for Google. Pretty simple actually, and I’d have no problem doing that. But I probably care less about Google users than most people. :)

  5. phaithful Says:

    Ha! I like how you assumed I was referring to the “reader” as a Google reader, as opposed to Bloglines or NewsGator.

    Hmm maybe not an assumption, but probably a detection script? ;) Haha in either case, I am using Google Reader. But for all practical purposes if you were to block a popular reader such as Google (which I know you have no problem doing) then I guess you wouldn’t be maintaining an RSS feed anyhow. heh!

    In either case, I like the deviant thought of using 2 or more servers and also the scrapper proxy. dead sexy!

  6. RSnake Says:

    Thank you sir! :) My RSS reader is only for other people’s convenience. Honestly it drives less traffic as a result of having it, as people can just use their RSS reader instead of visiting my site. I know that’s a tad short sighted, but I have faith that people could rebound in the face of such adversity.

    And if said RSS reader is seen as a spam gateway and people get annoyed by the fact they can’t get their feeds through them, perhaps the service used for scraping will be proactive and solve that problem for me. Hey, a guy can dream, can’t he?

  7. Randy Charles Morin Says:

    I think you underestimate what sploggers do.

  8. RSnake Says:

    In what way, Randy?