Cenzic 232 Patent
Paid Advertising
web application security lab

Google Spamming Us

You know, we get some really odd traffic. Some of it good, some of it not so much. Let’s take a look at some of Google’s traffic since it’s a slow day. If nothing else it’s good for a laugh. First let’s look at Google trying to hack us - XSS style:

66.249.73.40 - - [26/Nov/2007:01:53:58 +0000] “GET /blog/?%22%3E%3Cscript%3Ealert(1)%3C/script%3E HTTP/1.1″ 200 55053 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

Not too bad for a robot. How about some totally innane Apache directory structure stuff that couldn’t possibly work?

66.249.73.40 - - [26/Nov/2007:00:46:03 +0000] “GET /bluehat-spring-2007/?C=S;O=A HTTP/1.1″ 200 3681 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

Someone needs to figure out how UTF-7 works:

66.249.73.40 - - [26/Nov/2007:02:25:19 +0000] “GET /s.js+ACIAPgA8-/script+AD4-x HTTP/1.1″ 302 204 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

Oh don’t we love the Google spam? I really am disheartened that it’s this easy to con Google into spamming websites. As if I don’t get enough referrer spam, Google does one better. *sigh*

66.249.73.40 - - [23/Nov/2007:19:11:23 +0000] “GET /weird/popup.html/Buy-NET.html HTTP/1.1″ 302 204 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.73.40 - - [09/Dec/2007:07:21:51 +0000] “GET /weird/popup.html/Buy-COM.html HTTP/1.1″ 302 204 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.73.40 - - [11/Dec/2007:05:24:19 +0000] “GET /weird/popup.html/Buy-MEUK.html HTTP/1.1″ 302 204 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.73.40 - - [14/Dec/2007:17:48:58 +0000] “GET /weird/popup.html/Buy-INFO.html HTTP/1.1″ 302 204 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

Google has a lust for the goatse! Cannot get enough of it!!!!! Seriously, Google. I just don’t have Goatse on my machine. I promise! Granted, I 302 redirect all 404s to the homepage, instead of 301, so that’s my bad, but seriously - there is a reason I might want to do that and still not have goatse on my site. I don’t ever remember having it anyway. Time to give up the obsession, Google!

66.249.73.40 - - [30/Nov/2007:01:04:10 +0000] “GET /goatse.html HTTP/1.1″ 302 204 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.73.40 - - [07/Dec/2007:19:36:57 +0000] “GET /goatse.html HTTP/1.1″ 302 204 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.73.40 - - [10/Dec/2007:20:17:00 +0000] “GET /goatse.html HTTP/1.1″ 302 204 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.73.40 - - [19/Dec/2007:22:58:31 +0000] “GET /goatse.html HTTP/1.1″ 302 204 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

More spam anyone? Let’s see here… Google likes Viagra and goatse. I’m seeing a theme here!

66.249.73.40 - - [26/Nov/2007:04:47:00 +0000] “GET /fierce/?ref=SaglikAlani.Com HTTP/1.1″ 304 - “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

And the trackbacks… oh Google, please figure out what a Trackback is and stop spidering it. I swear, no matter how many bazillion times you look at the trackback pages, you’re still not going to find anything useful there. I double cross my heart and swear to die. This is from Nov 18th-Dec 20th (just over one month):

$ grep 66.249.73.40 error_log |grep -c wp-trackback
938

Think how much bandwidth Google uses that is just completely unnecessary. The countless and senseless bandwidth waste-age. I started using Google because it was light on my personal bandwidth - so much for that idea.

26 Responses to “Google Spamming Us”

  1. FiSh Says:

    It’s not nearly as bad as Yahoo; at least Google sends traffic sometimes.

  2. vaanx Says:

    Still struggling with that complicated robots.txt format I see.

  3. 10ha10ha Says:

    could it be UA spoofer?

  4. Wladimir Palant Says:

    Not a UA spoofer - this IP address is really Google.

    Actually, /bluehat-spring-2007/?C=S;O=A is a correct request - sorts the directory listing by size.

  5. Alphane Moon Says:

    I think that somebody links to these URLs. Googlebot will follow inbound links to nonexistent pages and see 404 or a redirect.
    In case of the 302 redirect Google may keep the wrong URL instead of the target URL of the redirect (because the document has only been temporarily moved, not permanently). “302-Hijacking” has been a problem with search engines.

    However, you have thousands of inbound links, finding the sites with these “hacking links” is like the needle in the haystack.

    That these URLs look like hacking attempts could be connected to the topic of your blog.

  6. Marco Ramilli Says:

    “Think how much bandwidth Google uses that is just completely unnecessary. The countless and senseless bandwidth waste-age. I started using Google because it was light on my personal bandwidth - so much for that idea.”

    I totally agree with you. Often we hear:
    1- there is no enough bandwidth
    2- the currents protocols use too overhead
    3- the computational speed is increasing but the network speed-up not so much
    4- ecc ecc
    And Google, one of the most important web companies which was really light (some time ago) wastes resources like that ?

    It amazing, thank you very much for the interesting post.

  7. Awesome AnDrEw Says:

    I’ve noticed Google’s bot has been acting very suspiciously lately. I was going to make a topic about it on sla.ckers, but never got around to it. Below are 3 instances where the Googlebot filled out an INPUT form with odd text that was not even found within the page.

    66.249.70.196 - - [28/Nov/2007:05:12:11 -0800] “GET /index2.php?content=fusker&txtURL=reverted HTTP/1.1″ 200 8530 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

    66.249.70.196 - - [28/Nov/2007:05:38:09 -0800] “GET /index2.php?content=fusker&txtURL=dancer HTTP/1.1″ 200 8531 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

    66.249.70.196 - - [03/Dec/2007:04:12:15 -0800] “GET /index2.php?content=fusker&txtURL=unleash HTTP/1.1″ 200 8531 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

  8. anyone Says:

    So it could be google employe with UA spoofer? some kind of lanch/free time hacker?

  9. Adam Moro Says:

    Never heard of goatse before…yeah, thanks for that right before the holiday feasts! jk obviously. Anyway, honest question. Is this a joke?

  10. Ronald van den Heetkamp Says:

    Bandwidth does not exist, ;)

    If so, tell me what it is and where I can get it.

  11. Faruk Says:

    66.249.73.40 - - [26/Nov/2007:04:47:00 +0000] “GET /fierce/?ref=SaglikAlani.Com HTTP/1.1″ 304 - “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

    SaglikAlani is an Turkish term.
    but Turkish lamers can’t do it.

    66.249.73.40 - - [26/Nov/2007:01:53:58 +0000] “GET /blog/?%22%3E%3Cscript%3Ealert(1)%3C/script%3E HTTP/1.1″ 200 55053 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

    LoL. I think googlebot will be very big hacker. I hope it will hack to microsoft.com.

  12. RSnake Says:

    @vaanx - Ooh! It is the holiday season. I haven’t heard from a Google zealot in a while. It’s a little hard to retroactively write robots.txt files for things that have already happened, but thanks for the witty retort. But even so, why would I choose to miss out on Google issues - it is, after all, part of what this site is all about - web issues. And feel free to explain to me why every webmaster on earth has to code their site to protect it from Google’s abusive behavior instead of them fixing it. I’m alllll ears on that one. Also, in case the point of a “slow day” post was missed completely - it makes for a funny blog post. Feel free to cat Google.post > /dev/null if you lack a sense of humor about Google’s failings. :) There will be more in the New Year, so get your blindfold and earplugs ready, Google fanboys!

    @Wladimir - That would be great, if it was an Apache directory and not a PHP script.

    @Adam - Sorry about that, I probably should have warned you. I thought everyone had seen it by now. And no, this wasn’t a joke - funny, but not a joke.

  13. Ronald van den Heetkamp Says:

    @Awesome AnDrEw

    Yes, Google also tries to utilize search fields in websites to deepen it’s search. Strange is the keywords they use for it, I had one time I found my name in a Google cached version of what it had crawled through a search query in a search field on another website. I thought about a way to exploit this for ranking issues, but I stopped investigating that.

  14. bobi Says:

    who has invent netmeeting or net send or net bios java …javascript……html …..c++ c# or…..vb or…!!!!!
    and put them(ms-dos) and put them under windows or
    can easly see any ones works on these earth we are watching

  15. Spyware Says:

    Erh, it’s not -just- google spamming you, thank god for spambots.

  16. Ha Says:

    LIght on personal bandwidth, yet totally opposite on personal data!

    Im almost sick of Gmail’s invasion of privacy ways that Id be willing to shell out money for a Gmail pay service.

    Damn thing knows me up and down … Fuckers

  17. mario Says:

    google is the best anyway

  18. Mark Says:

    Hello,

    Read your article and the comments here and I have a question:

    How can you tell with certaintly that it is google behind this? If you are refering to the IP and referer fields and in general headers you know that this can be manipulated in many different ways.

    To better explain, assume that you’re the owner of a site, then once a visitor (say John Doe) clicks a button or performs an action that reloads a page of this site, you can setup a server side script such that - before headers are sent - it will open another page on another site, post some info or do something. The IP that will show, will be that of John Doe (unless you set up some anonymous proxy server). There will be no trace of the server in-between. And you can specify whatever headers you like. At the end you load the regular page to John Doe.

    Now add some filtering to it, you can detect when a bot like google visits your site (the intermediate site) before you execute this part of the code for the extra open/post action. So now the target sees in the server logs this info thinking it came from the googlebot.

    I maybe missing something if so please clarify.

  19. RSnake Says:

    @Mark - I’m assuming when you said “you can setup a server side script such that - before headers are sent - it will open another page on another site” you are talking about header redirection (301/302, etc…). Unfortunately that doesn’t work cross domains for Googlebot. Trust me, there are many who have tried. On the same domain, sure, but not from one domain to another.

    There ARE ways to get Googlebot to spam cross domains, but that’s not one of them.

  20. Mark Says:

    Not 301/302 redirects, ok there are 2 steps. So manually say for simplicity:

    1. Ensure the url with the query can be opened with 200. So if I type:
    www.yoursite321.com/index.php?some_garbage_here=more_garbage
    The page opens normally with a 200 OK header. (Although just making sure you are not getting a 404 is good enough) So in this case let take of the attempts you posted earlier on:

    h**p://ha.ckers.org//bluehat-spring-2007/?C=S;O=A

    2. Ok now the other end has a url that although is invalid (the site owner thinks so) it still gets a good response back from the server. So if someone posts this particular link to a high PR page (in other words to a page that is regularly indexed by google) then google is going to pick up the url from there and try it. And you will get an IP match for google.

    So now your logs will indicate what you mentioned above but it is not really the search engine’s jobs. It is manipulated.

    In my earlier post I was talking about automation where someone can do the attempt from the server end, first to make sure the page with the get parameters opens properly, then posting it. Someone could also filter it as I mentioned earlier on so the link only appears to the search engines and not regular visitors (and/or not all the time).

    Otherwise is very hard to understand why a search engine will ever try to get in with such parameters. In my opinion such methods are utilized by various parties to test weaknesses of sites without risking visibility.

  21. RSnake Says:

    Yes, I agree that your scenario 99% probable, and therefor Google actually _is_ spamming me, just because they think that I have URLs like http://ha.ckers.org/images/kcpimp.jpg?somepornsite.com. Way to go Google.

    But I’m still not sure I have a clear picture on your comment about “a server side script such that - before headers are sent - it will open another page on another site” What you just talked about is cloaking, but that has nothing to do with Google automatically going to my site, it’s still the same scenario (no redirection required).

  22. Mark Says:

    Well, practically, a search engine has no way of pre-verifying a link without actually using it. And when it uses it you will see in the server logs along with the incorrect parameters.

    I am not sure how else this could be done. If google say would attempt to strip the parameters and get to the site using the domain only (as we all would prefer to index secondary pages using internal links) it may not be able to pick up the page content (in a reasonable time anyways or due to other reasons like some sort of intermediate hosting).

    Sorry I wasn’t clear earlier on the cloaking you mentioned can be used to verify that the page of the other party returns a 200 OK response before setting up the link for public access automatically. In such cases you could check the server logs for the /GET parameters to see if a match exists by another IP at an earlier time (apart of google).

  23. RSnake Says:

    When you think about it, this is a really nice way to proxy RFI attacks through Google or SQL injection (if the results get cached). All you need to do is create the link on a page that Google will spider and poof, Google is your conduit for attacking the third party. You just need to wait long enough for Google to hack on your behalf.

    But the spammer does not have to visit my domain to create those links. In fact, the URLs they picked don’t actually exist, they 302 back to the homepage. So they may or may not have visited the domain before but it absolutely wasn’t a requirement for them to construct that link for Google to find.

    I still have no idea what any of this has to do with cloaking since it’s not required for the spammer at all and I’m certainly not SEO cloaking.

  24. Mark Says:

    Yes it is a stealthy method, but it can also cause other side effects depending how well such a link is crafted and what the real intention is. He may not have to visit the site at all (if the web engine, one uses is known and the spammer is certain of the 200 header).

    For the particular link mentioned earlier, when I try it, it returns 200 not 302 or 307. Now this is bad for a number of reasons among them the use of duplicated content, search engines count on, to list sites in a certain order. If say 2 merchants have similar products they try to sell via their web-sites, a search engine among other things will promote the one with less duplicate content. Now if the first merchant places links that cause duplicate content to the other merchant’s site, he can certainly gain a marketing advantage with search engines. This another area one needs to consider seeing such links.

    With open source web related blogs, cms, ecommerce and so forth packages this is fairly easy for one to exploit and experiment (Closed source has the same problems but it’s just harder to demonstrate). In many cases an external link like this, can propagate its parameters due to the engine’s core code, inside other links and inside the database. It doesn’t have to be an XSS attempt at all. And in some cases this may not be intentional at all.

    Programming languages like php, java, etc use sessions to distinguish among visitors which can be stored in a cookie or passed with the /GET parameters. Now if the code underneath picks up the passed /GET parameters and without validation replicates them, you end up with duplicate content. And in addition one may able to see the personal details of someone else due to the use of sessions. To do that one may post a link to a site including some session info passed via /GET. Depending how a site is setup it may accept sessions via /GET. So one publishes such a link in forums, blogs etc where search engines can pick them up and later he can recall the URL or utilize the cache of the search engines to retrieve private information.

    Other methods of such links, include generating programming errors and warnings from a site that may reveal important information. For instance, passing arrays through the /GET for valid parameters it is likely to cause an exception in a script, when not handled, it will reveal the location of the particular site inside the server.

    Ideally the site owner should only process at the php/jsp/asp level known parameters by first validate their integrity against the data type and data content. Unknown ones should be ignored completely (and obviously not by replicating them with other site links). That goes for anything that comes from outside /POST/GET/COOKIE etc. If the known parameters do not pass the validation phase a 301 permanent redirect should be issued. 302/307 aren’t ignored by spiders hence you should use them when a valid link exists. Also in extreme cases a site maybe removed from the search engine’s index just because the link is crafted so to include some malicious code and propagates through other links.

    As for cloaking it maybe used to do an initial validation for the incoming attempt. With shared server hosting you have a great number of sites operating with the same IP. Thus such an attempt is very hard to backtrace. The same could be done with a cron job but an asynchronous method based on user’s clicks is safer against detection. Client-side code attempts isn’t a very good way for the server to emit, as it can be easily detected even with the basic browser functionality.

    For search engines now I believe they will not store an external link as part of their index with one’s site. That is the logical thing to do. The problem comes when the code of that site does not ignore the passed parameters but replicates them. In this case a search engine will see and verify on the next page reload a duplicate/bad content. Having no knowledge of the engine’s internals it will record and store these links in the index. Others can then see the content. Depending on the link the artificial one may gain enough exposure to be used by visitors via the spider’s cache. In which case a session presence (although typically filtered/ignored for spiders for attempts to use personalized info) can be duplicated so everyone who uses that particular link from the spider’s cache activates the same session.

    Here is the proof of simple duplicated content generation for this particular page;

    original:
    hxxp://ha.ckers.org/blog/20071220/google-spamming-us/

    assumed external duplicate:
    hxxp://ha.ckers.org/blog/20071220/google-spamming-us/?C=S;O=A

    auto-generated internal
    hxxp://ha.ckers.org/blog/20071220/google-spamming-us/?C=S;O=A#comment-64068
    or pick up any other link under the date field here to see it.

    And I take this as the blog’s engine fault for not properly generating the links.

  25. RSnake Says:

    Just an update, we are now getting spammed through Google:

    209.85.138.136 - - [16/Jun/2008:00:41:49 -0500] “POST /blog/wp-comments-post.php HTTP/1.0″ 302 - “http://ha.ckers.org/blog/20061207/orkut-email-address-disclosure/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Google Wireless Transcoder;)”

    Spam blog posts coming in through Google now.

  26. Mark Says:

    As far I know the IPs can be faked. One could change the Source IP address of a TCP packet and you may think a specific party causes this isn’t it?

    Therefore he could send a request of his choice and if the site/framework does not validate the response properly it could go through. Like with the blog links I mentioned earlier.

    I have one site/store in particular that used to get lots of spam attacks in various forms (contact us, login etc), with all kinds of IPs and various headers (like the agent one). I had to change the site’s framework and validate the incoming requests as I could not trust the server vars as they can be manipulated. So I strengthen the handshaing between server/client so only those who posted an identifier (eg: a session via _GET or _COOKIE) could stand a chance of using the forms. And only if that information is re-validated against internal records stored in the database (in other words if the other end has already received a record from the server). Then the server checks the client’s request against the dbase. If no match is found the system will generate a new record and treat the other end as a new visitor (regardless of IP).

    This way blind ip spoofing can be avoided regardless of the countermeasures the host has in place already and regardless if an attacker manages to predict the sequence identifiers of the tcp packets (which are the key, but lets assume they’re compromised in some way). He still needs to receive the full responses from the server.

    Basically if someone tries to post something I expect a valid record to pre-exist. Forms submission should not be accepted unless a dbase record already exists.