Wouldn’t it be great to have a mapping of virtually the entire internet, where you could see every hostname -> IP address pairing? Granted, it would have false positives like virtual hosting services, as he says, but come on! Talk about predictive! Sure, a few dozen domains may be possible. Especially for hosting providers, but if I have hundreds of domains that look even vaguely shady, that’s a huge indicator. Even if they aren’t the same IP, but within a class C network, that could still be highly predictive. IP addresses have come back to haunt us! Everything has to be routable, and if Google has to know where you are to index you, and they have any interest in detecting spamming, of course they’ll do a mapping like this.
I had always wanted to build something like this myself, but to build a spider like that would take more horsepower than I’ve got in my rack at home by far, and a database with some serious space. We’re talking about millions of hostnames to IP addresses. It gets harder because that has to stay up to date. Six month old data is practically worthless when you are talking about spamming domains which may only stay up for a week or less in some cases.
Then I suddenly remembered a conversation I had a few weeks back with one of my readers, who shall remain nameless for the time being. He asked me a simple question, “How do you find all the cnames on a host?” Cname (or subdomain) spam has it’s ups and downs in the SEO world depending on the day of the week it seems like and depending on which search engine you’re talking about, but it’s a pain to correlate it all together, no matter how you slice it. It’s also useful for auditing websites for vulnerabilities since cnames almost always reside on the same host, or at minimum use the same backend. I thought for a few seconds and I came up with a solution. Use the search engine itself! Let’s say I want to find all the cnames on Google. Let’s start with a simple query:
That gives us a list of links back, none of which contain “www”. So now I see things like sketchup.google.com and finance.google.com and eval.google.com. So let’s make a note of those and query again:
site:google.com -www -eval -sketchup -finance
And then you take what is left from that (which may include things like sub directories which you can remove as well) and remove them:
site:google.com -www -eval -sketchup -finance -google.com/answers -google.com/trends -browsersync -desktop -toolbar -earth -picasa -toolbarqueries
And so on… Until there is nothing left to search. In this way, you can get all of the cnames of a server, with relatively few queries. Of course, Google is a huge site, with lots of cnames, so this technique is pretty tedious with them, but with smaller sites you can go through this pretty quickly. This still won’t help you do an IP address to domain name lookup, like what Google has access to, but it does help you do your own investigation of cname based spam. This technique came in handy finding some of the other domains on one spammer site, that you may have remembered from one of my previous posts.
Finding cnames can help isolate spammers, but wouldn’t it be nice if we could somehow get access to all the IP address to hostname maps? There’s got to be a way somehow. Hmmm… I’ll have to think about that one.