Cenzic 232 Patent
Paid Advertising
web application security lab

Email Obfuscation and Spam Robots

I’ve long been interested in spam and robots that scrape for email addresses. I’ve done tons of work in the space, although I’ve never published any of it. Call it more of a side hobby than anything I really want to go public with - as it is with a lot of my research. But anyway, today I was messing around with search engines and I found myself typing “at gmail dot com” into them for no apparent reason and poof, out popped a ton of valid although obfuscated email addresses. Aside from the raw text here’s a sampling of the different types:

…<at>gmail<dot>com
…(at) gmail (dot) com
… at-gmail-dot-com
… {at} {gmail} {dot} {com}
… [at] gmail [dot] com
… “at” gmail “dot” com
… at-gmail-dot-com-for.info
etc…

I think it would be interesting to create a generic algorithm for de-obfuscating email addresses of this nature. I’m sure it can be done to some degree, but some get more complicated, and I’m sure once you add in the variants of the username it gets even more complex. Even if you could get only 80% that would still be quite a feat. Still though, I have a feeling it wouldn’t take much effort to create a robot that made quick work of all those obfuscated email addresses. Of course, the benefit to a spammer in spamming people who proactively try to protect themselves from spam is questionable, but it’s still interesting.

22 Responses to “Email Obfuscation and Spam Robots”

  1. LTaub Says:

    See here: http://xkcd.com/208/

  2. fuzion Says:

    about 222,000,000 results for “* at * dot com”

    Here’s an effective js decoder:
    http://jasonpriem.com/obfuscation-decoder/

  3. RSnake Says:

    LTaub - funny! I have a feeling it’s more than just a single regex, but I could be wrong. I bet there are some PRE experts out there who could prove me wrong.

  4. Acidus Says:

    Mark Pilgram has a great essay on this from 2002. In fact, the whole “The Club” security solutions vs. “Lo-jack solutions” is something that has stuck with me for years since reading it. Gold star without a doubt.

    http://diveintomark.org/archives/2002/10/29/club_vs_lojack_solutions

  5. duryodhan Says:

    I pretty much stopped doing that. It doesn’t really protect you I realised. Now I either write my actual email address (assuming GMail is smart enough) or keep an image.

    Using the stupid obfuscations just seems disrespectful to the spammer’s intelligence.

  6. Trophaeum Says:

    http://mailhide.recaptcha.net/ ftw

  7. Aung Khant Says:

    Now matter how different patterns we use, spammers are always smart enough to defect such. From time to time they add patterns that we use to escape : (@) …etc.

    Even if it’s good use to JavaScript confiscator, they simple can grasp such JavaScript patterns.

    Again, to agree with duryodhan, even if we can use spam trap like: http://www.owasp.org/index.php/PHP_My_Spam_Fighter, spammers simply can escape from such trap URLs from spidering.

    No way? I prefer image solution which server-side script dynamically generate email text in image format. Against this is safe? Not at all. This can also be decoded by OCR tool or image Captcha defeating tools as we can’t make complex look like CaptCha.

  8. Jabra Says:

    There are obviously many ways to approach this problem. One method if you know the domain you are looking for is to just search using a regex for that domain. (ex: domain.com). From there, you can get the username. A good way to approach finding the username is to look for the first special character before the “@” or “at” . ex:

  9. piR Says:

    Thanks for the club vs lojack article.

    There is also the joeBANANA@example.EATaFRUIT.com.
    On forums, you can use [my nick]@example.com
    On a website, joe@[this domain], but people have to know what a domain is.

  10. Declare.James Says:

    I would love to see this discussion include actual DOM based deobfuscation of encrypted functions that contain the user/companies email address. I see on a daily basis the move towards providing a websites email address via the eval(unescape( function.
    It would be nice to have some type of method to provide deobfuscation of email addresses of this nature also.
    It would also be interesting to see if further obfuscation like gmailcom is being used inside these eval(unescape( for enhanced protection.

  11. mRt Says:

    Both sides (spammers & anti-spam*boys) are smart enough to hack some script, and it is like virus/anti-virus, etc. It never ends.
    If you publish (or put on internet) information, then, there is a way to read it.

  12. Wornstrom Says:

    I would be surprised by this point to see a scraper that doesn’t pick up “foo at gmail dot com” or “fooUPPERCASE@gmailMOREUPPERCASE.com” correctly.

    My method is to use Javascript to generate a mailto link, falling back to an image. The image is replaced by the link, so without scripts you still get an address. Until scrapers start downloading and OCRing every image on a page, or executing Javascript, this should be enough - and I can’t see them doing either any time soon.

    On forums or other places where scripts aren’t convenient, I generally just say along the lines of “domain ‘gmail.com’, username same as it is here”.

  13. piR Says:

    Spambots are able to break more and more captcha, so they will be able to read a picture.
    Next step : giving our adress in captcha.

  14. bobthebuilder Says:

    Finally a bit of light hearted fun on the Rs blog!

    How about putting your email address inside a flash movie with clouds and rain passing over the email address at timely intervals to mess with the OCR and recpature.

  15. bobresp Says:

    not with google sniffing around
    http://www.google.co.uk/#q=gmail.com%20filetype:swf

  16. Picci Says:

    bobthebuilder -> that is quite funny…

    …but with some flash decompiler you could still regex the email out of the generated code… (and the same goes for an image inside the flash animation, which can be grabbed and then OCR’d)
    In the end… the only way is to hide the email behind a captcha like in the link Trophaeum posted.

    A bot will read whatever a human doesn’t have a really hard time reading.

  17. mickeymoose Says:

    Regex to start everybody off…..

    ([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4})

    This will find a standard (properly formatted) address. I will play around to see if I can expand this to detect the obfuscation technique using positive and negative lookahead and lookbehind.

    *scratches head*

  18. Anonym Says:

    “Using the stupid obfuscations just seems disrespectful to the spammerís intelligence.”

    Yeah….. we wouldn’t want to disrespect spammers would we?

  19. Unmensch Says:

    @mickeymoose
    Your regex would’nt match mails from the .museum tld and probaply several others, especially in the future.

  20. Herberth Amaral Says:

    Maybe a little bit of Artificial intelligence (like neural networks and fuzzy logic) can do miracles.

  21. Carter Cole Says:

    i dont know if a neural net would solve the classification problem of is it an email or not but i do think some pretty regex (like everyone else said) would be the right tool for the job

  22. Herberth Amaral Says:

    There are several ways to build a neural network to indentify emails, however it’s a non-deterministic method and it’s not easy to do some of that.

    For simplicity, I choose regexp :)