Cenzic 232 Patent
Paid Advertising
web application security lab

Spam clustering

A few years back I was having a conversation with Ambient Empire (aempirei) about ways to detect interesting information through natural text. He started by creating a tool to measure relative intelligence by word length and density, etc… It was an amusing tool but that’s about it. Later I asked him to write a tool to detect when someone is mad at me, so I can respond quicker (it was intended for disgruntled girlfriends) - still waiting on that one. But then aempirei came up with a way to do spam recognition by clustering it to it’s relative signatures.

It’s an interesting theory, that has a lot of practical worth. Humans often attempt to classify information into buckets, so this is a way to visually represent spam variants into those buckets that people find easier to digest. But if you were to take this one step further, for instance, you could classify any kind of malicious behavior and correlate that to a certain type of user or even to a particular user itself.

Several years back I was working on event correlation systems (or as Gartner likes to call it - security information management). One of the interesting things we could do is detect two desparit events like a change to a file on a system via a Tripwire or HIDS, and tie that into a normal router event. The Tripwire event might be able to tell you who changed the file (probably the administrator account and they probably cleaned up the files to remove their IP address from any logs so that’s not particularly helpful) but the router can tell you that an IP address touched it at the exact second that the file changed. Thereby you get a lot more value out of both of those tools than you would have with either in a stand alone environment.

In the same way this sort of spam clustering could have all sorts of value. Think about an environment where you have a clustering technology that attempts to classify events into buckets for eventual digestion from an event correlation tool. You could get a lot more value out of the information because it would be directly relevant - “This XSS attack looks a lot like this other connection that we’ve never seen before - although we have no signature for it”. Interesting concept anyway. It’s more along the lines of advanced anomaly detection and it’s already being worked on, but I like the way that aempirei digests the information.

Comments are closed.