Cenzic 232 Patent
Paid Advertising
web application security lab

Web Server Log Forensics App Wanted

I can’t tell you how many times over the last several years I’ve needed an application that can properly parse and help me inspect web server log files. I’ve searched around, asked friends and colleagues and nothing. The best I’ve come up with is a bunch of crappy shell scripts, grep, Splunk, libraries and a few people mentioned that event correlation systems come close to doing what I want. In the end I just end up manually grepping through files, and writing my own custom scripts that I end up having to re-write over and over again, depending on the situation and the log files themselves. Without the time to dedicate to it, and a million other things on my plate I’ve never had the opportunity to sit down and code it up. Here’s the requirements for what I need:


  • Must be able to parse log files in different formats. Lots of web server logs don’t look like other web server logs - even from the same web server, depending on how they are formatted and the order that the variables get logged. IIS logs may intentionally add in cookie parameters. Some logs may not use the same delimiters and so on. A generic parser that can deal with any log in any format is what needs to be built. I know companies have built these things before, so it’s possible. Yeah, this bullet alone is a bit of a nightmare.
  • The system must be able to take two independent and differently formatted logs and combine them. Often times in a forensics case the attacker hit more than one web server in the process of attacking the site. This happens a lot when you’re talking about static content hosted on other sites or a separate single sign on authentication server or whatever. One server might be IIS and the other Apache - so the system would have to be able to combine different lot formats and take into account that some logs may not have the same parameters in them; one might be missing query string information or host name or whatever.
  • The system must be able to normalize by time. I can’t tell you how many times I’ve found that one of the sites involved in the forensics case isn’t using NTP and the log file is off by some arbitrary amount of time. This is a huge pain when you’re doing this by hand, let me tell you. Anyway, timezones also must be accounted for, where one server is hosted in one timezone and a different log is hosted in another.
  • Log files are big - they can be many gigs per day, and a forensics case can span a month or more. This is where grep gets a lot less convenient and where a database would be a better choice. So the system should be able to handle just about any size of log file data, up to and including a terabyte.
  • It should allow for regular expressions and binary logic on any parameter. Sometimes I want to check to see if something is a “POST” followed by a “5xx” error as a response code against any of the logs over N days. Or maybe I want to check for anyone who hit any file and got a different size back than everyone else who hit that same file. Or maybe I want to ignore things in certain directories or with certain file extensions, because I know that contains only static content.
  • The system should be able to narrow down to a subset of logical culprits. That is, remove any IP addresses that never submitted a “POST” request, or any GET requests with a Query string.
  • The system should allow for white-lists, to remove things like internal IP addresses, or known robots that weren’t involved but make a lot of suspicious requests (third party scanners and such).
  • The system should also build a probable culprits list that you can pivot against. If you know N IP addresses are suspicious, you should be able to run commands against just those IP addresses, without re-searching all the non-suspicious IP addresses. That way you can gradually narrow down the list further and further so you are only looking at things that interest you.
  • The system should be able to maintain a list of suspicious requests that indicate a potential compromise, like “../” and “=http://” and so on, to quickly narrow down a list of culprits, without having to do a lot of manual searching.
  • The system should decode URL data so that it can be searched easier. This could be really tricky given how many encoding methods there are out there, but even just URL would be a huge time saver.
  • The software must use the BSD license - so it can be used in any capacity, and modified as necessary. Because GNU just won’t cut it. :)

So yeah, if anyone is just looking to build something extremely useful to guys like me, and feels like making it open source so anyone else can use it, please do! The forensics community could really use something like this. I sure know I’d use it!

42 Responses to “Web Server Log Forensics App Wanted”

  1. Andre Gironda Says:

    apache-scalp.googlecode.com

  2. RSnake Says:

    @Andre - doesn’t meet a lot of the requirements (Eg. more than just Apache logs). But looks like an interesting tool anyway.

  3. RSnake Says:

    And for those of you who mentioned splunk, it doesn’t pivot, doesn’t have whitelists, etc… and yes, it doesn’t use the BSD license. Tried it, it’s okay, but not quite there.

  4. D0mi Says:

    You have described a tool I know called splunk: opensource based, close source engine … Gh!

  5. mial Says:

    Hi there,
    Why not OSSEC ?

  6. RSnake Says:

    @D0mi - See the message above yours.

    @mial - I need a forensics tool. Typically by the time I get there, the attack has already happened. IDS tools run in real time, not after the fact. I need something that runs retroactively, on whatever setup the company has already built.

  7. Andrew Barnes Says:

    G’day,

    Sawmill might be of use –> http://www.sawmill.net 800 log formats.

    A known limitation (which I wish would be included) is the inability to overlay multiple log formats. However I’ve successfully traced performance and other issues by tracking usage/requests between profiles.

    I hope that this helps - it certainly works for me

    Regards,
    Andy

    P.S. No, I don’t work for them, just a happy customer!

  8. Andrew Barnes Says:

    That’ll teach me - lost a paragraph due to use of a “

  9. Archiloque Says:

    Hi, I’m interested in working on such a thing, I’ve already done some log analysis and much http-related stuff and the tool may be handy for me in some situations.

    Contact me if you’re interested

  10. jwes Says:

    Splunk does allow pivoting and white-lists the way you describe by allowing you to generate lookup tables of suspicious IPs at search time which you can then easily use as a base for your continued searches.

  11. Dominic Cronin Says:

    Microsoft LogParser comes close on a lot of those criteria. You can do sql-like queries against data from multiple sources, with a plug-in model for both input and output. So assuming you can code a pivot in SQL, and you can put your whitelist in a (text file | xml file | database | whatever ) then you’re rolling. Nicest feature is that you can query for data that happened since the last time you ran your query.

  12. Xavier Says:

    Lot of features you are mentioning are part of a SIEM… In your case, you don’t need correlation & alerting but: normalization and retention
    (normalize the formats, timestamps etc…) and allows you to perform powerful searches.

    Of course, if you need to perform forensics investigations on external sites, you’ll first need to process the logs!

  13. RSnake Says:

    @Andrew Barnes - That’s good for the log format piece, probably, but not much use for the other part. We have one customer that swears by it as well, so I’ll take your word for it. But it doesn’t meet most of the other requirements. It could probably be modified to though…

    @Archiloque - yep - I’m interested. Feel free to talk!

    @jwes - possible… I really don’t want to pay to use it is the only problem. I’d rather it be something I can use on any size logs for any amount of time under any circumstance and modify in any way I want. Splunk is pretty restrictive and I’ve had mixed results with it actually finding what I want. It feels like a colorized grep to me, but I’m definitely not a splunk expert. Either way, perhaps someone can modify splunk to make it work in this way…

    @Dominic Cronin - I haven’t played with that. Does it work with any log format for any web server? The last nice feature you mentioned isn’t actually much use for what I’m talking about, but I agree, most of the time that would be useful.

    @Xavier - Yeah, I’ve looked a lot of SIEM solutions, most of them aren’t particularly good at the randomness of logs and the slicing and dicing of data in quite the way I have in mind. But, I haven’t looked at all of them. If you know of one in particular that’s especially good (and open source/free) I’m extremely interested.

    @all - These are all good ideas… I’m not opposed to using something existing, if it meets the requirements. But preferably it would be free, and most of these things require licenses or don’t meet the requirements, or both.

  14. risk Says:

    i assume you’re charging people for your forensics work.. i’m not sure why you’re opposed to paying for something that would be a key part of your workflow, assuming it was suitably licensed?

  15. X Says:

    hbase, hadoop, thrift and some syslog-ng too.

    Add a simple web interface in language of choose for grepping style queries (python or nodejs). Run against hadoop, add scripts for custom stuff like grouping etc, and interface to add and run them. thrift if you wanna push web logs with full headers for serious analysis(e.g. you need massive auditing all the time, also handy for 0days).

  16. RSnake Says:

    @risk - depends a lot on the cost. SEIMs for instance can be extraordinarily expensive, especially when you’re adding in big databases, etc…

    @X - awesome, now find someone to write it! ;)

  17. mike Says:

    You might want to check out SNARE: Its GPL’ed and hosted on sourceforge.
    http://www.intersectalliance.com/projects/SnareBackLog/index.html

  18. Joe Says:

    I second the LogParser recommendation.

  19. nathan watson Says:

    It will cost you, but SenSage is a great tool for aggregating, querying, filtering, correcting log data. Some customers have petabyte log stores, hundreds of gigs of log data stored daily. SQL + perl queries. Check it out.

  20. nathan watson Says:

    Prior comment should read ‘correlating’, not ‘correcting’.

  21. Tom T. Says:

    “The system must be able to normalize by time. [snip] …, timezones also must be accounted for, where one server is hosted in one timezone and a different log is hosted in another.”

    Just out of academic curiosity, why on earth aren’t all server logs on UTC (GMT), even if they might also show local time for the convenience of the owner? Just asking …

  22. RSnake Says:

    @mike - It says it doesn’t do any analysis function. Maybe that’s an overstatement, but that’s kind of a big part of what I need.

    @nathan wilson - yeah, I have a good friend who works over there. Not exactly practical for one off engagements from a price standpoint. Usually we only need it for a day or two at a time. Also, it’s not really custom suited to this type of analysis. It would take some work to get it set up to do this. Not saying it couldn’t, but we’d have to do some work to customize it.

    @Tom T - Who knows! We have lots of customers who still think Google Analytics or Omniture is a good logging solution and are surprised to find it’s virtually useless for tracking bad guys. In fact, that may be an interesting way to detect sites that are better to attack - find the ones who use client side logging. Generally their logging infrastructures on the back end are the weakest. As far as time, the only thing I can guess is that lots of sites log in the local time zone, and lots of admins just set it to whatever their watch says at that moment. I’ve found servers that are minutes off before.

  23. Archiloque Says:

    Can you send me a mail with some example for the requirements so I have some data to wrap my design ideas around ?

  24. Martin Says:

    For one-offs, LogParser is venerable. Also note that with a tiny amount of scripting you can use it to do SQL-style queries across many servers. However, it is fairly slow, and you will wait forever on a terabyte. Splunk will absolutely do exactly what you want it to do, and up to 500 MB/day are free. If you need more than that and can’t pay (my situation) then read on.

    I wrote something along the lines of what you describe. I am a SIRT lead, so I spend much of my day doing what you might do on forensic engagements. What I wrote/am writing is pretty much what @X described, but hadoop is too slow at indexing for large installations. I wrote a system that has this flow:

    Snare | Syslog-NG (you would skip this part and do a manual import if you’re showing up on an engagement) | Perl | MySQL | Sphinx | web frontend (Perl/Apache | YUI)

    Which yields a Google-inspired interface which allows for searches like this:

    method:GET site:example.com -referer:uid -knowngoodparam -otherparam size>600 srcip>85.255.0.0 srcip

  25. Martin Says:

    …continued

    The output is in a YUI DataTable with options to save/export to PDF/Excel/HTML/CSV and comes complete with a GROUP BY function (provided by Sphinx) so you can pick any numeric attribute to do the group-by on. (Hopefully Shodan over at SphinxSearch.com will add string attribute group-by’s someday). That is critical for searches like “baduriparam” groupby:srcip.

    Right now this system is used for indexing all of the logs for a very large org to the tune of 15k-20k per second. It is sharded across a couple boxes, and searches for arbitrary terms complete in a few milliseconds. What’s missing from your perspective is the last mile of importing files as the system was designed for live recording. However, that’s a pretty small step. The only real trick there would be in the parsing which really isn’t that bad once you get into it because you can reuse so much of your parsing code.

    I’d be willing to release the code under GPL if there was sufficient interest. However, I don’t have a lot of time for writing install scripts, etc., so installation would not be for the faint of heart. Very few orgs are big enough to need this much searching horsepower, so I don’t think it would get a lot of Sourceforge downloads. Email me if this sounds interesting.

  26. anonymous Says:

    What about Ossim?????

    http://www.ossim.net

  27. Jérôme Radix Says:

    @tom : not to mention Daylight Saving Time which depending on the country, the fact that certain countries like India, or middle Australia are not synchronized on hours like other countries, they have an half an hour time-lag :

    Logs don’t always show the GMT time-lag.

    http://upload.wikimedia.org/wikipedia/commons/0/01/2007-02-20_time_zones.svg

    We can also have the case where the sysadmin, for an obscure reason, has changed the date and time by hand. The tool should be able to change the synchronization of logs at different times.

  28. AppSec Says:

    @RSnake/@Archiloque

    I’d be interested in providing a helping hand. I’ve been doing security work for the past 6 years, but prior to that spent about 5 years doing web dev and a decade doing development of some kind or another.

    If you don’t mind someone else taking part, then feel free to e-mail.

  29. Andre Gironda Says:

    Why not build this into existing apps using OWASP AppSensor?

    Or OWASP ESAPI?

    If I couldn’t use one or both of the above (or build a similar system), I’d go with mod-security combined with OSSEC.

  30. RSnake Says:

    @Appsec - be my guest! The more the merrier.

    I also got an email from one of the jwall guys - interesting take on re-purposing the mod_security log analyzer by changing the parser.

    @Andre Ginonda - Not to say that isn’t needed. But the only problem is it’s typically too late to add it in by the time they call the forensics guys in. By then the attack has already happened. This is for forensics, not for IDS/IPS, which means we’re given whatever the customer already has. Ivan Ristic even tweeted that he wanted something like this.

  31. SimonSays Says:

    Splunk will do exactly what you’re describing very easily. There is a free version that will let you index up to 500MB/day.

    Beyond that, you’d need to buy an enterprise license and I’m pretty sure it’ll be cheaper than building this yourself.

  32. Rebus Says:

    Have you ever tried PyFLAG?
    A little tangled, but powerful.

  33. Ory Segal Says:

    Back in 2002, I wrote an article about web application forensics -

    http://www.cgisecurity.com/lib/WhitePaper_Forensics.pdf

  34. RSnake Says:

    @SimonSays - 500MB won’t even get me one full day’s worth of logs on some sites. It doesn’t really make sense to pay for something I’m going to use for a day or two at a time and then not use until the next random forensics case comes up. I’d much rather whatever it is be completely open/free. And I’ve used splunk a few times and it cannot do what I’m saying without a ton of ground work, and remember each time I get logs they could theoretically look totally different. Configuring splunk each time seems like a big pain. I’d rather use grep at that point, at least then I know what I’m typing isn’t going to go through a pre-processor - potentially messing something up.

    @Rebus - Never even heard of it. I’ll look at it.

    @Ory - thanks, I’ll look at it.

  35. Tom T. Says:

    @ RSnake: “It doesn’t really make sense to pay for something I’m going to use for a day or two at a time and then not use until the next random forensics case comes up.”

    OK, get ready for the next Tom T. brainstorm:

    We already have the concept of “sw as a service”. How about (drum roll)

    “Software as a rental”.

    Think about it. I might need a pressure-sprayer once or twice a year, to clean my driveway. I’m not about to spend hundreds or thousands of dollars to buy one, when I can rent one for $50/day.

    So, let’s say you find your perfect solution, but as said, it’s too expensive to buy for only random days of use. So you rent it from them, either at X dollars/day, or at a percentage of fee earned. Say you earned $10,000.00, and paid them $1,000.00 for providing the sw that allowed you to earn the other $9.000.00 with a lot less time and effort than you would have spent without this tool.

    Of course, I picked those numbers and percentages out of thin air, as I have no idea what the fee structure is, etc. But you get the picture. Does this concept not solve that particular problem (once you find your perfect, but expensive, tool)? … For the percentage deal, you’d have to send them a copy of your contract with your customer, including proof of how much fee paid, but with a full NDA on the part of the sw rental company, of course. There are a lot of other details to work out; I just wanted to put the concept “out there”.

    For the nitpickers who will point out that no one ever actually “sells” or “buys” sw, you only get a “license” to use it: Yes, but that license is generally good for life (yours or the puter’s, LOL.) So a license for a few days’ use, knowing that RSnake and others will be back on numerous, if random, occasions, to pay additional fees. lets the company offer this rental license much more cheaply.

    Oh, and in keeping with the concept of silly patents, I hereby patent the idea of renting software. So when you pay the company for using their tool, don’t forget my royalty check as well. ;)

  36. Tom T. Says:

    @ Jérôme Radix: I’m not sure of your intent, but you’re proving my point that all servers should run on UTC, then the admin can also program it to show whatever crazy adjustments his locale or his whim desire.

    Cripes, even my dinky little Win XP Home Edition has auto time-sync, with either MS time server or nist.gov. How hard could that be to add to a server, where it could be far more crucial?

  37. RSnake Says:

    @Tom T - that’s a model that could definitely work. And you know, some shops probably do forensics a lot more than we do, so they’d want an all they could eat sort of licenses or even site license. But for companies like us, tools either have to be free, very very low cost, or, like you suggest, rent-able. I don’t hate that idea. Of course, free is better. ;)

  38. Jason Says:

    LogParser can read a list of file formats natively (the ones you care about are probably IIS, IISODBC, IISW3C, NCSA, URLSCAN, and W3C). If you can get it into a CSV or XML file and define the fields, you can parse anything.

    You can create .sql files with complicated transformations on data. You can do sql joins with data.

    You can output your results into a number of useful formats. For example, you can clean, normalize, organize your results into a CSV and pivot to your heart’s content in Excel.

  39. RSnake Says:

    @Jason - Converting logs into .csv or XML may or may not be a trivial task, depending on the format of the logs. Typically they are Apache or IIS logs, but they can contain any variants, and can switch columns around pretty frequently. Also, it’s not just a matter of parsing, please read the section above about time. Lastly, Excel is great, but not for this. Over a few gigs and I’ve found Excel to be pretty unusable. I wouldn’t even bother using it with logs up to 1TB.

  40. Thiago Siqueira Says:

    I am Brazilian and I have answered hundreds of incidents in the Linux environment shared.

    I have a script to programmatically analyze Injection, SQL Injection and FTP.

    You want the script? If yes, call me on Gtalk.

  41. bronc Says:

    SenSage can do that :)

  42. Anonymous Says:

    I don’t want to sound too harsh, but it looks like you’re trying to search for something that simply isn’t there. I’m not quite sure at all why you’re trying to continue looking for something already existing.

    Write something yourself, that’s all you’ll find in the end.