I can’t tell you how many times over the last several years I’ve needed an application that can properly parse and help me inspect web server log files. I’ve searched around, asked friends and colleagues and nothing. The best I’ve come up with is a bunch of crappy shell scripts, grep, Splunk, libraries and a few people mentioned that event correlation systems come close to doing what I want. In the end I just end up manually grepping through files, and writing my own custom scripts that I end up having to re-write over and over again, depending on the situation and the log files themselves. Without the time to dedicate to it, and a million other things on my plate I’ve never had the opportunity to sit down and code it up. Here’s the requirements for what I need:
- Must be able to parse log files in different formats. Lots of web server logs don’t look like other web server logs - even from the same web server, depending on how they are formatted and the order that the variables get logged. IIS logs may intentionally add in cookie parameters. Some logs may not use the same delimiters and so on. A generic parser that can deal with any log in any format is what needs to be built. I know companies have built these things before, so it’s possible. Yeah, this bullet alone is a bit of a nightmare.
- The system must be able to take two independent and differently formatted logs and combine them. Often times in a forensics case the attacker hit more than one web server in the process of attacking the site. This happens a lot when you’re talking about static content hosted on other sites or a separate single sign on authentication server or whatever. One server might be IIS and the other Apache - so the system would have to be able to combine different lot formats and take into account that some logs may not have the same parameters in them; one might be missing query string information or host name or whatever.
- The system must be able to normalize by time. I can’t tell you how many times I’ve found that one of the sites involved in the forensics case isn’t using NTP and the log file is off by some arbitrary amount of time. This is a huge pain when you’re doing this by hand, let me tell you. Anyway, timezones also must be accounted for, where one server is hosted in one timezone and a different log is hosted in another.
- Log files are big - they can be many gigs per day, and a forensics case can span a month or more. This is where grep gets a lot less convenient and where a database would be a better choice. So the system should be able to handle just about any size of log file data, up to and including a terabyte.
- It should allow for regular expressions and binary logic on any parameter. Sometimes I want to check to see if something is a “POST” followed by a “5xx” error as a response code against any of the logs over N days. Or maybe I want to check for anyone who hit any file and got a different size back than everyone else who hit that same file. Or maybe I want to ignore things in certain directories or with certain file extensions, because I know that contains only static content.
- The system should be able to narrow down to a subset of logical culprits. That is, remove any IP addresses that never submitted a “POST” request, or any GET requests with a Query string.
- The system should allow for white-lists, to remove things like internal IP addresses, or known robots that weren’t involved but make a lot of suspicious requests (third party scanners and such).
- The system should also build a probable culprits list that you can pivot against. If you know N IP addresses are suspicious, you should be able to run commands against just those IP addresses, without re-searching all the non-suspicious IP addresses. That way you can gradually narrow down the list further and further so you are only looking at things that interest you.
- The system should be able to maintain a list of suspicious requests that indicate a potential compromise, like “../” and “=http://” and so on, to quickly narrow down a list of culprits, without having to do a lot of manual searching.
- The system should decode URL data so that it can be searched easier. This could be really tricky given how many encoding methods there are out there, but even just URL would be a huge time saver.
- The software must use the BSD license - so it can be used in any capacity, and modified as necessary. Because GNU just won’t cut it.
So yeah, if anyone is just looking to build something extremely useful to guys like me, and feels like making it open source so anyone else can use it, please do! The forensics community could really use something like this. I sure know I’d use it!