XSS Annihilation
On more than one occation people have asked me how to stop XSS. What would the logic actually look like to accurately be able to stop cross site scripting attacks but still allow some HTML? Well, here’s a small how-to that should help (or hurt, depending on how technical you are):
- Make a temp variable (or array of characters, or stack or however you end up doing it) with the data in it so you don’t modify the original text.
- Take the modified text and normalize anything starting with to the ASCII equivalent characters. That includes both hex, decimal with and without padding and with and without semicolons. I’d also recommend normalizing HTML entities (that can be a pain in the ass as you’ll need to do a hash mapping).
- I remove nulls, newlines, carriage returns, newlines and tabs completely. (However, I preserve spaces since that breaks up JavaScript and it won’t render, but I do compress multiple spaces into a single space).
- Then scanning it using a boyer-moore algorithm or a pre-compiled directed acyclic word graph (for speed - if that’s an issue) for all HTML tags that are allowed or forbidden (depending on which route you want to go).
- If something is forbidden, I’d stop here, and reject it, if you want to try to sanitize it, which I completely recommend against, make sure you do a while loop and continue to sanitize until there is nothing left to sanitize. Lest you get caught doing something like this. In doing your while loop where you find something worthwhile to strip, start over at #2 and continue doing that recursively until you remove all instances of any offending HTML caused by your own filtering. In this case you can dump the temporarily array of characters to save memory and use the modified array instead as your output. If it’s unmodified from the original it’s probably not worth keeping around, but if it is modified, the original can be interesting to save for forensics IDS (intrusion detection system) purposes.
The other things to remember that can help are using things like HTTPOnly, and remember to properly UTF-8 (or whatever encoding method you choose to use) to avoid issues with UTF-7 vectors. I personally don’t like things like phpBB script as that really only obfuscates the issue. Easy as cake, eh?



June 2nd, 2006 at 12:22 pm
I personally don’t like things like phpBB script as that really only obfuscates the issue.
I assume you mean bbcode? …because I’m not sure what else phpBB does that might be considered obfustication…
June 2nd, 2006 at 3:57 pm
Yes, thanks, yawnmouth, actually that’s what I meant… sorry, typing ahead of my brain. And by obfuscation I meant simply that it is removing HTML by changing it to something exactly like HTML. It’s obfuscating the problem rather than fixing the real underlying problem outright. People already know HTML, why force them to learn a new language that needs to be interpreted back into the language that they already know. Silly if you ask me. Silly and doesn’t protect you completely anyway.
February 5th, 2009 at 4:55 am
Cool ! I from Russia . Need crack is datalife engines !