Paid Advertising
web application security lab

Stopping XSS but allowing HTML is Hard

Allowing HTML is the mantra of consumer friendly web applications. The more rich the consumer experience the more engaged they get and the more they can feel like it’s home to them. Over the last few days there has been a really interesting thread evolving stemming from an open source HTML cleanser. SirNotAppearingOnThisForum posted a sample script that he has been working on for the last few days, that attempts to allow HTML but disallow XSS.

After finding some of the more obvious holes, it started getting more and more interesting (especially as it reached the second page of the thread). The latest hole is actually probably one of the best examples of why regex and blacklisting in general is hard. Not just because HTML is flexible but because rendering engines are ultra complex:

<IMG src="http://ha.ckers.org/" style"="style="a
/onerror=alert(String.fromCharCode(88,83,83))//" &gt;`&gt

At first blush this doesn’t look like it should work. Firstly, there is a character immediately before the onerror event handler (although this is ignored). But most importantly the onerror event handler is technically encapsulated by quotes. Yet in Internet Explorer this works. The flexibility of the rendering engines provides a uniquely complex problem that I’ve been talking about for over a year. This is just a particularly good example of why something that should clearly be encapsulated still manages to cause an XSS vector.

Thanks again to our good old friend Turing and his halting problem, stripping XSS out of HTML is hard. Sure you could block everything malicious pretty easily, as others have shown, but that’s a very different problem and much easier solved.

12 Responses to “Stopping XSS but allowing HTML is Hard”

  1. Edward Z. Yang Says:

    Well, the code is definitely far away from the HTML specification: having an attribute name and then a quote directly adjacent to it is expressly forbidden by SGML. When a parser encounters a construct like that, you cannot expect any well-defined behavior: the computer may blow up for all you know.

    No browser engineer thought that it would be cool to support something like that, it’s because they didn’t properly consider all the edge cases. It’s futile to try to guard against these oversights: just stay in known territory.

    This is why XML’s syntax is strict: it’s to prevent the proliferation of HTML that actually *need* the above-mentioned parser quirks.

  2. Rodney Says:

    What would be the down side of introducing a new tag that prevents any javascript from executing between the outermost start and end tags ? It seems to me that this sort of thing should be pretty easy for the browsers to implement but would go a long way towards preventing XSS vectors if used properly by web software developers.

  3. Spikeman Says:

    In my opinion this is why you should use a whitelist. On my website I use a sort of bbcode, [b]bold[/b] for example and escape angle brackets. I’m sure you could use regex the same way to escape the angle brackets and only turn the ones that you know to be strict into actually HTML.

  4. zeno Says:

    BBCode can still possibly have issues. Converting ALL non alphanumeric chars to html entities should stop ‘useful’ JS from being used on US-ASCII.
    I have yet to see an example of XSS without ();’:&# except possibly involving unicode and variable width stuff.

    - zeno
    http://www.cgisecurity.com/

  5. Edward Z. Yang Says:

    Spikeman: If you’re going to whitelist, you have to actually parse the HTML. It doesn’t really work any other way. I should know, I’ve done it before. :-)

  6. RSnake Says:

    @Rodney, Content restrictions was actually designed to do something very similar to what you were talking about. Mozilla for some reason felt that this was incredibly difficult to do. Also, you need to make sure that the bad guys can’t enter the “end” tag that would end that code. But yes, that’s very much part of the original idea.

  7. Dr. Strangelove Says:

    Yes, stopping XSS and HTML injection is hard - but is in good part the result of braindead design - Javascript. Ugly, unmanageable and stuffed with dangerously implemented features. The consequence of chaotic development in the early era of “inventing the browser”. I wonder if anyone, in those times, sat down and thought a bit before rushing to implement features whose detrimental impacts we are suffering now. Probably not many did that…

  8. Spyware Says:

    Why not blacklist all things that need to be filtered. Sure, it would take some weeks(/months?) to build a list that covers all (known) xss, but it is doable.

  9. RSnake Says:

    @Dr. Strangelove - what I am showing is not JavaScript obfuscation, it’s HTML obfuscation. Yes, ultimately a scripting language is to blame for these issues, but remember, XSS and CSRF are definitely not limited to JavaScript.

    @Spyware - I think that’s what we’re showing doesn’t work particularly well. A straight blacklist approach suffers from the Turing halting problem.

  10. Spyware Says:

    What do all xss vectors have in common? You need to filter on that.
    Dunno if something like that exists anyway.

  11. RSnake Says:

    They don’t all have anything in common other than the fact that they are somehow instantiated. That instantiation can be as simple as a + character or as complex as a full HTML tag. Surprisingly XSS takes many many different forms.

  12. Trevor Jim Says:

    I’d like to mention my own solution to this again:

    http://www.research.att.com/~trevor/beep.html

    Among other things, we point out a way to get the effect of Rodney’s suggestion about telling the browser not to execute scripts in a section of a web page, in current browsers, in a way that prevents attackers from writing the end tag. The downside is that it requires encoding the section as a string literal, which is less readable than straight html.

Respond here or Discuss On the Forums