Charset Map Of Top 100
After releasing my fuzzer I realized I had never really talked about the breakdown of the different charsets for the top 100 websites. I did a quick poll of the top 100 alexa listings (granted there’s some overlap and Alexa really isn’t a good measure of a lot of things, but I just wanted a quick test). I took the top 100, parsed it apart and graphed it out:
Variable width encoding is interesting when you take the various encoding methods in context. As more and more encoding methods are used the fuzzer will have more need. It’s pretty complex to do this sort of testing with any other method. Some of these encoding methods are not very common, others are. It’s interesting that it’s so diverse.
The “Unknown” section is the websites that didn’t have a charset, or failed out because they didn’t allow the request - it was just a quick poll. This could become more and more interesting as we uncover more issues with the various encoding methods - all of which will be much harder to stop as they will involve more complex filtering (restricting the ASCII ranges for instance).




September 22nd, 2006 at 3:34 pm
Yep, definitely a quick poll. You should have lowercased all the character encodings and merged UTF-8 and utf-8 (and the like).
It’s interesting, however, how many websites don’t specify a character encoding at all. Trying to fight against variable-width attacks without specifying a charset is a losing battle: you are totally at the mercy of browser autodetection methods.