Cenzic 232 Patent
Paid Advertising
web application security lab

US-ASCII Issues Redux

As I’m nearing completion of my XSS fuzzer for people, I’m finding more and more interesting issues. Just so you know I’m not keeping everything from you all, here’s another interesting problem I uncovered. Sure you remember the original problem with US-ASCII encoding, where a character could be modified and in US-ASCII it would render as a open angle bracket, or any other character, if you encoded it correctly. Wellllll, it just happens that that is only one very small problem it turns out. Sure you can look for everything higher than 7F (or 127 in decimal) and less than FF (255 in decimal) and kill it but that won’t solve your problem. One of the tests I ran was:

[CHAR]IMG SRC="" onerror="XSS_ME([DECIMAL-CHAR])">

Where [CHAR] was an enumerating list of characters and [DECIMAL-CHAR] was the decimal representation of that character. I expected to only find 60 (the decimal representation of the open angle bracket, and the additional character 188 (the US-ASCII issue that Kurt Huwig found). Alas, there was far far more vulnerable characters. Here’s the list:

188, 316, 380, 444, 508, 572, 636, 700, 764, 828, 892, 956, 1020, 1084, 1148, 1212, 1276, 1340, 1404, 1468, 1532, 1596, 1660, 1724, 1788, 1852, 1916, 1980, 2044, 2108, 2172, 2236, 2300, 2364, 2428, 2492, 2556, 2620, 2684, 2748, 2812, 2876, 2940, 3004, 3068, 3132, 3196, 3260, 3324, 3388, 3452, 3516, 3580, 3644, 3708, 3772, 3836, 3840, 6588, 6652, 6716, 6780, 6844, 6908, 6972, 7036, 7100, 7164, 7228, 7292, 7356, 7420, 7484, 7548, 7612, 7676, 7740, 7804, 7868, 7932, 7936, 10684, 10748, 10812, 10876, 10940, 11004, 11068, 11132, 11196, 11260, 11324, 11388, 11452, 11516, 11580, 11644, 11708, 11772, 11836, 11900, 11964, 12028, 12032, 14780, 14844, 14908, 14972, 15036, 15100, 15164, 15228, 15292, 15356, 15420, 15484, 15548, 15612, 15676, 15740, 15804, 15868, 15932, 15996, 16060, 16124, 16128, 18876, 18940, 19004, 19068, 19132, 19196, 19260, 19324, 19388, 19452, 19516, 19580, 19644, 19708, 19772, 19836, 19900, 19964, 20028, 20092, 20156, 20220, 20224, 22972, 23036, 23100, 23164, 23228, 23292, 23356, 23420, 23484, 23548, 23612, 23676, 23740, 23804, 23868, 23932, 23996, 24060, 24124, 24188, 24252, 24316, 24320, 27068, 27132, 27196, 27260, 27324, 27388, 27452, 27516, 27580, 27644, 27708, 27772, 27836, 27900, 27964, 28028, 28092, 28156, 28220, 28284, 28348, 28412, 28416, 31164, 31228, 31292, 31356, 31420, 31484, 31548, 31612, 31676, 31740, 31804, 31868, 31932, 31996, 32060, 32124, 32188, 32252, 32316, 32380, 32444, 32508, 32512, 35260, 35324, 35388, 35452, 35516, 35580, 35644, 35708, 35772, 35836, 35900, 35964, 36028, 36092, 36156, 36220, 36284, 36348, 36412, 36476, 36540, 36604, 36608, 39356, 39420, 39484, 39548, 39612, 39676, 39740, 39804, 39868, 39932, 39996, 40060, 40124, 40188, 40252, 40316, 40380, 40444, 40508, 40572, 40636, 40700, 40704, 43452, 43516, 43580, 43644, 43708, 43772, 43836, 43900, 43964, 44028, 44092, 44156, 44220, 44284, 44348, 44412, 44476, 44540, 44604, 44668, 44732, 44796, 44800, 47548, 47612, 47676, 47740, 47804, 47868, 47932, 47996, 48060, 48124, 48188, 48252, 48316, 48380, 48444, 48508, 48572, 48636, 48700, 48764, 48828, 48892, 48896, 51644, 51708, 51772, 51836, 51900, 51964, 52028, 52092, 52156, 52220, 52284, 52348, 52412, 52476, 52540, 52604, 52668, 52732, 52796, 52860, 52924, 52988, 52992, 55740, 55804, 55868, 55932, 55996, 56060, 56124, 56188, 56252, 56316, 56380, 56444, 56508, 56572, 56636, 56700, 56764, 56828, 56892, 56956, 57020, 57084, 57088, 59836, 59900, 59964, 60028, 60092, 60156, 60220, 60284, 60348, 60412, 60476, 60540, 60604, 60668, 60732, 60796, 60860, 60924, 60988, 61052, 61116, 61180, 61184, 63932, 63996, 64060, 64124, 64188, 64252, 64316, 64380, 64444, 64508, 64572, 64636, 64700, 64764, 64828, 64892, 64956, 65020, 65084, 65148, 65212, 65276, 65280, 65340, 65404, 65468, 65532

Forgive the mess, but yes, all those characters can substitute for an open angle bracket, and run HTML and your cross site scripting vectors. Looks like theres tons of other problems to look for. Thankfully US-ASCII encoding is not that prevelant (about 1% of Fortune 500 by our estimates), however I’ve only just begun my testing. Almost everything I’m trying works, which is pretty scary. Lots more to come…

11 Responses to “US-ASCII Issues Redux”

  1. yawnmoth Says:

    Correct me if I’m wrong, but there are only 256 US-ASCII characters, aren’t there? As such, 316 isn’t its own character - it’s two. chr(1) and chr(60). 380 is chr(1) and chr(124), 444 is chr(1).chr(188), etc.

    Of those, 316 and 444 aren’t at all surprising. 380 kinda is, though.

  2. RSnake Says:

    That’s probably what the intention was, but when you output a long width character in that encoding method it works. Like so:

    IMG src="" onerror=alert(65532)>

  3. Dean Brettle Says:

    But a filter that prevents these attacks according to Kurt Huwig’s suggested fix should be clearing the high bit of each *byte* before looking for vectors. It wouldn’t make any sense to look at each long character, right?

    That said, I agree with yawnmoth that chr(1) chr(124) getting interpreted as an angle bracket is definitely interesting since the suggested fix would not have blocked it. Are there any other sequences of 7-bit chars that have that property?

  4. RSnake Says:

    Dean, I hadn’t heard that fix… but what do you mean look at the high bit, exactly? Do you mean if the first bit of the char is greater than 7 (as in 7f) then ignore it? Well what if I have 32060 (7D3C in hex) the first bit is not greater than 7. Or am I misunderstanding?

  5. yawnmoth Says:

    The idea behind the fix, I think, is that if you do an ‘and’ with 0×7f on every character that characters like chr(ord(’

  6. yawnmoth Says:

    Hmmm… WordPress’s filtering out what it thinks are tags broke my post…

    I’ll just use ‘i’, instead.

    If you do an ‘and’ with 0×7f on every character like chr(ord(’i') | 0×80), you won’t have two characters that represent an angle bracket - just one. ie. chr(ord(’i') | 0×80) & chr(0×7f) == ‘i’, etc.

  7. Dean Brettle Says:

    RSnake, the suggested fix is to clear the high bit of each *byte* (not each character) before filtering. For 7D3C, no bits would be cleared but a filter looking for “

  8. Dean Brettle Says:

    [Trying again…]

    RSnake, the suggested fix is to clear the high bit of each *byte* (not each character) before filtering. For 7D3C, no bits would be cleared but a filter looking for a left angle bracket would catch it because 3C is a left angle bracket. As a result, I’m saying that 7D3C isn’t really any more interesting than say 203C (space followed by left angle bracket). Both would be caught by a filter operating on bytes, and any filter operating on text in the US-ACII encoding should clearly be operating on bytes.

    017C (decimal 380) is interesting because it doesn’t contain a left angle bracket. As a result, a filter would probably miss it.

  9. cea Says:

    Well, all these characters, when represented in UTF-8 contain either 0xbc or 0×3c which are equivalent to ‘

  10. cea Says:

    Well, the previous posting is broken by the use of ‘<’ but the conclusion is that there are no hidden surprises.

  11. RSnake Says:

    Amit and I discussed this as well, and I think I was falsely finding issues that weren’t there. Here is his diagnosis:

    I’m afraid the picture is more complicated. Your test
    ended up (I believe) in writing the Unicode symbol
    (e.g. 65532) as DOUBLE UTF-8 encoded. That is, the
    symbol was first encoded in UTF-8 into the 3 bytes EF
    BF BC, and then was AGAIN UTF-8 encoded (each byte
    separately) into C3 AF C2 BF C2 BC, which is what I
    see on the page.

    Now, the browser is instructed to interpret this as
    US-ASCII, meaning, each byte to its own. And the
    terminating byte BC is indeed interpreted as a left
    bracket, hence the desired effect.

    But I claim that:

    1. What you have here is pretty non-standard
    situation, wherein data is DOUBLE encoded. I think in
    typical situations, Unicode data is serialized (e.g.
    by Java) to UTF-8 once.

    2. The issue here is not about “characters”, but
    rather, with their wire representation. As it happens,
    UTF-8 is typically used to represent Unicode
    characters, so the question is simply - which Unicode
    characters have UTF-8 representation whose last byte
    is BC (it’s easy to see that no Unicode symbol has
    UTF-8 representation whose last byte is 3C except
    U+003C - the true open bracket). So for example, 188,
    which is U+00BC, is UTF-8 encoded into C2 BC (bingo).

    Likewise, 316, which is U+013C, is UTF-8 encoded into
    C4 BC (bingo, again…). What you need is to study the
    UTF-8 conversion algorithm, and easily derive the list
    of characters (again, up to issue #1 above, which is
    the strange double encoding you have). I think the net effect of what you’ve
    demonstrated
    is that it’s possible to generate the byte BC in various ways in the HTML
    output.

    -Amit

    What’s confusing is that there actually are variable widths that modify the next character (I see the effect quite a lot now that I’m working with Cheng Peng Su and modifying the fuzzer to use that information). Using bvi, I was able to modify the tests in memory, and sure enough I can see the effect with anything I type:

    55 02 BC -> works
    09 BC -> works

    Etc… As long as that last bit is BC it functions. So the serialization of the various characters strung together is having no effect on the last bit. Anyway, as a result this has uncovered a pretty major flaw in my fuzzer design. Don’t expect it anytime soon… I have to re-think some key issues.