Paid Advertising
web application security lab

HTMLSpecialChars Strikes Again

I found this post today by Miles Baker about how to create custom landing pages using PHP. At the end of the article he suggests using HTMLSpecialChars to protect yourself. While generally, that’s correct if you know every place you will be outputting code is within the HTML constructs, it’s really not fool proof. I’m really not sure why this function even exists since it doesn’t do what people want it to do. It only works in some very specific circumstances.

In the case of parameter injection it only works if the user is encapsulated within double quotes (not single quotes) and even then that the page itself isn’t vulnerable to variable width encoding issues or other character issues. Maybe it’s me but can anyone tell me why this function exists or at minimum it doesn’t escape single quotes grave accents and prefferably updated to take into account charsets? I think PHP would be safer, and the amount of code it would break would be minimal, compared to how many sites are vulnerable due to ignorance of what the function actually does and doesn’t protect against.

22 Responses to “HTMLSpecialChars Strikes Again”

  1. david Says:

    It can escape quotes and account for charsets, but not by default - you have to pass 2 optional parameters. I agree, though, that the default behavior should be to do both of these things. I figure the best you can do in PHP is: htmlentities($var, ENT_QUOTES, ‘ISO-8859-1′)

  2. Wladimir Palant Says:

    I generally find it hard to understand why PHP has a dozen different escaping functions without clearly describing the differences between them and when you should use one over the other. Which has the predictable result that many web developers seem to choose one function randomly and use it consistently (addslashes is pretty popular in particular). Which doesn’t protect them from anything but only corrupts user’s input.

  3. Jungsonn Says:

    @ david;

    you mean UTF-8

  4. Jungsonn Says:

    addslashes is only intened to prevent corrupt database tables, it isn’t designed to protect. mysql_real_escape_string() does this job in queries. Anything escaped with addslashes is in risk of SQL injection on the fly.

    This is the real sollution for the html entities:
    htmlentities($var, ENT_QUOTES, ‘UTF-8′);

    is pretty standard, ENT_QUOTES makes sure all quotes are represented as their html entities, there is no way of breaking out of this with XSS or SQL injection, except for some exotic encoding issues M_R_E_S, which are tough to deploy in the right context.

  5. Edward Z. Yang Says:


    Well, it depends on the context. While it would be nice if the entire world used UTF-8, they do not, and thus ‘ISO-8859-1′ is a reasonable default recommendation, although one should always explain what the apparently nonsensical string of letters and numbers means.

    I’m going to reiterate something I’ve said earlier on the forums: clean your UTF-8 using iconv or a utf8 library! If you follow this simple rule (and set your page’s encodings properly), variable width encoding attacks become impossible and you can get by with just htmlspecialchars()

    HTMLSpecialChars() works in these places:

    1. Between tags, always
    2. Inside double-quoted attributes, by default (you can turn it off but I don’t see why you’d want to)

    With ENT_QUOTES, it also works:

    3. Inside single-quoted attributes

    Which is good enough for most people.

  6. RSnake Says:

    Why would you prefer UTF-8 over ISO-8859-1 when UTF-8 has known issues? Why not stick with something that has no known charset based security holes? Am I missing something? Sure it can be cleaned but it seems like you are just adding one additional potential problem that you could at some point forget.

  7. Chris Shiflett Says:

    Maybe I’m wrong, but this just sounds like another case of not escaping for the right context.

    Can someone demonstrate the problem(s) with UTF-8? I’m assuming it’s a reference to things like this:

    If so, I don’t really see the dilemma. If not, I’m sure I’ve missed something.

  8. Dude Says:

    PHP blows to begin with (and this is why)

  9. lpilorz Says:

    The magic wand is:

    $str = mb_convert_encoding($str, ‘UTF-8′, ‘UTF-8′);
    $str = htmlentities($str, ENT_QUOTES, ‘UTF-8′);

  10. lpilorz Says:

    Uh, I meant:

    $str = mb_convert_encoding($str, ‘UTF-8′, ‘UTF-8′);
    $str = htmlspecialchars($str, ENT_QUOTES, ‘UTF-8′);

    (put your encoding in UTF-8 place)

    If you are processing data in a known, constant encoding, htmlspecialchars() could be also replaced by some home-made function, which does not have to convert &, but may also strip

  11. lpilorz Says:

    …but may also strip \ 0 (without the space it is removed by your Wordpress)

  12. RSnake Says:

    @Chris - there is a problem in IE6.0 with UTF-8: IE7.0 has fixed the issue.

  13. Matt Says:

    Wait, are you saying I should use echo $_GET[’id’]?


  14. Edward Z. Yang Says:

    @RSnake: the problem you’re citing refers to Internet Explorer’s buggy behavior when dealing with malformed UTF-8 documents. UTF-8 design ensures that you will never find a valid character byte sequence within the byte sequence of another character, so the [Multibyte character][Quote] character will never be well-formed and will always be caught by iconv, mb_convert_encoding, PCRE’s u flag, or any other UTF-8 well-formedness checker.

    Chris is sort of correct, but in the case, it’s not escaping for the wrong context, but rather not doing enough escaping for the context. Htmlspecialchars() deals with the HTML escaping issues, but you need to also deal with the character encoding issues.

    At this point, I have to knock the PHP developers for making it so difficult to escape data for the most common output situation.

  15. RSnake Says:

    That’s exactly what I was referring to, yes. And it would seem to me that for the time being, until IE6.0 drops significantly in use, it would be safer to stick to ISO-8859-1. At the point at which IE6.0 drops to nothing, it is probably better to stick to UTF-8, but for the short-term, why take any additional risks. Especially when not enough people know the details of it.

  16. Edward Z. Yang Says:

    Hmm… how to explain this…

    UTF-8 has a clear and unambiguous way of indicating when there is a multibyte character sequence. In a nutshell, when the first bit of a byte is zero, we’re dealing with an ASCII-compatible character, and when the first bit is one, we’re dealing with a multiple byte character. From there, the next bits indicate how many bytes are in the character (for every 1 after the first bit, add one more byte to the character). By the time the software is finished parsing the first byte, it knows how long the character is.

    From there, it must parse the next bytes. A byte that is within a multibyte character will always start with the bit sequence 10, making it impossible to confuse with an ASCII character (which starts with a 0 bit) or the start of another multibyte character (a two byte character starts with 110, a three byte character starts with 1110, etc). Anything else is illegal.

    The Internet Explorer 6.0 behavior stems from the fact that after shifting into multibyte mode, it disregards UTF-8’s stipulation that later bytes in a multibyte character must begin with “10″ and attempts to turn them into characters, thus allowing the quote character to be assimilated into the multibyte character and letting you escape the attribute.

    However, this situation only presents itself when the UTF-8 data is malformed, i.e. the problem is caused by IE’s lax error-checking. If you were to run it through a more rigorous character encoding parser such as iconv, the processor would notice the invalid byte-sequence, and either fatally error out or ignore the character. You do the strict checking, so that you don’t have to worry about Internet Explorer messing it up.

    Why am I so adamant about this? Because trying to build international applications on ISO-8859-1 or a similarly fixed-width encoding is an exercise in futility.

    I apologize if you already knew all this.

  17. RSnake Says:

    @Edward - that actually is a very good explanation. However, you are saying that the advantage to writing something in UTF-8 is in ease of programming international applications, not in security of it. You and probably 10 other programmers know to use iconv or similar functions in other languages. Would you rather have a huge chunk of the population who still use IE6.0 be vulnerable or would you rather make it harder on developers who have to retrofit fixed width character sets to work with multi-btye chars (or visual representations of those chars)?

    It’s a tough choice in my mind, but ultimately laziness and/or ignorance of the issue will probably win out here - they will do neither. They won’t use iconv and they will use UTF-8. Meaning from a security perspective consumers lose.

  18. Edward Z. Yang Says:

    One runs into a lot of problems when you try to process UTF-8 with functions that where originally designed for fixed width character sets, even when you’re not thinking in a security perspective. I agree, it is completely unreasonable to expect every developer to develop the intuition to know when to use plain or UTF-8 aware functions. PHP’s treatment of strings as pure binary data makes it very flexible, but also is disappointing because most of the data we deal with in web applications is not binary. For legacy reasons, this will not ever change.

    The best we can do is provide native string processing functions for UTF-8 in our programming languages (For example, PHP6). And considering how slow migration to PHP5 has been, the average Joe developer is probably not going to have safe behavior turned on for them, by default, any time soon.

    My hope is that the gymnastics required to successfully exploit this vulnerability (user data inside an attribute, un-escaped quote outside (a surprisingly unlikely situation if you’re not trying to filter HTML)) are high enough that a prospective hacker will be more likely to target some other low-hanging fruit to compromise the application.

    Starting with a fix-width encoding will take you far, but the headaches (database conversion scripts, text processing function retrofitting, possibly legacy multiple-charset support) you’ll get when you discover you need to migrate to UTF-8 will far surpass any headaches you’ll get from variable width encoding attacks, which are one-line fixes.

  19. RSnake Says:

    Thank you for writing such a well thought through response. I completely agree with that statement. For non-international applications I am sticking with ISO-8859-1. For international applications I will be using UTF-8 but I’ll be very strict about how I use/process that data. I guess the best we can hope for is that this can’t be somehow leveraged in other attacks (IE: some crazy header injection flaw, or a universal XSS or an XSS in a tool that is widely used). Anything that raised the stakes or made this attack more universal than the ultra-rare circumstances that traditionally surround variable width encoding flaws would change my opinion though.

  20. lpilorz Says:

    “You and probably 10 other programmers know to use iconv or similar functions in other languages.”

    In some non-English-speaking countries it’s one of the things many programmers learn very fast ;) But I agree, that it is not because of security reasons, but because they have to use ten different encodings at the same time.

  21. Jim Manico Says:

    RSnake and Edward, what a fantastic conversation. Do the same problems with ISO-8859-1 vs. UTF-8 for International applications still apply to the Java programming language?

  22. Erik Says:

    I would like to use htmlspecialchars without converting . Any way to do this?