Cenzic 232 Patent
Paid Advertising
web application security lab

Is HTML a Cludge?

I’ve got mixed feelings about writing this, so I’ll try to stay as objective as possible since I know this is a religious issue amongst some developers. I ran across an email to the www-html list by Shane McCarron talking about how HTML is well… maybe you should just read it for yourself:

Okay, okay… I give up. You are right, I am wrong. IE is broken and everyone uses it, so we are screwed. There’s a shock. Let’s all roll over and keep using 1997 technology and hacking around using weird-ass abstraction libraries to implement “Web 2.0″ (gag-me) on top of incompatible underlying implementations rather than attempting to help the Internet evolve toward something light-weight, fast, and extensible like XML/XHTML.

Tag soup is sooo much better.

Honestly, people. You all disappoint me. But you are right - the HTTP spec does permit this broken behavior and I did not know that. In my world I always personally ignore */* in the accept header. Groups like the OMA have declared that you cannot use it that way for this very reason. Its silly. Oh well.

I will continue to use XHTML ’cause it works well, really it does. Or rather, it works no worse than anything else and it is forward looking. You all do whatever you want. I can sleep at night.

Despite being a little mouthy, there are some good points in here. HTML is really one of the most complex languages out there. It’s nearly impossible for a human to read something and know what it says without the aid of a rendering engine (and often I find people are amazed at what HTML is capable of - in a bad way). I don’t care what anyone says, HTML is not an easy language. Click here and view source to see what I mean (and this isn’t even that complex of an example).

From a web application security perspective it’s just as complex. Knowing what HTML has JavaScript in it is tough but try finding text has a bad word in it. Forget it. Maybe XHTML is the answer. Maybe a new version of the entire protocol is worth thinking about. I know it would mess a lot of things up in the short term, but from an information security perspective it would make it a lot easier to know what the user is submitting and how the page renders if we start talking about a real standard instead of the makeshift proprietary rendering engines that we have come to know and love.

4 Responses to “Is HTML a Cludge?”

  1. WhiteAcid Says:

    This is something I’ve often had long conversations over on IRC.

    There are several things about how rendering engines are made that I hate, they are so extremely forgiving, way more than any other computer language. Worse yet there’s no good reason for this. I’d love things if a browser error report instead or try to do it’s best to render the page, like some do if you try to view a badly formed XML document.

    XHTML does try to fix some of these issues, something I embrace. XHTML 2 has some more major changes, which I all agree with. Things like using section elements to go with a h tag, things like allowing the href attribute on any elements (making the a tag only supported for backwards-compatability) and a similair change to the src attribute (and the img element).

    With the change in structure that XHTML2 bring it can greatly simplify understanding a document no matter if you’re looking at the code or programatically trying to seperate sections.

    There is no reason for anyone to learn HTML, it’s is obsolete. I fully agree with Shane is how HTML is a cludge and in that the problem is being adressed and people should embrace XHTML.

  2. MERLiiN Says:

    I suspect HTML, like SMTP will continue to be used and continue to stay broken if not become worse. Security never seem to drive these things forward. “We” should have learned from SMTP, but I think it will be a long time before enough people learn, or someone puts it into a perspective that can force a changeover.

  3. Peter Says:

    I don’t like your example - it would be much better if you could say, “Look, this is _valid_ HTML, and it’s this bad.” validator.w3.org reports 16 or so errors, depending on the doctype you pick. Although I could write volumes on valid code the validator rejects…

    HTML will stick around for a while, though - people don’t want to write their own DTDs usually, and HTML’s “works”, so it won’t get fixed. Revised, maybe, updated surely, but not fixed.

  4. RSnake Says:

    Why does it matter if it’s valid or not? If rendering engines render it that’s all that matters in application security. I don’t care if some person in a lofty ivory tower claims that it is deemed good or not. I only care if a browser in all it’s craziness does what I want it to do. How people think it should or shouldn’t work is irrelevant. That’s the exact nature of most of the XSS exploits on the XSS Cheat Sheet http://ha.ckers.org/xss.html

    Case in point is the recent exploit against MySpace http://ha.ckers.org/blog/20061205/myspace-xss-for-firefox-0day/ Using the non-alpha-non-digit vector which really should not work it allows the vector to fire. It doesn’t really matter if it doesn’t work in IE. The point is that it does work, making HTML unruly and a nightmare to secure.

    But point taken about it not being fixed. Things are too far gone. The only thing we may see happen is that the browsers decide to be less crazy (like IE fixing their javascript directive inside image tags for example). That goes a long way towards fixing the issues even if it does break a few things.