web application security scanner survey
Paid Advertising
web application security lab

Identifying Information Structures

One of my very first programming assignments in one of my very first real jobs was to parse apart address information to shove into a database. Previously the address field had been textarea where the user could input whatever they felt like - there was no error checking of any kind. That seems bad but it’s actually worse for the user because they wouldn’t get paid if they didn’t put a valid address in there. So the problems were kept at a minimum. But when I mean minimum, I mean of the 10,000 addresses we had on file about 10% (1,000) of them had issues of some kind.

It might sound like a trivial task, but in reality it just wasn’t. First of all international addresses look nothing like US addresses, and in many countries (especially at that time) there was no online validation engine to validate that zipcodes (if the country even used them) were valid. Further, you had mis-spellings, incorrect formatting, and plain old typos where people entered nothing or completely erroneous data.

Think about an email address. It seems simple enough, but there are only two good ways to validate that it’s in the correct format - regex and emailing the user and seeing if they get it or not. Let’s look at what something as simple as an email address regex might look like:

/^\s*\d*\s+(([A-Z0-9]+[._]?){1,}[A-Z0-9]+\@(([A-Z0-9]+[-]?){1,}[A-Z0-9]+\.){1,}[A-Z]{2,4})\s*$/i

If something as complex as that regex is required to validate the informational structure of something as simple as an email address think about what it would take to look at the unstructured data of an address field, or something even more complex as a chunk of HTML. In the case of HTML it’s not necessarily just an exercise of identifying valid HTML but of identifying it’s intent. It just struck me in looking at the above regex, that people who try to protect against JavaScript injection while still allowing HTML in regex (especially people who can’t effectively read/write regex) are pretty doomed.

4 Responses to “Identifying Information Structures”

  1. Daniel Papasian Says:

    It’s also worth pointing out that the regular expression provided will miss some valid email addresses - there are top level domains such as .museum (and perhaps that’s the only one) that are greater than 4 characters.

    I don’t think it’s worth evaluating hostnames with regular expressions - it makes much more sense to evaluate them with a DNS query (in this case, checking for an MX or A record), as that, ultimately, is what a hostname is used for. And I think that the approach behind that technique is similar to what folks should adopt for security filtering - you can’t just blacklist or whitelist certain HTML elements, you need to interpret/parse the HTML to get what the information is that’s being sent, and then build your own representation of it (in HTML, XML, whatever) using elements that you’ve decided are safe.

    Which is not an easy task given the crap that you must put up with if you’re parsing HTML — it’s not all that different from writing a browser that renders text that can render all pages that browsers could view since, say, Netscape 2.

  2. Edward Z. Yang Says:

    Well, it depends on how deep of an understanding you want. Do you want AI-like intelligence, where it can look at a pair of U tags and think, “Oh, the user wanted to demarcate the title of a book in MLA format!” That would be cool, but we’re not going to see it any time soon.

    More appropriate terminology, then, would be parse the HTML into a DOM (document object model), and then write that out back into HTML. This removes any parsing discrepancies.

  3. rdivilbiss Says:

    “That would be cool, but we’re not going to see it any time soon.”

    You could very well be prescient and your statement could be correct. But look at the last 40 years. Soon could be much quicker than you think.

    But I think that misses the point of the article. Determining intent of code is inheriently complex beyond the capabilities of the tools at hand. Woe be unto the web designer who thinks her or she is capable of writing code to determine evil intent so as to allow only the “safe” to pass.

  4. Jungsonn Says:

    I never used a RegEx for an email, I’ve written a few though, but because of it’s international issues and long tld like “.museum”, mainly I use this method:

    1.) let the user type it twice (and disable copy/paste)
    2.) then send a mail for confirmation.

Respond here or Discuss On the Forums