A pedant that hangs out in the dark corner-cases of the web.

Wednesday, July 06, 2005

Entify Your HTML!

To the embarassingly uninformed third party vendors of web-based applications, I present a quick look at HTML entities. This is Chapter One stuff in even the most basic HTML book, but I still get puzzled, dismissive, and even indignant replies when I request fixes for simple HTML bugs.

Three important characters: < > &

These characters are special to HTML for processing. In the text or attribute values of a page, you must use entities that stand for them: &lt; &gt; &amp;(respectively). In attributes, " should also be replaced with &quot; (actually, you can use &quot; anywhere, but it isn't required outside attribute values).

The Web Is A Big Place

If you forget to entify your special characters, some browsers will sometimes let you get away with it. If you intend to produce code for the widest possible audience (which is the whole point of the Internet, after all), it is best not to assume your indiscretions will always go unnoticed; better to do it right to start with, and you won't have to double check every support call ($$$) to see if unentified HTML is part of the problem.

Unentified HTML Is Insecure HTML

All Cross-Site Scripting (XSS) attacks are caused by unentified HTML, and can be prevented using entities. The liability of such an attack, though potentially considerable, is nothing compared to the loss of client trust.

It's Easy

Every web development language has a single function you can call to entify the contents of string or text variables (numeric and date/time variables do not typically require escaping), e.g. Server.HTMLEncode() in Active Server Pages or htmlentities() in PHP. In cases where the language does not provide such a function, writing one is trivial: four search-and-replace calls (do the ampersand first).

No comments: