A pedant that hangs out in the dark corner-cases of the web.

Wednesday, August 06, 2008

Efficacy of .NET StreamReader's detectEncodingFromByteOrderMarks

The .NET System.IO.StreamReader class has several forms of its constructor that accept a boolean detectEncodingFromByteOrderMarks parameter to look for a byte-order-mark (BOM)/encoding-signature when the file is first read.

When enabled, this feature populates the CurrentEncoding property after the first time the file is read (which can be a simple call to Peek()).

This method only works reliably for encodings that supply a BOM, but since the default encoding is utf-8, several other single-byte encodings are compatible with content in the 7-bit ASCII range.

Here is a sample of how well this feature works with content written in various encodings:

Not detected, but works fine with the default UTF-8 since ASCII is a subset of UTF-8.
Not detected, not UTF-8 compatible.
Detected correctly. Default encoding anyway.
Detected correctly (as utf-16).
Detected correctly.
Detected correctly, but still reads incorrectly in my testing!
Detected correctly.
Windows-1252, iso-8859-1, iso-8859-15, macintosh
Not detected, but shares a significant character overlap with the UTF-8 default (7-bit ASCII).
Various EBCDIC encodings: IBM037, IBM500, IBM870
Not detected, and not read correctly in tests.
UTF-EBCDIC, SCSU, BOCU-1, Punycode, CESU-8, UCS-4*, UTF-1, UTF-9†, UTF-18
Not supported by the .NET Framework.

For the most part, content using a Unicode encoding of some kind (which include a BOM) have the greatest chance of success, and encodings not listed aren't likely to work. EBCDIC and international encodings, among others, must really be opened using their explicit encoding (meaning they must be anticipated), if they are to be read successfully, which is why you should only produce UTF-8/16/32 content.

* Not recognized as an alias for UTF-32.
† To be fair, these encodings are a joke.


Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.