As part of my day job, I had to convert some XML data provided by one of our Danish customers. During this process I learned a little about character encoding. I have read Joel Spolsky's article, but it's not until you encounter the problems of character encoding face-to-face, it really sinks home.
There were two files, one a complete set of XML, and the second, a
subset of the first, which contained data of particular interest.
To process the files, I had already written some Clojure code that converted XML to
maps for easier manipulation. To read the file contents, the code
used the in-built slurp
function. For the first file, this
resulted in a set of strings containing all those funny Danish
characters. However, reading the second file the same way gave me
strings with clearly incorrect Danish characters. What the hell?
I inspected the files with od -t cx4
. A key excerpt from
each file is shown below:
0001720 3 > F e j l p å ** l i n i e 65463e33 70206c6a 6c20a5c3 65696e69
0002060 e s l 345 a t t e k n i k e r e56c7365 20746120 6e6b6574 72656b69
After re-reading Mr. Spolsky's article, it became clear what was
going on. The first file is in UTF-8 format; the clue is the two
byte sequence x'c3a5'
(bytes are swapped as this was dumped
on a little-endian machine). This is a UTF-8 sequence, which when
decoded, results in the single character represented by x'e5' - a
lower-case 'a' with a ring on top, that is "å".
The excerpt from the second file shows the same character (displayed as '345' in the character decoding), but you can see the x'e5' appears as itself. Hence, the file encoding is not UTF-8, but ISO-8895-1 (aka ISO-LATIN-1).
Looks like whatever process extracted the second file from the contents of the first also converted the output to ISO-8895-1. Now I had figured this out, I needed a version of slurp which decoded the second file as ISO-8859-1. Of course, first I just wrote it from scratch:
(defn slurp-iso "Read in file with ISO-8859-1 encoding (e.g. Danish). Returns file contents as string." [filename] (with-open [rdr (-> filename (java.io.FileInputStream.) (java.io.InputStreamReader. "ISO-8859-1") (java.io.BufferedReader.)) (apply str (line-seq rdr))))
Afterwards (naturally), I read the documentation, and the same effect can be achieved by:
(slurp "filename" :encoding "ISO-8859-1")
There seems to be no easy way to determine how a file is encoded, without examining it. Well, at least I know some of the pitfalls now.