Funny foreign characters

As part of my day job, I had to convert some XML data provided by one of our Danish customers. During this process I learned a little about character encoding. I have read Joel Spolsky's article, but it's not until you encounter the problems of character encoding face-to-face, it really sinks home.

There were two files, one a complete set of XML, and the second, a subset of the first, which contained data of particular interest. To process the files, I had already written some Clojure code that converted XML to maps for easier manipulation. To read the file contents, the code used the in-built slurp function. For the first file, this resulted in a set of strings containing all those funny Danish characters. However, reading the second file the same way gave me strings with clearly incorrect Danish characters. What the hell?

I inspected the files with od -t cx4. A key excerpt from each file is shown below:

First file

0001720    3   >   F   e   j   l       p   å  **       l   i   n   i   e
                 65463e33        70206c6a        6c20a5c3        65696e69

Second file

0002060    e   s   l 345       a   t       t   e   k   n   i   k   e   r
                 e56c7365        20746120        6e6b6574        72656b69

After re-reading Mr. Spolsky's article, it became clear what was going on. The first file is in UTF-8 format; the clue is the two byte sequence x'c3a5' (bytes are swapped as this was dumped on a little-endian machine). This is a UTF-8 sequence, which when decoded, results in the single character represented by x'e5' - a lower-case 'a' with a ring on top, that is "å".

The excerpt from the second file shows the same character (displayed as '345' in the character decoding), but you can see the x'e5' appears as itself. Hence, the file encoding is not UTF-8, but ISO-8895-1 (aka ISO-LATIN-1).

Looks like whatever process extracted the second file from the contents of the first also converted the output to ISO-8895-1. Now I had figured this out, I needed a version of slurp which decoded the second file as ISO-8859-1. Of course, first I just wrote it from scratch:

(defn slurp-iso
  "Read in file with ISO-8859-1 encoding (e.g. Danish). Returns 
  file contents as string."
  [filename]
  (with-open [rdr (-> filename (java.io.FileInputStream.) 
                      (java.io.InputStreamReader. "ISO-8859-1") 
                      (java.io.BufferedReader.))
    (apply str (line-seq rdr))))

Afterwards (naturally), I read the documentation, and the same effect can be achieved by:

  (slurp "filename" :encoding "ISO-8859-1")

There seems to be no easy way to determine how a file is encoded, without examining it. Well, at least I know some of the pitfalls now.