Dealing with Unicode in filenames

I've recently acquired a portable WD 250GB USB external disk drive, based on a 2.5" drive, very small and light. Since I could carry it around with me, it seemed like a good way to start having an 'off-site' backup. Even copying stuff from Debian was easy, but I had to install the ntfs-3g package as the disk is NTFS formatted.

However, I encounted a problem when trying to copy my mp3 collection on to it via thunar in xfce4. The copy failed, complaining that the file system did not support the characters in some of the filenames: "Invalid or incomplete multibyte or wide character". On inspection, these turned out to be characters from the Unicode Latin-1 Supplement set in UTF-8, as my Debian locale is set to en_GB.utf8. Trawling Google indicated that the version of ntfs-3g in Debian lenny might have limited Unicode support. Hmm, looks like I needed to figure out how to rename the files to eliminate the funny European characters. Let's face it, 26 characters and a bit of punctuation should be enough for any language.

I'd never had to deal with non-ascii characters in ernest before, so I had a perform a little research on UTF-8 and the Python capabilities. There are a lot of resources (e.g. an article by Markus Kuhn and by anonymous).

Armed with this info, I was ready. First, many files did I have to deal with? I produced a list of all the files and directories under the two directories containing mp3 files using:

  find classical mp3 -name -print >flist

Next, into the python code. I made use of the codecs module, which provides a host of capabilities for dealing with Unicode data. To process the list of mp3 pathnames, I used codecs.open, which returns a file-like object that can handle Unicode strings. The following code (contained in a file called xlate.py, hence module xlate) will read in the list of pathnames from flist and return a list of those pathnames with non-ascii characters.

  import codecs
  def collect(fn):
  """return list of lines in file fn that contain non-ascii characters."""
      tlist = list()
      flist = codecs.open(fn,'rb','utf-8','replace').readlines()
      for f in flist:
          for c in f:
              if ord(c) > 127:
                  tlist.append(f.strip()) # remove trailing newline
                  break
      return tlist

Let's run this in the interpreter:

  [mark@amber:/rep/music] python 
  Python 2.5.2 (r252:60911, Jan 24 2010, 14:53:14) 
  [GCC 4.3.2] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import xlate
  >>> tlist = xlate.collect("flist")
  >>> tlist[:1]
  [u'mp3/arvo_p\xe4rt_-_tabula_rasa',
  u'mp3/arvo_p\xe4rt_-_tabula_rasa/06_adagio_in_e_flat,_d.897_
    (\xbbnotturno\xab).mp3']

OK, a good start. Now, how many different characters are we dealing with? Let's add another function to xlate.py that will return a list of unique characters found:

  def ident_chars(ls):
    """return list of non-ascii chars in unicode strings found in list ls."""
      uchars = list()
        for f in ls:
            for c in f:
                if ord(c) > 127 and uchars.count(c) == 0:
                    uchars.append(c)
      return sorted(uchars)

Let's test it:

  >>> reload(xlate)
  <module 'xlate' from 'xlate.pyc'>
  >>> clist = xlate.ident_chars(tlist)
  >>> clist
  [u'\xab', u'\xbb', u'\xc9', u'\xe0', u'\xe1', u'\xe2', u'\xe4', u'\xe8', 
   u'\xe9', u'\xef', u'\xf1', u'\xf3', u'\xf4', u'\xf7', u'\xfa', u'\xfc', 
   u'\ufffd']
  >>>

Not too many. Still, it would be useful to have a function which automatically wrote a dictionary to make it easy to define the mappings of "weird" european characters to standard anglo-saxon. Such a function, to write the unicode code point, an empty replacement and the original character glyph, might look something like the following:

  def create_mapping(clist,fn): 
      """Write python dictionary definition code to filename fn, using
      unicode characters in list clist as keys."""

      with codecs.open(fn,'wb','utf-8','replace') as f:
          f.write("mapping = {\n")
          for c in clist:
              f.write(u'\tu\'\\x%x\' : \'\', # %s\n '%(ord(c),c))
          f.write('\tu\'\\x00\' : \'\' }\n')
      return

What does the output of running this function look like?

  >>> create_mapping(clist,"inc_map.py")
  >>> with open("inc_map.py") as f:
  ...     print f.read()
  ... 
  mapping = {
      u'\xab' : '', # «
      u'\xbb' : '', # »
      u'\xc9' : '', # É
      u'\xe0' : '', # à
      u'\xe1' : '', # á
      u'\xe2' : '', # â
      u'\xe4' : '', # ä
      u'\xe8' : '', # è
      u'\xe9' : '', # é
      u'\xef' : '', # ï
      u'\xf1' : '', # ñ
      u'\xf3' : '', # ó
      u'\xf4' : '', # ô
      u'\xf7' : '', # ÷
      u'\xfa' : '', # ú
      u'\xfc' : '', # ü
      u'\xfffd' : '', # �
      u'\x00' : '' }
  >>>

That makes things easy to setup my preferred non-ascii to ascii character mappings. I just insert the ascii characters into the appropriate dictionary element definition. The default for the dictionary mapping is to nothing.

  mapping = {
      u'\xab' : '', # «
      u'\xbb' : '', # »
      u'\xc9' : 'E', # É
      u'\xe0' : 'a', # à
      u'\xe1' : 'a', # á
      u'\xe2' : 'a', # â
      u'\xe4' : 'a', # ä
      u'\xe8' : 'e', # è
      u'\xe9' : 'e', # é
      u'\xef' : 'i', # ï
      u'\xf1' : 'n', # ñ
      u'\xf3' : 'o', # ó
      u'\xf4' : 'o', # ô
      u'\xf7' : '', # ÷
      u'\xfa' : 'u', # ú
      u'\xfc' : 'u', # ü
      u'\xfffd' : '', # �
      u'\x00' : '' }

This file can then be copied wholesale into xlate.py and used as the basis for the next function we have to write: actually translate filenames. Gird your loins. We are going to take a peek into the ugly world of hacking at the first thing you thought of. Those of a nervous disposition should look away now...

OK, that first thought was: I have a list of pathnames to translate, so I just need to run through them, deriving the plain anglo-saxon name for each path, then rename the file. Sounds simple enough, let's write some code:

  def xlate(ls):
      newls = list()
      for f in ls:
          s = f
          for m in mapping:
              s = s.replace(m,mapping[m])
          newls.append(s)
      return newls

  def rename_files(old,new):
      map(os.rename, old, new)
      return

So, the function xlate takes a list of pathnames with non-ascii characters, translates the odd characters, then returns a list of the translated pathnames. The rename_files function will actually perform the renaming. Hang on a moment, that's not going to work. If you look again at the snippet from tlist (above), the parent directory comes before the contained files. Once the containing directory is renamed, all the other renames in that directory are going to fail because the original (non-ascii) pathname no longer exists. Bugger, now what?

OK, I need a process that visits the files before the containing directory. One easy solution is find with the -depth option. Combined with -execdir a simple python script would do the job, say something like this, where the code resides in xlate.py:

  def xlate_str(s):
    """Return copy of s with utf-8 characters, as defined in dictionary 
    mapping, mapped to ascii."""
    n = s
    for m in mapping:
        n = n.replace(m,mapping[m])
    return n

  # main code
  if __name__ == '__main__':
      pname = sys.argv[0]
      if len(sys.argv) == 2:
          old = unicode(sys.argv[1],'utf-8','replace')
          new = xlate_str(old)
          if old != new:
              try:
                  os.rename(old,new)
                  print "> %s %s"% (old,new)
              except OSError,e:
                  print >>sys.stderr,"%s: error in %s at %s: %s"%\
                        (pname,os.getcwd(),old,e[1])
                  sys.exit(1)
      else:
          print >>sys.stderr, "%s: one argument expected."%(pname,)
          sys.exit(1)

One could run this with a command such as:

  find mp3 classical -execdir python /rep/music/xlate.py {} \;

However, this has an overhead in that it invokes python for every file. A more efficient way would be to perform the filesystem walk (ala find) within python itself, if there is an simple method. And there is: the os modules offers a ready solution, os.walk. The key element is the topdown option, which allows us to invoke a depth-first directory walk, that is we can deal with the filenames before the containing directory. With this function, we can perform everything within python.

  def walk(mod_name,top,do_move): 
  """Walk directory tree (depth first) with root at top, renaming
     contained files/directories that contain non-ascii characters to
     use only ascii characters.

     Error messages are prefixed with the sring mod_name.

     If do_move is false, the renames are not actually performed, but
     a messaage indicating what would have been done is issued.

  """
      for root, dirs, files in os.walk(top,topdown=False):
          eroot = unicode(root,'utf-8','replace');
          for fd in files+dirs:
              old = unicode(fd,'utf-8','replace')
              new = xlate_str(old)
              if old != new:
                  if os.path.exists(os.path.join(eroot,new)):
                      print >>sys.stderr,\
                      "%s: at %s: xlated file %s already exists."%\
                          (mod_name,eroot,new)
                  else:
                      try:
                          if do_move: os.rename(os.path.join(eroot,old),
                                                os.path.join(eroot,new))
                          print "%s: > %s %s"%(eroot,old,new)
                      except OSError,e:
                          print >>sys.stderr,"%s: error in %s at %s: %s"%\
                                (mod_name,eroot,old,e[1])
      return

  # main code
  if __name__ == '__main__':
      mod_name = os.path.basename(sys.argv[0])
      top = os.getcwd()
      do_move = True
      try:
          opts,args = getopt.getopt(sys.argv[1:],'np:')
          for o,v in opts:
              if o == '-n': do_move = False
              elif o == '-p': top = v
      except getopt.GetoptError,e:
          print >>sys.stderr,"%s: unknown argument: -%s"%(mod_name,e.opt)
          sys.exit(1)

      walk(mod_name,top,do_move)

I've added a couple of command arguments to the main code that are passed directly to the xlate.walk function, that is the starting directory and a flag to indicate the the move should actually be performed.

Once this rename process had been run, I was able to copy all my mp3 files to the WD external harddrive. On the other hand, if I were a true European, perhaps I should have just waited until Debian was upated with a version of ntfs-3g which had better unicode support.

Addendum 7th May, 2010

I discovered, in a post of the comp.lang.python newsgroup by Peter Otten (in response to a question from coldpizza), that I had wasted my time. The python unicodedata module already includes a means of eliminating non-ascii characters. This is using unicode.normalize. Here's an example of the usage:

  astr = unicodedata.normalize('NFD',ustr).encode('ascii','ignore')

which essentially replaces the xlate_str function. So, no lookup table needs to be created at all. I hate reinventing the wheel...