I've recently acquired a portable WD 250GB USB external disk drive,
based on a 2.5" drive, very small and light. Since I could carry it
around with me, it seemed like a good way to start having an
'off-site' backup. Even copying stuff from Debian was easy, but I
had to install the ntfs-3g
package as the disk is NTFS
formatted.
However, I encounted a problem when trying to copy my mp3 collection
on to it via thunar in xfce4. The copy failed, complaining that the
file system did not support the characters in some of the filenames:
"Invalid or incomplete multibyte or wide character"
. On
inspection, these turned out to be characters from the Unicode
Latin-1 Supplement set in UTF-8, as my Debian locale is set to
en_GB.utf8
. Trawling Google indicated that the version of
ntfs-3g
in Debian lenny might have limited Unicode support.
Hmm, looks like I needed to figure out how to rename the files to
eliminate the funny European characters. Let's face it, 26
characters and a bit of punctuation should be enough for any
language.
I'd never had to deal with non-ascii characters in ernest before, so I had a perform a little research on UTF-8 and the Python capabilities. There are a lot of resources (e.g. an article by Markus Kuhn and by anonymous).
Armed with this info, I was ready. First, many files did I have to deal with? I produced a list of all the files and directories under the two directories containing mp3 files using:
find classical mp3 -name -print >flist
Next, into the python code. I made use of the codecs module,
which provides a host of capabilities for dealing with Unicode data.
To process the list of mp3 pathnames, I used codecs.open
,
which returns a file-like object that can handle Unicode strings.
The following code (contained in a file called xlate.py
,
hence module xlate
) will read in the list of pathnames from
flist
and return a list of those pathnames with non-ascii
characters.
import codecs def collect(fn): """return list of lines in file fn that contain non-ascii characters.""" tlist = list() flist = codecs.open(fn,'rb','utf-8','replace').readlines() for f in flist: for c in f: if ord(c) > 127: tlist.append(f.strip()) # remove trailing newline break return tlist
Let's run this in the interpreter:
[mark@amber:/rep/music] python Python 2.5.2 (r252:60911, Jan 24 2010, 14:53:14) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import xlate >>> tlist = xlate.collect("flist") >>> tlist[:1] [u'mp3/arvo_p\xe4rt_-_tabula_rasa', u'mp3/arvo_p\xe4rt_-_tabula_rasa/06_adagio_in_e_flat,_d.897_ (\xbbnotturno\xab).mp3']
OK, a good start. Now, how many different characters are we dealing
with? Let's add another function to xlate.py
that will
return a list of unique characters found:
def ident_chars(ls): """return list of non-ascii chars in unicode strings found in list ls.""" uchars = list() for f in ls: for c in f: if ord(c) > 127 and uchars.count(c) == 0: uchars.append(c) return sorted(uchars)
Let's test it:
>>> reload(xlate) <module 'xlate' from 'xlate.pyc'> >>> clist = xlate.ident_chars(tlist) >>> clist [u'\xab', u'\xbb', u'\xc9', u'\xe0', u'\xe1', u'\xe2', u'\xe4', u'\xe8', u'\xe9', u'\xef', u'\xf1', u'\xf3', u'\xf4', u'\xf7', u'\xfa', u'\xfc', u'\ufffd'] >>>
Not too many. Still, it would be useful to have a function which automatically wrote a dictionary to make it easy to define the mappings of "weird" european characters to standard anglo-saxon. Such a function, to write the unicode code point, an empty replacement and the original character glyph, might look something like the following:
def create_mapping(clist,fn): """Write python dictionary definition code to filename fn, using unicode characters in list clist as keys.""" with codecs.open(fn,'wb','utf-8','replace') as f: f.write("mapping = {\n") for c in clist: f.write(u'\tu\'\\x%x\' : \'\', # %s\n '%(ord(c),c)) f.write('\tu\'\\x00\' : \'\' }\n') return
What does the output of running this function look like?
>>> create_mapping(clist,"inc_map.py") >>> with open("inc_map.py") as f: ... print f.read() ... mapping = { u'\xab' : '', # « u'\xbb' : '', # » u'\xc9' : '', # É u'\xe0' : '', # à u'\xe1' : '', # á u'\xe2' : '', # â u'\xe4' : '', # ä u'\xe8' : '', # è u'\xe9' : '', # é u'\xef' : '', # ï u'\xf1' : '', # ñ u'\xf3' : '', # ó u'\xf4' : '', # ô u'\xf7' : '', # ÷ u'\xfa' : '', # ú u'\xfc' : '', # ü u'\xfffd' : '', # � u'\x00' : '' } >>>
That makes things easy to setup my preferred non-ascii to ascii character mappings. I just insert the ascii characters into the appropriate dictionary element definition. The default for the dictionary mapping is to nothing.
mapping = { u'\xab' : '', # « u'\xbb' : '', # » u'\xc9' : 'E', # É u'\xe0' : 'a', # à u'\xe1' : 'a', # á u'\xe2' : 'a', # â u'\xe4' : 'a', # ä u'\xe8' : 'e', # è u'\xe9' : 'e', # é u'\xef' : 'i', # ï u'\xf1' : 'n', # ñ u'\xf3' : 'o', # ó u'\xf4' : 'o', # ô u'\xf7' : '', # ÷ u'\xfa' : 'u', # ú u'\xfc' : 'u', # ü u'\xfffd' : '', # � u'\x00' : '' }
This file can then be copied wholesale into xlate.py
and
used as the basis for the next function we have to write: actually
translate filenames. Gird your loins. We are going to take a peek
into the ugly world of hacking at the first thing you thought of.
Those of a nervous disposition should look away now...
OK, that first thought was: I have a list of pathnames to translate, so I just need to run through them, deriving the plain anglo-saxon name for each path, then rename the file. Sounds simple enough, let's write some code:
def xlate(ls): newls = list() for f in ls: s = f for m in mapping: s = s.replace(m,mapping[m]) newls.append(s) return newls def rename_files(old,new): map(os.rename, old, new) return
So, the function xlate
takes a list of pathnames with
non-ascii characters, translates the odd characters, then returns a
list of the translated pathnames. The rename_files
function will actually perform the renaming. Hang on a moment,
that's not going to work. If you look again at the snippet from
tlist (above), the parent directory comes before the contained
files. Once the containing directory is renamed, all the other
renames in that directory are going to fail because the original
(non-ascii) pathname no longer exists. Bugger, now what?
OK, I need a process that visits the files before the containing
directory. One easy solution is find
with the
-depth
option. Combined with -execdir
a simple
python script would do the job, say something like this, where the
code resides in xlate.py
:
def xlate_str(s): """Return copy of s with utf-8 characters, as defined in dictionary mapping, mapped to ascii.""" n = s for m in mapping: n = n.replace(m,mapping[m]) return n # main code if __name__ == '__main__': pname = sys.argv[0] if len(sys.argv) == 2: old = unicode(sys.argv[1],'utf-8','replace') new = xlate_str(old) if old != new: try: os.rename(old,new) print "> %s %s"% (old,new) except OSError,e: print >>sys.stderr,"%s: error in %s at %s: %s"%\ (pname,os.getcwd(),old,e[1]) sys.exit(1) else: print >>sys.stderr, "%s: one argument expected."%(pname,) sys.exit(1)
One could run this with a command such as:
find mp3 classical -execdir python /rep/music/xlate.py {} \;
However, this has an overhead in that it invokes python for every
file. A more efficient way would be to perform the filesystem walk
(ala find) within python itself, if there is an simple method. And
there is: the os
modules offers a ready solution,
os.walk
. The key element is the topdown
option,
which allows us to invoke a depth-first directory walk, that is we
can deal with the filenames before the containing directory. With
this function, we can perform everything within python.
def walk(mod_name,top,do_move): """Walk directory tree (depth first) with root at top, renaming contained files/directories that contain non-ascii characters to use only ascii characters. Error messages are prefixed with the sring mod_name. If do_move is false, the renames are not actually performed, but a messaage indicating what would have been done is issued. """ for root, dirs, files in os.walk(top,topdown=False): eroot = unicode(root,'utf-8','replace'); for fd in files+dirs: old = unicode(fd,'utf-8','replace') new = xlate_str(old) if old != new: if os.path.exists(os.path.join(eroot,new)): print >>sys.stderr,\ "%s: at %s: xlated file %s already exists."%\ (mod_name,eroot,new) else: try: if do_move: os.rename(os.path.join(eroot,old), os.path.join(eroot,new)) print "%s: > %s %s"%(eroot,old,new) except OSError,e: print >>sys.stderr,"%s: error in %s at %s: %s"%\ (mod_name,eroot,old,e[1]) return # main code if __name__ == '__main__': mod_name = os.path.basename(sys.argv[0]) top = os.getcwd() do_move = True try: opts,args = getopt.getopt(sys.argv[1:],'np:') for o,v in opts: if o == '-n': do_move = False elif o == '-p': top = v except getopt.GetoptError,e: print >>sys.stderr,"%s: unknown argument: -%s"%(mod_name,e.opt) sys.exit(1) walk(mod_name,top,do_move)
I've added a couple of command arguments to the main code that are
passed directly to the xlate.walk
function, that is the
starting directory and a flag to indicate the the move should
actually be performed.
Once this rename process had been run, I was able to copy all my mp3
files to the WD external harddrive. On the other hand, if I were a
true European, perhaps I should have just waited until Debian was
upated with a version of ntfs-3g
which had better unicode
support.
I discovered, in a post of the comp.lang.python
newsgroup
by Peter Otten (in response to a question from coldpizza), that I
had wasted my time. The python unicodedata
module already
includes a means of eliminating non-ascii characters. This is using
unicode.normalize
. Here's an example of the usage:
astr = unicodedata.normalize('NFD',ustr).encode('ascii','ignore')
which essentially replaces the xlate_str function. So, no lookup table needs to be created at all. I hate reinventing the wheel...