With my web site now totally in XHTML format, I was in a position to deploy some python-based tools to check the web site for correctness and consistency.
There are three scripts. The first two are dlc.py
, a dead
link checker, and chk_xhtml.py
, which validates the XHTML
web pages are well formed. dlc.py
was fairly easy to
write, although it still needs to be extended to follow ftp
links. Checking XHTML correctness was slightly more difficult,
since for reasons of speed, I had to enable the checker to use local
copies of the DTD and supporting files. Otherwise, since I
referenced the web copies in the DOCUMENT tag, these web versions
were fetched each time, doubling or trebling the time taken to
validate a page.
To this end, I developed a class, catalog
, that built an
in-memory dictionary, mapping public XML entity names to their local
file definitions. I kept the parsing simple, by using the
shlex
lexer provided with python. It is open to question
how robust catalog
is, since the design is based upon
eyeballing the structures of catalog files on the local system.
chk_xhtl.py
limitation is that it can only validate the
well-formedness of an XHTML file, not that it is valid,
i.e. conforms to the DTD specified in the DOCTYPE element. I needed
something that could validate against a DTD.
After a little more research, I decided to use the python bindings
for libxml2
, which can be installed via the port
textproc/py-libxml2
on FreeBSD (under Debian, use
python-libxml2
). Based on chk_xhtml.py
, I wrote a
simple wrapper around the validation code, given as an example on
the xmlsoft web site.
This results in the third python script, chk_xml.py
.
A tar file containing the
python source for dlc.py
, chk_xml.py
,
chk_xhtml.py
and catalog.py
may be downloaded. It
is also available via the download
page.