Website Validation

With my web site now totally in XHTML format, I was in a position to deploy some python-based tools to check the web site for correctness and consistency.

There are three scripts. The first two are dlc.py, a dead link checker, and chk_xhtml.py, which validates the XHTML web pages are well formed. dlc.py was fairly easy to write, although it still needs to be extended to follow ftp links. Checking XHTML correctness was slightly more difficult, since for reasons of speed, I had to enable the checker to use local copies of the DTD and supporting files. Otherwise, since I referenced the web copies in the DOCUMENT tag, these web versions were fetched each time, doubling or trebling the time taken to validate a page.

To this end, I developed a class, catalog, that built an in-memory dictionary, mapping public XML entity names to their local file definitions. I kept the parsing simple, by using the shlex lexer provided with python. It is open to question how robust catalog is, since the design is based upon eyeballing the structures of catalog files on the local system.

chk_xhtl.py limitation is that it can only validate the well-formedness of an XHTML file, not that it is valid, i.e. conforms to the DTD specified in the DOCTYPE element. I needed something that could validate against a DTD.

After a little more research, I decided to use the python bindings for libxml2, which can be installed via the port textproc/py-libxml2 on FreeBSD (under Debian, use python-libxml2). Based on chk_xhtml.py, I wrote a simple wrapper around the validation code, given as an example on the xmlsoft web site. This results in the third python script, chk_xml.py.

A tar file containing the python source for dlc.py, chk_xml.py, chk_xhtml.py and catalog.py may be downloaded. It is also available via the download page.