XHTML Validation

When I first started developing this web site, I used (and I'm ashamed to admit it) Microsoft Frontpage. Once I'd moved the site to its current FreeBSD host, perforce I had to learn HTML. Later additions, including most of the journal entries, were added using emacs, with psgml. However, the genesis of the site had left a lot of cruft in the HTML files; I had a mixture of file naming convention (.htm for the original Frontpage created files, and .html for manual creation), and I had no real clue as to how correct the HTML on the site was.

The first thing to tackle was the initial page. This was a Frontpage-generated frameset, defining a title bar, a navigation bar and a main window. I decided the title bar was a waste of space, so deleted that. To define which HTML standard I wanted to adhere to, a DOCTYPE declaration has to be included at the head of the HTML source. I decided to go for XHTML, which meant that, for the index.html page, I required XHTML 1.0 Frameset, which allowed the definition of frames. In addition, since I was now producing XML documents, an xml header was required. So, the beginning of the index.html file now looked like this:

      <?xml version="1.0" encoding="UTF-8"?>
      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
    

For the rest of the pages, I used the XHTML 1.0 Transitional specification:

      <?xml version="1.0" encoding="UTF-8"?>
      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    

In order to perform most the page edits automatically, I wrote a python script (the official scripting language of hydrus.org.uk; what else could I use?) to insert the XML and DOCTYPE headers, and make the change to the <style> tag (see common validation problems below).

It's then possible to validate web pages using the W3C service or the WDG service. The good thing about the W3C service is that it offers a logo to put on the page, which confirms the page's compliance to the W3C standards. In a nice touch, clicking the logo causes an automatic verification of the page. On the other hand, the WDG service allows a batch mode, to check many files at once.

When validating the web site, the most common problems I encountered were:

The final change I made was to place the various DTDs I was now using to local disk, so that psgml in emacs could perform verification while editing. This should eliminate a lot of the errors I had made previously, using the HTML mode in emacs.

The psgml default catalog file ~/dtd/CATALOG now contains:

      CATALOG "docbook-xml/docbook.cat"
      CATALOG "xhtml/xhtml.cat"
    

The xhtml.cat file contains:

      PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
      "xhtml1-transitional.dtd"
      PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
      "xhtml1-frameset.dtd"
    

The only problem I encountered during this setup was a refusal by psgml to parse the DTD. It died with the error message:

      Name expected; at :lang NMT
    

The solution to this problem was offered by http://www.culturematic.net. [N.B. Original link is broken, and I can't find the new location (if it exists)]. The mode used by psgml for XHTML has to be XML-MODE, not the default SGML-MODE I had set for .html files in my psgml initialisation in the .emacs file. The section which defines modes for psgml now looks like this:

       (add-to-list 'auto-mode-alist '("\\.html" . xml-mode))
       (add-to-list 'auto-mode-alist '("\\.adp" . xml-mode))
       (add-to-list 'auto-mode-alist '("\\.xml" . xml-mode))
       (add-to-list 'auto-mode-alist '("\\.xsl" . xml-mode))
    

In order to ensure consistency of naming convention for the html files, I used a couple of scripts I've had hanging around for a while; msub, which performs regexp replacements in the files listed on the command line (via sed) and rnam, which performs file renaming using wildcards.