Web Publishing using make

When I originally built the hydrus web site, I used frames to present a static navigation bar, with another frame for the page content. This made it easy to define the page contents, without the overhead of repeating the standard images/navigation on each page. However, for visitors linking directly to a page, it made it difficult to view other sections of the site (and even made its identity obscure).

In order to remove frames, I needed some mechanism to support common content in each page - a template. As I envisaged that it would be necessary to change the page template to cater for new menu items, changes in style and so on, I had to develop an automated process to create the website from a template and specific page contents. In addition, I didn't want to be re-creating the whole website every time I made a small content change, so I needed something that would only make the minimal set of updates needed. This sounded like a job for make (or specifically GNU make, which is what I used).

After a period of experimentation, I built a system which relied on three Makefiles, and a couple of python scripts. The master Makefile requires two subordinate make files, compile.mk and publish.mk. These are responsible for building the final pages, and installing them in the public html directory, respectively.

To perform the file processing required to merge the page contents into a template, the python script create-html.py is used. Also, during the processing, file contents need to be modified - this capability is provided by munge.py

The system requires two directory structures under a root directory containing the page templates, Makefiles, and python scripts. The src subdirectory contains the basic page content, images and other support files (e.g. index file for the technical journal). The obj contains the results of combining the page contents with the page template, and copies of the non-page files. At this point, the web site files can be examined to ensure correctness before the final stage of publishing. The process is invoked by the following command, issued in the root directory:

  make compile

The publishing process merely copies all newly modified files from the obj directory to the web site location:

  make publish

The process was complicated by the presence of HTML files generated from Docbook markup. The jade process I used to convert Docbook to HTML output HTML 4.01, although the rest of the web site conforms to XHTML 1.0. I therefore needed a slightly different template for the docbook HTML, as well as the ability to remove the headers and footers from the original Docbook on the fly. This latter requirement stemmed from the fact that the HTML files would be re-generated should I have to revise the original Docbook XML file.

Addendum - 2nd October, 2005

The makefiles were modified to handle two new factors: the addition of a configuration control system (CVS) and the conversion to complete XHTML compliance (see Docbook and XHTML for details).

This made the processing considerably simpler. See Improved Web Publishing.

Makefile

# Makefile to process and publish hydrus web contents
#
#
# make compile - create pages by placing page contents into the standard
#                page template. Docbook pages need a different template
#                since they are in HTML 4.01.
#
# make publish - copy processed pages to web directory
#
#
# MODIFICATION HISTORY
# Mnemonic      Date    Rel Who
# www-publish   040615  1.0 mpw
#   Written.
#

SRC := ./src
OBJ := ./obj
JOURNAL-SRC := ${SRC}/journal
PUB-DIR := /usr/local/www/data
TEMPLATE-NORMAL := page-template.html
TEMPLATE-DOCBOOK := page-template-docbook.html

.PHONY: compile publish clean template link

compile:    template link
    find ${SRC} -type d -exec gmake -C {} -f ${CURDIR}/compile.mk \
        TEMPLATE-NORMAL=${CURDIR}/${TEMPLATE-NORMAL} \
        TEMPLATE-DOCBOOK=${CURDIR}/${TEMPLATE-DOCBOOK} ROOT=${CURDIR} \;
    cp ${TEMPLATE-NORMAL} ${OBJ}    
    chmod 644 ${OBJ}/${TEMPLATE-NORMAL}

publish:
    find ${OBJ} -type d -exec gmake -C {} -f ${CURDIR}/publish.mk \
        ROOT=${CURDIR} PUB-DIR=${PUB-DIR} \;


# update internal links in journal if necessary
link:
    cd ${JOURNAL-SRC}; \
    newlink.py

clean:
    rm -rf ${OBJ}/*


# Update docbook template if standard template has changed
template: ${TEMPLATE-DOCBOOK}


${TEMPLATE-DOCBOOK}: ${TEMPLATE-NORMAL}
    cp ${TEMPLATE-NORMAL} ${TEMPLATE-DOCBOOK}
    munge.py -f create-docbook-template ${TEMPLATE-DOCBOOK}

compile.mk

# Makefile for constructing publishable html file from source html
# files and a template.  The variable TEMPLATE-NORMAL and
# TEMPLATE-DOCBOOK should be passed as an argument to the make
# directive.  One or other of them are used to create all the target
# html pages, and therefore all target files are dependent on them.
#
# MODIFICATION HISTORY
# Mnemonic      Date    Rel Who
# www-publish   040615  1.0 mpw
#   Written.
#

# set target directory
TD := ${subst src,obj,${CURDIR}}

# define pattern rule for producing .html files in the target directory
# Docbook HTML files may have a body in them, so we remove the <BODY> tags
# and replace with a comment indicating this file is docbook html.
# N.B. This relies on using docbook2html (i.e. jade) to produce the HTML 
# files; xmlto produces different signatures.
# This comment is used to determine if the docbook template should be used.
# If not, the normal template is applied.

${TD}/%.html : %.html
    grep "<BODY" $< >/dev/null ; \
    if [ $$? -eq 0 ]; then  \
        munge.py -f ${ROOT}/remove-body $< ; \
    fi
    grep "<!-- DOCBOOK -->" $< >/dev/null ; \
    if [ $$? -eq 0 ]; then \
        ${ROOT}/create-html.py ${TEMPLATE-DOCBOOK} $< $@; \
    else \
        ${ROOT}/create-html.py ${TEMPLATE-NORMAL} $< $@ ; \
    fi

# pattern rule to make non-html targets (images, support files, etc)
# note we ignore directories
${TD}/% : %
    if [ ! -d $< ]; then \
        cp $< $@; \
    fi

# define list of targets (based on list of .html files in current directory)
OBJS := ${patsubst %,${TD}/%,${wildcard *.html}}

# define list of non-html targets
OTHER := ${patsubst %,${TD}/%,${filter-out %.html,${wildcard *}}}

all: ${TD} ${OBJS} ${OTHER}

${OBJS}:    ${TEMPLATE-NORMAL} ${TEMPLATE-DOCBOOK}

# make target directory if necessary
${TD}:
    mkdir -p ${TD}

publish.mk

# Makefile for publishing html files from processed html pages
#
# MODIFICATION HISTORY
# Mnemonic      Date    Rel Who
# www-publish   040615  1.0 mpw
#   Written.
#

# set target directory
# note, PUB-DIR and ROOT are passed on invocation line
TD := ${subst ${ROOT}/obj,${PUB-DIR},${CURDIR}}

# pattern rule to make all targets (directories are ignored)
${TD}/% : %
    if [ ! -d $< ]; then \
        cp $< $@ ; \
    fi

# define list of  targets (that's everything)
OBJS := ${patsubst %,${TD}/%,${wildcard *}}

all: ${TD} ${OBJS} 

# make target directory if necessary
${TD}:
    mkdir -p ${TD}

create-html.py

#!/usr/local/bin/python
"""
NAME
    create-html.py - wraps HTML page contents with HTML page template

SYNOPSIS
    create-html.py template_file source_page_contents output_page

DESCRIPTION
    create-html.py will insert the contents of an HTML page into a supplied
    page template, outputting the results as a final HTML page.

    The title of the resulting page is determined by the first <h1>
    for <h2> header encountered in the page contents.

MODIFICATION HISTORY
Mnemonic       Rel    Date   Who
create-html    1.0    040614 mpw
    Written.
    
"""
import sys
import re

default_title = "hydrus.org.uk"
template_file = sys.argv[1]
html_in_file = sys.argv[2]
html_out_file = sys.argv[3]

template = open(template_file).read()
html_in = open(html_in_file).read()
html_out = open(html_out_file,mode="w")

# attempt to modify title to reflect page contents
re_title = re.compile(r'<title>.*?</title>')
re_header = re.compile(r'<h[12]>(.*)</h[12]>')

match = re_header.search(html_in)
if match != None:
    header = match.group(1)
    page_title = "<title>"+default_title+" - "+header+"</title>"
else:
    page_title = "<title>"+default_title+"</title>"
    
if re_title.search(template):
    template = re_title.sub(page_title,template)

content = template.replace("<!-- page contents go here -->",html_in)

html_out.write(content)

munge.py

#!/usr/local/bin/python
"""
 NAME
    munge.py

 SYNOPSIS
    python munge.py [-f cmd_file] [-n] file [...]

 DESCRIPTION
    Performs editing functions on files specified on command line.
    Munge differs from sed and awk, in that it allows (nay, insists on)
    multi-line substitutions.  Munge accepts the following commands
    from stdin (or the cmd_file if the -f option is given):

    %prefix
    .text.
    %end

    Prefixes the contents of the file with the .text. specified between
    %prefix and %end.  

    %append
    .text.
    %end

    Appends the contents of the file with the .text. specified between
    %append and %end.

    %sub
    .regexp.
    %new
    .text.
    %end

    Substitutes the .regexp. with .text. in the file.  Note that any
    trailing newline character in the regexp and text is removed.

    By default, regexps will match the . metacharacter to everything
    (including newline).  Specifying -n on the command line will
    suppress this default.  Since the regexps are passed to python
    unchanged, it is possible to specify alternate matching
    instructions via the regexp string itself (see the python
    documentation on how to do this).

 MODIFICATION HISTORY
 Mnemonic        Rel  Date     Who
 munge.py        1.0  20040607 mpw
    Created
 munge.py        1.1  20040609 mpw
    Added -n option
"""

import os
import re
import sys
import getopt

#### munge command class - used to hold editing commands and string arguments
class mcmd:
    def __init__ (self,m,o,n,re_opts):
        self.cmd = m
        self.old = re.compile(o,re_opts)
        self.new = n
        self.next = None
    def set_next (self,n):
        self.next = n
    def execute (self,current):
        return apply(self.cmd,(current,self.old,self.new))

#### munge edit operations
def mappend(current,old,new):
    return current+new

def mprefix(current,old,new):
    return new+current

def msub(current,old,new):
    if old.search(current):
        return old.sub(new,current)
    else:
        return current

#### read lines from stream until block terminator
def getlines(instream,term):
    buf = ""
    l = instream.readline()
    while l.find(term):
        buf = buf+l
        l = instream.readline()
    return buf

#### process munge commands
def get_commands(instream,re_opts):
    cmd = None
    new = ""
    old = ""
    head = None
    tail = None
    try:
        while True:
            l = instream.readline().rstrip("\n")
            if l == "%prefix":
                cmd = mprefix
                new = getlines(instream,"%end")
            elif l == "%append":
                cmd = mappend
                new = getlines(instream,"%end")
            elif l == "%sub":
                cmd = msub
                old = getlines(instream,"%new").rstrip("\n")
                new = getlines(instream,"%end").rstrip("\n")
            elif l == "":
                return head
            else:
                print "munge: unrecognised command - quitting"
                sys.exit(1)
            if tail != None:
                tail.set_next(mcmd(cmd,old,new,re_opts))
                tail = tail.next
            else:
                tail = mcmd(cmd,old,new,re_opts)
                head = tail
                    
    except:
        print "File read error"
        raise
        sys.exit(1)

#+++++++++++++++++++++++++++++++++++++++++++++++++
# start of program
#+++++++++++++++++++++++++++++++++++++++++++++++++

#default is to read munge commands from stdin
cmdfile = sys.stdin

# default for regular expressions is . matches everything, including newline
re_opts = re.DOTALL

# read command line arguments, if any
try:
    opts,args = getopt.getopt(sys.argv[1:],'f:n')
    for o,v in opts:
        if o == '-f': cmdfile = open(v)
        elif o == '-n': re_opts = 0
except getopt.GetoptError:
    print "illegal argument"
    sys.exit(0)

# read cmdfile for the munge commands; returns cmd chain
head = get_commands(cmdfile,re_opts)

# apply munge cmd chain to each file on command line
for file in args:
    content = open(file).read()
    this = head
    while this != None:
        content = this.execute(content)
        this = this.next
        
    h = open(file,mode="w")
    h.write(content)
    h.close()

Defects

You may have noticed that programs in cgi-bin are not included in this process. This is a problem because these programs generate pages on the fly, and therefore need to use the current page template. The name of the template is currently wired into the source code, rather than being discovered from some environment setting or such. This needs to be fixed.

$Id: webpublish.html,v 1.3 2023/03/27 08:07:33 mark Exp $