Mon Sep 8 22:54:57 PDT 2008

How The Molecular Universe Site Is Constructed

I started this blog to document developments to The Molecular Universe site. I thought that it might be useful to provide background to the main site: overviews of some of the articles on the main site, topical information on Molecules and Materials, and also information on the way that the whole site is constructed, that sort of thing.

The reason I was interested in documenting the construction of the site is that when you set out to create a web site there is not all that much information around on how to go about the task. Or perhaps it would be more accurate to say that there is relatively little focused information. There is plenty of information but it scattered in many places and it is generally not collected near the site that it represents. There are many software products, environments, productivity tools, and so on. While these are, no doubt, wonderful, they are also typically proprietary or constraining. They make everything straightforward as long as you can stick to a single user interface and way of doing things. If there are bugs or issues to be fixed, you may be left waiting for the next release. If you want to experiment with PHP, AJAX, or a similar technology, and your chosen environment does not support that capability, then you will need to upgrade or migrate to a new technology.

Instead, I wanted an ultra simple site. I wanted to be able to focus on the wording and images and incrementally improve the quality of text with time. As the pages and images are interlinked, I wanted the busy work overhead to be as low as possible. I wanted PDF files to be generated for any page on the site. The reason for the PDF files was to make life simple (and predictable) for anyone that wanted to read a section of the site offline.

So, I decided to make use of the following basic technologies, and nothing more, in the construction of The Molecular Universe site.

HTML is used to author the articles. This presents a substantial danger in that content may be inextricably mixed with 'style' information throughout the text. In order to minimize this risk, I ensure that the HTML is well formed using 'tidy'. Hence, if I want to drop a subset of HTML tags, I can easily do this. If necessary, I can slim the pages down to raw text and minimal formatting information covering subscripts and superscripts, for example.

PDF The PDF files themselves are generated from the HTML. There is no need to use a manual process to create PDF files, going through a word processer, for example.

MAKE The Unix/Linux tool make is used to control what needs to be done whenever a source file is updated. This means that the rules about what depends on what are encoded in Makefiles. This sounds complex but in practice is extremely simple using wild card operations whenever possible. For example, each directory on the site has a Makefile which (through a single included template) says that filename.pdf depends on filename.html. When there is an update to an html file in that directory, the pdf file is regenerated automatically. If a file is not touched there is no need to update its pdf file. A master Makefile for the entire site controls how the site is built, simply by listing the set of subdirectories which are included in the site.

ASPELL Spell checking is performed whenever PDF files are generated. This is just a simple invocation of the aspell command at this stage. So, grammatical and other errors are not caught. My aim will be to improve the 'basic' quality of the writing on the site as much as possible, by making use of simple text processing tools like aspell, diction, and style. These will also keep the writing clear (that is my hope at least, please let me know if we are not meeting these goals).

RSYNC Rsync is use to keep the local copy of the site that we work with synchronized with the site on the server. Rsync is a smart synchronization tool. Only files that have changed are transferred to the server. When the site needs to be updated 'make upload' synchronizes the local site with that on the external server.

BASH In some areas of the site (e.g. in creating the list of images which are included in a table linked to the home page) we have used bash scripting to make the process of creating the index as automatic as possible.

NanoBlogger This blog site is created by Nanoblogger and rsync-ed to the main server. Nanoblogger makes use of bash and simple Linux utitlies and leaves the source of the articles easily accessible. If there is ever a need to move to a different blogging tool, then this should be a straightforward migration. However, for now, Nanoblogger is performing just fine.

CVS The content for the site is stored in CVS. This enables me to happily work away with my own local copies and when necessary commit changes to the repository. This sounds grandiose, but in practice is extremely simple. There is no CVS server involved, the CVS repository is simply a set of files on disk.

As is true for all web sites, The Molecular Universe, has grown organically. Pages and technologies have been bolted onto the site from time to time. The openness of the basic construction of the site makes this possible and indeed straightforward. The basic strategy throughout has been the use of Linux (or Cygwin) tools to carry out transformations and control operations, be conscious of the perils of mixing style and content, and avoid the duplication of information.

I hope that this brief overview of The Molecular Universe gives you a sense of how the site is constructed. I know that such information would have been useful when we first started constructing the site. I am sure that few people will want to duplicate the way that The Molecular Universe is constructed exactly. However, having a sense of the tools and thinking behind the site may be useful to you as you think about developing your own site or sites.


Posted by ZFS | Permanent link | File under: general