ChemDig: New approaches to Chemically Significant Indexing and Searching of Distributed Web Collections

Georgios V. Gkoutos, Christopher Leach and Henry S. Rzepa*

aDepartment of Chemistry, Imperial College of Science, Technology and Medicine, London, SW7 2AY.

Summary: We describe an extension of the ht://Dig Robot-based Internet indexing and search engine to include the retrieval of information included in a variety of molecular data formats as defined by chemical MIME types. This is achieved by invoking chemical metaparsers, software agents which designed to provide key metadata information about the content of the external chemical files. This meta-data can include e.g. derived molecular formula, molecular mass and atom connection table (SMILES) where the content of the file allows this, and other types of content such as author information and supplied keywords. These terms can be automatically added to the searchable terms, and the search outputs can be automatically linked via database requests to other external databases containing chemical information. We report our experience in applying this robot to indexing five different remote sites. We discuss different mechanisms for storing and searching for the chemical content, ranging from simple keyword based searches qualified by chemically significant boolean terms, chemical similarity searches and our experiments in creating more highly structured content which expresses the chemical data using XML-based markup and where XSLT transforms for filtering, searching and rendering the information are used.

1. Introduction and Background

The widespread adoption of the World-Wide Web system has resulted in the creation of a substantive data, information and resource collection in molecular sciences,1 albeit one which has been described as a library in which the books are strewn on the floor rather than classified by shelf. Such document collections are frequently expressed in Hypertext Markup language, the salient features of which can include links to other such documents and to chemical data files, together with links to server-based resources such as programs, databases and scripts. During the evolution of HTML through various versions to reach the current XHTML specification, the focus was on developing hypertext markup syntax as predominantly a carrier of bibliographic content for browser-based display. Because of this, early inclusion of explicit chemical content was invariably associated with markup-free legacy file formats rather than formats designed to take advantage of the (then) rather limited features of HTML. A more extensible and structured formalism for carrying data known as XML (extensible markup language) has been developed in response to this need. The molecular implementation of XML has focused on CML (Chemical Markup Language).2 We have demonstrated how molecular data expressed using XML and CML, together with associated formalisms for linking and transforming the data can be chemically applied.3 We recognised however the need to develop procedures for migrating the substantial inherited legacy of HTML files and associated chemical content into this more structured, data-centric and inter-operable XML environment. We describe here the background to the procedures we have developed to harvest Internet-based molecular resources, together with a discussion of various approaches which can be used to retrieve this content.

1.2 Association of Chemical Content with HTML Documents.

Prior to the development of the extensible XML formalisms, we had proposed an infra-structure for using HTML as a linking mechanism for data file types directly associated with chemical information. This was known as the chemical MIME definitions, and these have been rapidly and widely adopted.4 This mechanism attempts only to identify data associated with a limited set of accepted molecular file types, but it has always been recognised that as usage of these types has evolved, their internal data structures have often developed undocumented or incompletely defined variations. Moreover, the internal data structures were often designed for syntactical compactness rather than clarity and human or machine readability. Typical examples of such files include the Protein databank coordinate format (PDB), the JCAMP data exchange formats for chemical analytical information, and a variety of files for expressing molecular connectivity, coordinates and reactions. There are also many data files associated with widely used computational chemistry modelling program systems such as e.g. MOPAC or Gaussian. We believe that links to around 40 such file types are now to be found globally within HTML pages on chemical servers. However, we also recognised that hitherto, there has been no easy mechanism or tools for identifying, indexing, searching and comparing this content in an automatic and low cost manner.5 It is also true that the early focus on associating such content with HTML pages emphasised the presentational aspects rather than retrieval mechanisms. As a result, a variety of syntactical formalisms were used to create this association within the HTML syntax. Our first task therefore was to identify those formalisms for which automatic procedures could (or could not) be developed, and we list these below.

  1. The anchor (also known as the hyperlink), invoked as
    <a href="molecule_URI" title="Benzene, C6H6">
    Prose description of the chemical object</A>
    where "molecule_URI" (Uniform Resource Identifier) can be a relative file declaration or an absolute path to a file on a remote server.
  2. An in-lined version of the anchor, invoked as
    <embed src="molecule_URI" width="50" height="50" >
    and requiring the user to install a browser plugin.
  3. An alternative in-lined version of the anchor, invoked as
    <applet codebase="location of code for displaying the data" width="50" height="50" >
    <param name="data_source" value="molecule_URI">
    but which does not require any local software to be installed, the components instead being downloaded from the remote server.
  4. Modes 2 and 3 unfortunately are mutually exclusive, with differing actions required from both the author of the content and its reader for each type. With the release of the HTML 4.0 and its successor XHTML 1.0, rationalisation of modes 2 and 3 into one, properly cascading syntax takes the form;
    <object data="molecule_URI" type="chemical_MIME_type" 
    classid="specification of how to implement an object" 
    title="Simple meta-information about the chemical object" > 
    <param name="run-time" value="initialisation data"> 
        <object data="molecule_URI"
        classid="alternative object specification" type="chemical MIME type"
        title="Meta-information about the chemical object" >
    Mode 4 has the advantage of conforming to a well-formed and valid XML document, and makes no exclusive assumptions about the state of the user's browser. However, it is often inadequately supported by the current generation of browser6 and is relatively rarely used for this reason. For this reason, we include it here as a "legacy" format, particularly because it does not expose any chemical data structures other than the MIME type, and will therefore always require special processing to extract this information.

  5. Several HTML constructs are often used by authors to control the style or presentation of a page to the user in a compact and attractive manner. One such is the FORM;
         <option value="molecule_URI" >
         Prose description of the chemical object
  6. Image maps are widely used to link bitmap representations of molecules to explicit coordinate files, such as;
    <map name="chemical_image">
    <area shape="rect" coords="51,187,165,267"  href="molecule_URI">
    <img usemap="#chemical_image" src="ImageURI">
  7. Also common is the use of a scripting procedure to create user-selectable browser presentation. Many variations are possible; a common one is associating the anchor in mode 1 above with an event or action such as;
    <a href="#" onclick="ShowMolecule();">Prose description of the chemical object</a>
    function ShowMolecule() {'molecule_URI','1','width="308",height="300");}
    This category is a particularly difficult one to handle, since around ten different types of event could be trapped, and hence a general solution would involving parsing the logic in the script to re-assemble the intended content invocation. A partial solution to this problem would be for the author of such code to also declare a link element for each molecule_URI as shown below.
  8. Document links are part of the header and are intended to normalise all the other forms of document linking into a consistent meta-representation. We have shown6 that it is possible to capture many of the common link invocations and to declare them as link objects;
    <link type="chemical/*" rel="alt" href="molecule_URI"  title="description" />
    This assembling of all document links also has the advantage of having the formal primary chemical MIME type declared, which makes it potentially very easy to identify presumed chemical content in the document.
  9. The least exposed way of invoking molecular content would be to reference a remote database using a so-called CGI request, which might take the form;
    <form action="http://remote-site/cgi-bin/database-interface" title="Description of action"</form>
    The molecular resource accessed via the database interface is specified by one or more variables collected from the <form> and passed to the database-interface program or script. These variables only have significance in the context of the particular database interface referenced and their meaning is not available within the document containing the <form>. In general no conclusions can be drawn about what type of content is referred to. The title attribute, which could loosely carry this information, is very rarely declared. The existing mechanism therefore carries little if any meta-information about the remote CGI resources and hence this type of molecular resource linking is unlikely to be captured by any automatic robot mechanism. The proposed successor to the <form> content model is XFORMS, which is a powerful XML-based method which clearly separates the purpose of a request (the data collection semantics), from the manner of its presentation within the browser, and from the data (as name/value pairs) defining the request. Use of this more powerful model will in the future allow more significant identification of the purpose and likely content of remote database resources referenced via documents.
  10. Finally, and arguably the most difficult category to identify, is chemical content expressed as
    <img src="Image_URI"  alt="description" longdesc="Long Description" /> 
    Possible methods for identifying chemical content in such raster images is discussed in more detail below.
It becomes apparent from the above diversity of syntactical forms for including chemical information in HTML documents that methods for aggregating and transforming this content into a more systematically structured format are desirable. The next section outlines our approach for achieving this.

1.3 Web-based Indexing and Search Engines.

In this section we describe the characteristics of a typical indexing robot known as ht://Dig, and then describe its enhancement with a set of external chemical metaparsers, the collective name for which we refer to as ChemDig.

The first implementations of so-called robot based indexing (traversing) of an interlinked collection of HTML based documents were achieved in 1994 with software such as Lycos or WebCrawler. Such software was soon commercialised, and was followed by a large number of similar robot based systems for generalised indexing of the Internet. One widely used modern system is the UltraSeek server from Infoseek, which in addition to indexing HTML content, employs more advanced features such as meta-tag and image searching, and the ability to crawl client-side image maps. However such general purpose robots do not in general attempt to identify and index chemical data. We report here our experiments in addressing this particular issue, using an OpenSource software wherever possible, and in particular a document indexing and traversing system known as ht://Dig7 (a reference to "digging" the content of documents using Hypertext Transport protocols).

The essential operation of ht://Dig commences via the manual specification of one or more root documents in a configuration file, which serve as the starting point for traversing a document collection via the hyperlinks found. Using the HTTP protocols, the HTTP header is retrieved from the remote server and if the MIME type declaration contained in this header is found to correspond to HTML, then the content of the document is transferred to the internal syntax parser, and the marked up HTML content analysed appropriately. Initially, those components of the document marked as the <head> < /head> are separated from the <body> < /body> of the document. For example, if an HTML markup declaration in the <head> component of the type < Title>The Three dimensional Structure of Mauveine < /title> is identified, this title text string can be passed to the indexer, and if considered appropriate, the significant words assigned a high weighting factor. Other forms of declaration in the <head> such as <meta name=DC.Description content="Mauveine coordinates"> can be identified and associated with pre-defined weighting factors. Such meta-data declarations are particularly important in identifying key properties of the documents that might not otherwise be easily identified, such as the author of the document, any date or ownership associated with it, or even more specifically whether the document has any chemical information.8 Thereafter, the text content of the <body> of the document is indexed according to well-defined algorithms, and appropriate weighting factors for specific elements of the body such as <h1> < /h1> headings assigned.

If during the course of parsing the HTML document, any link to another HTML document is identified, then an attempt will be made to also retrieve this document and perform the same indexing. The configuration of the robot search can specify whether the traversing of the hyperlinks is restricted purely to any document at the same hierarchical directory level as the root level or below it, or whether links above or out of this directory structure are also allowed. The robot will also obey the so-called robot-exclusion rules, a mechanism whereby the administrator of a remote collection can specify whether any directories of the collection are out of bounds to the robot. Typically, traversing a remote document collection of tens of thousands of HTML documents can be completed in just a few hours, depending of course on the bandwidth of the network connecting the remote server and the indexing computer.

Most robot-based indexing software supports not only the parsing of simple HTML documents, but also of other types of document specified by their MIME types, such as text/plain and more complex documents such as application/word or application/pdf, corresponding to the Microsoft Word and the Adobe Acrobat formats. In general however, subject-specific MIME types are not by default included in the indexing. Our interest in the ht://Dig indexing software arose because this code does allow a specification of external parsers for such file types, and we also noted that availability of the source code would allow us the option of specific modifications to be made if necessary. Other features of ht://Dig, which attracted our attention included an extensible method for parsing meta-data declarations in document headers, which we felt might be useful in a chemical context. Finally we noted that the system has been shown to be highly efficient for traversing collections of up to 1 million documents in a reasonable time, and so could be considered suitable for handling small and medium sized Web sites (e.g. Intranets) containing less than this maximum number of documents.

2. The ChemDig Implementation.

In specifying the functionality of ChemDig, we had four objectives in mind;

  1. to identify the existence of chemical content from distributed document collections
  2. to convert and store structurally homogeneous chemical content in appropriate database
  3. to add content (metadata) to the databases where applicable to aid in resource discovery
  4. to explore several novel mechanisms for retrieval of the chemical data from the databases.
Figure 1a shows an overview of the core operation of ChemDig. Some of these components have been described in preliminary detail elsewhere.6,9 The overall operation involved three discrete phases.

  1. The ht://Dig software7 contains a parser which corresponds to version 2.0 of the HTML standard, but which is capable of handling many exceptions to these standards. Some exceptions however are not handled explicitly, such as documents which make use of in-lined RasMol scripts to control the function and appearance of linked molecule coordinate files intended for a Chime display. The use of < and > script operators conflicts with the same operators used for containing HTML elements, and since the resulting HTML document is not well formed, the HTML parser in ht://Dig will fail. We considered it essential to solve this problem by pre-processing any document handed to ht://Dig to ensure that it was well formed. This was achieved using JChemTidy9 which can be used to search for occurances of RasMol Scripts and declare the < and > operators as entities. JChemTidy also converts older HTML markup to the XHTML standard and an accompanying module called JChemMeta also normalises hyperlinks of the type described in the link modes above by inserting the corresponding link declarations, which the ht://Dig software does honour. The resulting file collection can be re-created on a local hard disk with the directory structures retained in readiness for indexing using ChemDig, and also can be used to replace the original collection.
  2. The second phase involves invoking the ht://Dig robot. Some changes to the ht://Dig source code were required, at version 3.1.1 when the project was started and currently at version 3.2. One significant limitation of ht://Dig is that parsing of the common attributes of these elements such as title="...." or alt="..." is not yet implemented, and this was overcome by using JChemMeta9 to insert a title attributed derived from chemical files into the metadata declarations of the XHTML file, where they are accessible to the ht://Dig robot. The interface for calls to an external parser was written in the Java programming language to facilitate future implementation on a variety of different operating systems, and also to allow modular deployment in other indexing software.
  3. The final stage involves meta-parsing of chemical files linked to the documents trawled by ht://Dig, and adding any derived fields to the document header of the original HTML file (Figure 1a). By metaparser, we mean a declared procedure for identifying meta-information about the internal content of specific files, such as any title, author, date, or chemically relevant information such as molecular formula. Our objectives were not only to identify the performance characteristics of such an operation, but also to gather statistics on the extent that external sites have adopted the various hyperlinking mechanisms described above to include specifically defined chemical data. We also wished to establish how links or pointers to other databases of chemical information sources might be automatically added, and perhaps most importantly to create an automatic mechanism for converting the information so collected into an XML-based document collection. We note that this robot based method for traversing a distributed collection of chemical information has some significant differences from the more traditional methods for registering compound information in specifically designed chemical databases. Although the current version of ht://Dig does not support fielded searches on the content of added meta tags, other search engines such as InfoSeek do. Details of the chemical metaparsing are given in the next section.

2.1 Chemical Metaparsers

The most frequently used types for which metaparsers were written are shown in Table 1. A list of less common types supported is available via the supplemental information associated with this article at

The metaparsers were not intended to act as precise validators or complete parsers of the specific content of each file type. The majority of chemical file types are defined as 7-bit ASCII file types, where the identification of information components is defined by either strings of text terminated with an end-of-line character, or strings of text with a characteristic prefix, and no other structured identifiers. Typically, these types of delimiters are used within the file structure to identify titles, comments, authors, dates or keywords used to specify details of the input for a computational chemistry calculation or instrumental output, and for many of the file types, definitions of these fields are documented to a greater or lesser extent. Such files are often found in the e.g. the supplemental information sections of electronic journals. A more challenging type of file to parse is the much more finely grained output file from e.g. an instrument or modeling calculation, and these we have deferred for processing using appropriately markup-up formats based on XML and CML.

A smaller number of chemical file types are expressed as 8-bit binary files, where the structure is not defined by end-of-line markers, and for which byte-code structure definitions are not readily available being considered proprietary. When these types are encountered, we chose merely to pass back to the index engine the existence of the file and its formal filename. There was one exception to this, when the 8-bit binary file was detected as a "gzipped" compressed format, typically employed in chemical/x-pdb and model/vrml formats to reduce the file size. Here it is trivial to internally decompress the format and parse it as a 7-bit ASCII file. In general however, the binary compression mode is propriety, and no attempt was made to read the contents of such files.

2.2 Handling Chemical Content Expressed in Bitmap images.

Largely for historical reasons, much Web-based document content is carried in the form of in-lined raster images, encoded in formats such as GIF, JPEG or PNG (and animated versions, including video animations). These represent a notoriously difficult problem in what has been described as "machine vision", or software image recognition. Whilst a major activity outside of the chemical arena, surprisingly little work has been done on chemical recognition in such images10. This work has centered on deriving atom and bond connection tables from line diagrams, which of necessity has to be a process with effectively zero error rate. Human intervention is essential to achieve this. Less attention has been given to identifying simple chemical meta information from such images, or indeed answering questions such as "does this image contain any representations of chemistry". Recent developments in "machine vision" hold the prospect of automatically generating such meta-information during the aggregation process,11 but currently, only some simple methods can be applied to identify potential chemical content in images. These include;
  1. Image files can carry associated meta-fields in the form of HTML "alt" and "longdesc" descriptors, which can be added to the indexing process. Unfortunately, few Image files carry any sensible description, and when they do, there is no guarantee that it is appropriately associated with the image (i.e such descriptions are often inherited from other images during the authoring process).
  2. GIF and PNG files can contain invisible text-based fields with useful chemical information such as atom coordinates and Molfile connection tables.12. We wrote a parser for these files which can detect the presence of such information, and flag it as indicating the likely presence of a chemical structure file (chemical/x-mdl-molfile).
  3. It is also possible in a general sense to automatically convert an arbitrary raster image to a vector description such a SVG.13 With chemical structure diagrams of course, such a description is far more concise. It may be possible from the resulting number and connectivity of the vectors to recognize patterns typical of chemical structures, and hence add corresponding meta information. This area too is under-developed and one where we anticipate rapid progress in the future.

2.3 Defining External Metaparsers

The external metaparsers are specified via the ht://Dig configuration file, an abbreviated example of which is illustrated in Scheme 1. The configuration includes specifications of particular meta data declarations to be indexed. We have included examples of the Dublin Core set together with some chemical extensions we have previously proposed.8

Scheme 1. Example Configuration file for ht://Dig
     database_dir: /disk1/www/htdig/new/tests
     allow_in_form: search_algorithm
     search_algorithm: exact:1 synonyms:0.5 endings:0.1
     external_parsers: chemical/x-pdb "/usr/java/bin/java chemical.Htdigfront" \
     chemical/x-jcamp-dx "/usr/java/bin/java chemical.Htdigfront"\
     chemical/x-mopac-input "/usr/java/bin/java chemical.Htdigfront" \
     chemical/x-mdl-molfile "/usr/java/bin/java chemical.Htdigfront" \
     chemical/x-pdb "/usr/java/bin/java chemical.Htdigfront" \
     chemical/x-xyz "/usr/java/bin/java chemical.Htdigfront" \
     model/vrml "/usr/java/bin/java chemical.Htdigfront" \
     use_meta_description: true
     #the defined meta tags
     keywords_meta_tag_names : DC.chem.coordinates DC.chem.substance.smiles \ 
     DC.chem.substance.formula DC.Title DC.Publisher \ 
     DC.Date DC.subject DC.Format DC.Coverage DC.Type DC.description DC.Creator \
     # weighting given to metadata elements
     meta_description_factor: 10
     title_factor: 10
     keyword_factor: 10
     heading_factor_1: 10
     heading_factor_2: 9
     text_factor: 4
     template_map: Long ${common_dir}/chemical.html

The relevancy ranking is evaluated on the basis of the weighting factors associated with the original index entries and the frequency of occurrence of the search string in the document itself. In indexing the content, we have chosen a default weight of 1 for any string located in the body of a HTML document, and factors of 10 for strings originating in the document header, including meta-data declarations, and strings occurring within the body of a an external chemical document.

2.4 Chemical Validation and Derived Information

With the most common chemical legacy file formats, we chose to implement some chemical parsing, validation and derived molecular information. An overview of these and other post-processing operations is given in Figure 1b.

For the file types chemical/x-mdl-molfile, chemical/x-mdl-rdfile, chemical/x-mdl-rxnfile, chemical/x-mdl-sdfile and certain flavours of chemical/x-pdb, it is possible to straightforwardly derive a molecular formula and molecular weight for the substance. For several of the formats containing molecule atom and bond information, such as chemical/mdl-molfile and chemical/x-pdb the parser was also modified to pass a request to an external program to validate the molecular content and to derive a unique SMILES string corresponding to the molecule. If the chemical validation was successful these strings were returned to the ht://Dig index engine as keywords to enable the users to search for the unique SMILES string of a molecule.14 We employed two external programs to validate the file and to derive the SMILES identifier. A CGI-type request can be issued to the Daylight toolkit running on a remote machine, and a similar operations is also possible using the JME (Java Molecular Editor).15 If the validation process fails, an error message is returned instead. We do note that the Daylight and JME canonicalisation routines do not always produce identical unique SMILES strings for the same molecule. Normalisation of this string must be done at the generation stage, since it cannot be achieved at the searching stage (the Daylight system for example will always renormalise a SMILES string as part of its own search sequence).

These variables are all passed these back in standard form for inclusion in the ht://Dig database, with the option of assigning discrete weighting factors to these entries. The derived metadata is also added to the original HTML document in the form of specific meta declarations, such as DC.CHEM.coordinates, DC.CHEM.substance.smiles etc (Scheme 1). This would allow subsequent searches to be modified by such qualifiers; currently not an option with ht://Dig but available with some other search tools.

3. Results and Discussion.

The initial phase involved selecting a set of distributed sites as the test basis(Table 2) . We selected a number of sites from a standard collection of chemical resources16, using criteria such as geographical distribution and the indicated presence of possible structured chemical content. Initially, a single configuration file containing the start_URL for the root document of the project at each site was used. Retrieval of the chemical content of these sites served to highlight several problems which needed specific solution.

From these experiences, we conclude that the relatively loose standards and compliance of many Web sites means that a complete level of automation for the operation of a robot based on document traversals may not always be possible. However, the degree of human intervention required in the process was manageable, and would be particularly effective in e.g. an Intranet environment.

3.1 Creating Chemically Enhanced Databases

At this stage, we were ready to produce searchable databases from the resulting molecular harvesting. We have evaluated three options for this procedure.
  1. Creating ht://Dig based ChemDig databases taking two forms, one each for the sites, and a consolidated one integrating all the sites. these databases would contain an index of all the prose keywords found in the HTML documents together with the metadata obtained from the metaparsing of the linked chemical files.
  2. Creating a chemical structure database containing all molecular connection tables (e.g. Molfiles and PDB files) located by ChemDig. In principle, the database could also contain e.g. spectroscopic information gleaned from e.g JCAMP files, but we did not investigate this option at this stage.
  3. Creating a parallel XML-based repository, comprising a seamless integration of HTML as converted to XHTML with chemical files automatically converted to XML representation using CML. An XML file was created for each original HTML document containing transcluded molecule coordinate files for this process, and each document was inserted into an XML-based Object store. This store can be separately searched.
We have previously described how separate ht://Dig databases can be aggregated with other search engines into a Chemical Search Channel.18

3.1.1 Searches using the ChemDig Databases

The ht:/Dig database comprises all the bibliographic keywords (other than the usual default stop words which are not indexed) contained on a central server. The presence of added chemical metadata deriving from the operation of JChemMeta allowed the resulting database to be searched in two ways

  1. A simple keyword search invoked by using the following embedded XHTML entries invoked from a browser page.

    <form method="post" action="">
    <input type="hidden" name="config" value="database_name" />
    <input type="text" size="9" name="words" value="" /> <br />

    Here htsearch is the search program component of htdig, whilst the values of the variables config and words contain the database to be searched and the user specified search string to be passed to it. This results in the following request to the server:

    The search string can derive from occurances either in the < header> components of the HTML document, or the text components of the body. It can also include the ALT attribute of the <IMG> image element, although these are rarely defined with chemically significant values. Text strings gathered by or derived using the external chemical metaparsers can also be searched for, and all searches can be performed with the usual boolean operators (AND, NOT and OR) and with regular expressions of the type alizarin*. Three derived text expressions can also be specified as a search string; the molecular formula (expressed as CnHlElm), the molecular mass specified as a numeric string, and the unique SMILES representation of a molecular connection table. These search terms have to be "quoted" to ensure they are treated as single string.

  2. The second mode makes use of the automatic insertion by the JChemTidy tool of meta-data headers of the type;
        <meta name="DC.Subject" content="Molecule-href, Molfile-format" />
    This string is assumed to be unique to JChemTidy, and indicates that the document contains a formal link to a molecular connection table or coordinates, specifically expressed in the Molfile format. Similar entries can be inserted for other formats and also for and e.g. documents containing. This now allows a more restricted search to be conducted only for XHTML documents which contain links to such chemical files. The code to achieve this is;
    <select name="words" multiple="multiple">
    <option value="Molfile-format and">
    Search only HTML pages with links to Molfiles
    <option value="PDB-format and">
    Search only HTML pages with links to PDB files
    In effect, this forces a boolean operation specifying a search for the user provided string AND the JChemTidy unique string. A similar technique can be used to search only for keywords explicitly extracted from the contents of e.g. Molfiles and PDB files (and again in principle any chemical file) by inserting a unique string into such files such as "molfile_molecule" or "pdb_molecule".

Output Templates for ChemDig

Customizable output templates based on HTML for presenting to the user the results of the search allow inclusion of the title of any HTML or external chemical file, which contains at least one occurrence of the search terms. In the absence of a title attribute, the name of the file is displayed. The computed relevance ranking of the document is displayed in the form of a star rating., together with metadata term specifying the document description if available. In a chemical file, if no title field is specified, this is by default taken as the first comment line found in the file. Next, the fully specified URL of the document is given, with an anchor-based (<a> </a> ) hyperlink to that document. If the document comprises one of the chemical MIME types, the user will of course have to implement a viewer for that document type. Typically, if the document is e.g. of type chemical/x-mdl-molfile, then an external program such as RasMol or a browser plug-in such as Chime or Chem3D would be appropriate.

The output template was also modified to automatically include several additional terms. Firstly, for each connection table list item located, an entry was inserted to pass this specific query to the Daylight database to search for similar molecules, and an entry inserted to start a conversion process to CML.19 Secondly, we include cgi requests for some chemically relevant databases, which were considered as capable of containing information for further chemically related queries that can be invoked by the user. These were presented as active links in the results page of the query and were independent of the number of the hit results. The output template (available in the supplemental information ) gives search output as shown in Figure 2a, b.

3.1.2 Molecular Substructure Searches

The ht://Dig search interface is essentially bibliographic in nature. It is possible to use it to search for molecules via simple descriptors such as molecular formula or a more unique atom connectivity descriptor such as SMILES. Such fields have been added by our tools to the meta information of the HTML file that invokes a molecular coordinate file. Such searches however can only return precise matches, and cannot be used for sub-structure searches. A clearly identified need however is to be able to search for similar molecules to that identified perhaps by an ht://Dig search.

Because ChemDig handles each Molfile or PDB file explicitly, it is straightforward to pass these files directly to a specially constructed molecular database. We chose the Daylight THOR/MERLIN system to demonstrate this in operation. The captured fields include not only the molecular connection table, but also the absolute URI indicating the location of the original file. A search page can be defined in HTML and the output results can include not only the list of molecules, but a link to the original site (Table 3).

3.1.3 Searches Based on XML Documents

JChemTidy normalises HTML documents located using ht://Dig into XML-conforming XHTML representation. Similarly some chemical coordinate files such as the MDL molfile (versions V2000 and V3000) can be passed to a script which converts them to an XML-conforming CML representation.19 The resulting two components (XHTML and CML) can be coalesced single a XML document, the identification of each component being achieved using the appropriate namespaces. In effect, the first generation chemical/MIME mechanism has been replaced by a more structured and extensible approach enabling more finely grained information to be expressed within the document. This now allows the documents to be searched for patterns using XSLT stylesheet transformations, on the premise that all the significant content is marked up in a well defined and syntactically correct manner with a resolvable structure to the data. The desired search pattern, which can be a mixture of bibliographic content carried in XHTML, and molecular, atomic and bond information carried in CML, is then specified in an XSLT stylesheet declaration. Application of such a stylesheet to an XML document or document collection can be performed by generic software such as browser itself (for example Internet Explorer 6), or more specific software designed to run under batch conditions (such as Saxon20). Unlike e.g. the Daylight searches described above, these functions can be accomplished using OpenSource software.

To illustrate the concept, we include two simple pattern searches (Table 4) which involve operations such as counting the number of molecules contained in an XML document, and identifying e.g. any molecule containing sulfur atoms. The stylesheet also includes processing instructions for transforming any located molecules to e.g. Molfile format for display in a browser window.19 This concept is readily extensible, since new XSLT search queries can be either written directly using XSLT grammars, or assembled from stylesheet library components, or potentially generated dynamically using an appropriate chemical query tool. The stylesheet can also include filters to display any specified property and meta-data fields detected in the XML document.

A search based purely on XSLT patterns is not very efficient when dealing with a large collection of such documents, where pre-indexing is required for speed. To evaluate this mode, we used another Opensource program called eXist21 to create a testbed CML repository. CML documents can be stored either internally within eXist using a native XML-DB written in Java that stores data to disk and indexes them, or externally using a relational backend such as MySQL, PostgreSQL or Oracle. Documents can be retrieved, stored, viewed and edited dynamically (Figure 3), since the eXist server can be accessed via HTTP. The built in search engine provides fast XPath queries, using indexes for all the element, text and attribute nodes present in the original document collection, and is designed to cope with large collections. The eXist database can also be configured to associate any retrieved documents with declared stylesheets libraries. This allows the document to be appropriately transformed prior to presentation to the user. We also note that since these stylesheets themselves are XML documents, they too can be deposited and stored in eXist, and if necessary searched for. For this to be effective, the stylesheet document should be annotated with meta-data to describe its operation and allow search retrieval.

4. Conclusions

When the Web started being populated with molecular content around 1994, it received often justified criticism for having none of the formal rigour of a database system, and for being particularly difficult to search in the specific context of molecular information. Amongst the significant problems identified were the issue of badly formed HTML files which were not syntactically valid, of no semantic markup for explicit molecular information (even the molecular formula was not really parsable chemically), of a variety of often inconsistent ways in which a variety of imprecisely defined chemical format files were declared, linked and displayed, of the almost complete lack of metadata describing any chemical data which might have been described, and of opaque mechanisms for linking documents to server based databases where the internal data structures were not exposed. The creation of information and data grid based projects such as Globus22 emphasizes the importance of establishing scalable procedures to address such issues.

We have presented here a number of approaches based on a traversing robot, which can be used to rectify some of these deficiencies. Although these still require some manual intervention, it has proved possible to construct various forms of structured chemical database by traversing a typical site containing molecular information. These solutions range from the traditional customised server-based chemical database to distributed "self-defining" document collections searchable at the client using stylesheets. Some issues still remain. The ubiquitous use of bitmap images to describe chemical structures means that not only is molecular constitution and connection irretrievably lost from such documents, but that even meta-information signifying the presence of a molecule is not captured. We believe the development of appropriate "machine vision" mechanisms for recognising meta-content in images is desirable. A second issue relates to the current non-availability of on-line resources for identifying chemical uniqueness from identified molecular connectivity; there is probably much undetected duplication. The resolution of such problems must lie in persuading the community to focus on capturing molecular content into a self-describing document structure, comprising finely grained molecular marked up with globally unique associated identifiers.23 Such a highly distributed global information and knowledge base would represent a novel and extensible mechanism for scientific dissemination which is seen by many as the ultimate vision of a semantic World-Wide web.24


One of us (GVG) thanks Merck Sharp and Dohme and the EPSRC for the award of a scholarship. We also gratefully acknowledge the help given us by Paul May at Bristol and Karl Harrison at Oxford.

5. References and Citations

  1. See for example H. S. Rzepa, P. Murray-Rust and B. J. Whitaker, Chem. Soc. Revs, 1997, 1-10.
  2. P. Murray-Rust "Chemical Markup Language", World Wide Web Journal, 1997, pp 135-147; P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 1999, 39,
  3. P. Murray-Rust, H. S. Rzepa and M. Wright, New J. Chem., 2001, 618-634; P. Murray-Rust, H. S. Rzepa, M. Wright and S. Zara, ChemComm, 2000, 1471-1472.
  4. H. S. Rzepa, P. Murray-Rust and B. J. Whitaker, J. Chem. Inf. Comp. Sci., 1998, 38, 976-982.
  5. J. S. Brecher, Chimia, 1999, 52, 650; C. Leach and H. S. Rzepa, Article 121, "Electronic Conference on Heterocyclic Chemistry '96", H. S. Rzepa, J. Snyder and C. Leach, (Eds), Royal Society of Chemistry, 1997. CDROM ISBN 0-85404-894-4; L. Patiny Internet J. Chem., 2000, 3, article 2; S. K. Lin and L. Patiny, ibid, 2000, 3, article 1
  6. G. V. Gkoutos, H. S. Rzepa and M. Wright, Internet J. Chem., 2000, 3, article 7.
  7. A. Scherpbier,
  8. S. Weibel, Bull. Am. Soc. Information Science, 1997, 24, 9-11; G. V. Gkoutos and H. S. Rzepa, Electronic Conference on Synthesis in Organic Chemistry (ECSOC-2), Ed. S.-K. Lin and E. Pombo-Villar, 1999, CD-ROM, ISBN 3-906980-01-4 (Publisher MDPI).
  9. G. V. Gkoutos, P. Kenway and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 2001, 41, 253-258; G. V. Gkoutos, P. R. Kenway and H. S. Rzepa, New. J. Chem., 2001, 635-638.
  10. P. Ibison, M. Jacquot, F. Kam F, J. Chem. Inf. Comp. Sci. 1992, 32, 373-378; R. Simon and A. P. Johnson, J. Chem. Inf. Comp. Sci., 1993, 33, 338-344; ibid, 1997, 37, 109-116.
  11. R. M. Clark, O. A. Adjei, H. Johal, "Machine classification of textures using incremental learning based on the mean and variance of the multi-dimensional feature space", International conference on Mechatronics and Robotics 2000, Saint Petersburg, Russia, May 2000.
  12. See W. D. Ihlefeldt,
  13. G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa and M. Wright, Internet J. Chemistry, 2001, submitted for publication.
  14. D. Weininger, A Weininger and J. L. Weininger J. Chem. Inf. Comp. Sci., 1989, 29 97-101.
  15. P. Ertl and O. Jacob, Theochem J. Mol. Struct. 1997, 419, 113-120.
  16. for a review of such sites, see W. A. Warr, J. Chem. Inf. Comput. Sci., 1998, 38, 966-975. We used as the source for identifying sites.
  17. For mirroring tools, see J. Langfeldt, For renaming and link renormalising tools, see
  18. G. V. Gkoutos and H. S. Rzepa, Internet J. Chem., 2000, 3, article 8.
  19. G. V. Gkoutos, P. Kenway, P. Murray-Rust, H. S. Rzepa and M. Wright, Internet J. Chem., 2001, 4, article 5.
  20. See M. Kay,
  21. See W. M. Meier,
  22. See N. Antonopoulo, A. Shafarenko, J. of Supercomputing, 2001, 20, 20 5-35; Aloisio, M. Cafaro, P. Falabella, Lect. Notes Comput. Sc., 2000, 1823: 32-40 and
  23. IUPAC Chemical Identifier (IChI) Project; A. McNaught (Chair), Chemistry Intl., 2001, 23, issue 3. For details, see
  24. H. S. Rzepa and P. Murray-Rust, Learned Publishing, 2001, 14, 177.


Figure 1. (a) The Core ChemDig functionality.
ChemDig Operations
(b) The Extended ChemDig functionality.
ChemDig Operations
Figure 2. (a) Input window for ChemDig search.
ChemDig search

(b) Resulting output for the default search.
Resulting output

Figure 3. eXist Database Interface to XML Document Repository.
the eXist XML Repository
Table 1. A list of frequently occurring links to Chemical MIME Media Types.


File name-extension



csml, csm

Rasmol Script language


gau, gjf

Gaussian Input format


dx, jdx

JCAMP Spectral format



MDL Molfile



MOPAC Input format



Protein DataBank



Virtual Reality Modeling Language (VRML)



Co-ordinate Animation format

A list of less common chemical file types is available in the supplemental information.
Table 2. Chemical File types and Counts identified at Chemical Sites.

Type Of File

Number of occurrence

















Total number










Total number










Total number








Total number








Total number


Table 3. Substructure search of Daylight Database resulting from ChemDig

Table 3

Table 4. XSLT Querying of XML/CML Generated using ChemDig

Query XSLT Query and Template Strings
Find total number of molecules <xsl:value-of select="count(//cml:molecule)"/>
Convert any molecules containing Sulfur to Molfiles <xsl:for-each select="cml:molecule[cml:atomArray/cml:atom/cml:string='S']">
<textarea rows="10" cols="80">
<xsl:call-template name="convertV2000"> <xsl:with-param name="endLine">|
</xsl:with-param> </xsl:call-template>
Display any molecules containing Sulfur using JMol <xsl:for-each select="cml:molecule[cml:atomArray/cml:atom/cml:string='S']">
<xsl:with-param name="display">jmol</xsl:with-param>
<xsl:with-param name="width">400</xsl:with-param>
<xsl:with-param name="height">300</xsl:with-param>
<xsl:with-param name="id">
<xsl:value-of select="generate-id()" />
To invoke these stylesheets, use Internet Explorer 6