Published March 2001, in the Internet Journal of Chemistry, 4, article 5.


A Resource for Transforming HTML and Molfile Documents to XML Compliant Form

Georgios V. Gkoutosa, Philip R. Kenwayb, Peter Murray-Rustc, Henry S. Rzepaa and Michael Wrighta

aDepartment of Chemistry, Imperial College of Science, Technology and Medicine, London, SW7 2AY.
bMerck Sharp and Dohme Research Laboratories, Neuroscience Research Centre, Terlings Park, Harlow, Essex, CM20 2QR.
cSchool of Pharmacy, University of Nottingham, Nottingham, UK.

Summary: An on-line resource for transforming HTML and MDL-format Molfiles to digitally signed XML-conforming XHTML and CML is described. An option for adding meta-data as RDF (Resource Description Framework) is available, and the resulting information can be visually verified by XSLT-based transformations to a browser display.


Introduction

The introduction of structured markup languages (such as the generally applied HTML) has transformed the way information is expressed on the Internet.1 In 1997 a more general and extensible set of markup rules known as XML (eXtensible Markup Language) was proposed. The objective was to combine the recognised ease of use of HTML with greater formal rigour, to allow automatic machine transformations of the information content.2 This has resulted in a range of proposals and actual XML implementations in the scientific area. These include SVG (Scalable Vector Graphics) and PlotML for scientific graphing3 MathML (markup for mathematical and symbolic expressions),4 TML (Taxonomic Markup Language in bioinformatics)5, XyML6 and CDXML7 (chemical structural representations), BPML (Biopolymer Markup Language)8 and GEML (Gene Expression Markup Language)9.

A common feature of all XML-conforming languages is a requirement to be 'well formed' and 'valid'. By well formed we mean that the structure of the document is carried by syntactically correct constructs, comprising of markup tags (more formally 'elements'), which have specified attributes and associated values. These elements surround the data or other content of the document. A valid document, is one where the values and behaviour of its elements, their attributes and the type and range of their values can be verified against a formal definition of the language known as a DTD (Document Type Description), also expressible in XML form as a Schema. The advantages of using a well formed and valid document format include:

  1. Access to many existing generic tools for creating, editing and transforming XML documents.2
  2. The ability to invoke automatic machine transformations of the document data/content using another form of XML known as XSLT (eXtensible Stylesheet Language Transformations).
  3. The ability to create complex documents, combining components marked up using a variety of different XML languages.

Our own work has centred on developing and using Chemical Markup Language10 which in its original form anticipated many of the subsequent developments in XML.11. Examples of documents containing CML and a variety of transformations to and from other chemical formats have been previously illustrated12, along with a demonstration of how components of an XML document (including CML) can be digitally signed to ensure authenticity.13 Of paramount importance in handling CML is the ability to generate well-formed documents and to verify them for compliance with a specific version of CML10. We describe in this article an on-line resource for converting and digitally signing the common Molfile format14 to CML 1.0 and expressing the resulting document in visual form within a browser window.

Results and Discussion

Two of the most common forms for handling and presenting chemical content on the Web are 'first generation HTML' (corresponding approximately to versions 1 through 4 of the language) and a molecular structure format known as an MDL Molfile, for which precise definitions of two versions (V2000 and V3000) are publicly available.14 We describe here the implementation of a procedure for the on-line conversion of (potentially neither well formed nor valid) HTML documents to well-formed and valid XHTML 1.0 (the current standard), and of V2000 and V3000 Molfiles to CML version 1.0. XSLT stylesheets are then used to transform the resulting XML components to a format that can be displayed in a browser. Access to the procedures described here is via Table 1, or at http://www.ch.ic.ac.uk/chimeral/ The procedure makes use of the following software tools:

  1. An interactive form to invoke a CGI program. The variables passed to this program include several options. These include, providing additional metadata information based on Dublin Core fields, the option to sign the converted document with a digital signature,13 and a specification of how the data will be presented at the browser using stylesheet libraries.15
  2. JChemTidy16 is used to convert an HTML document to XHTML form.
  3. JChemMeta17 is used to optionally extract metadata and added value information.
  4. JChemAgent13 acts as a robot by following links to any associated Molfiles, extracting chemical information and adding them as metadata to the XHTML document. It can also convert the identified Molfiles into well-formed and valid CML.
  5. JChemSign13 is used to ensure the well-formed and XML-valid characteristics of the converted file and to indicate this by adding a digital signature.
  6. CheMstyLe15 is a collection of stylesheet libraries invoked via a single 'root stylesheet'. It enables XML and CML documents to be transformed to a variety of formats and in particular, displayed in a browser window.

These procedures can act upon files provided by the user from four different sources. A file can be uploaded from the user's local directory, a URL can be supplied giving the location of a specified HTML or Molfile and batch processing of multiple Molfiles via URL specification of a remote directory can be requested (Table 1). Examples illustrating the conversion process are collected in Table 2. The process for each of these file sources is divided into three stages.

Firstly, the conversion of the specified file(s) takes place, along with any added-value procedures and the converted file(s) are stored on a server. The user then has the options of: requesting the document be digitally signed, adding meta-information to the CML or XHTML files produced, and specifying whether any embedded Rasmol scripts present in the HTML document are to be corrected (Table 3). Rasmol scripts may not be XML-compliant and so have to be handled specially. Two alternatives are possible for converting an HTML file which contains links to one or more Molfiles. The document can be converted to XHTML and the Molfiles left unprocessed but invoked using <object> elements, or the Molfiles can be converted to CML and included as in-lined components of the parent XHTML document.

In the final phase, the user selects a stylesheet to be used to display the converted files. This can be either from a pre-defined set or by specifying the URL of their own custom XSLT stylesheet (Table 4). We also include options to specify the CML namespace, based either on the current CML 1.0 schema or the published DTD. Because CML and XML languages in general are designed to be extensible and to be capable of cross language interactions, a unique namespace must be declared or implied for each document element. This namespace is then used to generate a globally unique ID and hence a globally unique addressing for each XML element. This scheme does not per se generate a unique molecular identifier, but any scheme for generating such identifiers could readily be included in the procedures described here. A schematic representation of the scripts and Java classes responsible for the overall conversion process is shown in Figure 1.

Scheme 1, showing data flow

Figure 1. Flow diagram for transformation and signing HTML and CML Resources.

Visual Verification

It is particularly important for the user to be able to verify any CML produced as part of the procedures described here. We have in other articles, described stylesheet transformations which can back-transform CML e.g to a Molfile for presentation using applets12. Here we include some of these options (Table 4, schematic only), together with a new method termed Jumbo3-JS (Java or JavaScript Universal Molecular Browser)18 This involves using a client (browser)-side JavaScript transform of a CML collection to a set of SVG graphics primitives. This option currently requires the user to install an SVG plugin into their browser, but versions of browsers with native SVG support are now becoming available. Because SVG can support other chemically relevant graphical requirements, we regard this approach as having much potential for integrating molecular with other chemical graphical content.

The supplied stylesheets also extract other components of the transformed documents for inspection, including any metadata and an option to show details of the XML digital signature added to the file. A typical display resulting from this process is shown in Figure 2a and Figure 2b

Figure 2. Display of CML document using Stylesheet transforms to display (a) meta-data, digital signature and molecule using JMol, (b) meta-data and molecule using Jumbo3-JS and SVG


Conclusions

The procedures described here are designed to illustrate how molecular information currently specified in either so-called legacy formats or in semi-presentational formats such as versions 1-4 of HTML can be easily converted to XML-compliant forms, for example XHTML, CML and SVG. The design is deliberately modular, and hence readily amenable to extension at each stage. Thus we have not attempted to comprehensively cover the transformation and presentation of all types of existing molecular information, but have set out a framework for future development. The inclusion of XML-based digital signing procedures is one important aspect of maintaining an audit trail of the transformation history of the molecular components, as is the inclusion of globally unique identifiers for this information. In this demonstration, we include outputs only for browsers such as Internet Explorer V5 or 6, but we also recognise the need for other transformations to e.g. object-oriented databases, or to printable formats such as Acrobat PDF via FOP12, and the need to include other scientifically important XML languages such as MathML. We envisage the eventual extension of such procedures to automatic robot-based processing of globally distributed molecular information resources.

Acknowledgements

One of us (GVG) thanks Merck Sharp and Dohme and the EPSRC for the award of a scholarship.


References and Citations

  1. H. S. Rzepa, B. J. Whitaker and M. J. Winter, J. Chem. Soc., Chem. Commun., 1994, 1907; O. Casher, G. Chandramohan, M. Hargreaves, C. Leach, P. Murray-Rust, R. Sayle, H. S. Rzepa and B. J. Whitaker, J. Chem. Soc., Perkin Trans 2, 1995, 7; H. S. Rzepa, P. Murray-Rust and B. J. Whitaker, Chem. Soc. Revs., 1997, 1; H. S. Rzepa, P. Murray-Rust and B. J. Whitaker, J. Chem. Inf. Comp. Sci., 1998, 38, 976-982
  2. See the home site of the World-Wide Web consortium; http://www.w3c.org/, http://xml.apache.org/ for a collection of Opensource tools for manipulating XML and http://www.xml.org/ for Industrial applications.
  3. For a specification of SVG, see http://www.w3.org/TR/2000/CR-SVG-20001102/. For a specification of PlotML, see http://ptolemy.eecs.berkeley.edu/; J. Davis II, M. Goel, C. Hylands, B. Kienhuis, E. A. Lee, J. Liu, X. Liu, L. Muliadi, S.Neuendorffer, J. Reekie, N. Smyth, J. Tsay and Y. Xiong, ERL Technical Report UCB/ERL No. M99/37 University of California, Berkeley, July 1999.
  4. For a specification of MathML see http://www.w3.org/TR/2001/PR-MathML2-20010108/
  5. R. Gilmour, Bioinformatics, 2000, 16, 406-407.
  6. S. Fujita. J. Chem. Inf. Comp. Sci., 1999, 39, 915-927.
  7. J. Brecher, personal communication. See http://sdk.camsoft.com/chemdraw/cdx/CDXFileformat/index.htm for more detail.
  8. D. Fenyo, Bioinformatics,1999, 15, 339-340.
  9. M. Pesce, personal communication. See http://www.geml.org/
  10. For a description of Chemical Markup Language, see P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 1999, 39, 928.
  11. The original concept was described in P. Murray-Rust, C. Leach and H. S. Rzepa, Abs. Papers. Am . Chem. Soc., 1995, 210, 40-COMP.
  12. P. Murray-Rust, H. S. Rzepa, M. Wright and S. Zara, ChemComm, 2000, 1471-1472; P. Murray-Rust, H. S. Rzepa and M. Wright, New J. Chem, 2001, in press.
  13. G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa and M. Wright, J. Chem. Inf. Comp. Sci., submitted.
  14. A. Dalby, J. Chem. Info. Comp. Sci., 1992, 32, 244. The latest specifications of the Molfile formats are to be found at http://www.mdlchime.com/
  15. M. Wright, P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., in preparation.
  16. JChemTidy; G. V. Gkoutos, P. Kenway and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 2001, in press.
  17. G. V. Gkoutos, P. Kenway and H. S. Rzepa, New J. Chem., 2001, in press.
  18. P. Murray-Rust, unpublished work.

Table 1. File Selection Options

  • Load a local file, convert (and) sign and view. Proceed
  • Provide a URL corresponding to a file, convert (and) sign and view. Proceed
  • Provide a URL corresponding to a directory of chemical files, convert (and) sign and view. Proceed

Table 2. Test Examples for Conversion

Example 1
Molfile (V2000) conversion to CML:
Original V2000 Molfile
Example 2
Molfile (V3000) conversion to CML:
Original V3000 Molfile
Example 3
HTML to XHTML Conversion:
Original HTML

Table 3. Automatic Conversion and Signing for HTML and MDL Molfiles

Please select a HTML or a Molfile from the WWW and fill the forms with your desired parameters.
Type the URL of the file:
Select type of file: You will have the option to sign the document with
JChemAgent Signature in the next step.
Attempt to correct Rasmol script in HTML document
Yes No
Add (chemical) Metadata to HTML/Molfile?
Yes No
Add your own metadata
Title
Creator (author)
Subject or keywords
Description
Publisher

Table 4. Stylesheet Processing of Converted Files. This table is only illustrative of the output from Table 3, and is not functional.

Your MDL molfile has been successfully converted.
You have now the option to either view the converted file or sign and view it.
The second option will be a little slow (up to 40 seconds).
First, you have to select or provide a stylesheet and a namespace.
Type the URL of the stylesheet you want to use:
Or select a stylesheet: Select a namespace:
Type the URL of the stylesheet you want to use:
Or select a stylesheet: Select a namespace:

Please note that in order to view the converted file using an XSLT stylesheet, it is required that you use Internet Explorer 5.5 browser combined with the Microsoft msxml3 parser run in replace mode. See http://msdn.microsoft.com/xml/general/xmlparser.asp. We anticipate that these additions will be consolidated in Version 6.0 of the Internet Explorer release.