Summary: An on-line resource for transforming HTML and MDL-format Molfiles to digitally signed XML-conforming XHTML and CML is described. An option for adding meta-data as RDF (Resource Description Framework) is available, and the resulting information can be visually verified by XSLT-based transformations to a browser display.
The introduction of structured markup languages (such as the generally applied HTML) has transformed the way information is expressed on the Internet.1 In 1997 a more general and extensible set of markup rules known as XML (eXtensible Markup Language) was proposed. The objective was to combine the recognised ease of use of HTML with greater formal rigour, to allow automatic machine transformations of the information content.2 This has resulted in a range of proposals and actual XML implementations in the scientific area. These include SVG (Scalable Vector Graphics) and PlotML for scientific graphing3 MathML (markup for mathematical and symbolic expressions),4 TML (Taxonomic Markup Language in bioinformatics)5, XyML6 and CDXML7 (chemical structural representations), BPML (Biopolymer Markup Language)8 and GEML (Gene Expression Markup Language)9.
A common feature of all XML-conforming languages is a requirement to be 'well formed' and 'valid'. By well formed we mean that the structure of the document is carried by syntactically correct constructs, comprising of markup tags (more formally 'elements'), which have specified attributes and associated values. These elements surround the data or other content of the document. A valid document, is one where the values and behaviour of its elements, their attributes and the type and range of their values can be verified against a formal definition of the language known as a DTD (Document Type Description), also expressible in XML form as a Schema. The advantages of using a well formed and valid document format include:
Our own work has centred on developing and using Chemical Markup Language10 which in its original form anticipated many of the subsequent developments in XML.11. Examples of documents containing CML and a variety of transformations to and from other chemical formats have been previously illustrated12, along with a demonstration of how components of an XML document (including CML) can be digitally signed to ensure authenticity.13 Of paramount importance in handling CML is the ability to generate well-formed documents and to verify them for compliance with a specific version of CML10. We describe in this article an on-line resource for converting and digitally signing the common Molfile format14 to CML 1.0 and expressing the resulting document in visual form within a browser window.
Two of the most common forms for handling and presenting chemical content on the Web are 'first generation HTML' (corresponding approximately to versions 1 through 4 of the language) and a molecular structure format known as an MDL Molfile, for which precise definitions of two versions (V2000 and V3000) are publicly available.14 We describe here the implementation of a procedure for the on-line conversion of (potentially neither well formed nor valid) HTML documents to well-formed and valid XHTML 1.0 (the current standard), and of V2000 and V3000 Molfiles to CML version 1.0. XSLT stylesheets are then used to transform the resulting XML components to a format that can be displayed in a browser. Access to the procedures described here is via Table 1, or at http://www.ch.ic.ac.uk/chimeral/ The procedure makes use of the following software tools:
These procedures can act upon files provided by the user from four different sources. A file can be uploaded from the user's local directory, a URL can be supplied giving the location of a specified HTML or Molfile and batch processing of multiple Molfiles via URL specification of a remote directory can be requested (Table 1). Examples illustrating the conversion process are collected in Table 2. The process for each of these file sources is divided into three stages.
Firstly, the conversion of the specified file(s) takes place, along with any added-value procedures and the converted file(s) are stored on a server. The user then has the options of: requesting the document be digitally signed, adding meta-information to the CML or XHTML files produced, and specifying whether any embedded Rasmol scripts present in the HTML document are to be corrected (Table 3). Rasmol scripts may not be XML-compliant and so have to be handled specially. Two alternatives are possible for converting an HTML file which contains links to one or more Molfiles. The document can be converted to XHTML and the Molfiles left unprocessed but invoked using <object> elements, or the Molfiles can be converted to CML and included as in-lined components of the parent XHTML document.
In the final phase, the user selects a stylesheet to be used to display the converted files. This can be either from a pre-defined set or by specifying the URL of their own custom XSLT stylesheet (Table 4). We also include options to specify the CML namespace, based either on the current CML 1.0 schema or the published DTD. Because CML and XML languages in general are designed to be extensible and to be capable of cross language interactions, a unique namespace must be declared or implied for each document element. This namespace is then used to generate a globally unique ID and hence a globally unique addressing for each XML element. This scheme does not per se generate a unique molecular identifier, but any scheme for generating such identifiers could readily be included in the procedures described here. A schematic representation of the scripts and Java classes responsible for the overall conversion process is shown in Figure 1.
Figure 1. Flow diagram for transformation and signing HTML and CML Resources.
The supplied stylesheets also extract other components of the transformed documents for inspection, including any metadata and an option to show details of the XML digital signature added to the file. A typical display resulting from this process is shown in Figure 2a and Figure 2b
Figure 2. Display of CML document using Stylesheet transforms to display (a) meta-data, digital signature and molecule using JMol, (b) meta-data and molecule using Jumbo3-JS and SVG
The procedures described here are designed to illustrate how molecular information currently specified in either so-called legacy formats or in semi-presentational formats such as versions 1-4 of HTML can be easily converted to XML-compliant forms, for example XHTML, CML and SVG. The design is deliberately modular, and hence readily amenable to extension at each stage. Thus we have not attempted to comprehensively cover the transformation and presentation of all types of existing molecular information, but have set out a framework for future development. The inclusion of XML-based digital signing procedures is one important aspect of maintaining an audit trail of the transformation history of the molecular components, as is the inclusion of globally unique identifiers for this information. In this demonstration, we include outputs only for browsers such as Internet Explorer V5 or 6, but we also recognise the need for other transformations to e.g. object-oriented databases, or to printable formats such as Acrobat PDF via FOP12, and the need to include other scientifically important XML languages such as MathML. We envisage the eventual extension of such procedures to automatic robot-based processing of globally distributed molecular information resources.
One of us (GVG) thanks Merck Sharp and Dohme and the EPSRC for the award of a scholarship.
|Example 1 Molfile (V2000) conversion to CML: Original V2000 Molfile||Example 2 Molfile (V3000) conversion to CML: Original V3000 Molfile|
|Example 3 HTML to XHTML Conversion: Original HTML|
Please select a HTML or a Molfile from the WWW and fill the forms
with your desired parameters.
Your MDL molfile has been successfully converted.
You have now the option to either view the converted file or sign and view it.
The second option will be a little slow (up to 40 seconds).
First, you have to select or provide a stylesheet and a namespace.
Please note that in order to view the converted file using an XSLT stylesheet, it is required that you use Internet Explorer 5.5 browser combined with the Microsoft msxml3 parser run in replace mode. See http://msdn.microsoft.com/xml/general/xmlparser.asp. We anticipate that these additions will be consolidated in Version 6.0 of the Internet Explorer release.