Development of Chemical Markup Language (CML) as a System for Handling Complex Chemical Content

Michael Wright¹ - Supervisor: Dr. H. S. Rzepa¹ - project in collaboration with S. Zara¹ and Peter Murray-Rust²

(1) Department of Chemistry, Imperial College of Science, Technology and Medicine, UK - (2) School of Pharmaceutical Sciences, University of Nottingham, UK - May 22, 2000

We report the conclusion of an 8 month project culminating in the first fully operational system for managing complex chemical content entirely in interoperating XML-based markup languages.¹ This involved the extension of published Chemical Markup Language (CML 1.0)² and the development of mechanisms allowing the display of CML marked up molecules, spectra and reactions within a standard web browser (Internet Explorer 5.x).³ Integrating these techniques with existing XML compliant languages (e.g. XHTML⁴ and SVG^{5 6}) results in electronic documents with the significant advantages of data retrieval and flexibility over existing HTML/plugin solutions. These documents can be optimised for a variety of purposes (e.g. screen display or printing⁷) by single stylesheet transformations. An XML schema^{8 9} has also been developed for CML 1.0 to allow document validation and the use of data links. The ChiMeraL website,¹ containing a range of online demonstrations, examples and CML resources has been built as an integral part of this project.

1 - Introduction

The World-Wide-Web was originally conceived as a collaborative tool for scientists, allowing the rapid distribution and publication of results and greatly improved online communication. It has become increasingly common for academic papers to be posted online as HTML (hypertext markup language) pages, this being an extremely efficient way of making them available to everyone.3 Databases and knowledge archives are increasingly becoming 'web enabled' allowing faster, easier and more widely available access. This reduces unnecessary or parallel research and allows significantly more efficient literature searches. However, awkward handling of non-textual information (e.g. molecular structures) and difficulties in automatically recognising and extracting data from HTML pages limit their potential.^{10 11 12}

More sophisticated mechanisms are required to solve these problems, and these are presently being developed. This project involved the development and extension of chemical markup language (CML)² and techniques allowing the display of molecules, spectra and reactions within a web browser. These chemical objects can be seamlessly integrated into existing textual markup to create complete electronic documents (an example being this report).

Note: This report assumes a working knowledge of HTML and requires Internet Explorer 5.x with Adobe's SVG plugin to be installed (http://www.adobe.com/svg/viewer/install/)

1.1 - Existing Solutions - HTML and plugins

The powerful text formatting language HTML and its interface, the web browser are familiar tools to many scientists. While able to handle bitmapped images (normally gif, jpg and png), HTML was originally designed to display plain text on a computer screen. Its markup 'tags' are primarily concerned with formatting textual objects (paragraphs, tables, headings etc.) and supporting inter-document linking via the hypertext mechanism.

HTML documents rely on human processing for content interpretation and error detection since the language's syntax is too flexible to be reliably interpreted by machines. It cannot be validated with respect to the document's integrity and there is very little markup of context (to indicate the meaning and relevance of each information component). Non textual data normally requires the use of browser 'plugins' (small, platform specific helper programs) or Java 'applets' (non platform specific programs, that run on a 'virtual machine' on top of the web browser) - see Figure 1.1.1. As an example, a present day online chemical paper might consist of HTML text, several static images and a molecular structure intended to be visualised using a plugin. A common plugin is MDL's Chime¹³ (based on the open source Rasmol viewer¹⁴) which like most of these programs, requires an external data file in one of a variety of 'legacy' formats e.g. MDL .mol or Brookhaven .pdb.¹⁵ These files are widely available online and use ASCII but their syntax is heavily implied and requires specialist knowledge to interpret. This severely limits the re-usability of the rapidly growing amount of high-quality chemistry available on Web pages and a significant amount of time and information is wasted converting between these files.

Figure 1.1.1: Comparing different ways of displaying chemical structures

Taxol intermediate as a .gif image (bitmap) - a 'dead' format, human comprehensible but chemical information can't be automatically extracted	Riboflavin using a plugin - Uses the EMBED tag and an external data file	Cochineal using Chemaxion's Marvin applet - the structure is being stored as CML within this document, hence no data file is needed
		You must have Java turned on!

Since HTML⁴ wasn't originally designed to support non textual data, there is no built in mechanism allowing molecular structures to be shared between a plugin and the parent document. As a result, the external data files become isolated from the text and from each other. This reduces inter-operability and makes it significantly harder to search or data-mine for information within these files, while at the same time retaining their context. An extensive literature search through HTML pages using a conventional search engine requires significant human interaction in order to identity useful 'hits'. If the information was labelled by context, far more could be handled automatically, e.g. filtering and ignoring identical data from different sources. The main problems with existing HTML/plugin solutions include:

1 - The intimate binding of content and formatting in the HTML document. This restricts the document to a single style (fixed colours, object layout, plugin choice etc.) and limits the amount of additional information that can be supplied without overwhelming the reader. More importantly it also means text marked up using HTML is effectively lost as far as machine readability is concerned.

2 - The use of legacy formats with implied syntax and the resulting multiplicity of standards. A single, human readable format is required, combining both textual and non textual information within a single document.

Additional considerations include scalability, validation, platform independence (particularly relevant online), portability and flexibility.¹⁶ Considering the present cost of computer hardware and improvements in communication bandwidths, file size is significantly less of a concern. Many legacy formats are designed to be extremely brief (described as terse) and hence have a large amount of implied syntax.

1.2 - Extensible Markup Language (XML)

In February 1998, the World Wide Web Consortium (W3C) published their recommendations for extensible markup language (XML)17 18 as a successor to and superset of, HTML. XML isn't a markup language in its own right, instead it is a series of rules and conventions defining how to build discipline-specific markup languages (a so called 'meta-markup language'). XML is designed to allow the description of any structured data by the use of an user defined set of tags. These tags (called elements in XML nomenclature) support precise declarations of a document's content and context using explicit syntax. A defined set of elements is called an XML language and these languages are far more flexible than HTML, since they can be tuned to a particular type of information. Once the language and the means to display it have been developed, non-textual mark-up becomes trivial.

Figure 1.2.1: Data flow in an XML/XSL transform (SVG Diagram - an XML document can be transformed by one of many stylesheets depending on the required output.

All XML compliant languages use the same syntax (looking superficially similar to HTML) and can be manipulated using any XML tools or applications. Information marked up using different languages can be easily combined into a single document and manipulated at the same time. These different languages; e.g MathML,¹⁹ CML,²⁰ SVG (Scalable Vector Graphics)⁵ are identified within the document by the use of discrete namespaces²¹ (described later) and are often glued together using XHTML (HTML following XML rules). An online XML paper might consist of XHTML text combined with SVG diagrams and CML structures or spectra.

Unlike HTML, XML elements do not describe a document's formatting and the browser is no longer able to comprehend and display the source directly (as it can with HTML). Instead the parser uses an external stylesheet, which contains formatting instructions for each element in the XML source. This stylesheet is supplied by either the author or the reader and effectively maps the contents of each element to an output. This would normally be a HTML page or some other text file but is not required to be (they can for example, be mapped to a Acrobat file).²²

The older cascading stylesheets (CSS) are commonly used with HTML.²³ They provide a mechanism for centralising formatting and layout instructions from many pages to a single stylesheet. CSS works by allowing the author to redefine the meaning of existing HTML tags, e.g. <H1> can be defined as meaning 'size 11 and red'. This is useful when the style of an entire web site needs to be changed, since it only requires rewriting the stylesheet and not every HTML page. Much more sophisticated stylesheets can be written using a powerful formatting, searching and scripting language XSL (extensible stylesheet language).²⁴ A significant proportion of this project involved the construction of a range of chemically useful stylesheets.

Since any number of stylesheets can be used with a single XML document, it can be displayed in a large variety of ways ( Figure 1.2.1). The searching and scripting capabilities of XSL mean that element(s) can be picked out and displayed and/or manipulated according to their contents. Other information can be ignored if required, making data searches from large XML documents trivial. A paper containing a mixture of text and chemical information can be scanned for spectra by the stylesheet and these displayed using a suitable applet. Alternatively, a different stylesheet might be used to identify molecular structures and potentially calculate their properties.

Figure 1.2.2: Dynamic selection of stylesheets - this is a much simpler version of the ChiMeraL demonstration

Sorry - the chooser demo requires Internet Explorer 5 to run

Stylesheet transformations are carried out using software called an XML/XSL parser - this may be a dedicated program or a component built into an existing web browser (as with Internet Explorer 5.x).²⁵ Stylesheet selections can be defined by a URL (uniform resource locator - also called a web address) from within the document or selected using the parser. This allows a reader to use their own stylesheets. For example, they might dislike or be unable to use the default molecular viewer and would prefer one of their own. This flexibility makes stylesheets very powerful and is exemplified by the ChiMeraL demonstration¹ where sample CML data files containing molecule and spectra can be transformed with a range of XSL stylesheets.

The use of a unified XML format reduces the time wasted in processing legacy files and avoids the risk of serious information loss due to poor semantics. This "plug-compatible" XML approach guarantees a document to be searchable, sortable, mergeable, and printable with minimal extra cost. Ideally all future document and publishing systems will be transferred to using XML languages.

1.3 - Chemical Markup Language (CML 1.0)

The definition of CML version 1.0 was published last year by Peter Murray-Rust and Henry S. Rzepa.2 It was developed to carry molecules, cystallographic data and reactions using an XML language and for the first time offers a universal, platform and application independent format for storing and exchanging chemical information. As the first generation of CML, it outlines a variety of general purpose 'data-holder' elements and a smaller number of more specifically chemical elements (e.g. <molecule>, <reaction>, <crystal>) used to indicate chemical 'objects'. For example, a <molecule> will contain a <list> of <atom>s, which in turn have three <float>s giving each atoms Cartesian coordinates.

This framework is flexible but leaves many areas open for later evolution. In particular, CML provides no default conventions for labelling data elements and puts few restrictions on element ordering. This is intentional as CML is designed to be generic and contains minimal preconceptions as to the type of chemical information that will be stored using it. Standardisation and conventions are the concern of the community (CML's users) and will be co-ordinated by publications and projects such as this one.

We have developed rigorous markup procedures for small and medium molecule structures and spectra using an additional element <spectrum> which is not part of CML 1.0. Exploratory work has been carried out into <reaction> markup and data linking within a CML/XHTML document. Simply defining such procedures is of limited use and a variety of XSL template 'fragments' have been developed. These allow molecules, spectra and reactions to be displayed using a variety of existing Java applets and can be used for the construction of stylesheets able to format mixed XML documents containing chemical information. These stylesheet fragments and examples are offered as resources to the CML community. The culmination of this work is the development of the first fully operational demonstration of CML parsing within a web browser

The aim of this project is to raise the profile of CML and promote its acceptance and use in the chemical community. All work is offered as open source and is intended to help stimulate the development of new ideas, tools and resources in this area.

2 - XML, Namespaces and Schema

The markup syntax used for XML languages is superficially similar to that used for HTML (though significantly stricter).18 Both types of markup are written using ASCII and consist of data units contained within nested tag pairs (identifiable as an open <tag> and a close </tag>). In HTML, these tags are often used to describe formatting properties, e.g. <FONT SIZE="+1" COLOR="#FF0000">here is some <B>bold</B> text</FONT>. Since XML languages strictly separates out formatting from semantic markup, they describes the meaning and context of their data as opposed to its formatting e,g. <float title="melting point" units="degC">238</float>). In XML nomenclature, tag pairs are called 'elements' (not to be confused with chemical elements!) and an element's content consists of any attributes of that element, their values, and any text or sub-elements nested within the tag pair. Text or sub elements within a parent element are described as children of that element and siblings of each other.

2.1 - XML Rules

The most important rules of XML syntax are as follows. A document that following these is described as 'well formed'.

All tags must be closed to form an element e.g. <formula>C8 H10 N4 O2</formula> and must be correctly nested (in contrast with HTML, were incorrect nesting is normally ignored).
'Empty' elements with no children may be written using the shorthand <link/> = <link></link>
All attribute values must be within quotes (") e.g. <string title="CAS"> (this report will use a convention; @title = an attribute named 'title')
Element and attribute names are case sensitive and are normally written using lower case to distinguish them from HTML.
Code comments are of the form and are ignored by the parser.

Figure 2.1.1: Simple XML document - note the 'tree' structure

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
	href="http://www.ch.ic.ac.uk/chimeral/publications/document.xsl" ?>
 
<document title="Simple example of an XML document" id="xmldoc_simpleEx"
	xmlns:xhtml="http://www.w3.org/1999/xhtml"
	xmlns:cml="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml">

<xhtml:P>This is a block of text formatted using XHTML (HTML following
 	XML rules). Normal <xhtml:I>italic</xhtml:I> and <xhtml:B>bold</xhtml:B>
 	elements can be used as can more complex <xhtml:FONT FACE="Helvetica"
 	COLOR="#000080">fonts and colours.</xhtml:FONT> Following this is a block
 	of molecular information marked up using CML. Additional blocks using
 	other XML languages could be added at will.<xhtml:/P>

<cml:cml title="Properties of Caffeine" id="cml_caffeine">
   	<cml:molecule title="caffeine" id="mol_caffeine">
      	<cml:formula>C8 H10 N4 O2</cml:formula>
      	<cml:string title="CAS">58-08-2</cml:string>
      	<cml:float title="molecule weight">194.19</cml:float>
      	<cml:float title="melting point" units="degC">238</cml:float>
     	<cml:string title="comments">White powder or white glistening needles
		 usually melted together. LIGHT SENSITIVE</cml:string>
	  	<cml:list title="alternate names">
			<cml:string title="name">1,3,7-Trimethylxanthine</cml:string>
			<cml:string title="name">1,3,7-Trimethyl-2,6-dioxopurine</cml:string>
			<cml:string title="name">7-Methyltheophylline</cml:string>
		</cml:list>
	</cml:molecule>
</cml:cml>
</document>

An example of a short XML document is given in Figure 2.1.1 which contains both XHTML and CML marked up information. Those familiar with HTML will recognise the syntax but will find the element names unfamiliar. Notice how information can be stored both as an element child and also as the value of an attribute.

The structure of this document is best visualised as a branched 'tree' (similar to the directory structure on a hard-drive). It has a single 'root' (in this case <document>) with a number of sub-elements (<P> and <cml>, ignoring the prefix) and sub-sub-elements etc. defining the logical structure of the document. Each branch in the tree may contain information either as a text child or as attribute values. Each piece of information is therefore uniquely described by a list of its ancestors and this provides a convenient method for stylesheets to navigate the XML tree (called pattern matching).

Looking back at Figure 2.1.1 we can see that this document contains a chemical formula, CAS number, molecular weight, melting point and three alternate names for caffeine. In addition, there are two processing instruction tags (line 1 and 2) that use the syntax <?tag_name_here?>. The first, <?xml version="1.0"?> indicates that the document has been written to follow XML rules. The second, <?xml-stylesheet .. ?> gives the URL of the default stylesheet. Unless given over-riding instructions (e.g. by a reader who wishes to use their own stylesheet) the parser will automatically obtain this stylesheet and use it to transform (format) the XML document. This transformation can be carried out locally, on a web server or using a combination of the two.

When writing an XML document, it is usual to collect data of similar types and 'pigeon hole' them together in a suitably logical structure. For example; the formula for caffeine is stored in branch \document\cml\molecule\formula. Figure 2.1.2 shows a simplified CML document tree.

Figure 2.1.2: Illustrating the CML element 'tree' - (SVG Diagram) note how a molecule contains three lists, one of alternate names, one of atoms with their coordinates and one of bonds and their atom references (the atoms the bond is between)

2.2 - Namespacing

The attribute:

xmlns="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml"

on <document> in Figure 2.1.1 is called an XML namespace declaration.²¹ XML rules allow for the mixing of XML languages within the same document and there is no mechanism to prevent potentially conflicting element names being used in two or more different languages. Indeed such a mechanism would greatly restrict development of XML. Instead, the author assigns each language a different namespace prefix, using the syntax: <namespace:element> and @namespace:attribute. The choice of prefix for any particular language is left to the author and these need only be unique over the document in question. A namespace declaration (normally found on the root element) is then used to index each prefix to a globally unique URL (often the language development site). Alternatively, it can point to a file containing a description of the language, as it does for the CML namespace.

There are two types of namespace declaration. The simplest is of the form @xmlns:mynamespace="http://www.uniqueURL.com" and declares any elements with a mynamespace: prefix as belonging to a language defined at 'http://www.uniqueURL.com'. It is assumed that attributes share the namespace of their element unless declared otherwise. A default namespace declaration; @xmlns="http://www.anotherURL.com", results in the element containing the declaration and all ancestors of that element, belonging to "http://www.anotherURL.com" unless otherwise stated. This avoids having to use a prefix in front of a large number of similar elements.

Figure 2.2.1: Namespaces used in a XHTML/CML document

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="document.xsl" ?>
<karne:document
	xmlns="http://www.w3.org/1999/xhtml"
	xmlns:karne="x-schema:http://www.ch.ic.ac.uk/chimeral/karne_schema_01.xml"
	xmlns:chimeral="x-schema:http://www.ch.ic.ac.uk/chimeral/spectrum_schema_ie_01.xml">

<karne:abstract>We report the conclusion an 8 month project, culminating in the ..</karne:abstract>
	
<karne:chapter id="chap_introduction" title="Introduction">
<karne:index>1</karne:index>

<P>The <B>World-Wide-Web</B> was originally developed as a collaborative tool
for scientists, allowing for rapid distribution and publication of results and
greatly improved communication by email. It has become increasingly common for
academic papers and results to be posted online, this being an extremely ..</P>

</karne:chapter>

<cml title="cml data block" id="cml_moldata"
	xmlns="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml">
   	<molecule title="caffeine" id="mol_caffeine">
	..
	</molecule>
	<chimeral:spectrum title="Furan, tetrahydro-" id="spect_furantetrah_ir_1">
	..
	</chimeral:spectrum>
</cml>

</karne:document>

Figure 2.2.1 shows a document mixing XHTML, published CML 1.0,² ChiMeraL¹ CML (which also includes <spectrum>) and a simple document structure language (developed for this report). The four namespace declarations are as follows:

xmlns="http://www.w3.org/1999/xhtml": All elements default to the XHTML namespace (defined by the W3C) unless otherwise stated. There is no need to write stylesheet entries to transform this language, instead it can be designed to pass these unrecognised elements directly to the browser (which already knows how to format XHTML). This 'bypass' avoids unnecessary stylesheet templates and makes stylesheet development significantly quicker.
xmlns:chimeral="x-schema:http://www.ch.ic.ac.uk/chimeral/spectrum_schema_ie_01.xml": One element in <cml> is not in the default cml: namespace, this is <spectrum> and requires a namespace of its own - chimeral:. As with the two previous declarations, the URL points to a XML schema for this language.
xmlns:karne="x-schema:http://www.ch.ic.ac.uk/chimeral/karne_schema_01.xml": This language describes the logical structure of the document e.g. <abstract>, <chapter>, <appendix>
xmlns="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml": This declaration is found on <cml> and changes the default namespace for that element and its children, anything within <cml> is CML unless declared otherwise

2.3 - Schema and DTD

The URL in a namespace declaration may point to a file describing the XML language. This can be a simple HTML page but a more powerful method is to point to a file defining the language using a machine readable syntax. This allows the parser to automatically validate (check) the document against the language(s) it uses, before transforming it with a stylesheet. The parser can also be informed if elements belong to a particular data-type (string, integer, etc.) or of 'special' attributes exist (e.g. unique @id or @href links) and validate these also. If a document is correct XML (well formed) but does not comply to the definitions of the languages it uses, it is described as non-valid and can not be parsed. For example, <float builtin="CAS">58-08-2</float> is well formed XML but is not valid CML since <float> may only contain a single floating point number.

There are two alternative standards for writing machine readable descriptions. In both cases, the file lists all valid elements and attributes for the language and can declare restrictions on element content and child ordering. The older standard is called DTD (document type definition) and uses a specialised machine readable (but not particularly human readable) syntax. DTDs have wide support amongst XML tools and parsers but are being superseded by XML schemas.8 These are written using an XML language defined by the W3C and can therefore be parsed with the same tools as XML documents and XSL stylesheets. Schemas allow more sophisticated descriptions of valid element trees and they are significantly more human readable. Parsers can recognise a schema referenced in a namespace declaration and automatically validate the document against it.⁹ If a document consists of a mixture of XML languages, the use of a schema for each allows more complex manipulations (for example the linking of data between languages). Since the modular structure of XML allows (and indeed expects) the addition and extraction of information 'objects' using many languages, this flexibility is very important.

The CML schema (version 0.2) was developed from the published DTD as part of this project. Two versions have been made available, the first is a generic XML schema, the second is optimised for Internet Explorer 5 and includes additional data-type and linking declarations. A full description of this schema can be found in Appendix C. Since <spectrum> is not included in the CML 1.0 DTD, an additional namespace and schema have been constructed for it. It is expected that additional elements will be required as further areas of chemical markup are developed and it is suggested that this approach be maintained. Once a particular addition has been reviewed and accepted by the community, it can then be incorporated into a future version of published CML and included in the schema. The CML schema is presently at http://www.ch.ic.ac.uk/chimeral, this is likely to change to http://www.xml-cml.org²⁰ as a permanent CML namespace

2.4 - Platform Choice

There is a wide variety of XML parsers presently under development,26 many written using the programming language Java.²⁷ Java programs are run within a software 'virtual machine' and not directly on a computers operating system. Since this software has been written for most operating systems (Windows, Mac OS, Linux, Unix, etc.), Java programs can be ported between these platforms with minimal effort. This platform-independence means Java is popular and versatile but programs written using it can be non trivial to install and require the virtual machine to run. It was the aim of this project to build a fully working online demonstration of real time CML parsing and transforming using a ubiquitous and familiar interface - namely the web browser. This restricts functionality (Java parsers are significantly more advanced) but allows CML examples to be run using software already installed in a large number of computers and the requires minimal knowledge of XML from the user.

The choice of browser is somewhat restricted. Later versions of Windows Internet Explorer 4.x have some limited XML support and this is much improved in version 5 onwards. It now includes full DOM (Document Object Model),²⁸ XSL and XQL (extensible query language) support. Netscape 6 (based on the open source Mozilla project)⁵¹ supports XML but is only able to display it using CSS stylesheets. It is hoped that XSL will be included in the next release. No other common browsers support XML to any reasonable degree. This being the case, Internet Explorer 5.x is presently the only suitable client for online XML applications. Its powerful scripting support, allowing Java Script to control the loading and parsing of XML documents merely adds weight to its choice.

3 - Chemical Markup Language

In keeping with the XML philosophy, CML is designed to be extensible and will be published as a series of drafts. The first draft (CML 1.0)2 contains the core components of this new language and defines two main types of markup: data holders and chemical components.

3.1 - Data Holders

These are generic elements designed to contain the standard data types required by chemistry. These include strings, integers, floating point numbers and a series of array and matrix elements. These do not usually contain sub-elements, only text, and their context is described by the values of their attribute (particularly @title); e.g. <float title="melting point" units="degC">238.2</float> contains the melting point data: 238.2 degrees Centigrade.

The values of these attributes are not defined in the DTD or schema so any data can be contained simply by choosing them appropriately. This avoids the need to define large numbers of specific elements within CML. To include a molecule's melting point, you could define an additional element <meltPoint> but this would not be recognised and formatted by a generic CML stylesheet. It is possible to build new stylesheets including the new element, but this reduces their compatibility. The better approach is to use existing <float> and add @title="melting point" to give it a label. A stylesheet can then display this as 'melting point: ...' without having to recognise the attribute value directly ( Figure 3.1.1).

Figure 3.1.1: Simple stylesheet transform (SVG Diagram)

3.2 - Chemical Components

Chemical components represent objects of chemical interest: molecules, atoms, bonds etc. and these elements normally contain a number of data holders. CML 1.0 defined the basic components for molecular structure and crystallographic markup but the extension of CML to other areas of chemistry requires the addition of additional elements. The number of these should be minimised as much as possible and new elements must be described as an extension to CML with their own DTD or schema. In most cases each chemical component should also be supplied with a unique identity @id. A molecule containing chemical property information is shown in Figure 3.2.1 (in order to simplify the code, processor instructions, <document> tags and namespace declarations have been neglected in the following examples).

Figure 3.2.1: A simple CML block containing property information

<cml title="tetrahydrofuran" id="cml_THF">
	<molecule title="Furan, tetrahydro-" id="mol_thf_1">
		<formula>C4 H8 O</formula>
		<string title="CAS">109-99-9</string>
		<float title="molecular weight">72.1066</float>
      		<string title="ACX">I1001473</string>
     		<string title="DOT">UN 2056</string>
     		<string title="RTECS">LU5950000</string>
		<float title="melting point" units="degC">-108.3</float>
		<float title="boiling point" units="degC">65</float>
   	   	<float title="specific gravity">0.886</float>
      		<float title="water solubility" convention="g/100 mL at 23 degC">30</float>
      		<string title="comments">
		Colorless liquid with an ether-like odor detectable at 2 to 50 ppm.
		HYGROSCOPIC</string>
		<list title="alternate names">
			<string title="name">THF</string>
			<string title="name">1,4-Epoxybutane</string>
			<string title="name">Butylene oxide</string>
			<string title="name">Cyclotetramethylene</string>
			<string title="name">tetramethylene oxide</string>
			<!-- etc -->
		</list>
	</molecule>
</cml>

Two other important elements used in CML are <list>, which allows for collections of similar data holders and <link> which supports linking between data in the same, or different documents. Common attributes are @title, @convention, @unit and @builtin. The latter uses a series of fixed values predefined in the DTD (e.g. xyzFract, elementType, atomId) but values for the rest depend on convention and published examples. Attempts should be made to use values compatible with those used by others in the field and it is expected that published examples and online projects will help coordinate this. In a similar manner while the DTD defines the element names that should be used when marking up a chemical object, the tree structure is left to convention. It is likely that many of these conventions will be based on existing file formats (e.g. MDL's .mol), simply because this makes conversions to and from CML easier. At a later stage we would expect CML to become a standard export format for chemical software packages.

3.3 - Displaying CML Using a Stylesheet

The CML block in Figure 3.2.1 contains only trivial CML property information. A more complex example might contain molecular structures (2D and 3D), spectra (ir, uv/vis, nmr or ms) and possibly a reaction scheme. Techniques for marking up and displaying these more complex components are covered later, but to illustrate how a XSL stylesheet is written a simple example will be explained in full ( Figure 3.3.1 transforming Figure 3.2.1).

XSL follows XML rules and therefore requires a namespace declaration in the same way as any other XML document. Internet Explorer recognises http://www.w3.org/TR/WD-xsl²⁵ as a XSL namespace declaration and the ability to understand this language is built into the parser. XSL uses a limited set of 'commands' (approximately 15) which can be recognised by their xsl: prefix. All text and all elements not part of the XSL namespace are passed directly to the output (in this case, the browser). Two exceptions are, comments () which the parser ignores and the contents of the <xsl:eval> element which are used for scripting and calculations within the stylesheet.

Figure 3.3.1: Simple XSL stylesheet (HTML table)

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<!-- root -->
<xsl:template match="/">
<xsl:apply-templates select="*"/>
</xsl:template>

<!-- match <cml> and build HTML page -->
<xsl:template match="cml">
<HTML>
	<HEAD><TITLE>
	<xsl:value-of select="@title"/> - <xsl:value-of select="@id"/></TITLE></HEAD>
	<BODY>
	<xsl:apply-templates select="molecule"/>
	</BODY>
</HTML>
</xsl:template>
		
<!-- build html table -->
<xsl:template match="molecule">
	<!-- Pull out @id="" etc -->
	<TABLE><TR>
		<TD>Molecule ID:</TD>
		<TD>Formula:</TD>
		<TD>CAS:</TD>
		<!-- etc. -->
		</TR><TR>
		<TD><xsl:value-of select="@id"/></TD>
		<TD><xsl:value-of select="formula"/></TD>
		<TD><xsl:value-of select="*[@title = 'CAS']"/></TD>
		<!-- etc. -->
		</TD>
		</TR><TR>
		<TD>Alternate Names:</TD>
		<TD COLSPAN="6">
		<xsl:for-each select="list[@title = 'alternate names']/string[@title='name']">
		<xsl:value-of select="text()"/>, </xsl:for-each>
		</TD>
		</TR>
		<!-- etc -->
	</TABLE>
</xsl:template>
</xsl:stylesheet>

The stylesheet contains a series of <xsl:template>s which in turn consist of a mixture of XSL commands and template text/tags. At the simplest level, each template refers to the particular element in the XML source that matches the template's @match value. Therefore, <xsl:template match="molecule"> marks the start of a template for the <molecule> element. The parser navigates iteratively through the XML document tree, starting at the document root and selecting matching templates at each branching. Navigation through the tree is controlled by the <xsl:apply-templates select=" value and branches can be parsed or ignored as required. Matching and selection of elements is handled by a powerful pattern matching language which refers to the document tree using a Unix like syntax (a full description of pattern matching can be found on the W3C²⁴ and MSDN²⁵ web-sites).

The example shown here ( Figure 3.3.1) contains three templates. The first template (@match="/" - the document root) starts the parsing process at the first level element in the document (<cml> in this example). A similar template is required for all XSL stylesheets. The second template (@match="cml") builds the start of a skeleton HTML page and put the values of @title and @id within the TITLE tags. <xsl:apply-templates select="molecule"/> instructs the parser to select child element(s) <molecule> and to find a template for it. Once that branch has been completed, the parser returns to the second template and completes the HTML page:

<HTML><HEAD><TITLE> Furan,tetrahydro- mol_thf_1 </TITLE></HEAD><BODY></BODY></HTML>

The third template recognises the chemical properties within the <molecule> and formats their values to a simple HTML table. Note how the alternate names are handled; the contents of <xsl:for-each select="list[@title = 'alternate names']/string[@title='name']"> is repeated each time the pattern value of @select matches an element in the XML document. The output from the complete stylesheet is shown in Figure 3.3.2

Figure 3.3.2: Displaying CML block as a simple HTML table

Molecule ID:	Formula:	CAS:	ACX:	DOT:	RTECS:	MW:
mol_thf_1	C4 H8 O	109-99-9	I1001473	UN 2056	LU5950000	72.1066
MP: -108.3 degC	BP: 65 degC	Spec. Gravity: 0.886		Water Sol.: 30 g/100 mL at 23 degC
Alternate Names:	THF, 1,4-Epoxybutane, Butylene oxide, Cyclotetramethylene, tetramethylene oxide,
Comments:	Colorless liquid with an ether-like odor detectable at 2 to 50 ppm. HYGROSCOPIC

3.4 - Display of CML using Applets

More complex chemical data can not be easily displayed using text and a more sophisticated solution is required. A hypothetical CML-aware parser might have this functionality built in, being able to recognise chemical components and display them appropriately. Using existing parsers, the problem is analogous to that encountered with normal HTML pages. Browser plugins and Java applets are used to provide additional display functionality and a wide range of of these add-on programs have been developed, some of them very advanced. An efficient solution is to use a combination of stylesheets and applets (adapted if necessary) to display the CML components contained within a mixed XML document. Using existing software is preferable as this allows faster development of a working application and is in the spirit of XML's reusability.

Figure 3.4.1: Applet code produced by the stylesheet and its display

<APPLET NAME="jSpec" CODE="Visua.class"
	WIDTH="400" HEIGHT="250">
<PARAM NAME="SOURCE" VALUE="
	##TITLE=Furan, tetrahydro-
	##JCAMP-DX=4.24
	##DATA TYPE=MASS SPECTRUM
	##CAS REGISTRY NO=109-99-9
	##EPA MASS SPEC NO=61352
	##MOLFORM=C4H8O
	etc.">
</APPLET>

Both plugins and Java applets are primarily intended to read external data files, normally in a legacy format (.mol, .pdb, .dx etc.) and to embed a graphical display of this data within an HTML page. In both cases, special tags are required:-

CML is normally embedded into a mixed XML document, and therefore it is not easily accessible by a plugin. Applets are more flexible and can receive data in a wider variety of ways, this makes them significantly more useful for XML display (until a plugin is written to read CML natively, as has been done for SVG). Stylesheets can be written to recognise CML components and build appropriate <APPLET> tags to display them. Since all applets are designed to read legacy formats, the data within the components must also be transformed (called a CML to legacy conversion). The transformed data string can then be passed to the applet in two different ways (depending on which the applet supports). The easiest technique is to place the string within the applet's <PARAM> tag - see .

An alternative technique is to place the transformed string within a hidden <INPUT>. These are normally used for HTML forms and can not be seen by the reader. Here it becomes a convenient holder for the transformed string and Java Script is then used to call a public function on the applet and pass it the contents of the INPUT. The Java Script can be triggered by an HTML button or when the page has completely loaded into the browser. This is less direct than using <PARAM>s but allows more powerful interactions between the page and the applet since other functions can also be made accessible to Java Script. This approach is used for the JME applet described in the next chapter ( Figure 4.3.1).²⁹

Figure 3.4.3: Data flow in an XML/XSL system (SVG diagram)

A selection of applets have been chosen depending on their suitability for XML integration. It is desirable that the applets be small (faster downloads), flexible, easy to use and open source (since this allows changes to be made to the applet code if required). XSL 'template fragments' have been built for each of these applets and for several common CML to legacy conversions (available from http://www.ch.ic.ac.uk/chimeral). Existing stylesheets can be upgraded to comprehend CML by the simple addition of these fragments. By combining these stylesheets, with applets and the schema complete CML applications can be built ( Figure 3.4.3).

4 - Molecular Structures

The most commonly used file formats for molecular structures are MDL's .mol13 and the Brookhaven protein database .pdb. As its name suggests, the Brookhaven format is normally used for large molecules or proteins and can be extremely complicated (beyond the scope of this project). Small and medium sized molecules (< 500 atoms) often use the .mol format, an example of which is shown in Figure 4.1.1.

4.1 - MDL .mol

This uses a strict new-line delimited structure with a heavily implied syntax. As a result it is very sensitive to the amount of white (empty) space on each line. This can cause serious problems when converting to and from this format from CML. The format has the advantage of being extremely terse particularly when compressed (relevant when it was developed but much less so now). Other file formats tend to use rather similar syntaxes (e.g. .xyz lacks the additional atom columns and the bond data).

Figure 4.1.1: Ethanol using the .mol format

rows 1-3:: Comment lines; sometimes these contain a title, a filename and the source of the file but there is no common convention. These lines are normally discarded on converting to CML (note line 3 is blank).
row 4:: The first digit is the total number of atoms in the molecule, the second is the total number of bonds (hydrogen atoms and bonds to them, are sometimes neglected).
rows 5-13:: Each row refers to an atom and gives (in order) Cartesian coordinates x, y and z (in Angstrom) and the element type (as its periodic table letter). The remaining columns are used for electrons, charges etc. and are not yet included in CML, since mol files rarely use them.
rows 14-21:: Each row refers to a bond. The first two digits refer to the atoms the bond is between (hence 2 3 means a bond between the second and third atoms in the list above) and the third digit refers to the empirical bond order (2 = double bond etc.) Again, the remaining columns are rarely used.
row 22:: M END indicates the end of the molecule. Files that contain multiple .mol molecules (e.g. reaction .rxn or .sd files) also include a delimiter between molecules ($MOL or $$$$).

4.2 - CML: molecule

CML molecular structures are reminiscent of the .mol format but with the data being fully marked up and explicitly defined. Implied syntax is reduced to a minimum and conventions are declared not assumed. The aim is to make each chemical component (coordinate, atom, bond etc.) completely separable from the rest of the document, allowing components to be easily added or removed without destroying the document tree. Since an XML document might contain a very large number of molecules (examples containing over 600 molecules have been built), each component requires a unique @id. A large collection of mol files can be easily converted to CML and then concatenated to a single XML document. A stylesheet can then pick out a single molecule and display its structure by searching against @id.

Figure 4.2.1: Example of small molecule markup - see Appendix B for alternative methods

..
<molecule title="ethanol" id="mol_ethanol">
	<formula>C2 H6 O</formula>
	<string title="CAS">64-17-5</string>
	<!-- etc -->
	<list title="atoms">
		<atom id="ethanol_a_1">
			<integer builtin="atomId">1</integer>
			<float builtin="x3" units="A">1.0303</float>
			<float builtin="y3" units="A">0.8847</float>
			<float builtin="z3" units="A">0.9763</float>
			<string builtin="elementType">C</string>
		</atom>
		<!-- eight further atoms -->
	</list>
	<list title="bonds">
		<bond id="ethanol_b_1">
			<integer title="bondId">1</integer>
			<integer builtin="atomRef">1</integer>
			<integer builtin="atomRef">2</integer>
			<integer builtin="order" convention="MDL">1</integer>
		</bond>
		<!-- seven further bonds -->
	</list>
</molecule>
..

As with .mol, lists of atoms and bonds are used. Atoms contain either 3 (x3, y3, z3) or 2 (x2, y2) Cartesian coordinates, an atom type and an 'atomID'. This is a non-unique label referenced by the bond's 'atomRef' values and not the same as the atom's unique @id (hence <integer builtin="atomRef">1</integer> refers to an atom of <integer builtin="atomId">1</integer> not @id="1"). The same applies to 'bondId'. While not optimal, this greatly eases the conversion of CML to and from legacy formats. Coordinate units are no longer implied and since this molecule was converted from a .mol file, the bond order convention is 'MDL'.

Additional information (formula, CAS number) is not extracted from the mol file (while such information might be supplied using comments, it is not automatically identifiable) but from various online databases - in particular ChemFinder.30 The author may decide to include any information they wish in this way, simply by adding additional elements at appropriate places in the CML.

Note that all applets and code blocks are being converted from CML 'on the fly' by the browser as this document loads. If an applet appears to have failed, please refresh the page.

4.3 - Java Molecular Editor (JME)

The JME (Java Molecular Editor) applet has been developed as part of online chem-informatics system at Novartis Crop Protection AG in Basel.29 It is a simple and extremely small (31K) 2D editor, incorporating a sophisticated SMILES calculator (SMILES is a standard for producing text descriptions of a molecule's structure). Rather than include filters for legacy file formats and greatly increasing the applet's size, JME reads a simple coordinate and bond string. The syntax used is unique to JME and since there is no way to dynamically calculate it from a standard data file, is a problem for HTML solutions. Using CML, this problem is solved with a trivial stylesheet transformations. This makes the applet excellently suited for our purposes ( Figure 4.3.1).

Figure 4.3.1: JME editor with Java Script integration - toolbars used for editing, left click/drag to translate, right click/drag to rotate

You have to enable Java and JavaScript on your machine !

JavaScript is able to access functions within the applet. Click a button to see a demonstration of this

Figure 4.3.3: JME stylesheet fragment - reformatted for clarity

..
<!-- match <cml> and build HTML page -->
<xsl:template match="cml">
<HTML>
	<HEAD><TITLE>
	<xsl:value-of select="@title"/> - <xsl:value-of select="@id"/></TITLE></HEAD>
	<BODY ONLOAD="document.JME.readMolecule(jmeoutput.value)">
	<xsl:apply-templates select="molecule"/>
	</BODY>
</HTML>
</xsl:template>
..
<!-- match molecule and display using JME -->
<xsl:template match="molecule">
	<APPLET CODE="JME.class" NAME="JME" ARCHIVE="JME.jar" WIDTH="400" HEIGHT="300">
	You have to enable Java and JavaScript on your machine !
	</APPLET>
	<!-- hidden form element contains the JME data string -->
	<xsl:element name="INPUT">
		<xsl:attribute name="NAME">jmeoutput</xsl:attribute>
		<xsl:attribute name="TYPE">hidden</xsl:attribute>
		<xsl:attribute name="VALUE">
			<!-- select last atom and calculate its number -->
			<xsl:for-each select="list/atom[end()]" xml:space="preserve">
				<xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval> 
			</xsl:for-each>
			<!-- select last bond and calculate its number -->
			<xsl:for-each select="list/bond[end()]" xml:space="preserve">
				<xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval>
			</xsl:for-each>
			<!-- select each atom and extract coordinates and atom type -->
			<xsl:for-each select="list/atom" xml:space="preserve"> 
				<xsl:value-of select="*[@builtin = 'elementType']"/> 		
				<xsl:value-of select="*[@builtin = 'x3']"/> 
				<xsl:value-of select="*[@builtin = 'y3']"/>
			</xsl:for-each>
			<!-- select each bond and extract atom refs and bond order -->
			<xsl:for-each select="list/bond" xml:space="preserve">			
					<xsl:value-of select="*[@builtin = 'atomRef']"/>
				<xsl:value-of select="*[@builtin = 'order']"/>
			</xsl:for-each>
		</xsl:attribute>
	</xsl:element>
</xsl:template>
..

The string read by JME uses the syntax;

n_atoms n_bonds {atomic_symbol x_coord y_coord}per atom {atom1 atom2 bond_order}per bond

This string is built by the stylesheet and held in hidden <INPUT> labelled with the name 'jmeoutput'. A Java Script 'onLoad' function is also added to the HTML's <BODY> tag and this is called when the page has been completely loaded into the browser. The script takes the content of 'jmeoutput' and passes it to a public 'readMolecule' function on the applet which proceeds to display it. The 'Get SMILES' and 'Get JME Source' buttons work in a similar fashion. Stereoisomers and simple reactions can also be displayed using JME but haven't yet been incorporated into CML.

Some additional XSL commands should be explained at this point; <xsl:element> and <xsl:attribute> create new XHTML/XML tags in the output and are used as an alternative to writing the tags directly. For example, <xsl:element name="INPUT"><xsl:attribute name="NAME">jmeoutput</xsl:attribute></xsl:element> is equivalent to <INPUT NAME="jmeoutput"/>. They are used to build complex tags that might otherwise cause the stylesheet to become invalid XML.

The <xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval> command uses IE 5 specific functions to calculate the position of a element with respect to its siblings and return this as a formatted number. It is used in almost all the stylesheet fragments, normally to count the total numbers of atoms and bonds in the molecule. Non IE stylesheets would use different functions depending on the parser they are designed for.

4.4 - Jmol 3D Viewer

Jmol31 is a stand alone Java application (not an applet) being developed by the Open Science project.³² It is designed to read and display molecules using a large variety of legacy formats. Its functionality is similar to the Chime plugin but it is open source and platform independent. An experimental applet has been released and this has been adapted for use in this project. A stylesheet converts the CML <molecule> to a % delimited .xyz string, which is then stored and transferred to the applet in a manner similar to JME. The '%' delimiters must be added because the converted string can not replicate the line breaks required in the .xyz format (similar to .mol). The applet has been slightly rewritten to recognise % as equivalent to a new line.

Figure 4.4.1: Jmol 3D viewer and .xyz format - click/drag to rotate, shift-c/d to zoom, ctrl-c/d to translate, TAB changes drawing style, L adds atom labels

You have to enable Java and JavaScript on your machine !

Problems also occur with white space, .xyz and .mol require strict space handling to ensure their data remains in straight columns. Both formats use right aligned text, in contrast to XML/HTML standard left alignment:

right:	C  -1.7560   0.0000   0.3080% 	left:	C   -1.7560   0.0000   0.3080%
	C -10.3600   3.0760   5.2880%		C   -10.3600   3.0760   5.2880%
	H   8.1400 -11.1520   7.5080%		H   8.1400   -11.1520   7.5080%

In both cases, the stylesheet must check each coordinate and add a varying amount white space in front of it depending on its size and sign (+ or -) - unfortunately this complication is unavoidable. As with JME, the converted string is contained within a named <INPUT> and then passed by onLoad="document.JMolApplet.setModelToRenderFromXYZString(jmoloutput.value,'T')" to the applet.

Figure 4.4.3: Jmol stylesheet fragment - reformatted for clarity

..
<xsl:template match="molecule">
	<!-- build applet -->
	<APPLET CODE="org.openscience.miniJmol.JmolApplet.class" 
		NAME="JMolApplet" ARCHIVE="JmolApplet.jar" 
		WIDTH="300" HEIGHT="300">
	You have to enable Java and JavaScript on your machine !
	</APPLET>
	<xsl:for-each select="id(@idref)">
	<!-- build INPUT containing the xyz source -->
	<xsl:element name="INPUT">
		<xsl:attribute name="NAME">jmoloutput</xsl:attribute>
		<xsl:attribute name="TYPE">hidden</xsl:attribute>
		<xsl:attribute name="VALUE">
		<!-- start convertion to .xyz -->
		<!-- select last atom and add its number --> 
		<xsl:for-each select="list/atom[end()]"xml:space="preserve">
		<xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval>
		</xsl:for-each>%
		<!-- add title -->
		<xsl:value-of select="@title"/> from CML%
		<xsl:for-each select="list/atom" xml:space="preserve">
			<xsl:for-each select="string[@builtin='elementType']">
				<xsl:value-of select="."/>
				<!-- single letter elements with require an additional space  -->
				<xsl:if test=".[. = 'H']"> </xsl:if>
				<xsl:if test=".[. = 'B']"> </xsl:if>
				<xsl:if test=".[. = 'C']"> </xsl:if>
				<xsl:if test=".[. = 'N']"> </xsl:if>
				<xsl:if test=".[. = 'O']"> </xsl:if>
				<!-- etc -->
			</xsl:for-each> 
		<!-- for each atom, extract coord adding white space as required -->
		<xsl:if test="float[@builtin='x3' and . $lt$ 10]"> </xsl:if>
		<xsl:if test="float[@builtin='x3' and . >= 0]"> </xsl:if>
		<xsl:value-of select="float[@builtin='x3']"/>
		<!-- repeat for y3 and z3 -->
		%
		</xsl:for-each>
		<!-- finish convertion to .xyz -->
		</xsl:attribute>
	</xsl:element>
</xsl:template>
..

4.5 - Marvin and Structure Drawing Applet (SDA)

Marvin is an commercial applet produced by Chemaxon33 and is free to academic users. It is a wire frame 3D viewer able to accept .mol data using external files, as a <PARAM> or via JavaScript. The applet is very configurable and includes a 2D editing function. SDA (Structure Drawing Applet) is a freeware editor developed by ACD Labs and also accepts .mol data via <PARAM>.³⁴

Figure 4.5.1: Marvin (left) and SDA applets - Marvin: left click/drag to rotate, shift-lc/d to zoom, ctrl-lc/d to translate, right click for menu - SDA: tool bars contain editing tools, applet can be 'floated' to a separate window by clicking the top left button

You must have Java turned on!

Figure 4.5.3: Marvin stylesheet fragment - reformatted for clarity

..
<xsl:template match="molecule">
	<APPLET CODE="MView" ARCHIVE="marvin.jar" WIDTH="300" HEIGHT="300">
		<xsl:element name="PARAM">
			<xsl:attribute name="NAME">mol</xsl:attribute>
			<xsl:attribute name="VALUE">
<!-- start conversion to .mol -->
<xsl:value-of select="@title"/>\
  from CML\
\		
<!-- select last atom, calculate its number and required white space -->
<xsl:for-each select="list/atom[end()]/integer[@builtin='atomId']" xml:space="preserve">
	<xsl:if test=".[text() $lt$ 100]"> </xsl:if>
	<xsl:if test=".[text() $lt$ 10]"> </xsl:if>
	<xsl:for-each select="../../atom[end()]">
		<xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval>
	</xsl:for-each>
</xsl:for-each>
<!-- select last bond, calculate its number and required white space -->
<xsl:for-each select="list/bond[end()]/integer[@title='bondId']" xml:space="preserve">
	<xsl:if test=".[text() $lt$ 100]"> </xsl:if>
	<xsl:if test=".[text() $lt$ 10]"> </xsl:if>
	<xsl:for-each select="../../bond[end()]">
		<xsl:eval>formatIndex(childNumber(this), "1")</xsl:eval>
	</xsl:for-each>
</xsl:for-each>\
<!-- for each atoms, extract coord and type, with required white space -->
<xsl:for-each select="list/atom" xml:space="preserve">
 	<xsl:if test="float[@builtin='x3' and . $lt$ 10]"> </xsl:if>
	<xsl:if test="float[@builtin='x3' and . >= 0]"> </xsl:if>
	<xsl:value-of select="float[@builtin='x3']"/>  
	<!-- repeat for y3 and z3 --> 
	<xsl:for-each select="string[@builtin='elementType']">
		<xsl:value-of select="."/>
		<!-- elements with single letter names require an additional space -->
		<xsl:if test=".[. = 'H']"> </xsl:if>
		<xsl:if test=".[. = 'B']"> </xsl:if>
		<xsl:if test=".[. = 'C']"> </xsl:if>
		<xsl:if test=".[. = 'N']"> </xsl:if>
		<xsl:if test=".[. = 'O']"> </xsl:if>
		<!-- etc. -->  0  0  0  0  0  0  0  0  0  0  0  0\
	</xsl:for-each>
</xsl:for-each>
<!-- for each bond, extract atomRefs and order, with required white space -->
<xsl:for-each select="list/bond" xml:space="preserve">
	<xsl:for-each select="integer[@title='atomRef']">
		<xsl:if test=".[. $lt$ 100]"> </xsl:if>
		<xsl:if test=".[. $lt$ 10]"> </xsl:if>
		<xsl:value-of select="."/>
	</xsl:for-each>
	<xsl:value-of select="integer[@builtin='order']"/>  0  0  0  0\
</xsl:for-each>
M  END
<!-- finished convertion to mol -->
			</xsl:attribute>
		</xsl:element>
	You must have Java turned on!
	</xsl:element>
</xsl:template>
..

The stylesheets for these applets are very similar and use the <PARAM> mechanism. As with Jmol, line delimitation is required for the mol file and is built into both applets (Marvin uses "\" and SDA "|"). Like .xyz, handling white space is complex particularly since Marvin in particular is very intolerant of errors. The following XSL fragment works for most small molecules but a more robust version will need to be developed when more features are available in Internet Explorer's XSL engine (in particular the ability to compare of data from different elements).

Experiments have shown that the CML and Marvin applet scales unexpectidly well and can demonstrably display over 300 molecules on a modern PC.

5 - Spectra

Spectra are regularly stored and exchanged using an electronic format and modern spectrometers can save their data directly to disk. Electronically stored spectra can be easily indexed, searched against or otherwise manipulated and large databases of spectra files exist online (e.g. the NIST Webbook).35 As with molecular structures, a number of legacy formats are in use the most common being JCAMP-DX⁵⁰ and SPC.³⁶ One of the aims of this project was to extend existing CML to incorporate spectra markup and to develop techniques for displaying it.

5.1 - JCAMP-DX/CS

The JCAMP-DX format defines a series of conventions for the storage and exchange of spectral data between spectrometers, lab and personal computers. Files using this format are written using ASCII and contain a series of labelled data records (LDR). Each LDR is delimited by a new line and a double hash:

##LDRname= value

While having a much simpler structure than XML, these LDRs can be considered as a form of markup with the LDR name (data label) being equivalent to the element name. This is unsurprising since the two formats share a number of important design criteria, such as extensibility, human/machine readable, flexibility and platform independent. Approximately 25 reserved data labels were originally defined for JCAMP-DX. These are intended to contain:

Metadata - (e.g. ##ORIGIN, ##OWNER, ##DATE)
Header information - (e.g. ##TITLE, ##DATA TYPE, ##BLOCKS, ##END)
Required spectral parameters - (e.g. ##XUNITS, ##FIRSTX, ##MAXX)
Optional spectral parameters - (e.g. ##RESOLUTION, ##DELTAX)
Spectral data - (e.g. ##XYDATA, ##XYPOINTS, ##PEAK TABLE, ##PEAK ASSIGNMENTS)
Equipment - (e.g. ##INSTRUMENT PARAMETERS)
Sample information - (e.g. ##CAS NAME, ##MOLFORM, ##MP)

Figure 5.1.1: Tetrahydrofuran mass spectrum using the JCAMP-DX format

An example of a simple JCAMP-DX file is given in Figure 5.1.1. Important LDRs are #DATA TYPE which describes the type of spectrum and ##PEAK TABLE that contains the data points. A variable list label following ##PEAK TABLE describes how the data is tabulated. Data is grouped into XY or XYZ/XYM coordinates and separated by commas within the group, while groups are separated by semicolons or spaces. Any additional white space is ignored. Common variable lists are;

(XY..XY): Grouped XY pairs.; 23,102 19,89 15,87 ..
(XYZ..XYZ) or (XYM): Grouped XYZ or XYM (can be used to give multiplicity).; 23,102,S 19,89,S 15,87,D ..
(X++(Y..Y)): More complex. Each line starts with an X value, then a series of Y value at equal X spacings, so -; 12 52 48 46 43 42; 22 43 38 36 35 31; is equivalent to -; 12,52 14,48 16,46 18,43 20,42 22,43 24,38 26,36 28,35 30,31

JCAMP-DX also allows the use of user defined data labels. These are distinguished by a $ prefix and are specific to a particular instrument, location, user etc. This effectively means that JCAMP is an extensible language but does not include any procedure for systematically describing a LDR's meaning. This seriously limits the use of user defined data labels beyond the original author. A number of JCAMP 'sub-languages' have been defined in an attempt to solve this.^{37 38} These contain lists of LDRs intended for particular areas of spectroscopy (e.g. UV/Vis, H1 NMR, MS). Particularly interesting is JCAMP-CS³⁹ which allows the addition of limited structural information. These JCAMP files are split into a number of separate blocks, each containing a different type of information (e.g. molecular structure, peak table and peak assignments). This information can be extracted and marked up as CML, and a JCAMP-CS to CML converter has been written for this purpose.

5.2 - CML: spectrum

Since a well established (if non-XML) markup language already exists, many technical difficulties have already been solved. Rather than develop new conventions for spectral markup, we have chosen to incorporate those used by JCAMP. It is expected that most users will wish to convert CML to and from JCAMP so this needs to be as easy as possible. Once converted, spectra can incorporated into XML documents and displayed using techniques similar to those used for molecules.

Comparing Figure 5.1.1 (JCAMP) with Figure 5.2.1 illustrates the approach used. A new element <spectrum> is defined as an extension to CML, and this is given a new namespace chimeral:. The contents of JCAMP's labelled data records are then mapped to a <string> or <float> with an appropriate @title value. Reserved LDRs (##TITLE, ##XUNIT etc.) are directly recognised and given lower case values to better comply with XML conventions. User-defined or unrecognised LDRs can either be ignored or, more correctly, given @title="LDRname". This avoids any information loss on converting JCAMP to CML.

Figure 5.2.1: Example of spectra markup - see Appendix B for more details

..
<chimeral:spectrum title="Furan, tetrahydro-" id="spect_furantetrah_ir_1"
	convention="JCAMP-DX= 4.24">
	<string title="datatype">MASS SPECTRUM</string>
	<string title="EPA">61352</string>
	<string title="origin">D.HENNEBERG, MAX-PLANCK INSTITUTE, MULHEIM, WEST GERMANY</string>
	<string title="owner">NIST Mass Spectrometry Data Center</string>
	<float title="xunits">M/Z</float>
	<float title="yunits">RELATIVE ABUNDANCE</float>
	<float title="firstx" convention="M/Z">24</float>
	<float title="lastx" convention="M/Z">73</float>
	<float title="xfactor">1</float>
	<float title="firsty" convention="RELATIVE ABUNDANCE">4</float>
	<float title="miny" convention="RELATIVE ABUNDANCE">3</float>
	<float title="maxy" convention="RELATIVE ABUNDANCE">9999</float>
	<float title="yfactor">1</float>
	<float title="npoints">32</float>
	<list title="peak table" convention="(XY..XY)">
		<coordinate2 id="furantetrah_ir_c_1">24, 4</coordinate2>
		<coordinate2 id="furantetrah_ir_c_2">25, 30</coordinate2>
		<coordinate2 id="furantetrah_ir_c_3">26, 171</coordinate2>
		<coordinate2 id="furantetrah_ir_c_4">27, 1545</coordinate2>
		<!-- 28 more coordinate2s -->
	</list>
</chimeral:spectrum>
..

The delimiters between data groups and between data within a group are only loosely defined in the JCAMP specifications. As a result, the syntax used for ##XYDATA and ##PEAKTABLE is extremely variable. In particular, while the (X++(Y..Y)) convention is common, it is distinctly confusing for the non specialist and is extremely difficult to store using XML elements. It is suggested that CML spectra should only use the (XY..XY) or (XYM)/(XYZ) conventions and with each pair or triplet be marked up using <coordinate2> and <coordinate3> respectively. Data within these elements is separated by the use of a comma and a white space (', '). More complex conventions are deconvoluted during the JCAMP to CML conversion process, e.g. (X++(Y..Y)) is expanded to (XY..XY).

Since a huge number of LDRs have been defined, we have chosen to include only a small number of generic ones in this project. If an author wishes to use additional LDRs, thse need to be included into the converter and stylesheets.

5.3 - JSpec - JCAMP and SPEC display applet

Jspec is an open source applet written by Guillaume Cottenceau.40 It is small (31K) and was designed to display JCAMP and SPC spectra files. It also incorporates a number of useful features, including zooming, peak finding and integration.

Figure 5.3.1: JSpec spectra viewer showing caffeine mass and UV/Vis spectra - click drag to select an area then click buttons

The stylesheet is similar to that used for the Marvin applet. CML is converted to a (cleaned up) JCAMP text string and this is built into the applet's <PARAM>. Line delimitation is not required since JCAMP already has a ## separator, but the applet was slightly adapted to accept comma-separated ##PEAKTABLE and ##XYDATA groups. Infra red spectra can be reversed (X axis) but not inverted (Y axis) this functionality might be built into the applet at a later date.

Figure 5.3.3: Jspec stylesheet fragment - reformatted for clarity

..
<!-- you must include the namespace in the template match -->
<xsl:template match="chimeral:spectrum">
<!-- build applet -->
	<APPLET CODE="Visua.class" WIDTH="400" HEIGHT="250">
	<xsl:element name="PARAM">
		<xsl:attribute name="NAME">SOURCE</xsl:attribute>
		<xsl:attribute name="VALUE" xml:space="preserve">
<!-- start conversion to JCAMP -->
<!-- * means 'any' within a pattern match -->
##TITLE= <xsl:value-of select="@title"/>
##<xsl:value-of select="@convention"/>
##DATA TYPE= <xsl:value-of select="*[@title='datatype']"/>
##XUNITS= <xsl:value-of select="*[@title='xunits']"/>
##YUNITS= <xsl:value-of select="*[@title='yunits']"/>
##FIRSTX= <xsl:value-of select="*[@title='firstx']"/>
##LASTX= <xsl:value-of select="*[@title='lastx']"/>
##FIRSTY= <xsl:value-of select="*[@title='firsty']"/>
##MINY= <xsl:value-of select="*[@title='miny']"/>
##MAXY= <xsl:value-of select="*[@title='miny']"/>
##NPOINTS= <xsl:value-of select="*[@title='npoints']"/>
<xsl:for-each select="*[@title='peak table' and @convention='(XY..XY)']">
##PEAK TABLE= (XY..XY)
<xsl:for-each select="coordinate2">
<xsl:value-of/>,
</xsl:for-each>
</xsl:for-each>
<!-- repeat for 'xypairs (XY..XY)' and 'peak table (XYM)' -->
##END=
<!-- end conversion' -->
		</xsl:attribute>
	</xsl:element>
	<!-- Reverses IR spectra -->
	<xsl:if test="*[@title='datatype' and .='INFRARED SPECTRUM']">
		<PARAM NAME="WAY" value="REVERSE"/>
	</xsl:if>
</APPLET>
</xsl:template>
..

Further development is required to include peak assignment tables into the CML <spectrum>. We would also like to develop mechanisms by which applets can communicate with each other and the XML document (probably via Java Script). One can envisage clicking on a peak in a spectrum and having the parser look it up in a CML peak assignments table, it would then highlight the appropriate functional group on a molecular display.

5.4 - Scalable Vector Graphics (SVG)

Chemistry, in common with disciplines such as mathematics, conveys information through formal notation (often machine parsable) and more general graphical objects. In chemistry these include full and dotted lines, straight and curly arrows (with various types of arrowheads), links, braces, containers specialised glyphs and pictorial objects (e.g for surfaces, solids, etc.). Chemical schemes and diagrams are common and often consist of a mixture of graphics objects and formal notation. These enable great creativity of expression and several editing tools pay great attention to providing these. They are therefore difficult to capture accurately for machine processing using a markup language. The only current method - line art or pixel maps - loses much of the machine readable semantics.

XML provides a very powerful graphics language, Scalable Vector Graphics (SVG).5 Most images on the web use bitmapped formats (.gif, .jpg, .png), which use a grid of coloured pixels to form the picture. In contrast, vector images use Cartesian descriptions of drawing objects (lines, polygons, fills, text) and overlay them onto a blank 'canves'. Vector formats are most efficient for simple line images and diagrams, where the files they produce will be significantly smaller than the equivalent bitmap. Since the drawing objects are described mathematically, vector images can also be zoomed and resized without pixellating. SVG uses a range of drawing 'primitives' (line, circle, path, etc. with fill, stroke, pattern etc.) as elements. Being XML it allows these to contain attributes and other element children, and SVG specifically anticipates that elements from other namespaces will be interspersed with its own information.

The natural use of SVG is for transmission of line art and other semantically rich graphics (2D only) over the Web. For example, most of the diagrams in this manuscript are created directly in SVG and can be viewed with widely available tools. SVG allows local and global rescaling, extraction of sub-components, and searches for text strings. Filters allowing files to be exported as SVG are available for Corel Draw⁴¹ and Adobe Illustrator (popular drawing packages), and Adobe has released an SVG plugin⁶ for Internet Explorer and Netscape Navigator. SVg can also be exported to various outputs, for example pdf files

A common use of graphics is to display instrumental output such as spectra. SVG is a natural medium for this as the data are not corrupted and accurate values can be extracted from the spectrum if required. Moreover the spectrum can have XML-based links to other elements such as molecules (e.g. in a chromatogram) or atoms (as in molecular spectra).

We can therefore confidently use SVG as a semantically rich tool to co-exist with CML. A typical example is a Scheme with several molecules, linked with lines, and surrounded by containers; this could denote reactions, interrelationships, systematisations, etc. The basic framework consists of an SVG element which contains <cml:molecule>s or, even better, links to <cml:molecule>>. The graphic elements can *contain* metadata stating their function (e.g. "curly arrow" denotes a 2-electron transfer from its tail to head). Hopefully the community will converge on conventions for such metadata. If agreed, it lets the users search the documents for such constructs:

Figure 5.4.1: Example of combined SVG and CML

<svg:svg>
  <defs>
    <cml:molecule id="mol1">...</cml:molecule> 
    <cml:molecule id="mol2">...</cml:molecule> 
  </defs>
  <use href="#mol1" x="100" y="0"/>
  <svg:path d="M 100 0 l 100 0 100 100 0 100 z">
    <string type="metadata">Conformational change</string>
    <cml:float name="temperature" units="Celsius">80</cml:float>
  </svg:path>
  <use href="#mol2" x="0" y="100"/>
</svg:svg>

could portray the conformational interconversion of mol1 to mol2 at specific points in the Scheme and using a curved line to link them.

CML containing 2- or 3-D coordinates can be transformed to SVG with XSLT stylesheets. Since bonds contain references to atoms (atomRefs) it is possible to work out where the bonds should be drawn. Atom properties (e.g. calculated charge) can be simulated by spheres or other primitives, so XML/SVG-aware browsers already contain components for a simple molecular renderer.

Figure 5.4.2: SVG molecule and infra red spectrum - right click for menu

The examples given in Figure 5.4.2 were produced from CML by Peter Murray Rust, using the XT parser. SVG promises to be an excellent alternative to applets for the display of static molecules, reactions and spectra. Unfortunately the IE 5 XSL engine does not yet include the functions necessary to carry out CML to SVG stylesheet transformations. It is hoped that later releases of the browser will rectify this. In addition, the Adobe plugin can only accept external SVG files and is unable to display SVG incorporated into an XML document. One solution would be to use an SVG applet, but these are still being developed.^{42 43}

Several diagrams and flow charts in this report use SVG; see Figure 1.2.1, Figure 2.1.1, Figure 3.4.3 (right click on the diagrams and select 'zoom in').

6 - Reactions

Element <reaction> was included in the CML DTD but not elaborated further. The techniques presented here are experimental and further work is required to develop robust and flexible reaction markup. It is hoped that these explorations might provide a starting place for future development.

Reaction markup could potentially get extremely complicated, particularly if atom mapping or functional groups are included. For our initial experiments, we have concentrated on using CML to display simple reaction schemes within the IE browser. This can be considered as an extension of <molecule> markup and is approached in a similar way.

6.1 - Linking via @id

A reaction is considered as a series of molecular species; potentially including reactants, products, reagents, intermediates, transition states etc. These species are then grouped into reaction steps with suitable information given at each step (e.g. reaction conditions, yield, reaction name). If the structure of each species is expressed as a CML <molecule> and supplied with a unique @id value, then a reaction step need only contain references to each species' @id (via <@href="idname">) and need not contain the entire molecule. This keeps the CML succinct, enhancing human readability. It also enhances the reusability of chemical components; a library of reactions using the same reactive species needs only contain one copy of its structure. A example of a simple CML reaction is given in Figure 6.1.1 with the molecular structures omitted. A full description of reaction markup can be found in Appendix B.

The attribute names @id and @href are those chosen as linkers for CML. Other languages might use different attributes and the parser needs which these are. IE uses three data type declarations in the schema for this purpose. An attribute declared as dt:type="id" contains an element's unique identity, dt:type="idref" contains a reference and dt:type="idrefs" may contain one or more references (space separated). The parser is required to ensure all identities contain unique strings and that all references refer to an existing identity.

Figure 6.1.1: Simplified 'stepwise' reaction - Diels Alder Cycloaddition

..
<cml title="Simple Reaction" id="cml_simple">
	..
	<reaction title="Diels-Alder cycloaddition" id="simple_rxn_1" convention="stepwise">
		<!-- overall information -->
		<string title="description">Simple example of a A + B -> C reaction.</string>
		<float title="yield" units="%">88</float>
		<string title="notes">taken from Vollhardt and Schore</string>
		<list title="reactionStep" id="simple_s_1">
			<!-- reaction step information -->
			<string title="description">cycloaddition</string>
			<string title="notes">one step</string>
			<!-- series of links to each reaction species -->
			<link title="reactant" href="simple_mol_reactant1"/>
			<link title="reactant" href="simple_mol_reactant2"/>
			<!-- reagent could contain: a reagent name,
				 link to a structure or as here, a list of 
				 conditions. multiple reagents can be included 
				 if needed -->
			<link title="reagent">
				<integer title="index">1</integer>
				<string title="solvent">Acetonitrile</string>
				<string title="temperature" convention="degC">100</string>
				<string title="duration" convention="hours">3</string>
				<string title="notes">reflux</string>	
			</link>
			<link title="reagent">
				<integer title="index">2</integer>
				<string title="notes">workup</string>	
			</link>
			<link title="product" href="simple_mol_product"/>
			<!-- could also include catalyst, intermediate, 
				transition state etc. -->	
		</list>
	</reaction>
	..
	<!-- the actual molecules can be anywhere, even in a different cml block -->
	<molecule title="2,3-Dimethyl-1,3-butadiene" id="simple_mol_reactant1">
	<!-- etc. -->
	</molecule>
	..
	<molecule title="Propenal" id="simple_mol_reactant2">
	<!-- etc. -->
	</molecule>
	..
	<molecule title="Diels-Alder adduct" id="simple_mol_product">
	<!-- etc. -->
	</molecule>
	..
</cml>
..

Links of this sort are particularly useful for comparing or indexing two sets of data against each other. Further chemical examples might be atom mapping (in a reaction) or peak assignment for a spectrum. Significantly more advanced linking techniques called XLink⁴⁴ and XPointer⁴⁵ are being developed by the W3C. These allow for one or two way linking between documents and complex one to many or many to one links. Using these techniques, a <reaction> in a XML document could reference data-sheets in a chemical archive. By clicking on a reaction species, the reader would then have access to full chemical data for that molecule.

Figure 6.1.2: Stepwise epoxidation reaction (directly converted from a rxn file)

Stepwise Reaction: Epoxidation of Styreneid: epoxidation_rxn_1Description: Example of a simple A + B -> C + D reaction.Yield: 75 %Notes: taken from Warren

You must have Java turned on! You must have Java turned on!

reflux,

You must have Java turned on! You must have Java turned on!

Two stylesheets have been developed to display CML reactions. Both use the Marvin applet (in 2D mode) to display the reaction species and an HTML table to format the applets into a reaction schema (the arrows are normal gif images). Additional information (reaction conditions etc. ) is displayed above each reaction step or below the arrow, as appropriate. The first stylesheet ( Figure 6.1.3) is able to display multiple reaction steps of the form A + B + .. -> X + Y + ..

Figure 6.1.3: Stepwise reaction stylesheet fragment - reformatted for clarity

..
<!-- use this template for a stepwise reaction -->
<xsl:template match="reaction[@convention='stepwise']">
	<!-- build HTML table -->
	<TABLE><TR>
	    <TD COLSPAN="2"><B>Stepwise Reaction:</B> <xsl:value-of select="@title"/></TD>
    	<TD><B>id:</B> <xsl:value-of select="@id"/></TD>
	</TR>
	<!-- etc. for 'description', 'yield' and 'notes' -->
	<!-- build a reaction scheme for each step -->
	<xsl:for-each select="list[@title='reactionStep']">
	<TR><TD COLSPAN="3">
		<TABLE><TR>
   		<TD>
		<!-- select each link to a reactant, then find and select the elements 
		with matching @id (this is what id(@href) does) -->
		<xsl:for-each select="link[@title='reactant']"> 
			<xsl:for-each select="id(@href)">
				<!-- call a template that builds Marvin applet -->
				<xsl:apply-templates select="list[./atom/float/@builtin='x3']"/>
			</xsl:for-each>
		</xsl:for-each>
		</TD><TD>
		<!-- the arrow is a normal gifs -->
		<IMG SRC="writeupdata/arrow.gif" /><BR/>
		<!-- select each reagent in turn -->
		<xsl:for-each select="link[@title='reagent']"> 
			<xsl:if test="*[@title='index']">
				<!-- if an index number is given, include it -->
				<xsl:value-of select="integer[@title='index']"/>)
			</xsl:if> 
			<xsl:for-each select="id(@href)">
				<!-- if a structure is given, include it -->
				<xsl:apply-templates select="list[./atom/float/@builtin='x3']"/>, 
			</xsl:for-each>
			<xsl:for-each select="string[@title='solvent']"> 
				<!-- if a solvent is given, include it -->
				<xsl:value-of select="text()"/>, 
			</xsl:for-each>
			<!-- etc. for temperature, duration and notes -->
		</xsl:for-each>
		</TD><TD>
		<!-- select each link to a product -->
		<xsl:for-each select="link[@title='product']"> 
			<xsl:for-each select="id(@href)">
				<!-- call a template that builds the Marvin applet -->
				<xsl:apply-templates select="list[./atom/float/@builtin='x3']"/>
			</xsl:for-each>
		</xsl:for-each>
		</TD>
		</TR></TABLE>
	</TD></TR>
	</xsl:for-each>
	</TABLE>
</xsl:template>
..

The code to convert the molecular information to .mol and build the Marvin applet is almost identical to that described in Chapter 4 and has been omitted. The stylesheet is significantly more structured than previous examples, and contains larger amounts of HTML. In particular the requirement for TABLE tags restricts the flexibility of the stylesheet. More complex reactions would require dedicated stylesheets written specially for them. An example is the catalytic cycle shown in Figure 6.1.4 Alternative approaches would be to use the JME applet (which can handle limited reactions), or SVG.

Figure 6.1.4: Cross Metathesis (example of a catalytic cycle)

Cycle 4 Reaction: Mechanism of Cross Metathesisid: cycle_rxn_1Description: Catalytic cycleYield:Notes: from Dr. Braddock's lecture notes

You must have Java turned on!				You must have Java turned on!
		You must have Java turned on!		You must have Java turned on!
	You must have Java turned on!		You must have Java turned on!
You must have Java turned on!		You must have Java turned on!
You must have Java turned on!			You must have Java turned on!

7 - Converting Legacy Formats to CML

Manually creating CML from scratch is unfeasibly laborious and so alternatives are needed. It is hoped that future chemistry software will include filters enabling them to save data directly to CML, but until this occurs the best solution is to convert existing legacy formats to CML. This has the advantage of 'normalising' the existing range of diverse chemical files to a single universal format. The extensibility of XML (and hence CML) and the use of generic data holders means that any ASCII based legacy file can be converted to CML without information loss.

CML has been designed to make these conversions as easy as possible. In most cases (e.g. .mol or JCAMP) this involves identifying blocks of data and wrapping them in the appropriate CML elements. This is equivalent to a complex text search and replace operation. Perl (a widely used text manipulation and report language) is an excellent tool for carrying out exactly this sort of manipulation.46 It is open source and whilst originally designed for Unix, versions are available for all major operating systems. Perl scripts are plain text files and are normally run from the command line. With slight alterations, they can also be be run as CGI (common gateway interface) programs on a web server. This allows text written into a online <FORM> to be sent to the server, converted and returned to the user as an HTML page. This is demonstrated in Figure 7.1.

Warning: you need to be online and the server-returned page will overwrite this one

Figure 7.1: Server side conversion of .mol, .rxn and .dx files to CML - http://www.ch.ic.ac.uk/chimeral/resources/file2cml.html

Converts MDL .mol, REACSS .rxn and JCAMP .dx files to CML
Either select an example file or;
1) Open file in a text editor
2) Copy and paste file contents into the textarea
3) Fill in the other fields (they are used in 'id') and select file and display types
4) Press GO

File Type:		Title:
Author:		Date:
File:

A range of perl converters have been written and are available for use from the ChiMeraL site. Perl can not easily access compressed data and since both .mol and JCAMP files sometimes use compressed sections, they must be expanded before conversion. Describing the workings of each converter is beyond this report and interested parties should read the script comments. Available converters include;

MDL .mol to CML
MDL .mol to the JME input string
.sd (archive format) to CML
.xyz to CML
JCAMP .dx and .cs to CML (JCAMP-CS files produce both a <molecule> and a <spectrum>)
REACSS .rxn (reaction format based on .mol) to CML

Converters for Gaussian and Mopac source files (amongst others) are under development.⁴⁷ We are experimenting with combining these scripts with a web 'robot' written by G. Gkoutos (Imperial College, Chemistry Department). The robot is able to identify and collect a large range of chemical file formats. The intention is a construct a large archive of CML molecules and spectra to support this project.

8 - Writing Complete XML Documents

Pure CML documents are best suited to chemical data-sheets or archived data. A more flexible approach is to combine chemistry with text and diagrams to from a report or scientific paper. Namespacing allows different XML languages to be intimately mixed and an excellent text formatting language already exists in XHTML. A sophisticated XML document might therefore consist of XHTML formatted text, chemistry using CML and diagrams either as bitmaps (gif or jpg) or drawn in SVG. An XML document language is also needed to describe the document's structure and to wrap these various components. A simple 'docuML' language has been developed to illustrate these concepts and used to write several documents, including this report.

A series of document structure elements are defined using an XML schema (http://www.ch.ic.ac.uk/chimeral/karne_schema_01.xml). These elements can be recognised by their namespace prefix 'karne:' and include: <document>, <metadata>, <abstract>, <chapter>, <subsection> and <bibliography> ( Figure 8.1). Additional elements are used to label figures and 'escape' blocks of sample code (this prevents them interfering with real XML).

Figure 8.1: Simplified XML document - outlines the document structure

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="document.xsl" ?>
<karne:document title="Skeleton" id="xmldoc_skele"
	xmlns="http://www.w3.org/1999/xhtml"
	xmlns:karne="x-schema:http://www.ch.ic.ac.uk/chimeral/karne_schema_01.xml"
	xmlns:chimeral="x-schema:http://www.ch.ic.ac.uk/chimeral/spectrum_schema_ie_01.xml">

<!-- *********************** metadata *********************** -->
<karne:metadata>
	<karne:date builtin="date:creation">June 12, 2000</karne:date>
	<karne:author idref="insti_ic" id="author_mw" email="karne@innocent.com"
	 href="http://www.ch.ic.ac.uk/chimeral">Michael Wright</karne:author>
	<karne:institution id="insti_ic" href="http://www.ch.ic.ac.uk">
	Department of Chemistry, Imperial College of Science, Technology and Medicine, UK
	</karne:institution>
</karne:metadata>

<!-- *********************** abstract *********************** -->
<karne:abstract>Demonstrates the key elements of an XML document.</karne:abstract>

<!-- *********************** keywords *********************** -->
<karne:keywords>ChiMeraL, XML, CML, chemical markup language, chemistry</karne:keywords>

<!-- *********************** chapter *********************** -->
<karne:chapter id="chap_introduction" title="Introduction">
	<karne:index/>
	<P>A document is made up of nested chapters and subsection.</P>

<!-- *********************** subsection *********************** -->
<karne:subsection id="sub_xhtml" title="XHTML">
	<karne:index/>
	<P>Elements with no prefix are assumed to be XHTML by the stylesheet and 
passed directly to the output. This allows the author to use familiar <I>'HTML 
like'</I> formatting without disrupting other XML blocks. Example code needs to 
be escaped; 
<FONT CLASS="code"><karne:lt/>formula<karne:gt/>C4 H8 O<karne:lt/>/formula<karne:gt/></FONT>
</P>
</karne:subsection>
</karne:chapter>

<!-- *********************** bibliography *********************** -->
<karne:bibliography title="References">
	<karne:ref id="ref_CML1.0">For a formal description of CML version 1.0, see
	<karne:index>1</karne:index>
		<karne:author href="http://www.xml-cml.org/">
			P. Murray-Rust and H. S. Rzepa</karne:author>
		<karne:publication>J. Chem. Inf. Comp. Sci.</karne:publication>
		<karne:date>1999</karne:date>
		<karne:volume>39</karne:volume>
		<karne:pages>928</karne:pages>
	</karne:ref>
</karne:bibliography>

</karne:document>

An XSL stylesheet has been written combining the CML template fragments with a number of templates designed to format the various document components. The stylesheet is generic and able to display any document written using this docuML language. In order to allow the author to use familiar (X)HTML for text formatting, the stylesheet has been designed leave unrecognised elements unchanged. These pass through to the output, are recognised by the browser as HTML and formatted appropriately. By setting the document's default namespace to XHTML, no element prefixes are required and the document can be written using standard HTML editing tools. A CSS stylesheet is also used, allowing changes to the appearance of the transformed document without having to edit the significantly more complex XSL.

Both CML and DocuML components have @id attributes, allowing sophisticated linking behaviour. The stylesheet can display components in any order and is not restricted by the physical order of the code in the XML source. CML molecules, spectra and blocks of sample code can be extremely long and would normally interrupt the chapter/subsection structure of the document. This makes it much harder for a human to read or edit. The prefered approach is to separate out CML data and sample code and move them to the end of the document. Links (<karne:link builtin="molecule" display="jmol" idref="mol_thf_1"/>) can then be used within text or figures, to reference these external blocks as they are required.

Figure 8.2: Part of an XML document showing linking

..
<!-- *********************** subsection *********************** -->
<karne:subsection id="sub_idlinks" title="Linking">
<karne:index/>
	<P>Links between @id and @idref are used to automatically build references 
to the bibliography<karne:link builtin="ref" idref="ref_CML1.0"/> or 
<karne:link builtin="figure" idref="fig_Jmol"/>. The reference will include the 
correct index number, hyperlinks and mouse-over effects, even if the target is 
moved or completely rewritten. This helps avoid the notorious HTML problem - 
"broken links". Links are also used to build the document's index.</P>

<!-- figure title -->
<karne:figure id="fig_Jmol" builtin="showLabel">Tetrahydrofuran using Jmol</karne:figure>

<!-- this tells the stylesheet to render the CML object with @id="mol_thf_1" 
using Jmol, changing @display changes the applet used -->
<karne:link builtin="molecule" display="jmol" idref="mol_thf_1"/>
</karne:subsection>
..
<!-- *********************** CML data block *********************** -->
<cml title="cml data block" id="cml_moldata" xmlns="x-schema:cml_schema_ie_02.xml">
	<molecule title="Furan, tetrahydro-" id="mol_thf_1">
		..
	</molecule>
	<chimeral:spectrum title="Caffeine" id="spect_caffeine_ms_1" convention="JCAMP-DX= 4.24">
		..
	</chimeral:spectrum>
</cml>
..

Links can also be used to automatically build and maintain literature or figure references. For example, the stylesheet can search for the start of each chapter, sub-section and figure, then use this information to automatically create an index of titles complete with hyperlinks. Document components can be reordered or even completely rewritten, and these links will be automatically updated. Figure 8.2 shows references to a figure and the bibliography (lines 6 and 7)and a link to a CML molecule (line 17). Although the last instructs the stylesheet to use the Jmol applet to render the molecule, a different applet can be chosen simply by changing the value of @display.

8.1 - Printing XML - Formatting Object and FOP

Formatting Objects (FO) is a sub-language of XSL, intended to describe the layout of document components on a printed page.24 It comprises of formatting elements rather similar in concept to XHTML but describing exact positions and font sizes in contrast to XHTML's relative positioning. FOP is a Java application being developed by Apache.⁷ When it is combined with a Java XML/XSL parser (e.g. Xalan) it acts as a formatting object parser and can convert an FO file to an Adobe Acrobat .pdf file. Acrobat is a very widely used for exchanging 'ready for print' electronic documents and being a read-only format has important legal implications.

Using multiple stylesheets, an XML document can be transformed for browser display or converted to a FO file for printing. Since the stylesheets can optimise the document for each purpose, the results are much superior to printing directly from the browser. The paper version of this report was produced in this way. It is expected that further converters will become available as this technology matures.

8.2 - Reusable journals

Journals are currently created for humans to read. Some contain machine readable supplemental material but the formats are not usually standardised. Articles in XML/CML have the revolutionary potential of being read by machines. It is easy to extract all the <cml:molecule> elements in an article or, indeed, in a whole year's publication.

The reader can then ask sophisticated questions of these data, and hypothetical examples could be:

what molecules contain a given functional group?
where do molecules and their spectra co-occur?
extract all molecules which take part in reactions

Note that XML/CML allows very flexible queries so that the author of the paper(s) need not necessarily anticipate the use that the publication will be put to. Given the relative cheapness of theoretical chemistry calculations, the orbitals of all (simple) molecules could be found and, from the XML output, those with given energies extracted. Or compounds not in the user's database could be listed for potential synthesis; in time the synthesis could be automatic!

This article is the first chemical article completely in XML. We expect this to become the standard for chemistry journals and urge the community to develop and adopt the technology. Key aspects will be ease of authoring and we are currently collaborating on how XML/CML systems can be developed simply and distributed easily. When they mature they will allow data checking at authorship time, extensive crosslinking of information and normalisation of existing information.

8.3 - In Conclusion

The chemical community is showing increased interest in CML. Further development of the language, its extensions and applications of its use are required to maintain this interest. It is hoped that by demonstrating the first fully operational systems for managing CML and creating complex CML/XHTML documents, we will have provided the basis for this work.

Applications for the CML/XHTML system described here might include: e-journals, safety data-sheets (for example, as part of FDA drug ratification), patent processing (where microscopic capture of information is essential, and failure can lead to the paten being challenged) or e-commerce. One can envisage future parsers able to comprehend both CML and XHTML without the requirement for sophisticated stylesheets.49 By including CML support in all computational chemistry and modelling software, a 'rolling stone' approach can be taken whereby each process the molecule passes through (structure optimisation, property calculation etc.) involves full information retention. This is in stark contrast to present solutions where information is lost at every stage. Enhanced applet-CML communication and the possibility of producing CML from a editor⁴⁸ also have great potential. Ultimately, server-side processing of XML/XSL stylesheets will allow powerful stylesheet transformations to be accessible using any web browser.

All source code and examples written for this project are available from the ChiMeraL website (http://www.ch.ic.ac.uk/chimeral/).

Acknowledgements

Many thanks to; Henry Rzepa for direction and advice, Peter Murray-Rust (CML, SVG and for recoding Jspec), Steve Zara (XSL stylesheets), Tom Grey (Jmol applet), Michelle Osmond (perl converters) Georgios V. Gkoutos (legacy files) and Eric Schaeffer (FOP).

Appendix A - List of ChiMeraL Resources

Demonstrations

Markup and display of molecules, spectra and reactions
Multiple stylesheet selection and the use of data islands
Testing XML/XSL functionality with various browsers
Scaling test - ECTOC database (9 Mb XML file)
Writing a complex XML document and display through IE and .pdf (via FOP)

Schema

CML schema 0.2 - based on published DTD
CML schema 0.2 (IE version) - includes unique @id and data links
ChiMeraL schema 0.1 - extends CML to include <spectrum>
Document schema 0.1 - developed to write this report

XSL Stylesheets

generic.xsl - displays all CML
datasheet.xsl - property information as an HTML table
jme.xsl - Java Molecular Editor applet - editor
jme_format.xsl - JME data string
marvinview.xsl - Marvin applet, 3D viewer mode
marvinedit.xsl - Marvin applet, editor mode
sda.xsl - Structure Drawing Applet - editor
mol.xsl - MDL .mol file format
jmol.xsl - Open Source applet - 3D viewer
xyz.xsl - .xyz file format
jspec.xsl - Java Spectrum Viewer - viewer
jcamp.xsl - JCAMP .dx format
reaction.xsl - stepwise, linear and cycle.4 reactions using Marvin
document.xsl - displaying complex XML document in IE5
document.css - CSS stylesheet for above
doc2fo.xsl - convert complex XML document to formatting object (for use with FOP)

perl converters

MDL .mol to CML
MDL .mol to JME format
.sd to CML (multiple mol archive)
.xyz to CML
JCAMP .dx and .cs to CML
REACSS .rxn to CML
.rd to CML
.mol, .rxn and .dx 2 CML are all available as an online cgi script

A large number of example CML files are also available, these contain properties, structures, spectra and reactions. Further archives of CML files will be made available as they are converted.

Appendix B - CML Syntax and Notes

The published CML 1.0 DTD2 declares valid elements and attributes but puts few restrictions on how these elements and attributes are used. As a consequence of this, it is possible to markup a CML object (e.g. a molecule) using a variety of different syntaxes. Whilst this gives great flexibility, it also makes it significantly more difficult to build CML applications. In particularl some of these syntaxes can not be easily parsed using stylesheets. I have selected what I feel is the syntax best suited to XSL and small molecule markup. This, along with the syntax for chimeral:spectrum and reaction (still experimental) is as follows: (comments on syntax are in green)

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="document.xsl" ?>

<!-- Declares this document as XML and indicate the URL of its stylesheet -->

<document title="Lipids" id="cmldoc_karne_lipids"
	xmlns="http://www.w3.org/1999/xhtml"
	xmlns:chimeral="x-schema:http://www.ch.ic.ac.uk/chimeral/spectrum_schema_ie_01.xml"> 
	
<!-- <document> isn't part of CML but represents the top element of any XML compliant 
document, this might contain CML, XHTML, MaML etc. - note the XHTML and chimeral namespaces -->

<cml title="Cholesterol" id="cml_karne_cholesterol" 
	xmlns="x-schema:http://www.ch.ic.ac.uk/chimeral/cml_schema_ie_02.xml">
	 
<!-- The CML namespace points to a copy of the schema, <cml> this may contain any number of 
CML 'objects' e.g. <molecule>, <reaction> or <chimeral:spectrum> -->

	<molecule title="cholesterol" id="mol_cholesterol">
		<formula>C27 H46 O</formula>
		
<!-- Information specific to this molecule is included here as <string>, <float> or <integer>
 - additional elements can be added as required but note the use of @title to label them. 
Alternate names are marked up as a <list> of <string>s -->

		<string title="CAS">57-88-5</string>
		<string title="ACX">I1001660</string>
		<string title="RTECS">FZ8400000</string>
		<float title="molecule weight">386.6598</float>
		<float title="melting point" units="degC">148.5</float>
		<float title="boiling point" units="degC">360</float>
		<float title="specific gravity">1.067</float>
		<list title="alternate names">
			<string title="name">Cholesterin</string>
			<string title="name">(3beta)-Cholest-5-en-3-ol</string>
			<string title="name">Cholest-5-en-3-ol (3beta)-</string>
			<string title="name">cholest-5-ene-3beta-ol</string>
			<string title="name">3beta-hydroxycholest-5-ene</string>
		</list>
		
<!-- The following is used for small molecular structures 
- this format is much preferred but rather verbose. <integer builtin="atomId"> 
would normally be that used in the MDL .mol format but in contrast, 'id' must be unique 
over (at least) this document. Additional strings are 'formalCharge' and 'hydrogenCount'. 
2D structures will use builtin="x2 | y2" but are otherwise the same -->

		<list title="atoms"> 
			<!-- repeat -->
			<atom id="cholesterol_a_1">
				<integer builtin="atomId">1</integer>
				<float builtin="x3" units="A">-1.9901</float>
				<float builtin="y3" units="A">2.1889</float>
				<float builtin="z3" units="A">-1.8776</float>
				<string builtin="elementType">H</string>
			</atom>
			<!-- /repeat (74 atoms) --> 
		</list>
		
<!-- Large molecular structures - this format is terse but much harder to format/refer 
to in XSL. I have chosen not to use it -->

		<atomArray id="methanol"> 
			<stringArray title="label">a1 a2 a3 a4 a5 a6</stringArray> 
			<stringArray builtin="elementType">C O H H H H</stringArray>
			<floatArray builtin="x3">-0.748 ..</floatArray> 
			<floatArray builtin="y3">-0.015 ..</floatArray> 
			<floatArray builtin="z3">0.024 ..</floatArray> 
			<integerArray builtin="formalCharge"></integerArray> 
		</atomArray>
		
<!-- A <list> of <bond>s is used for small molecules 
- large ones will probably ignore bonds and calculate then directly -->

		<list title="bonds"> 
			<!-- repeat --> 
			<bond id="cholesterol_b_1">
				<integer title="bondId">1</integer>
				<integer builtin="atomRef">2</integer>
				<integer builtin="atomRef">1</integer>
				<integer builtin="order" convention="MDL">1</integer>
			</bond>
			<!-- /repeat (77 bonds) -->
		</list>
	</molecule>
	
<!-- Elements in spectra tend to match the LDHs in the JCAMP format. The namespace chimeral: is 
very important as spectrum isn't found in CML 1.0 -->

	<chimeral:spectrum title="Cholesterol" id="spect_cholesterol_ms_1"
	 	convention="JCAMP-DX=4.24">
		<string title="datatype">MASS SPECTRUM</string>
		<string title="EPA">67286</string>
		<string title="origin">T.IIDA NIHON UNIVERSITY, KORIYAMA,
		 FUKUSHIMA-KEN, JAPAN</string>
		<string title="owner">NIST Mass Spectrometry Data Center</string>
		<string title="spectrometer">LKB 9000</string>

<!-- The following information is required for the rendering of spectra in Jspec -->

		<float title="xunits">M/Z</float>
		<float title="yunits">RELATIVE ABUNDANCE</float>
		<float title="firstx" convention="M/Z">18</float>
		<float title="lastx" convention="M/Z">387</float>
		<float title="deltax" convention="M/Z"></float>
		<float title="xfactor">1</float>
		<float title="firsty" convention="RELATIVE ABUNDANCE">700</float>
		<float title="miny" convention="RELATIVE ABUNDANCE">100</float>
		<float title="maxy" convention="RELATIVE ABUNDANCE">9999</float>
		<float title="yfactor">1</float>
		<float title="npoints">169</float>
		
<
!-- One of the following syntaxes is then used, depending on the type of spectrum -->

<
!--  A: simplest and prefered spectra format. Try and avoid (X++(Y..Y)) -->

		<list title="xypairs" convention="(XY..XY)"> 
			<!-- repeat -->
			<coordinate2 id="cholesterol_ms_c_1">18, 700</coordinate2>
			<!-- /repeat (169 data pairs) --> 
		</list>
		
<!-- B: alternate format for peak tables rather then data -->

		<list title="peak table" convention="(XY..XY)">
			<!-- repeat -->
			<coordinate2 id="mol_s_1">X, Y</coordinate2>
			<!-- /repeat --> 
		</list>
		
<!-- C: convention often used for NMR peak tables is (XYM) -->

		<list title="peak table" convention="(XYM)"> 
			<!-- repeat -->
			<coordinate3 id="mol_s_1">X, Y, M</coordinate3>
			<!-- /repeat --> 
		</list>
	</chimeral:spectrum>
	
<!-- Reactions are made up of a series of lists, each list containing a number of links @href 
to molecule @id -->

	<reaction title="Reactions" id="simple_rxn_1" 
		convention="stepwise | linear | cycle.4 | ..">
		
<!-- Linear markup is much preferred since it's the simplest, others provided for formatting 
purposes. The following three elements can be used either at the reaction level or at 
each reaction step -->

		<string title="description">Diels-Alder cycloaddition</string>
		<float title="yield" units="%">88</float>
		<string title="notes">example</string>

<!-- Stepwise (x1 > y1, x2 > y2); use as many reactants, reagents and products 
as needed for each step, reactions steps will be displayed separately. Additional links to 
catalysts, intermediates, transition states etc. can be used as required -->

			<!-- repeat --> 
			<list title="reactionStep" id="simple_s_1">
				<link title="reactant" href="mol_x"/>
				<link title= "reagent" href= "mol_r">
					<integer title="index">1</integer>
					<string title="solvent">Acetonitrile</string>
					<string title="temperature" units="degC">100</string>
					<string title="duration" units="hours">3</string>
					<string title= "notes">reflux</string>
				</link>
				<link title="reagent">
					<integer title="index">2</integer>
					<string title="notes">workup</string>	
				</link>
				<link title="product" href="mol_y"/>
			</list>
			<!-- /repeat -->
			
<!-- Linear (x > y > z); reactant refers to the first reactants, product 
refers to the final product; intermediates use linearReactant and linearProduct	 -->

			<list title= "linearstep" id="step_1">
				<link title="reactant" href="mol_1"/>
				<link title="reagent" href="mol_r1">
					<!-- .. -->
				</link>
				<link title="linearProduct" href="mol_2"/>
			</list>
			<list title= "linearstep" id="step_2">
				<link title="linearReactant" href="mol_2"/>
				<link title="reagent" href="mol_r2">
					<!-- .. -->
				</link>
				<link title="linearProduct" href="mol_3"/>
			</list>
			<list title= "linearstep" id="step_3">
				<link title="linearReactant" href="mol_3"/>
				<link title="reagent" href="mol_r3">
					<!-- .. -->
				</link>
				<link title="Product" href="mol_4"/>
			</list>
			
<!-- Catalytic cycle - much more complex (.. > x > y > z > ..) reactant and product 
refer to substances 'in' and 'out' of the cycle in each step, use cycleReactant and 
cycleProduct for 'within' the cycle. Markup should be cyclic 
(final cycleProduct == first cycleReactant) -->

			<list title="reactionStep" id="step_1">
   				<link title="cycleReactant" href="mol_1" id="cycle_lk_1"/>
   				<link title="reactant" href="mol_3" id="cycle_lk_2"/>
 				<link title="cycleProduct" href="mol_7" id="cycle_lk_3"/>
 			</list>
 			<list title="reactionStep" id="step_2">
   				<link title="cycleReactant" href="mol_7" id="cycle_lk_4"/>
   				<link title="cycleProduct" href="mol_6" id="cycle_lk_5"/>
   				<link title="product" href="mol_8" id="cycle_lk_6"/>
 			</list>
 			<list title="reactionStep" id="step_3">
   				<link title="cycleReactant" href="mol_6" id="cycle_lk_7"/>
   				<link title="reactant" href="mol_5" id="cycle_lk_8"/>
   				<link title="cycleProduct" href="mol_4" id="cycle_lk_9"/>
 			</list>
 			<list title="reactionStep" id="step_4">
   				<link title="cycleReactant" href="mol_4" id="cycle_lk_10"/>
   				<link title="cycleProduct" href="mol_1" id="cycle_lk_11"/>
   				<link title="product" href="mol_2" id="cycle_lk_12"/>
 			</list>
			
<!-- list title="atomMap" could be placed within each step -->

 	</reaction>
</document>

Appendix C - CML Schema (IE 0.2) Notes

This version of the schema is based directly on the CML 1.0 DTD but includes datatype declarations for use in IE 5.x. These datatypes allow the use of @id and @href for intra-document linking. A platform independent version (plus schemas for chimeral: and karne:) are available from the ChiMeraL site.1 (comments on syntax are in green)

<?xml version="1.0"?>
<Schema name="cml_dev_karne" 
	xmlns="urn:schemas-microsoft-com:xml-data" 
	xmlns:dt="urn:schemas-microsoft-com:datatypes">
<description>
CML development - Version 0.2 - 7/4/00

This document is the first draft of an XML Schema compatible with CML
V1.0 published in JCICS... In converting the schema we have suggested
some closed and some open content models. As CML develops it seems likely
that there will be advantage in opening some of the models (e.g. angle),
perhaps for annotations and ancillary information. Readers should note
that the XML Schema activity is still at draft stage and that this schema
may be revised in the future for compatibility. 

This schema is intended for use with IE5 - a platform independent version
is available. XML documents can be validated against this schema by adding
xmlns="x-schema:URL" within the cml element. 

Please see cml_schema_ie_02.html for further comments and explanations.

Peter Murray Rust, Henry Rzepa, Michael Wright

Comments to Michael Wright - karne@innocent.com
</description>

<!-- ********** Attribute Types ********** -->
<!-- *** common *** -->

<!-- It is expected that these attributes will be found on almost all elements. 
@title is used for display and general labelling and @id indicates a document 
unique identity string for that element. This identity can then be used to 
reference that element (and hence the object it represents) by the use of 
@href. Care must be taken that @id is unique. The use (for example) of 
<atom id="1"> will cause trouble in a document with more 
than one molecule and should be avoided. Note that @title on data elements 
(string/integer/float) can be used to markup data for which there is no 
explicit CML element (e.g. <string title="CAS">58-08-2</string>) 
and @convention should be declared if this is not obvious. @builtin implies 
an element with a meaning predefined in the DTD -->

<AttributeType name="title" required="no"/>
<AttributeType name="id" required="no" dt:type="id"/>			
<AttributeType name="convention" required="no"/>
<AttributeType name="builtin" required="no" dt:type="enumeration"
	dt:values="x2 y2 xy2 x3 y3 z3 xyz3 xFract yFract zFract xyzFract elementType
		atomId isotope occupancy hydrogenCount atomParity residueType residueId 
		formalCharge atomRef atomRefs length order stereo acell bcell ccell alpha 
		beta gamma z spacegroup" />


<!-- *** linkers *** -->

<!-- Designed to allow the linking of an element to another via @id. For example 
<link href="mol_543"> indicates a link to any element with @id="mol_543".
@unitsRef and @dictRef are intended for future use -->

<AttributeType name="href" required="no" dt:type="idrefs"/>
<AttributeType name="dictRef" required="no" dt:type="idrefs"/>
<AttributeType name="unitsRef" required="no" dt:type="idrefs"/>
<AttributeType name="atomRef" required="no" dt:type="idref"/>
<AttributeType name="atomRefs" required="no" dt:type="idrefs"/>

<!-- *** quantifiers/constraints *** -->

<!-- Various constraints on the values of data elements, only 'units' is in common use -->

<AttributeType name="count" required="no"/>
<AttributeType name="size" required="no"/>
<AttributeType name="rows" required="no"/>
<AttributeType name="columns" required="no"/>
<AttributeType name="min" required="no"/>
<AttributeType name="max" required="no"/>
<AttributeType name="units" required="no"/>

<!-- ********** Element Types ************ -->
<!-- *** data  *** -->

<!-- These elements are intended to contain only text string and attributes - they may 
not contain other elements - and are hence 'closed' (no additional markup can be 
added beyond the schema). If additional data holders are required, they should be of 
the form <string title="CAS">58-08-2</string> 
where the title value is used to identify the elements contents. For full list of 
attributes, please see the schema -->

<ElementType name="string" content="textOnly" model="closed">
	<attribute type="id"/>
	<attribute type="builtin"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="float" content="textOnly" model="closed">
	<attribute type="id"/>
	<attribute type="builtin"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="min"/>
	<attribute type="max"/>
	<attribute type="units"/>
	<attribute type="unitsRef"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="integer" content="textOnly" model="closed">
	<attribute type="id"/>
	<attribute type="builtin"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="min"/>
	<attribute type="max"/>
	<attribute type="units"/>
	<attribute type="unitsRef"/>
	<attribute type="dictRef"/>
</ElementType>

<!-- The arrays should be used only when large amounts of data need 
to be stored. In all cases explicit markup of each unit of data is preferred since this 
is much easier to manipulate with a stylesheet. In some cases this may not be possible 
(e.g. large molecules with 200+ atoms) and arrays can be used. These would probably need 
to be parsed by dedicated chemical tools. floatMatrix is intended for computational 
chemistry -->

<ElementType name="stringArray" content="textOnly" model="closed">
	<attribute type="id"/>
	<attribute type="builtin"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="size"/>
	<attribute type="min"/>
	<attribute type="max"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="floatArray" content="textOnly" model="closed">
	<attribute type="id"/>
	<attribute type="builtin"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="size"/>
	<attribute type="min"/>
	<attribute type="max"/>
	<attribute type="units"/>
	<attribute type="unitsRef"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="integerArray" content="textOnly" model="closed">
	<attribute type="id"/>
	<attribute type="builtin"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="size"/>
	<attribute type="min"/>
	<attribute type="max"/>
	<attribute type="units"/>
	<attribute type="unitsRef"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="floatMatrix" content="textOnly" model="closed">
	<attribute type="id"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="rows"/>
	<attribute type="columns"/>
	<attribute type="min"/>
	<attribute type="max"/>
	<attribute type="units"/>
	<attribute type="unitsRef"/>
	<attribute type="dictRef"/>
</ElementType>

<!-- Coordinates consist of 2 or 3 comma separated numbers. They should 
be used for logically linked number groups - e.g. a data point in a spectrum -->

<ElementType name="coordinate2" content="textOnly" model="closed">
	<!-- use for data pairs (e.g spectra xy) -->
	<attribute type="id"/>
	<attribute type="builtin"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="units"/>
	<attribute type="unitsRef"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="coordinate3" content="textOnly" model="closed">
	<!-- use for data triplets (e.g spectra x, y, m) -->			
	<attribute type="id"/>
	<attribute type="builtin"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="units"/>
	<attribute type="unitsRef"/>
	<attribute type="dictRef"/>
</ElementType>

<!-- Angle and torsion haven't yet been developed, hence they have been left open -->

<ElementType name="angle" content="textOnly" model="open">		
	<attribute type="id"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="atomRefs"/>
	<attribute type="min"/>
	<attribute type="max"/>
	<attribute type="units" default="deg"/>
	<attribute type="unitsRef"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="torsion" content="textOnly" model="open">		
	<attribute type="id"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="atomRefs"/>
	<attribute type="min"/>
	<attribute type="max"/>
	<attribute type="units" default="deg"/>
	<attribute type="unitsRef"/>
	<attribute type="dictRef"/>
</ElementType>

<!-- *** mixed *** -->

<!-- Link is used as a 'holder' element - e.g. for 'href' or 'unitsRef' 
and indicates a logical link to another CML object -->

<ElementType name="link" content="mixed" model="open">
	<attribute type="id"/>
	<attribute type="title"/>
	<attribute type="href"/>
</ElementType>	

<ElementType name="formula" content="mixed" model="open">
	<attribute type="id"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="count"/>
	<attribute type="dictRef"/>
</ElementType>
	
<!-- *** structural *** -->	

<!-- These elements define the tree structure of the CML document and are not expected to 
contain text strings - only other elements. Since they are 'open', additional sub elements 
- whether CML or not - can be added with correct namespacing. Attributes and elements 
for these elements have been declared in the schema but these are only suggestions - 
the DTD allows anything. Many, like electron and the crystalagraphic markup have yet 
to be developed -->

<ElementType name="atom" content="eltOnly" model="open" order="many">
	<attribute type="id"/>
	<attribute type="title"/>
	<attribute type="convention" default="mol"/>
	<attribute type="count"/>
	<attribute type="dictRef"/>
	<element type="float" minOccurs="0" maxOccurs="*"/>
	<element type="string" minOccurs="0" maxOccurs="*"/>
	<element type="integer" minOccurs="0" maxOccurs="*"/>
	<element type="link" minOccurs="0" maxOccurs="*"/>
</ElementType>

<ElementType name="bond" content="eltOnly" model="open" order="many">
	<attribute type="id"/>
	<attribute type="convention" default="mol"/>
	<attribute type="atomRef"/>
	<attribute type="atomRefs"/>
	<!-- avoid using these and use a subelement with builtin="atomRef" -->
	<element type="integer" minOccurs="0" maxOccurs="*"/>
	<element type="float" minOccurs="0" maxOccurs="*"/>
	<element type="string" minOccurs="0" maxOccurs="*"/>
	<element type="link" minOccurs="0" maxOccurs="*"/>
	<element type="angle" minOccurs="0" maxOccurs="*"/>
	<element type="torsion" minOccurs="0" maxOccurs="*"/>
</ElementType>

<ElementType name="list" content="eltOnly" model="open" order="many">
	<attribute type="id"/>
	<attribute type="title"/>
	<!-- required: atoms/bonds/xypairs/peak table/atom map-->
	<attribute type="convention"/>
	<!-- should be included for spectra: (XY..XY), (XYM) -->
	<element type="string" minOccurs="0" maxOccurs="*"/>
	<element type="integer" minOccurs="0" maxOccurs="*"/>
	<element type="float" minOccurs="0" maxOccurs="*"/>
	<element type="floatArray" minOccurs="0" maxOccurs="*"/>
	<element type="stringArray" minOccurs="0" maxOccurs="*"/>
	<element type="integerArray" minOccurs="0" maxOccurs="*"/>
	<element type="floatMatrix" minOccurs="0" maxOccurs="*"/>
	<element type="angle" minOccurs="0" maxOccurs="*"/>
	<element type="torsion" minOccurs="0" maxOccurs="*"/>
	<element type="coordinate2" minOccurs="0" maxOccurs="*"/>
	<element type="coordinate3" minOccurs="0" maxOccurs="*"/>
	<element type="link" minOccurs="0" maxOccurs="*"/>
	<element type="formula" minOccurs="0" maxOccurs="*"/>
	<element type="atom" minOccurs="0" maxOccurs="*"/>
	<element type="bond" minOccurs="0" maxOccurs="*"/>
</ElementType>

<ElementType name="atomArray" content="eltOnly" model="open" order="many">
	<!-- only use for large molecules -->
	<attribute type="id"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="dictRef"/>
	<element type="stringArray" minOccurs="0" maxOccurs="*"/>
	<element type="floatArray" minOccurs="0" maxOccurs="*"/>
	<element type="integerArray" minOccurs="0" maxOccurs="*"/>
	<element type="link" minOccurs="0" maxOccurs="*"/>
</ElementType>

<ElementType name="bondArray" content="eltOnly" model="open" order="many">
	<!-- only use for large molecules -->
	<attribute type="id"/>
	<attribute type="convention"/>
	<attribute type="dictRef"/>
	<element type="stringArray" minOccurs="0" maxOccurs="*"/>
	<element type="floatArray" minOccurs="0" maxOccurs="*"/>
	<element type="integerArray" minOccurs="0" maxOccurs="*"/>
	<element type="link" minOccurs="0" maxOccurs="*"/>
</ElementType>

<ElementType name="electron" content="eltOnly" model="open">
	<attribute type="id"/>
	<attribute type="count"/>
	<attribute type="convention"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="crystal" content="eltOnly" model="open">
	<attribute type="id"/>
	<attribute type="title" default="crystal"/>
	<attribute type="convention"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="sequence" content="eltOnly" model="open">
	<attribute type="id"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="feature" content="eltOnly" model="open">
	<attribute type="id"/>
	<attribute type="title"/>
	<attribute type="convention"/>
	<attribute type="dictRef"/>
</ElementType>

<ElementType name="molecule" content="eltOnly" model="open" order="many">
	<attribute type="id"/>
	<attribute type="title" default="molecule"/>
	<attribute type="convention" default="mol"/>
	<attribute type="dictRef"/>
	<element type="formula" minOccurs="0" maxOccurs="*"/>
	<element type="list" minOccurs="0" maxOccurs="*"/>
	<element type="atomArray" minOccurs="0" maxOccurs="*"/>
	<element type="bondArray" minOccurs="0" maxOccurs="*"/>
	<element type="string" minOccurs="0" maxOccurs="*"/>
	<element type="float" minOccurs="0" maxOccurs="*"/>
	<element type="integer" minOccurs="0" maxOccurs="*"/>
	<element type="link" minOccurs="0" maxOccurs="*"/>
</ElementType>

<ElementType name="reaction" content="eltOnly" model="open" order="many">
	<attribute type="id"/>
	<attribute type="title" default="reaction"/>
	<attribute type="convention"/>
	<attribute type="dictRef"/>
	<element type="string" minOccurs="0" maxOccurs="*"/>
	<element type="float" minOccurs="0" maxOccurs="*"/>
	<element type="integer" minOccurs="0" maxOccurs="*"/>
	<element type="list" minOccurs="0" maxOccurs="*"/>
</ElementType>

<!-- **********  root ********** -->

<!--This is the general 'holder' for all CML documents. Normally this 
would then be embedded within an XML <document> Namespaces/schema 
should be declared within this element or the document root-->

<ElementType name="cml" content="mixed" model="open" order="many">
	<attribute type="id"/>
	<attribute type="title" default="cml document"/>
	<attribute type="convention"/>
	<attribute type="dictRef"/>
	<element type="molecule" minOccurs="0" maxOccurs="*"/>
	<element type="crystal" minOccurs="0" maxOccurs="*"/>
	<element type="reaction" minOccurs="0" maxOccurs="*"/>
</ElementType>
</Schema>

Appendix D - Glossary

Applet (Java): A small program designed to run embedded within a web page (either HTML or XML). The applet is downloaded from the same site as the web page and does not require installation (unlike a plugin). For security reasons, applets has restricted access to the local hard drive.
ASCII: Refers to the standard set of text characters used for all computers. An ASCII file can be opened in a normal text editor (in contrast to a binary file which can not). XML/HTML and most of the legacy formats use ASCII.
Attribute: Used to give more specific information within an HTML/XML tag (hence <FONT SIZE="+1"> has attribute 'SIZE'), @href is shorthand for 'attribute called href'
CML (Chemical Markup Language): XML language optimised for marking up chemistry.
CSS (Cascading stylesheet): Stylesheet primarily designed for HTML, allows tag properties to be changed (over-riding the parser defaults).
DTD: Older standard for describing XML languages, uses a unique syntax.
DOM (Document Object Model): 'Tree' model of a document, allows the parser scripting access to all 'objects' within the tree. Under development by the W3C
Element: XML equivalent to HTML tags. Refers to the tag pair, any attributes and their contents
HTML(hypertext markup language): Text formatting and display language - standard electronic format on the web
JavaScript: A scripting language (which has little to do with Java, although they can communicate) used to add interactivity to web pages. Javascript functions need to be triggered in some way (normally using a <FORM> button).
Markup: Procedure of labelling and organising information using <tag></tag> pairs and attributes. Basis for most formatting languages.
Namespaces: System for assigning elements to different XML languages. Very important if a document is to be validated.
Parser: Program able to read XML, comprehend its structure and build a DOM and/or use a stylesheet to display it. This project uses Internet Explorer 5.x
Plugin: Small, platform dependent program that 'plugs into' a web browser and allows it to display additional file formats (e.g. .mol, .svg, .vrml). Needs to be installed by the user.
Perl: Powerful text search and replace language (amongst other things) and used to build all the legacy2CML converters in this project. Also used for server-side cgi scripts.
Schema: Newer standard for describing XML languages, uses XML rules and syntax.
SVG (Scalable Vector Graphics): XML language optimised for describing vector ('line' as compared to bitmap/gif/jpg) images.
Tag: Basic unit of markup, delimited by square brackets: <tag>
Well formed: A well formed document is one that complies to XML rules and syntax.
Valid: A valid document is well formed and conforms to a language description given in a DTD or schema.
XHTML(extensible hypertext markup language): HTML tags corrected to follow XML rules
XML(extensible markup language): A series of rules for writing other 'XML compliant' languages
XSL(extensible stylesheet language): Stylesheet language, includes powerful scripting and querying features.

References