XML in Chemistry

Peter Murray-Rust^a and Henry Rzepa^b

^aUnilever Centre for Molecular Informatics, Cambridge University
^bDepartment of Chemistry, Imperial College of Science, Technology and Medicine.

1. Markup Languages and the publication process

Although the use of markup languages in publishing goes back to the 1960s and e.g. the introduction by IBM of GML (Generalised markup language) and its subsequent evolution into the SGML standard, most authors are nowadays more familiar with one particular more recent implementation referred to as HTML (Hypertext markup language). The rapid impact of its use in conjunction with the World Wide Web was in large measure due to its ease of use to achieve presentational and visual effect; its limitations as a mechanism for expressing precisely defined data and meanings were not always adequately recognised. These limitations meant that in areas such as molecular sciences where precise meanings are essential, a variety of often proprietary solutions continued to be used to define and manipulate molecular "data" and information. The publishing processes were seen as quite separate, and the process of translating data, information and knowledge into a published entity remained an activity requiring much human perception. It is also worth noting that the reverse process of converting the published materials back in usable data remained equally human intensive and hence expensive.

The need to reconcile these two extremes was recognised at the first World Wide Web conference in 1994, and this gelled shortly after in a remarkable communal effort towards the specification of an extensible markup language called XML, with the ultimate vision of what has been described by Berners-Lee as the creation of a "Semantic Web".¹ The objectives of this impressive effort included the following;

To provide a more universal infrastructure for publishing
A recognition that the use of XML will require subject specific vocabularies ("ontologies"). An ontology is defined as a description, such as a formal specification of a program, of the concepts and relationships that can exist for a (software) agent or a community of agents.
To provide a mechanism for enhancing quality ("validation")
To promote the creation of dynamic hyperdocuments
A recognition of the need to be able to re-use components of documents for other purposes
To provide a mechanism for creating "smart archives", in which the re-usable components (information objects) can be readily identified
To create an infrastructure for underpinning the emerging areas of e-business

The vision included therefore the creation of a new generation of ontologically rich primary publication, and a clear division of the respective roles of humans and software agents (robots). Thus humans should be able to:

publish all their data automatically
eliminate "errors" from publications
use the published literature as a database
"understand" information from other domains

Robots should be able to:

analyse publications (on whatever scale)
create secondary publications
purchase chemicals
synthesize chemicals from literature

To achieve this, we argue that a number of prerequisites must be in place.

Automatic data capture, especially from instruments. We note here that we have moved from a generation some 30 years ago where data capture from instruments was often only analogue (chart paper), to the use of standard computers to capture and process data, to most recently an increasing tendency to place these computers on-line and connect them to centralised data stores.
Common ontologies for the molecular science community.
Ontologically guided authoring

2. An example of the issues involved in "Capturing" Chemistry

The following extract² from a typical molecule science journal illustrates both how precisely data and information must be represented, but also how much human perception is required to translate this information as presented in this (linear) form into e.g. a reproducible experiment or a mechanistic interpretation;

"Thiamin phosphate synthase catalyzes the formation of thiamin phosphate from 4-amino-5- (hydroxymethyl)-2-methylpyrimidine pyrophosphate and 5-(hydroxyethyl)-4-methylthiazole phosphate. The reaction involves... dissociative mechanism...carbenium ion intermediate...and pyrimidine iminemethide observed in the crystal..."

Note the profusion of chemical structure information, concepts and terms, which only a trained human chemist could easily process. Quantitative concepts and units are also ubiquitous;

"A 500 ul aliquot of 0.8 uM TP synthase in 50 mM Tris-HCl (pH 7.5) and 6 mM MgCl₂ incubated at room temperature with 50uM CF₃HMP-PP."

An even greater degree of human perception is required when handling graphical chemical representations, which may contain many, often fuzzy and dangerous, human-only semantics (e.g. 2D representations of 3D properties, relative stereochemistry, aromaticity, hydrogen and other "weak" bonding, use of generic and "R" groups, reaction arrows and mechanisms, etc etc). The challenge therefore is to develop an infrastructure which can be routinely used to capture, store and appropriately filter and display such information.

3. The Current Position of XML (2002)

We will argue here that XML offers a general powerful and extensible mechanism for handling both the "capture" and the publication of chemical information, and most particularly for the first time will allow this process to operate equally well in both directions. Our basis for stating this derives from the following observations:

XML is increasingly widely accepted as an information infrastructure
The protocols are all public and many of the tools open(source).
XML is vendor-neutral, but with heavy vendor involvement
There is a large communal investment in generic tools (e.g. business2Business, e-commerce)
XML has a modular approach; an application is built from components
The expectation that domains will create domain-specific XML protocols and tools
XML is increasingly universal in backends, middleware, servers
XML has a rapidly increasing support from database vendors
XML has close interoperability with other informatic standards: (UML, OMG/CORBA, etc.)
There is increasing support for "XML over the net" and browsers (e.g Internet Explorer, Netscape 6, etc).
XML is very well supported by books, tutorials.

4. Global Open Activity in Scientific XML

So how has the scientific community adopted these concepts? As noted above, the first World-Wide Web conference in 1994 specifically identified Maths and Chemistry as requiring specific markup. The first WWW conference provided the spark, and during the period 1995-7, CML (chemical markup language) evolved to become the first XML language, and a concurrent effort lead to MathML becoming formalised as such in 1998.³ We estimate that by 2002, perhaps 50 specifically scientific applications has been described in some degree (for example 37 are quoted on one XML portal⁴, the Science Citation Index shows around 570 references to the keyword XML and SciFinder retrieves 38 references on the concept "XML in chemistry").

We also emphasize that XML is designed to allow markup languages to be combined, at whatever level of granularity, and hence documents could contain any number of components deriving from specific XML languages. HTML, which we noted above, has evolved into one such language (XHTML) but in its latest development, has been modularised into smaller, more easily implemented components (for example, XFORMS, a data entry and validation component can be implemented separately from other, more display oriented components), and XHTML can co-exist in a document with e.g. SVG (a scalable vector graphical language), MathML and CML. We elaborate this when discussing namespaces (vide infra).⁵

5. Some Essentials of an XML system

The tasks that can be identified in implementing an XML solution include;

The creation of documents from both legacy sources of data and de novo by humans
The creation and capture of metadata (dictionaries of terms, tables of contents, codes, etc.)
Specification of namespaces (a reserved addressing scheme for information)
Human validation of the system (conformance to agreed specifications)
Machine validation of documents (according to a specified and agreed schema)
Document transformation (XSLT)
Rendering and display (XSL-FO, Domain-specific such as molecular representations)

The design of an XML-based markup language should provide for;

a simple, extensible DTD or Schema (do not over complicate, and make it modular)
Agreed semantics
One (or more) agreed and published ontologies
Agreed examples and conformance tests
A community of critical mass

Appropriate tools for accomplishing this should be identified. These might include;

XML Writers
XML Readers (more difficult than readers since the XML may not be normalised to a single form)
Legacy converters (difficult because of variation and ambiguity in the original data which may require some degree of perception for an accurate conversion)
Validators
Dictionaries
Editors

Custom written XSLT stylesheets and generic editors will do some of these, but a DOM (Document object model, which represents a syntax free abstraction of the data in memory) is probably essential for many subjects

6. Ontologies of relevance to Chemistry

An overview of the types of ontologies required is shown in Table 1.

General Non-chemical informatics
Business and Commerce, Government, Regulatory, Academic, Publishing, etc.	Reuse existing or emerging approaches
Domain-specific Non-chemical
Mathematics (MathML), Healthcare (HL7/XML), Genomics (GeneOntology), etc.	Collaborate to reuse existing or emerging approaches
Chemical specific but generic information types
Numeric data, Descriptive prose, Safety	Ontologies must be created by the chemical community, reuse generic tools
Chemical-specific information types
Chemical substances, Molecules, Analytical and Spectroscopic, Reactions, Computational chemistry,	The Chemical community must build the complete tool set

Of the chemically-specific information types, support should include that for;

Molecules and substances
Reactions
Analytical information, especially spectra
Computation and simulation (QM, mechanics, dynamics, etc.)
"Data-centric" concepts (numbers, units, arrays, matrices, etc.)
Specialist software for display, editing, searching etc
Support "adjoining" disciplines such as bio areas, materials science etc.)

7. XML DTDs and Schemas

In this section, we outline some of the existing generic tools and protocols for creating valid XML documents.

The DTD (Document type definition) is a concept rooted in SGML, and is still used in XML to constrain the Markup vocabulary (i.e. the basic elements used for markup) and to some extent the (sub)structure of documents (i.e. what element can be a parent or child of another).

Schemas are a more recent development, and unlike DTDs, are themselves expressed using XML. Of particular relevance to chemistry, they provide advantages over DTDs in that they can also be used for;

Datatyping: numbers and user-defined types
enumeration (for example to specify the list of chemical elements)
Lexical patterns
Inheritance
To allow additional user-created rules (Schematron/XSLT)

Schemas and dictionaries also support:

Conversion to software (e.g. CML-DOM)
Authoring support (e.g. in editors)
Data validation on entry

The use of DTSs, and Schemas in particular, for creation of valid documents can bring enormous benefits, including eliminating/reducing software failure due to the use of invalid data and reducing difficulty of (human) understanding due to invalid publications.

8. Namespaces

Each information object must be uniquely named to avoid collision and ambiguity. This is achieved using XML namespacing.

The example below shows a paragraph of text (derived from XHTML, which inherits the default namespace), within which components of CML are embedded with prefixes using the defined namespaces;

<html
  xmlns="http://www.w3.org/1999/xhtml"
  xmlns:cml="http://www.xml-cml.org/schema/cml2/core">
  <p>We can supply the following set of molecules:</p>
  <ul>
  <li><cml:molecule id="p1" title="phosphine">
    <cml:atomArray>
      <cml:atom elementType="P" hydrogenCount="3"/>
    </cml:atomArray>
  </li>
  <li><cml:molecule id="p2" title="penguinone"/></li>
  </ul>
</html>

The next example illustrates how CML can be used in conjunction with the STMML namespace⁶ to specify units and their constraints:

<molecule id="m1">
  <crystal spacegroup="Fm3m" z="4">
    <stm:scalar title="a" errorValue="0.001" units="angstrom">5.628</stm:scalar>
    <stm:scalar title="b" errorValue="0.001" units="angstrom">5.628</stm:scalar>
    <stm:scalar title="c" errorValue="0.001" units="angstrom">5.628</stm:scalar>
    <stm:scalar title="alpha" errorValue="0">90</stm:scalar>
    <stm:scalar title="beta" errorValue="0">90</stm:scalar>
    <stm:scalar title="gamma" errorValue="0">90</stm:scalar>
  </crystal>
  <atomArray>
    <atom id="a1" elementType="Na" formalCharge="1" xyzFract="0.0 0.0 0.0" xy2="+23.2 -21.0"/>
    <atom id="a2" elementType="Cl" formalCharge="-1" xyzFract="0.5 0.0 0.0"/>
  </atomArray>
</molecule>

STMML is a proposal⁶ for domain-independent components for Scientific-Technical-Medical information, and contains key elements such as Units, Dictionary, Metadata, item, array, matrix and supports Datatypes such as numbers, max/min, ranges, errors, etc.
A more extended example of this concatenation of namespaces⁷ contains up to eight namespaced components, and illustrates how a complete publication in XML/CML could be achieved.

The use of namespaces can be seen in a more general context in Figure 1, which illustrates how the various specific XML components might relate to each other.

The relation of CML to other ontologies

In particular we note here how the original CML specification⁸ can be extended by modularisation into a Core namespace, and extended via other schemas into e.g.

CMLReact. A reaction, containing reactantLists, productLists and links between them.
CMLComp. A container for computational and simulation input and results
CMLQuery. A generic query language
Hooks for other Schemas such as e.g. SpectHook, for spectral parameters and data, and links to molecular details (assignment)

9. Dictionaries and Schemas

It is useful to separate the domain ontology from the Schema/DTD, which allows the schema to be more abstract and which helps extensibility. Thus a 3- or 4-level hierarchy can be envisaged:

The data instance
The XMLSchema describing the instance
The dictionary/ies describing the instance
The schema describing the dictionaries

where an instance document refers to NAMESPACED dictionaries to add semantics and ontology. In this system, units are themselves verified by the UNITS dictionary. An overview of this process is shown in Figure 2.

Structure of Dictionaries

We summarise briefly below some characteristic features of dictionaries:

Dictionaries consist of curated entries, and Many dictionaries are "flat" with seeAlso, e.g. the IUPAC GoldBook
A Single hierarchy is common:
- generic ("isA"):
  eukaryote <-- vertebrate <-- mammal <-- human
- partitive ("hasA"):
  body <-- leg <-- foot <-- toe
Dictionaries can now be namespaced for uniquification and navigation
Dictionaries must have curatorial information
Dictionaries should support versioning

The existing IUPAC dictionaries provide a natural base for creating an XML-based machine processible resources. These dictionaries fall into three broad categories; Descriptive (e.g. Medicinal Chemistry, Phys. Org. Chem., Stereochemistry, etc), validating (e.g. Theoretical Chemistry) and supplemental (e.g. Atomic Weights).⁹ Their availability for XML-based processes would be a considerable asset.

10. XML and Metadata

Metadata is an important component of a document or information object, and it can serve a number of purposes:

Navigational/Discovery. How is a piece of information to be discovered, e.g. e.g. Dublin Core and GILS
Descriptive. What does the information mean and how is it to be used?
Constraining. What constraints are there on the structure and content of the information. Is it valid?. this would be accomplished using mainly XMLSchemas.
Supplementary. Additional (hyper)data added from metadata
Algorithmic. Deductions can be made from metadata, using e.g. Schematron and XSLT and RDF.
Chemical-descriptive. e.g. Medicinal, PhysOrgChem, GoldBook, StereoChem
Chemical-constraining. e.g. Theoretical Chemistry, CIF
Chemical-supplemental. e.g. tables of Atomic Weights, dictionaries of compounds etc.
Chemical-algorithmic. TheoChem, CIF

Communally agreed schemas for defining such metadata are again seen as an essential component of the XML-infrastructures.

11. Conclusions

In this brief review of the application of XML in chemistry, we have summarised the essential advantages of adopting the XML approach. We have discussed in particular the benefits in creating re-usable namespaced information components or objects, and how these can be created and validated using subject-specific ontologies and dictionaries, and enhanced with appropriate metadata. The role of communities and global organisations such as IUPAC is seen as crucial in this endeavour towards creating these key resources. The use of such XML-based documents opens the prospect of creating avenues for the reversible flow of data and information between the scientific publication processes and the discovery, research and learning processes in molecular sciences, a reversibilty that has hitherto only been achieved with considerable (and error-prone) human effort and expense.

12. References

T. Berners-Lee, M. Fischetti, M, "Weaving the Web: The Original Design and the Ultimate Destiny of the World-Wide Web", Orion Business Books, London, 1999. ISBN 0752820907. For discussion of this in a molecular context, see H. S. Rzepa and P. Murray-Rust, Learned Publishing, 2001, 14, 177; P. Murray-Rust and H. S. Rzepa, Data Science, 2002, issue 1, in press.
D. H. Peapus, H. J. Chiu and N. Campobasso, Biochemistry, 2001, 40, 10103-10114.
See http://www.xml.com/pub/rg/117
See http://www.wr3.org for details of all XML specifications.
G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa, and M. Wright, Internet J. Chemistry, 2001, article 13.
P. Murray-Rust and H. S. Rzepa, Data Science, 2002, submitted for publication.
P. Murray-Rust, H. S. Rzepa and M. Wright, New J. Chem., 2001, 618-634.
P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 1999, 39, 928; P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 2001, 1113; G. Gkoutos, P. Murray-Rust, H. S. Rzepa and M. Wright, J. Chem. Inf. Comp. Sci., 2001, 1124.
See G. P. Moss, http://www.chem.qmw.ac.uk/iupac/ for IUPAC dictionaries in Web-form. IUPAC home is at http://www.iupac.org/