XML in Chemistry

Peter Murray-Rusta and Henry Rzepab

aUnilever Centre for Molecular Informatics, Cambridge University
bDepartment of Chemistry, Imperial College of Science, Technology and Medicine.


1. Markup Languages and the publication process

Although the use of markup languages in publishing goes back to the 1960s and e.g. the introduction by IBM of GML (Generalised markup language) and its subsequent evolution into the SGML standard, most authors are nowadays more familiar with one particular more recent implementation referred to as HTML (Hypertext markup language). The rapid impact of its use in conjunction with the World Wide Web was in large measure due to its ease of use to achieve presentational and visual effect; its limitations as a mechanism for expressing precisely defined data and meanings were not always adequately recognised. These limitations meant that in areas such as molecular sciences where precise meanings are essential, a variety of often proprietary solutions continued to be used to define and manipulate molecular "data" and information. The publishing processes were seen as quite separate, and the process of translating data, information and knowledge into a published entity remained an activity requiring much human perception. It is also worth noting that the reverse process of converting the published materials back in usable data remained equally human intensive and hence expensive.

The need to reconcile these two extremes was recognised at the first World Wide Web conference in 1994, and this gelled shortly after in a remarkable communal effort towards the specification of an extensible markup language called XML, with the ultimate vision of what has been described by Berners-Lee as the creation of a "Semantic Web".1 The objectives of this impressive effort included the following;

The vision included therefore the creation of a new generation of ontologically rich primary publication, and a clear division of the respective roles of humans and software agents (robots). Thus humans should be able to:
Robots should be able to:
To achieve this, we argue that a number of prerequisites must be in place.

2. An example of the issues involved in "Capturing" Chemistry

The following extract2 from a typical molecule science journal illustrates both how precisely data and information must be represented, but also how much human perception is required to translate this information as presented in this (linear) form into e.g. a reproducible experiment or a mechanistic interpretation;

"Thiamin phosphate synthase catalyzes the formation of thiamin phosphate from 4-amino-5- (hydroxymethyl)-2-methylpyrimidine pyrophosphate and 5-(hydroxyethyl)-4-methylthiazole phosphate. The reaction involves... dissociative mechanism...carbenium ion intermediate...and pyrimidine iminemethide observed in the crystal..."

Note the profusion of chemical structure information, concepts and terms, which only a trained human chemist could easily process. Quantitative concepts and units are also ubiquitous;

"A 500 ul aliquot of 0.8 uM TP synthase in 50 mM Tris-HCl (pH 7.5) and 6 mM MgCl2 incubated at room temperature with 50uM CF3HMP-PP."

An even greater degree of human perception is required when handling graphical chemical representations, which may contain many, often fuzzy and dangerous, human-only semantics (e.g. 2D representations of 3D properties, relative stereochemistry, aromaticity, hydrogen and other "weak" bonding, use of generic and "R" groups, reaction arrows and mechanisms, etc etc). The challenge therefore is to develop an infrastructure which can be routinely used to capture, store and appropriately filter and display such information.

3. The Current Position of XML (2002)

We will argue here that XML offers a general powerful and extensible mechanism for handling both the "capture" and the publication of chemical information, and most particularly for the first time will allow this process to operate equally well in both directions. Our basis for stating this derives from the following observations:

4. Global Open Activity in Scientific XML

So how has the scientific community adopted these concepts? As noted above, the first World-Wide Web conference in 1994 specifically identified Maths and Chemistry as requiring specific markup. The first WWW conference provided the spark, and during the period 1995-7, CML (chemical markup language) evolved to become the first XML language, and a concurrent effort lead to MathML becoming formalised as such in 1998.3 We estimate that by 2002, perhaps 50 specifically scientific applications has been described in some degree (for example 37 are quoted on one XML portal4, the Science Citation Index shows around 570 references to the keyword XML and SciFinder retrieves 38 references on the concept "XML in chemistry").

We also emphasize that XML is designed to allow markup languages to be combined, at whatever level of granularity, and hence documents could contain any number of components deriving from specific XML languages. HTML, which we noted above, has evolved into one such language (XHTML) but in its latest development, has been modularised into smaller, more easily implemented components (for example, XFORMS, a data entry and validation component can be implemented separately from other, more display oriented components), and XHTML can co-exist in a document with e.g. SVG (a scalable vector graphical language), MathML and CML. We elaborate this when discussing namespaces (vide infra).5

5. Some Essentials of an XML system

The tasks that can be identified in implementing an XML solution include;

The design of an XML-based markup language should provide for;

Appropriate tools for accomplishing this should be identified. These might include;

Custom written XSLT stylesheets and generic editors will do some of these, but a DOM (Document object model, which represents a syntax free abstraction of the data in memory) is probably essential for many subjects

6. Ontologies of relevance to Chemistry

An overview of the types of ontologies required is shown in Table 1.
General Non-chemical informatics
Business and Commerce, Government,
Regulatory, Academic, Publishing, etc.
Reuse existing or emerging approaches
Domain-specific Non-chemical
Mathematics (MathML), Healthcare (HL7/XML),
Genomics (GeneOntology), etc.
Collaborate to reuse existing or emerging approaches
Chemical specific but generic information types
Numeric data, Descriptive prose, Safety Ontologies must be created by the
chemical community, reuse generic tools
Chemical-specific information types
Chemical substances, Molecules, Analytical and Spectroscopic,
Reactions, Computational chemistry,
The Chemical community must build
the complete tool set

Of the chemically-specific information types, support should include that for;



7. XML DTDs and Schemas

In this section, we outline some of the existing generic tools and protocols for creating valid XML documents.

The DTD (Document type definition) is a concept rooted in SGML, and is still used in XML to constrain the Markup vocabulary (i.e. the basic elements used for markup) and to some extent the (sub)structure of documents (i.e. what element can be a parent or child of another).

Schemas are a more recent development, and unlike DTDs, are themselves expressed using XML. Of particular relevance to chemistry, they provide advantages over DTDs in that they can also be used for;

Schemas and dictionaries also support:

The use of DTSs, and Schemas in particular, for creation of valid documents can bring enormous benefits, including eliminating/reducing software failure due to the use of invalid data and reducing difficulty of (human) understanding due to invalid publications.

8. Namespaces

Each information object must be uniquely named to avoid collision and ambiguity. This is achieved using XML namespacing.

  1. The example below shows a paragraph of text (derived from XHTML, which inherits the default namespace), within which components of CML are embedded with prefixes using the defined namespaces;
    <html
      xmlns="http://www.w3.org/1999/xhtml"
      xmlns:cml="http://www.xml-cml.org/schema/cml2/core">
      <p>We can supply the following set of molecules:</p>
      <ul>
      <li><cml:molecule id="p1" title="phosphine">
        <cml:atomArray>
          <cml:atom elementType="P" hydrogenCount="3"/>
        </cml:atomArray>
      </li>
      <li><cml:molecule id="p2" title="penguinone"/></li>
      </ul>
    </html>
    
  2. The next example illustrates how CML can be used in conjunction with the STMML namespace6 to specify units and their constraints:
    <molecule id="m1">
      <crystal spacegroup="Fm3m" z="4">
        <stm:scalar title="a" errorValue="0.001" units="angstrom">5.628</stm:scalar>
        <stm:scalar title="b" errorValue="0.001" units="angstrom">5.628</stm:scalar>
        <stm:scalar title="c" errorValue="0.001" units="angstrom">5.628</stm:scalar>
        <stm:scalar title="alpha" errorValue="0">90</stm:scalar>
        <stm:scalar title="beta" errorValue="0">90</stm:scalar>
        <stm:scalar title="gamma" errorValue="0">90</stm:scalar>
      </crystal>
      <atomArray>
        <atom id="a1" elementType="Na" formalCharge="1" xyzFract="0.0 0.0 0.0" xy2="+23.2 -21.0"/>
        <atom id="a2" elementType="Cl" formalCharge="-1" xyzFract="0.5 0.0 0.0"/>
      </atomArray>
    </molecule>
    
  3. STMML is a proposal6 for domain-independent components for Scientific-Technical-Medical information, and contains key elements such as Units, Dictionary, Metadata, item, array, matrix and supports Datatypes such as numbers, max/min, ranges, errors, etc.
  4. A more extended example of this concatenation of namespaces7 contains up to eight namespaced components, and illustrates how a complete publication in XML/CML could be achieved.

The use of namespaces can be seen in a more general context in Figure 1, which illustrates how the various specific XML components might relate to each other.

The relation of CML to other ontologies

In particular we note here how the original CML specification8 can be extended by modularisation into a Core namespace, and extended via other schemas into e.g.

9. Dictionaries and Schemas

It is useful to separate the domain ontology from the Schema/DTD, which allows the schema to be more abstract and which helps extensibility. Thus a 3- or 4-level hierarchy can be envisaged: where an instance document refers to NAMESPACED dictionaries to add semantics and ontology. In this system, units are themselves verified by the UNITS dictionary. An overview of this process is shown in Figure 2.

dictionaries

Structure of Dictionaries

We summarise briefly below some characteristic features of dictionaries: The existing IUPAC dictionaries provide a natural base for creating an XML-based machine processible resources. These dictionaries fall into three broad categories; Descriptive (e.g. Medicinal Chemistry, Phys. Org. Chem., Stereochemistry, etc), validating (e.g. Theoretical Chemistry) and supplemental (e.g. Atomic Weights).9 Their availability for XML-based processes would be a considerable asset.

10. XML and Metadata

Metadata is an important component of a document or information object, and it can serve a number of purposes: Communally agreed schemas for defining such metadata are again seen as an essential component of the XML-infrastructures.

11. Conclusions

In this brief review of the application of XML in chemistry, we have summarised the essential advantages of adopting the XML approach. We have discussed in particular the benefits in creating re-usable namespaced information components or objects, and how these can be created and validated using subject-specific ontologies and dictionaries, and enhanced with appropriate metadata. The role of communities and global organisations such as IUPAC is seen as crucial in this endeavour towards creating these key resources. The use of such XML-based documents opens the prospect of creating avenues for the reversible flow of data and information between the scientific publication processes and the discovery, research and learning processes in molecular sciences, a reversibilty that has hitherto only been achieved with considerable (and error-prone) human effort and expense.

12. References

  1. T. Berners-Lee, M. Fischetti, M, "Weaving the Web: The Original Design and the Ultimate Destiny of the World-Wide Web", Orion Business Books, London, 1999. ISBN 0752820907. For discussion of this in a molecular context, see H. S. Rzepa and P. Murray-Rust, Learned Publishing, 2001, 14, 177; P. Murray-Rust and H. S. Rzepa, Data Science, 2002, issue 1, in press.
  2. D. H. Peapus, H. J. Chiu and N. Campobasso, Biochemistry, 2001, 40, 10103-10114.
  3. See http://www.xml.com/pub/rg/117
  4. See http://www.wr3.org for details of all XML specifications.
  5. G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa, and M. Wright, Internet J. Chemistry, 2001, article 13.
  6. P. Murray-Rust and H. S. Rzepa, Data Science, 2002, submitted for publication.
  7. P. Murray-Rust, H. S. Rzepa and M. Wright, New J. Chem., 2001, 618-634.
  8. P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 1999, 39, 928; P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 2001, 1113; G. Gkoutos, P. Murray-Rust, H. S. Rzepa and M. Wright, J. Chem. Inf. Comp. Sci., 2001, 1124.
  9. See G. P. Moss, http://www.chem.qmw.ac.uk/iupac/ for IUPAC dictionaries in Web-form. IUPAC home is at http://www.iupac.org/