CML First Tutorial

CML is a complex language and there are many ways to begin; this is for those who like tutorials. Because CML is flexible, and can cover most apsects of molecular science, it may take a little while to adjust. It will help if you are familiar with the basiscs of HTML and the use of hypertext and linked documents on the WWW.

CML is a formal method of organising information and the result will depend very much on the author's views. It's quite possible for two people to represent the same information in different ways in CML, but it's important to realise that these are often equivalent and can be interconverted by machines or humans. CML can be regarded as a (limited) programming language, (but no more forbidding than HTML!). CML files can be very small (a few lines) or very large - many megabytes with thousands of tags. Indeed, I am planning to write my next book in CML.

Throughout this tutorial we shall show examples of CML files and the way they look in my JUMBO. It's important to realise that CML files can be used without any viewing software, and in some cases without any software at all. Most of the time, however, you will need some software to help create, manipulate, view and transform CML files. JUMBO is able to do most of this, but I hope that much other software will be developed.

The CML examples are denoted by hyperlinks like this [hello.cml] and when you follow the link (click) you will see the raw CML data. (You will find it useful to clone another browser window). For each such file there will also be one or more screen shots from JUMBO. The JUMBO software is written in Java and available as applets on the WWW or as standalone applications. See the FAQ for details. If you are able to run JUMBO, you should be able to reproduce the diagrams (allowing for different windowing systems) and to explore other renderings.

CML can be transparent

In many applications you won't need to see any CML as it will be 'hidden' within the guts of the application software. So the first example shows the JUMBO reading an MDL Molfile and displaying it. (The command was

java pmr.cml.CMLObj adenosine.mol

and the result is shown here:

Although this looks a bit like any standard molecular viewer, it's organised internally in a hierarchical manner as shown by the Table Of Contents (or TOC). This is a very powerful approach fundamental to everything in CML. Notice how the date in the Molfile is shown as a piece of information (an object or element). All dates 'know how old they are', so that JUMBO is able to show this info without additional programming.

This example also shows that most 'flat files' contain implicit hierarchical information. You may wish to compare the raw file with the picture. (The stereochemistry is not shown - that's a current limitation of the JUMBO depiction, not of CML).

It's straightfoward to produce a CML file at this stage and JUMBO will do this if the _DOCROOT icon square is clicked. Here is the resulting CML file. This contains precisely the same information as the original Molfile and you can see that it's easy to understand, and that the information is much better identified. Don't worry about the details at present; that will be explained later. Note the use of the CONVENTION attribute which allows a future reader (machine or human) to identify what convention is/are used. For example, you would need to refer to the MDL manuals to understand the figures in the BUILTIN="STER" array (although JUMBO will add an implementation in the future). (NOTE: this type of conversion depends on how easily the original file can be interpreted and parsed, not on CML. In some of my converters information has not been preserved because I didn't know what it meant or whether it was significant! It's possible for someone to add this later if required.)

Can a CML file be converted to a Molfile? That depends entirely on what is in the CML file and whether the Molfile can hold that information. For example, objects like sequences, hypertext, diagrams - all of which can be held in CML - cannot be held and so would be lost. All conversion between current file types involves information loss; CML is the only method of ensuring that everything is kept after conversion.

The next two examples show that CML can hold non-molecular information, especially text.

Hello Benzene!

There is some unavoidable terminology here - don't worry about it on first reading

Let's start with the simplest CML file of all: [hello.cml], reproduced here:

<!DOCTYPE CML PUBLIC "-//CML//DTD CML//EN">
<CML TITLE="Hello Benzene">
</CML>

In the JUMBO it looks like this:

The document contains a single ELEMENT of type CML. ELEMENT is an SGML term for part of a document, and every CML document can be thought of as a single Object, usually containing subObjects. SGML maps very closely onto Object-Oriented language and all CML ELEMENTS can be mapped onto objects. (JUMBO is written in Java, and every CML ELEMENT corresponds to a Java class). (NB: When referring to a chemical element I shall use ChemicalElement!).

The ELEMENT is defined by the start-tag (<CML>) and end-tag (</CML>). It has no content (in this case), but has a single attribute (TITLE="Hello Benzene"). In general, attributes should describe an element rather than be part of its content, and CML follows this closely. Note that start- and end-tags in CML must always be balanced and must nest neatly. (The only exception is empty tags in HTML such as <BR> or <HR>.)

The JUMBO window renders the information in the CML file precisely. The type of the element is shown by the icon (a small alembic for CML elements). (The large alembic is the CML logo.) JUMBO labels elements with their TITLE attribute if they have one, but otherwise with the type of element (technically known as generic identifier or GI).

Some final technical points:

The document hello.cml is identified internally as an SGML document through the DOCTYPE statement. This statement also defines the DocumentTypeDefinition or DTD. The quoted string (the FormalPublicIdentifier or FPI can identify precisely the DTD that was used to construct the document. Many DTDs (e.g. HTML, TEI, ISO12200, ISO12083, DOCBOOK, CALS) are in widespread use and allow documents to be precisely and reliably transported and interpreted. All CML documents should use the FPI shown in this example.
The DTD defines strictly what is allowed in an SGML document - you cannot just make up tags and attributes :-). Thus the CML DTD declares the CML element and allows it to have a TITLE attribute whose value can be an ASCII string. All CML documents should be validated against this DTD, because then the recipients can transform them using simple software. (At present you should not worry about where the DTD is or what it looks like!).
The JUMBO creates a dummy object (_DOCROOT) as the root of the document and you will see this on all screenshots. It's not part of the document, but helps with manipulation (e.g. clicking on the _DOCROOT square icon will dump the (SGML) document to a file.)

Hello Benzene again

This is a larger example and shows the power of the TOC and the mixing of different sorts of information. Here is the hello1.cml source file and here is the screen shot of what it produces:

. The TOC shows a CML object ('Benzene!' - small alembic icon) which contains 3 subObjects:

Kekulés account. This is a STRING (the yellow s icon).
Source. This is an ADMIN object (filing cabinet icon), which contains a further subObject (Reference 1).
ADMIN (untitled). This contains two untitled chunks of hypertext (XHTML) and a PERSON (MURRAY-RUST). This contains more subObjects (FIRSTNAME, LASTNAME, and EMAIL).

At this stage it's worth comparing the TOC with the hello1.cml source file. You will note that JUMBO has (deliberately) hidden quite a lot of the material so as not to confuse you. An inevitable consequence of marking up your information is that there is suddenly much greater apparent complexity, and friendly navigational tools will become crucial. It's also important for authors to think more carefully how they present information.

Why go to the complexity of 10 lines (<BIB> ... </BIB>) when it's shorter to write and read:

James Kendall, Great Discoveries by Young Chemists, p94 (1953) publ: Thomas Nelson and Sons Ltd?

In practice it can be very difficult to parse such text (what are the volume numbers, pages, years, etc.?) and it's almost impossible to search it. In CML the containment and markup make it easy to search for, say, a publisher called Nelson rather than an author. It's still possible to display the information succinctly and we'll see that in later examples.

The next figure

shows a common way of exploring the information. Clicking on icons for subObjects displays them in appropriate ways and here the STRINGS: TITLE, PUBLISHER and "Kekulé's account" are all displayed in separate windows. Such subwindows are a common feature and many applications will need to limit their number by application-dependent software. For a CML file of unknown structure, however, there is no other easy method of exploration at present. In some cases the subwindows might be themselves be (sub)TOCs (e.g. for chapters of a book). (This is a universal problem for electronic publications and I expect that we shall see imaginative generic solutions over the next few years - obviously I can't predict what they will be.)

A feature you may have noticed is the accented e (é). This is provided by an SGML entity with the symbolic name eacute. When the application parses the file it looks for strings of the form &something; and, if possible, replaces them by their value. In this case, eacute is defined in ISO-Latin 1, and is also implemented in the Java character see and all commonly used browsers. There are lots of other entity sets (many from ISO) for Greek, maths, etc., but as yet relatively few browsers or fonts have glyphs for them. Happily there is now much emphasis on extended character sets (e.g. ISO10646) and SGML has been designed to support them.

Another Molecule...

We shall now explore in detail a simple molecule, and then its mass spectrum. The molecule is held in the CML file mol.cml, which is displayed as a TOC:

. The MOL consists of ATOMS, BONDS and a FORMULA. In detail:

ATOMS

There is no mandatory field in ATOMS, but ELSYM will almost always be present. This is an ARRAY of symbols (in this case 11), separated by whitespace. Normally, as here, they will be from the periodic table but others are permitted (such as dummy) and hopefully an agreed list will evolve. The list is terminated by the closing tag and the count can be deduced by the application (there is a (redundant) SIZE attribute if required).

The ATOMS have X2 and Y2 ARRAYs to provide drawing coordinates. (There are separate attributes X3, Y3 and Z3 for 3-dimensional coordinates which can coexists with 2-D coordinates if appropriate). There is also a list of the atomic charges (all zero in this case). Note that all ARRAYs within ATOMS have to be the same length, or the file is invalid. There are about 30 possible values for the BUILTIN attribute, and it will be expected that application software should be able to deal with all of these in some way (e.g. by drawing, producing tables, etc.). There is a simple, robust, mechanism for adding additional atomic properties as ARRAYs when necessary.

BONDS

BONDS follow a similar format. The normal way to specify a bond is to identify the atoms at the ends (ATID1 and ATID2). These may be serial numbers (as here) and always start from 1 (not zero), but may also be any unique identifer (e.g. 'C13A'). The bond ORDER may have various conventions and if it is other than 1, 2, or 3 this should be specified (e.g. CSD aromatic is -5).

FORMULA

The formula of a molecule cannot always be calculated from its coordinates (there could be missing atoms, for example), and FORMULA gives a variety of tools. The most common are 2-d connexion tables, and SMILES. Other conventions will need to be described with care.

The displays

The TOC is shown in more depth here:

where the components of the ATOMS and BONDS can be seen. Clicking on the molecule icon (water) brings up the 2-D diagram, which - in this case - includes the H-atoms

. This diagram introduces a key aspect of Object-Oriented systems - the objects can have properties and methods which are available to every instance of the object. Here, for example, a molecule (with a formula) can be used to calculate a variety of properties. Three are shown above the diagram, where a menu option offers RingCount, MolecularWeight and other options.

This application contains tables of isotopes so it is also possible to calculate the isotopic variation of the molecular mass. This is shown as a simulated mass spectrum (with hydrogen loss) and can be compared with the experimental (see below).

You might want to reveist the first example and see how the CML files has been constructed.

... its Mass Spectrum...

CML has been specifically designed to cater for experimental data and we show a mass spectrum as an example. It was obtained in the JCAMP standard and automatically translated to CML in JUMBO to give ms.cml. (I have pretty-printed this for visual impact, but this doesn't affect the content.) This introduces a key feature of CML: extensibility through external dictionaries or glossaries. Thus XVALUES is defined in the printed JCAMP standard so its use is precisely defined. An application program which can display spectra from CML files could be expected to understand the role of all these terms.

The spectrum itself is contained in an XLIST with BUILTIN="SPECTRUM", and the attributes are doing the job of definining thr information. (You could argue that SPECTRUM is sufficiently important to have its own tag and it would be reasonable for the spectroscopy community to develop its own CML-compatible DTD in the future). At present the generality of CML is kept by not multiplying tags, but putting a high emphasis on the BUILTIN attribute values. Notice the use of FLOAT to define real numbers.

The spectrum has two more attributes: CONTENT=GRAPH declares that the content of XLIST (which can be very flexible) is constrained to be representable as a graph (i.e. have two ARRAYs of equal lengths). (It's not really necessary as BUILTIN="SPECTRUM" implies that anyway). DISPLAY="BAR" is a hint to use a bargraph display rather than a continuous line - this is advisory only as it doesn't form part of the content).

Note how numeric data can be mixed with other information (although the spectrometer hasn't output very meaningful values!). CML can faithfully capture this and use it in other applications.

Here is the TOC:

The scalar properties are represented by the yellow s(tring), i(nteger), f(loat) icons and clicking on these displays the value (if it's short):

Notice how XLIST/BUILTIN="SPECTRUM" has its special icon. Clicking this displays the spectrum

... and their combination

CML has very few limitations on the information structure and so it's easy to combine simple components into larger compound documents. In compnd.cml the molecule and its spectrum have been combined by simple manual cut-and-paste. (Shortly this will be possible with graphical tools). The resultant TOC shows the simple combination:

It's natural that the spectrum is contained within the MOL element, as it 'belongs' to the molecule. Any number of such objects could be added here is required.

HTML is CML!

This file you are reading is CML! - because CML includes a subset of HTML2.0. Assuming that it's XML-like (i.e. tags are balanced, attribute values quoted) it's possible to display it in JUMBO. Here's a shorter file faq.html (an early FAQ) and here's its TOC:

The HTML tags are identified by coloured icon text and you should be able to compare the document source with this TOC. JUMBO is NOT a general purpose HTML viewer (there are enough already, but it is possible to view components of the document

Here two of the P(aragraph) elements have been displayed in separate windows. Is also possible to contract the TOC elements in the display and a more compact version is shown here:

This ability to expand and contract parts of an HTML document is not common in most conventional browsers but has obvious advantages for larger documents.

Next Steps

Although simple, these examples have shown the variety of documents that can be constructed in CML. After mastering them (especially if you have JUMBO) it's probably best to browse the wider and more complex range of examples