CML Frequently asked Questions

Introduction

CML (Chemical Markup Language) is a new approach to managing molecular information using recently developed Internet tools such as SGML/XML and Java. It has a large scope as it covers disciplines from macromolecular sequences to inorganic molecules and quantum chemistry. There is also a lot of detail as many molecular documents can contain many thousand discrete objects, all of which are manageable in CML. Because of this there is no single place to 'start' learning about CML and this FAQ is offered for those who like that approach.

The FAQ has been prepared (Jan 1997) for the V1.0 release of CML. Previous versions have been released incrementally and although they have had processing software, this hasn't been easily portable. This release should be taken as the starting point and earlier versions should be ignored.

Questions

What is Chemical Markup Language?
What disciplines does it cover?
Why would it be useful to me?
How does CML work?
What technology do I need?
What is SGML and its relevance to CML?
What is XML and its relevance to CML?
Do I have to know SGML?
How can I create CML files/documents?
Can I edit CML documents?
What software is available for processing CML?
What is a DTD? Do I need to know?
Why is the CML DTD so complex?
How is whitespace treated in CML?
Does CML understand aromaticity?
How does CML help molecular database users?
What is (chemical) MIME?
How can I convert my foo.bar file to CML?
Can CML draw 2-D molecular structures?
Can CML draw 3-D molecular structures?
Could I write a scientific paper in CML?
Is CML Object-Oriented?
Can I search CML documents?
Can I run CML from Netscape/MSIE, etc?
Does CML use standards?
Is CML a 'standard'?
What does CML NOT do?
Are there any known bugs?
How is CML likely to develop?
Are there restrictions on the use of CML?

Answers

What is Chemical Markup Language?

CML (Chemical Markup Language) is a new approach to managing molecular information using recently developed Internet tools such as SGML/XML and Java. It is based strictly on SGML, the most robust and widely used system for precise information management in many areas. It has been developed over 18 months, and has been tested in many areas and on a variety of machines.
CML is NOT 'just another file format'; it is capable of holding extremely complex information structures and so acting as an interchange mechanism or for archival. It interfaces easily with modern database architectures such as relational databases or object-oriented databases.

What disciplines does it cover?

CML has already been used to manage documents and information in:

Macromolecular Sequence
Macromolecular Structure
Spectra
Organic Molecules
Publishing
Quantum Chemistry
Inorganic Crystallography
Hypertext (HTML)
Databases
Terminology

and others. A selection of uses is shown in the examples.

Why would it be useful to me?

CML provides a lossless transmission of information so it is ideally suited to sending complex data over networks. It is machine-independent so guarantees precise machine-machine communication. It is object-oriented so that it supports and interfaces with modern developments such as Java, C++ and Corba. Wherever you find yourself managing complex information, especially from a variety of sources, CML provides an extra level of help to prevent you getting lost.

What is SGML and its relevance to CML?

SGML is a very widely used information standard and massively used commercially and by governments. It's extremely well tested and there are a huge number of tools and support organisations (see the SGML Home page by Robin Cover.
SGML is not a language (despite its name), but is a meta-language for constructing other markup languages. The best known of these is HTML (HyperTextMarkupLanguage) but there are many others in publishing, government business, music, literature and publishing. CML has been developed to support molecular sciences.
SGML is designed to manage 'chunks' of information (called entities) and these can range from whole chapters to single characters. SGML is particularly well placed to support varying character sets and is therefore the system of choice for chemical documents.
Typical phrases are "CML is written in SGML" or "CML is an application of SGML".

How does CML work?

Not yet written.

What technology do I need?

Not yet written.

What is XML and its relevance to CML?

XML (eXtensibleMarkupLanguage) is being developed by a large and dynamic group under the aegis of the W3 consortium. It is a very recently suggested approach for developing SGML applications over the Inter- and Intra-nets. It's essentially a subset (or simplification) of SGML and is much easier to use. If you know how to write valid HTML, it's a very small step to using XML. The main philosophy is that everything can be contained within a single document and you don't need to have supporting documents (especially the DTD or DocumentTypeDefinition).
XML is still very new and not yet fixed, but the CML has used an XML-like philosophy for many months and so there is no problem in making it follow the emerging 'standard'. In any case this is unlikely to affect the current versions of CML documents, other than possibly to add some additional boilerplate header information to help parsers. Remember that CML is still strict SGML so nothing has been lost in this simplification.
The main features of XML are:

Easy to use, both authoring and reading
No complex supporting documents
XML documents need not be valid, and can simply be well-formed
Widespread acceptance
Simple to write parsing software

Well-formed is an important new concept. Essentially it means that a document is syntactically correct (e.g. the start and end tags balance, ATTRIBUTEs are quoted, etc.). However the document might no be valid (e.g. contain an unknown tag). XML is therefore very well suited to situations where the documents have already been validated (e.g. because the authoring software is authenticated, or because they have already passed through a validating parser). NOTE, however, that all CML documents must be validatable against the CML DTD, but it is possible to manipulate them without necessarily having to validate them.
Historical note. Until recently CML used a language called XML. I have now changed this to TecML (TechnicalMarkupLanguage) to avoid collisions with the W3 effort. Please ignore any references to XML unless it refers to this new approach.

Do I have to know SGML?

For the vast majority of likely users of CML, NO. The rules for constructing CML documents are simple:

No tags can be omitted (i.e. every start-tag has a closing end-tag)
All ATTRIBUTE values must have quotes
The documents must have a DOCTYPE statement
In general, whitespace should not be used as markup (e.g. FORTRAN formatting or the use of PRE). CML has better mechanisms.

If you are writing software to create CML files or read them you will probably find that you don't need to know SGML. CML is a flexible DTD so that most ELEMENTs (tags) can occur in 'reasonable' places. You will need to read the documentation to discover what ATTRIBUTEs are available and what values they may have, but again you won't have to read formal SGML texts.
If you want your documents to conform to the XML/CML spec then it will be easiest simply to copy a common header verbatim. (This will be small, but tells the parser some basic information such as which ELEMENTs can contain text. The most likely area where change will occur is in the definition of entities, especially if we develop special chemical ones.

How can I create CML files/documents?

The following methods are, or will become, available:

Manual authoring, perhaps aided with generic tools (e.g. EMACS)
Graphical authoring
Editing of existing CML documents
Merging of existing CML documents
Conversion of existing file types
Direct output from programs (e.g. calculations or instrumental data)

At present there are a reasonable selection of converters closely following the chemical/* MIME types (e.g. PDB, MOL, JCAMP). It's not too difficult to write other converters and I'd be happy to show you how. The main difficulty is not normally the CML, but parsing the existing files :-(

Can I edit CML documents?

If you know what you are doing. Obviously the result must be well-formed, so you mustn't omit tags, quotes, etc. You also mustn't make up new tags - the processing software is required to complain! There are restrictions about ELEMENTs can be included in other ELEMENTs, but there is enough fluidity that it's mainly a question of whether the result is meaningful to you and the rest of the world. Some things are forbidden - a variable (XVAR) can only contain a string - but a list container (XLIST) can contain almost anything.
If no one else does I intend to develop a graphical editor for CML. This is NOT a trivial task, especially if it has to validate. It is most likely that this will deal in cut-and-paste chunks of information - e.g. import a Molfile and put it 'here' in the document, and a citation and put it 'there'.

What software is available for processing CML?

Formally CML is a language for describing molecular information, so doesn't comprise software. However it depends on software being available and this is a short account.
There is a huge amount of generic SGML software. For example, if you wish to validate CML, get J Clark's free sgmls or SP - they are very impressive tools. For processing there is Joe English's cost or perl-based equivalents. The W3 consortium is continuing to develop tools for SGML on the Net. In the commercial arena there are many tools, but frequently they are customised for particular disciplines and have not yet supported molecular sciences except in a textual fashion.
To support the molecular applications I have written a large number of java classes (at least one for each CML ELEMENT). These can render, transform, search and provide some limited molecular perception. These classes are NOT intended to duplicate the many free and commercial tools for managing molecular information (e.g. databases, chemical perception, substructure searching). They are primarily the interfaces between CML and existing systems and are particular useful for converting and reorganising documents.

What is a DTD? Do I need to know?

A DTD (or DocumentTypeDefinition) is the formal specification of an SGML document. For example it prescribes the syntax completely, such as formats for tags, how entities are recognised, etc. It states what ATTRIBUTEs an ELEMENT may have (e.g. HREF and NAME for A in HTML) and what values these ATTRIBUTEs can have. It also states what an ELEMENT can contain, such as text or other ELEMENTs. The rules can be quite complex and are expressed in a grammar (Extended Backus Naur - EBNF). So, for example, a list item (LI) in HTML must occur within a container such as UL or OL.
CML is sufficiently flexible that these rules can be expressed in simple language if they are needed (e.g. "an XVAR may only contain text data and cannot contain other ELEMENTs"). If you are writing a validating parser you will have to know, but if you rely on a valid document, then you don't. If you have to look at the DTD you should start with the output from dtd2html which is much simpler to understand than the raw DTDs.
You are not allowed to edit the CML DTD or the accompanying files.

Why is the CML DTD so complex?

TecML recognises that molecular science is not the only discipline that it might support, so there is a mechanism for adding DTDs. At present there are two such, HTML (V2.0) and MOL (for molecules). It's possible to include other DTDs from other disciplines if required. However this is not trivial in SGML and is only manageable by complex manipulation of strings ('parameter entities'). They are a useful and moderately robust mechanism for updating the DTD, but only if you are very well practised in SGML.
The 'virtual' DTD that emerges after these manipulations is simpler, but doesn't get printed out. dtd2html gives a reasonable picture.

How is whitespace treated in CML?

Whitespace is very complex in SGML and can be a very common source of errors. It is a particular problem before and after tags and you have to know the precise rules. The good news is that XML has recognised this problem and it is now much easier.
XML (and therefore CML) allows non-significant whitespace (such as between tags) to be discarded and for the rest to be folded into a single space character. Since whitespace is a very poor markup tool (e.g. tabbing) CML does not use it at all, and requires the use of tagged delimiters. HTML has also led many people to ignore whitespace as precise and CML also follows this approach.
Specifically, therefore, CML regards any contiguous whitespace as a single separating character. The ARRAY ELEMENT below:

<ARRAY>
1.2 3.4        5.6
7.8 9.0
11.234
            </ARRAY>

therefore contains 6 floating numbers and no significant whitespace. This means you can format large chunks of data so it can pass through mailers, etc. If you have to include significant spaces (as in PDB atom identifiers) use quotes or entities:

<ARRAY>
" CA" "CA "
</ARRAY>

represents a C-alpha and a Calcium.

Does CML understand aromaticity?

This is typical of a wide range of questions about the level of detail and the algorithmic support that CML provides.
CML provides support for a very large number of chemical concepts but does not presume to supply the details. Thus a common mechanism for transmitting aromaticity is to draw a Kekulé structure with alternating double and single bonds. CML support you in doing this, but does not place its interpretation on it. For example, other system use an 'aromatic bond' (-5 in CSD) and CML will deliberately not recognise that these two correspond to the same molecule (CML does not compare molecules - that is the role of the application program.)
It might be tidy to require everyone to use the same convention but the world doesn't work that way! CML allows you to use whatever convention you like BUT you have to tell people what convention you are using. It's easy and more acceptable to do this if the convention is already widely used, and it's probable that CML-aware software will concentrate on the commoner systems. However, so long as you use the CONVENTION ATTRIBUTE to describe what you are doing, you are free to do what you want. It is likely, however, that the CML project will recommend that certain CONVENTIONs (e.g. PDB) are reserved and may not be used for other purposes.

How does CML help molecular database users?

CML is not a database management system, but both database schemas and data can be represented in CML and often this can provide new approaches. Thus a CML document can be regarded as the serialisation of an object - in other words an ASCII representation of a objects held in programs or databases. (There will be other tools for serialisation but CML can be made isomorphous with them.) In this sense CML acts as an object schema, or the basis for developing IDLs (for use with CORBA) in molecular sciences.
There are many ways to manage objects, including distributed databases. Thus it could be reasonable to deliver a protein entry from several servers. One could hold its sequence, another its coordinates, a third the small molecules, etc. and these can all be transmitted as CML documents.
CML can also be used to represent data in relational databases and can provide a mechanism for input, output and archival.
CML also has a role to play in data entry. If entries are generated in CML (perhaps with an authoring tool) it becomes much easier to abstract the information from them when checking and validating. Moreover, since CML supports ADMIN info, it's easy for authors to add this before submission.

What is (chemical) MIME?

MIME is an IETF standard for labelling electronic documents (files) for transmission between machines by mail or WWW protocols. Every document can be stamped with a MIME content-type such as "text/sgml" or "image/gif". Thus all documents sent from a WWW server have a content-type provided by the server, which allows the client to decide how to treat it (e.g. what software to use for rendering. In 1994-5 Henry Rzepa put forward the idea that this could be extended to cover molecular science and he, Ben Whittaker and I published a proposal which has been widely adopted. See the Chemical MIME home page. CML will have its own MIME type, chemical/x-cml.
MIME stamps are external to the document and the only safe way of encoding a document type within itself is to use SGML using the DOCTYPE statement. This declares what DTD the document uses. For example, this document (which is of type text/html) starts with:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML//EN">

which declares it as using the HTML DTD of the W3C.

How can I convert my *.foo file to CML?

If *.foo is one the chemical/* MIME list there is a good chance that a full or partial converter has been written. If not, one will need to be written (manual conversion is not recommended). I shall give example of how to do this soon, but the main task is to identify the information components within your *.foo file, such as molecules, scalar data, text, citations, arrays, tables, graphs, dates, URLs, etc. You must then write a parser that reads one of your files and extracts this information. For each of these there is a simple routine which allows you to poke the object into the CML document.
At this stage you will need to think about the logical structure of your information. What data belongs to this molecule (e.g. a date)? Should all the annotations be in a separate section (e.g. an XLIST?). You will also find that you start creating your own DICTNAMEs for information and so you should draw up a glossary of all those terms used (if you have a user manual this should be in it anyway).
If you are in charge of the generation of *.foo files (e.g. they are output by your software or instrument) consider adding a CML option to the system. This is much easier than writing a parser and is not a lengthy process.
Note also that you don't have to covert every piece of information initially. In some cases it can be held as text (XVAR or XHTML) until you have decided what to do with it (an example is the REMARK cards in the CML version of PDB). But the more markup you add, the more valuable it will be to your readers.
CML also provides a mechanism for total encapsulation of foreign files. Do not use this as a lazy way out, but it's reasonable if you already use standard approaches from other disciplines. Thus a CML file might hold a CGM file (ComputerGraphicsMetafile) for its graphics - there is no real advantage in conversion.

Can CML draw 2-D molecular structures?

CML can hold 2-D molecular information in a variety of ways:

Connection tables
SMILES
2-D coordinates

It is up to the application program what use it makes of these. JUMBO is able to use the 2-D coordinates to draw diagrams, and a connectionTable to 2-D diagram tool is under way. Remember that JUMBO is not intended to duplicate systems which already exist.

Can CML draw 3-D molecular structures?

CML can hold 3-D molecular information in a variety of ways:

3-D cartesian coordinates
fractional coordinates
Z-matrix
Crystallographic Unit Cell
Molecular symmetry
Space group symmetry

It is up to the application program how this is used. JUMBO can use the first two to create single molecules but does not yet apply crystal or molecular symmetry - remember that JUMBO is not intended to duplicate systems which already exist.

Could I write a scientific paper in CML?

Yes. I have already converted a J.C.S. Chemical Communications paper into CML with a very high degree of markup. Many papaers , especially viewed as preprints, contain well separated chunks of information of the sort:

(Hyper)text with links to other objects
diagrams
molecules
scalar data
tables
graphs
citations

A paper can then consist of separate sections holding these data which can often be automatically converted from other formats. I am already doing this in collaboration with Henry Rzepa and Chris Leach for the ECHET96 conference (see Henry's home page).
If the author wishes to render the paper in particular ways (e.g. by interleaving molecules in the text, etc.) they will have to find a system for doing this. The advantage of SGML is that it is very widely used and understood in the publishing and printing communities, so if you take a document specification in SGML to a specialist they are likely to be able to help.

Is CML Object-Oriented?

Yes. Any SGML document can be held as a tree of objects and all CML documents are trivially parsable in this way. JUMBO is written in Java and a CML document is displayable as a tree of objects. The CML document is a serialisation of those objects.
There are many advantages to this. Firstly, CML automatically gains the benefit of generic advances in OO technology, such as CORBA/IDL, Java beans, etc. Object databases will accept SGML (CML) as an input specification and can therefore build complex search and management tools using their generic procedures. On top of this objects can have methods which they can invoke, so that they 'carry around with them' methods for rendering, answering questions about themselves, linking to other objects, etc. A key method is validation, so that objects 'know' whether their data is valid or not. For example, all ELEMENTs in CML have a Java class and many of these have over 1000 lines of code.

Can I search CML documents?

Yes! SGML is an extremely powerful way of organising information so that it can be searched later. It is possible to search on the content of ELEMENTs, on their ATTRIBUTEs, and - very powerful - to search by context. The following are examples of the sort of questions that are possible:

How many references are not from journals? (i.e. a BIB ELEMENT does not contain any XVAR ELEMENTs with BUILTIN="JOUR")
How many files contain more than 2 molecules (find and count all MOL ELEMENTs that are not parents of MOL).
Which authors reference their own work? (find the authors of the document (perhaps in ADMIN) and all the authors in the BIB ELEMENTs).
Which files refer to molecules with molecular weights < 500? (search the files for XVAR with TYPE="FLOAT" and BUILTIN="MOLWT". Alternatively, search for all FORMULA with BUILTIN="STOICHIOM" and calculate the molecular weight

Object Oriented systems will increasingly allow database searches to 'ask the objects' to calculate properties or do their own search. Since CML documents are serialised objects, it will be straightforward to implement this.

Can I run CML from Netscape/MSIE, etc?

There are several possibilities.
Browsers recognise the MIME type of a document and can be configured to launch an appropriate helper application. It's common for a browser to launch RasMol when it gets a file of type "chemical/x-pdb", and it would be simple to configure your browser to recognise "chemical/x-cml" and launch JUMBO or some other tool.
It would be possible to write (or convert) a JUMBO plugin. I don't intend to do this myself - offers?
Many browsers are java-enabled. This means that if the chemical/x-cml file comes from a server which also has the JUMBO *.class files, you can view the files automatically in your browser with no effort! I shall provide an example of this and I hope to get collaborators who provide other CML-viewable information. Because JUMBO can convert other file types, it can be used to view a wide range of molecular data files. For some of these - such as Quantum Chemical calculations - there are no current viewers.
The WWW is moving towards the use of SGML and I expect browsers to become more SGML/XML-aware. CML is ideally placed to take advantage of this.

Does CML use standards?

Yes. CML uses standards wherever possible. It's based in SGML/XML and MIME. Internally it uses ISO standards for dates and terminology. When standard entity sets (e.g. for Unicode characters) are used, CML can take advantage of this and will render them if the software has the appropriate glyphs. CML can also contain information in other standards or near-standards such as CGM, TeX, GIF, etc.

Is CML a 'standard'?

CML is the first application of SGML to molecular science and it is not possible to predict how it will develop. In some respects it is a meta-language in that very varied applications can be constructed and it's therefore not a proscriptive approach. If strictness is required, then either CML can be used to develop a more hardcoded DTD, or a controlled vocabulary must be used. It's very likely that hardcoded DTDs will use bits of CML.

What does CML NOT do?

It's not software, so it won't actively 'do' anything! The relevant question is "what information can CML not carry?"
CML has no special support at present for:

Chemical reactions. Until we get a feel for the complexity of this they can be represented by MOLs and XVARs along with the RELATIONs to describe what part of the reactants correspond to what part of the products. For simple reactions this should be straightforward.
Queries, substructures, Markush structures, libraries, etc. These all require a grammar, and there is no currently accepted grammar for chemistry (could there be?). CML probably provides enough components for developers (e.g. atom types can be configured and molecules can contain other molecules). remember that these areas are probably not sufficiently well understood anyway.
Heteromers. Glycoproteins, chemically modified proteins, modified nucleic acids, carbohydrates, etc. are very difficult. Ultimately they can (probably) be represented by a connection table but this loses the advantage of polymeric representations. My first approach has been for a molecule to contain references to its oligomeric components, along with details of the covalent linkages and stereochemistry.
Parsable mathematics. Systems like TeX can hold formatting instructions and can therefore be rendered, but it isn't possible to evaluate an equation in TeX. There are several DTD projects in the maths community, but none appears to have emerged - any news would be welcome. (Basic linear and polynomial representations, along with common functions would be a useful start and would , for example, allow graphical curves to be transmitted and rendered on graphs.)

In the last resort it may be possible to use parsable textual descriptions or contained files. But CML will not be widely useful if it is simply a container for a ragbag of legacy file types!
SGML is very good at providing the containment required in Object-Oriented systems, but not well suited to inheritance. Thus special semantics will have to be added for (say) taxonomy - an area I do not intend to venture into! Multiple inheritance will be even worse than single inheritance!

Are there any known bugs?

Strictly speaking, no! CML is a DTD of SGML and parses satisfactorily with sgmls which is as bug-free as any software can be. So the real question is "what has CML got wrong?"
CML has a clear philosophy of supporting very varied applications, so it cannot provide precise structure in a document. For example you might wish to insist that every citation has an author - CML can't do this for you since it provides for cases where documents don't have authors (quite common). This check will have to be made elsewhere (e.g. in a special postprocessor - and CML makes it very easy to write this.)
CML has an implied hierarchy of components (e.g. MOL contains ATOMS contains ARRAY). This seems to work for many varied cases but its almost certain that there will be systems it can't tackle. In most cases it will be possible to provide a representation, though it may not be pretty.
CML allows flexibility in the size and shape of components through pointers (addresses). Thus complex information 'attached to an atom' can be constructed elsewhere and located by a pointer on the atom. It is possible that this is not powerful enough to hold some systems, but I haven't met them yet.
The most likely problem is that the semantics of the ELEMENTs and their ATTRIBUTEs is poorly described or that the examples are not consistent in their usage. If so, please point this out and I'll try to correct it. In particular, the area of addressing and sub-addressing and RELATIONs are not yet mature.
If you ask "are there any bugs in JUMBO?" the answer is "zillions".

How is CML likely to develop?

CML is offered as a way forward in SGML and Object-Oriented systems for the molecular community. My intention is to freeze the current version (1.0) for a period. The ( Open Molecule Foundation is supporting CML and we are building a community of users. We are especially interested in applications from people developing molecular software and do not intend it to be seen as a competitor to existing systems. JUMBO is a free enabling tool rather than a product.
The philosophy is that collaborators develop their systems to be compatible with the CML architecture. Some may also use JUMBO (or its java classes) as components of their system, but this is not required. They develop their own applications by extending rather than modifying and for this reason CML itself may not be edited nor be redistributed. In this way we hope to avoid the army of mutant files that are so common in molecular science.
Just as there is no single software package for SGML (or HTML) there should not be a single one for CML. We expect both free and commercial products in this area. Serious developers will get special advantages by being members of the OMF in that they will have an input into the development of CML and its priorities. They will also have advance knowledge of the likely developments in the language. There are various levels of membership in the OMF and it is possible to contribute either in cash or in kind (e.g. by authoring extensions, testing, converting legacy systems, etc.).

Are there restrictions on the use of CML?

There is no charge for CML, but it is NOT in the public domain. That means that you may not alter documents in this distribution, nor distribute them to third parties without permission. You may, of course, point them to this page, and this should be used as the definitive reference.
JUMBO consists of a set of Java classes and these may be freely used over the Internet. I intend that their distribution is managed by the Open Molecule Foundation and the intention is that they will be free, but not in the public domain. The classes may not be redistributed without permission but the OMF is actively looking at ways of doing this which will be beneficial to the community. If you wish to include the classes in a product, please contact me.
If you wish to mount the system on your server, there will be a distribution kit, which I hope will be free. The API for the classes will be published. You may therefore extend the classes by standard mechanisms without needing to have source code. This is one of the great benefits of Java and means that the community can rely on a single, stable, core on which they can build. If the extensions are widely valuable it may be possible to incorporate them in future versions.
There will be a community of committed developers who will have access to the source code. This is likely to be managed through the OMF.

Up to index