CML Frequently asked Questions
Introduction
CML (Chemical Markup Language) is a new approach to managing
molecular information using recently developed Internet tools
such as SGML/XML and Java. It has a large scope as it covers
disciplines from macromolecular sequences to inorganic molecules
and quantum chemistry. There is also a lot of detail as many
molecular documents can contain many thousand discrete objects, all
of which are manageable in CML. Because of this there is no single
place to 'start' learning about CML and this FAQ is offered for those
who like that approach.
The FAQ has been prepared (Jan 1997) for the V1.0 release of CML.
Previous versions have been released incrementally and although they
have had processing software, this hasn't been easily portable. This
release should be taken as the starting point and earlier versions
should be ignored.
Questions
Answers
- What is Chemical Markup Language?
-
CML (Chemical Markup Language) is a new approach to managing
molecular information using recently developed Internet tools
such as SGML/XML and Java. It is based strictly on SGML, the most robust
and widely used system for precise information management in many areas.
It has been developed over 18 months, and has been tested in many areas and
on a variety of machines.
CML is NOT 'just another file format'; it is capable of holding extremely
complex information structures and so acting as an interchange mechanism or
for archival. It interfaces easily with modern database architectures
such as relational databases or object-oriented databases.
- What disciplines does it cover?
-
CML has already been used to manage documents and information in:
- Macromolecular Sequence
- Macromolecular Structure
- Spectra
- Organic Molecules
- Publishing
- Quantum Chemistry
- Inorganic Crystallography
- Hypertext (HTML)
- Databases
- Terminology
and others. A selection of uses is shown in
the examples.
- Why would it be useful to me?
-
CML provides a lossless transmission of information so it is ideally suited
to sending complex data over networks. It is machine-independent so
guarantees precise machine-machine communication. It is object-oriented
so that it supports and interfaces with modern developments such as Java,
C++ and Corba. Wherever you find yourself managing complex information,
especially from a variety of sources, CML provides an extra level of help
to prevent you getting lost.
- What is SGML and its relevance to CML?
-
SGML is a very widely used information standard and massively used
commercially and by governments. It's extremely well tested and there are
a huge number of tools and support organisations (see
the SGML Home page by Robin Cover.
SGML is not a language (despite its name), but is a meta-language for
constructing other markup languages. The best known of these is HTML
(HyperTextMarkupLanguage) but there are many others in publishing, government
business, music, literature and publishing. CML has been developed to
support molecular sciences.
SGML is designed to manage 'chunks' of information (called entities)
and these can range from whole chapters to single characters. SGML is
particularly well placed to support varying character sets and is therefore
the system of choice for chemical documents.
Typical phrases are "CML is written in SGML" or "CML is an application
of SGML".
- How does CML work?
-
Not yet written.
- What technology do I need?
-
Not yet written.
- What is XML and its relevance to CML?
-
XML (eXtensibleMarkupLanguage) is being developed by a large and dynamic group
under the aegis of the W3 consortium. It is a very recently suggested
approach for
developing SGML applications over the Inter- and Intra-nets. It's essentially
a subset (or simplification) of SGML and is much easier to use. If you know
how to write valid HTML, it's a very small step to using XML. The
main philosophy is that everything can be
contained within a single document and you don't need to have supporting
documents (especially the DTD or DocumentTypeDefinition).
XML is still very new and not yet fixed, but the CML has used an XML-like
philosophy for many months and so there is no problem in making it
follow the emerging 'standard'. In any case this is unlikely to affect
the current versions of CML documents, other than possibly to add some
additional boilerplate header information to help parsers. Remember that
CML is still strict SGML so nothing has been lost in this simplification.
The main features of XML are:
- Easy to use, both authoring and reading
- No complex supporting documents
- XML documents need not be valid, and can simply be well-formed
- Widespread acceptance
- Simple to write parsing software
Well-formed is an important new concept. Essentially it means that a
document is syntactically correct (e.g. the start and end tags balance,
ATTRIBUTEs are quoted, etc.). However the document might no be valid
(e.g. contain an unknown tag). XML is therefore very well suited to
situations where the documents have already been validated (e.g. because the
authoring software is authenticated, or because they have already passed
through a validating parser). NOTE, however, that all CML documents
must be validatable against the CML DTD, but it is possible to manipulate them
without necessarily having to validate them.
Historical note. Until recently CML used a language called XML. I have
now changed this to TecML (TechnicalMarkupLanguage) to avoid collisions
with the W3 effort. Please ignore any references to XML unless it refers to
this new approach.
- Do I have to know SGML?
-
For the vast majority of likely users of CML, NO. The rules for constructing
CML documents are simple:
- No tags can be omitted (i.e. every start-tag has a closing end-tag)
- All ATTRIBUTE values must have quotes
- The documents must have a DOCTYPE statement
- In general, whitespace should not be used as markup (e.g. FORTRAN
formatting or the use of PRE). CML has better mechanisms.
If you are writing software to create CML files or read them you will
probably find that you don't need to know SGML. CML is a flexible DTD so
that most ELEMENTs (tags) can occur in 'reasonable' places. You will need
to read the documentation to discover what ATTRIBUTEs are available and
what values they may have, but again you won't have to read formal SGML
texts.
If you want your documents to conform to the XML/CML spec then it will be
easiest simply to copy a common header verbatim. (This will be small, but
tells the parser some basic information such as which ELEMENTs can contain
text. The most likely area where change will occur is in the definition
of entities, especially if we develop special chemical ones.
- How can I create CML files/documents?
-
The following methods are, or will become, available:
- Manual authoring, perhaps aided with generic tools (e.g. EMACS)
- Graphical authoring
- Editing of existing CML documents
- Merging of existing CML documents
- Conversion of existing file types
- Direct output from programs (e.g. calculations or instrumental data)
At present there are a reasonable selection of converters closely following
the chemical/* MIME types (e.g. PDB, MOL, JCAMP). It's not too difficult
to write other converters and I'd be happy to show you how. The main
difficulty is not normally the CML, but parsing the existing files :-(
- Can I edit CML documents?
-
If you know what you are doing. Obviously the result must be well-formed,
so you mustn't omit tags, quotes, etc. You also mustn't make up new tags -
the processing software is required to complain! There are restrictions
about ELEMENTs can be included in other ELEMENTs, but there is enough fluidity
that it's mainly a question of whether the result is meaningful to you and
the rest of the world. Some things are forbidden - a variable (XVAR) can only
contain a string - but a list container (XLIST) can contain almost anything.
If no one else does
I intend to develop a graphical editor for CML. This is NOT a trivial task,
especially if it has to validate. It is most likely that this will deal
in cut-and-paste chunks of information - e.g. import a Molfile and put
it 'here' in the document, and a citation and put it 'there'.
- What software is available for processing CML?
-
Formally CML is a language for describing molecular information, so doesn't
comprise software. However it depends on software being available and this
is a short account.
There is a huge amount of generic SGML software. For example, if you wish
to validate CML, get J Clark's free sgmls or SP - they are
very impressive tools. For processing there is Joe English's cost
or perl-based equivalents. The W3 consortium is continuing to develop
tools for SGML on the Net. In the commercial arena there are many tools,
but frequently they are customised for particular disciplines and have not
yet supported molecular sciences except in a textual fashion.
To support the molecular applications I have written a large number of
java classes (at least one for each CML ELEMENT). These can render,
transform, search and provide some limited molecular perception. These classes
are NOT intended to duplicate the many free and commercial tools for managing
molecular information (e.g. databases, chemical perception, substructure
searching). They are primarily the interfaces between CML and existing
systems and are particular useful for converting and reorganising documents.
- What is a DTD? Do I need to know?
-
A DTD (or DocumentTypeDefinition) is the formal specification of an SGML
document. For example it prescribes the syntax completely, such as
formats for tags, how entities are recognised, etc. It states what ATTRIBUTEs
an ELEMENT may have (e.g. HREF and NAME for A in HTML) and what values these
ATTRIBUTEs can have. It also states what an ELEMENT can contain, such as
text or other ELEMENTs. The rules can be quite complex and are expressed
in a grammar (Extended Backus Naur - EBNF). So, for example, a list item (LI)
in HTML must occur within a container such as UL or OL.
CML is sufficiently flexible that these rules can be expressed in simple
language if they are needed (e.g. "an XVAR may only contain text data and
cannot contain other ELEMENTs"). If you are writing a validating parser you
will have to know, but if you rely on a valid document, then you
don't. If you have to look at the DTD you should start with the output from
dtd2html which is much simpler to understand than the raw DTDs.
You are not allowed to edit the CML DTD or the accompanying files.
- Why is the CML DTD so complex?
-
TecML recognises that molecular science is not the only discipline that
it might support, so there is a mechanism for adding DTDs. At present there
are two such, HTML (V2.0) and MOL (for molecules). It's possible to include
other DTDs from other disciplines if required. However this is not trivial
in SGML and is only manageable by complex manipulation of strings
('parameter entities'). They are a useful and moderately robust mechanism
for updating the DTD, but only if you are very well practised in SGML.
The 'virtual' DTD that emerges after these manipulations is simpler, but doesn't
get printed out. dtd2html gives a reasonable picture.
- How is whitespace treated in CML?
-
Whitespace is very complex in SGML and can be a very common source of
errors. It is a particular problem before and after tags and you have to
know the precise rules. The good news is that XML has recognised this problem
and it is now much easier.
XML (and therefore CML) allows non-significant whitespace (such as between
tags) to be discarded and for the rest to be folded into a single space
character. Since whitespace is a very poor markup tool (e.g. tabbing)
CML does not use it at all, and requires the use of tagged delimiters.
HTML has also led many people to ignore whitespace as precise and CML also
follows this approach.
Specifically, therefore, CML regards any contiguous whitespace as a single
separating character. The ARRAY ELEMENT below:
<ARRAY>
1.2 3.4 5.6
7.8 9.0
11.234
</ARRAY>
therefore contains 6 floating numbers and no significant whitespace. This
means you can format large chunks of data so it can pass through mailers, etc.
If you have to include significant spaces (as in PDB atom identifiers) use
quotes or entities:
<ARRAY>
" CA" "CA "
</ARRAY>
represents a C-alpha and a Calcium.
- Does CML understand aromaticity?
-
This is typical of a wide range of questions about the level of detail
and the algorithmic support that CML provides.
CML provides support for a very large number of chemical concepts but
does not presume to supply the details. Thus a common mechanism for
transmitting aromaticity is to draw a Kekulé structure with alternating
double and single bonds. CML support you in doing this, but does not place
its interpretation on it. For example, other system use an 'aromatic bond'
(-5 in CSD) and CML will deliberately not recognise that these two
correspond to the same molecule (CML does not compare molecules - that is
the role of the application program.)
It might be tidy to require everyone to use the same convention but the world
doesn't work that way! CML allows you to use whatever convention you like
BUT you have to tell people what convention you are using. It's easy and
more acceptable to do this if the convention is already widely used, and it's
probable that CML-aware software will concentrate on the commoner systems.
However, so long as you use the CONVENTION ATTRIBUTE to describe what you
are doing, you are free to do what you want. It is likely, however, that
the CML project will recommend that certain CONVENTIONs (e.g. PDB) are
reserved and may not be used for other purposes.
- How does CML help molecular database users?
-
CML is not a database management system, but both database schemas and data
can be represented in CML and often this can provide new approaches. Thus
a CML document can be regarded as the serialisation of an object - in
other words an ASCII representation of a objects held in programs or databases.
(There will be other tools for serialisation but CML can be made
isomorphous with them.) In this sense CML acts as an object schema, or the
basis for developing IDLs (for use with CORBA) in molecular sciences.
There are many ways to manage objects, including distributed databases. Thus
it could be reasonable to deliver a protein entry from several servers. One
could hold its sequence, another its coordinates, a third the small molecules,
etc. and these can all be transmitted as CML documents.
CML can also be used to represent data in relational databases and can provide
a mechanism for input, output and archival.
CML also has a role to play in data entry. If entries are generated in CML
(perhaps with an authoring tool) it becomes much easier to abstract the
information from them when checking and validating. Moreover, since CML
supports ADMIN info, it's easy for authors to add this before submission.
- What is (chemical) MIME?
-
MIME is an IETF standard for labelling electronic documents (files) for
transmission between machines by mail or WWW protocols. Every document can
be stamped with a MIME content-type such as "text/sgml" or "image/gif".
Thus all documents sent from a WWW server have a content-type provided by the
server, which allows the client to decide how to treat it (e.g. what software
to use for rendering. In 1994-5 Henry Rzepa put forward the idea that this
could be extended to cover molecular science and he, Ben Whittaker and I
published a proposal which has been widely adopted. See
the Chemical MIME home page.
CML will have its own MIME type, chemical/x-cml.
MIME stamps are external to the document and the only safe way of encoding
a document type within itself is to use SGML using the DOCTYPE
statement. This declares what DTD the document uses. For example, this
document (which is of type text/html) starts with:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML//EN">
which declares it as using the HTML DTD of the W3C.
- How can I convert my *.foo file to CML?
-
If *.foo is one the chemical/* MIME list there is a good chance that a full
or partial converter has been written. If not, one will need to be written
(manual conversion is not recommended). I shall give example of how to do this soon, but the main task is to identify the information components within
your *.foo file, such as molecules, scalar data, text, citations, arrays,
tables, graphs, dates, URLs, etc. You must then write a parser that reads
one of your files and extracts this information. For each of these there is
a simple routine which allows you to poke the object into the CML document.
At this stage you will need to think about the logical structure of your
information. What data belongs to this molecule (e.g. a date)?
Should all the annotations be in a separate section (e.g. an XLIST?). You
will also find that you start creating your own DICTNAMEs for information
and so you should draw up a glossary of all those terms used (if you have
a user manual this should be in it anyway).
If you are in charge of the generation of *.foo files (e.g. they are output
by your software or instrument) consider adding a CML option to the system.
This is much easier than writing a parser and is not a lengthy process.
Note also that you don't have to covert every piece of information initially.
In some cases it can be held as text (XVAR or XHTML) until you have decided
what to do with it (an example is the REMARK cards in the CML version of
PDB). But the more markup you add, the more valuable it will be to your
readers.
CML also provides a mechanism for total encapsulation of foreign files. Do not
use this as a lazy way out, but it's reasonable if you already use standard
approaches from other disciplines. Thus a CML file might hold a CGM file
(ComputerGraphicsMetafile) for its graphics - there is no real advantage in
conversion.
- Can CML draw 2-D molecular structures?
-
CML can hold 2-D molecular information in a variety of ways:
- Connection tables
- SMILES
- 2-D coordinates
It is up to the application program what use it makes of these. JUMBO is
able to use the 2-D coordinates to draw diagrams, and a connectionTable to
2-D diagram tool is under way. Remember that JUMBO is not intended to
duplicate systems which already exist.
- Can CML draw 3-D molecular structures?
-
CML can hold 3-D molecular information in a variety of ways:
- 3-D cartesian coordinates
- fractional coordinates
- Z-matrix
- Crystallographic Unit Cell
- Molecular symmetry
- Space group symmetry
It is up to the application program how this is used. JUMBO can use the
first two to create single molecules but does not yet apply crystal or
molecular symmetry - remember that JUMBO is not intended to duplicate
systems which already exist.
- Could I write a scientific paper in CML?
-
Yes. I have already converted a J.C.S. Chemical Communications paper
into CML with a very high degree of markup. Many papaers , especially viewed
as preprints, contain well separated chunks of information of the sort:
- (Hyper)text with links to other objects
- diagrams
- molecules
- scalar data
- tables
- graphs
- citations
A paper can then consist of separate sections holding these data which can
often be automatically converted from other formats. I am already doing
this in collaboration with Henry Rzepa and Chris Leach for the ECHET96
conference (see
Henry's home page).
If the author wishes to render the paper in particular ways (e.g.
by interleaving molecules in the text, etc.) they will have to find a system
for doing this. The advantage of SGML is that it is very widely used and
understood in the publishing and printing communities, so if you take a
document specification in SGML to a specialist they are likely to be able
to help.
- Is CML Object-Oriented?
-
Yes. Any SGML document can be held as a tree of objects and all CML documents
are trivially parsable in this way. JUMBO is written in Java and a CML
document is displayable as a tree of objects. The CML document is a
serialisation of those objects.
There are many advantages to this. Firstly, CML automatically gains the
benefit of generic advances in OO technology, such as CORBA/IDL, Java beans,
etc. Object databases will accept SGML (CML) as an input specification and
can therefore build complex search and management tools using their generic
procedures. On top of this objects can have methods which they can invoke,
so that they 'carry around with them' methods for rendering, answering
questions about themselves, linking to other objects, etc. A key method
is validation, so that objects 'know' whether their data is valid or not.
For example, all ELEMENTs in CML have a Java class and many of these have over
1000 lines of code.
- Can I search CML documents?
-
Yes! SGML is an extremely powerful way of organising information so that it
can be searched later. It is possible to search on the content of ELEMENTs,
on their ATTRIBUTEs, and - very powerful - to search by context. The following
are examples of the sort of questions that are possible:
- How many references are not from journals? (i.e. a BIB ELEMENT does
not contain any XVAR ELEMENTs with BUILTIN="JOUR")
- How many files contain more than 2 molecules (find and count all MOL
ELEMENTs that are not parents of MOL).
- Which authors reference their own work? (find the authors of the document
(perhaps in ADMIN) and all the authors in the BIB ELEMENTs).
- Which files refer to molecules with molecular weights < 500? (search
the files for XVAR with TYPE="FLOAT" and BUILTIN="MOLWT". Alternatively,
search for all FORMULA with BUILTIN="STOICHIOM" and calculate the molecular
weight
Object Oriented systems will increasingly allow database searches to 'ask
the objects' to calculate properties or do their own search. Since CML
documents are serialised objects, it will be straightforward to implement
this.
- Can I run CML from Netscape/MSIE, etc?
-
There are several possibilities.
Browsers recognise the MIME type of a document and can be configured to
launch an appropriate helper application. It's common for a browser
to launch RasMol when it gets a file of type "chemical/x-pdb", and it would
be simple to configure your browser to recognise "chemical/x-cml" and launch
JUMBO or some other tool.
It would be possible to write (or convert) a JUMBO plugin. I don't
intend to do this myself - offers?
Many browsers are java-enabled. This means that if the chemical/x-cml
file comes from a server which also has the JUMBO *.class files, you can
view the files automatically in your browser with no effort! I shall provide
an example of this and I hope to get collaborators who provide other
CML-viewable information. Because JUMBO can convert other file
types, it can be used to view a wide range of molecular data files. For
some of these - such as Quantum Chemical calculations - there are no current
viewers.
The WWW is moving towards the use of SGML and I expect browsers to become
more SGML/XML-aware. CML is ideally placed to take advantage of this.
- Does CML use standards?
-
Yes. CML uses standards wherever possible. It's based in SGML/XML and MIME.
Internally it uses ISO standards for dates and terminology. When standard
entity sets (e.g. for Unicode characters) are used, CML can take advantage
of this and will render them if the software has the appropriate glyphs.
CML can also contain information in other standards or near-standards such
as CGM, TeX, GIF, etc.
- Is CML a 'standard'?
-
CML is the first application of SGML to molecular science and it is not
possible to predict how it will develop. In some respects it is a
meta-language in that very varied applications can be constructed and it's
therefore not a proscriptive approach. If strictness is required, then
either CML can be used to develop a more hardcoded DTD, or a controlled
vocabulary must be used. It's very likely that hardcoded DTDs will use
bits of CML.
- What does CML NOT do?
-
It's not software, so it won't actively 'do' anything! The relevant question
is "what information can CML not carry?"
CML has no special support at present for:
- Chemical reactions. Until we get a feel for the complexity of this
they can be represented by MOLs and XVARs along with the RELATIONs to
describe what part of the reactants correspond to what part of the products.
For simple reactions this should be straightforward.
- Queries, substructures, Markush structures, libraries, etc. These all
require a grammar, and there is no currently accepted grammar for chemistry
(could there be?). CML probably provides enough components for developers
(e.g. atom types can be configured and molecules can contain other molecules).
remember that these areas are probably not sufficiently well understood anyway.
- Heteromers. Glycoproteins, chemically modified proteins, modified
nucleic acids, carbohydrates, etc. are very difficult. Ultimately they
can (probably) be represented by a connection table but this loses the
advantage of polymeric representations. My first approach has been for a
molecule to contain references to its oligomeric components, along with
details of the covalent linkages and stereochemistry.
- Parsable mathematics. Systems like TeX can hold formatting instructions
and can therefore be rendered, but it isn't possible to evaluate an equation
in TeX. There are several DTD projects in the maths community, but none
appears to have emerged - any news would be welcome.
(Basic linear and polynomial representations, along with common functions
would be a useful start and would , for example, allow graphical curves to
be transmitted and rendered on graphs.)
In the last resort it may be possible to use parsable textual descriptions
or contained files. But CML will not be widely useful if it is simply
a container for a ragbag of legacy file types!
SGML is very good at providing the containment required in Object-Oriented
systems, but not well suited to inheritance. Thus special semantics will
have to be added for (say) taxonomy - an area I do not intend to venture
into! Multiple inheritance will be even worse than single inheritance!
- Are there any known bugs?
-
Strictly speaking, no! CML is a DTD of SGML and parses satisfactorily with
sgmls which is as bug-free as any software can be. So the
real question is "what has CML got wrong?"
CML has a clear philosophy of supporting very varied applications, so it
cannot provide precise structure in a document. For example you
might wish to insist
that every citation has an author - CML can't do this for you since it
provides for cases where documents don't have authors (quite common). This
check will have to be made elsewhere (e.g. in a special postprocessor - and
CML makes it very easy to write this.)
CML has an implied hierarchy of components (e.g. MOL contains ATOMS contains
ARRAY). This seems to work for many varied cases but its almost certain
that there will be systems it can't tackle. In most cases it will be possible
to provide a representation, though it may not be pretty.
CML allows flexibility in the size and shape of components through pointers
(addresses). Thus complex information 'attached to an atom' can be
constructed elsewhere and located by a pointer on the atom. It is possible
that this is not powerful enough to hold some systems, but I haven't
met them yet.
The most likely problem is that the semantics of the ELEMENTs and their
ATTRIBUTEs is poorly described or that the examples are not consistent
in their usage. If so, please point this out and I'll try to correct it.
In particular, the area of addressing and sub-addressing and RELATIONs are
not yet mature.
If you ask "are there any bugs in JUMBO?" the answer is "zillions".
- How is CML likely to develop?
-
CML is offered as a way forward in SGML and Object-Oriented systems for the
molecular community. My intention is to freeze the current version (1.0) for
a period. The
(
Open Molecule Foundation
is supporting CML and we are building a community of users. We are especially
interested in applications from people developing molecular software and
do not intend it to be seen as a competitor to existing systems. JUMBO
is a free enabling tool rather than a product.
The philosophy is that collaborators develop their systems to be compatible
with the CML architecture. Some may also use JUMBO (or its java classes)
as components of their system, but this is not required. They develop
their own applications by extending rather than modifying and
for this reason CML itself may not be edited nor be redistributed. In this
way we hope to avoid the army of mutant files that are so common in
molecular science.
Just as there is no single software package for SGML (or HTML) there should
not be a single one for CML. We expect both free and commercial products in
this area. Serious developers will get special advantages by being members
of the OMF in that they will have an input into the development of CML and
its priorities. They will also have advance knowledge of the likely
developments in the language. There are various levels of membership in the
OMF and it is possible to contribute either in cash or in kind (e.g.
by authoring extensions, testing, converting legacy systems, etc.).
- Are there restrictions on the use of CML?
-
There is no charge for CML, but it is NOT in the public domain. That means
that you may not alter documents in this distribution, nor distribute them
to third parties without permission. You may, of course, point them
to this page, and this should be used as the definitive reference.
JUMBO consists of a set of Java classes and these may be freely used
over the Internet. I intend that their distribution is managed by the
Open Molecule Foundation and the intention is that they will be free, but
not in the public domain. The classes may not be redistributed without
permission but the OMF is actively looking at ways of doing this which will
be beneficial to the community. If you wish to include the classes in a
product, please contact me.
If you wish to mount the system on your server, there will be a distribution
kit, which I hope will be free. The API for the classes will be published.
You may therefore extend the classes by standard mechanisms without needing
to have source code. This is one of the great benefits of Java and
means that the community can rely on a single, stable, core on which they
can build. If the extensions are widely valuable it may be possible to
incorporate them in future versions.
There will be a community of committed developers who will have access to the
source code. This is likely to be managed through the OMF.
Up to index
©
Peter Murray-Rust, 1996, 1997