cml DTD Quick Reference


<a>
Anchor; source/destination of link (HTML 2.0)
<address>
Address, signature, or byline (HTML 2.0)
<admin>
Administrivia, of a more powerful kind that can be provided by HTML META ELEMENTs.

The form of this is flexible and will presumably be governed by the application or conventions. A number of standard terms are provided by ISO12620 (e.g. 'originationDate') and could be used. PERSON is provided as useful object for authors, owners, etc.

Note that TecML/CML files may be archived as single entities and it may then be extremely valuable to include ADMIN information.

Content Model

An optional/repeatable list of XVAR, XHTML, XLIST and PERSON which should allow any complexity of content.

<array>

A homogeneous (1- or 2-dimensional) array of variables. The values are given as a white-space-separated string, with quotes around elements containing whitespace. The dimension of the matrix is determined as follows:
If no attributes are given: 1-dimensional.
If SIZE (but not ROWS/COLUMNS) is given: 1-dimensional.
If ROWS and COLUMNS are given: 2-dimensional. SIZE is ignored
If STRUCT represents a square or triangular matrix (SQUARE, ANTISYMMETRIC, LOWERTRIANGLE, ORTHOGONAL, SYMMETRIC, UNITARY, UPPERTRIANGLE) only ONE of ROWS or COLUMNS need be given. SIZE is ignored.

An array of anything more complicated (links (A), ARRAY, etc) requires the use of XLIST rather than ARRAY.

Some arrays will be sparse or have missing values. The word NULL (or the quoted null string "") can be used to denote an element for which there is no information. Where many identical values are required, a premultiplier can be used, as in:
1.2 3.4 25*NULL 23*4.5 3*NULL 4.1
which would represent an array of length 54.

Sometimes (as for a controlling variable or an axis on a graph) an array can be generated from a linear expression and the values in the content can be omitted. In this case TYPE must be INTEGER or FLOAT, START and DELTA must both be given and be of this type and SIZE must also be given. An example:
<ARRAY START="3.1" DELTA="0.3" SIZE="5"></>
is equivalent to:
<ARRAY>3.1 3.4 3.7 4.0 4.3</ARRAY> .

Note that ARRAY and XVAR share the same TYPE, FUZZY, DICTNAME, MIME, UNITS and LANG attributes and their associated values. They will differ in the values of BUILTIN.

<atoms>

ATOMS is used for describing either single atoms, or more commonly, a list of atoms in a 'molecule' or 'compound' when it contained within MOL.

ATOMS is the heart of MOL.DTD and (in MOL) represents an atom-centred description with optional bonds (BONDS). (This is perhaps driven by my background as an inorganic crystallographer, where bonds are a personal 'opinion'!) Elemental identity, atomic positions and spacegroup are often necessary and sufficient to describe what the substance is. Many theoretical chemists would agree that, with the addition of the total electron count, everything else is opinion.

Like the rest of CML, ATOMS does not dictate how a molecule is described, and an author can create whatever atomic (or bond) properties they wish. There is a set of BUILTINs to cover most common cases, but other can be added by using REL="glossary" and HREF= in the ATOMS attributes. The incentive to use BUILTINs is that they will be recognised by the postprocessing software, whilst for a glossary item the author will have to write code themselves.

ATOMS may have CONVENTION and DICTNAME at tributes and these can be used to resolve problems of convention (e.g. 'charge', 'valence'). It is assumed that the contained ARRAYs use the same convention unless overridden.

ATOMS and BONDS can be used to give the molecular formula (connectivity) by the use of attributes such as formal ligand count, number of attached hydrogen atoms, formal charge, etc. Where possible, however, we recommend that FORMULA is used since standardisation is likely to be clearer in that format. The current conventions (SMILES and MOL) could be expanded to include others.

ATOMS/BONDS may be difficult to relate to FORMULA. Where ATOMS represents coordinate data, this might relate to multiple copies of a molecule (as in crystallography where an asymmetric unit can contain several identical molecules and all the coordinates must be included so that the crystal structure can be recreated.) A related problem is where some of the atomic coordinates are not determined, a frequent occurrence in some techniques. Hydrogen atoms represent a particular problem - CML does not lay down rules as to how these are used.

ATOMS and BONDS are linked by the ATID attribute of XVAR within ATOMS, and the ATID1 or ATID2 of BONDS. This need not be an integer, and could be a construct such as CA15. If the tables are edited or modified it will be important to make sure that consistency is obtained and that ATODs are always unique.

The content model is simple: an optional description (XHTML), followed by a number of (column) arrays all of length equivalent to the number of atoms. Each ARRAY corresponds to an atomic attribute. The semantics of the attribute is given by one of two mechanisms:

The actual enumeration of the attributes are given in a file ../mol-arr-bui.ent and this is definitive, rather than what is written below (although hopefully they are in sync!). In many cases it is difficult to decide whether something is a number or ID or a type. The file contains:

Note that an XVAR can contain pointers to other objects, so that if you need (say) to have multipoles attached to atoms, they can be set up elsewhere (for example in an XLIST) and XVAR TYPE="ADDRESS" can be used to point to them.

Content Model

The generic content model. This is interpreted as:

Note that any ARRAY can have a CONVENTION attribute, so that different ways of holding information can be identified.

<b>
Bold text (HTML 2.0)
<base>
The address of the document, so that relative URLs within the document will be added to the BASE URL. (HTML 2.0)
<bib>

BIB (simple tool for 'most' bibliographic requirements) Compiled from other bibliographic standards. Deliberately kept simple so as to be readable (I couldn't understand the other ones :-). Because there is no structure, the renderer and authoring tools have to have some semantics.

The components of the citation are all XVARS, using the BUILTIN attribute where possible. The builtins are listed in tecml-var-bui.ent from which you may choose. Some useful ones are

Content Model

The content is an optional description (HTML), then optional/repeatable a list of authors (PERSON), XVAR (for all the components of the bibliography) and a list of addresses (XADDR). The addresses should correspond to the citation/organisation since PERSON has its own provision for addresses.

<blockquote>
Quoted passage (HTML 2.0)
<body>
BODY is not used in CML although most of its contents are. Note that unlike most implementations of browser, HTML 2.0 strictly requires that BODY contents start with a tag (e.g. P, HR, Hn, etc.) and NOT with free text. (HTML 2.0)
<bonds>

BONDS contains an arbitrary number of arrays (ARRAY) for carrying bond information. The following BUILTINs are provided.

Normally BONDS corresponds to the contents of an accompanying ATOMS - i.e. all ATIDs or other addresses in BONDS point to ATOMS. However, a MOL can contain an additional BONDS which describes links to other MOLs (e.g. for building macromolecules, combinatorial libraries, etc.). These are contained within a BONDS with the LINK attribute set to 1 (or YES).

Note that it is perfectly valid to have ATOMSs without any BONDSs (after all, bonds are merely 'opinions').

BONDS may have CONVENTION and DICTNAME attributes and these can be used to resolve problems of convention (e.g. 'order', 'stereochemistry'). It is assumed that the contained ARRAYs use the same convention unless overridden.

Content Model

Note that BONDS can be used for isolated bonds.

<br>
Line break (HTML 2.0)
<cite>
Name or title of cited work (HTML 2.0)
<cml>
CML has a simple content allowing a very flexible approach to the construction of CML files. As CML is s asuperset of the three DTDs it can be used for any applications restricted to two or less (note that TecML often uses HTML and MOL often uses TecML).

CML can contain itself - this is not encouraged, but may be required if a large number of component documents are being collected together. More commonly it contains some or all of the following (in any order, and any number of occurrences):

Though there is no need to do so, it can be useful to collect all the components of the same type together, and to create a standard order for them.

Note that all these components can be referenced from the XHTML hypertext through the use of the HREF mechanism.

<code>
Source code phrase (HTML 2.0)
<cryst>

Crystallographic data. This is mainly for the unitcell, spacegroup, and crystallographic experimental data (e.g. wavelengths, etc.)

Content Model

<dd>
Definition of term (HTML 2.0)
<dir>
Not used in CML (HTML 2.0)
<dl>
Definition list, or glossary (HTML 2.0)
<dt>
Term in definition list (HTML 2.0)
<em>
Emphasized phrase (HTML 2.0)
<feature>

This is being worked out. It can represent the SW-PROT FEATURES, and is being linked to the sequence. It is also capable of representing SITE, HELIX, etc from PDB.

The SWISS-PROT description is given in the content.

We suggest the using the BUILTIN=KEYWORD option for XVAR where possible, (or DICTNAME) to hold content like "ACTIVE SITE".

Content Model

FEATURE uses the DICTNAME and CONVENTION attributes and so can resolve ambiguities of terminology.

<figure>

A figure. At present the figure has no internal semantic content, but can carry textual description and other attributes.

How to transport the figure is not yet solved. I have provided for two possibilities:

The content is therefore an optional description (caption) (HTML) and an optional (encoded) file.

<formula>

Chemical formula. The primary purpose of this is to say what the molecule is, not to represent ideas about it. No present method covers all molecules, and for many we have only partial info (e.g. stoichiometry). FORMULA allows for one connexion table in the content - but more than one FORMULA is allowed within MOL to cover multiple components (especially in crystallographic files).

The primary use for the generic content (ARRAY/XLIST/ARRAY) are connexion tables. The connexion tables can be textual (e.g. SMILES) or the components of an atom-bond based table, following the same convention as in ATOMS and BONDS. In FORMULA both atom and bond arrays can be used, which will normally be of different sizes. XVAR can also be used for reference numbers, etc (MEDLINE, SWISSPROT, Cambridge, etc)

FORMULA supports CONVENTION/DICTNAME so that differing conventions can be used.

The following BUILTINs are provided (See mol-var-bui.ent).

Content Model

<h1>
Heading, level 1 (HTML 2.0)
<h2>
Heading, level 2 (HTML 2.0)
<h3>
Heading, level 3 (HTML 2.0)
<h4>
Heading, level 4 (HTML 2.0)
<h5>
Heading, level 5 (HTML 2.0)
<h6>
Heading, level 6 (HTML 2.0)
<head>
Container for meta-information. All CML documents must have a HEAD, which must include a TITLE. All other components are optional though users are well adavised to think of including them. (HTML 2.0)
<hr>
Horizontal rule (HTML 2.0)
<html>
Document type for HTML (top level container). In HTML documents there is a HEAD and BODY. These would confuse TecML authors and so the body content of HTML is used within the XHTML container (which also has additional attributes). (HTML 2.0)

From the HTML 2.0 DTD:


Document Type Definition for the HyperText Markup Language
(HTML DTD)

$Id: html.dtd,v 1.29 1995/08/04 17:50:22 connolly Exp $

Author: Daniel W. Connolly <connolly@w3.org>
See Also: html.decl, html-1.dtd
http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html

<!ENTITY % HTML.Version
"-//IETF//DTD HTML 2.0//EN"

-- Typical usage:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
...
</html>
--
>

<i>
Italic text (HTML 2.0)
<img>
Image; icon, glyph or illustration, which may also be a clickable map (ISMAP). (HTML 2.0)
<kbd>
Keyboard phrase, e.g. user input (HTML 2.0)
<li>
List item (HTML 2.0)
<link>
This is discussed in Murray Altheim's paper on the semantics of addressing. The descriptions of the attributes are rather short... (HTML 2.0)
<menu>
Not used in CML (HTML 2.0)
<meta>
This is for describing the contents, purpose, etc of the document. The WWW community has yet to produce clear standards for this and the most promising (1995) is the Dublin Core proposal of eleven categories of meta-information. Until this is developed, CML does not give guidance here although the use of META information is strongly recommended. (HTML 2.0)
<mol>

The content model of a MOL (molecule) allows for considerable flexibility in storage (see below).

Although many of the XVAR, XLIST, etc. could also be held in an TecML file without MOL.DTD, the containment within a molecule is very well suited to molecular databases (e.g. crystallography) where all data is "attached" to a molecule.

NOTE: The use of the term 'molecule' is not meant to imply anything about the bonding model or physical nature of the thing in question. MOL can be used to hold data on extended solids (such as NaCl) or van der Waals complexes. The bonding model is kept simple to emphasise that for many molecules there need to be additional semantics to specify it adequately. The simple model may be refined over time.

The primary use of MOL is to provide at least one way of accurately conveying the precise nature and identity of the substance. This may not always be the best or most efficient or the one that you are used to.

The present constraints of MOL are:

Content Model

Among the molecular properties and data MOL can handle (in any order and repeatable although this is not always meaningful);

The more examples I have explored, the less constraints can be put on what MOL can contain. The

<ol>
Ordered, or numbered list (HTML 2.0)
<p>
Paragraph (note that in strict HTML this is a container <P> ... </P> and so every paragraph should start with <P> (it is not a separator). (HTML 2.0)
<person>

A person. This occurs in BIB but could be used in many other places (e.g. name of experimenter).

Content Model

<reaction>
<relation>

Relationships between objects and 'hyperlinks' are a key part of CML/TecML but wherever possible they should be added to the document as a specific list rather than hardcoding in hyperlinks with the <A NAME= HREF=> technology. When an object has an explicit (or implicit) address, that can ve used as part of RELATION content. The content allows for: 1:1 links; 1:n links; m:1 links and m:n links. (Example: <RELATION><XVAR>aspirin:C3</XVAR><ARRAY>peaks3:point-25 peaks3:point-29</ARRAY></RELATION> might link the C3 atom of aspirin to two spectral peaks.

It is possible that RELATION can form the basis of a REACTION element, holding references to the reactants and products.

RELATION can hold the whole range of abstract data types (XVAR, XLIST and ARRAY) as additonal content. The use of this is undefined, and probably domain-specific (e.g. it could hold: action for hyperlinks; rendering inforamtion; conditions of transformations such as chemical reactions, and so on.).

The A->HREF mechanism also allows for relationships (1:1 and n:1). It's less powerful than RELATION since it is: hardcoded and difficult to amend; unable to deal with n:m links; and cannot easily be qualified by constraints.

Content Model

This allows for:

I appear to have used (at least!) HEAD/TAIL, TO/FROM and END1/END2 as BUILTINs for XVARs. This seems rather OTT and will need to be reduced.

<samp>
Sample text or characters (HTML 2.0)
<sequence>

Biomolecular Sequence. This is intended to cover only those molecules where the chemical identity is an important aspect, and is not intended to intrude into genome structure, etc. It also covers only 'simple' types of sequence (PROTein, DNA, RNA, CARBohydrate). CML will not (at present) provide a comprehensive list of monomers and there is a very limited support for covalently modified molecules, although this will be a major role for CML. (The MOL TYPE=FRAGMENT may be used to describe small molecules for attachment to proteins. although at present this can only be done if the atoms are explicit as in PDB).

In general, therfore, SEQUENCE should only be used for 'normal' proteins, small stretches of DNA or RNA without 'unusual' components, and carbohydrates which can be represented by a simple linear text string. It is unsuitable for cyclic molecules, modified bases, unusual aminoacids, branched saccharides, etc. The chain termination is also unlikely to be well defined (e.g. monophosphate?, acetylated N-terminus?). Covalent modifications may be described textually (e.g. 'glycosylated').

SEQUENCE supports CONVENTION/DICTNAME which should allow precise management of macromolecular data entries.

There is a BUILTIN=STRAND option for XVAR, which could be used as follows:

Content Model

<strong>
Strong emphais (HTML 2.0)
<symbol>

Symbolic variable. Some numeric or string quantities may have to be represented by a symbolic variable rather than an explicit value. SYMBOL is an experimental approach towards this and the details have not yet been worked out.

Hypothetical... Suppose two objects are contrained to have numerical values which sum to a constant value (e.g. the occupancies of two or more atoms must sum to 1.0). We might write:
<SYMBOL NAME="C13occ">0.73</SYMBOL>
and later
<XVAR TYPE="symbol" NAME="pos1">C13occ</XVAR>
<XVAR TYPE="symbol" NAME="pos2">1-C13occ</XVAR>

Content Model

Yet to be worked out... At present any variable.

<symmetry>

Molecular (not crystallographic) symmetry. The author can specify a point group or a set of symmetry operations (this could be useful for a helical molecule or one in a non-standard orientation.)

The content is the symmetry operators as (4*3) matrices (ARRAY). These should have the form [R|t] where R premultiplies the coordinates and t is a column translation vector. It is up to the author whether they give a complete set of operators (e.g. 48 for Oh) or whether they give just the group generators. The identity matrix can be assumed to be present in all cases.

SYMMETRY supports CONVENTION/DICTNAME and can therefore distinguish between Schoenflies and H-M conventions.

The following BUILTINs for XVAR are provided:

Content Model

<tecml>

The toplevel container for TecML files. It consists of a HEAD and any of the TecML elements in any order. Rarely needed unless MOL is excluded from the DTD.

TecML has relatively few hardcoded ELEMENTS and gets its flexibility from a wide range of attributes that can be applied to 'meta'-elements such as XVAR , ARRAY and XLIST. These attributes can be extended through the DTD, but are provided through files (using the ENTITY mechanism). The definitive DTD, therefore, depends critically on the contents of these files (*.ent) stored in the same directory as the TecML DTD.

TecML is designed to extended by adding discipline-specific DTDs. SGML does not have asimple mechanism for this and therefore the content and attributes of certain ELEMENTs is defined in *.ent files rather than hardcoded. Unless you understand the ENTITY mechanism in SGML very well, do not touch these files! These are the current files (Jan 1997):

Extension of TecML with other DTDs must be done carefully and requires a good knowledge of SGML. The current TecML DTD shows how the MOL DTD is included and the use of the catalog file. Before extending TecML you should consult PM-R to avoid namespace collisions, and also to agree the most robust method. In general you would expect to extend the content model of XLIST, and the BUILTIN attributes of XLIST, XVAR and ARRAY. It is conceivable that TYPE might be expanded (e.g. to include currency). You should always extend using additional files rather than editing the current ones.

The files that extend TecML to include the MOL DTD. (See the previous paragraphs as well).

Content

TecML only contains ELEMENTs from its own DTD, which can be in any order and any number after an optional HEAD.
<termentry>

Terminology is a key part of TecML and ISO12620 has been used to provide the terms for supporting it. A TERMENTRY consists of:

<title>
All CML documents must have a TITLE. This will normally be rendered as a textual description of the contents or purpose of the document. (HTML 2.0)
<tt>
Typewriter text (HTML 2.0)
<ul>
Unordered list (HTML 2.0)
<var>
Variable phrase or substituable (HTML 2.0)
<xaddr>

The address of a person or organisation. It can contain electronic components such as E-Mail or URLs.

The XADDR can contain XVARs which give the components of the address. A Number of these are BUILTINs and can be picked from the list in tecml-var.ent. These include:

Note that URLs and EMAILs are specific TYPEs for XVAR and this is the best way of using them (SGML forbids them also to be used as BUILTINs).

Content Model

An optional/repeatable list of XVARs carrying either BUILTIN information or specific to the application. Note that E-mail addresses and URLs are already catered for as TYPEs in XVARs so do not need BUILTIN.

<xhtml>

HTML allows authors to add hypertext of the complexity of the current HTML language (at Sept. 1995 this is HTML 2.0). Authors are assumed to be familiar with HTML and the DTD will not be documented here. There are, however a few important differences:

Content Model

(See above).
<xlist>

A generic container. It can be used to construct most of the common container classes (although these can only be validated at postprocessing time). The DTD imposes very little constraints on how XLIST can be used, but CONTENT can be set to show certain common methods. XLIST can contain any or all of the common generic data items (A, ARRAY, XVAR and XLIST itself). The commonest uses are:

Note that the counts (COLUMNS, ROWS, SIZE) are advisory and primarily used for checking. The postprocessor is assumed to be able to count.

Content Model

Because XLIST can be used in so many ways the possible content is flexible. The SGML keyword ANY allows any of the elements in the DTD including #PCDATA to be included in any order.

<xnotation>
<xvar>

The generic variable of TecML and the most common ELEMENT. It is used to create an indefinite variety of objects through the use of the TYPE, BUILTIN and DICTNAME variables. When more than one XVAR of the same type is required ARRAY should be used, as it shares all the important attributes.

XVAR content is an ASCII string (PCDATA) which can include any printable characters. If '<' etc. are required they must be escaped with & or enclosed within a marked section. The content may contain whitespace which is significant except for leading and trailing whitespace which is ignored. XVAR may include newlines if these are specified as '\n', but normal record ends are translated to a single space.

A variety of TYPEs are allowed (the default is STRING).

The type (TYPE) and UNITS maybe specified. May be extended to simple geometrical objects (e.g. point, circle, etc).

Note that ARRAY and XVAR share the same TYPE, FUZZY, DICTNAME, MIME, UNITS and LANG attributes and their associated values. They will differ in the values of BUILTIN.


cml DTD