The form of this is flexible and will presumably be governed by the application or conventions. A number of standard terms are provided by ISO12620 (e.g. 'originationDate') and could be used. PERSON is provided as useful object for authors, owners, etc.
Note that TecML/CML files may be archived as single entities and it may then be extremely valuable to include ADMIN information.
An optional/repeatable list of XVAR, XHTML, XLIST and PERSON which should allow any complexity of content.
A homogeneous (1- or 2-dimensional) array of variables. The values are
a white-space-separated string, with quotes around elements containing
whitespace. The dimension of the matrix is determined as follows:
If no attributes are given: 1-dimensional.
If SIZE (but not ROWS/COLUMNS) is given: 1-dimensional.
If ROWS and COLUMNS are given: 2-dimensional. SIZE is ignored
If STRUCT represents a square or triangular matrix (SQUARE, ANTISYMMETRIC, LOWERTRIANGLE, ORTHOGONAL, SYMMETRIC, UNITARY, UPPERTRIANGLE) only ONE of ROWS or COLUMNS need be given. SIZE is ignored.
An array of anything more complicated (links (A), ARRAY, etc) requires the use of XLIST rather than ARRAY.
Some arrays will be sparse or have missing values. The word NULL (or the
quoted null string "") can be used
to denote an element for which there is no information. Where many identical
values are required, a premultiplier can be used, as in:
1.2 3.4 25*NULL 23*4.5 3*NULL 4.1
which would represent an array of length 54.
Sometimes (as for a controlling variable or an axis on a graph) an array
can be generated from a linear expression
and the values in the content can be omitted. In this case
TYPE must be INTEGER or FLOAT, START and DELTA must both be given and be
of this type and SIZE must also be given. An example:
<ARRAY START="3.1" DELTA="0.3" SIZE="5"></>
is equivalent to:
<ARRAY>3.1 3.4 3.7 4.0 4.3</ARRAY> .
Note that ARRAY and XVAR share the same TYPE, FUZZY, DICTNAME, MIME, UNITS and LANG attributes and their associated values. They will differ in the values of BUILTIN.
ATOMS is used for describing either single atoms, or more commonly, a list of atoms in a 'molecule' or 'compound' when it contained within MOL.
ATOMS is the heart of MOL.DTD and (in MOL) represents an atom-centred description with optional bonds (BONDS). (This is perhaps driven by my background as an inorganic crystallographer, where bonds are a personal 'opinion'!) Elemental identity, atomic positions and spacegroup are often necessary and sufficient to describe what the substance is. Many theoretical chemists would agree that, with the addition of the total electron count, everything else is opinion.
Like the rest of CML, ATOMS does not dictate how a molecule is described,
and an author can create whatever atomic (or bond) properties they wish.
There is a set of BUILTINs to cover most common cases, but other can be
added by using REL="glossary" and HREF=
ATOMS may have CONVENTION and DICTNAME at tributes and these can be used to resolve problems of convention (e.g. 'charge', 'valence'). It is assumed that the contained ARRAYs use the same convention unless overridden.
ATOMS and BONDS can be used to give the molecular formula (connectivity) by the use of attributes such as formal ligand count, number of attached hydrogen atoms, formal charge, etc. Where possible, however, we recommend that FORMULA is used since standardisation is likely to be clearer in that format. The current conventions (SMILES and MOL) could be expanded to include others.
ATOMS/BONDS may be difficult to relate to FORMULA. Where ATOMS represents coordinate data, this might relate to multiple copies of a molecule (as in crystallography where an asymmetric unit can contain several identical molecules and all the coordinates must be included so that the crystal structure can be recreated.) A related problem is where some of the atomic coordinates are not determined, a frequent occurrence in some techniques. Hydrogen atoms represent a particular problem - CML does not lay down rules as to how these are used.
ATOMS and BONDS are linked by the ATID attribute of XVAR within ATOMS, and the ATID1 or ATID2 of BONDS. This need not be an integer, and could be a construct such as CA15. If the tables are edited or modified it will be important to make sure that consistency is obtained and that ATODs are always unique.
The content model is simple: an optional description (XHTML), followed by a number of (column) arrays all of length equivalent to the number of atoms. Each ARRAY corresponds to an atomic attribute. The semantics of the attribute is given by one of two mechanisms:
The actual enumeration of the attributes are given in a file
and this is definitive, rather than what is written below (although hopefully
they are in sync!). In many cases it is difficult to decide whether something
is a number or ID or a type. The file contains:
An atom may be given a serial number which must be a positive unique integer, but the atoms need not be ordered. If ATOMNO is NOT given, the atoms are assumed to be numbered from 1...NATOMS in their occurrence in the ARRAY container. This is potentially fragile, however, and it's best to include explicit ATOMNOs.
It is often conventional to split the ligands into hydrogen atoms and others because many chemical structure diagrams and many connection tables are hydrogen-suppressed. Note that bridging hydrogens (as in electron-deficient compounds) and isotopically substituted hydrogen atoms may need explicit inclusion here.
The chiral volume of a tetrahdron with 4 vertices at X1,
X2, X3, X4, is given by the determinant:
|1 1 1 1 |
|x1 x2 x3 x4| /6
|y1 y2 y3 y4|
|z1 z2 z3 z4|
The four atoms representing the corners of the tetrahedron (PID1-PID4) must be specified. For atoms without described parity, these fields should be NULL.
Note that an XVAR can contain pointers to other objects, so that if you need (say) to have multipoles attached to atoms, they can be set up elsewhere (for example in an XLIST) and XVAR TYPE="ADDRESS" can be used to point to them.
Note that any ARRAY can have a CONVENTION attribute, so that different ways of holding information can be identified.
BIB (simple tool for 'most' bibliographic requirements) Compiled from other bibliographic standards. Deliberately kept simple so as to be readable (I couldn't understand the other ones :-). Because there is no structure, the renderer and authoring tools have to have some semantics.
The components of the citation are all XVARS, using the BUILTIN attribute
where possible. The builtins are listed in
tecml-var-bui.ent from which you may choose. Some useful ones are
The content is an optional description (HTML), then optional/repeatable a list of authors (PERSON), XVAR (for all the components of the bibliography) and a list of addresses (XADDR). The addresses should correspond to the citation/organisation since PERSON has its own provision for addresses.
BONDS contains an arbitrary number of arrays (ARRAY) for carrying bond information. The following BUILTINs are provided.
Normally BONDS corresponds to the contents of an accompanying ATOMS - i.e. all ATIDs or other addresses in BONDS point to ATOMS. However, a MOL can contain an additional BONDS which describes links to other MOLs (e.g. for building macromolecules, combinatorial libraries, etc.). These are contained within a BONDS with the LINK attribute set to 1 (or YES).
Note that it is perfectly valid to have ATOMSs without any BONDSs (after all, bonds are merely 'opinions').
BONDS may have CONVENTION and DICTNAME attributes and these can be used to resolve problems of convention (e.g. 'order', 'stereochemistry'). It is assumed that the contained ARRAYs use the same convention unless overridden.
Note that BONDS can be used for isolated bonds.
CML can contain itself - this is not encouraged, but may be required if a large number of component documents are being collected together. More commonly it contains some or all of the following (in any order, and any number of occurrences):
Note that all these components can be referenced from the XHTML hypertext through the use of the HREF mechanism.
Crystallographic data. This is mainly for the unitcell, spacegroup, and crystallographic experimental data (e.g. wavelengths, etc.)
Cell lengths a, b, c in Angstroms. (optional) and
Cell angles alpha, beta, gamma in Degrees. (optional).
<XVAR BUILTIN=ACELL UNITS=Angstrom>10.23</XVAR>
<XVAR BUILTIN=BCELL >20.34</XVAR>
<XVAR BUILTIN=CCELL >23.34</XVAR>
<XVAR BUILTIN=ALPHA >90.0</XVAR>
<XVAR BUILTIN=BETA >98.23</XVAR>
<XVAR BUILTIN=GAMMA >90.0</XVAR>
<XVAR BUILTIN=CRYSTSYS> Monoclinic </XVAR>
Principal/unique axis of spacegroup: X, Y or Z
Number of molecules per unit cell.
This is being worked out. It can represent the SW-PROT FEATURES, and is being linked to the sequence. It is also capable of representing SITE, HELIX, etc from PDB.
The SWISS-PROT description is given in the content.
We suggest the using the BUILTIN=KEYWORD option for XVAR where possible, (or DICTNAME) to hold content like "ACTIVE SITE".
FEATURE uses the DICTNAME and CONVENTION attributes and so can resolve ambiguities of terminology.
A figure. At present the figure has no internal semantic content, but can carry textual description and other attributes.
How to transport the figure is not yet solved. I have provided for two possibilities:
The content is therefore an optional description (caption) (HTML) and an optional (encoded) file.
Chemical formula. The primary purpose of this is to say what the molecule is, not to represent ideas about it. No present method covers all molecules, and for many we have only partial info (e.g. stoichiometry). FORMULA allows for one connexion table in the content - but more than one FORMULA is allowed within MOL to cover multiple components (especially in crystallographic files).
The primary use for the generic content (ARRAY/XLIST/ARRAY) are connexion tables. The connexion tables can be textual (e.g. SMILES) or the components of an atom-bond based table, following the same convention as in ATOMS and BONDS. In FORMULA both atom and bond arrays can be used, which will normally be of different sizes. XVAR can also be used for reference numbers, etc (MEDLINE, SWISSPROT, Cambridge, etc)
FORMULA supports CONVENTION/DICTNAME so that differing conventions can be used.
The following BUILTINs are provided (See mol-var-bui.ent).
From the HTML 2.0 DTD:
Document Type Definition for the HyperText Markup Language (HTML DTD) $Id: html.dtd,v 1.29 1995/08/04 17:50:22 connolly Exp $ Author: Daniel W. Connolly <firstname.lastname@example.org> See Also: html.decl, html-1.dtd http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html <!ENTITY % HTML.Version "-//IETF//DTD HTML 2.0//EN" -- Typical usage: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html> ... </html> -- >
The content model of a MOL (molecule) allows for considerable flexibility in storage (see below).
Although many of the XVAR, XLIST, etc. could also be held in an TecML file without MOL.DTD, the containment within a molecule is very well suited to molecular databases (e.g. crystallography) where all data is "attached" to a molecule.
NOTE: The use of the term 'molecule' is not meant to imply anything about the bonding model or physical nature of the thing in question. MOL can be used to hold data on extended solids (such as NaCl) or van der Waals complexes. The bonding model is kept simple to emphasise that for many molecules there need to be additional semantics to specify it adequately. The simple model may be refined over time.
The primary use of MOL is to provide at least one way of accurately conveying the precise nature and identity of the substance. This may not always be the best or most efficient or the one that you are used to.
The present constraints of MOL are:
Among the molecular properties and data MOL can handle (in any order and repeatable although this is not always meaningful);
The more examples I have explored, the less constraints can be put on what MOL can contain. The
A person. This occurs in BIB but could be used in many other places (e.g. name of experimenter).
Relationships between objects and 'hyperlinks' are a key part of CML/TecML but wherever possible they should be added to the document as a specific list rather than hardcoding in hyperlinks with the <A NAME= HREF=> technology. When an object has an explicit (or implicit) address, that can ve used as part of RELATION content. The content allows for: 1:1 links; 1:n links; m:1 links and m:n links. (Example: <RELATION><XVAR>aspirin:C3</XVAR><ARRAY>peaks3:point-25 peaks3:point-29</ARRAY></RELATION> might link the C3 atom of aspirin to two spectral peaks.
It is possible that RELATION can form the basis of a REACTION element, holding references to the reactants and products.
RELATION can hold the whole range of abstract data types (XVAR, XLIST and ARRAY) as additonal content. The use of this is undefined, and probably domain-specific (e.g. it could hold: action for hyperlinks; rendering inforamtion; conditions of transformations such as chemical reactions, and so on.).
The A->HREF mechanism also allows for relationships (1:1 and n:1). It's less powerful than RELATION since it is: hardcoded and difficult to amend; unable to deal with n:m links; and cannot easily be qualified by constraints.
This allows for:
Biomolecular Sequence. This is intended to cover only those molecules where the chemical identity is an important aspect, and is not intended to intrude into genome structure, etc. It also covers only 'simple' types of sequence (PROTein, DNA, RNA, CARBohydrate). CML will not (at present) provide a comprehensive list of monomers and there is a very limited support for covalently modified molecules, although this will be a major role for CML. (The MOL TYPE=FRAGMENT may be used to describe small molecules for attachment to proteins. although at present this can only be done if the atoms are explicit as in PDB).
In general, therfore, SEQUENCE should only be used for 'normal' proteins, small stretches of DNA or RNA without 'unusual' components, and carbohydrates which can be represented by a simple linear text string. It is unsuitable for cyclic molecules, modified bases, unusual aminoacids, branched saccharides, etc. The chain termination is also unlikely to be well defined (e.g. monophosphate?, acetylated N-terminus?). Covalent modifications may be described textually (e.g. 'glycosylated').
SEQUENCE supports CONVENTION/DICTNAME which should allow precise management of macromolecular data entries.
There is a BUILTIN=STRAND option for XVAR, which could be used as follows:
Symbolic variable. Some numeric or string quantities may have to be represented by a symbolic variable rather than an explicit value. SYMBOL is an experimental approach towards this and the details have not yet been worked out.
Hypothetical... Suppose two objects are contrained to have numerical values
which sum to a constant value (e.g. the occupancies of two or more atoms must
sum to 1.0). We might write:
<XVAR TYPE="symbol" NAME="pos1">C13occ</XVAR>
<XVAR TYPE="symbol" NAME="pos2">1-C13occ</XVAR>
Yet to be worked out... At present any variable.
Molecular (not crystallographic) symmetry. The author can specify a point group or a set of symmetry operations (this could be useful for a helical molecule or one in a non-standard orientation.)
The content is the symmetry operators as (4*3) matrices (ARRAY). These should have the form [R|t] where R premultiplies the coordinates and t is a column translation vector. It is up to the author whether they give a complete set of operators (e.g. 48 for Oh) or whether they give just the group generators. The identity matrix can be assumed to be present in all cases.SYMMETRY supports CONVENTION/DICTNAME and can therefore distinguish between Schoenflies and H-M conventions.
The following BUILTINs for XVAR are provided:
The toplevel container for TecML files. It consists of a HEAD and any of the TecML elements in any order. Rarely needed unless MOL is excluded from the DTD.
TecML has relatively few hardcoded ELEMENTS and gets its flexibility from a wide range of attributes that can be applied to 'meta'-elements such as XVAR , ARRAY and XLIST. These attributes can be extended through the DTD, but are provided through files (using the ENTITY mechanism). The definitive DTD, therefore, depends critically on the contents of these files (*.ent) stored in the same directory as the TecML DTD.
TecML is designed to extended by adding discipline-specific DTDs. SGML does not have asimple mechanism for this and therefore the content and attributes of certain ELEMENTs is defined in *.ent files rather than hardcoded. Unless you understand the ENTITY mechanism in SGML very well, do not touch these files! These are the current files (Jan 1997):
Extension of TecML with other DTDs must be done carefully and requires a good knowledge of SGML. The current TecML DTD shows how the MOL DTD is included and the use of the catalog file. Before extending TecML you should consult PM-R to avoid namespace collisions, and also to agree the most robust method. In general you would expect to extend the content model of XLIST, and the BUILTIN attributes of XLIST, XVAR and ARRAY. It is conceivable that TYPE might be expanded (e.g. to include currency). You should always extend using additional files rather than editing the current ones.
The files that extend TecML to include the MOL DTD. (See the previous paragraphs as well).
Terminology is a key part of TecML and ISO12620 has been used to provide the terms for supporting it. A TERMENTRY consists of:
The address of a person or organisation. It can contain electronic components such as E-Mail or URLs.
The XADDR can contain XVARs which give the components of the address. A Number of these are BUILTINs and can be picked from the list in tecml-var.ent. These include:
Note that URLs and EMAILs are specific TYPEs for XVAR and this is the best way of using them (SGML forbids them also to be used as BUILTINs).
An optional/repeatable list of XVARs carrying either BUILTIN information or specific to the application. Note that E-mail addresses and URLs are already catered for as TYPEs in XVARs so do not need BUILTIN.
HTML allows authors to add hypertext of the complexity of the current HTML language (at Sept. 1995 this is HTML 2.0). Authors are assumed to be familiar with HTML and the DTD will not be documented here. There are, however a few important differences:
In other words, it is an allowable %body.content without FORMS.
A generic container. It can be used to construct most of the common container classes (although these can only be validated at postprocessing time). The DTD imposes very little constraints on how XLIST can be used, but CONTENT can be set to show certain common methods. XLIST can contain any or all of the common generic data items (A, ARRAY, XVAR and XLIST itself). The commonest uses are:
<XLIST STRUCT=PERSON><XVAR>John Doe<A HREFemail@example.com></A></XLIST>
The format of the table is different from the HTML 2.1 tables, (TAB) and even when that comes in, XLIST will be retained. It has much more possibility for semantics.
Note that the counts (COLUMNS, ROWS, SIZE) are advisory and primarily used for checking. The postprocessor is assumed to be able to count.
Because XLIST can be used in so many ways the possible content is flexible. The SGML keyword ANY allows any of the elements in the DTD including #PCDATA to be included in any order.
The generic variable of TecML and the most common ELEMENT. It is used to create an indefinite variety of objects through the use of the TYPE, BUILTIN and DICTNAME variables. When more than one XVAR of the same type is required ARRAY should be used, as it shares all the important attributes.
XVAR content is an ASCII string (PCDATA) which can include any printable characters. If '<' etc. are required they must be escaped with & or enclosed within a marked section. The content may contain whitespace which is significant except for leading and trailing whitespace which is ignored. XVAR may include newlines if these are specified as '\n', but normal record ends are translated to a single space.
A variety of TYPEs are allowed (the default is STRING).
The type (TYPE) and UNITS maybe specified. May be extended to simple geometrical objects (e.g. point, circle, etc).
Note that ARRAY and XVAR share the same TYPE, FUZZY, DICTNAME, MIME, UNITS and LANG attributes and their associated values. They will differ in the values of BUILTIN.