Macromolecular Sequences in CML

At present I have only processed FASTA and SwissProt files into CML format, but it is easy to do the same for other formats
CML has a SEQUENCE ELEMENT designed specifically for macromolecular sequences. It has flexible content but should be straightforward for linear polymers such as proteins and nucleic acids. There is, as yet, no simple convention for modified macromolecules (e.g. glycosylated proteins) or for branched and cyclic polymers (e.g. oligosaccharidea). However, since CML is capable of holding many MOL elements in a document, including hierarchical containment it should be possible to develop powerful systems for referencing different types of component in oligomers.

An important aspect of macromolecules is the FEATURE, for which a special ELEMENT is also provided. Normally FEATURE is an annotation in that it describes the relevance of part of the sequence, but sometimes it also indicates molecular connectivity (e.g. disulphides, glycosylation, or metal binding). It is too early to suggest all the ways that FEATURE can be used, but it will often make use of the ability to address parts or an object ('Sub-addressing'). This is a generic feature in CML and allows an application programmer to let one object (or parts of it) interact with parts of another. The FEATURES, therefore, can sub-address the SEQUENCE if required, though the action and interpretation is normally left to the application.

FASTA

FASTA is a simple format for holding sequences, and the example here is from the EMBL distribution (hsinsu.fasta). Essentially it only contains an ID, a description and a sequence, so the CML TOC is very simple:

The sequence is shown in a separate window

SwissProt

CML has a SwissProt parser which understands 'most' of the records in the file. Particular emphasis has been put on showing the use of FEATURE, BIBliograpghy and, sequence, but many of the other keywords have been objectified (the main omission being the administrivia).

This is the SwissProt entry for human insulin and its automatic conversion to CML ins_human.cml. The TOC

shows the SwissProt file structure to good advantage. The CML output is ins_human.cml

The next screenshot shows the way that information can be combined using CML technology. The SEQUENCE is drawn graphically, but is also annotated with the FEATUREs.

A FEATURE can be a coloured bar (significant length of chain), a coloured oval (single residue feature), a DISULFID bridge (yellow line), or a structural element (sine curve for HELIX, zigzag for SHEET, 'T' for TURN. (These distinctions are hardcoded into JUMBO). Clicking on any of these produces the textual annotation from the Swissprot file - you can see that the reader has just clicked the green 'F' oval and then the grey SIGNAL bar.

CML pays great attention to bibliography and citations and JUMBO can render everything in the SwissProt file.

The JOURNAL, YEAR, etc are all held as ELEMENTS (using the generic XVAR mechanism) and so could be searched, etc. Each citation (BIB - open book icon) can be displayed individually, but I have also created a tool to show all the citations in one place (BIBLIST).

Finally we show some of the annotations (a mixture of keywords, comments, dates and other administrivia). Note how the simple scalars (e.g. GeneName, Species) can be toggled on or off inline. Those with larger content (e.g. Comments such as FUNCTION) are displayed as a text box. In some cases the information is a list of items (strings, numbers, etc.) as in the KEYWORDS.

Back to index