An introduction to Structured Documents

Peter Murray-Rust

Virtual School of Molecular Sciences

Nottingham University, UK

About this paper

This paper is being written to accompany the publication on CDROM of ECHET96 ("Electronic Conference on Heterocyclic Chemistry"), run by Henry Rzepa, Chris Leach and others at Imperial College, London, UK. It is being sponsored by the Royal Society of Chemistry who (along with Cambridge, Leeds and IC) are participants in the CLIC project. This is one of the projects under E-Lib, a UK-based programme to promote electronic publishing. CLIC makes substantial use of SGML, and Chemical Markup Language (an SGML-based approach to molecular information management and publishing) is being developed in parallel with CLIC. The sponsors have agreed to make part of the CDROM available for CML material, of which this paper is part.

At the same time, the W3 consortium is promoting the use of SGML on the WWW, particularly through a simplified, easy-to-use, version called XML. Chemical Markup Language is written using XML and this paper is written in the belief that it may be useful those interested in the XML program, since CML is one of the first working applications of XML.

The paper assumes that the reader knows nothing about Markup Languages (other than an acquaintance with HTML). It is primarily aimed at those who are interested authoring or browsing documents with the next generation of markup languages, especially those created with XML. In its CDROM version it is accompanied by a structured document browser, JUMBO, which is a general XML browser written in Java, and is enhanced by being specifically extended to support CML and molecular applications. The CDROM contains a CML tutorial, many CML examples, and a number of screenshots of JUMBO displaying CML documents. For those of you reading this from a WWW page this material can be found under: The CML home Page. CML is part of the portfolio of the Open Molecule Foundation which is a newly constituted open body to promote interoperability in molecular sciences.

The paper alludes to various software tools, but does not cover their operation or implementation. However, with the exception of stylesheets, most of the operations described here for CML have already been implemented as a prototype using the JUMBO browser and processor. Nor is the paper a tutorial for CML as one is included in the CML distribution.

Finally I should emphasise that SGML can be used in many ways, and my approach does not necessarily do justice to the commonest use which is the management and publication of complex (mainly textual) documents. Projects in this area often involve many megabytes of data and industrial strength engines. I hope, however, that the principles described here will be generally useful.

Preamble

Two years ago I had never heard of structured documents, and have since come to see them as one of the most effective and cheapest ways of managing information. The basic idea is simple but when I first came across it I failed to see its importance, so this paper is written as a guide to what is now possible. In particular it explains the new simple language XML being developed by a working group (WG) of the W3 consortium. I have used this language as the basis for a markup language in technical subjects (TecML) and particularly molecular sciences (Chemical Markup Language, CML).

The paper is written as a simple structured document, using HTML, although it could have been written in CML. Since CML is being developed at the same time as XML readers may belong to two categories:

Those who know nothing about structured documents, SGML and XML
Those who know nothing about science.

My hope is that both can read it without problems - the science is minimal and I hope that you can make the mental jump to other disciplines. However I shall slant it towards those who wish to carry precise, possibly non-textual, information arranged in (potentially quite complex) data structures. I shall use the term document, but this could represent a piece of information without conventional text such as a molecule. Moreover, documents can have a very close relation to objects and if you are comfortable with Object-Oriented language you may like to substitute 'object' for 'document'. In practice, XML documents can be directly and automatically transformed into objects, although the reverse may not always be quite so easy.

It will help if you know something about HTML, and you can relate the source of the document to its rendered form. It will be useful if you have been involved in authoring or editing HTML documents at the source level, and you shouldn't feel frightened of tags (strings of characters enclosed in diamond brackets <...>). The markup I shall introduce you to uses essentially the same syntax as HTML, and the main thing that may be new to you will be the concepts underneath this, rather than any new technology. I am primarily writing this paper in the context of document delivery over networks, but markup is also ideally suited to the management of 'traditional' documents. It is often seen as a key tool in making them 'future-proof' and interchangeable between applications (interoperability).

Some of what I say may appear trivial, perhaps just an exhortation to include some structure or navigation aids in your text. For a human reader this may be true, but for a machine (and the people who have to write the programs) it is of immense importance. I have seen several projects (including some of my own) which have tried to produce machine-readable information and failed because the nature of the task hadn't been appreciated.

The important point about the XML approach is that it has been designed to separate different parts of the problem and to solve them independently. I'll explain these ideas in more detail below, but one example is the distinction between syntax (the basic rules for carrying the information components) and semantics (what meaning you put on them and what behaviour a machine is expected to perform). This is a much more challenging area than people realise, since human readers don't have problems with it.

Introduction

One of the great polymaths of this century, J.D.Bernal, inspired the development of information systems in molecular science. In 1962 he urged that the problems of scientific information in crystallography (his own field) and solid state physics should be treated as one in communication engineering. 30 years on we have most of the tools that are required to get the best information in the minimum quantity in the shortest time, from the people who are producing the information to the people who want it, whether they know they want it or not. (Bernal's words, quoted in Sage, Maurice Goldsmith, p219.) I believe that structured documents, especially using markup languages such as CML/XML have a key role to play. Nothing comes free, but where this approach is possible it's very cost effective.

many scientists are unaware of the research during the last 30 years into the management of information. A recent and valuable review is: "Information Retrieval in Digital Libraries: Bringing Search to the Net", Bruce R. Schatz, Science, 275, pp. 327-334, (1997). [I shall comment on the format of the last sentence shortly.] In this Schatz shows that previous research in the analysis of complex documents, including hyperlinking, concept analysis, and vocabulary switching between disciplines is now possible on a production scale. Much of his emphasis is on analysis of conventional documents produced by authors who have no knowledge of markup and who do not use vocabularies. For that reason, complex systems are required to extract implicit information from the documents, and they rely on having appropriate text to analyse. Automatic extraction of numerical and other non-textual information will be much more difficult.

Structure and Markup

We often take for granted the power of the human brain in extracting implicit information from documents. We have been trained over centuries to realise that documents have structure (Table Of Contents (TOC), Indexes, Chapters with included Sections, and so on). It probably seems 'obvious' to you that you are reading the fourth section (Structure and Markup) in the paper (A simple introduction to structured documents). The HTML language and rendering tools which you are using to read it provide a simple but extremely effective set of visual clues such as the Chapter being set in larger type. However the logical structure of the document is simply:

HTML
  HEAD
    TITLE
  BODY
    H1 (Chapter)
    H2 (Section)
    H3 (Subsection)
    H3 
    H2 
    P  (Paragraph)
    P
    P
    P
    P
    H2
    P
    P
    P
    H2
    P
    P
    ... and so on ...
    ADDRESS

where I have used the convention of indentation the show that one component includes another. This is a common approach in many TOCs and human readers will implicitly deduce a hierarchy from the above diagram. But a machine could not unless it had sophisticated heuristics, and it would also make mistakes.

You may now find it useful to have a window open on your browser with the source of this document visible

The formal structure in this document is quite limited, and that is one of the reasons that HTML has been so successful. Humans can author them easily and human readers can supply the implicit structure. But if you look again at the TOC diagram you will see that Chapters do NOT include Sections in a formal manner, nor do Sections include Paragraphs. The first occurrence of H2 and H3 is used for the author and affiliation which is not a 'Section'.

An information component (an Element) contains another if the start-tag and end-tag of the container completely enclose the contained. Thus the HEAD element contains a TITLE element, and the TITLE element contains a string of characters (technically the term is #PCDATA). There's a formal set of rules in HTML for what Elements can contain what others and where they can occur. Thus it's not formally allowed to have TITLE in the BODY of your document. These rules, which you won't need to read, are called a Document Type Definition (DTD). They are written in a language called SGML, which you won't need to learn unless you do a great deal of work in this field.

[If you have already come across SGML and been put off for some reason, please don't switch off here. XML has been carefully designed to make it much easier to understand the concepts and there are many fewer terms For example, you don't even have to have a DTD if you don't want.]

This document has an inherent structure in the order of its Elements. Most people would reasonably assume that an H2 element 'belongs to' the preceding H1 , and that P elements belong to the preceeding H2. It would be quite natural to use phrases like "the second sentence of the second paragraph in the section called 'Introduction'". Humans can do this easily although it's easy to get lost in large documents. The important news is that XML now makes it possible for machines to do the same sort of thing with simple rules and complete precision. The Text Encoding Initiative (a large international project to markup the world's literature) has developed tools for doing this, and they will be available to the XML community.

[NOTE on HTML: In HTML there are no formal conventions for what constitutes a Chapter or Section, and no restriction as to what elements can follow others. Therefore you can't rely on analysing an arbitrary HTML document in the way I've outlined. This highlights the need for more formal rules, agreements and guidelines. In XML we are likely to see communities such as users of CML develop their own rules, which they enforce or encourage as they feel. For example, there is no restriction on what order Elements can occur in a CML document but there is a requirement that ATOMS can only occur within a MOL (molecule Element). (In CML I use the term ChemicalElement to avoid confusion). ].

In the Schatz reference (Introduction: Para 2, sentence 2), you will probably 'know automatically' what the components are. The thing in brackets must be the year, 'pp.' is short for 'pages', the bold type must be the volume, and the italics are the journal title. But this is not obvious to a machine, and trying to write a parser for this is difficult and error-prone. Many different publishing houses have their own conventions. The Royal Society of Chemistry might format this as: B. R. Schatz, Science, 1997, 275, 327. Any error in punctuation such as missing periods causes serious problems for a machine, and conversions between different formats will probably involve much manual crafting.

The precise components of the reference are well understood and largely agreed within the bibliographic community. They are a good example of something that can be enhanced by markup. Markup is the process of adding information to a document which is not part of the content but adds information about the structure or elements. Using the citation as an example, we can write:

<BIB>
  <TITLE>
  Information Retrieval in Digital Libraries: Bringing Search to the Net
  </TITLE>
  <JOURNAL>Science</JOURNAL>
  <AUTHOR>
    <FIRSTNAME>Bruce</FIRSTNAME>
    <INITIAL>R</INITIAL>
    <LASTNAME>Schatz</LASTNAME>
  </AUTHOR>
  <VOLUME>275</VOLUME>
  <YEAR>1997</YEAR>
  <PAGES>327-334</PAGES>
</BIB>

Even if they had never seen markup before most scientists would implicitly understand this information. The advantage is that it's also straightforward to parse it by machine. If the tags (<...>) are ignored, then the content is exactly the same as earlier (except for punctuation and rendering). It's often use to think of markup as invisible annotations on your document. Many modern systems do not markup the document itself, but provide a separate document with the markup. This is a feature of hypermedia systems and one of the goals of XML is to formalise this through the development of linking syntax and semantics in Phase II, but this is outside the scope of this paper.

What is so remarkable about this? In essence we have made it possible for a machine to capture some of those things that a human takes for granted.

Punctuation and other syntax are no longer a problem as there are extremely carefully defined rules in XML. If your markup characters are <...>, how do you actually send < and > characters without them being mistaken for markup? One way is to encode them as < and >. Look at this document's source to see it in practice and also to see how < is held.
Character encoding and other character entities have received a huge amount of attention and many Entity Sets have been developed, some by ISO. For example the copyright symbol (©) is number 169 in ISO-Latin 1 and can be encoded as ©. It also has a symbolic representation (©). XML itself only has a very few builtin character entities, but will support Unicode and other approaches to encoding characters. (I shall not discuss chemical typesetting here, since the emphasis is on non-typographical ways of encoding chemistry.) Most browsers do not yet support a wide range of glyphs for entities but this is likely to change very rapidly, especially since languages like Java have addressed the problem.
The role of information elements is defined. In the example, you can see what the precise components are and what their extent is. Note how the AUTHOR element is divided into three components. What you do with this information is the remit of semantics, and XML separates syntax precisely from semantics in a way that very few other non-SGML systems manage.
Documents can be reliably restructured or filtered by machine. An author might enter the LASTNAME, FIRSTNAME and INITIAL sequentially, but the machine could be asked to sort them into a different order. This may not appear very important, but to those implementing programs it's an enormous help. If the house style was initials-only, the program could easily turn 'Bruce' into 'B.' .
Documents can be transformed, merged, and edited automatically This is a great advance in information management. It would be straightforward to write a citation analyser which found all BIB elements in a document and abstracted parts of them by JOURNAL or YEAR.
It's easy to convert from one structured document to another. The bibliographic example above is not in strict CML, but it's very easy to convert it to CML, without losing any information
All information in a document can be precisely identified. In the above example there is markup down to the granularity of a single character (the INITIAL). It is conceptually easy to extend this to markup of numbers, formulae, and parts of things such as regions in diagrams or atoms in molecules.

Rules, meta-languages and validity

I started writing Chemical Markup Language because I wanted to transfer molecules precisely using ATOMS, BONDS and related information. It was always clear that 'chemistry' was more than this and that we needed the tools to encapsulate numeric and other data such as spectra. I looked at a wide variety of journals in the scientific area to see what sort of information was general to all of them and whether a markup language could be devised which could manage this wide range. It required a meta-language, and this section is an explanation of what that involves.

I'll explain the 'meta-' concept using XML and then show how it extends to applications such as TecML. XML, despite its name, is not a language but a meta-language (a tool for writing languages). XML is a set of rules which enable markup languages to be written and TecML and CML are two such languages. For example, one rule in XML is "every non-empty element must have a start-tag and an end-tag" so that the <AUTHOR> tag must be balanced by a </AUTHOR> tag. This is not a strict requirement of HTML, for example, which uses a more flexible set of rules. Another rule is "all attribute values must occur within quotes (")". Writing a markup language is a analogous to writing a program and the relation of XML to CML is much the same as C to hello.c. We say that CML 'is an application of XML', or 'is written in XML', just as 'hello.c is written in C.' XML is a little stricter than HTML in the syntax it allows but the benefit is that it's much easier to write browsers and other applications.

XML allows for two sorts of documents, valid and well-formed. Validity requires an explicit set of rules as a DTD which is usually a separate file, but can be included in the document itself. An example of a validity criterion in HTML is that LI (a ListItem) must occur within a UL or OL container. Well-formedness is a less strict criterion and requires simply that the document can be automatically parsed without the DTD and that the result can be The bibliographic example above is well-formed, but without a DTD may not be valid. It might have been an explicit rule that the author must include an element describing the language that the article was written in such as <LANGUAGE>EN</LANGUAGE>; in this case the document fragment would be invalid. The importance of validity will depend on the community using XML. In molecular science all *.cml documents will be expected to be valid and this is ensured by running them through a validating parser such as the free sgmls from James Clark. If a browser or other processing application such as a search engine can assume that a certified document was valid (perhaps from a validation stamp) there would be no need to write a validating parser. Being valid doesn't mean the contents are necessarily sensible and a further processor may be needed for that.

Where, and how, you enforce validity depends on what you are trying to do. If you are providing a form for authors to submit abstracts you will enforce fairly strict rules. ("It must have one or more AUTHORs, exactly one ADDRESS for correspondence, and the AUTHOR must contain either a FIRSTNAME or INITIALS but not both"). This can be enforced in a DTD. But this would be too restricting for a general scientific document, which need not always have an AUTHOR. The two forces of precision and flexibility often conflict, but can be reconciled to a large extent by providing different ways of processing documents.

Processing documents

At this stage it's useful to think about how an XML document might be created and processed. At its simplest level a document can be created with any text editor which is how the BIB example was written). It can then be processed with the human brain. This isn't a trivial point; there is no fundamental requirement for software at all or any stages of managing XML documents. In practice, however, software adds enormously to the value. CML documents such as those including atomic coordinates only make sense when rendered by computer.

A general authoring process can be represented as:


               stylesheets

Authoring       assembly                      validation
Validation ------ // ----> parsing & validation --> postprocessing
                serving 
Editing                                        rendering
Conversion     objects/Java

The break (//) signifies where the document is transferred from author/server to client/reader. Not all XML applications will fit this simple model, but it serves to highlight the components:

Authoring. One of the hardest problems is to write the authoring tools for an SGML/XML system. A good tool has to provide a natural interface for authors, most of whom won't know the principles of markup languages. It may also have to enforce strict and complex rules, possibly after every keystroke. Many current authoring tools are therefore tailored to a limited number of specific applications. [One of the most versatile is an SGML add-on to EMACS]. Sometimes a customer will approach an SGML house and, after agreeing a DTD, a specific tool will be built. For some common document types such as military contracts there is enough communality that commercial tools are available.
Conversion In some cases authoring involves conversion of existing (legacy) documents and if these are well understood, conventional programs can be written in Perl or similar languages. Where the XML documents represent database entries or the output from programs, the authoring process is particularly simple and many CML applications will fall in that category. XML makes it particularly easy to reuse material either by "cut-and-paste" of sections, or preferably through entities.
Editing and merging. Editing and merging affects the structure of the document and therefore may require validation. To write programs which do this on the fly is again difficult, and it may be useful, where possible, to divide documents into 'chunks' or entities. SGML has a very powerful concept or entities and can describe documents whose components are distributed over a network. For example, if I have an address such as the one at the bottom of this document, it is extremely useful to refer to that chunk by a symbolic name, such as &pmraddress;. With appropriate software I can include this at appropriate places and the software will include the full content of the entity. (If the entity contains references to other entities, they are also expanded and so on.) How XML uses entities in practice is being actively resolved at present.
The Server: Assembly and Queries The server has a vital role to play in many XML applications. It is possible to mount sophisticated SGML systems which retrieve document components and assemble them on the fly into XML documents. Alternatively the components could be retrieved from databases, as for chemical and biological molecules or data, and converted into XML files. Since XML maps onto Object storage it is particularly attractive for those developing Object-based systems such as CORBA. Whether the complete document is assembled at the server or the addresses of the fragments are sent to the client will depend on bandwidth, the preference of the community, the availability of software and many other considerations.
Parsing. Parsing is the process of syntactic analysis and validation. It normally produces a standardised output either on file or in memory. Whether you need to validate documents when you receive them will depend on your community's requirements. For example, if I receive a database entry from a major molecular data centre I can rely on its validity, but if I'm a publisher getting a hand-edited XML manuscript I will probably want to validate it. A validating parser requires that the document be valid against a specified DTD. Finding this DTD normally requires interpretation of the DOCTYPE statement at the head of an XML document. Some authors/servers are prepared to distribute the DTDs when documents are downloaded. This adds precision in that the correct DTD is used, but can add to the burden of server maintenance. and can increase bandwidth. If a community agree on a DTD they may find it useful to distribute it with the browsing software. The result of parsing is usually a parse-tree. If this is an unfamiliar concept, think of it as a table of contents (TOC) with every Element corresponding to a chapter or (sub...sub)section. Trees are easy to manipulate and display; JUMBO displays the tree as a TOC. There are already 2 freely available XML parsers written in Java (NXP and Lark) and I have used both. Lark creates a parse tree in memory which can be subclassed, while NXP produces it on the output stream.
Postprocessing, rendering, and validation. Most documents require at least some postprocessing, and many need a lot. Most users of XML applications will think of 'browsers' or 'plugins' as the obvious tools to use on a document. This will probably be true, but because it's machine processable XML is so powerful that many completely new applications will be developed. An XML document might consist of an airline reservation and the postprocessor could decide to order a taxi to the airport. A chemical reaction in a CML document could trigger the supply of chemicals and interrogate the safety databases.
Semantics and the postprocessor. An XML document carries no semantics with it, and there has to be an explicit or implicit agreement between the author and reader. Everyone understands roughly the same thing by the TITLE in HTML documents although they might try and use them in different ways. TITLE is valuable for indexers such as Altavista which abstract their content separately from the body of the document. This emphasises the value of structural markup. However some widely used element names are ambiguous (A is variously used in different DTDs for author, anchor, etc.) and for some such as 'LINK' it's unclear what their role is. Clarifying this for each DTD requires semantics. Traditionally semantics have been carried in documentation, and if this is not done clearly then implementers may provide different actions for the same Element. The XML project is actively investigating formal automatic ways of delivering semantics, such as style sheets and Java classes.
Validation at the postprocessor. The DTD/validating-parser cannot deal with some aspects of validation, which must be tackled by a conventional program/application. Common examples of validation are content ("is this number in the allowed range?"), and occurrence counts ("no more than five sections per chapter"). This is likely to need special coding for each application, and will be most important where high precision and low flexibility is the intention.
Stylesheets. Stylesheets are sets of rules that accompany a document. They can be used to filter or restructure the document ("extract all footnotes and put them at the end of a section"). Their most common use is in formatting or providing typesetting instructions ("all subsections must be indented by x mm and typeset in this font"). ISO has produced a standard for style sheets (DSSSL) which allows their description in Scheme (a derivative of LISP). Stylesheets are generally written to produce a transformed document, rather than to create an object in memory for which Java classes are more suitable. I expect to see the technologies converge and which is used will depend on the application and the community using it. There are four ways in which stylesheets can usefully be used:
- The author. If an author wishes to impart a particular style to a document then they can attach or include a stylesheet. When this reaches the postprocessor it can be invoked, unless it has been overridden.
- The server. If the organisation such as publishing house is running the server it may impose a particular style such as for bibliographic references. XML would give the author the freedom to prepare them in a standard way (e.g. using CML), while the journals could transform this by sending their stylesheets to the reader.
- The client software ('browser').. The software manufacturer has an interest in providing a common look-and-feel to the display. It reduces training and documentation costs and might provide a competitive market edge.
- The reader. She may have personal preferences about the presentation of material, perhaps because of her education. Alternatively her employer may require a common house style because training, and internal communication would be made easier.
The technology exists for any one of these four to provide their stylesheet, Which overrides which is a matter of politics, not technology.
Java classes. Every Element can be thought of as an object and have methods (or behaviour) associated with it. Thus a LIST might count and number the ITEMs. Most elements will have a display() method which could be implemented differently from object to object. Thus in JUMBO, MOL.display() brings up a rotatable screen display of the molecule, while BIB.display() displays each citation in a mixture of fonts. As with style sheets, Java classes can be specified at any of the 4 places, and the appropriate one downloaded from a Web site if required. One of the problems the XML-WG is tackling and solving is how to locate such classes. Because Java is a very powerful programming language with full WWW support it offers almost unlimited scope for XML applications. A document need not be passive, but could awake the client to take a whole series of actions such as mailing people, downloading other data, and updating the local database.

This has been a long section, but I hope it shows that XML is not simply a document processing language.

Attributes

So far I have only used Element names (often called GIs) to carry the markup. XML also provides attributes as another way of modulating the element. Attributes occur within start-tags, and well-known examples from HTML are HREF (in A) and SRC (in IMG):
<A HREF="http://www.venus.co.uk/omf/cml/">
<IMG SRC="mypicture.gif" WIDTH="500" HEIGHT="100">.
Attributes are semantically free in the same way as Elements, and can be used with stylesheets or Java classes to vary their meaning.

Whether Elements or attributes are used to convey markup is a matter of preference and style, but in general the more flexible the document the more I would recommend attributes. As a point of style, many people suggest that document content should not occur in attributes, but this is not universal. Here are some simple examples of the use of attributes:

Describing the type of information (e.g. what language the Element is written in).
Adding information about the document or parts of it (who wrote it, what its origins are)
Suggestions for rendering such as recommended sizes for pictures.
Help for the postprocessor (e.g. the wordcount in a paragraph).

In XML-link attributes will be extensively used.

Flexibility and meta-DTDs

When developing an XML application the author has to decide whether precision and standardisation is required, or whether it is more important to be flexible. If precision is required, then the DTD will be the primary means of enforcing it and as a consequence may become large and complex. It implies that the 'standard' is unlikely to change. When new versions are produced, the complete pipeline from authoring to rendering will need to be revised. As this is a major effort and cost, careful planning of the DTD is necessary.

If flexibility is is more important, either because the field is evolving or because it is very broad, a rigid DTD may restrict development. In that case a more general DTD is useful, with flexibility being added through attributes and their values. So in TecML I have created a Element type XVAR, for a scalar variable. I use attributes to tune the use and properties of XVAR and it's possible to make it do 'almost anything'! For example it can be given a TYPE such as STRING, FLOAT, DATE and a TITLE. In this way any number of objects can be precisely described. Here are three examples:

<XVAR TYPE="STRING" TITLE="Greeting">Hello world!</XVAR>
<XVAR TYPE="DATE">2000-01-01</XVAR>
<XVAR TYPE="FLOAT" DICTNAME="Melting Point" UNITS="Fahrenheit">451</XVAR>

The last is particularly important because it uses the concept of linking to add semantics. This is a big feature of XML , and the precise syntax is being developed in XML-Phase-II. CML uses DICTNAME to refer to an entry in a specified glossary which defines what "Melting Point" is. This entry could have further links to other resources such as world collections of physical data. Similarly I use UNITS to specify precisely what scale of temperature is used. Again this is provided by a glossary in which SI units are the default. By using this approach it is possible to describe any scalar variable simply by varying the attributes and their values. Note that the attribute types must be defined in the DTD but their values may either be unlimited or can be restricted to a set of possible values

The TecML DTD uses very few Element types, and these have been carefully chosen to cover most of the general concepts which arise in technical subjects. They include ARRAY, XLIST (a general tool for data structures such as tables and trees), FIGURE (a diagram), PERSON, BIB, and XNOTATION. (NOTATION is an XML concept which allows non-XML data to be carried in a document, and is therefore a way of including 'foreign' file types). With these simple tools and a wide range of attributes it is possible to markup most technical scientific publications. Areas which are not covered are: parsable mathematics, fine-grained markup in diagrams, and anything that involves complex relationships. Of course there has to be general agreement about the semantics of the markup but this is a great advance compared with having no markup at all. In some cases where adequate methods have been developed for well defined components those can be encapsulated and need not be translated. Examples are NETCDF for multidimensional data and VRML for 3-D graphics.

Searching

It was a revelation when I realised the power of structured documents (SD) are for carrying information. I think that data in many disciplines map far more naturally into a tree structure than into a relational database (RDB). An SD has a concept of sequential information while an RDB does not. The exciting thing is that the new Object databases (including the hybrid Object-Relational Databases (ORDBS)) have the exact architecture which is needed to hold XML-like documents, and suppliers now offer SGML interfaces. (For any particular application, of course, there may be a choice between RDBs and ORDBs.) The attraction of Objects over RDBs is that it is much easier to design the data architecture. In many cases simply creating well marked-up documents may be all that is required for their use in the databases of the future.

The reason for this confident statement is that SDs provide a very rich context for individual Elements. Thus we can ask questions like:

"Which DATASET contains one MOLECULE and one (SPECTRUM whose attribute TYPE has a value of "nmr")?"
"Find all MOLECULEs which contain MOLECULEs" (e.g. ligands in proteins)
"Find all references to journals not published by the Royal Society of Chemistry"

Despite their apparent complexity, all these can be managed with standard techniques for searching structured documents. Because of this power, a special language (Structured Document Query Language - SDQL) has been developed and will interoperate with XML. If simple application-specific tools are developed then queries like the following are possible:

"Find all XVARs whose DICTNAME value is "Melting Point"; retrieve the value of the UNITS attribute and use it to convert the content to a floating point number representing a temperature on the Celsius scale. Then include all data with values in the range 150-170"

Summary, and the next phase

This document has described only part of what XML can offer to a scientific or publishing community. XML has three phases, and only the first has been covered here. Phase II is to define a hyperlinking system; and Phase III to define how style sheets will be used. Hyperlinking can range from the simple, unverified link (as in HTML's HREF attribute for Anchors) to a complete database of typed and validated links over thousands of documents. Phase II is addressing all of these and has the power to support complex systems.

Technical aspects and the future

How will XML develop in practice? A natural impetus will come from those people who already use SGML and see how it could be used over the WWW. It is certainly something that publishers should look at very closely as it has all the components required, including the likelihood that solutions will interoperate with Java.

XML is the ideal language for the creation and transmission of database entries. The use of entities means it can manage distributed components, it maps well onto objects and it can manage complex relationships through its linking scheme. Most of the software components are already written.

How would it be used with a browser? Assuming that the bulk of tools are written in Java, we can foresee helper applications or plugins, and perhaps there will be more autonomous tools which are capable of independent action. It's an excellent approach to managing legacy documents rather than writing a specific helper for each type.

I hope that there will be enough tools that XML will provide the same creative and expressive opportunities that HTML has done. However, it's important to realise that freely available software is required and any tools for structured document management, especially in Java, will be extremely welcome.

References

The SGML and XML community has excellent WWW resources and so it is unecessary to give a large list of pointers. Some key sites are: