Electronic Publishing and Molecular Sciences

Peter Murray-Rust

This is a brief overview of some of my current thoughts on the potential electronic publishing revolution and how it will affect molecular sciences.

Reasons for electronic publishing

There are many obvious attractions for electronic publishing and I shall cover those very briefly before moving to the areas where I think that e-publishing, hyperlinking and CML can play a major role. The well-known aspects include:

Changing the nature of publication

The topics above are closely related to conventional publishing, and mainly relate to making certain operations easier. But e-publications can add new opportunities which are impossible on paper. This is a very brief overview, which does not do justice to them.

The document environment

Any document is written in a context, with the assumption that the reader can bring in information to enhance the raw material. A word, a phrase, a diagram can all be linked to other information, either in the reader's brain or in traditional information sources such as books, journals, calculating equipment and so on. With the speed and diversity of today's technical subjects, few readers now have an adequate information resource to bring to those scientific papers which may be of interest to them. The phrase 'knowledge environment' has been used to describe the shells of information surrounding a document, starting with a small number of highly relevant resources and rapidly becoming larger and more diffuse. Electronic publishing is essential in providing this environment, but it is not trivial or cheap to produce. Some ideas and references are provided by Schatz (see An Introduction to Structured Documents in this distribution).

An idea of the document environment for a paper is shown:

Here the emphasis is in the value of citations as a way of adding value to the document and this is the largest area of activity at present. The main problems are not technical since citations are very well understood, but are in the interoperability of material from different publishing houses. This will require complex agreements about access to intellectual property and methods of charging for it before the scientific literature is fully hyperlinked in this fashion.

Many documents are multidisciplinary as shown in this diagram for a publication in protein structure:

To provide the very varied sets of links into different disciplines may require a large amount of expensive human effort. The author has a role to play by producing careful abstracts and clear keywording, as does the publisher and the abstracter. But the environment itself is diffuse in nature, consisting of primary and secondary sources of raw data, refined data, opinions, educational material and so on. Moreover for each reader a document has a different environment which may need to contain very elementary resources such as tutorials or the latest opinions on specialist areas.

I believe that markup can help this process by making it easier to identify the local context of words and concepts, and their relation to other components. A paper with several 'Schemes' is more likely to relate to chemical reactions and one with 'Tables of Results' more likely to relate to data collection. CML will allow the detailed machine analysis of these components, making it easier to determine the concepts in the paper and its context.

One simple and cost-effective way to adding document environment is through terminology. If precise, unambiguous, terms are used in a paper it is easier to determine the context. For a human reader terminology is often critical and a whole paper may be impossible to understand if some of the terms are opaque. Unfortunately terminology is not standardised on the WWW and many search engines will return a lot of noise when given an ambiguous term.

With Lesley West and Henry Rzepa's group I have developed the hyperglossary concept (and ECTOC-1 contained a molecular hyperglossary). For further details see the the Virtual HyperGlossary Home Page (VHG). The idea is shown below:

Here we assume the existence of discrete, well maintained, terminology resources for which I believe that learned societies like the RSC have a key role. In many cases the terms in the document are clear and can be automatically linked to the appropriate resource. This resource (of which an example is given in the demonstration) may include non-textual data such as molecular structure or links to other resources such as databases, discussion groups or tutorials.

However the document environment is managed, electronic publication of structured documents will have a key role to play.


Back to index
© Peter Murray-Rust, 1996, 1997