Electronic Publishing and Molecular Sciences

Peter Murray-Rust

This is a brief overview of some of my current thoughts on the potential electronic publishing revolution and how it will affect molecular sciences.

Reasons for electronic publishing

There are many obvious attractions for electronic publishing and I shall cover those very briefly before moving to the areas where I think that e-publishing, hyperlinking and CML can play a major role. The well-known aspects include:

Elimination of paper. Whether or not e-publishing 'saves trees', there is no doubt that it eliminates some of the costs associated with paper publishing, though these are not as dramatic as sometimes assumed.
Convenience for the reader. Many readers, myself included, now rely largely on the electronic medium to be kept up to date. I now visit paper-based libraries very infrequently, although conventional monographs still have their attractions.
Immediacy. Current awareness tools are now routine, and knowledge of relevant information sources comes from automatic mailings, postings to newsgroups, searches of structured and unstructured resources and often from the authors themselves.
Freedom to publish. The WWW and its technology now allows anyone to create documents rapidly and circulate them widely. This conference is an example, and it is challenging the way in which the value of publications is traditionally measured.

Changing the nature of publication

The topics above are closely related to conventional publishing, and mainly relate to making certain operations easier. But e-publications can add new opportunities which are impossible on paper. This is a very brief overview, which does not do justice to them.

Normalisation. It is now possible to have a single key reference copy of a document, so that everyone can be sure they are referring to the same version of the same document.
Extensibility in Extent. A document is not a fixed piece of paper but has fuzzy boundaries. I explore one aspect of this below ('The document environment'). Authors can provide implicit or explicit links into resources which enhance their document without the material having to be reformatted manually.
Extensibility in Time. The targets of hyperlinks are often not static and are continually enhanced. A link to a major data centre, or a learned society will have been enhanced in value considerably over the last 2-3 years as the target pages are expanded in size and quality. Information is kept up to dat, and the simplest way of making sure your document does not become obsolete is to link it to key information resources. Again, the RSC has a key role to play in this.
Reusability. A good information resource can be used in many contexts and prevents unnecessary duplication. Markup and hyperlinking are the key technologies in the effective re-use of information.
Documents as data. A structured document is easy to parse and to enhance by addition of semantics. In chemistry there are some exciting possibilities such as:
- Calculations. With CML it is now simple to extract the molecular information from a paper, so that it could be automatically submitted as input to a program. "Run MOPAC calculations on all the molecules in this paper" could be an instruction to a piece of software.
- Assessment of hazards. "Extract all molecules from this paper and search the hazard databases".
- Data mining. This is common in analysis of crystallographic data where many papers now systematically analyse the deposited data. It could be equally possible for less structured data such as reactions, where conditions and yields could be extracted routinely from current publications and analysed.
Non-traditional publications. Conventional publications represent experimental results, or opinions, but are ill-suited to reporting the creation of hyperresources on the WWW. ECTOC and ECHET are best judged as themselves rather than writing a conventional manuscript to describe them. The same applies to databases, software, virtual meeting places and projects, which are exciting and valuable but not always regarded as having the same peer-accreditation as conventional articles.

The document environment

Any document is written in a context, with the assumption that the reader can bring in information to enhance the raw material. A word, a phrase, a diagram can all be linked to other information, either in the reader's brain or in traditional information sources such as books, journals, calculating equipment and so on. With the speed and diversity of today's technical subjects, few readers now have an adequate information resource to bring to those scientific papers which may be of interest to them. The phrase 'knowledge environment' has been used to describe the shells of information surrounding a document, starting with a small number of highly relevant resources and rapidly becoming larger and more diffuse. Electronic publishing is essential in providing this environment, but it is not trivial or cheap to produce. Some ideas and references are provided by Schatz (see An Introduction to Structured Documents in this distribution).

An idea of the document environment for a paper is shown:

Here the emphasis is in the value of citations as a way of adding value to the document and this is the largest area of activity at present. The main problems are not technical since citations are very well understood, but are in the interoperability of material from different publishing houses. This will require complex agreements about access to intellectual property and methods of charging for it before the scientific literature is fully hyperlinked in this fashion.

Many documents are multidisciplinary as shown in this diagram for a publication in protein structure:

To provide the very varied sets of links into different disciplines may require a large amount of expensive human effort. The author has a role to play by producing careful abstracts and clear keywording, as does the publisher and the abstracter. But the environment itself is diffuse in nature, consisting of primary and secondary sources of raw data, refined data, opinions, educational material and so on. Moreover for each reader a document has a different environment which may need to contain very elementary resources such as tutorials or the latest opinions on specialist areas.

I believe that markup can help this process by making it easier to identify the local context of words and concepts, and their relation to other components. A paper with several 'Schemes' is more likely to relate to chemical reactions and one with 'Tables of Results' more likely to relate to data collection. CML will allow the detailed machine analysis of these components, making it easier to determine the concepts in the paper and its context.

One simple and cost-effective way to adding document environment is through terminology. If precise, unambiguous, terms are used in a paper it is easier to determine the context. For a human reader terminology is often critical and a whole paper may be impossible to understand if some of the terms are opaque. Unfortunately terminology is not standardised on the WWW and many search engines will return a lot of noise when given an ambiguous term.

With Lesley West and Henry Rzepa's group I have developed the hyperglossary concept (and ECTOC-1 contained a molecular hyperglossary). For further details see the the Virtual HyperGlossary Home Page (VHG). The idea is shown below:

Here we assume the existence of discrete, well maintained, terminology resources for which I believe that learned societies like the RSC have a key role. In many cases the terms in the document are clear and can be automatically linked to the appropriate resource. This resource (of which an example is given in the demonstration) may include non-textual data such as molecular structure or links to other resources such as databases, discussion groups or tutorials.

However the document environment is managed, electronic publication of structured documents will have a key role to play.

Back to index