Electronic Publishing and Molecular Sciences
Peter Murray-Rust
This is a brief overview of some of my current thoughts on the potential
electronic publishing revolution and how it will affect molecular sciences.
Reasons for electronic publishing
There are many obvious attractions for electronic publishing and I shall
cover those very briefly before moving to the areas where I think that
e-publishing, hyperlinking and CML can play a major role. The well-known
aspects include:
- Elimination of paper. Whether or not e-publishing 'saves trees',
there is no doubt that it eliminates some of the costs associated with
paper publishing, though these are not as dramatic as sometimes assumed.
- Convenience for the reader. Many readers, myself included, now
rely largely on the electronic medium to be kept up to date. I now visit
paper-based libraries very infrequently, although conventional monographs
still have their attractions.
- Immediacy. Current awareness tools are now routine, and
knowledge of relevant information sources comes from automatic mailings,
postings to newsgroups, searches of structured and unstructured resources
and often from the authors themselves.
- Freedom to publish. The WWW and its technology now allows anyone
to create documents rapidly and circulate them widely. This conference is
an example, and it is challenging the way in which the value of publications
is
traditionally measured.
Changing the nature of publication
The topics above are closely related to conventional publishing, and mainly
relate to making certain operations easier. But e-publications can add new
opportunities which are impossible on paper. This is a very brief overview,
which does not do justice to them.
- Normalisation. It is now possible to have a single key reference
copy of a document, so that everyone can be sure they are referring to the
same version of the same document.
- Extensibility in Extent.
A document is not a fixed piece of paper but has
fuzzy boundaries. I explore one aspect of this below ('The document
environment'). Authors can provide implicit or explicit links into resources
which enhance their document without the material having to be reformatted
manually.
- Extensibility in Time. The targets of hyperlinks are often not
static and are continually enhanced. A link to a major data centre, or
a learned society will have been enhanced in value considerably over the
last 2-3 years as the target pages are expanded in size and quality.
Information is kept up to dat, and the simplest way of making sure your
document does not become obsolete is to link it to key information resources.
Again, the RSC has a key role to play in this.
- Reusability. A good information resource can be used in many
contexts and prevents unnecessary duplication. Markup and hyperlinking are
the key technologies in the effective re-use of information.
- Documents as data. A structured document is easy to parse and
to enhance by addition of semantics. In chemistry there are some exciting
possibilities such as:
- Calculations. With CML it is now simple to extract the molecular
information from a paper, so that it could be automatically submitted as
input to a program. "Run MOPAC calculations on all the molecules in this
paper" could be an instruction to a piece of software.
- Assessment of hazards. "Extract all molecules from this paper
and search the hazard databases".
- Data mining. This is common in analysis of crystallographic
data where many papers now systematically analyse the deposited data. It
could be equally possible for less structured data such as reactions, where
conditions and yields could be extracted routinely from current publications
and analysed.
- Non-traditional publications. Conventional publications represent
experimental results, or opinions, but are ill-suited to reporting the
creation of hyperresources on the WWW. ECTOC and ECHET are best judged as
themselves rather than writing a conventional manuscript to describe them.
The same applies to databases, software, virtual meeting places and projects,
which are exciting and valuable but not always regarded as having the same
peer-accreditation as conventional articles.
The document environment
Any document is written in a context, with the assumption that the reader can
bring in information to enhance the raw material. A word, a phrase, a diagram
can all be linked to other information, either in the reader's brain or in
traditional information sources such as books, journals, calculating equipment
and so on. With the speed and diversity of today's technical subjects, few
readers now have an adequate information resource to bring to those scientific
papers which may be of interest to them. The phrase 'knowledge environment'
has been used to describe the shells of information surrounding a document,
starting with a small number of highly relevant resources and rapidly becoming
larger and more diffuse. Electronic publishing is essential in providing
this environment, but it is not trivial or cheap to produce. Some ideas
and references are provided by Schatz (see An
Introduction to Structured Documents in this distribution).
An idea of
the document environment for a paper is shown:
Here the emphasis is in the value of citations as a way of adding value to
the document and this is the largest area of activity at present. The main
problems are not technical since citations are very well understood, but
are in the interoperability of material from different publishing houses.
This will require complex agreements about access to intellectual property
and methods of charging for it before the scientific literature is fully
hyperlinked in this fashion.
Many documents are multidisciplinary as shown in this diagram for a publication
in protein structure:
To provide the very varied sets of links into different disciplines may require
a large amount of expensive human effort. The author has a role to play
by producing careful abstracts and clear keywording, as does the publisher
and the abstracter. But the environment itself is diffuse in nature, consisting
of primary and secondary sources of raw data, refined data, opinions,
educational material and so on. Moreover for each reader a document has a
different environment which may need to contain very elementary resources
such as tutorials or the latest opinions on specialist areas.
I believe that markup can help this process by making it easier to identify
the local context of words and concepts, and their relation to other
components. A paper with several 'Schemes' is more likely to relate to
chemical reactions and one with 'Tables of Results' more likely to
relate to data collection. CML will allow the detailed machine analysis of
these components, making it easier to determine the concepts in the paper
and its context.
One simple and cost-effective way to adding document environment is through
terminology. If precise, unambiguous, terms are used in a paper it is
easier to determine the context. For a human reader terminology is often
critical and a whole paper may be impossible to understand if some of the
terms are opaque. Unfortunately terminology is not standardised on the
WWW and many search engines will return a lot of noise when given an
ambiguous term.
With Lesley West and Henry Rzepa's group I have developed the hyperglossary
concept (and ECTOC-1 contained a molecular hyperglossary). For further
details see the the Virtual
HyperGlossary Home Page (VHG). The idea is shown
below:
Here we assume the existence of discrete, well maintained, terminology
resources for which I believe that learned societies like the RSC have
a key role. In many cases the terms in the document are clear and can be
automatically linked to the appropriate resource. This resource (of which
an example is given in the demonstration) may include non-textual data such
as molecular structure or links to other resources such as databases, discussion
groups or tutorials.
However the document environment is managed, electronic publication of
structured documents will have a key role to play.
Back to index
©
Peter Murray-Rust, 1996, 1997