To be published in New Review of Information Networking, December, 1995. This article is also available on-line as http://www.ch.ic.ac.uk/clic/video.html

The Case for Content Integrity in Electronic Chemistry Journals: The CLIC Project.

David James

The Royal Society of Chemistry, Thomas Graham House, Milton Road, Cambridge

Benjamin J. Whitaker and Christopher Hildyard

School of Chemistry, Leeds University, Leeds LS2 9JT.

Henry S. Rzepa and Omer Casher

Department of Chemistry, Imperial College, London, SW7 2AY.

Jonathan M. Goodman and David Riddick

Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge

Peter Murray-Rust

Biomolecular Structure, Glaxo Wellcome, Stevenage, Herts, SG1 2NY, UK

Abstract: There is currently intense debate on the future role of electronic journals in all subject disciplines. The debate carried primarily by traditional publishers has focused on subject independent issues such as the "look and feel" of the electronic product, copyright protection and charging mechanisms. Whilst such issues are certainly important, other issues which are felt to be important by specific user community have not received similar publicity. Arguing specifically from the particular perspective of science and chemistry in particular, we contend in this article that electronic journals should also be perceived as a new form of scientific instrument, in allowing the delivery to the user of manipulable 3D molecular images, instrumental data, symbolic algorithms capable of evaluation locally, and other semantically intact molecular data for re-use locally by the reader. Two specific implementations of this concept, the ECTOC electronic conference, and the CLIC electronic journal project, are discussed.

Introduction

One of the foundation stones of modern scientific culture is the peer-reviewed printed journal. In a discipline such as chemistry, with a history of publication going back more than 150 years, scholarly journals have been accepted for some time as principal method for disseminating new theories and advances in the subject. As a consequence, most academic and industrial professionals in the molecular sciences have come to rely on achieving a substantial portfolio of published work in such journals for peer recognition and career advancement.

During this 150-200 year evolution of the learned printed journal or book in molecular science, a rich and often very subtle and sophisticated notation has evolved to describe molecules and their properties on the printed page. For some time now, an international committee structure (IUPAC, the International Union for Pure and Applied Chemistry) has overseen the systematisation and publication of this nomenclature. When it comes to representing this on the printed page, it has been estimated that some 3000 special typesetting characters are required in molecular science. Amongst the younger generations of scientists, the skills necessary to translate complex molecular concepts into printed notation are increasingly absent. In part at least, this has been brought about in the last ten years by the introduction of a range of "chemical structure" drawing computer programs, which allow a highly visual representation of molecular structures to be constructed, thus by-passing the need to use the more traditional printers symbols and nomenclature. This trend has been re-inforced by the increasing trend toward manipulating graphic object resulting from the use of computer based database searches, molecular modelling and other techniques by scientists. This in turn has the effect that fewer chemists rely on purely text based nomenclature to formulate and communicate their ideas.

In the last ten years then, most scientists have acquired access to software tools that have enabled them to generate accurate descriptions of their subject in a variety of digital formats. With few exceptions, the ultimate destination of all these digital formats is ironically to the printed page of the conventional journal. But consider the perspective from the reader's point of view. They are faced with an essentially "analogue" medium and if they require quantitative information, they will have to re-key it into a computer, with all the risk of transcription error. OCR (optical character recognition) is another option, but its relatively poor accuracy means it is still little used in this context, especially since chemistry has its own unique character set! Molecular structures present even more of a challenge, since 100% accuracy must really be achived (there is often little or no internal redundancy in such structures). Similar operations would be needed on symbolic or mathematical equations, and the difficulties become extreme when it comes to recognising Instrumental data, which is often presented in highly consolidated or perhaps symbolic form in the journal.

It is as well to remind oneself of the philosophy of the scientific paper, which is not only to introduce new concepts and theories to the reader, but to provide the reader with sufficient information to allow them to reproduce the original research and experiments as appropriate. In part this arose as an "error-correction" mechanism, so that erroneous or faulty concepts and theories could be edited out of the "body of knowledge". In practice, with the widespread adoption of printed journals, most readers were also faced with identifying the inevitable transcription and typesetting errors contained on the printed page. Whilst a refereeing system does exist to catch both the errors of science and the transcription errors, neither referee nor reader has any "tools" to assist them in this process other than then own eyes and minds! The reader has in effect to use their own knowledge of the subject to identify whether the scientific "checksums" are correct. Because of the modern career pressures to publish, one suspects that few papers are ever fully subjected to such a rigorous analysis of either their factual content or the scientific concepts presented. To put it simply, whereas the computer era has enabled most of us to simplify the process of producing a learned article, it has done almost nothing to help us read, understand and apply published materials.

The advent of electronic journals, where in principal at least the content can be made fully digital, allows us finally to explore how the science of reading articles as opposed to producing them can be improved. In this respect, it is unfortunate that most of the discussion of electronic journals has not centered around this requirement, but on rather different, and it has to be said, much more commercial points of view. Thus librarians see the medium as an opportunity to re-vitalise a market where the spiraling costs of many journals appear to be casting doubts on the viability of small and departmental libraries. Publishers are largely focusing on what are termed "parallel" track printed and electronic publications, since this allows them to capitalise on any achieved reputation and quality of an existing journal, and to sell into an established market. Put crudely, the "look and feel" of and the copyright ownership of the journal become important, dare we say pre-eminent, considerations. This in turn translates into an emphasis on the physical appearance of the electronic representation of the journal, and the ability of the reader to print pages that closely resemble the printed version, as well as the controlled way in which the journal may be disseminated.

The central point of the present article is to present the point of view from the reader's and the scientist's perspective, and one which we feel has not hitherto been so actively promoted by publishers or librarians. The scientist would very much like to see the journal move towards being regarded as much more of a scientific instrument, to be used and more importantly integrated into the laboratory as part of the everyday research and teaching activity. Thus information and data should be capable of flowing transparently and of course with complete digital accuracy from say a learned article in a journal into the software, tools and instruments used for basic research. In effect, our aim is to reconcile the reader with at least a proportion of the quantitative information the original authors had at their disposal when they formulated their original theories. In the last several years, the various technologies that have become assciated with the World-Wide Web have finally allowed this dream to become a reality.

The Implementation of Chemistry Content in an Electronic Journal

Perhaps an explicit example will illustrate these concepts. The antimalarial drug halofantrin exists in two forms, known as R and S. It also has a complex structure, interacting with itself via a very strange hydrogen bond between a C-H bond on one of the rings and an oxygen of a second molecule. Both these features are characteristic of its "three dimensional" structure, and also impinge strongly on its properties.

Figure 1. A 2D representation of the 3D Molecular Structure of Halofantrin. In the e-journal version of this diagram, "clicking" on this diagram will produce a "rotatable" image.

The "R" and "S" nomenclature derives from a very complex set of nomenclature rules known as the "Cahn-Ingold-Prelog" (CIP) convention. The difference between these two forms can only be understood if the chemical structure of the molecule is considered in three dimensions, and it takes a highly experienced and confident chemist to apply the CIP rules reliably to the laboratory synthesis of a safe pharmaceutical product. In the diagram of halofantrin (Figure 1), we have a two dimensional image taken from one particular perspective view of this molecule. However, in this view, the user cannot rotate or inspect the molecule themselves, and any obscured aspects will be hidden. Modern chemistry is often concerned with systems which might be one hundred times larger and more complex, and the reader may also wish to acquire say detailed toxicology or synthesis data, spectral and instrumental information, accurate three dimensional coordinates for the molecule determined from x-ray crystallography, the structures of a few dozen analogues, details of any enzymes involved in metabolic pathways, mathematical algorithms that describe the molecular properties, and theoretical models which might describe the mechanisms of its behaviour. Most importantly, the reader will also wish to acquire all this information without any risk of transcription errors, in an instantly usable form for further processing by computer.

Traditional journals provide all this information in printed and often higly symbolic form. Cost decrees that only a minimal amount of information actually appears on the printed pages. The rest is often discarded, or at best "deposited" in the form of printed supplementary information, and it can require real determination to re-conciliate all this information into a form that can then productively put to use. Thus in the electronic version of this article, the "thumbnail image" (Figure 1) should not simply be a static two dimensional diagram which still requires substantial (and possibly error prone) interpretation by the reader of the paper, but in fact a starting point for the reader's own exploration of the content. By hyperlinking this image to a set of three dimensional information about the molecule, the reader can easily acquire a full set of molecular data in digital form which is theirs to process as they wish. We have coined the expression "hyperactive molecules" [1] to express this concept. In this case, it is achieved by using the following HTML markup commands;
<A HREF="http://www.ch.ic.ac.uk/clic/halo.pdb"><IMG SRC="http://www.ch.ic.ac.uk/clic/halo.gif"></A>

Implicit in this syntax is another concept we have introduced, that of chemical MIME. [2] This is a chemical implementation of a standard mechanism for allowing the reader to associate files with assumed chemical content (in this case halo.pdb) with a "viewer" of their choice that is capable of processing this content, in this case to produce a rotatable image of the molecule on the screen. In the 18 months since we wrote our first "Internet draft" proposing this standard, the concept has come to be widely accepted throughout the molecular community, and we anticipate will be extensively used in chemistry electronic journals.

Another of our research projects currently involves exploring metaphors for creating 3D objects (scenes as they are referred to), using authoring environments such as virtual reality modelling language (VRML) [3]. Here again, one has the ability to directly associate visual images with the quantitative data and definitions behind them. In the case of halofantrin for example, one could hyperlink the carbon atom involved in the so called "chiral centre" which gives rise to the R/S symbolism, with say a remotely held glossary of information describing the rules governing the Cahn-Ingold-Prelog nomenclature. The addition of the powerful and secure Java scripting language [4] to the VRML object descriptor provides a powerful and flexible authoring environment which we anticipate will allow the creation of many new and innovative applications of electronic journals that far surpass what is currently possible with printed journals. We feel that is only by the introduction of such tools that the user community will come to fully accept electronic journals as a valuable new resource.

"Chemistry Markup Language" or CML [5] is another project which explores the ethos of providing full chemistry content to the reader of an electronic journal. This is a semantically rich SGML based language applicable to all areas of chemistry and molecular biology. It is designed for accurate and facile interchange and deposition of information, with the following characteristic features;

Springs from the collaborative ethos of the Internet
Developed through working prototypes
Simple: "human-readable and human-hackable"
Created in a distributed, not a centralised, manner
Extensible; but through glossaries, not syntax
Customisable through style sheets
Data validation and transformation through glossary-based code
Based on strict standards (SGML and HTML 2.0).
Uses existing commercial and public tools
Independent of platform, manufacturer and application.

A CML consortium is being formed to support the development of the language and tools. It aims to take CML through to release 1.0 (the first stable, public release) by building prototype applications for general evaluation. For further details, consult ref. [3].

The concepts outlined above lead to an interesting potential conflict with one of the basic principles of conventional publishing, namely clear identification of copyright ownership. What we are advocating is that the the reader of any learned article is actively encouraged to acquire digital and hence exact copies of information and data associated with any particular article. Potentially, any one article might be associated with many different sources of information, some of which may reside with the publishers of the article, others of which may refer back to the original authors of the article, whilst others may point to third parties or commercial sources of information. The entire ethos of our argument is that the reader of the article becomes enabled to acquire this information easily and quickly. However, the nightmare scenario is that they are faced with negotiating a "transaction" with many different owners of this information. If for example, the reader of an article were to be faced with a demand for payment in exchange for each set of molecular coordinates or other information they wished to download, then clearly the entire system would not function in the manner intended. Clearly, if the scientific community sets out on such a road, then clear and workable guidelines will need to be formulated. Whether existing copyright law is up to this task is debatable!

Summary

This article therefore is a plea to "re-invent" the journal in a way which applies it on behalf of the reader of the journal as well as the publishers and authors. Ultimately, because such articles will come to contain a rich variety of digital and most importantly easily verifiable information, we would argue that the quality of published information is bound to increase. It certainly appears that few assertions in the bulk of the existing published literature are ever checked for internal or numerical consistency, either by referees or by readers. In the CLIC electronic journal project [6], we hope to implement active molecular information in association with articles. Initially, we will focus on crystallographic coordinates, with the intention of adding instrumental data, algorithms and other forms of information later. We know from an initial pilot project, which took the form of the ECTOC electronic conference [7], that a highly positive user response is achievable, and that technically, no significant problems exist. Around 80 "manuscripts" from 13 countries were received, and the editorial handling was accomplished entirely by the two editors, with no additional resources other than a refereeing panel. During a four week period, some 2000 separate computer systems connected to the electronic material, and the most popular papers attracted well over 500 requests. Over the next two years, we will have an exciting opportunity to establish if the themes outlined above do inded strike a chord in the chemist's heart via the proliferation of electronic journals implementing "hyperactive chemistry".

References and Notes.

O. Casher, G. Chandramohan, M. Hargreaves, C. Leach, P. Murray-Rust, R. Sayle, H. S. Rzepa and B. J. Whitaker, "Hyperactive Molecules and the World-Wide-Web Information System", J. Chem. Soc., Perkin Trans 2, 1995, 7. See also the URL http://www.ch.ic.ac.uk/rzepa/RSC/P2/4_05970K.html. See also H. S. Rzepa, B. J. Whitaker and M. J. Winter, Chemical Communications, 1994, 1907.
H. S. Rzepa, P. Murray-Rust and B. J. Whitaker, IETF Internet Draft April-October, 1995. See the URL: http://www.ch.ic.ac.uk/chemime2.html
O. Casher, C. Leach, C. S. Page and H. S. Rzepa, "Advanced VRML Based Chemistry Applications: A 3D Molecular Hyperglossary", Presented at the 2nd Electronic Conference on Computational Chemistry, November 1995. See also the URL http://www.ch.ic.ac.uk/eccc2/
For an example of how Java might be used within an electronic journal, connect to the URL: http://www.ch.ic.ac.uk/clic/pop.html For further information about Java and VRML, see the URL: http://www.SGI.com/Products/cosmo/sgisun.html
P. Murray-Rust, H. S. Rzepa and collaborators. For further details, see http://www.dl.ac.uk/CBMT/cml/cml06f/
J. G. Goodman, D. James, H. S. Rzepa and B. J. Whitaker: The CLIC Project. For further details, contact the project director at rzepa@ic.ac.uk or connect to one of the following URLs; http://www-clic.ch.cam.ac.uk/CLIC/; http://chemistry.rsc.org/rsc/clic.htm; http://www.ch.ic.ac.uk/clic/; http://www.chem.leeds.ac.uk/CLIC/leeds.html
Electronic Conference on Trends in Organic Chemistry (Editors H. S. Rzepa and J. M. Goodman), to be published on CD-ROM, 1996. See also the URL http://www.ch.ic.ac.uk/ectoc/ . See also the following conference in the series; ECHET-96 at the URL: http://www.ch.ic.ac.uk/ectoc/echet96/