LIM

Correspondence and Proofs to;

Dr H. S. Rzepa,

Department of Chemistry,

Imperial College, London,

SW7 2AY, UK.

The World-Wide Web Information System: Chemical Chaos or Global Scientific Enabler?

Henry S. Rzepa[1]

Department of Chemistry, Imperial College of Science Technology and Medicine, London, SW7 2AY, U.K. E-mail: rzepa@ic.ac.uk.

Summary: A proposed Internet based standard based on primary chemical MIME types used in conjunction with the World-Wide Web information delivery systems allows molecular structural and spectroscopic information to be transparently integrated into scientific publications, providing unparalleled access and control for the user. Examples of "hyperactive molecules" from the areas of organic synthesis, quantitative structural modelling, molecular dynamics, crystallography, nmr and mass spectroscopy are discussed. The implications of these mechanisms for scientific publication in general, the impact on the quality and reproducibility of published experimental data and the enhancement of serendipitous discoveries are discussed.

Introduction. It is remarkable how little integration currently exists between the three main physical depositories of chemical and molecular information. Most scientists store information locally in non-digital laboratory notebooks, from where it may eventually be consolidated into a variety of word processors, spreadsheets and databases on a personal computer. This in due course may be made available as part of a local area network, where use may be made of a departmental or institutional server running a variety of in-house data base systems. In most university chemistry departments, this last layer is quite likely to be entirely missing, if only because of cost.

After a period of months or more likely years, a small proportion of this heterogeneous blend of local information will be encapsulated by one or more individuals into a presumed high quality scientific "paper", destined after appropriate peer review to be published in one of the large number of primary scientific journals. At this point, the information will probably have lost any digital origins it may have had, since almost all journals still appear exclusively on paper. The data is now completely divorced from its beginnings in the original instrument or laboratory notebook. After another lapse a few months, the information may be condensed into secondary form by e.g. the Chemical Abstracts organisation, review writers, specialist database producers or authors of textbooks. At each stage of this process, elaborate and expensive quality control mechanisms are needed to filter out errors, although inevitably a measure will still survive.

Is it possible to envisage any mechanism in which say original data recorded on an instrument can survive digitally intact all the way to the final stages of the scientific publication process? More importantly, can the data also survive the variety of programs, storage and delivery methods it is likely to experience, and remain easily accessible and searchable? I hope to show in this paper that at least one such mechanism is already in place. If however it is to achieve widespread acceptance, a number of important issues and problems will have to be discussed and addressed by the scientific community.

The recent development of the World-Wide Web,[2] a communications system based on the global Internet network, offers one solution to the problem of chemical information delivery.[3] This mechanism also addresses the mismatch between conventional two dimensional printed pages and the three dimensional properties of most molecules. In this paper, the implementation of what we have called the "hyperactive molecule concept"[4] into the World-Wide-Web infra-structure is described, and its use is discussed with examples drawn from different areas of synthetic, structural, spectroscopic and computational chemistry.

Chemical MIME Types.

The World-Wide Web operates two protocols known as HTTP (the transport protocol) and HTML (the formatting or style protocol),[5] within the overall Internet domain. Together, these define how data will be transferred between two computers operating in a "client-server" relationship, and how it will appear visually on the client computer. Computer programs that implement the client side of these two protocols are known as World-Wide Web Browsers, whilst separate programs perform the server operations.

A third protocol known as MIME (Multipurpose Internet Mail Extension)[6] serves to label the data according to its content. Currently, seven primary MIME types have been defined;

text, image, sound, video, application, message, multipart

A MIME type text/richtext would indicate that a text (as opposed to a binary) style document should be treated locally as comprising instructions in the Microsoft Richtext Interchange format, and would require formatting accordingly. World-Wide Web Browsers themselves use text/html MIME types. Whilst these primary MIME types allow a rich variety of multimedia information to be displayed locally, they do not directly address the labelling requirement for chemical information. In recognition of this, we have proposed[7] that an eighth primary MIME type be defined, illustrated by a few of the possible examples;

chemical/pdb chemical/smiles chemical/molfile chemical/cif chemical/mopac chemical/gaussian

It is of course essential that these types are capable of explicit and exact definition according to a published standard, and whilst one could easily conceive of a great number of potential secondary chemical MIME types, in the first instance it will be necessary to limit the types to a smaller selection of well defined and commonly used ones. Indeed, we hope that our proposal will serve to focus minds on the need to standardise the way in which chemical data is defined away from entirely proprietary formats to open and extensible formats which serve the community as a whole.

Uniform Resource Locators (URLs).

With a formal definition of the chemical content of a document in place, we need to address its location. The HTML formatting standard allows the insertion of a so-called Uniform Resource Locators (a URL)[8] into the formatting string. An example will serve to illustrate how this would operate. Imagine a line of text in a screen display;

If you click here, some molecular coordinates in PDB format will be transferred to your local machine and displayed using the program RasMol.

Using HTML as the formatting language, this string would be encoded as;

If you <A HREF="http://www.ch.ic.ac.uk/atp.pdb">click here</A>, some molecular coordinates in PDB format will be transferred to your local machine and displayed using the program RasMol.

It is the task of the World-Wide Web Browser to interpret this string on the user's computer screen, in much the same way that e.g. Microsoft Word would display a file in Interchange format. The <A...> text </A> tag is known as an anchor, and serves to define a hypertext link (or hyperlink) within the document. The string beginning http is known as the URL, and serves to define the communication protocol to be used, a unique name for the remote Internet based computer where the data is held (www.ch.ic.ac.uk) and the name of the remote file stored there (atp.pdb).

There is still one problem to be solved, and that is how to persuade e.g. the RasMol program[9] to process the data. This is a chemically specific process that existing World-Wide Web Browsers are currently incapable of. Our solution was to use a mechanism on the World-Wide Web server which associates appropriate filename extensions (e.g. .pdb for the Brookhaven protein database format) to an equiavalent MIME type (e.g. chemical/pdb) and by a client configuration which associates chemical/pdb with the need to invoke e.g. RasMol. By this means, a collection of well-defined chemical data stored on a remote computer running a World-Wide Web server can be displayed in fully-functional active form on another computer running a World-Wide Web Browser. Most importantly, the mechanism for activating the "hyperactive" molecules can be embedded fully within the context of the discussion of the science, and not as an arcane mechanism requiring the overt use of file transfer programs such as the "user-unfriendly" file transfer protocol or FTP.

Chemical Structure Markup Language.

Our experiments with "hyperactive"⁴ molecules convinced us that a further "value-added" layer needs to be added to this chemical information delivery mechanism. Just as conventional collection of text benefits from being "marked-up" using e.g. HTML, chemical information comprises a rich syntax of atomic, molecular, and functional group information which itself requires navigation aids. Whereas text functions purely in two dimensions, molecular structures often require three dimensions and molecular potential energy surfaces may require more than three dimensions for their representation. We have therefore introduced the idea of a standard way of marking-up chemical structures, via a CSML or chemical structure markup language. Our first implementation of this concept involved the insertion of hyperlinks into two dimensional protein NMR spectra.⁴ Each cross-peak in such a spectrum can be associated with a particular structural feature defined by a collection of molecular coordinates as found in e.g. a pdb file. By introducing a further MIME type;

chemical/csml

we can now specify how we wish the eventually 3D display of the structure to appear on the screen, much like HTML itself can be used to control attributes of the text displayed by the Browser. In our case, the CSML instructions specify to Rasmol that individual residues or protons in a pdb file be coloured differently from the rest, and rendered in e.g. spacefill mode. Such marking up serves to concentrate attention to the scientific points being made, and helps to reduce the "information overload".

We have extended this concept⁴ to marking up structures associated with mass spectra, organic synthetic schemes, rotational/vibrational spectra, potential energy surfaces and simply illustrating binding centres in complex proteins in the context of the scientific discussion of the system. We feel that many interesting and important applications of this concept will emerge in the near future.

Chemical Applications of the World-Wide-Web.

How can this infra-structure assist the scientific communication process? In the first instance, we concentrated on primary scientific publications. The RWW "paper"³ was prepared by the collaboration of three authors and contained hyperlinks to many of the applications and demonstrations alluded to above. Although the "paper" was submitted in conventional printed form to the journal, the referees were able to access the "live" version of the paper using a World-Wide Web browser, test all the claims made in the article and pronounce on the scientific value of the publication. In that sense, the moment the referees were able to recommend "publication" of the paper, it was in theory available to the entire scientific community. In practice, the publication date was deemed to occur when the printed version was distributed to its readers, a process that occurred some months after the electronic version of the paper became available. We emphasise that in theory at least, the referees would have been able to apply a much more rigorous "quality control" assessment to the paper than is usual, since any original spectroscopic, structural and computational data could have been made available to them in digital and hence verifiable form via appropriate hyperlinks, the use of chemical MIME types, and helper programs such as Rasmol. In the area of molecular modelling for example, verification of computational results based purely on printed journal data is often very difficult, and as the calculations become more complex and more heterogeneous, such verification becomes next to impossible without access to all the original data files. The "living" journal proposed here greatly facilities the quality control process. In that sense, the greater control over the journal available to its readers should help to reduce the current chaos of fragmented papers and data that is less readily verifiable.

Another feature which becomes possible with the chemical MIME mechanism in place is that of global indexing and the possibility of indexed searching against the original data. Currently, this is an immensely expensive and labour intensive process, operating again on a timescale of one or more years from the original report of the science. Currently, there exist tools known as "intelligent agents" which are programmed to perform a specified task, and which could roam the Internet searching for appropriate items. For example, the World-Wide-Web-Worm[10] can search for all known URLs, and index the contents of the information referred to by the URL against specified criteria. Given that chemical MIME types can potentially define a rich variety of chemical and molecular information, it is entirely possible that a chemically aware agent could be written with the specific task of searching for say a chemical sub-structure defined in e.g. a pdb file or a SMILES string. If for example, all synthetic reaction schemes were defined in such terms in World-Wide Web based papers, then producing a globally indexed reaction database could be automatically and most importantly cheaply and rapidly achieved. Whilst no-one is currently advocating a proliferation of such uncontrolled agents running nightly on a global scale, it is easy to imagine how a published paper might be routinely and rapidly incorporated into global indices, certainly far more quickly than the timescale current abstracting mechanisms operate on. The ability to rapidly index information presented in World-Wide Web form will greatly help to cope with the explosion of molecular information that is expected to come about as a result of the mass synthesis programs associated with modern drug-discovery programs, whilst hopefully preserving to a large extent the element of serendipity so treasured by all creative scientists.

Other forms of chemical communication can also be envisaged as benefitting from this technology. Presentations and talks differ from scientific papers in being presented to an audience, with the opportunity for questioning the speaker. Such talks would normally be used to convey general themes rather than a mass of factual detail, the latter being left to the scholarly paper that may or may not result from the talk. The talk associated with this paper was mounted using the World-Wide Web mechanism,[11] and hence was available for inspection both before and after the actual meeting by not only the delegates at the conference but by a much wider audience. In theory, a feedback mechanism is possible using a "form" implanted into the talk, so that the physical duration of the talk need not limit discussion. Equally original in concept is the ability to provide usage statistics for documents. Finally, the talk can be integrated with the associated "paper" by hyperlinks, increasing the utility of both. This same mechanism has been applied to conference poster sessions and workshops,[12] and was used to organise and present the first "virtual" chemistry conference.[13] Such an event may not have the immediacy of meeting real people, but has the advantage of being cheaper to organise, is fully distributable via a CD-ROM and can be easily abstracted by e.g. the Chemical Abstracts.

In the arena of teaching, we have started a pilot project known as "Global Instructional Chemistry",[14] where the mechanisms of hyperactive molecules are used to bring interesting chemical stories to students and staff alike. Embedded "forms" are used to solicit new contributions to this medium, and could indeed be used locally for student assignments and other coursework. Internally to our department, much lecture note, laboratory script and other informative documents are available in this form. Such parochial material can easily be removed from general access by the system administrator setting up "access control lists", and even more sensitive commercial material now has the potential for being encrypted for greater security.

The Issues Raised.

Publishing in this format does raise a number of new important issues. What constitutes a single unit currently known as a "paper"? The RWW³ paper was extensively hyperlinked to other documents, and hence defining a single published "unit" could become a matter of defining the depth of the hyperlinks in the top level document. The "paper" then becomes a matter of defining a potentially widely dispersed body of information. Some of this information may not necessarily be under the control of either the author or any journal, and may indeed even change with time, or disappear entirely. Whilst presumably the essential scientific content and novelty of a paper should be encapsulated in the top level document, the usual "publish-or-perish" requirement for paper counting may be entirely subverted by the concept of an extensively interlinked body of information on which career structures, funding applications and tenure would be based.

The mechanism outlined above operates implicitly in a short time scale, of months or perhaps a few years. What then of the longer timescale of decades or centuries? No doubt the World-Wide Web mechanisms will mature and then be replaced by new ones in the coming years. Will information published be capable of transfer to any new medium? What of the cost of archival? Who should be responsible for any archival? Will archived information still be available via a URL, or will the equivalent of an inter-library-loan request have to be made to reload it?

Perhaps the most serious issue is that of authenticity. Currently, our own output in World-Wide Web format resides entirely on a server under our control, and running on funded machines that may not be always available in the future. The long term strategy might be to transfer at least the top-level primary documents to a trusted society. That is one way of ensuring that authors cannot "edit" documents after they have been deemed to be published, and of encouraging confidence in the authenticity of documents. Even here however, technology could allow authors to apply a "one-time" digital signature to a document, and to publish the decryption key. This would allow anyone access to the information, and would serve the purpose of verifying that the data has not been tampered with since the original publication date.

What of the hyperlinks to what used to be called "supplemental" information? Should all documents associated with hyperlinks also be transferred to the trusted society or encoded with digital signatures? In many cases this would not be possible, since the authors may have no control over a remote document. In other cases, the volume of supplemental data may be so large that any third party would not find it cost effective to store the information on behalf of the authors. Is this a case for a two-tier system of documents, some of which remain under author control and digitally signed, or perhaps a situation where information needs to be clearly tagged with an "expiry" date, after which it may not be accessible any longer via the networks? The case for the continued existence of trusted global depositories of thoroughly authenticated information, as exemplified by the Brookhaven protein database, or the Cambridge structural database must remain strong.

Finally, what of cost recovery and the role of traditional publishers? It is undoubtedly now easier for authors to write their scientific musings in "marked-up" format, and even to set up their own World-Wide Web servers and publish papers entirely from their own resources. For example, our own efforts were accomplished with no explicit funding for the first six months. Scientific societies will continue to have a valuable role in arranging for the quality control mechanisms to be applied via peer review. However, if the resultant papers are freely available to the scientific community with no cost recovery mechanism, where will the cost of authentication be recovered from? Conversely, if the casual browser encounters messages such as "please phone the following number with your credit card ready", which is already happening on some pages of the World-Wide Web, then the entire way in which science is taught as a subject will have to be re-evaluated. Perhaps the best compromise will be to license the electronic journals on an institutional basis for a "top-sliced" fee, whilst also making available other forms of the journal, i.e. in printed form or via CD-ROM.

Conclusions.

The chemical sciences form a well defined and excellent test-bed for investigating new electronic forms of information delivery. Many issues are raised by these new technologies that we have hitherto not had to confront. We will now need to resolve these very rapidly. Nevertheless, an increasing number of scientists are coming to believe that an entirely new era of chemical communication is starting, which brings with it enormous potential for enhancing the subject.

Acknowledgements: The author wishes to thank Mark Winter (Sheffield), Benjamin Whitaker (Leeds), Peter Murray-Rust, Roger Sayle and Martin Hargreaves (Glaxo), and Glaxo (Greenford) for funding.