Presented at the International Chemical Information Conference, Nimes, October, 1996. This article is mounted on-line as http://www.ch.ic.ac.uk/talks/nimes/article.html together with the associated talk on http://www.ch.ic.ac.uk/talks/nimes/

A Paradigm Shift in Chemistry Electronic Publishing

Henry S. Rzepa*(a), Omer Casher(a) and Benjamin J. Whitaker(b)

(a) Department of Chemistry, Imperial College of Science, Technology and Medicine, London, SW7 2AY and (b) School of Chemistry, University of Leeds, Leeds.

Abstract: A review of the development of computational chemistry tools which make direct use of the Internet is presented, together with some recent advances in new Internet methodologies based around chemical MIME standards, Java Applets, VRML (Virtual Reality Modelling Language) and server-side technologies. The use of these tools in new generations of chemistry electronic journals is illustrated via the CLIC project to enhance the journal "Chemical Communications" and the activities of the "Open Molecule Foundation".


Keywords: Internet, World-Wide Web, VRML, Java, Electronic Journals, CML (Chemical-markup-language), OMF (Open-Molecule-foundation).
In the past three years, since the explosive growth of the Internet, it has become apparent that a paradigm shift in the way information is interchanged and used was under way. Molecular science is particularly well placed to take advantage of these changes, since the subject is well structured, can be expressed semantically very precisely, and has a visual, multidimensional and time dependent quality which translates with great difficulty to the printed page. Given that limitation, it is surprising that some 10,000 printed journals with some chemical content exist, some admittedly very tenuously, and even more surprising that much of the momentum in the area of electronic versions of these journals has been generated by the more physical sciences. What is perhaps less surprising is that many of the recent "conversions" on the road to Damascus concentrate less on innovation, their prime intention being to duplicate the original printed formats as closely as possible. Such innovation that exists in commercial journals appears to focus on the removal of the costly components of journal publishing, namely colour plates, or to provide a solution to financially unattractive aspects such as the archiving of supplemental data. In this article, we continue[1] our argument that readers of journals should by and large reject mere electronic replication of printed texts, and that the role of electronic journals must be more akin to a scientific tool or instrument, where the basic information can be processed actively by the user.

Progress in Chemical MIME: The Netscape Plug-in API

At the 1995 Nimes conference, we introduced[2] our early work on how chemical information could be integrated into a document for electronic publishing using a mechanism known as chemical MIME.[3] This system also achieved a degree of separation of the chemical "content" from the chemical "style" or "form". No longer was the reader of the information bound to accept the particular visual style or interpretation of the data imposed on it by the author. Instead, if they wished, they could ask that the data be passed to their own visual interpreter or indeed much more sophisticated form of processing tool. Perhaps a specific example will emphasise the progress in its implementation in the last year.

The World-Wide Web allows the 3D coordinates to be integrated into the document describing the research. The molecular content can be rendered on the screen with a chemical MIME aware "plug-in" such as Chime.[4] The syntax of the command that invokes Chime is shown below;

<EMBED SRC="porphend.pdb" bgcolor=#FFFFFF align=abscenter width=250 height=250 spinx=360 startspin=true display3D=sticks name="main" script="zoom 120; connect true; set bondmode and; set hydrogen false">


Figure 1. Molecular diagram as invoked using the Chime plug-in.

This allows the user to acquire a chemical document from the Internet, and to impose a molecular style on the content such that any 3D molecular coordinates are integrated directly into the screen display. The reader can rotate the 3D representation of the molecule, change its stylistic attributes from the default supplied by the author e.g. wirefame to spacefill, decide which colours to display it in, perhaps measure interatomic distance, even perform computations on the system. Most importantly, they can save all this information either to hard disk, or perform the equivalent of a copy-paste operation from the Web client window to some other program of their own choosing. Thus Chime supports such export/input to a full-featured molecular editor called ISIS/Draw.

The earliest example of how such "hyperactive" molecules were seamlessly integrated into a chemical document is in a keynote paper for the 1995 ECTOC electronic conference.[5] The concept has also been extensively used in the later ECHET96 conference[6] and is now a feature of the electronic version of the Journal "Chemical Communications"[7] for enhancing the keynote articles. In summary, the World-Wide Web, in conjunction with chemical MIME headers, turns a chemical document into a working live tool, where the user makes the decisions on what to do with the data, rather than simply accepting a single point of view imposed by the author or publisher of the information.

Chime operates on two levels, the basic plug-in and an enhanced version called Chime Pro, which will offer the ability to accept a structure query directly from the clipboard, for searching a proprietary database engine such as Chemscape Server. The ability to display data in-line on an HTML page or within an HTML table is promised to give substantial performance benefit for the display of search results from a molecule or reaction database. This refers to the problems with "stateless" communication protocol such as HTTP, in which every component of a document has to be retrieved from a server in a discrete transation. If the user wishes to retrieve say 100 molecules with associated 2D or 3D co-ordinates, then the large number of HTTP transactions required makes the process very inefficient. By retrieving all the molecules in a single transaction, and then parsing the molecules out of this, one achieves much greater efficiency (and one also recovers the "state" or "context" between the molecules). Such databases services, which might in the future be an integral part of reading a research article in a journal represent an impressive technical advance over what was possible even one year ago. This does, however, raise interesting issues of whether it is appropriate to associate the reporting of primary research results with potentially proprietary and commercial information services.

Other aspects of plug-in technology must also be considered. Firstly it is still necessary for the author of the software (in the case of Chime, MDL Information systems) to produce operating system specific versions of the plug-in. For example, MDLI released a Unix version more than six months after the Windows and the Macintosh versions became available, and then only for SGI systems. Secondly, the reader must be pro-active in acquiring this software, and must often download perhaps three different versions to satisfy local installation requirements. Finally, there is also no guarantee that any two chemical plug-ins will necessarily inter-operate, since guidelines for interchange of data between various plug-ins do not appear to currently exist. At least the plug-in implementation by some suppliers, e.g. Netscape V 3.0, now allows the user to choose which plug-in to use for any given supported chemical MIME type, a feature not supported in the initial releases.

We noted above that the Chime Pro plug-in was capable of parsing an HTML document for molecule content, solving the problems of the stateless HTTP protocol. This raises the issue of whether the molecular content should be enclosed in any particular standard form. Ideally, this should follow the strict SGML guidelines from whence HTML originated, but which nowadays seem rarely followed. A secondary issue is whether HTML is the best carrier for molecular content. In recognition of the many inadequacies of HTML as a molecular descriptor, we have continued development of the "chemical markup language" or CML, which was implemented by Murray-Rust[10] and first reported at the 1995 Nimes meeting. CML functions as a medium for the inter-operability of chemical information between areas such as publications, programs, equipment, databases, other structured file formats, for example, CIF, CEX, asn.1 and CXF, and older legacy formats such as PDB. Because it is derived from a formally defined SGML DTD (document type definition), it can be parsed using standard tools. Futhermore because of the highly structured nature of its implementation, it can be associated with the molecular class libraries defined by Java in a natural way. Thus CML encapsulated chemical semantics provide a powerful and extensible way forward for the development of Internet based chemical tools.

Progress in Java Applet Technologies

There is an alternative paradigm based around an environment known as Java. This is the first attempt to construct an infra-structure for inter-operable applications in a manner which is extensible, does not lead to proprietary turn-key solutions into which the user is locked, and which solves the problem of how to deliver the functionality to the user without repeated installation tasks. Based on an object oriented language known as Java (itself similar in construction to languages such as C++), it uses the concept of Class libraries to define inter-operable molecular functionality. The most important aspect of Java however is that the Internet is an intrinsic aspect of its functionality. Firstly, the Java language is used to write small code units known as "Applets", which can be delivered to the user in the same way that a document is delivered. In this sense, Java introduces what has been called "just-in-time" applications, in which the user does not have to pre-install any application on their local system, but acquires each applet only when they need it. Applets themselves, because they derive from object oriented constructions, can call any object class they may need via a suitable request to a class library that may already be present on the users own local disk system or if it is not present locally may be retrieved as a "network object". Because of their intrinsic object orientation Java applets can potentially inter-operate, i.e. interchange data with other applets. Another feature of the Java environment is its built-in platform independence. Conventional applications are compiled with machine specific compilers to produce executable instructions that only run on a very specific computer and operating system. Java applets are transmitted in a machine independent meta-language or "byte code" form, which can be interpreted by a "virtual machine" on the user's system. Such a virtual machine can be implemented within a World Wide Web browser (e.g Netscape 3.0) and thus the availability of the applet is directly related to the availability of a suitable Web browser. The limitation of virtual machines is the speed with which they interpret and hence execute the Java applets. In 1997, this concept will be augmented by so-called "just-in-time compilers" which will be part of the capability of the Web browser, and which should enhance the execution speed of the applet significantly.

By mid 1996, a significant number of molecular Java applets had been produced, including molecular editors, molecule visualisers, sequence alignment editors, front ends to computational programs such as Gaussian 94 and a CML visualiser capable of parsing a CML encoded file [8]. In order provide a mechanism for standards to development in this process, to document the subject in a manner suitable for molecular scientists and to try to ensure future inter-operability is achieved in the area of molecular science, an organisation known as the "Open Molecule Foundation" has been set up to facilities developments in this area, and to provide information and support for developers and users.[9]

Progress in 3D Document Descriptors

One area where the initial 1996 releases of Java were significantly limited was the lack of any 3D graphical programming API, and the poor performance of 2D and 3D graphical rendering using this method. The problem of efficient 3D rendering is of course largely solved by codes such as RasMol (and its implementation in the Chime Web browser plug-in). RasMol is essentially a black box for converting 3D molecular coordinates to a pleasing visual form. However, such a product does not allow other forms of visual representation or property to be rendered; for example the molecular electrostatic potential calculated for 1 at the B3LYP/6-31G* level.

We recently argued[11] that the negative electrostatic potential of 1 is highly chiral, a property that we hypothesised is related to the excellence of this reagent as a chiral resolving agent via its weak binding and hydrogen bonding properties (Figure 1)


Figure 2. A rendered electrostatic potential for molecule 1. This figure is linked to a VRML file. You will need an appropriate plug-in such as Live3D to view this.

There is therefore need for a generic technology that can be easily incorporated into an electronic journal to display such visual information. In 1995 one such solution to such problems called VRML (virtual reality modelling language) was introduced. VRML is an object oriented 3D scene descriptor, much in the way that HTML is a 2D descriptor where the objects are the standard ASCII character set. Because VRML can display 3D objects such as lines, spheres, polygons etc, it can be efficiently optimised for the problems of 3D object and property display. The ability to display VRML objects is now integrated into Web browsers such as Netscape, and via such association, a wide range of authoring tools are now available to produce output in VRML form. In 1996, VRML 2.0 was introduced (the Moving Worlds proposal) which allows greater flexibility in defining various 3D objects, and introduces a mechanism for mapping "actions" on the individual objects in a 3D scene using integrated scripts and algorithms. This then suggests the possibility of building applications which use a combination of Java acting upon underlying molecular data, and then make use of the efficiency of VRML in rendering the information. Because VRML is a 3D format, the user can rotate, and navigate this information on the screen under their own control.

We have previously shown [12] how VRML can be used to present a variety of types of molecular information to the user in an integrated manner. Just as with HTML, VRML objects can contain URL hyperlinks to other Internet based resources. In our case, individual points in the 3D scatter diagrams could be hyperlinked to other VRML scenes containing details of inter-molecular interactions relevant to the chemistry being discussed, and these in turn can be hyperlinked to bibliographic information about each molecule, or whatever. Other possibilities include linking VRML objects to Java applets or scripts which can perform actions on such objects. The most trivial example would be to change the radius of a spacefill display of any particular atom from e.g. the van der Waals value to some other. A more sophisticated example is the use of Java to display digital spectral information derived from an NMR spectrometer,[13] and to link regions of the spectrum to specific atoms or residues in a 3D molecular object displayed using e.g. VRML. The links can be bi-directional, i.e. clicking on a specified atom will highlight the spectral region containing peaks associated with that atom.

Such bi-directional functionality could of course be implemented entirely using Java applets, and as the 3D rendering tools available in Java improve, this route may well become the favoured one. Indeed with the Java 3D API class now under development, there is no reason why a VRML viewer cannot be completely written in Java. Such a Java-based VRML viewer would consolidate the two technologies and preclude the need for any VRML plug-in. Certainly, the current trend towards the production of very memory intensive Web browsers to which the user might have added a number of plug-ins and potentially duplicate functionality seems unsustainable.

The future of VRML itself depends on the quality of the content available. Tools, such as EyeChem 2.0, [12] are required to facilitate the creation of molecular VRML files from scratch or from data in other formats. A drawback of VRML is that this file format is ill suited for molecular data. One approach to this problem would be to prototype a more appropriate 3D file format, such as the Molecular Inventor [14] format into VRML. Another approach to this problem would be to implement an SGML document type representation for VRML [15] which would also address indexing and archiving issues of VRML files in a structured document server.

Structured Document Servers and Extensible Servers

The increasing chemical structure and content manifesting in Web based documents needs to be matched by chemical functionality on the server side. Among the key issues that need to be addressed are document and more especially hyperlink maintenance, and associated issues such as intelligent indexing of the content of the documents. We have largely focused on the use of a Server known as HyperWave (formerly Hyper-G) originating from the University of Graz. Unlike conventional servers, where a document collection is largely defined by the server directory structures, and whatever hyperlinks have been inserted by the administrator and/or the authors of documents, HyperWave uses explicit tools to insert and maintain both the document collection, and any embedded hyperlinks contained within them. Unlike a conventional server therefore, it is not possible for either a document or localy resolved hyperlinks within the documents to become "orphaned" by a mis-directed URL. Furthermore, because a HyperWave server can communicate with other such servers, even non-local URLs to such servers can be automatically checked and automatically maintained. HyperWave also has a sophisticated mechanism for verifying the integrity of individual documents in its collections. By default, it is configured with an explicit declaration of an appropriate SGML DTD describing the document content, i.e. by default with a DTD corresponding to HTML (actually HTML 3.2, but with some Netscape extensions added to the DTD). When a new document is added, it is checked against this DTD, and index entries corresponding to the structure of the document are automatically generated.

As described thus far, HyperWave has no explicit molecular "intelligence". However, in the future it is expected that this will be added by installing other SGML DTDs to the server functionality, such as for example the CML DTD, or enhanced versions of the ISO 12083 DTD which contain chemically explicit entities. Experiments underway will establish whether this approach will offer the type of functionality and in particular the performance suitable for chemical databases.

A second theme addresses the lack of structure that is already apparent in the so-called CGI (Common Gateway Interface) mechanisms for processing subject specific content. A "cgi" program or script is frequently used to perform specific tasks based on user-provided information derived from a forms-based input, and to return the output to the user via the browser window. These custom written "cgi" processes suffer considerably from a lack of explicit guidelines, and over a period of time they can become essentially unmaintainable. Such scripts are often also not inter-operable; frequently it is easier to introduce minor variations via a new script than to modify the existing script. Whilst this is currently a largely unsolved problem for chemistry, new approaches to the integration of the Web server with its "cgi" functionallity are now appearing. Both the Jigsaw server from the World-Wide Web Consortium and the Jeeves server from Sunsoft offer a Java-based solution. In this scenario, the "cgi" functionality is closely integrated in an object-oriented manner to the server itself, thus offering a stable and scaleable way of building discrete subject-based services into the document collection. A solution which offers chemical functionality via both the server and the document structure seems promising development path to follow.

Conclusions

In this article, we have largely focused on the technical infra-structure around which new generations of journals, conferences, books and teaching aids will be based. We believe there is much potential in the use of technologies such as Java for interpreting structured chemical information, encoded in formats such as CML, the use of VRML for its 3D display, and structured document servers for the delivery and semantic processing. We believe that this infra-structure constitutes a new concept in functional chemical information tools, in which the Internet is an intrinsic component. As we pointed out last year [3] issues such as security and validation, the implementation of cost recovery and charging models, how intellectual property rights and priority claims are handled, and how archival mechanisms can be developed in an areas where the technological advances of a few months can easily obsolete formats, still need to be resolved. In some ways, these present a far greater challenge than the purely technical ones. There can be little doubt however that after 25 years of development, the Internet can finally be considered as a maturing major tool as the disposal of molecular science.

Acknowledgements: The authors thank Peter Murray-Rust, Jurgen Brickmann and Adam Precious for their inspiration and collaboration. Funding from the JISC Electronic libraries (e-Lib) and JTAP programs, from British Telecom for a University Development award and from GlaxoWellcome is gratefully acknowledged.

References and Citations

[1] For earlier articles on this theme, see D. James, B. J. Whitaker, C. Hildyard, H. S. Rzepa, O. Casher, J. M. Goodman, D. Riddick, P. Murray-Rust., New Review of Information Networking, 1996, 61-70 (http://www.ch.ic.ac.uk/clic/video.html); S. M. Bachrach, P. Murray-Rust, H. S. Rzepa, B. J. Whitaker, "Publishing Chemistry on the Internet", Network Science, 1996, 2 (3) (http://www.awod.com/netsci/Issues/Mar96/feature4.html)

[2] B. J. Whitaker and H. S. Rzepa, "Chemical Publishing on the Internet", Conference on Chemical Information, Nimes, France, October, 1995 (http://www.chem.leeds.ac.uk/papers/html/Nimes/nimes.html). See also http://www.ch.ic.ac.uk/chemime/iupac.html for the latest summary of this area. [3] For the official chemical MIME home page, see http://www.ch.ic.ac.uk/chemime/

[4] See the URL http://www.mdli.com/ Chime is itself a Netscape plug-in enhancement of the RasMol molecular viewer written by Roger Sayle. For the history of RasMol, see http://www.glaxowellcome.co.uk/netscape/software/history.html

[5] A. Padwa, E. A. Curtis, V. P. Sandanayaka, and M. Weingarten, ECTOC, 1995, See http://www.ch.ic.ac.uk/ectoc/papers/01/

[6] See http://www.ch.ic.ac.uk/ectoc/echet96/

[7] The CLIC Project. See http://chemcomm.clic.ac.uk/

[8] For one such collection, see http://www.ch.ic.ac.uk/java/

[9] For information about Java and tools for Bio and Chemo-informatics, see the Open Molecule Foundation; http://www.ch.ic.ac.uk/omf/

[10] P. Murray-Rust; http://www.dl.ac.uk/CBMT/cml/cml06f/newintro/role.html

[11] For further details, see D. O'Hagan and H. S. Rzepa, Chem. Commun, 1996, in press. An electronic version of this article will also be available via the journal home page; http://chemcomm.clic.ac.uk/

[12] For chemically oriented examples of how VRML has been applied, see O. Casher and H. S. Rzepa, J. Mol. Graphics,1995, 13, 268; H. Vollhard, C. Henn, G. Moeckel, M. Teschner, J. Brickmann J. Mol. Graphics, 1995, 13, 368; J. Brickmann, H. Vollhardt, Trends In Biotechnology, 1996, 14, 167-172.

[13] H. S. Rzepa, P. Murray-Rust and R.Kinder; See http://www.ch.ic.ac.uk/java/HyperSpec/

[14] O. Casher and H. S. Rzepa, Proceedings of the 14th UK Eurographics Conference, March, 1996. See http://www.ch.ic.ac.uk/rzepa/eg/

[15] W. E. Kimber; See http://38.145.245.206:80/drmacro/vrml/