A New Publishing Paradigm: STM Articles as part of the Semantic Web

Henry S. Rzepa^a and Peter Murray-Rust^b

^aDepartment of Chemistry, Imperial College, London, SW7 2AY. ^bSchool of Pharmacy, University of Nottingham.

Abstract

An argument is presented for replacing the traditional published scientific techmical or medical article, where heavy reliance is placed on human perception of a printed or a printable medium, by a more data-centric model with well defined structures expressed using XML languages. Such articles can be regarded as self-defining or "intelligent", they can be scaled down to finely level details such as e.g. atoms in molecules, up into journals or higher order collections and can be used by software agents as well as humans. Our vision is that this higher order concept, often referred to as the Semantic Web, would lay the foundation for the creation of an open and global knowledge base.

Introduction

Both scientists and publishers would agree that the processes involved in publishing and particularly of reading scientific articles have changed considerably over the last five years or so. We will argue however that these changes relate predominantly to the technical processes of publication and delivery, and that fundamentally most authors' and readers' concepts of what a learned paper is, and how it can be used, remain rooted in the medium rather than the message. The often ubiquitous use of the term "reading a paper" implies an perceptive activity that only a human can easily do, especially if complex visual and symbolic representations are included in the paper. We suggest that the learned article should be instead regarded as more of a functional tool, to be used with the appropriate combination of software based processing and transformation of its content, with the human providing comprehension and enhancement of the knowledge represented by the article.

The Current Publishing Processes

Much of the current debate about learned STM (Scientific, Technical and Medical) publishing centres around how the dynamics of authoring an article might involve review, comment and annotation by peer groups; i.e the pre-print/self-print experiments and discussion forums.¹ This leads on to the very role of publishers themselves, and the "added-value" that they can bring to the publication process, involving concepts such as the aggregation of articles and journals, ever more rich contextual full-text searching of articles, and added hyper linking both between articles and between the article and databases of subject content. These are all layers added by the publishers, and inevitably, since they often involve human perception and some or all of these stages, they remain expensive additions to the publishing process. There is also the implicit assumption that the concept of what represents added value is largely defined by the publishers rather than by the authors and readers.

These debates largely assume that the intrinsic structure of the "article" remains very much what an author or reader from the 19th century might have been familiar with. These structures are mostly associated with what can be described as the "look and feel" of the journal and its articles, namely the manner in which the logical content created by the author is presented on the printed page (or the electronic equivalent of the printed page, the Acrobat file). In our own area of molecular sciences, the content is serialised onto the printed page or Acrobat equivalent into sequential sections such as abstract, introduction, numerical and descriptive results, a section showing schematic representations of any new molecules reported, a discussion relating perhaps to components (atoms, bonds etc) of the molecules and a bibliography. A human being can scan this serialised content and rapidly perceive its structure and more or less accurately infer the meaning of e.g. the schematic drawing of a molecule (although perceiving the three dimensional structure of such a molecule from a paper rendition is much more of a challenge!). A human is less well suited to scan in an error free manner thousands if not millions of such articles, and is subject to the error-prone process of transcribing numerical data from paper. Changing the medium of the article from paper to an Acrobat file does little to change this process. Most people probably end up printing the Acrobat file; few would confess to liking to read it on the computer screen. Yet, this remains the process that virtually everyone "using" modern electronic journals would go through.

We argue here that data must be regarded as a critically important part of the publication process, with documents and data being part of a seamless spectrum. In many disciplines the data are critical for the full use of the "article". To achieve such seamless integration, the data content of an article must be expressed in a far more precise way than is currently achieved, precise enough to be not merely human perceivable, but if necessary to be machine processable. The concept is summarised by the term "Semantic Web", used by Berners-Lee² to express his vision of how the World-Wide Web will evolve to supporting the exchange of knowledge on the Web. The semantic web by its nature includes the entire publishing process, and we feel that everyone involved in the publishing process will come to recognise that this concept really does represent a paradigm shift in the communication and in particular the use of information and data. A central concept to the semantic web is that data must be self-defining, such that decisions about what it represents and the context of how it can be acted upon or transformed are possible not merely by humans but by software agents created by humans for the purpose. The concepts also include some measure of error checking if the structure and associated meaning (ontology) of the data is available, and of mechanisms to avoid loss of data if the meaning is not suffciently well known at any stage.

The stages in the evolution of data and knowledge are part of the well known scientific cycle. An example in molecular and medicinal sciences might serve to illustrate the current process;

A human decides a particular molecular sub-structure is of interest, on the basis of reading a journal article reporting one or more whole molecular structures and their biological properties relating to e.g. inhibition of cancerous growth. This process is currently almost entirely dependent on human perception.
A search of various custom molecular databases is conducted, using a manual transcription of the relevant molecular structure. This implies a fair degree of knowledge by the human about the representational meaning of the structure they have perceived in the journal article. Chemists tend to use highly symbolic representations of molecules, ranging from text-based complex nomenclature to even more abstract 2D line diagrams where many of the components present are implied rather than declared. Licenses to access the databases must be available, since most molecular databases are proprietary and closed. It is quite probable that a degree of training of the human to use each proprietary interface to these databases will be required.
It is becoming more common for both primary and secondary publishers to integrate steps 1 and 2 into a single "added value" environment. This environment is inevitably expensive, because it was created largely by human perception of the original published journal articles. In effect, although the added service is indeed valuable, the processes involved in creating it merely represent an aggregation of what the human starting the process would have done anyway.
The result of the search may be a methodology for creating new variants of the original molecule (referred to by chemists as the "synthesis" of the molecule). The starting materials for conducting the synthesis have to be sourced from a supplier, and ordered by raising purchase orders from an accounts officer.
Nowadays, it is perfectly conceivable that a "combinatorial" instrument or machine will need to be programmed by the human to conduct the synthesis.
The products of the synthesis are then analysed using other instruments, and the results interpreted in terms of both purity and the molecular structure. This can often nowadays be done automatically by software agents. A comparison of the results with previously published and related data is often desirable.
Biological properties of the new species can be screened, again often automatically using instrumentation and software agents.
The data from all these process is then gathered, edited by the human, and (nowadays at least) transcribed into a word processing program in which the document structures imposed are those of the journal "guidelines for authors" rather than those implied by the molecular data itself. We emphasize that this step in particular is a very lossy process, i.e. lack of appropriate data structures will mean loss of data!
More often than not, the document is then printed and sent to referees. The data from components 1-7 above are only accessible to them if they invoke their own human perception, since the process involved in step 8 may adhere (and then often only loosely) merely to the journal publishing and presentational guidelines rather than to those associated with the data harvested from steps 1-7.
The article is finally published, the full text indexed, and the bibliography possibly hyper linked to the other articles cited (in a mono directional sense). The important term here is of course "full text". In a scientific context at least, and certainly in molecular sciences, the prose-based textual description of the meaning inevitably carries only part of the knowledge and information accumulated during the steps 1-10. Full-text prose is inevitably a lossy carrier of data and information. Even contextual operators invoked during a search (is A adjacent to B? Does A come before B?) recover only a proportion of the original data and meaning. The rest must be accomplished by humans as part of the secondary publishing process, and of course the cycle now completes with a return to step 1.

The cycle described above is clearly lossy. Much of the error correction, contextualisation and perception must be done by humans. We argue, too much (we certainly do not argue for eliminating the human entirely from the cycle!).

Learned Articles as part of a Semantic Web

It is remarkable how many of the 10 steps described above have the potential for the symbiotic involvement of software agents and humans. If the structures of the data passed between any two stages in the above process and the actions resulting could be mutually agreed, then significant automation becomes possible, and more importantly, data or its context need not be lost or marooned during the process. This very philosophy is at the heart of the development and adoption of XML (extensible markup language)³ as one mechanism for implementing the Semantic Web, together with the other vital concept of meta data, which serves to describe the context and meaning of data. XML is a precise set of guidelines for writing any extended markup language, together with a set of generic tools for manipulating and transforming the content expressed using such languages. Many MLs already exist and are being used; examples include XHTML (for carrying prose descriptions in a precise and formal manner), MathML (for describing mathematical symbolisms),⁴ SVG and PlotML (for expressing numerical data as two dimensional diagrams and charts)⁵ and CML (Chemical markup language)⁶for expressing the properties and structures of collections of molecules).

We have described in technical detail elsewhere⁷ how we have authored, published and subsequently re-used an article written entirely in XML languages, and so confine ourselves here to how such an approach has the potential to change some if not all of the processes described in steps 1-10 above. Molecular concepts such as molecule structures and properties were captured using CML, schematic diagrams were deployed as SVG, the prose was written in XHTML, the article structure and bibliography was written in DocML, meta data was captured as RDF (resource description framework),⁸ and the authenticity, integrity and structural validity of the article and its various components verified by using XSIGN digital signatures. All these various components inter-operate with each other, and can be subject to generic tools such as XSLT (transformations) to convert the data into the context required or CSS (stylesheets) to present the content in e.g. a browser window. The semantics of each XML component can be machine-verified using documents known as DTDs (document type descriptions) or Schemas, and where necessary components of the article (which could be as small or finely grained as individual atoms or bonds) can be identified using a combination of namespaces and identifiers.

The most important new concept that emerges from the use of XML is that the boundaries of what would conventionally be thought of as a "paper" or "article" can be scaled both up and down. Thus as noted above, an article could be disassembled down to an individual marked up component such as one atom in a molecule, or instead aggregated into a journal, collection of journals, or ultimately into the semantic web! This need not mean loss of identity, or provenance, since in theory at least, each unit of information can be associated with meta data indicating its originator, and if required a digital signature confirming its provenance. Because the heart of XML contains the concept that the form or style of presentation of data is completely separated from its containment, the "look-and-feel" of the presentation can be applied at any scale (arguably for an individual atom, certainly for an aggregation such as a journal, and potentially for the entire semantic web). It is worth now reanalysing the ten steps describe above, but in the context that everything is expressed with XML structures.

A human or software agent acting on their behalf can interrogate an XML-based journal, asking questions such as "how many molecules are reported containing a particular molecular fragment with associated biological data relating to cancer?". This would, technically, involve software searching for the CML or related "namespaces" to find molecules, and checking any occurances for particular patterns of atoms and bonds. We have indeed demonstrated a very similar process for our own XML-based journal articles; the issue is really only one of scale. Any citations retrieved during this process are captured into the XML-based project document along with relevant information such as CML-based descriptors.
Any retrieved molecules can now be edited or filtered by the human (or software agent) and presented to specialised databases for further searching (if necessary preceded by the appropriate transformation of the molecule to accommodate any non-standard or proprietary representations required by that database) and any retrieved entries again formulated in XML.
With publishers receiving all journal articles in XML forms, the cost of validating, aggregating, and adding value to the content is now potentially much smaller. The publisher can concentrate on higher forms of added value; for example contracting to create similarity indices for various components, or computing additional molecular properties.
Other XML-based sources of secondary published information such as "Organic Syntheses" or "Science of Synthesis" (both of which actually happen to be already available at least partially in XML form) can be used to locate potential synthetic methods for the required molecule. The resulting methodology is again returned in XML form. At this stage, purchasing decisions based on identified commercial availability of appropriate chemicals can be made, again with the help of software agents linking to e-commerce systems. Many new e-commerce systems are themselves based on XML architectures.
The appropriate instructions, in XML-form, can be passed to a combinatorial robot.
Processing instructions for instruments can be derived from the XML formulation, and the results similarly returned, or passed to software for heuristic (rule based) interpretation or checking. The software itself will have an authentication and provenance that could be automatically checked, if necessary by resolution back to a journal article and its XML-identified authorship. We also note at this stage that the original molecule fragment originated in step 1 is still part of the data, but obviously subjected to very substantial annotation with each step, the provenance of which can be verified if necessary.
The compound along with its accreted XML description can now be passed to biological screening systems, which can extract the relevant information and return the results in the same form.
At this stage, much human thought will be needed to make intelligent sense of the accumulated results. To help in this process, the XML document describing the entire project can always be represented to the human by appropriately selective filters and transforms, which may include statistical analysis or computational modelling. The human can annotate the document with appropriate prose, taking care to link technical terms to an appropriate dictionary or glossary of such terms so that other humans or agents can make the ontological associations.
Any referees of the subsequent article (whether open in a pre-print stage, or closed in the conventional manner) will now have access not only to the annotated prose created by the author in the previous stage, but potentially to the more important data accreted by the document in the previous stages. Their ability to perform their task can only be enhanced by having such access.
The article is published. The publisher may choose to add additional value to any of the components of the article, depending on their speciality. They may also make the article available for annotation by others.

This revised cycle is potentially at least far less lossy than the conventional route. Of course, some loss of data is probably desirable, since otherwise the article will become over-burdened by superceded data. The issue of how many editing within such a model is one the community (and commercial reality) will decide.

Conclusions

The Semantic Web is far more than just one particular instance of how the scientific discovery and publishing process could be implemented. It involves a recognition by humans of the importance of retaining the structure of data at all stages in the discovery process. It involves them recognising the need for inter-operability of data in the appropriate context, and ultimately of agreeing to common ontologies for what they mean in their own subject areas. At the heart of this model will be the creation of an open model of publishing, which will lay the foundation for the creation of a global knowledge base in a particular discipline. The seamless aggregation of published "articles" will be the foundation of such a knowledge base.

These will be grand challenges which may take a little while to achieve. The technical problems are relatively close to solution, although the business models may not be so!. However, the greatest challenge will be convincing authors and readers in the scientific communities to rethink their concepts of what the publishing process is, and to instead think on a global scale and of how they must change the way they work, capture and pass on data and information into the global community.

Citations and References

Harnad, S, Nature, 1999, 401 (6752), 423; The topic is currently being debated on forums such as the Nature debates; http://www.nature.com/nature/debates/e-access/index.html or the American Scientist Forum; http://amsci-forum.amsci.org/archives/september98-forum.html and at Chemiwtry pre-print sites such as http://preprint.chemweb.com/. Other interesting points of view are represented by Bachrach, S. M. "The 21st century chemistry journal", Quim. Nova 1999, 22, 273-276; Kircz, J. "New practices for electronic publishing: quality and integrity in a multimedia environment", UNESCO-ICSU Conference Electronic Publishing in Science, 2001.
Berners-Lee, T, Hendler, J, and Lassila, O, http://www.scientificamerican.com/2001/0501issue/0501berners-lee.html; Berners-Lee, T and Fischetti, M, "Weaving the Web: The Original Design and the Ultimate Destiny of the World-Wide Web", Orion Business Books, London, 1999. ISBN 0752820907.
The definitive source of information about XML projects is available at the World-Wide Web Consortium site; http://www.w3c.org/
See http://www.w3c.org/Math/
SVG, see http://www.w3c.org/Graphics/SVG/; PlotML, see http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII1.0/
Murray-Rust, P. and Rzepa, H. S, J. Chem. Inf. Comp. Sci., 1999, 39, 928 and articles cited therein. See http://www.xml-cml.org/
Murray-Rust, P, Rzepa, H. S, Wright, M. and Zara, S, "A Universal approach to Web-based Chemistry using XML and CML, ChemComm, 2000, 1471-1472; Murray-Rust, P, Rzepa, H. S, Wright, M, "Development of Chemical Markup Language (CML) as a System for Handling Complex Chemical Content," New J. Chem., 2001, 618-634. The full XML-based article can be seen at http://www.rsc.org/suppdata/NJ/B0/B008780G/index.sht
The RDF specifications provide a lightweight ontology system to support the exchange of knowledge on the Web, see http://www.w3c.org/RDF/