We introduce the concept of a datument as a hyperdocument for transmitting and preserving the complete content of a piece of scientific work. Currently the scientific publishing process loses almost all of the information environment that the author creates or possesses. It is shown that datuments can record and reproduce experiments and act as a lossless way of publishing science. This is illustrated with specific examples drawn from scientific documents and molecular science, showing how a datument containing molecular coordinates can be viwed in various styles and how typical documents deriving from organic and physical chemistry and expressed in XML can be transformed using XSLT.
This article is an expansion of a 5-minute, slightly tongue-in-cheek, invited presentation (by PMR) at ACM Hypertext 2003 (Hypertext, 2003). In subsequent discussion the underlying serious message was felt to be important, and this is the emphasis here.
We start by defining what we mean by the term "data" in an electronic environment. This is used to cover any material which is not usefully human-readable in raw form. Examples include graphs, digitised maps, database tables, computer code, program output, chemical structures, graphics visualisations, audio and video streams, genomic and microarray data and many more (really a superset of "hypermedia"). Our background may emphasise physical and biological sciences but the concept can be transferred to many domains. We believe that many concepts here are widely applicable to hypertext in all disciplines. However practice and technology will differ and, for example, the classic concept of transclusion will vary considerably between fields. We often emphasise the "call-by-value" (i.e. direct copy) strategy rather than "call-by-reference" as we feel this is the manner in which Open scientific disciplines wish to work.
We also emphasise the term "Open". This is well understood in software licenses where the term OpenSource insists on the preservation of authors' moral rights and the integrity of information and metadata. For "data" it is much less clear and we highlight this concern, without providing complete solutions. "Open", therefore refers to the desire to make information universally available without hindrance on conditions which preserve authors' rights but make it unnecessary to contact the author for responsible re-use.
Our ideas are implemented as working examples within the chemistry domain that act as proofs of concept and have been peer-reviewed (Murray-Rust et al, 2000, 2001, 2003, 2004; Gkoutos et al, 2001). We illustrate these concepts with working examples that are part of the present article. What is needed is the political will of the scientific community to give the impetus to scaling.
Some aspects of this discourse may appear as an uncritical diatribe against all scientific publishers. This was indeed one of the themes in the humorous presentation, and it elicited resonance from the audience. We recognise that there are forward-looking publishers and we are currently pleased to be working with them. However we feel that the scientific publishing community is in many ways holding back the vision of increased scientific communication in the digital age. We will be pleased to hear from publishers who want to explore the concept of datument further.
Most publicly funded scientific information is never fully published and decays rapidly. As an example the Crystallography services in typical chemistry departments such as the University of Cambridge or Imperial College London carry out hundreds of analyses per year. Each is publishable in its own right, but the majority remain as "dusty files" where the effort required to "write them up" for a full peer-reviewed paper cannot be found. Yet these data are among the highest quality scientific experiments performed in any discipline. All information is produced electronically and only about 1% are found to be incorrect in some way. They contain very rich information and nearly 1000 peer-reviewed papers have been published on information extraction ("data mining") from such crystallographic data alone. The International Union of Crystallography (IUCR, 2003) has produced a very impressive e-only publication process where the complete "manuscript" is submitted electronically, and reviewed not only by humans but by extensive computer programs ("robots"). Such manuscripts are "almost always" accepted if there are no technical errors. Yet well over 50% of such material lies unpublished and unavailable to science.
Why? In some cases the scientists wish to have first use of their data and do not want competitors to get it. This was common in the protein crystallography community, which has developed acceptable practices such as putting data "on hold" for, say, six months. In each discipline the practice will vary but often the result is that the scientific public gets a summary of the work in (e)Paper form but has not enough information to repeat the experiment. This is particularly true for in silico experiments (such as quantum mechanical calculations on molecules) where unless the reader has complete knowledge of the input information and installation details of the program, they may get different results or behaviour. Frequently a reader will carry out the experiment again from scratch, as the published information is insufficient.
A serious consequence is that data- and text-mining is non-existent in many communities - they lack a sufficiently large corpus to make it useful. Crystallography had J. D. Bernal (Goldsmith, 1980) who like Bush (Bush, 1945) was a visionary and far ahead of his time, and in a more general scientific sense, Eugene Garfield (Garfield, 1962). Both foresaw the globalisation of information and laid the infrastructure for the archival of scientific and crystallographic information.
A feature of many sciences is that information is "micropublished" in many different journals. There are ca 3,000,000 new chemical compounds reported per year but few journals carry more than about 50 in any one article. Thus information about chemical compounds becomes spread over perhaps half a million articles each year. There are three main approaches to integrating and coordinating such micropublished information:
The latter approach is essentially an (incomplete) datument and we strongly promote this idea below. However the number of publishers who actively adopt this idea is small. Supporting "Supplemental Information" (nowadays prefixed with the term electronic and hence often referred to as "ESI") is a cost without obvious return and is unlikely to be actively refereed or curated. There are few standards and (anecdotally) very little re-use. The publishing process itself militates against datuments. The author is required to recast their information into models that conform to the publishers technology and business model, often being a Office document with defined template and the conversion of all data into tables or semantically void images. In any case most manuscripts are re-keyed at some stage in the publication process, so electronic submissions by authors have little value. Authors are not surprisingly discouraged from datument-like publishing.
Until recently this was inevitable, but now we have the technology to address this. Many information components in a hyperdocument can be recast as context-free XML and integrated with XML text and XML graphics. Here we show the overall information architecture with reference to the latest proofs-of-concept in the chemical field.
The current transition to "e-journals" seems to be welcomed by many - but not us. E-journals published in portable document format (PDF) have missed a great opportunity for change and brought little value to the scientific community (in this sense, portable really means print anywhere rather than re-use anywhere). Many readers still print their reading on paper, so the effect is merely to transfer the cost of producing paper journals (including mail) to the readers' printing bills. Even where readers use the screen there are few or no tools to manage this information - each scientific article is a distinct entity whose linear concept dates from the nineteenth century. Electronic TOCs and bibliographic hyperlinks may provide some value but the idea of a dynamic knowledge base for the benefit of the community is wholly lacking. We accept that business goals and methods cannot change overnight, but novel forms of communication have usually been ignored. For example the authors have pioneered e-conferences (Rzepa et al, 1996), e-courses (Murray-Rust et al, 1995) and sit on the board of an innovative e-journal where datuments can be published (Gkoutos et al, 2001). These and similar efforts in other disciplines have been largely ignored. The brave new world articulated in many of the talks at the first World-Wide Web conference in 1994 (Rzepa, 1995) which foresaw radically new ways of using the digital age, have been largely stifled by conventional business interests and methods.

A common feature of all mainstream science publication is the universal destruction of high-quality information. Spectra, graphs, etc. are semantically rich but are either never published or must be reduced to an emasculated chunk of linear text to fit the paper model. The reader has to carry out "information archeology" from the few bricks that remain from the building.
The true vision of the digital age is to use information beyond the limitations of paper. We use the test of the "robotic scientific reader". This robot can read and understand scientific discourse such as papers and emails. The understanding is very limited and has very carefully controlled semantics but it has several major advantages over human reading.
The last feature is critical. Science is being overwhelmed with information and it is essential that we develop robotic readers. This is keenly understood in the biosciences where "text mining" is an active area. We accept that human natural language is a major current barrier, but this can be dramatically lowered if we have the will. By contrast most publishers are continuing to make their products inaccessible to robots, mainly through PDF. Even Microsoft Word (TM) is better. GIF or JPEG images carry content which machines can understand only with great difficulty (Gkoutos et al, 2003). It is symptomatic that this journal supports only "dumb" image formats such as GIF and JPEG and that SVG (graphics in XML), the natural choice for digital image information, is not actively promoted. [We shall use it anyway so if readers want to see our diagrams they need to take their first steps in reading a datument.]
This is not science fiction. A program undertaken at Cambridge (Murray-Rust et al, 2003) has resulted in robots that can read and understand most of the data in a typical paper on the synthesis of new chemical compounds. The robots can read a paper in ca 5 seconds and create a complete datument of all analytical information. Using XSLT stylesheets the robots can answer trivial (chemical) questions like:
Unfortunately the real spectra have been destroyed in the (conventional) publication process. Even so these tools can carry our information archeology to make reasonable estimates of what the original spectra might have looked like.
It easily conceivable that robots could take action on reading papers, such as "find all inhibitors of HIV protease in J. Med. Chem., order them from suppliers or where unavailable repeat the syntheses." In practice this will still require human oversight for some years but it illustrates the power of the semantics.
This discourse therefore, is a call for "accessibility for robots as well as humans"
A datument is a hyperdocument for transmitting "complete" information including content and behaviour. We differentiate between "machine-readability" - merely that a document such as a JPEG image can be read into a system - and "understandability" where the machine is supplied with tools which are semantically aware of the document content. Examples of the latter are domain-specific XML components such as maps (GML), graphics (SVG) and molecules (Chemical Markup Language , CML). Understandability may require ontological (meaning) or semantic (behaviour) support for components. Neither are yet fully formalised but within domains it is often possible to find agreement that certain concepts are sufficiently agreed that programs from different authors will behave in acceptable manners on the same documents. We shall assume that most scientific disciplines can, given the will, support machine-understandability for large parts of their information.
In principle datuments can be infinite in size, both in terms of the semantic and ontological recursion and the need to provide complete information for every component. For example a scientific paper has citations which themselves are datuments and which may be required to create the complete knowledge environment. In principle, also, a datument can be dynamic with components changing in time. Nonetheless we believe that in many sciences bounded static datuments are of great value and that many primary publications are valuable as such.
Classical transclusion normalises information by providing a single copy of each component and providing links to rather than copies of such sources. This works well on the web as long as integrity is regarded as relatively unimportant (or at least poor integrity can be "lived with"). It also works where a single (monopolistic) supplier has control over all the transcluded information. In a heterogeneous environment it does not yet work. A supplier of transcludable content may have little business or moral motivation to provide continued integrity. A primary publisher may have no contractual information to continue to support authors' supplemental data or even full text indefinitely. While transclusion may work where microcontent is of very high value (e.g. arts and literature) it is difficult to see a business model in science.
An alternative model is the datument "snapshot" where all the components are copied and aggregated at "time of publication" (Figure 2).
While this forgoes the power of dynamic linking, it provides an enormous enrichment of the original material. An example could be a scientific thesis with multiple components generic components such as:
and domain-specific ones such as:
The graduand creates a thesis by aggregating the information as a single datument with integral XML copies of all the information collected to support the scientific work. After the examiners have torn it to pieces (critical examination!), the revised datument can then be published in its entirety. Whereas most paper theses are never re-read, PhDatuments can be universally accessible to humans and robots alike.
Remarkably models for such aggregation are already arising within the so-called "blogging" communities, which are united by their published "Web logs" and some degree of semantic and ontological unity achieved using meta-data (RSS) feeds (Murray-Rust and Rzepa, 2003, 2004).
This article is addressed to those communities who genuinely wish to share scientific information. We believe that "most" scientists wish their data to be re-used, even if it occasionally leads to embarrassing retractions and revisions. Many authors do not recognise the value of aggregating their micropublished work, although this tradition has been common for 200+ years. We hope that the datument will show that mutual contribution leads to a vastly richer resource for scientific discovery.
We accept that certain data cannot be made freely available though patient confidentiality, patentability, etc. We are however urging that all data published in the primary literature be Openly available for re-use. "Free" does not necessarily mean Open, as re-use may be prohibited. By "Open" we mean that the information can be aggregated, filtered and redistributed, and derivative works can be made, subject to appropriate license conditions. In OpenSource these are well explored and (paraphrased) include the preservation of original authorship, details of any changes in derivative works (if allowed) and full access to source code (not merely executable functionality).
A datument is generally composed of components from many sources. If these sources have any barriers to re-use the distributability and re-use of the datument is very severely limited. Among the barriers are:
The protection of intellectual property (IP) on datuments is potentially extremely complex. Creative works are copyrightable but "facts" are not. However collections of facts may be held to be creative works. The status of a datument, where many components including text are assembled, is unlikely to be clear and this could jeopardise the process of making data in the community Open.
This could be simplified if authors made it clear that the complete scientific datument was made Openly available by the authors. In most cases it has been created before submission to the publishers and we see little reason why copyright should be reassigned. If compromise seems inevitable we have heard of a recent case where authors keep copyright of the original manuscript and the publishers have copyright of the form that appears "in print" with pagination.
The International Scientific Unions have emphasised the importance of data being publicly available to the scientific community. In our view authors must not hand over copyright of the "data" to publishers. The datument (perhaps eviscerated of some of its "text") should be regarded as "data" and published in Open view. We show how this is technically straightforward and manageable with marginal costs
This article contains two small examples of datuments (see Figures 4 and 5). Their subject matter is chemistry but readers need no detailed domain knowledge. They are interactive, but are not just another example of scientific multimedia or hypermedia. We stress that the content is independent of the presentation and the graphical displays are created by tools operating on the display-neutral datuments. For example a graphical display is irrelevant to a robot reader.
The content is two snippets of published scientific information and both incorporate a mixture of "text" and "data".
Although datuments are expressed in XML, this is not (yet) the format in which most scientists work. Data and text are collected in a variety of (often proprietary) non-extensible legacy formats, many in binary form. The two strategies are:
We have discussed this elsewhere but remark that the second approach, though unaesthetic and lossy is likely to be the most tractable. Moreover when it succeeds the community may be sufficiently impressed to invest in the infrastructure of XML. But 5000 years of linearisation will not disappear immediately. Each domain will have to create a significant amount of infrastructure and technology. In some cases this is well understood and under construction. We illustrate it from our own subject of molecular science (with the CML family of languages) and expect that the structure will map to other disciplines. We have created, often with the help of the OpenSource community:
The social dynamics of this daunting enterprise will vary considerably between domains. In some areas (e.g. crystallography) it is overseen by the appropriate Scientific Union or learned body. In others (e.g. new drug applications, NDA) it will be part of the regulatory process. In biosciences the (inter)national data curators have a major role. In chemistry the established nature of the chemical information industry has left a vacuum in communal development which is filled by a smallish group of OpenSource enthusiasts such as ourselves. In all cases it requires considerable investment of some sort, but considerably lessened through the availability of Open generic tools.
The datument is therefore a hypermedia document accessible to robots and humans. At ACM Hypertext we were impressed by the developments in human-understandable hypermedia but felt that robots were neglected in comparison. Web hypermedia systems are largely aimed at human readers and have few concessions for robots. Much of the analysis is post facto - analysing how humans and metadata-deprived robots navigate rather than building global hyperstructures ab initio. Developments such as ZigZag (Nelson, 2003) with a non-traditional information structure are exciting but it will require much evangelism before they become tools in mainstream publishing.
We argue that a cultural change in our approach to information is needed and that money on its own will not solve it. Indeed greater investment in mainstream publishing may worsen the situation. The publishers' primary selling point is their impact factor, not necessarily the functionality of the product. Funders and academic bodies compound this, and novel initiatives are often not welcomed if they have low impact. The model of publication must therefore change. Realistically this will take time but we have to create something where the benefit is to the scientific community, and where the practitioners can be visionary. We propose students and their theses or reports as fertile ground.
Students have less fear of the impossible and less legacy to unlearn. We have involved both under- and post-graduate students in authoring XML in many of the ways shown above and they have not only picked it up quickly but added their innovations. We therefore suggest that positive incentives should be given to students to create their theses as XML datuments.
We illustrate this approach with example derived from a small part of a typical student chemistry thesis (Figure 4). The original component of the thesis is written in XML, with the chemistry carried directly using CML, itself an XML language. This datument can then be transformed into different representations for human assimilation. Figure 4a illustrates its conversion to an Acrobat file, destined largely for those humans who wish to print or archive the content, whereas the same datument can be transformed to eg Figure 4b, where the chemical content can now be viewed using either SVG (for 2D perception) or directly using a Java applet (where 3D perception might be needed).
What are the immediate benefits of this approach? Some examples, which we contend may immediately save the student work:
The longer-term benefits are even more dramatic. Assuming the research group has 5 years worth of student PhDatuments, they could:
When the thesis is accepted, corrections will be easier to make. By using XSLT, the components of the thesis can be prepared as datuments for publication into the wider community. A working illustration of this process in action is given in Figure 5, where the action of XSLT stylesheets upon XML-based datuments can provide a variety of (user-driven) representations, including functional ones such as transformation of scientific units and manipulations of mathematical terms.
How could funding make this happen? We need:
PMR thanks Adam Moore, Helen Ashman and ACM Hypertext 2003 for the invitation to speak on the panel and discussions with many delegates. They have also made valuable editorial suggestions. We thank our students (Sam Adams, Vanessa de Souza, Joe Townsend, Chris Waudby (Cambridge) and Mark Williamson (Imperial College) for inspiration from their projects (to be reported elsewhere).