Digital data repositories in chemistry and their integration with journals and electronic laboratory notebooks

Matt J. Harvey,^a Nicholas J. Mason^b and Henry S. Rzepa^c,*

^aHigh performance computing unit, ICT division, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK. ^bDepartment of Chemistry, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK

Abstract

We discuss the concept of recasting the data-rich scientific journal article into two components, a narrative and separate data components, each of which is assigned a persistent digital object identifier. Doing so allows each of these components to exist in an environment optimised for purpose. We make use of a poorly-known feature of the handle system for assigning persistent identifiers which allows an individual datafile from a larger fileset to be retrieved according to its file-name or its MIME type. The data objects allow facile visualisation and retrieval for re-use of the data, and facilitates other operations such as data mining. Examples from five recently published articles illustrate these concepts.

Introduction

Reporting scientific investigations in the form of a periodic journal is a concept dating back some 350 years to the 17th century.¹ For much of that time, the only mechanism for dissemination involved bound paper (the "volume" or "issue"). This of course has changed in the last twenty years, at least in terms of delivery, but the basic structure and format of the scientific article has undergone less change. The article continues to interleave a narrative supported by reporting experimental data. The data is presented in the form of figures, tables, schemes and often just plain text in the highly visual form that humans can easily absorb. Often, only a small subset of the data actually available can be presented for reasons of "space". The online era, dating back perhaps 30 years, has allowed "supporting information (data)" to be separated from the physical constraints of the printed journal article and deposited with the publisher or a national library as a separate archive.

The logical connection between data present in the main article and its supporting counterpart is in fact tenuous; both sets of data continue to depend on the human reader to extract value from them. Text and data mining (TDM) however is making enormous strides² in allowing machines to harvest data, being capable of far higher and less error prone throughputs than humans. This then facilitates human verification of assertions made in the narrative component of an article, or indeed the discovery of connections and patterns between a body of articles. The paper-bound article, including electronic emulations of paper such as the PDF format, is not well set up for a clear separation of the narrative and the data on which the former is so dependent. The two types of content are also caught up in complex issues of copyright; are the narrative rights and ownership held by the publisher or retained by the author? Is the ability to perform TDM an unrestricted one, or prescribed by the publisher? Unfortunately, the data component, which we may presume is not covered by copyright (one cannot copyright the boiling point of water) is often entrained in these complexities. Here we suggest a new model of how the scientific journal can take advantage of some of the many technical advances in publishing by emancipating the data from its interleaved co-existence with narratives.³ We reflect on how this might allow the journal to evolve in a manner more appropriate for a fully online environment, and present several examples.

The electronic laboratory notebook

A slowly growing innovation is the electronic laboratory notebook as the primary holding stage for the capture of data from e.g. instruments, databases, computational resources and other data-rich sources.^4,5 This model can be represented as in Figure 1. The flow of data is primarily from data-source to notebook, and much less information is likely to flow in the other direction. When a project is complete, the data held in the notebook is then assembled into a narrative + data-visuals in a word processor and submitted in this latter form to a journal. The data, and its expression and semantics is rarely well preserved in this latter process. There is no communication at all in the reverse direction from the journal article to the notebook; if nothing else, the notebook security model would not permit this.

Figure 1. The standard model for the flow of data from the laboratory to the scientific journal.

Consider however an alternative model (Figure 2). It now incorporates perhaps the single most important game-changing technology introduced a little less than ten years ago, the digital repository in chemistry⁶ and other domains.^7,8 This combines the concept of using rich (reliable) metadata to describe a dataset with an infrastructure that allows automated retrievals of the datasets, potentially on a vast scale. The other basic component of a digital repository is the idea of a persistent identifier for the data, one that can be abstracted away from any explicit hardware installation. There are other differences from the first model.

The electronic notebook is replaced by what might be termed the "lite" model, more customised for the type of data being collected, and more bidirectional in its data-flows.
The lite notebook continues with a privacy model, but the interface between private and shared or public data is now handled instead by the next link in the chain, the repository. Again the data-flow model is bidirectional.
It is the repository that can have further (bidirectional) data flows associated. The first, a data mining model, is not described further in this article. The second becomes the journal.
The links between the repository and the journal are based on persistent identifiers. Generically they are called handles, of which a specific example much adopted throughout the publishing industry is the digital object identifier, or DOI.⁹ Again the data flow can be bidirectional; a journal article can reference an entry in a repository, and vice versa.
In this model, one can achieve a desirable separation of narrative from data. Each of these components can be held in an environment optimised for its primary purpose; neither need be compromised.

In the remainder of the current narrative, we will describe how a working implementation of this model was constructed.

Figure 2. A proposed model for the bidirectional data flows between the laboratory and the scientific journal.

The electronic notebook lite (uportal)

Our starting point was computational chemistry, although solutions for molecular synthesis and spectroscopy have also been trialled.⁶ The basic resource for this is known as a high-performance-computing centre. Since the latter operates in a fully digital manner, it provides a good test-bed to construct an electronic-notebook customised for the purpose. Here will refer to it as uportal (Figure 3). The design of such a system has to factor-in requirements specific to computations:

It requires a domain knowledge. The need for specific software, the type of calculation that software is asked to undertake, include the resources needed to complete it.
It requires a technical knowledge. What type of computer and software is needed? How much CPU/memory is required? What is the interface to the batch system?
An imposed structured workflow captures all input and output data to provide a full context-record context to ensure future reproducibility for the calculation. The workflow includes actions such as listing all previous jobs submitted to the system, an interface for creating a new job within a declared project area, and configuring software for computing properties and generating new data.
Naturally arising from the workflow are metadata such as user name, date-stamps, free-text descriptions. We have come to believe that such metadata is best generated as part of that automatic workflow, and it should not be added in an ad hoc and relatively uncontrolled manner at a later stage.
Silently created as part of the computation are CML (XML) representations¹⁰ of the objects being modelled (molecular structures in our case).
Other chemistry specific auto-generated (meta)data includes portable application-specific data formats (e.g. Gaussian formatted checkpoint file) and tags abstracted from input/output files (InChI, SMILES, calculation type, etc ).

Figure 3. The data flows between the uportal notebook and a digital repository.

The uportal notebook interfaces via a developer API¹¹ to a digital repository to enable deposition there of:

raw input and output data files for the computation (ORCA input, Gaussian input, formatted checkpoints)
derived and extracted metadata (program used, calculation types, XML representations, InChI, SMILES identifiers)
validated metadata (author, from an appropriate validation procedure such as institutional login or ORCID credentials¹², date stamps
captured metadata (free-text description supplied by depositor, and a project code assigned to each item)
One or more unique and persistent identifiers for each deposition in the form of a DOI/Handle, as assigned by a Handle-Server.
Each deposition derives from a specific batch-mode computation on a specified molecule, and these entries are further grouped by project codes. This latter attribute can be inherited by a digital repository (for example Figshare) where depositions can be viewed by their projected identifier, and where project collections can be shared (privately or publically) amongst a specified group of collaborators.

It is challenging and expensive to build/acquire/configure a general purpose electronic laboratory notebook system that can accomplish all these specialised tasks for a local environment. We believe it is more straightforward to instead construct a lightweight portal using standard scripting environments such as python or php. An overview of such a system can be seen in Figure 4, where some of these attributes are listed for each entry. As can be seen from the sequential ID, the system can easily scale to ~100,000 entries accumulated over a period of around seven years (about 10,000 entries each year). This was achieved by around 600 users distributed amongst staff, postgraduates and undergraduates. The last column shows the interface to the next component, the digital repository. The bidirectional nature is reflected in the capture of the assigned repository DOI back into the lite notebook if it has been published. Other actions include deleting the entry, or simply leaving it unpublished.

Figure 4. Uportal: a job submission and notebook system interfacing to a high-performance computing resource.

The digital repository

We have described elsewhere⁶ the principles behind our DSpace-based repository (SPECTRa), introduced in 2006. Two others have been added since then, Chempound¹³ and Figshare¹¹. There is no limitation to the number of repositories that can be associated with any given electronic notebook.

DSpace contains a handle-server that is used to assign unique persistent identifiers and to resolve incoming requests. Our system currently assigns two identifiers, the first being a generic Handle and the second issued by DataCite¹⁴ as a DOI (digital object identifier).
Chempound was added as an example of a semantically-enabled system, including the ability to create semantic-queries. This system however does not assign its own persistent identifiers via a dedicated handle server.
Figshare is a open repository with a published application programmer interface (API). Because it is an external organisation, it contains a two-stage authentication system that establishes trust between the uportal and itself via encrypted tokens. Figshare also rely exclusively on DataCite¹⁴ to issue persistent identifiers (DOIs).

In general therefore, any dataset collected at the uportal as a result of a job submitted to the high-performance computing cluster can be simultaneously published into any combination of these three repositories. An example of how one particular deposited data or fileset is presented in these three repositories is shown in Figure 5, highlighting the auto-determined metadata and other attributes. The metaphor is that each dataset relates to a specific molecule with specific metadata for that molecule, and that this collection of data is then assigned its own repository identifier. Sets of such depositions can then be grouped into collections in the form of e.g. datuments (see below).

Figure 5. Repository metadata for the same dataset as sequentially published in (a) Dspace/SPECTRa, (b) Chempound (c) Figshare.

The Figshare repository differs in one regard from the other two. The initial deposition process reserves a persistent identifier for the object, inherits any project associated with the original entry from the uportal and creates a private entry within that project. At this stage, Figshare allows collaborators to be assigned exclusive permission to access the items in any given project, but the item is otherwise not open. Only when the project is deemed complete and submitted for publication need each entry be converted to public mode. One aspect of this process is not yet supported; a private but nevertheless anonymous mode to enable referees only to view the depositions as part of any review process. Currently, we make the data fully public even at the review stage, with priority afforded by date stamp and other metadata associated with the deposition.

The Handle system

Handles are analogous to Web URIs (uniform resource identifiers) in being a hierarchic descriptor containing an authority and a path to a resource. A technology for assigning and resolving persistent identifiers for digital objects (IETF RFCs 3650-2) was developed and is maintained by the Corporation for National Research Initiatives (CNRI). This has the following features.

The identifiers have the form XXXX/YYY, e.g. 10042/28000 or 10.6084/m9.figshare.1234
where XXXX is the prefix, and is a unique identifier assigned by CNRI to an organisation creating the persistent identifiers. For example, the prefix for our DSpace repository is uniquely 10042.
YYYY is an arbitrary identifier assigned by the prefix owner that uniquely identifies a digital object; it can be any length (within limits).

The most common implementation of a handle by journal publishers is the Digital Object Identifier (DOI) System.⁹ A short-form of the standard DOI has recently been introduced which limits the identifier length to seven characters, and can be as few as three, the purpose being to facilite their use by humans. Handles are typically resolved through using http://hdl.handle.net or http://doi.org. These resolvers can display the records returned from the prefix server via the syntax:

It is more common is to use the Handle resolver to immediately redirect the client to the destination page, often also referred to as the "landing page", using a "URL" record type. Although it would be possible to also assign such a URL record to individual data files, this rapidly becomes unwieldy and the associations between related files are lost. Such URL records also have the limitation that there is no easy way of specifying what action is required for the file, the default being simply to attempt to display the contents in the browser DOM (document object model). A standard more flexible way is therefore needed to directly specify the individual files within a deposited record, and one which may be off the landing page. This can in fact be achieved by an extension to the handle system, the poorly-known Locatt feature of the "10320/loc" record type that was developed to improve the selection of specific resource URLs and to add features to the handle-to-URL resolution process.¹⁵ This type includes an XML-encoded list of entries, each containing:

A resource URL
A numeric index
Arbitrary key:value pairs

The servers hdl.handle.net or doi.org accept the URL-encoded "?locatt=key:value" and return the URL of the first entry matching key=value. We define the keys "filename" and "mime-type"¹⁶ in our custom DSpace handle records (Figure 6).

Figure 6. The response returned for a query http://doi.org/10042/26065?noredirect showing the filename and mimetype records.

This now works as follows:

hdl.handle.net/10042/26065 or doi.org/10042/26065 resolve to the DSpace landing page for that entry.
hdl.handle.net/10042/26065?locatt=filename:input.gjf or doi.org/10042/26065?locatt=filename:input.gjf resolve to a Gaussian input file (filename matching) on that repository.
hdl.handle.net/10042/26065?locatt=mimetype:chemical/x-gaussian-input or doi.org/10042/26065?locatt=mimetype:chemical/x-gaussian-input resolve to the Gaussian input file using MIME-type matching.
hdl.handle.net/10042/26065?locatt=id:1 or doi.org/10042/26065?locatt=id:1 resolve to the Gaussian input file (ID matching)

The most valuable feature of this extended experimental system is that the resolvers hdl.handle.net/api/10042/26065 or doi.org/api/10042/26065 return the JSON-encoded (JavaScript Object Notation) full handle record, which we use for processing in Javascript. There do remain issues which will need eventual resolution.

Support for the 10320/loc record type is currently limited to our DSpace repository, where we control the handle resolution.
keys and values are assigned on a per-repository basis. We have used filename suffixes and mimetype which are generic keypairs. We propose that these be used as standards for any repository supporting this feature.
Nevertheless the issue arises of whether the 10320/loc record should continue to be extended in this manner, or whether a new record type should instead be defined for the purpose.
Everything in the Handle system retrieved in this manner is invisible to Google
Searching is still limited to indices of the messy human-readable landing pages

Filesets and more complex repository objects such as datuments⁸

With an system established which can directly and automatically address individual files (objects) held in a repository store, we can now consider how more complex object such as datuments⁸ might be constructed. Consider a table or a figure which can be built⁸ from basic HTML5/CSS3/SVG/Javascript components, as is typical for a complex marked-up web page. The complete collection may number 100s of files. In chemistry, such datument collections have been in use since 2006,¹⁷ with descriptors such as Web-enhanced-objects (WEO, by the American Chemical Society) or interactivity-boxes. As these descriptions imply, they are a combination of data together with a scripted environment that renders the data into an interactive visual presentation to the reader (a datument).⁹ Most of the existing examples are interwoven with the narrative of a journal article¹⁷ and occur in the HTML version of the article, whereas a static equivalent is presented in the printable PDF version. The infrastructure described above now allows us to formally separate such datuments from the narrative by depositing the data fileset into a repository and assigning it a persistent identifier of its own. The datument can then be re-absorbed back from the repository using e.g. an <iframe> declaration.

We have now created a number of such datuments on the Figshare repository, which have the following features:

An HTML5 document is created which will be defined as the root document of the object.
This document can invoke e.g.Javascript-based utilities which serve the purpose of transforming a data file into a visual representation. We will use the JSmol molecular viewer¹⁸ to achieve this, which is also based on Javascript files included in the fileset.
Another Javascript module we call resolve-api.js has been written to implement the handle-processing described above. It retrieves a JSON-array from e.g. hdl.handle.net/api/10042/26065 and converts the information into a string suitable for passing through to the JSmol system. The author provides the persistent identifier for a given datument, and the system uses the above to retrieve the appropriate individual file (dataset) from the digital repository for visual display. A specific example helps to illustrate the process: <a href="javascript:handle_jmol('10042/26065',';frame 21;connect (atomno=1) (atomno=11) partial;')">log</a>, with components listed below:
1. The function handle_jmol (defined in resolve-api.js) does the work of producing a Jmol or JSmol figure from data obtained from the specified handle. The first argument in the script above, the handle '10042/26065', is resolved via http://doi.org into a structured JSON object. From this, the repository URL for a Gaussian logfile is obtained by MIME-type matching and passed as input to a JMol instance. The second argument to handle_jmol() is a script passed directly to the JMol instance to configure the rendering. In this case the script is used to select the 21st frame from the log file (which in fact corresponds to a normal vibrational analysis showing the imaginary mode for a transition state calculation), to render the bond between atoms 1 and 11 as a partial one, and to display the result. Many other script-options are available, both for specifying the visual display and indeed computing new properties from the data provided. These include requesting a copy of the data to be saved to the user's local storage system.
2. This is all presented as an HTML hyperlink and the action results if the hyperlinked object (the text log in this instance, but it can be any valid object) is activated by the reader.
3. Further unscripted actions can also be interactively initiated by the reader using the menu-driven interface of JSmol itself.
At the Figshare server, the landing page for resolution of the DOI describing this datument is declared as the root document, and so when the DOI (10.6084/m9.figshare.840483 in the example below) is requested, the reader immediately receives a visual presentation as provided by JSmol.
The object can also be invoked from Figshare using the following code:
<iframe src="http://wl.figshare.com/articles/840483/embed?show_title=1" width="850" height="300" frameborder="1"></iframe>
the effect being shown in WEO 1. In principle, the iframe declaration could itself be derived purely from the datument DOI using the locatt selection method described above; in this specific instance, it was obtained manually from the Figshare DOI landing page. It is also possible that this HTML element will be superseded by the use of a link element which is regarded as having superior document properties: <link rel="import" href="/path/to/imports/stuff.html">

Link to WEO 1 to be placed here.
We have investigated two other ways in which data can be emancipated from narrative. The most simple and the one that most closely corresponds to traditional "supporting information" is a fileset collection, DOI: 10.6084/m9.figshare.777773 or shortdoi: rnf. In this example (WEO 2), it corresponds to instrumental spectra captured in PDF and available for download from Figshare for inspection.

Link to WEO 2 to be placed here
A utility might be similarly packaged. If you invoke the DOI: 10.6084/m9.figshare.811862 (shortdoi: n5b) this will take you through the process of uploading data in the form of a cube of electron density values from a local file store and converting it into a so-called non-covalent-interaction (NCI) surface and storing the isosurface in a new data file. Such a utility could for example be included within an article²³ describing the generation and use of such surfaces.
Persistent data identifiers can also be usefully deployed in other contexts such as e.g. blogs. In conjunction with a citation extension such as Kcite, a dataset can be referenced using the simple syntax [cite]10.6084/m9.figshare.840483[/cite]. This uses a handle API to retrieve the metadata associated with the item and appends this into the data-citation enumerated list. An example of its use can be seen at 10042/a3uzb.

Redundancy

The functionality implemented in the resolve-api.js script is linear. One or more persistent-identifiers for datasets specified in a datument are each resolved using a handle server into 10320/loc handle record types pointing to a data-repository server. This returns the specified files to the calling datument, which itself can be requested by a journal server via its own persistent identifier. A total of up to four services in possibly four locations can involved in this sequence, each being a potential point of failure. Here we briefly discuss what redundancies could be built into the system.

The general repository structure would be as follows:

Repository 1 Handle records
URL - URL of landing page (repository 1)
URL - URLs/persistent identifiers of landing page (e.g. repository 2)
URL - URLs/persistent identifiers of landing page (e.g. repository 3)
10320/loc - locations of files at repository 1
Repository 2 Handle records
URL - URL of landing page (repository 2)
URL - persistent identifier for additional deposition (e.g. repository 1)
URL - persistent identifier for additional deposition (e.g. repository 3)
10320/loc - locations of files at repository 2
etc.

An attempt to retrieve a handle record from repository 1 (e.g. 10042) is made using a call of the type e.g. doi.org/api/10042/26065. This returns the JSON-encoded full handle record for repository 10042.
If this record contains a 10320/loc entry for the file of interest (e.g. Gaussian log file), an attempt is made to retrieve the file.
If this fails for whatever reason, switch to repository 2 (e.g. 10.6084) as specified in the URL records of repository 1.
If 10.6084 has a matching Handle record, try the calls in step 1. If not try repository 3.
etc.

This scheme relies on the alternative resources having the same or a similar handle record structure, including the 10320/loc type. Currently, only our DSpace/SPECTRa server has this specified, so the scheme above is not yet capable of practical resolution. It is nevertheless useful to include all instances of alternative depositions in the handle record if possible, in anticipation of other repositories implementing this scheme.

In the 10320/loc scheme, locatt is a selection method that selects a location based upon a specified key-pair attribute. This scheme also allows two other selection methods, country and weighted, specified by a chooseby attribute.¹⁵ If this attribute is not defined, it defaults to locatt,country,weighted. Our implementation (which allows the value of chooseby to default) uses locatt followed by weighted. We suggest it is good practice include an explicit chooseby attribute in the Handle records to anticipate any changes or enhancements in the repository structures. We also note that increasing consideration is being given to country records, since it can be desirable to select these based on the legal frameworks in place for cloud-based data.

The redundancy model described above is suitable for a tightly coupled set of repositories into which deposition is managed by e.g. the uportal front end. Specifications for a complementary solution known as ResourceSync have recently been published¹⁹ to enable remote systems to remain in step with their evolving resources. There are no working implementations yet which could be demonstrated here.

Journal Examples

Data emancipation along the lines of the model set out in Figure 2 has been used in five articles to date^19-23 (six if you include this one).

The Vinylcarbene - Cyclopropene Equilibrium of Silicon: an Isolable Disilenyl Silylene²⁰
- Narrative: DOI: 10.1038/NCHEM.1751
- Emancipated Data:²⁵ DOI: rng
Mechanistic and chiroptical studies on the desulfurization of epidithiodioxopiperazines reveal universal retention of configuration at the bridgehead carbon atoms²¹
- Narrative: DOI: 10.1021/jo401316a
- Emancipated Data:²⁶ DOI: rng
Epoxidation of Bromoallenes Connects Red Algae Metabolites by an Intersecting Bromoallene Oxide - Favorskii Manifold²²
- Narrative: DOI: 10.1039/C3CC46720A
- Emancipated Data:²⁷ DOI: n6q and n6r
The Houk-List transition states for organocatalytic mechanisms revisited²³
- Narrative: DOI: 10.1039/C3SC53416B
- Emancipated Data:²⁸ DOI: qd8, p9d, qd7, qcc, qcd, qcs, qc3, qc4.
N-Heterocyclic Carbene or Phosphine-Containing Copper(I) Complexes for the Synthesis of 5-Iodo-1,2,3-Triazoles: Catalytic and Mechanistic Studies²⁴
- Narrative: DOI: tba
- Emancipated Data:²⁹ DOI: rnt, rfk, rfm.

In all five cases, much of the data originated from the quantum-modelling of the systems controlled using the uportal. The individual calculations were published into both Dspace and Figshare simultaneously as public objects. The assigned DOIs were then incorporated into tables and figures as hyperlinks using HTML. For articles 1-3 and 5, explicit data files were also included in the file collection and the complete set was then converted to a datument, uploaded to the Figshare repository and then itself assigned a DOI (the one quoted above). The reader can either retrieve this local copy of the data and view it in Jmol, or use the original DOI for that item to download a more complete set of set which includes the input specification which defines how the calculation was performed, and a checkpoint file containing the complete set of calculated properties. For the fourth article above, no local copies of the data files are present in the complex data object, and the calculation log files are retrieved on-demand from the original repository (in this instance DSpace/SPECTRa). There is one exception to this type of retrieval. Some of the original data sets were converted into electron density cubes using the calculation checkpoint file and a non-covalent-interaction (NCI) surface was then generated. This was as two files, a .xyz coordinate file and a .jvxl surface file; such operations can take 10-15 minutes or longer per molecule and are too long to be implemented as an on-demand interactive process. These specific surface files were therefore included into the datument as local files.

The model above was clearly developed to handle and illustrate the type of data we are interested in; it is not a generic solution for chemistry! But it serves to demonstrate that the entire workflow can be successfully implemented, which suggests that solutions for many other kinds of chemical data should be developed.

Data Search and derived analysis

Search engines are starting to appear which focus on citable data. For example, all the metadata associated with persistent data identifiers issued by DataCite¹⁴ is available for querying. Thus:
http://search.datacite.org/ui?q=InChIKey%3DLQPOSWKBQVCBKS-PGMHMLKASA-N
will return all deposited data objects associated with the InChIKey chemical structure identifier³⁰ LQPOSWKBQVCBKS-PGMHMLKASA-N. As the "SEO" (search engine optimisation) of the metadata included in the depositions becomes more effective, so too will e.g. searches for molecular information held in digital repositories. Similar features are also offered by Google scholar³¹ and ORCID (open-researcher-and-collaborator-id)¹². A search using either of these sites for one of the present authors reveals multiple data-citation entries from both the DSpace/SPECTRa and Figshare repositories. Although data-citations cannot be directly compared with article citations in terms of impact, the infra-structure is appearing to construct useful altmetrics to do so. Thus one example of how "added value" can accrue is illustrated by a resource³² harvesting metadata from the data repository Figshare¹¹, the ORCiD¹² database of researchers and collaborators and Google Schoolar.³¹ This information includes metrics which allow usage of the data to be estimated, and hence rather indirectly some measure of its scientific impact.

The future

A two-component narrative-data model for the journal article (Figure 2) has the potential for solving one major current problem associated with scientific journals; the serious and permanent loss and emasculation of data from its point of creation in the laboratory to its final permanent presentation to the community in the published article. A very recent publication serves to illustrate the serious extent of the data loss.³³ A significant computational resource was used to create 123,000 sets of optimised molecular coordinates in an impressive exploration of the conformational space of four pyranosides. Only 907 of these coordinate sets are available via the article supporting information, and they are presented in the form of a double-column unstructured monolithic PDF document containing page breaks and numbering and with very little associated metadata for each entry. A fair amount of effort would be required by the reader of this article to (re)create a usable database from this collection for further analysis. Absence of what could be regarded as key data is unfortunately often the norm rather than the exception. Some reported data can be recast into a structured re-usable form by data-mining techniques, but where it occurs has traditionally been conducted by commercial abstracting agencies, the substantial cost of which is also passed back to the scientist. In the example described here, if the data is absent in the first place, it cannot be recovered by any form of data-mining. It is worth at this stage noting a recent and concise declaration of principles known as the Amsterdam Manifesto³⁴ regarding data-sharing, which are reproduced here in full;

Data should be considered citable products of research.
Such data should be held in persistent public repositories.
If a publication is based on data not included with the article, those data should be cited in the publication.
A data citation in a publication should resemble a bibliographic citation and be located in the publication's reference list.
Such a data citation should include a unique persistent identifier (a DataCite DOI recommended, or other persistent identifiers already in use within the community).
The identifier should resolve to a page that either provides direct access to the data or information concerning its accessibility. Ideally, that landing page should be machine-actionable to promote interoperability of the data.
If the data are available in different versions, the identifier should provide a method to access the previous or related versions.
Data citation should facilitate attribution of credit to all contributors.

Of particular note are articles 6 and 7 above, which we address as described above using the 10320/loc records. Article 8 proposes that credit for data citation should be facilitated, which resources such as ImpactStory³² are starting to do.

The clear separation of narrative and data also addresses the vexed issues of copyright; data need no longer be constrained by limitations and costs imposed upon the narrative. There are costs of course associated with the data; the repositories must have a sound business model to ensure their long term permanence. Whether these responsibilities are borne by the agencies where the research is initially contracted, or by new agencies set up for the purpose will be determined by the communities involved.

Other than the infrastructures implicit in e.g. Figure 2, there is also the issue of how to persuade the authors of a scientific article to create the two components we propose. The narrative is straightforward, but can authors be persuaded to use and create the data objects? Very few are currently well versed or confident in using the HTML5/CSS3/SVG/Javascript toolkit to write scientific articles, although it is worth noting that this combination of tools is specified in an open distribution and interchange format standard for digital publications and documents known as epub3.³⁵ As adoption of such standards increases, so will familiarity with the concepts. An interim solution for promoting adoption may lie in the creation of standard templates; most of the complex detail is actually carried in Javascript and stylesheet declarations that need not be edited. They can be easily transcluded into a document template via header declarations. Indeed, almost the only actual code that they need be aware of is that shown earlier:

<a href="javascript:handle_jmol('persistent identifier', 'presentation script')">linked hypertext</a>

A more challenging problem is that most authors would see little reward in the current system for undertaking such tasks. Such rewards accrue from the narrative they present (be it scientific article or Ph.D. dissertation) and not currently from the data they associate with that narrative (except of course that the narrative would not stand on its own if no data had been presented somehow). Here, the scientific community must agree that preserving data and curating it for the future is a worthwhile activity and bestow the appropriate rewards for doing so, or indeed apply sanctions if it is not done. Technically at least, there is nothing preventing the scientific journal from evolving in this manner.

Citations

H. Oldenburg "Epistle Dedicatory", Phil. Trans., 1665, 1. DOI: 10.1098/rstl.1665.0001
D. M. Jessop, S. E. Adams, E. Willighagen, L. Hawizy and P. Murray-Rust, "OSCAR4: a flexible architecture for chemical text-mining", J. Cheminformatics, 2011, 3, 41. DOI: 10.1186/1758-2946-3-41
H. S. Rzepa, "Emancipate your data", Chemistry World, 2013, 09. Persistent identifier: 10042/a3uxk. The case was made much earlier, D. James, B. J. Whitaker, C. Hildyard, H. S. Rzepa, O. Casher, J. M. Goodman, D. Riddick and P. Murray-Rust, "The Case for Content Integrity in Electronic Chemistry Journals: The CLIC Project," New. Rev. Information Networking, 1995, 61. DOI: 10.1080/13614579509516846
M. Rubacha, A. K. Rattan, S. C. Hosselet, "A Review of Electronic Laboratory Notebooks Available in the Market Today", JALA, 2011, 16, 90-98. DOI: 10.1016/j.jala.2009.01.002
S. J Coles, J. G Frey, C. L Bird, R. J Whitby and A. E Day, "First steps towards semantic descriptions of electronic laboratory notebook records", J. Cheminformatics, 2013, 5:52. DOI:10.1186/1758-2946-5-52
J. Downing, P. Murray-Rust, A. P. Tonge, P. Morgan, H. S. Rzepa, F. Cotterill, N. Day and M. J. Harvey, "SPECTRa : The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", J. Chem. Inf. Mod., 2008, 48, 1571 - 1581. DOI: 10.1021/ci7004737
D. Shotton, K. Portwin, G. Klyne, A. Miles, Alistair, "Adventures in semantic publishing: exemplar semantic enhancements of a research article", DOI: 10.1371/journal.pcbi.1000361; D. Shotton "Semantic publishing: the coming revolution in scientific journal publishing", Learned Publishing, 2009, 22:85-94.
For a summary, see also H. S. Rzepa, "Chemical datuments as scientific enablers", J. Chemoinformatics, 2013, 4, DOI: 10.1186/1758-2946-5-6
N. Paskin, "Digital object identifiers for scientific data", Data Science J., 2005, 4, 12-20. DOI: 10.2481/dsj.4.12 and N. Paskin, "Digital Object Identifier (DOI) system", Encyclopedia of Library and Information Sciences, Third Edition, 2011. DOI: 10.1081/E-ELIS3-120044418
P. Murray-Rust and H. S. Rzepa, "CML: Evolution and Design", J. Cheminformatics, 2011, 3, 44. 10.1186/1758-2946-3-44
Figshare API http://api.figshare.com/docs/intro.html
D. Butler "Scientists: your number is up", 10.1038/485564aNature, 2012, 485, 564. DOI: 10.1038/485564a. For an example relating to one of the authors, see http://orcid.org/0000-0002-8635-8390
S. Adams, P. Murray-Rust, "Chempound - a Web 2.0-inspired repository for physical science data", J Digital Information, 2012, 13:5873. http://journals.tdl.org/jodi/article/viewArticle/5873
DateCite: http://www.datacite.org/
R. Kahn, R. Wilensky, Int. J. Digital Libraries, 2006, 6, 115-123. DOI: 10.1007/s00799-005-0128-x. See also http://www.handle.net/documentation.html and specifically http://www.handle.net/overviews/handle_type_10320_loc.html and http://www.handle.net/overviews/handle_type_10320_loc.html#conneg
H. S. Rzepa, P. Murray-Rust and B. J. Whitaker, "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World-Wide Web information exchange," J. Chem. Inf. Comp. Sci., 1998, 38, 976-982. DOI: 10.1021/ci9803233
H. S. Rzepa, "Activated data in chemistry publications, 2006-2013", Figshare. shortdoi http://doi.org/rnp
R. M. Hanson, J. Prilusky, R. Zhou, T. Nakane, J. Sussman, "JSmol and the Next-Generation Web-Based Representation of 3D Molecular Structure as Applied to Proteopedia", Israel J. Chemistry, 2013, 53, 207-256. DOI: 10.1002/ijch.201300024
See "ResourceSync Framework Specification - Beta Draft", 2013, http://www.openarchives.org/rs/resourcesync
D. Scheschkewitz, M. J. Cowley, V. Huch, H. S. Rzepa, "The Vinylcarbene - Cyclopropene Equilibrium of Silicon: an Isolable Disilenyl Silylene", Nature Chem, 2013, 5, 876-879. DOI: 10.1038/NCHEM.1751
F. L. Cherblanc, Y.-P. Lo †, W. A. Herrebout, P. Bultinck, H.S. Rzepa, and M. J. Fuchter, "Mechanistic and chiroptical studies on the desulfurization of epidithiodioxopiperazines reveal universal retention of configuration at the bridgehead carbon atoms", J. Org. Chem., 2013, 78, 11646-11655. DOI: 10.1021/jo401316a
D. C. Braddock, J. Clarke, H. S. Rzepa, "Epoxidation of Bromoallenes Connects Red Algae Metabolites by an Intersecting Bromoallene Oxide - Favorskii Manifold", Chem. Commun., 2013, 49, 11176-11178. DOI: 10.1039/C3CC46720A
A. Armstrong, Roberto A. Boto, P. Dingwall, J. Contreras-García, M. J. Harvey, N. J. Mason and H. S. Rzepa, "The Houk-List transition states for organocatalytic mechanisms revisited", Chem. Sci., 2014, Accepted manuscript. DOI: 10.1039/C3SC53416B
S. La, H. S. Rzepa, and S. Díez-González, "N-Heterocyclic Carbene or Phosphine-Containing Copper(I) Complexes for the Synthesis of 5-Iodo-1,2,3-Triazoles: Catalytic and Mechanistic Studies", submitted.
D. Scheschkewitz, M. J. Cowley, V. Huch, H. S. Rzepa, "Table 1: The Vinylcarbene - Cyclopropene Equilibrium of Silicon: an Isolable Disilenyl Silylene", DOI: rng
F. L. Cherblanc, Y.-P. Lo, W. A. Herrebout, P. Bultinck, H.S. Rzepa, and M. J. Fuchter, "Mechanistic and chiroptical studies on the desulfurization of epidithiodioxopiperazines reveal universal retention of configuration at the bridgehead carbon atoms", http://dx.doi.org/10.6084/m9.figshare.797484 , shortdoi: rns
D. C. Braddock, J. Clarke, H. S. Rzepa, "Interactivity box. Epoxidation of Bromoallenes Connects Red Algae Metabolites by an Intersecting Allene Oxide-Favorskii Manifold", DOI: n6q and n6r
A. Armstrong, Roberto A. Boto, P. Dingwall, J. Contreras-García, M. J. Harvey, N. J. Mason and H. S. Rzepa, "Table 3: Calculated transition state properties for R=Ph (scheme 2), DOI: qcc
S. La, H. S. Rzepa, and S. Díez-González, "N-Heterocyclic Carbene or Phosphine-Containing Copper(I) Complexes for the Synthesis of 5-Iodo-1,2,3-Triazoles: Catalytic and Mechanistic Studies", DOI: rnt, rfk, rfm.
S. Heller, A. McNaught, S. Stein, D. Tchekhovskoi and I. Pletnev, "InChI - the worldwide chemical structure identifier standard", J. Cheminf., 2013, 5:7. DOI: 10.1186/1758-2946-5-7
See http://scholar.google.co.uk/citations?user=ljZtPwkAAAAJ&hl=en&oi=ao&cstart=304
ImpactStory, www.impactstory.org/rzepa/
H. B. Mayes, L. J. Broadbelt, G. T. Beckham, "How Sugars Pucker: Electronic Structure Calculations Map the Kinetic Landscape of Five Biologically Paramount Monosaccharides and Their Implications for Enzymatic Catalysis", J. Am. Chem. Soc., 2013, 136, 1008-1022, DOI: 10.1021/ja410264d.
M. Crosas, T. Carpenter, D. Shotton and C. Borgman, "Joint Declaration of Data Citation Principles", http://www.force11.org/AmsterdamManifesto, March 2013.
See "A distribution and interchange format standard for digital publications and documents", http://idpf.org/epub/30