Data descriptors in molecular science: quantum computational simulation.

Henry S. Rzepa

Department of Chemistry, Imperial College London, Exhibition Road Campus, London, SW7 2AZ.

M. J. Harvey

High Performance Computing Service, ICT division, Imperial College London, Exhibition Road Campus, London, SW7 2AZ.

Abstract

This dataset comprises the description of a set of molecular 3D coordinates obtained from quantum mechanical simulation at a variety of theoretical levels, together with associated properties derived from computed wavefunctions for the molecules. These include molecular (harmonic) vibrations, transition state calculations, analysis of the topological properties of the computed wavefunctions (QTAIM) and evaluation of an electron localisation function (ELF) that provides information regarding the local bond properties. The datasets were generated using two quantum chemical computer codes, Gaussian09 and ORCA.

Keywords

Molecular 3D coordinates, molecular geometry, molecular wavefunctions, harmonic vibrations, transition state geometry, QTAIM topological analysis of electron density, ELF analysis of wavefunction.

Introduction

This data descriptor aims to provide an overview of datasets which have been made available as a primary component of published scientific articles. It is meant to be "human readable", rather than any formal procedural specification aimed at defining data structures or protocols for generating the data, and is aimed at both those who might wish to generate such datasets and the readers who might wish to (re)-use them. As such, it is not intended to provide a comprehensive review of all the features, but merely to outline the basic features.

The scientific context of the dataset

This data set descriptor takes as its scientific context a dataset associated with a published article¹ entitled "The rational design of helium bonds", in which the scientific conclusions derive from a sequence of quantum chemical calculations at various theoretical levels. Obtaining molecular coordinates in this manner is a general procedure, applied across the molecular sciences. The scope applies to discrete covalently bound molecules or ionic systems, containing up to around 250 atoms for any atom in the periodic table. The coordinates for these atoms can then be associated with computed wavefunctions for the entire system, which in turn defines the electron density distribution ρ(r) in the molecule. From this distribution, conclusions regarding the length and strength of individual bonds in the molecule can be assessed. In this particular case, it is the strength of bonds to the element helium that are the topic of the article. The computational procedures also allow a potential energy surface for putative reactions of the molecule to be computed, and from this thermodynamic quantities such as activation free energies of reaction ΔG^‡₂₉₈ to be estimated in order to provide estimates of the probable lifetime of the molecule and the rates of its reactions.

Whilst the dataset deployed in the example above is specific to the scientific problem being addressed, the concepts addressed here are in fact quite general to the area of quantum computation in molecular sciences.

Dataset description and availability

The available data relates to the following molecules shown below.
Molecules 9

The dataset itself is available² via the published article in a container therein entitled "Web Enhanced Table. Calculated properties for molecular compounds of Helium". A thumbnail is illustrated in Figure 1.
.
Figure 1. The dataset for quantum computational simulation. Click on graphic to load data.

This table has the following data content:

Calculated geometries (Column 2 of table), expressed using XML syntax and the CML schema,³ and associated with the MIME type chemical/x-cml.This is a modern "self defining" format, where the semantics and ontology of the data is defined thorough the use of declared XML schemas, vocabularies and dictionaries. The data can be validated⁴ against the schema and is designed to be readily re-used in other scientific contexts and applications. A total of 20 CML files are provided in this dataset. This is the preferred data format.
Seven files are provided in XYZ format, MIME type chemical/x-xyz and contain information about the coordinates of the molecule and the molecular displacement coordinates for one vibrational mode of that molecule (column 2 of table). The use of this format was largely determined by the availability of software able to generate displacement coordinates by pattern-recognition processing from the original dataset, which in this case was the logfile from a Gaussian calculation. The use of the XYZ format is now deprecated in this context.
One file is provided in Gaussian logfile format, MIME type chemical/x-gaussian-log, and contains the complete output from one quantum mechanical simulation using the Gaussian09 program system.⁵ The format is free-text and must be processed by appropriate pattern-recognition of the text at the time of display, a procedure which can be used to replace the use of XYZ files as above (which can be regarded as a sub-set of the Gaussian logfile). The Gaussian format does contain units, but these must be parsed by pattern-recognition and are not formally declared. A formal complete dictionary of Gaussian terms is not available, and sub-sets are normally used.⁶ A more extensive set of parsers for quantum chemical codes has been developed by the Quixote project.⁶
Thirteen files are provided in Molfile format, MIME type chemical/x-mdl-molfile, a much older format for molecular coordinates, but still widely used in molecular sciences. This type is not associated with an XML schema, and its ontology is defined by proprietary documentation originally provided by Molecular Design Ltd., a company no longer in existence. These files define two molecular properties:
1. The 3D coordinates of critical points derived from the topology of the electron density ρ(r). Of particular interest are the properties of the bond critical points as they pertain to the bonds to the helium atom. Not formally declared in this data set are the implicit units used for the coordinates (Bohr). The normal (implied) coordinate are Å. Formats where datatypes and units are implicit should be deprecated in favour of formats where they can be made explicit (such as the CML format noted above).
2. The 3D coordinates of basin centroids obtained by an ELF analysis of the wavefunction. Not formally declared in this data set are the implicit units used for the coordinates (Bohr).
Column 5 of the table contains the URL to the digital repository link for each explicit calculation.⁷ There are to be found more complete datasets for each of the files noted above (which should be regarded as a sub-set of the data, and are provided primarily for display purposes). The digital repository link contain the following data:
1. A time stamp for the data, and its author and publisher.
2. Unique identifiers for the molecule the dataset is describing in the form of either an InChI or an InChI key identifier.⁸
3. The URI handle for the dataset.
4. A METS manifest for the document collection associated with this URI.
5. An input file defining the protocol used to generate the data set, in this case appropriate for the Gaussian09 program.
6. A Gaussian or ORCA logfile, and any associated checkpoint file (which contains data suitable for deriving properties such as the molecular orbitals associated with the molecule). The equivalent files for other such programs would be presumed included in any related repositories.
7. A wavefunction file, which allows topological properties to be derived such as QTAIM or ELF
8. Text files containing the InChI identifier and a SMILES identifier.
9. Project and project description files which provide further information about the scientific context of the dataset.
10. One example of such a collection can be found at http://dx.doi.org/10042/to-2899
11. An example of a script that can be used to automate the harvesting of such data is:
```
wget --no-check-certificate https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/10042/$T/2/logfile.out -O $T.log;
```
  where the variable in the above example would have the value
```
$T=to-2899
```
The footnote to this table provides technical information on the computational procedures, either directly or via additional links.

Dataset display and annotation

The dataset is presented to the reader via a Jmol applet⁹ and is invoked by the use of a Jmol script to provide further annotation of the dataset. Examples of such annotation include:

Highlighting using colour of particular atoms or bonds.
Annotating key atoms (or pseudo atoms) with properties such as charge, or electron integrations.
Annotating bonds with lengths, in Å.
Animating molecular vibrations, and annotating the vibration with vectors illustrating the displacement coordinates.

An example of a script used to annotate is given below:

load "/nchem/journal/v2/n5/media/nchem.596/C4-BeHe-H-ccsdt.cml";zoom 5;moveto 4 0 2 0 90 100;
 connect (atomno=2) (atomno=4) partial;connect (atomno=5) (atomno=6) single;
 set measurementUnits Angstroms; measure 2 1;
 set fontscaling TRUE; font label 24;select atomno=2;label %A Be;select atomno=1;label %A He;

An description of how a dataset can be extracted from its display frame is available.¹⁰

Related Datasets and publications.

There are further examples of the types of dataset as described here.¹¹ There is also discussion of how such datasets can be used to enhance the scientific presentation.¹²

Data providence/Author contributions

In this model, the provenance (including date stamp) and authorship of the datasets are formally declared as part of the digital repository entry for each item. Such declarations can also be made in XML-based datasets such as CML as part of the meta-data. These declarations relate to the person in whose name the data was originally generated, and are not necessarily those of the principle investigator or project manager.

Data collection methods or protocol/methodology

The use of job submission and mangement tools provides an opportunity to structure the data capture process from the inception of the calculation. All data required to provide complete provenance of the eventual publication are recorded, ensuring future reproducibility. Furthermore, it enables the output of the calculation to be transparently reprocessed into the various formats useful for publication (CML, etc) without additional effort on the user's part. For all the examples given here, the workflow for generating the datasets is actually formalised in the shape of a job submission portal (Figure 2). This workflow controls submission of a job to the job queuing system for the High Performance Computing resource, and the collection of the job outputs upon completion.

Figure 2. The workflow for data generation.

At this stage, the person who initiated the workflow has the option of publishing these outputs to a digital repository (DSpace for all the examples noted above) at a time of their choosing. An embargo system is used (Figure 3) whereby a declared delay can be configured by the user (up to a maximum of 999 days) if they choose to take no immediate action. After this period is elapsed for any specific entry, it will be automatically published. This component also optionally invokes an RDF declaration (using the FOAF dictionary) of the interests and details of the researcher generating the data. This specification is also deposited in the digital repository, where it can be harvested if needed and which can be used to establish connections with other datasets either created by the same author, or by collaborators of that author.

Figure 3. The embargo control.

The actual methodology used to generate the data takes the form of keywords¹³ driving the Gaussian09 program (equivalent keywords for the other programs⁶ are available in equivalent form). A thesaurus for mapping such keywords between programs can be generated.⁶ There is a large measure of consensus on the mathematical descriptions of the procedures used in these programs.

Data processing

The primary inputs and outputs take the form of an input file for the program in question (Gaussian or ORCA) and the logfile this generates. The input file can be easily assembled using a text editor, although approximate cartesian coordinates for the molecular system being studied will need to be generated. This can be done in several ways

Using a graphical chemical structure editor. The program used to generate the example datasets described above was Gaussview 5.09.
3D Coordinates can also be acquired from databases such as the CCDC¹⁴ where the data are closed, or the PDB,¹⁵ IUCR journals¹⁶ or CrystalEye¹⁷ where the data are open.
3D coordinates can also be generated from 2D diagrams which are rapidly drawn using templates and sketching tools.¹⁸

Some further information has to be added to the input file, such as the spin state of the molecule, the charge on the component being calculated and other factors such as the nature of the solvent simulation required, using a a straightforward text editor if need be.

The generated dataset itself is primarily intended to be viewed using either customised closed software such as Gaussview, or a general open package such as Jmol¹⁹ (which requires Java 1.3.1). Some forms of this output are in fact "human readable" text files, designed to be inspected using simple text editors. The specialised codes can in fact be used to "round-trip" the dataset, whereby the output can be used as the input in a further cycle, either by the original creator, or by someone who may have acquired the dataset from e.g. the Journal.

Usage and Tools

The dataset is primarily presented to the reader in the form of a hyperlinked table, as illustrated in the thumbnail above. A mini-tutorial in its use⁹ outlines the basic procedures involved in re-using data presented by the Jmol applet. A comprehensive set of documentation and tutorials is available.¹⁹

Figure 4.. The Jmol tool, illustrating its usage to obtain data from a dataset.

Figure 4 shows this process in action. The reader loads the dataset they wish to inspect by invoking the appropriate link in the Table. A menu containing further options can be displayed by a right-mouse click in the display window. This is hierarchical in nature; the red arrow points to a file sub-menu which allows various actions. In this case, a request to save a copy of the file to the local file system is being requested. Once saved to the file system, the user can invoke any software tool that supports the file type. Alternatively, the files can be acquired (individually or in volume using automated scripts) from the digital repository for data-mining¹⁷ and other operations.

Competing financial interests

The author has no competing financial interests.

References and notes

(a) H. S. Rzepa "The rational design of helium bonds", Nature Chem., 2010, 2, 390-393. DOI: 10.1038/NCHEM.596 (b) H. S. Rzepa, "The importance of being bonded", Nature Chem., 2009, 1, 510-512. DOI: 10.1038/nchem.373 offers an interactive "data exploratorium", URL: http://www.nature.com/nchem/journal/v1/n7/media/nchem.373_jmol.html
Dataset container used in this descriptor "Web Enhanced Table. Calculated properties for molecular compounds of Helium", URL: http://www.nature.com/nchem/journal/v2/n5/media/nchem.596_jmol.html.
P. Murray-Rust and H. S. Rzepa, "CML: Evolution and Design", J. Cheminform., 2011, 3:44. DOI: 10.1186/1758-2946-3-44. URL: http://www.xml-cml.org
Chemical Markup language validator. URL: http://validator.xml-cml.org/
(a) Gaussian09: URL www.gaussian.com; (b) ORCA 2.8: URL http://www.thch.uni-bonn.de/tc/orca/
S. Adams, P. de Castro, P. Echenique, J. Estrada, M. D. Hanwell, P. Murray-Rust, P. Sherwood, J. Thomas and J. A. Townsend "The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age J. Cheminformatics, 2011, 3:38. doi:10.1186/1758-2946-3-38
J. Downing, P. Murray-Rust, A. P. Tonge, P. Morgan, H. S. Rzepa, F. Cotterill, N. Day and M. J. Harvey, "SPECTRa : The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", J. Chem. Inf. Mod., 2008, 48, 1571 - 1581. DOI: 10.1021/ci7004737
The IUPAC International Chemical Identifier (InChI™), URL: http://www.iupac.org/inchi/
Jmol: an open-source Java viewer for chemical structures in 3D. URL: http://www.jmol.org/ . A full bibliography is available at URL: http://wiki.jmol.org/index.php/Literature
H. S. Rzepa, "(re)Use of data from chemical journals", URL: http://www.ch.imperial.ac.uk/rzepa/blog/?p=3154
H. S. Rzepa, "(Hyper)activating the chemistry journal", URL: http://www.ch.imperial.ac.uk/rzepa/blog/?p=701
H. S. Rzepa, "The past, present and future of Scientific discourse", J. Cheminformatics, 2011, 3:46. DOI: 10.1186/1758-2946-3-46
Gaussian09 Users reference, Keyword list: URL http://www.gaussian.com/g_tech/g_ur/l_keywords09.htm
F. H. Allen, "The Cambridge Structural Database: a quarter of a million crystal structures and rising", 2002, Acta Cryst., B58, 380-388, DOI: 10.1107/S0108768102003890. URL: http://www.ccdc.cam.ac.uk/
RCSB Protein databank. URL: http://www.pdb.org/pdb/home/home.do
International union of crystallography. URL: http://journals.iucr.org/
N. Day. J. Downing, S. Adams, N.W. England, P. Murray-Rust, "CrystalEye: Automated aggregation, semantification and dissemination of the world’s Open crystallographic data". J. Appl. Cryst., (submitted). URL: http://wwmm.ch.cam.ac.uk/crystaleye/
Molecular networks, Online demo CORINA. URL: http://www.molecular-networks.com/online_demos/corina_demo
Jmol documentation. URL: http://jmol.sourceforge.net/docs/

M. J. Harvey and H. S Rzepa, "Data descriptors in molecular science: quantum computational simulation", 2012-07-24. URL:http://www.ch.ic.ac.uk/rzepa/data-descriptors/. Accessed: 2012-07-24. (Archived by WebCite® at http://www.webcitation.org/69OD2TqpJ)