Chemical Metadata Standards for the World-Wide Web

Henry S. Rzepa

August, 1996

Metadata is information about data which facilitates the processing and the indexing of the data, and which cannot be discovered by scanning other parts of the document that contains this data. Unlike many proprietary chemical databases and library search mechanisms (Z39.50 protocols) where metadata is a well established and implemented concept (see e.g. http://stas.cnidr.org/STAS.html), few of the 60 million or so networked information objects on the Internet that comprise the World-Wide Web contain any explicit metadata declarations. To quote from a recent W3C development document "Link relationships and meta-information are an underutilized aspect of the web architecture which can be used to extend the expressive capability of web pages and increase the effectiveness of web communications".

The current state of affairs might be illustrated by the results of a keyword search performed using any of the popular Internet search engines on "chemistry". Such a search retrieves around 500,000 "hits" found in text (normally HTML) based documents. The true number of occurances is likely to be much larger, if only because there is no obligation on the part of the author of a chemically oriented document to actually insert the word "chemistry" into its header or its body, or in the case of a graphical bitmapped "GIF" image to name it sensibly. The document might be part of a larger collection, and the author might have assumed this context, and therefore feel that there might be no need to insert "chemistry" into each and every document. Furthermore, the author might use synonyms such as "chemical" or "molecular" instead.

Most of the current generation of Internet search engines nevertheless make some attempt to look for metadata elements in the document header (i.e. for an HTML document, contained with the element <HEAD> and </HEAD>) before indexing the text-based body of the document, and on this basis try to give some indication of the quality of the hit. However, none of this current generation of search engines is likely to succeed in e.g. correlating the association of molecular bond connection and 2D/3D coordinate data linked to a document, with the possibility of performing chemical structure or sub-structure searches on this chemical content. The function of metadata is to clearly establish such associations if they exist.

From this it becomes clear that a standard protocol for defining and utilising chemical metadata must be an urgent priority for inclusion in all Internet based documents with molecular content. The OCLC/NCSA Metadata Workshop Report is a project which specifies so-called "metadata" entries in document headers as one of a series of steps being taken to improve the description of networked information objects. Known as the the Dublin Core proposals for meta-elements, the basic description comprises 13 elements.
Table 1. The Basic Metadata Elements in the Dublin Core Proposal
Core MetadataDescription
Subject: The topic addressed by the work
Title: The name of the object
Author: The person(s) primarily responsible for the intellectual content of the object
Publisher: The agent or agency responsible for making the object available
OtherAgent: The person(s), such as editors and transcribers, who have made other significant intellectual contributions to the work
Date: The date of publication
ObjectType: The genre of the object, such as novel, poem, or dictionary
Form: The physical manifestation of the object, such as an HTML file or file with chemical content
Identifier: String or number used to uniquely identify the object
Relation: Relationship to other objects
Source: Objects, either print or electronic, from which this object is derived, if applicable
Language: Language of the intellectual content
Coverage: The keywords, spatial locations and temporal durations characteristic of the object
In June 1996, a concensus document for implementing the Dublin Core proposals (referred to as DC) within the framework of a document encoded in HTML (Hypertext-markup Language) appears to have been agreed . This convention takes the following form;

<META NAME="schema_identifier.element_name" CONTENT="string data">
Thus, a partial Dublin Core citation might be encoded in HTML as follows. Here several of the elements are modified as shown in bold (yellow) type. This follows a suggestion made in the original Proposal from 1995 for an extension of the HTML META descriptor. Whether such extensions are to be followed is currently the subject of considerable discussion.
<HEAD>
<TITLE>Chemical Metadata</TITLE>
<META NAME = "DC.URC" TYPE="CHEMETA" CONTENT = "0.1">
<META NAME = "DC.SUBJECT" CONTENT = "Chemical Metadata Types">
<META NAME = "DC.TITLE" TYPE="MAIN" CONTENT = "A proposal for defining chemical metadata entries 
in document headers to improve the description of networked chmical information objects">
<META NAME = "DC.TITLE" TYPE="SUB" CONTENT = "chemical metadata types">
<META NAME = "DC.AUTHOR" CONTENT = "H. S. Rzepa">
<META NAME = "DC.PUBLISHER" CONTENT = "ICSTM">
<META NAME = "DC.DATE" CONTENT = "1996">
<META NAME = "DC.OBJECTTYPE" CONTENT = "ACS Nomenclature  Committee">
<META NAME = "DC.FORM" SCHEME="IMT" CONTENT = "text/html">
<META NAME = "DC.IDENTIFIER" SCHEME="URL" CONTENT = "http://www.ch.ic.ac.uk/chemime/chemeta.html">
<META NAME = "DC.RELATION" TYPE = "CHILD" SCHEME="URL"  CONTENT = "http://www.ch.ic.ac.uk/chemime/chemeta_organic_chemistry.html">
<META NAME = "DC.RELATION" TYPE = "SIBLING" SCHEME="URL" CONTENT = "http://www.ch.ic.ac.uk/chemime/chemeta_acs.html">
</HEAD>

It is judged useful to provide a means for linking to the reference definition of the metadata schema (or schemata) used in a document. Doing so serves as a primitive registration mechanism for metadata schemata, and lays the foundation for a more formal, machine-readable linkage mechanism in the future. The proposed convention for doing so is as follows:

<LINK REL = SCHEMA.schema_identifier HREF="URL">
Thus, the reference description of one metadata scheme, the Dublin Core Metadata Element Set, would be referenced in the LINK HREF as follows:
<LINK REL = SCHEMA.DC HREF = "http://purl.org/metadata/dublin_core">

Chemical Applications

In the Dublin Core set, the FORM meta-element in particular deserves further discussion. Here, the optional qualifier SCHEME=IMT refers to "Internet Media Types", more commonly known as MIME types. The FORM meta element is a descriptor for the document itself, and in the context of an HTML document, this could only take the value: text/html.

In a separate proposal, we have advocated a number of chemical Internet media types. Few of these have any extensible syntax which would allow metadata elements to be inserted into their header (if they have such an element) or their body. Thus the popular chemical/x-pdb media type does not have a format which allows the Dublin core metadata elements to be implemented directly. One solution which needs to be developed is to identify all the IMF types associated not only with a parent document but any associated children where direct inclusion of metadata might not be appropriate. Metadata for chemical data files could linked to the actual data, in a manner similar to that shown above. This would allow an unambiguous association to be made between the content of a chemical document and 2D/3D connection data, analytical data, and other semantically rich forms of information. The precise form that such mapping might take is not currently clear, although the subject of much discussion.

Conclusion

Currently, almost no use is made of metadata entries in chemical document headers in Internet based documents encoded in HTML. For this reason, the current state of "chemical" indexing of such documents can only be described as primitive and certainly not easily automated. One possible solution to this is is for the community to adopt a consistent set of meta-elements which help to identify the type of chemical content of these documents.

The purpose of this discussion paper is to introduce the concept of metadata in document headers, and to invite extensions of the Dublin Core proposals for meta-elements which might help to identify such chemical content. Clearly, the implementation of metaelements in document descriptors such as HTML is the subject of much current discussion, and the final form is not yet finalised. Other mappings will be carried out in the future to enable Dublin Core descriptions to be embedded in various image file formats to allow sensible indexing of e.g. figure elements as objects in HTML documents.


New links added subsequently

  1. Dublin Core Meta-data: http://128.253.70.110/DC5/UserGuide3.html
  2. Resource Discovery Framework: http://www.w3.org/TR/WD-rdf-syntax/

H. S. Rzepa, August 12, 1996 and April 27, 1998.