The current state of affairs might be illustrated by the results of a keyword search performed using any of the popular Internet search engines on "chemistry". Such a search retrieves around 500,000 "hits" found in text (normally HTML) based documents. The true number of occurances is likely to be much larger, if only because there is no obligation on the part of the author of a chemically oriented document to actually insert the word "chemistry" into its header or its body, or in the case of a graphical bitmapped "GIF" image to name it sensibly. The document might be part of a larger collection, and the author might have assumed this context, and therefore feel that there might be no need to insert "chemistry" into each and every document. Furthermore, the author might use synonyms such as "chemical" or "molecular" instead.
Most of the current generation of Internet search engines nevertheless make some attempt to look for metadata elements in the document header (i.e. for an HTML document, contained with the element <HEAD> and </HEAD>) before indexing the text-based body of the document, and on this basis try to give some indication of the quality of the hit. However, none of this current generation of search engines is likely to succeed in e.g. correlating the association of molecular bond connection and 2D/3D coordinate data linked to a document, with the possibility of performing chemical structure or sub-structure searches on this chemical content. The function of metadata is to clearly establish such associations if they exist.
From this it becomes clear that a standard protocol for defining and utilising chemical metadata must be an urgent priority for inclusion in all Internet based documents with molecular content. The OCLC/NCSA Metadata Workshop Report is a project which specifies so-called "metadata" entries in document headers as one of a series of steps being taken to improve the description of networked information objects. Known as the the Dublin Core proposals for meta-elements, the basic description comprises 13 elements.
Core Metadata | Description |
---|---|
Subject: | The topic addressed by the work |
Title: | The name of the object |
Author: | The person(s) primarily responsible for the intellectual content of the object |
Publisher: | The agent or agency responsible for making the object available |
OtherAgent: | The person(s), such as editors and transcribers, who have made other significant intellectual contributions to the work |
Date: | The date of publication |
ObjectType: | The genre of the object, such as novel, poem, or dictionary |
Form: | The physical manifestation of the object, such as an HTML file or file with chemical content |
Identifier: | String or number used to uniquely identify the object |
Relation: | Relationship to other objects |
Source: | Objects, either print or electronic, from which this object is derived, if applicable |
Language: | Language of the intellectual content |
Coverage: | The keywords, spatial locations and temporal durations characteristic of the object |
<META NAME="schema_identifier.element_name" CONTENT="string data">Thus, a partial Dublin Core citation might be encoded in HTML as follows. Here several of the elements are modified as shown in bold (yellow) type. This follows a suggestion made in the original Proposal from 1995 for an extension of the HTML META descriptor. Whether such extensions are to be followed is currently the subject of considerable discussion.
<HEAD> <TITLE>Chemical Metadata</TITLE> <META NAME = "DC.URC" TYPE="CHEMETA" CONTENT = "0.1"> <META NAME = "DC.SUBJECT" CONTENT = "Chemical Metadata Types"> <META NAME = "DC.TITLE" TYPE="MAIN" CONTENT = "A proposal for defining chemical metadata entries in document headers to improve the description of networked chmical information objects"> <META NAME = "DC.TITLE" TYPE="SUB" CONTENT = "chemical metadata types"> <META NAME = "DC.AUTHOR" CONTENT = "H. S. Rzepa"> <META NAME = "DC.PUBLISHER" CONTENT = "ICSTM"> <META NAME = "DC.DATE" CONTENT = "1996"> <META NAME = "DC.OBJECTTYPE" CONTENT = "ACS Nomenclature Committee"> <META NAME = "DC.FORM" SCHEME="IMT" CONTENT = "text/html"> <META NAME = "DC.IDENTIFIER" SCHEME="URL" CONTENT = "http://www.ch.ic.ac.uk/chemime/chemeta.html"> <META NAME = "DC.RELATION" TYPE = "CHILD" SCHEME="URL" CONTENT = "http://www.ch.ic.ac.uk/chemime/chemeta_organic_chemistry.html"> <META NAME = "DC.RELATION" TYPE = "SIBLING" SCHEME="URL" CONTENT = "http://www.ch.ic.ac.uk/chemime/chemeta_acs.html"> </HEAD>
<LINK REL = SCHEMA.schema_identifier HREF="URL">Thus, the reference description of one metadata scheme, the Dublin Core Metadata Element Set, would be referenced in the LINK HREF as follows:
<LINK REL = SCHEMA.DC HREF = "http://purl.org/metadata/dublin_core">
In a separate proposal, we have advocated a number of chemical Internet media types. Few of these have any extensible syntax which would allow metadata elements to be inserted into their header (if they have such an element) or their body. Thus the popular chemical/x-pdb media type does not have a format which allows the Dublin core metadata elements to be implemented directly. One solution which needs to be developed is to identify all the IMF types associated not only with a parent document but any associated children where direct inclusion of metadata might not be appropriate. Metadata for chemical data files could linked to the actual data, in a manner similar to that shown above. This would allow an unambiguous association to be made between the content of a chemical document and 2D/3D connection data, analytical data, and other semantically rich forms of information. The precise form that such mapping might take is not currently clear, although the subject of much discussion.
The purpose of this discussion paper is to introduce the concept of metadata in document headers, and to invite extensions of the Dublin Core proposals for meta-elements which might help to identify such chemical content. Clearly, the implementation of metaelements in document descriptors such as HTML is the subject of much current discussion, and the final form is not yet finalised. Other mappings will be carried out in the future to enable Dublin Core descriptions to be embedded in various image file formats to allow sensible indexing of e.g. figure elements as objects in HTML documents.