Perhaps because of the commercially driven characteristics of this growth, areas which are considered specialist, such as chemical and molecular sciences, have been largely ignored by the commercial robot-based indexing processes. Instead, the concept of server-side driven meta-collections (also referred to as a portal or a one stop collection) of the more important chemical sites has developed. For example, the large chemical societies tend to offer pages on their own Web sites which in effect collect, organise, and occasionally evaluate the various chemically oriented Internet resources for the benefit of their members. Commercial publishers also offer such "chemical villages". These meta-collections owe their origins back to 1993, when academic services such as ChemDex at Sheffield University2 were first established. The Sheffield service currently has approximately 4000 sites collected and organised. This latter site is not based on robot-based indexing and searching, but on largely automated process involving the processing of information sent to the central site. In practice, of the estimated 4000 chemical sites, a relatively small proportion offer their own dedicated search engine, although this particular attribute appears not to be recorded at the ChemDex or any other collection. A characteristic feature of a single large portal of information is the difficulty in maintaining its currency and its comprehensiveness. Unlike databases of explicitly chemical information such as Chemical Abstracts or Beilstein, no generally accepted pre-eminent portal has emerged from this model, and some feel that none should. Rather the tendency has been to develop national portals.
An alternative model to creating server-side large portals is to create a distributed client system for accomplishing the same objective. One recent interesting technology has emerged which offers potential for achieving such a distributed system. It is this model that we implement and evaluate in this article. We have chosen to focus on one proprietory technology known as Sherlock, and developed by Apple Computer. Other more or less equivalent implementations have arisen in parallel which we also briefly cover. The issues of using proprietory technologies will be addressed below.
We illustrate how such a plugin can be written by reference to Scheme 1.
Scheme 1. A Representative Sherlock Plug-in.The action attribute of the <search> element defines the address of the search engine software at the remote site for which the plugin is being created. A unique feature is that an update address for the plugin file is mandatory, along with a specification of how frequently the site should be checked for updates. Thus distribution of updated versions of the plugin is entirely automatic, and eliminates the need for the user to be pro-active in this regard. The routeType attribute allows specification of a so-called Channel, or collection of similar plugins, into which the newly installed plugin is inserted. We indicate here only a single channel, termed "chemical", but obviously more specific categories could be created depending on demand.
<search name="Chemistry Department, Imperial College" action="http://origin.ch.ic.ac.uk/cgi-bin/new/htsearch" method=get description = "Search the Chemistry Department, Imperial College" update="http://www.ch.ic.ac.uk/chemime/chemdig/chemdig.src.hqx" updateCheckDays = 3 bannerImage = '<img src = "http://www.ch.ic.ac.uk/chemcrest_s.gif">' bannerLink = "http://www.ch.ic.ac.uk/" routeType="chemical"> <input type=hidden name=config value=origin> <input name="matchesperpage" value="30"> <input name="words" user> <interpret resultListStart="<!-- LIST -->" resultListEnd="<!-- /LIST -->" resultItemStart="<!-- ITEM -->" resultItemEnd="<!-- /ITEM -->" relevanceStart="<!-- SCORE -->" relevanceEnd="<!-- /SCORE -->" > </search>
The variables associated with a search are defined in using the <input /> element. The plugin must predefine precisely all characteristics of the search except for the actual user supplied search string. It is not possible for the user to specify the value of more than one of these variables, in the example (Scheme 1) that labelled "words". Other variables, such as the number of matches per page, and other characteristics of the search, must be pre-specified. This does result in some inflexibility. For example, it is common on many search engines to restrict a search to eg "title" or "authors", selected from a menu presented to the user at the time of the search. Currently, the only way to implement such choices is to create separate plugins for each option.
To establish the required parameters and their values, one can adopt two
approaches.
(a) Inspect the URL string that is formulated via the search site entry
page. This string is generated using the "GET" command within a
<form> declaration in the HTML page invoking the search via a so
called CGI request. Thus a URL of the type
http://origin.ch.ic.ac.uk/cgi-bin/new/htsearch?words=pericyclic&config=echet98implies two variables, words and config with their corresponding values returned to the search program htsearch.
If the search string is not visible, possibly because use of the alternative POST command within a <form> declaration invokes the CGI request without displaying the full URL in the browser window, then the source code for the corresponding HTML document must be inspected. The relevant section is that enclosed by the <form> element;
<form method="post" action="http://origin.ch.ic.ac.uk/cgi-bin/new/htsearch"> <input type="hidden" name="config" value="echet98" /> <input type="text" size="9" name="words" value="" /><br /> </form>
It is more common however to find that the search engine presents no such containers for its output. An example of how a more unstructured output can be handled is shown in scheme 2. Here for example, inspection of the HTML source for the search output revealed that each individual item from a search was always preceded by the string <font size="+1"><b>. It is apparent from this that any even slight modification to this output presentation will in fact invalidate the configuration of the plugin. This phenomenon has of course been known for many years to computational chemists who post-process the output from e.g a modelling program, only to find that subsequent revisions of the program alter the output format such that the post processor no longer works. With the advent of XML, and the expectation that most search engines will output well formed, and perhaps even valid outputs, the stability of search plugins may be expected to increase.
Scheme 2.The example shown in Scheme 2 also indicates a more finely grained information component, namely the relevance ranking. It is common for most search engines to rank the search hits according to a predefined (but rarely declared) algorithm. With full text indexing, the algorithm may depend on properties such as word proximity, or even key words or phrases derived from a dictionary. With HTML indexing, the algorithm may allocate higher weighting to words found in the document header (e.g. bounded by <title></title> or <meta></meta> compared to the document body.
<interpret skipLocal = true resultItemStart='<font size="+1"><b>' resultItemEnd = "</b></font><br>" relevanceStart = '<img alt="" src="/images/buttons/blue_button.gif" width=13 height=12 border=0 hspace=0 align=right>' relevanceEnd = "%<br clear=right>" >
The Sherlock program assumes this ranking is done as a percentage value bounding the limits 0 to 100. In theory at least, if the ranking also has an attribute of units, then any other expression of the ranking could be transformed to the percentage scale automatically. Perhaps more controversial is whether the ranking value has any relative significance when normalized against the results of a range of different search engines. Since the algorithm used to derive each ranking is in general not specified in the output, the relative ranking across a range of search engines/plugins may have little meaning. In a chemical context for example, one might specify the "aromaticity" of a given molecule, but unless one knows how each aromaticity index was defined and evaluated originally, then a comparison across different scales would have little meaning.
Since the purpose of the above discussion is to introduce the concept of a chemical channel, it is interesting to speculate how the concept below could be extended in this direction. For example, if the search engine actually returned molecular hits, then the output might contain some components of e.g. CML (Chemical Markup Language)4 such as;
Molecule attributes such as molecular formula or molecular weight could be included, along with children of the element such as <atom> or <bond>. It would then become possible to display these attributes in a consistent manner using e.g the Sherlock software in the form of an element tree (element here meaning XML element rather than the chemical term). Currently, such extension is not possible.
MoleculeStart='<molecule>' MoleculeEnd = "</molecule>"
| Table 1 | |
|---|---|
| Plugin | Description |
| MetaChem | MetaChem, University College (UNSW) ADFA5 |
| CCDC | Cambridge Crystallographic Data Centre |
| MOPAC | MOPAC 2000 Manual |
| RSC-Journals | The Royal Society of Chemistry. Journals |
| acs-journals-title | ACS Journals. Title search |
| acs-journals-author.src | ACS Journals. Author search |
| ChemFinder | ChemFinder Molecular Search Site7 |
| Echet98 molsearch | ECHET98 Conference Molecule Database |
| Chemdig-IC | Chemistry Department, Imperial College |
| Chemdig-UK | Chemistry Departments, UK11 |
These plugins must be installed on the user's computer. On the MacOS
system (version 8.6 or higher) they are installed simply by dragging the
file to the system folder, whereupon they are automatically transferred to
the directory "System Folder:Internet Search Sites:Chemical:". Upon
invoking the Find (Sherlock) function from the Operating system level, the
chemical channel becomes selectable (Figure 1).

The user then has the option of specifying which individual sites they
wish to search, e.g. only the journals index, or perhaps only the molecule
properties searches.
The search itself is specified as simple text based keywords. These can include chemically significant strings such the molecular formula or a SMILES atom connection descriptor.6 Unfortunately, various types of search can be initiated based on the contents of a simple text string. These include a "normal" search, where stemming of the search terms is permitted (i.e. SEARCH will also allow items such as SEARCHES or SEARCHING to be located), exact search and boolean searches, where certain words are reserved as search operators. A significant limitation of the Sherlock system (version 2) is that boolean logic operations within the search string are not forwarded to the search engine. This is due in part to the problem that no universal syntax for passing boolean type queries is available, i.e. sometimes the OR operator can be declared with a comma rather than OR, and likewise the AND with a + sign, and the NOT with a -. This syntax would have to be defined in the plugin configuration file, which is currently not supported within Sherlock.
A Typical search result is shown in Figure 2. Note that each plugin has
an associated icon (displayed on the left). The name displayed is the text
enclosed within the first location of an <a></a> container in
the defined item list. Unfortunately, within the design of many search
engines, this text is often not particularly descriptive or unique, the
context being provided by other text not bounded by the hyperlink
reference. In fact it should be possible to recover this context via an
HTML declaration of the type <a href="..."
title="...">text</a>, in which the value of the title attribute
could also be displayed by Sherlock. The site at which the remote document
resides is shown on the right of the display. Note that some search engines
index only their local content, whilst others (Metachem in this example)
provide links to a variety of remote sites. These links can often point ot
other meta-collections, which of course implies the possibility of a
circular route to any topic!

Whilst Sherlock Search program itself is a proprietary technology, currently available only for MacOS, the plugin syntax is also supported on the Windows operating system via e.g. software called SHO.8 In addition, SHO also supports its own proprietary format, an example of which is shown in Scheme 3. Conversion of one plugin format to another is relatively trivial. An example of the SHO search interface is shown in Figure 3.
Scheme 3.
[Plugin] Author=George Gkoutos VerInfo=SHO_Plginver_2.0 RevNum=1 [DisplayName] Value=Search the Imperial Chemistry Site [UpdateURL] Value=http://www.ch.ic.ac.uk/chemime/chemdig/chemdigImperial.SHO [SearchURL] Value=http://origin.ch.ic.ac.uk/cgi-bin/new/htsearch [GetOrPost] Value=GET [Input] UserInput=words MaxReturnVar=20 Var1=config config=origin Var2=method method=and Var3=format |
format=long Var4=sort sort=score [ResultListBegin] Value=<!-- LIST --> Occurrence=1 [ResultListEnd] Value=<!-- /LIST --> [ResultItemBegin] Value=<!-- ITEM --> [ResultItemEnd] Value=<!-- /ITEM --> [Field1] Name=Title Extract=Text Start="> End=</a> Occurrence=1 [Field2] Name=Description |
Extract=Text Start=</dt><dd> End=<b><tt> [Field3] Name=Relevance Extract=Text Start=<!-- SCORE --> End=<!-- /SCORE --> [Field4] Name=Date Extract=Text Start=<font size=-1> End=</font> [IconName] Value=None |
Another example of this
type of concept is a product called WebLynx.9 This tool in fact addresses some of the limitations
noted above, namely multi-variable selection at search time and correct
handling of boolean queries. In general however, it does seem that end-user
software supporting the capability of invoking multiple searches of remote
site in parallel is still at a very early stage in its evolution cycle, and
that the current generation of products, whilst promising, still requires
development. Whether standards for this development will evolve, or whether
a profusion of different control definitions will emerge remains to be
established.
We have presented in this article a complementary model to the server based portal of meta-information, in which the meta-collection of chemically relevant sites is implemented on the client side rather than the server site, in the form of a library of chemical search plugins. This library could potentially be organised in various ways. It could be comprehensive (estimated at perhaps 500-1000 sites around the world at the end of 1999), or focused on specific types of searches, i.e. electronic journals only, e-commerce sites, or a collection relevant to a corporate intranet. Within any collection of plugins, the user has the option of selecting all, or indeed only one. The search can function at a system level (and hence via the system, be installed in the standard menus of all application programs). The task of collecting and maintaining the chemistry collection of such plugins might be undertaken by e.g. learned societies, or as a commercial service, whilst specialized chemistry sub-channels could develop run by individuals.
Although our model presents some new features, we also recognize that at its current level of development there are some significant limitations in its implementation. The most obvious is that plugins can only be created for sites that offer their own search engine, which tend to be associated only with larger sites, and is less common for specialized small sites. Many advanced molecular searches involve the user specifying a complex set of alternatives and options to define the search. These variables are obviously specific to each search engine, and so cannot be used in a generic search across multiple search engines. Instead they have to be pre-defined for each search engine. Any search of a chemical channel then must involve a specific set of pre-declared options. To cope with multiple options, separate variants of each plugin must be prepared. For this reason, a channel approach to chemical searching is a complementary tool rather than a replacement for using a dedicated service.
It is also apparent that the standards for post-processing the search outputs do not currently exist to implement any significant form of chemical processing. Thus, driven by e-commerce, one can perform a search across multiple engines on a specific item, and return the price of each in output, but not something as chemically specific as e.g. the molecular formula. It is fairly clear that as the deployment of XML compliant structured documents and searches evolves, such features will become possible to implement. One possible model based on XML for integrating finely grained chemical content into a site for delivery via search engine output is the Chimeral project.12
In conclusion, we have seeded the formation of a chemical search channel via the small collection noted above. The task of creating a more comprehensive range for the majority of the globally chemically significant search engines could be a distributed one, involving e.g. the administrator of each site. Most significantly, this concept introduces a new model for information retrieval which appears capable of significant development in the future.
Acknowledgements. One of us (GVG) thanks Merck Sharp and Dohme and the EPSRC for the award of a scholarship.