A Mechanism for Creating Chemically Oriented Internet Search Channels

Georgios V. Gkoutos and Henry S. Rzepa^*

Department of Chemistry, Imperial College of Science, Technology and Medicine, London, SW7 2AY.

One of the many remarkable features of the development of the Internet since 1994 has been the creation of so called robot indexing and search engines. Starting in 1994, projects such as Lycos¹ took advantage of the semi-structured nature of interlinked documents marked up in languages such as HTML to create a globally searchable index of Internet-based content. Many of these companies have subsequently grown enormously on the basis this technology, particularly with the recent focus on e-commerce.

Perhaps because of the commercially driven characteristics of this growth, areas which are considered specialist, such as chemical and molecular sciences, have been largely ignored by the commercial robot-based indexing processes. Instead, the concept of server-side driven meta-collections (also referred to as a portal or a one stop collection) of the more important chemical sites has developed. For example, the large chemical societies tend to offer pages on their own Web sites which in effect collect, organise, and occasionally evaluate the various chemically oriented Internet resources for the benefit of their members. Commercial publishers also offer such "chemical villages". These meta-collections owe their origins back to 1993, when academic services such as ChemDex at Sheffield University² were first established. The Sheffield service currently has approximately 4000 sites collected and organised. This latter site is not based on robot-based indexing and searching, but on largely automated process involving the processing of information sent to the central site. In practice, of the estimated 4000 chemical sites, a relatively small proportion offer their own dedicated search engine, although this particular attribute appears not to be recorded at the ChemDex or any other collection. A characteristic feature of a single large portal of information is the difficulty in maintaining its currency and its comprehensiveness. Unlike databases of explicitly chemical information such as Chemical Abstracts or Beilstein, no generally accepted pre-eminent portal has emerged from this model, and some feel that none should. Rather the tendency has been to develop national portals.

An alternative model to creating server-side large portals is to create a distributed client system for accomplishing the same objective. One recent interesting technology has emerged which offers potential for achieving such a distributed system. It is this model that we implement and evaluate in this article. We have chosen to focus on one proprietory technology known as Sherlock, and developed by Apple Computer. Other more or less equivalent implementations have arisen in parallel which we also briefly cover. The issues of using proprietory technologies will be addressed below.

Search Channels and Search Plug-ins

In October 1998, Apple computer introduced the Sherlock system.³ This comprises a system-level search interface, which in turn is configured for operation by installing a library of search definition "plugins". The system has several characteristic features.

Each plugin is based on a structured definition file, containing a number of key elements, each with specified attributes. In effect, this is similar in structure to the recently developed extension to the HTML protocols known as XML (eXtensible Markup language).⁴
Three elements are used in the Sherlock plugins. The parent element is <search> which defines fully the search characteristics (and is nominally equivalent to the <html> </html> container for a HTML document).
The children of <search> are <input> which defines the unique parameters for invoking a search and <interpret> which indicates how the search results should be parsed and presented to the user in a single consistent output format.

We illustrate how such a plugin can be written by reference to Scheme 1.

 Scheme 1. A Representative Sherlock Plug-in.


<search 
    name="Chemistry Department, Imperial College"
    action="http://origin.ch.ic.ac.uk/cgi-bin/new/htsearch"
    method=get
    description = "Search the Chemistry Department, Imperial College"
    update="http://www.ch.ic.ac.uk/chemime/chemdig/chemdig.src.hqx"
    updateCheckDays = 3
    bannerImage = '<img src = "http://www.ch.ic.ac.uk/chemcrest_s.gif">'
    bannerLink = "http://www.ch.ic.ac.uk/"
    routeType="chemical">

<input type=hidden name=config value=origin>
<input name="matchesperpage" value="30">
<input name="words" user>

<interpret
resultListStart="<!-- LIST -->"
resultListEnd="<!-- /LIST -->"
resultItemStart="<!-- ITEM -->"
resultItemEnd="<!-- /ITEM -->"
relevanceStart="<!-- SCORE -->"
relevanceEnd="<!-- /SCORE -->"
>
</search>

The action attribute of the <search> element defines the address of the search engine software at the remote site for which the plugin is being created. A unique feature is that an update address for the plugin file is mandatory, along with a specification of how frequently the site should be checked for updates. Thus distribution of updated versions of the plugin is entirely automatic, and eliminates the need for the user to be pro-active in this regard. The routeType attribute allows specification of a so-called Channel, or collection of similar plugins, into which the newly installed plugin is inserted. We indicate here only a single channel, termed "chemical", but obviously more specific categories could be created depending on demand.

The variables associated with a search are defined in using the <input /> element. The plugin must predefine precisely all characteristics of the search except for the actual user supplied search string. It is not possible for the user to specify the value of more than one of these variables, in the example (Scheme 1) that labelled "words". Other variables, such as the number of matches per page, and other characteristics of the search, must be pre-specified. This does result in some inflexibility. For example, it is common on many search engines to restrict a search to eg "title" or "authors", selected from a menu presented to the user at the time of the search. Currently, the only way to implement such choices is to create separate plugins for each option.

To establish the required parameters and their values, one can adopt two approaches.
(a) Inspect the URL string that is formulated via the search site entry page. This string is generated using the "GET" command within a <form> declaration in the HTML page invoking the search via a so called CGI request. Thus a URL of the type

http://origin.ch.ic.ac.uk/cgi-bin/new/htsearch?words=pericyclic&config=echet98

implies two variables, words and config with their corresponding values returned to the search program htsearch.

If the search string is not visible, possibly because use of the alternative POST command within a <form> declaration invokes the CGI request without displaying the full URL in the browser window, then the source code for the corresponding HTML document must be inspected. The relevant section is that enclosed by the <form> element;



<form method="post" action="http://origin.ch.ic.ac.uk/cgi-bin/new/htsearch">
<input type="hidden" name="config" value="echet98" />
<input type="text" size="9" name="words" value="" /><br />
</form>

Search Output

Another key feature of the plugin is that the search results must be presented (transformed) to the user in a single consistent manner. Since no standards exist for presenting output from a search engine, only a minimal subset of output fields can be easily processed. Here we see much of the concept of XML in operation, in that all important information components are identified and transformed for presentation to the user. Unlike the XML language however, where this component identification has to follow a strict set of protocols to create a valid document (ie one specified exactly by a formal DTD or document type definition) and a well formed document (ie one in which all the elements and their attributes are correctly structured), the output from most search engines may well not be either well formed, or indeed valid in following any available DTD specification. To address this problem, it is useful if the output of a search engine can have additional elements added to help transform the output in a standard manner. In the example below, the output template of the search engine (htsearch) has been modified such that the start and end of the result list and the result item are clearly contained within an element, in this case eg  and . These in fact correspond to HTML comments, and so are not rendered on the screen of search results presented to the user. The Sherlock software then assumes that any text bounded by this container is the relevant search result. It is further assumed that at least one URL anchor <a>...</a> will be found within these bounds, which can be resolved by the Sherlock software for passing to a Web browser. In fact, this can be a significant chemical limitation, since the <a>...</a> might be replaced by an <object></object>, <embed> or <applet></applet> invocation of e.g. molecular coordinates or other "hyperactive" information. The Sherlock system will currently ignore such information.

It is more common however to find that the search engine presents no such containers for its output. An example of how a more unstructured output can be handled is shown in scheme 2. Here for example, inspection of the HTML source for the search output revealed that each individual item from a search was always preceded by the string <font size="+1"><b>. It is apparent from this that any even slight modification to this output presentation will in fact invalidate the configuration of the plugin. This phenomenon has of course been known for many years to computational chemists who post-process the output from e.g a modelling program, only to find that subsequent revisions of the program alter the output format such that the post processor no longer works. With the advent of XML, and the expectation that most search engines will output well formed, and perhaps even valid outputs, the stability of search plugins may be expected to increase.

Scheme 2.


<interpret
    skipLocal = true
    resultItemStart='<font size="+1"><b>'
    resultItemEnd = "</b></font><br>"
    relevanceStart = '<img alt="" src="/images/buttons/blue_button.gif" 
    width=13 height=12 border=0 hspace=0 align=right>'
    relevanceEnd = "%<br clear=right>"
>

The example shown in Scheme 2 also indicates a more finely grained information component, namely the relevance ranking. It is common for most search engines to rank the search hits according to a predefined (but rarely declared) algorithm. With full text indexing, the algorithm may depend on properties such as word proximity, or even key words or phrases derived from a dictionary. With HTML indexing, the algorithm may allocate higher weighting to words found in the document header (e.g. bounded by <title></title> or <meta></meta> compared to the document body.

The Sherlock program assumes this ranking is done as a percentage value bounding the limits 0 to 100. In theory at least, if the ranking also has an attribute of units, then any other expression of the ranking could be transformed to the percentage scale automatically. Perhaps more controversial is whether the ranking value has any relative significance when normalized against the results of a range of different search engines. Since the algorithm used to derive each ranking is in general not specified in the output, the relative ranking across a range of search engines/plugins may have little meaning. In a chemical context for example, one might specify the "aromaticity" of a given molecule, but unless one knows how each aromaticity index was defined and evaluated originally, then a comparison across different scales would have little meaning.

Since the purpose of the above discussion is to introduce the concept of a chemical channel, it is interesting to speculate how the concept below could be extended in this direction. For example, if the search engine actually returned molecular hits, then the output might contain some components of e.g. CML (Chemical Markup Language)⁴ such as;



MoleculeStart='<molecule>' 
MoleculeEnd = "</molecule>"

Molecule attributes such as molecular formula or molecular weight could be included, along with children of the element such as <atom> or <bond>. It would then become possible to display these attributes in a consistent manner using e.g the Sherlock software in the form of an element tree (element here meaning XML element rather than the chemical term). Currently, such extension is not possible.

A Demonstration Chemistry Search Channel

To illustrate the search channel concept, we have created a small collection of Sherlock plugins. These include examples of a typical departmental site, examples of technical manual indices, electronic journal search pages, a general molecular database site, two forms of an electronic conference index, and a plugin to a large index of chemical meta information (Table 1). We also include a plugin for our own site and encourage Webmasters of all sites offering chemical content to consider doing likewise.

Table 1
Plugin	Description
MetaChem	MetaChem, University College (UNSW) ADFA⁵
CCDC	Cambridge Crystallographic Data Centre
MOPAC	MOPAC 2000 Manual
RSC-Journals	The Royal Society of Chemistry. Journals
acs-journals-title	ACS Journals. Title search
acs-journals-author.src	ACS Journals. Author search
ChemFinder	ChemFinder Molecular Search Site⁷
Echet98 molsearch	ECHET98 Conference Molecule Database
Chemdig-IC	Chemistry Department, Imperial College
Chemdig-UK	Chemistry Departments, UK¹¹

These plugins must be installed on the user's computer. On the MacOS system (version 8.6 or higher) they are installed simply by dragging the file to the system folder, whereupon they are automatically transferred to the directory "System Folder:Internet Search Sites:Chemical:". Upon invoking the Find (Sherlock) function from the Operating system level, the chemical channel becomes selectable (Figure 1).
The Sherlock Chemical Channel Search interface
The user then has the option of specifying which individual sites they wish to search, e.g. only the journals index, or perhaps only the molecule properties searches.

The search itself is specified as simple text based keywords. These can include chemically significant strings such the molecular formula or a SMILES atom connection descriptor.⁶ Unfortunately, various types of search can be initiated based on the contents of a simple text string. These include a "normal" search, where stemming of the search terms is permitted (i.e. SEARCH will also allow items such as SEARCHES or SEARCHING to be located), exact search and boolean searches, where certain words are reserved as search operators. A significant limitation of the Sherlock system (version 2) is that boolean logic operations within the search string are not forwarded to the search engine. This is due in part to the problem that no universal syntax for passing boolean type queries is available, i.e. sometimes the OR operator can be declared with a comma rather than OR, and likewise the AND with a + sign, and the NOT with a -. This syntax would have to be defined in the plugin configuration file, which is currently not supported within Sherlock.

A Typical search result is shown in Figure 2. Note that each plugin has an associated icon (displayed on the left). The name displayed is the text enclosed within the first location of an <a></a> container in the defined item list. Unfortunately, within the design of many search engines, this text is often not particularly descriptive or unique, the context being provided by other text not bounded by the hyperlink reference. In fact it should be possible to recover this context via an HTML declaration of the type <a href="..." title="...">text</a>, in which the value of the title attribute could also be displayed by Sherlock. The site at which the remote document resides is shown on the right of the display. Note that some search engines index only their local content, whilst others (Metachem in this example) provide links to a variety of remote sites. These links can often point ot other meta-collections, which of course implies the possibility of a circular route to any topic!
Search Results from Sherlock using the Chemical Channel

The next stage involves the user selecting one of the search result items. This results in the display of a logo for the remote search engine site linked to the site itself, and below that the content bounded by the start and end of the list item declarations, along with all <a></a> elements with appropriate hyperlinks. Selecting one of these will invoked a browser and result in the display of the document. As noted above, other elements such as <object>, <embed> and <applet> are not resolved by the Sherlock software.

Other Implementations of Channel Plugins

Whilst Sherlock Search program itself is a proprietary technology, currently available only for MacOS, the plugin syntax is also supported on the Windows operating system via e.g. software called SHO.⁸ In addition, SHO also supports its own proprietary format, an example of which is shown in Scheme 3. Conversion of one plugin format to another is relatively trivial. An example of the SHO search interface is shown in Figure 3.

Scheme 3.

[Plugin]
Author=George Gkoutos
VerInfo=SHO_Plginver_2.0
RevNum=1
[DisplayName]
Value=Search the Imperial Chemistry Site
[UpdateURL]
Value=http://www.ch.ic.ac.uk/chemime/chemdig/chemdigImperial.SHO
[SearchURL]
Value=http://origin.ch.ic.ac.uk/cgi-bin/new/htsearch
[GetOrPost]
Value=GET
[Input]
UserInput=words
MaxReturnVar=20
Var1=config
config=origin
Var2=method
method=and
Var3=format

format=long
Var4=sort
sort=score
[ResultListBegin]
Value=<!-- LIST -->
Occurrence=1
[ResultListEnd]
Value=<!-- /LIST -->
[ResultItemBegin]
Value=<!-- ITEM -->
[ResultItemEnd]
Value=<!-- /ITEM -->
[Field1]
Name=Title
Extract=Text
Start=">
End=</a>
Occurrence=1
[Field2]
Name=Description

Extract=Text
Start=</dt><dd>
End=<b><tt>
[Field3]
Name=Relevance
Extract=Text
Start=<!-- SCORE -->
End=<!-- /SCORE -->
[Field4]
Name=Date
Extract=Text
Start=<font size=-1>
End=</font>
[IconName]
Value=None

Another example of this type of concept is a product called WebLynx.⁹ This tool in fact addresses some of the limitations noted above, namely multi-variable selection at search time and correct handling of boolean queries. In general however, it does seem that end-user software supporting the capability of invoking multiple searches of remote site in parallel is still at a very early stage in its evolution cycle, and that the current generation of products, whilst promising, still requires development. Whether standards for this development will evolve, or whether a profusion of different control definitions will emerge remains to be established.

Conclusions

If one looks at how chemoinformatics is typically implemented in an Internet-based search,⁹ it is striking how the proprietory programs and interfaces found prior to around 1997 have very often been replaced by Web-based interfaces. Even highly specialized molecular search interfaces requiring dedicated structure editors are now evolving to Web-based metaphors invoking Web-delivered software based on Java rather than requiring pre-installed custom software. It is also striking how very little standardisation there currently is in this area. To a certain extent the SMILES descriptor⁶ offers some standardisation for describing molecular connection tables, but few of the Web interfaces inter-operate in the sense of being able to submit a single search query to multiple databases. The ChemFinder interface is in fact a very elegant example of "keep it simple", in that a search query of virtually any form can be passed to it, and the relevant form (i.e. text string, compound name, molecular formula etc) will be automatically recognized and processed.⁷ The search output from Chemfinder also presents much useful meta-information, collecting all the alternative possible sites where further information about the molecular search query might be found.

We have presented in this article a complementary model to the server based portal of meta-information, in which the meta-collection of chemically relevant sites is implemented on the client side rather than the server site, in the form of a library of chemical search plugins. This library could potentially be organised in various ways. It could be comprehensive (estimated at perhaps 500-1000 sites around the world at the end of 1999), or focused on specific types of searches, i.e. electronic journals only, e-commerce sites, or a collection relevant to a corporate intranet. Within any collection of plugins, the user has the option of selecting all, or indeed only one. The search can function at a system level (and hence via the system, be installed in the standard menus of all application programs). The task of collecting and maintaining the chemistry collection of such plugins might be undertaken by e.g. learned societies, or as a commercial service, whilst specialized chemistry sub-channels could develop run by individuals.

Although our model presents some new features, we also recognize that at its current level of development there are some significant limitations in its implementation. The most obvious is that plugins can only be created for sites that offer their own search engine, which tend to be associated only with larger sites, and is less common for specialized small sites. Many advanced molecular searches involve the user specifying a complex set of alternatives and options to define the search. These variables are obviously specific to each search engine, and so cannot be used in a generic search across multiple search engines. Instead they have to be pre-defined for each search engine. Any search of a chemical channel then must involve a specific set of pre-declared options. To cope with multiple options, separate variants of each plugin must be prepared. For this reason, a channel approach to chemical searching is a complementary tool rather than a replacement for using a dedicated service.

It is also apparent that the standards for post-processing the search outputs do not currently exist to implement any significant form of chemical processing. Thus, driven by e-commerce, one can perform a search across multiple engines on a specific item, and return the price of each in output, but not something as chemically specific as e.g. the molecular formula. It is fairly clear that as the deployment of XML compliant structured documents and searches evolves, such features will become possible to implement. One possible model based on XML for integrating finely grained chemical content into a site for delivery via search engine output is the Chimeral project.¹²

In conclusion, we have seeded the formation of a chemical search channel via the small collection noted above. The task of creating a more comprehensive range for the majority of the globally chemically significant search engines could be a distributed one, involving e.g. the administrator of each site. Most significantly, this concept introduces a new model for information retrieval which appears capable of significant development in the future.

Acknowledgements. One of us (GVG) thanks Merck Sharp and Dohme and the EPSRC for the award of a scholarship.

References

See http://www.lycos.com/. This Web-based Internet search engine has been discussed in a number of articles; H. V. Leighton and J. Srivastava, J. Amer. Soc. Inf. Science, 1999, 50, pp.870-881; R. Green and S. Pant, Commun. ACM, 1999, 42, p.70
M. J. Winter; http://www.chemdex.org/
See http://www.apple.com/sherlock/
P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 1999, 39, 928.
A. Arnold, the MetaChem Project: http://metachem.ch.adfa.edu.au/
See http://www.daylight.com/dayhtml/smiles/
See http://www.chemfinder.com/. For a discussion of this search engine, see J. S. Brecher, Chimia, 1999, 52, 658.
See http://www.3potato4.com/
See http://lcms.biomed.hawaii.edu/alohasoft/
H. S. Rzepa; http://www.ch.ic.ac.uk/local/it/ for a taught course illustrating a diverse collection of chemical search site, including links to the Web interfaces.
G. V. Gkoutos and H. S. Rzepa, to be submitted.
P. Murray-Rust, H. S. Rzepa and M. Wright, Chem. Comm, submitted for publication.