Building The Qsar Models On The Base OfFeature'S Spaces Searching And Selection.

Kumskov M.I.

Zelinskii Institute of Organic Chemistry, Russian Academy of Science,
Leninskii Prosp., 47, 117913, Russia, Moscow
fax: (095) 135-53-28, E-mail: kumskov@cacr.ioc.ac.ru
Abstract. It is proposed to solve the classification task of such structural objects as molecules on the basis of searching the feature's spaces ( in the class of structural spectra) which are adequate to physicochemical or biological molecular property

At present it is necessary to use computer support for classification task for such objects as molecules (the QSAR task). The known QSAR-programs use for description of molecules (in the form of feature's vector) the lists of features given beforehand [1-3]. Then the conventional programs of statistical data processing are applied (linear regression analysis, the factor analysis, cluster-analysis and etc.) for research and analysis build "molecule - feature" table [4]. This paper describe the computer implemented method to search adequate molecular representation level for QSAR-task on the basis of automatically generated structural molecular descriptors.

In general the QSAR-task is possible to be defined as follows. We have the "training" (learning) structural data base ( SDB ), containing both chemical molecule's structures and activity data about them: each i-record contains the pair ( Gi, Ui ), where Gi - molecular graph (or the atoms connectivity table) [3]; Ui - numerical value of the experimental activity of i-th substance. It is required to choose such feature space, describing the SDB molecules, which will generates in linear class of functions the QSAR-model with best predictive force [4]. Thus, QSAR-task includes two independent stage: the stage of choosing feature's spaces and the search stage for prognostic QSAR-models (on the base of the "molecular-feature" table received on the first stage). The beforehand absence of given SDB representation in the form of the "molecule-feature" table results in the statement of the task of choosing some feature's formation method. There is the similarly of situation, connected with processing and recognition of images, where task of adequate computer representation of the images also exist. The feature space type is determined by the level of complexity representation of structural objects ( images or molecules) in computer.

Multilevel Molecule Presentation. The molecular structure can be defined on several levels of detailing: on topological, on 2D-topological (with defining planar projection of the molecular graph vertex), in three dimensional (3D) representation with minimum energy conformation, in 3D representation with additional account of space electrostatic potentials. It is not beforehand known, on which level one should to conduct the description of molecules (storing in SDB only as the atom's connection tables) for QSAR-analysis of particular property. For example, it is known, that for biological properties prediction the important role plays the account of the 3D molecular structure. We offered to use uniform method for the formations and selections the molecular feature spaces (basing on different representation levels). The construction of structural descriptors is carried out by two stages. First, it is carried out the total "base fragments" (or specific points) enumeration of given type, and then the descriptors, describing the mutual arrangement of base fragments in molecules (on studied representation level) are formed. The general outline of this approach have following steps [5-10]:

1. The molecule is consequently considered as topological, planar or space object, having energy structure;

2. On each level the molecular "specific points" (or the "base fragments") are determined in the form of algorithm rules;

3. Each specific point have the (planar or space) coordinates and the type identificator. The point type (its name) is defined as the rules, given by experts;

4. For all molecule's specific points the distance matrix D = { Dij } is built, where Dij is the (topological or Euclidean) distance between i-th and j-th points. It is chosen the interval's bounds and the matrix P = { Pij } is built, where Pij is the the interval, which contains Dij;

5. Received interval matrix P is the base for the molecular description in the form of its structural spectrum as follows. The pairs of specific points are listed: The program generates the descriptors list in the form: "(T1,T2,P),N", where: T1 and T2 are the points names (in the pair); P is the (code of) distance interval between them, N is the number of occurrences of (T1,T2,P)-fragment in molecule. The record "(T1,T2,P)" is the descriptor's name: two descriptors are equal, if their names coincide.

The choosing of "specific points" can be conducted as follows:

* Atoms, labelled by various ways. Label has constructed by the information about local topological, geometrical or physicochemical properties of atom [5,6,12].

* Chain of marked atoms [5,6]. The distance between chains is defined as the minimum distance between chain's atoms.

* Points, not connected directly with atoms [7,8]. These points are chosen in the environmental molecule space. For example, they can be located on the molecular surfaces of various kind.

The selection process of the feature's spaces are conducting on the basis of numerous constructions of various QSAR-equations by the way of their gradual complication increasing. There is carried out the transition from simple base fragments to complex, and from topological level of the molecular representation to 3D level. The best QSAR-model is constructed for each feature space and then saved. The prognostication of the obtained QSAR-models is checked by the "cross-validation" method [4]:

1. the i-th compound is removed from the learning SDB and new weight coefficiens are found for the given model's parameters;

2. a property (activity) of the removed i


Return to main WATOC Poster page