The chemical meaning of topological indices

The chemical meaning of topological indices

Chemometrics and Intelligent Laboratory Systems, 15 (1992) 51-59 Elsevier Science Publishers B.V., Amsterdam The chemical meaning of topological indi...

658KB Sizes 11 Downloads 85 Views

Chemometrics and Intelligent Laboratory Systems, 15 (1992) 51-59 Elsevier Science Publishers B.V., Amsterdam

The chemical meaning of topological indices R. Dipartimento

Todeschini

di Chimica Fisica ed Elettrochimica, L4a Golgi 19, 20133 Mifano

(Italy)

R. Cazar Escuela Politecnica Superior de1 Chimborazo, Panamericana Sur Km. I, Riobamba (Ecuador)

E. Collina Dipartimento

di Chimica Fisica ed Elettrochimica, tia Golgi 19, 20133 Milan0 (Italy) (Received

6 July 1991; accepted

4 November

1991)

Abstract

Todeschini, R., Cazar, R. and Collina, E., 1992. The chemical meaning of topological indices. Chemometrics and Intelligent Laboratory Systems, 15: 51-59. Twenty-three topological indices chemical meaning. Several analyses using principal components analysis, in terms of the first six principal representation is then used to deal behaviour.

were calculated for a data set of 667 different chemicals in an effort to further underline their using chemometric methods were performed on this data set. A refined model, obtained by was achieved with only ten relevant topological descriptors. The representation of the data set, components, was used to determine a flexible measure of similarity for chemicals; this with different chemical problems and to allow chemical interpretations of topological index

INTRODUCTION

The use of topological indices has become increasingly important in the prediction of physicochemical properties of compounds, in quantitative structure-activity relationship studies and in related topics. Moreover, their availability is free of cost; even so, some chemists are unwilling to use them because topological indices have no direct chemical meaning and are generally considered as empirical descriptors.

The aim of this paper is to identify chemical information contained in the topological indices; such information provides a relevant chemical approach to the structural features of compounds within a topological index framework [1,2]. Thus the use of chemometric methods appears to be the natural way to deal with this kind of problem [3-51. In this study 23 topological indices were considered on the basis of the distance or the bond matrices. The symbols used to represent the different quantities used in the topological indices are defined as follows:

Correspondence to: Dr. R. Todeschini, Dipartimento di Chimica Fisica ed Elettrochimica, Via Golgi 19, 20133 Milano, Italy.

N N,

0169-7439/92/$05.00

Number of atoms of the molecular skeleton Total number of atoms

0 1992 - Elsevier Science Publishers B.V. All rights reserved

R. Todeschini et al. / Chemom. Intell. Lab. Syst. 15 (1992) 51-59

52

ni ti Pi Pci

B c

D

di

Number of atoms of the ith type (e.g. C, 0, N,...) Number of topologically equivalent atoms [6] in the ith subset of atoms Probability that a randomly selected atom will be of the ith type; pi = ni/N Probability that a randomly selected vertex will have first-order coordinates in the ith subset [7]; pc, = ncJN (nc,: number of vertices with first-order coordinates in the ith class) Number of bonds Number of cycles [8] Distance matrix of a graph - its entries dij represent the number of edges in the shortest path between the vertices i and j Degree of vertex i, i.e. the number of edges incident with vertex i Radius of vertex i, i.e. the maximum distance vertex i has to the other vertices of the graph Distance sum or distance degree for vertex i, i.e. the sum of all entries in the ith row of D

The indices considered are defined as follows (lb being the base 2 logarithm log,): (a) MZC -

(e) x -

x = C(S, ij

lb N-

Eni i

lb IZ~

(the sum runs over the atoms of the molecular skeleton) (b) ZAc- Information index on atomic composition (total) [9]: Z,, = N, lb N, - cni

lb rri

i jAc -

Information tion (mean) [8]:

i,c = - Lpi

index on atomic composi-

pp

(0 W -

(4 ZSZZ- Dimension index: ZSZZ=N lb N (4 M - Zagreb index on vertex degree 191: i

Wiener index on distance code (total)

[ll]: W= ; xdij ij F -

Wiener index on distance code (mean): 2w

w=

N(N-

(9) f

-

1)

Balaban distance connectivity index [121: C (didj)-1’2 ti

.I=B(C+l)

(the sum runs over adjacent vertices)

(h) fg - Mean information on distance equality 1131:

j+--

2gi

2gi i N(N-

1) lbN(N-

1)

(gi: incidence of distance i in D) ZE - Total information on distance equality [131:

W-l)lbN(N-l)

zE=

D

2

2

- Cgi i

lb gi

(gi: incidence of distance i in D>

(9 fz - Mean information on the magnitude of distances [ 131:

lb pi

i

M= Es:

s,)-1’2

(the sum runs over adjacent atoms) X - RandiC connectivity index (mean):

Bertz molecular complexity index:

MZC=N

RandiC connectivity index (total) [lo]:

(gi: incidence of distance i in D; W: Wiener index) Zz - Total information on the magnitude of distances 1131: Zz=

W lb W-

cg,i i

lb i

53

R Todeschini et al. / Chemom. Intell. Lab. Syst. 15 (1992) 51-59

(i)

(gi: incidence of distance i in D; W: Wiener index) IED&g - Mean information on equality of distance degree:

TABLE 1

iE D,deg

Functional groups

No.

Functional groups

No.

Alcohol Aldehydes Amides Amines Anhydrides Carbamic acids Carboxylic acids Ethers Esters

49 44 18 104 6 6 76 52 14

Halides Isocyanates Ketones Imides Lactames Lactones Nitriles Sulfides Thiols

98 5 71 5 12 18 11 14 9

=

-$$lb$

(g,: incidence of distance degree d) Mean information on magnitude of distance degree:

Od iw Weg-

iw D,deg (1)

=

$F Wiener index) C,R- Radial centric information

(g,: incidence of radius r) (m) ZC - Multigraph information IC = (n) SIC -

c pc,

i

(0) BZC -

(P) czc

-

lb Pci

Bonding information



content:

lb N

C Pci BZC=

content:

lb pci

Structural information CPci

SIC=

index:

content:

lb Pci

lb B

Complementary

information content:

CZC=Ib N-ZC

(9) it,,,, -

Topological information

Chemical characteristics of the 667 compounds of the database. Compounds with several functional groups are considered in more than one class

choice of 667 compounds was made from the main organic groups, e.g. aliphatic compounds, ketones, aldehydes, amines, etc. The database was created to provide a wide representation of chemical compounds by selecting, from each chemical group, compounds in increasing order of size, complexity, etc., this selection starting from the very basic compounds. A count was made of the number of times various important chemical functional groups occurred (Table l), independently counting different functional groups present in the same compound. In the data set there are 136 hydrocarbons (20%), 49 alkanes, 28 alkenes, 10 alkynes and 49 benzenoid compounds. Approximately 48% of the compounds are aromatic and 40% acyclic. The 23 topological indices were calculated for the 667 compounds, including the molecular weight to better display the topological indices more closely related to molecular size.

content: METHODS

DATA SET

A large data set of very heterogeneous chemical compounds was generated to extract the chemical meaning of the topological indices and to give these indices a more acceptable validity. A

As a first step the main univariate statistic parameters of the topological indices were calculated, together with a degeneration index D% (Table 2). This index gives information about the capability of a variable to discriminate between different compounds and is calculated as

0% =

-Ib(l/N) -lb(l/N)

-Hi

R. Todeschini et al. / Chemom. Intell. Lab. Syst. 15 (1992) 51-59

54

where N is the total number of compounds (N = 667) and Hi is the entropy due to the jth variable (Hj = 0 if all the N values are equal and Hj = -lb(l/N) = 9.382 if all the values are different). A preliminary analysis of the correlation matrix was performed to reveal any high correlations between topological index pairs. The most correlated pairs of indices (r > 0.9) are reported in Table 3. Multivariate information was carried out by using principal components analysis (PCA) [14] applied to the autoscaled data of 667 compounds, each compound being described by 23 variables. PCA is a powerful tool for exploratory data analysis and is a starting point to demonstrate the interdependence of the variables and to obtain uncorrelated information. In view of the complexity of the problem a cluster analysis of the descriptors was performed: the unweighted average linkage hierarchical method [151 was used on the first six principal components obtained from the whole data set,

'w w "1 '0 hbgCP

f ITOPICMC Ddeg

such components representing 96% of the total variance (Fig. 1). The information obtained from the different approaches led to ten topological indices being excluded because of their degeneracy, redundancy, correlation or similarity to other indices. A new PCA model was calculated on the remaining ten descriptors: the first six PCs were considered significant (eigenvalues > 1; Table 4) and then rotated by using the raw varimax orthogonal factor rotation [16] to provide for a better interpretation of the PC (Table 5). Information regarding the lowest PCs, undesired correlations and noise are rejected. The best informative subset of descriptors is then searched for by using the results obtained from cluster analysis applied to variables (R-mode analysis), the system being described in terms of the first six PCs, weighted in proportion to their explained variance. A similarity level of 80% was chosen to keep variables with correlation > 0.9 in the same clusters: eleven clusters were obtained

j&S

J

f

SIC NC

Fig. 1. Variable selection. Dendrogram obtained by using a hierarchical clustering method, where the similarity between the variables is evaluated by describing them in terms of their first six principal components. The similarity level of 80% is indicated; triangles indicate the selected variables.

R. Todeschini et al. / Chemom. Intell. Lab. Syst. 1.5 (1992) 51-59

55

TABLE 2

TABLE 3

Average, standard deviation, minimum, maximum and degeneracy index for the 23 topological indices

Main correlation values between the topological descriptors

Index

Average

SD.

Min.

Max.

D%

Mw

142.97 7.25 23.66 1.32 43.99 4.56 168.60 2.27 2.08 4.80 1.58 2.05 3.06 2.54 0.50 1078.55 132.93

86.03 6.90 8.94 0.26 29.62 2.27 223.72 0.52 0.61 1.48 0.68 1.02 0.70 0.84 0.06 1641.78 161.74 22.99 0.48 0.76 0.23 0.67 0.22

40.06 0.00 3.90 0.65 6.00 1.41 3.00 1.29 0.00 1.50 0.00 0.00 1.56 1.00 0.41 4.75 0.00 4.75 0.00 0.00 0.00 0.00 0.00

459.72 29.09 64.94 2.01 128.00 10.87 1506.00 4.85 4.03 7.66 3.40 4.25 4.44 7.17 0.71 11137.33 846.56 98.11 2.00 3.45 1.00 3.81 1.00

18.0 43.1 22.3 26.9 44.3 26.3 25.6 22.4 23.8 22.3 35.8 35.2 22.2 29.0 30.6 22.0 22.2 58.0 35.3 33.1 33.3 35.6 20.1

MIC IAC IAC

M

X W J

r,E 1; ‘IC,R iED&g iW D&g

w x ID” G ISIZ ITOP IC SIC CIC BIG

32.60 1.32 2.17 0.70 0.92 0.66

Indices

X

iw D&g W ISIZ

X

IED

IE

I,w ISIZ

i; ID”

P

Indices

1.000

ISIZ

iw D&g

0.951

0.998

iE B&Y

w

0.948

SIC

0.946

0.990 0.987

i;

ISIZ

0.945

M

0.983

iw D&g M ISIZ ID” ISIZ

0.942 0.938

0.920

0.996

P

M

0.981

ID” MW

ISIZ ISIZ

0.972 0.971

i,W W M I,w

X

iw D&g

0.969

W

X

Mw: iz X ID” MWX

0.957 0.963

IAC CIC

IDE MW

0.955 0.954 0.952

CIC ID” X

BIC w SIC MW ID”

M” X

0.938 0.930 0.926

- 0.910 0.918 - 0.910 0.909 0.903

TABLE 4 Eigenvalues, percent of explained variance and cumulative percent of the first six principal components of the refined PCA model for the unrotated and rotated models Rotated PCA model

PC Unrotated PCA model

Eigenvalue Variance Cumulative Variance Cumulative (%I (%o) (%o) (%I

(Fig. 1) and from these the variables with the lower degeneracy index were selected. Although it had a similarity level lower than 80%, the variable MZC was excluded for the same reason (D% = 43.1). The concept of similarity was used to interpret

1 2 3 4 5 6

4.630 2.767 1.005 0.717 0.508 0.179

46.3 27.7 10.0 7.2 5.1 1.8

46.3 74.0 84.0 91.2 96.3 98.1

36.3 25.8 12.1 10.9 10.5 2.5

36.3 62.1 74.2 85.1 95.6 98.1

TABLE 5 Rotated loadings of the first six principal components of the refined model and their degeneracy contributions of the variables in each PC are printed hold Variable luw iAC

J iED&g iW D&g

w x IC CIC BIC D%

index D%. The main

PC1

PC2

PC3

PC,

PC5

pc6

0.853 - 0.181 - 0.095 0.760 0.872 0.976 - 0.421 0.548 0.298

-0.128 0.109 0.171 0.370 - 0.082 - 0.069 0.185 0.755 - 0.933

- 0.219 -0.124 0.080 - 0.223 - 0.440 - 0.062 0.860 - 0.307 -0.107

- 0.201 - 0.078 0.973 - 0.037 - 0.102 0.023 0.108 0.007 -0.110 0.152

0.161 0.018 -0.009 0.457 0.007 -0.113 - 0.050 0.029 - 0.023 0.039

2.02

1.79

0.006

0.948

0.214

0.259 0.965 - 0.078 0.095 0.123 0.036 -0.165 0.165 -0.060 0.010

1.87

1.80

1.95

2.01

56

the principal components. The similarity between two chemicals i and j can be defined as a function of the distance between them: sij = 1 - dij/d,, is the maximum distance calculated where d,, between all the compounds and the distance d is, here, the Euclidean distance calculated on the variables describing the compounds. Very similar compounds have values of similarity near 1 (or lOO%), while very dissimilar compounds take values near 0.

RESULTS AND DISCUSSION

The purpose of this work was to gain greater insight into the interpretation of topological indices via their representation in principal components, and to check the capability of the PCs to give a flexible measure of similarity to face different chemical problems. Thus the whole data system with 23 topological descriptors was described in terms of the first six PCs, which represent 96% of the total variance. A reduced model was obtained by using cluster analysis on the variables, leading to only ten variables. These ten descriptors were then used to calculate a refined PCA model: Mw, j,,, J, &g, jgdeg, m, 2, ZC, CZC and BZC. In this refined model the first six PCs explained 98% of the total variance (Table 4). The rotated loadings for this ten-descriptor refined model, together with the degeneracy index of the factors, are shown in Table 5. As expected, the values of the degeneration index show how well the PCs discriminate between the different compounds. The few percentages of degeneracy are due to the presence in the dataset of some ci~i/ truns isomers which cannot, in principle, be discerned in a topological approach. Because the descriptors of the data are the six PCs obtained from the refined model, a measure of similarity must take account of the different contribution of each component to an explanation of the variability of the data systems. This means that the similarity between any of the chemicals of the data set must be calculated from

R Todeschini et al. / Chemom. Intell. Lab. Syst. 15 (1992) 51-59

PCs weighted for their proportion of explained variance (% VAR. in the rotated PCA model of Table 5). However, in cases where a subset of analogous chemicals is investigated for some particular chemical property, a different weighting scheme must be used, e.g. setting to zero a component carrying irrelevant information. This highlights properties pertinent to the investigated problem and provides effective similarity scales. The first PC carries information about the molecular size, both with respect to the total number of atoms and to the molecular weight. The second PC carries information about the bond distribution in a chemical compound, also including information about different bond multiplicity, e.g. all monocyclic compounds, with atoms having the same valence bonds in the hydrogendepleted scheme, have minimum values of PC, (minimum entropy, as for cyclohexane, benzene, [14]-annulene). The third PC carries information linked to X. For this topological descriptor a discriminant point exists at 0.5, the value assumed for all monocyclic compounds. All the other cyclic compounds assume values lower than 0.5, as do some non-cyclic, highly branched compounds, which are characterized by a tendency towards central symmetry (for example, the 3-methyl-3-ethyl-pentane, X = 0.4611). For PC, the discriminant value is about zero, with a trend towards higher values with increasing size. The fourth PC carries information mainly linked to Z,,. This means that values G 1 (i.e. lb 2) represent hydrocarbons or, more correctly, compounds with only two different atom types, as in the case of the 1,3,5-triazine nitrile (C,N,). A second discriminant value is 1.585 (i.e. lb 3), for compounds with three different atom types and a third discriminant value is 2 (i.e. lb 4) for compounds with four different atom types. Within each interval, compounds with unbalanced numbers of different atoms assume lower values (less entropy) than compounds with balanced numbers of different atoms (more entropy). This means, for example, that within the class of aliphatic compounds the more satured ones assume the lowest values because the number of hydrogen atoms greatly exceeds the number of carbon

R. Todeschini et al. / Chemom. Intell. Lab. Syst. I5 (1992) 51-59

57

TABLE 6 Similarity considering and not considering the presence of heteroatoms. Similarity evaluated with all PCs highlights the presence/absence of heteroatoms (Model A); excluding PC,, highlights only structural similarities (Model B). Toluene (C,Hs) is the selected target compound Model A C7H10 C6HlO

CsH,, C8H6 C8H8

C8H,,

Ethynylcyclopentane Fulvene Ethynylcyclohexane Ethynylbenzene Styrene o-Xylene

Similarity (a)

Model B

96.3 94.6 92.9 92.8 92.5 92.5

C,H,N C,H,N C,H,N C,H,N C,H,O C,H,S

atoms. The main characteristics of this component can be shown by evaluating the similarity with and without this component. Table 6 lists the six compounds most similar to a target compound (toluene) for the two cases. Without this component (Model B) compounds with heteroatoms also appear among those that are most similar, because the difference in the constituent atom types is not taken into account. High values of the fifth PC give information about the presence of high branching and a high number of substituents, mainly due to the BalaTABLE 7 Branching of the eighteen C,H,, isomers evaluated by using PC,. The reference compound is n-octane (compound 1, linear) Compound 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

n-Octane 2-Methylheptane 3-Methylheptane 4-Methylheptane 2,5-Dimethylhexane 2,CDimethylhexane 2,ZDimethylhexane 3-Ethylhexane 2,3_Dimethylhexane 3,4-Dimethylhexane 3,3-Dimethylhexane 2-Methyl-3-ethylpentane 2,2,4_Trimethylpentane 2,2,3_Trimethylpentane 2,3,4_Trimethylpentane 2,3,3_Trimethylpentane 2,2,3,3_Tetramethylbutane 3-Methyl-3-ethylpentane

Similarity (o/o) Aniline a-P&line P-Picoline y-Picoline Phenol Thiophenol

97.1 97.1 97.1 97.1 96.9 96.8

ban connectivity index J. These values are corrected toward lower values for compounds with higher molecular weights and toward higher values of BIC (different structural isomers). To highlight the meaning of this component, the similarity between a linear target compound (noctane) and its seventeen isomers was evaluated by using only PC,; as shown in Table 7, the more branched isomers appear less similar to the target compound in the list. The sixth PC seems to carry information about the position of the substituents in a fixed molecular skeleton. It would seem, for example, that PC, has a role in the discrimination of the ortho-, meta- and puru-isomers; however other kinds of discriminating properties are not so clear and deserve further investigation.

Similarity (%I 98.5 93.5 87.8 81.4 79.4 79.4 79.3 76.5 68.2 67.4 66.9 63.9 55.4 54.4 51.3 26.8 0.0

6\o A

$ 2 TETRAHYDROPlRAJdE

:;c+o-cq

1 DI-ISOPROPYLETHER

1 CYTOSINE

6 DIMETHYIAMINE 5 CYCLOFENTYLAMNE

fj CLJMARINE

Fig. 2. The six compounds randomly extracted from the data set. These compounds were used to evaluate the performances of pattern recognition in terms of the obtained six-PC model.

R. Todeschini et al. / Chemom. Intel. Lab. Syst. 15 (1992) 51-59

58

The six-PC model was used to assess the similarity (with respect to all the aspects present in the data and described by the six components)

between six randomly selected target compounds and the other compounds belonging to the whole data set. Fig. 2 shows the structures of the six compounds, while Table 8 shows the results obtained for the five most similar compounds.

TABLE 8 Similarity between six randomly selected compounds used as target chemicals, using all six PCs, weighted on the explained variance. T is the target compound, No. the ordered sequence of similar compounds Similarity (%o)

Name

T.

No.

Formula

1

1.1 1.2 1.3 1.4 1.5

C,HsN,O C,H,O,N, CsH,NO, C,H,H,O, C,H,NO, C,H,NO*

96.6 93.6 91.6 89.7 89.1

Cytosine Uracil Glutaric imide Barbituric acid Proline Succinimide

2.1 2.2 2.3 2.4 2.5

CsHtoO C,H,,N C,H,N C,H,O, C,H,O, C,H,N

98.8 94.4 93.2 93.2 92.8

Tetrahydropyrane Piperidine Pyridine Phthalic anhydride 1,4-Dioxane Pyrollidine

3.1 3.2 3.3 3.4 3.5

C,H,,O

C,H,,N C,H,,N C,H,, C7Hr6 CsH,,

98.9 92.4 89.8 88.0 86.3

Diisopropylether Diisopropylamine Triethylamine 2,4_Dimethylpentane 3-Ethylpentane 2,5-Dimethylhexane

4

4.1 4.2 4.3 4.4 4.5

CzH,N C,H,N C,H,O C*H,O CIH,S C,H,S

100.0 97.5 97.5 97.1 97.1

5

5.1 5.2 5.3 5.4 5.5 5.6

CsHttN C,H,,O CSH,O C,H,N C,H,N C,H,N C,H,N

98.8 96.0 94.1 94.1 94.1 94.1

6.1

C9H60Z

C,H,O,

95.2

6.2 6.3 6.4 6.5

CIoH,N C,,H,O, C,H,O, C,,H,O,

93.9 93.8 93.1 92.8

2

3

6

Dimethylamine Ethylamine Ethanol Dimethylether Ethanethiol Dimethylsulfide Cyclopentylamine Cyclopentanol Cyclopentanone Aniline cu-Picoline P-Picoline y-Picoline Cumarine Delta-benzobutyrrolactone 1-Naphthylamine 1,2-Naphthylamine Cromone (1,4) 2-Methyl-l,Cnaphthoquinone

CONCLUSIONS

Chemometric methods appear to provide an interesting methodological framework for a good understanding of topological descriptors. The pattern space, determined by the six PCs from linear combinations of ten topological descriptors, contains information useful in ascertaining the structural similarity of molecules. Exploratory and confirmatory analysis for PC interpretations by using subsets of compounds with known properties will be the subject of further investigations. Although not conclusive and definitive, a chemical interpretation of the topological indices and of the PCs obtained can bring a greater sensitivity to their use for different chemical problems, bearing in mind the great advantage of having cheap and easily computable variables. Particularly relevant is the objective to search for relationships between these descriptors and experimental physicochemical and biological properties of chemicals by using regression analysis and discriminant analysis. For example, preliminary encouraging results were obtained by using regression models to predict the boiling points of 300 compounds of the data set, selected from among those not containing more than one type of heteroatom. However, it must be pointed out that none of the topological variables or their PCs has any relationship to the metric and conformational aspects of the compounds. Thus, the proposed pattern space must only be used if one bears these constraints in mind.

REFERENCES 1 L.B. Rier and L.H. Hall, Molecular Connectivity in Structure- Activity Analysis, Research Studies Press, Letchworth, 1986.

59

R. Todeschini et al. / Chemorn. Intell. Lab. Syst. 15 (1992) 51-59 2 D. Bonchev, Information Theoretic Indices for Charactetization of Chemical Structures, Research Studies Press,

Letchworth, 1983. 3 S.C. Basak, V.R. Magnuson, G.J. Niemi, R.R. Regal and G.D. Veith, Topological indices: their nature, mutual relatedness, and applications, Mathematical Modelling, 8 (1987) 300-305. 4 M. RandiC, Resolution of ambiguities in structure-activity studies by use of orthogonal descriptors, Journal of Chemical Information and Computer Science, 31 (1991) 311-320. 5 D. Bonchev, Information Theoretic Indices for Charactetization of Chemical Structures, Research Studies Press,

Letchworth, 1983, p. 85. 6 D. Bonchev, Information Theoretic Indices for Characterization of Chemical Structures, Research Studies Press, Letchworth, 1985, pp. 150-153. 7 D. Bonchev, Information Theoretic Indices for Characterization of Chemical Structures, Research Studies Press, Letchworth, 1983, pp. 74-75. 8 SM. Dancoff and H. Quastler, in H. Quastler (Editor), Essays on the Use of Information Theory in Biology, University of Illinois, Urbana, IL, 1953. 9 I. Gutman, N. RusciC, N. Trinajstic’ and C.F. Wilcox Jr., Graph theory and molecular orbitals. XII. Acyclic

polyenes, Journal of Chemical

Physics,

62 (1975) 3339-

3405.

10 M. RandiC, On characterization Journal of the American 6615.

of molecular branching,

Chemical Society, 97 (1975) 6609-

11 H. Wiener, Structural determination of paraffin boiling points, Journal of American Chemical Society, 69 (1947) 17-20. 12 A.T. Balaban, Topological

indices based on topological distances in molecular graphs, Pure & Applied Chemistry,

55 (1983) 199-206. 13 D. Bonchev and N. Trinajstic,

Information theory, distance matrix and molecular branching, Journal of Chemi-

cal Physics, 67 (1977) 4517-4533. 14 L. Lebart, A. Morineau and K.M. Warwick, Multiuariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techniques for Large Matrices, Wiley, New York,

1984. 15 D.L. Massart and L. Kaufman, The Interpretation

of Analytical Chemical Data by the Use of Cluster Analysis, Wiley,

New York, 1983. 16 R.J. Rummel, Applied Factor Analysis, Northwestern University Press, Evanston, IL, 1970.