Advanced Drug Delivery Reviews 55 (2003) 1119–1147 www.elsevier.com / locate / addr
Hierarchy neural networks as applied to pharmaceutical problems Hiroshi Ichikawa* Hoshi University School of Pharmacy, Department of Information Science, Ebara 2 -4 -41 Shinagawa, Tokyo, 142 -8501, Japan Received 3 February 2003; accepted 12 May 2003
Abstract Optimization and prediction are the main purposes in pharmaceutical application of the artificial neural networks (ANNs). To this end, hierarchy-type networks with the backpropagation learning method are most frequently used. This article reviews the basic operating characteristics of such networks. ANNs have outstanding abilities in both classification and fitting. The operation is basically carried out in a nonlinear manner. The nonlinearity brings forth merits as well as a small number of demerits. The reasons for the demerits are analyzed and their remedies are indicated. The mathematical relationships of ANN’s operation and the ALS method as well as the multiregression analysis are reviewed. ANN can be regarded as a function that transforms an input vector to another (output) one. We examined the analytical formula for the partial derivative of this function with respect to the elements of the input vector. This is a powerful means to know the relationship between the input and the output. The reconstruction-learning method determines the minimum number of necessary neurons of the network and is useful to find the necessary descriptors or to trace the flow of information from the input to the output. Finally, the descriptor-mapping method was reviewed to find the nonlinear relationships between the output intensity and descriptors. 2003 Elsevier B.V. All rights reserved. Keywords: Artificial neural network; Basic theory of operation; Hierarchy; Backpropagation, Reconstruction; Forgetting; Descriptor mapping; Partial derivative; Correlation between input and output
Contents 1. Introduction ............................................................................................................................................................................ 2. Simulation of the nerve system................................................................................................................................................. 2.1. Biological nerve system in essence .................................................................................................................................... 2.2. Biological neuron and artificial neuron .............................................................................................................................. 2.3. Operation of artificial neuron ............................................................................................................................................ 3. Basic theory of hierarchy-type neural network ........................................................................................................................... 3.1. Network for classification ................................................................................................................................................. 3.2. Network for fitting ........................................................................................................................................................... 3.3. Training .......................................................................................................................................................................... *Tel. / fax: 181-3-5498-5761. E-mail address:
[email protected] (H. Ichikawa). 0169-409X / 03 / $ – see front matter 2003 Elsevier B.V. All rights reserved. doi:10.1016 / S0169-409X(03)00115-7
1120 1122 1122 1122 1123 1124 1124 1125 1125
1120
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
4. Characteristics and / or problems of the neural network’s operation.............................................................................................. 4.1. Basic operating characteristics .......................................................................................................................................... 4.2. Some difficulties .............................................................................................................................................................. 5. Relationship of operation between ANNs and conventional methods ........................................................................................... 5.1. The ALS method ............................................................................................................................................................. 5.2. The multiregression analysis ............................................................................................................................................. 6. How to overcome defects of ANN’s operation and purposive extension ...................................................................................... 6.1. How to deal with excessive nonlinear operation ................................................................................................................. 6.2. How to obtain partial differential coefficients by the neural network .................................................................................... 6.2.1. Accuracy of the partial derivatives obtained by the neural network ............................................................................ 6.2.2. Is the independency between input neurons kept? ..................................................................................................... 6.2.3. Isolation of functions out of the mixed functions ...................................................................................................... 6.2.4. Recognition of two similar functions ....................................................................................................................... 6.2.5. Recognition two similar functions by correlated input data ........................................................................................ 6.2.6. Application the partial derivative method ................................................................................................................. 6.3. Reconstruction learning .................................................................................................................................................... 6.3.1. Introduction of the forgetting procedure into the learning phase ................................................................................. 6.3.2. How does reconstruction learning work?.................................................................................................................. 6.3.3. Practical application: the relationship between 13 C NMR chemical shift and the conformation of norbornane and norbornene ............................................................................................................................................................ 6.4. Descriptor mapping .......................................................................................................................................................... 6.4.1. Method.................................................................................................................................................................. 6.4.2. Examination using mathematical functions w 5 x 1 2y 1 3z ............................................................................ 6.4.3. Application to SAR analysis ................................................................................................................................... 7. Concluding remarks ................................................................................................................................................................ Acknowledgements ...................................................................................................................................................................... References ..................................................................................................................................................................................
1. Introduction It could have an incredible influence on many fields when a fundamental phenomenon comes to light. The question of information processing in living things has been the object of such research. The functioning mechanism of the nerve system of living things had gradually emerged by the early 1940s. If one removes additional or decorative factors from the nerve system, the very central factors are reduced to neurons and nerve fibers that connect neurons. The physiology of such a nerve system reveals that the neuron transmits discrete information as an action potential when a certain amount of information is accumulated on the cell membrane. It is also understood that the functioning of memory and cognition is carried out as ‘thickness’ of fibers between neurons. The nerve system is an assembly of such neurons with nerve fibers [1]. Since the nerve system is easily simulated in a digital computer, the behavior of variously arranged and connected neurons has been studied [2], where the functioning of a neuron is replaced by a mathe-
1126 1126 1127 1128 1128 1129 1129 1129 1130 1130 1131 1132 1134 1135 1135 1137 1137 1138 1139 1139 1140 1140 1144 1145 1145 1146
matical function and the thickness of fibers is expressed as a weight value between neurons. If one teaches, the simulated nerve systems learn and behave as if they were a kind of brain of living things and are called artificial neural networks (ANNs). ANNs may have two types of connection, i.e. the hierarchy connection and mutual connection although the former is regarded as a special case of the latter (Fig. 1). In the hierarchy network, the signals (information) proceed from the input to the output without feeding back to the passed-through neurons. The mutual-connection type neural network, which allows such feedback, has developed to the Hopfieldtype networks [3,4] and, incorporating the idea of statistical mechanics, turned out to be the Boltzmann-machine type networks [5]. Turning to the method of learning, supervised learning and unsupervised learning systems may be used (Fig. 2). To achieve learning, a standard of evaluation is necessary. The results of evaluation must be fed back to change the thickness of fibers, i.e. the weight values, to adapt the output to the
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1121
Fig. 1. Hierarchy connection-type neurons (a) and mutual connection-type neurons (b). Here, arrows indicate the flow of information.
standard. The supervised learning is given such a standard from the outside of the network system. As a supervised learning method, the backpropagation method [2] has been established on the hierarchy neural networks, which have been most frequently applied to pharmaceutical problems. The unsupervised learning is based only on the internal structure of the network. In the Hopfieldtype and Boltzmann-machine network structures, the unsupervised learning method has been adopted. We do not intend to go further into a detailed explanation of these structures but, here, recommended textbooks are cited [2,6]. One thing to add is: since a molecule has a 3D geometry, 3D information is conveniently reduced to a lower dimension to analyze molecular properties. For this purpose, Hopfield-type, Boltzmann machine, or Kohonen-type networks [7,8] have been studied for application and are discussed / reviewed by Zupan and Gasteiger and co-workers [6,9] and Doucet and Panaye [10]. Here, the Kohonen-type network may be regarded as a combination of one-layer networks with hierarchy neurons. Now let us give a brief bibliography of ANN
application in the fields of pharmaceutical and related sciences. First, application of ANNs in these fields may be traced back to 1989, where a general treatment in the pharmaceutical field was discussed [11]. The first application in QSAR is seen in 1990 [12]. Since Devillers has reviewed the early application of ANNs to drug-related fields [13], here, we quote review articles from 1994 to 2001. Since the first successful application to the QSAR study [11], a large number of articles has appeared. They are continuously reviewed [14–18]. ANN is also a study object of clinical and biomedical application, and review articles are given in Refs. [19–24]. Traditional Chinese medicine requires full experience and a delicate balance of herbal components. Application to this end is considerable and has been successfully carried out as seen in Refs. [25–28], although they are not review articles. ANN seems to be a new modeling method that has not been broadly applied to pharmaceutical sciences; a number of articles have been published and several groups have already reviewed them [22,28–31]. As mentioned, the hierarchy-type neural network
Fig. 2. Supervised learning (a) and unsupervised learning (b).
1122
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
with the backpropagation learning procedure has been mostly used in the medical and pharmaceutical areas. This is because classification and fitting are the most commonly used technique for rationalization and prediction in these fields. But in most cases, ANN with the backpropagation algorism has been applied without rationalization of its operation. Thus, this article treats the hierarchy-type neural network in an attempt to explain its proper application, where special emphasis is placed on the rationalization of its operation.
2. Simulation of the nerve system
2.1. Biological nerve system in essence The idea of modeling of the biological neuron can be traced back to when McCulloch and Pitts found the mechanism of the nerve system of a living organism in 1943 [32]. A biological neuron, which is essential for information processing, has two types of extensions: the dendrites and the axon. The dendrites receive signals from other neurons through synaptic connections to the cell body (soma). Here, the signal is actually an electric potential that may change the potential of another soma. There are positive and negative synaptic connections according to the
chemical transmitters: the former transmits the potential to other somas in a regular way but the latter, in a reverse way. Thus, a neuron may receive the signal from other neurons as the sum of those positive and negative signals. Without a signal from other neurons, the neuron has a standing electric potential (ca. 250 mV). But it gives a pulse of a high potential (ca. 170 mV) when it receives more than a certain amount of information and sends its potential to other neurons. Concerning learning of a living organism, Hebb proposed a mechanism that the connection between neurons is strengthened according to the frequency of signal-passing and the signal easily gets through such a connection. This makes the neural network of a living organism plastic to form memory and cognition [33]. This is called the Hebbian rule. An ANN is a network of artificial neurons, where the Hebbian rule is applied to the connections of neurons.
2.2. Biological neuron and artificial neuron Let us consider how to simulate the neuron’s function. Neuron j in Fig. 3 has the 0 and 1 states of output corresponding to standing and excited states. In the standing state, the neuron produces a low potential (0) and the excited one gives a high
Fig. 3. How to simulate the biological nerve network system; x and W are the output strength from a neuron and the thickness of the fiber that is connected to neuron j.
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
Fig. 4. Function of neuron j. When the sum of signals (information) from presynaptic neurons (x) through fibers (W ) exceeds more than a threshold value (u ), neuron j ignites.
potential (1). The operation of such a neuron may be expressed as shown in Fig. 4, where u is the threshold value for ignition. Axis z represents the potential of the output and that of y is for the amount of the input signal, SWij x i , which is the sum of the potentials from presynaptic neurons. Wij is called an element of weight matrix (W ) between neuron j and presynaptic neuron i. This function may be expressed as
OW x 2 u 5 WX 2 u n
y5
ij i
i 51
1123
Fig. 5. Step function of a biological neuron is simulated by the sigmoid function.
transfer function. In ANN, the activation function of a neuron should be differentiable; the most popular one is the sigmoid function. 1 f( y) 5 ]]]] 1 1 exp (2a y)
(4)
According to the value of a, this function changes from one of nearly linear function to such a strongnonlinear function as the step function (Fig. 5). The sigmoid function is appropriate in the backpropagation learning, since it is differentiable.
(1)
z 5 f( y)
2.3. Operation of artificial neuron
here W and X are vectors (W1 j , W2 j ,.., Wnj ) and (x1 , x2 , . . . , x n ) and n is the number of presynaptic neurons. Adding elements Wn 11, j 5u and x n 11 51 as (W1 j , W2 j ,.., Wnj , u ) and (x1 , x2 , . . . , x n , 1), Eq. (1) is rewritten simply as
Let us examine the operation of each neuron. As Eq. (2) shows, W determines the output. If y is set at 0, XW is interpreted as a supersurface in the X space that determines the position and the gradient. Fig. 6 shows the concept of the neuron’s operation: the vector X is expressed as dots and the supersurface is the straight line on which y is 0. It is understood that the z value of the upper side of the line is 1 and that of the lower side is 0. Namely, the neuron performs recognition / classification based on W that is given by learning. The line of the step function as Eq. (3) has no width, but that of the sigmoid function may be wide and the z value is somewhere between 0 (lower side) and 1 (upper side).
y 5 WX
(2)
z in Eq. (1) is the output value of neuron j and is a step function which takes the following values f( y) 5
H
1 if y . 0 0 if y , 0
(3)
This function is called the activation function or
1124
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
Fig. 6. Operation of an artificial neuron. According to W, the neuron determines 0 (lower region) or 1 (upper region) in the X space. If the sigmoid function is used, there appears a wide decision line on whose value is somewhere between 0 and 1.
3. Basic theory of hierarchy-type neural network
3.1. Network for classification Shown in Fig. 7 is the hierarchy-type three-layer network. In fact, it is easily shown that the multilayer, which is more than three, is unnecessary. The circles are neurons that are, in simulation, variables taking a value ranging from 0 to 1. The data are
input to A and output from B. Usually, the number of neurons in the input layer is set to be equal to that of the parameters (descriptors) plus 1 (bias), while that of the output layer is equal to the number of categories. However, it should be noted that the bias for the second layer is unnecessary. The number of neurons in the second layer is arbitrary. However, a small number may reduce the resolution ability whereas a large number consumes considerable time in the training phase. The recommended number is somewhere between the number of the input neurons and double their number, unless the number of the input parameters is small (e.g. less than |5). Here, we recommend that the structure of a hierarchy-type neural network is expressed as N(a,b,c), in which a, b and c are the numbers of neurons of the first, second and third layers [34]. The activation function of the first-layer neurons is usually set to be the liner function that outputs the input value without any change [Eq. (5)] Oj 5 y j
Fig. 7. Three-layer network for classification. The data is input to A and output from B.
(5)
As the activation function of the second- and thirdlayer neurons, the sigmoid function is best used although variation is possible. Namely, the value of a neuron (Oj ) at the second or third layer can be expressed by Eq. (6)
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1 Oj 5 f( y i ) 5 ]]]]] f 1 1 exp (2a yi ) g
SOW x D 2 u
yi 5
ij i
1125
the linear function [Eq. (1)]. This type of neural network is termed an MR-type network [35]. (6)
j
j
where x i is one of the values of a neuron at the first or second layer; Wij , an element of the weight matrix, expresses the weight value of the connection between neurons i and j, and takes either a positive or a negative value; a is a parameter which expresses the nonlinearity of the neuron’s operation; uj is the threshold value for neuron j. Usually, one needs not set this value if constant 1 is always given as a bias to one of the neurons in the input layer. The reason is that the connection weights between the neuron with constant 1 and any of the neurons in the second layer are optimized during the learning (training) phase to play the same role as that of the optimized u. The sigmoid function is not unique as the activation function, f( y), and may be replaced by any function with an appropriate operation if it is differentiable. But such variation may be meaningless.
3.2. Network for fitting There is not much difference in the network structure between classification and fitting (Fig. 8). The only difference is the activation function for the third-layer neuron: the activation function should be
3.3. Training Given N neurons at the first layer, a vector can express a set of the input data with N elements for the input neurons, which is called here an ‘input pattern’. Likewise, the output data can also be regarded as a vector and be called an ‘output pattern’ (Oj ). The vector, which is compared with an output pattern to fix Wij , is called a ‘training pattern’ (t j ). The training of the network is based on the following equations: dW (n21,n) 5 2 d j(n) x i ´ ij
(7)
d (j 3 ) 5 (Oj 2 t j )g0( y j )
(8a)
d (j 2 ) 5
SOW l
( 2,3 ) jl
D
d (l 3 ) f0( y j )
Here, f0() and g9() are the derivative functions of the activation functions for the second- and third1ayer neurons, f() and g() respectively, and ´ is a parameter which determines the shift for correction in backpropagation. The superscripts on W and d indicate the relevant layer(s), and thus Eq. (8a) is used only to correct the connection between the second and third (output) layers while Eq. (8b) is for another connection. If f and g are the sigmoid functions, the derivative functions, f9 and g9 in Eq. (8) are f0( y j ) 5 g0( y j ) 5 f( y j ) f 1 2 f( y j ) g a
Fig. 8. Three-layer network for fitting. The activation function for the third layer should be a linear function.
(8b)
(9)
In the above equations, both ´ and a can be set independently of the layer. Since the value of each neuron is defined between 0 and 1, the given data must be scaled within the defined region. Here it should be noted that if the value of a neuron in the input layer is zero, the connections from such a neuron are always null, i.e. the information from that neuron is not propagated to the second and the third layers. To avoid this difficulty, the smallest value should be set at slightly larger than zero, typically 0.1. Therefore, the following scaling equation may be used:
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1126
(qmax 2 qmin )p 1 qmin pmax 2 qmax pmin p¯ 5 ]]]]]]]]]]] pmax 2 pmin
(10)
where qmax and qmin are the maximum and minimum values of scaling, p is the element of the input pattern to be scaled, and pmax and pmin are the maximum and minimum values of the elements of the actual input patterns. Finally, each element of the training pattern should also be between 0 and 1. There may be a number of ways to express the graded categories. A simple way is to use the pattern (0,0,0,0,1) for, for example, the first and (1,0,0,0,0) for the fifth grade in the five-graded classification. By doing so, one can observe the degree of contamination in the determined class from other classes. Training is carried out according to the above backpropagation algorithm until the error function
O(O 2 t )
E5
j
2
j
(11)
j
becomes small enough. Using random numbers as all elements take the values between 21 and 1 creates the initial weight matrices. Even when M sets of the input and training patterns are given, all of the output patterns can be made close enough to the training patterns by iteration through Eqs. (7) and (8) owing to the convergence theorem of perceptron [36]. If convergence is attained, then the neural network for classification has an ability to classify the input patterns into M groups while that for fitting performs the function as a nonlinear multiregression analysis.
operation. Aoyama and Ichikawa studied basic operating characteristics of neural networks when applied to structure–activity studies [37]. According to their report, one may see an asymmetric character in the classification. A typical example is as follows. They placed two regions, class 1 and class 2 (Fig. 9). After training those classes, they examined how points a to d and points o, p and q are classified. Points a and b are assigned to class 1, and c and d to class 2. It was, however, shown that the output patterns for points o, p and q are unanimously shifted to region 2, the region with larger positional data. As already discussed, one can allow the neural network do a job similar to that of multiregression analysis. In fact, the analysis by the network always gives better results. This stems from the basic difference between the two analytic methods, that is, multiregression analysis seeks linear dependence while the network uses nonlinear dependence including linear dependence. This nonlinear fitting implies an asymmetric fitting. How the fitting is carried out can be seen in that of the straight line, for example, y 5 x. Fig. 10 shows how fitting is carried out according to the given threshold value of the error function (convergence), where the straight dotted line is y 5 x on which the input and training data are located. The network structure was N(2,5,1). One may observe an interesting characteristic of the fitting performance: at both ends (i.e. 0.0 and 19.5) the difference between input and calculated values
4. Characteristics and / or problems of the neural network’s operation
4.1. Basic operating characteristics The neural network accepts any kind of data that is expressed numerically. In the learning phase, information about the relationship between the input and training patterns is accumulated in the weight matrices. Once these matrices are determined, the network predicts the categories or intensity of untrained data, even if they are out of the defined regions. We need to know the characteristics of such
Fig. 9. Symmetric learning. Positional data on the x–y plane for regions 1 and 2 are trained and positions a–d and o–q are examined to know how they are classified.
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1127
1: [h pi j,ht i j] 2:[ hpi j,ht i j] 3:[ h pi j,hti j] 4:[ hpi j,hti j] Among those training sets, there may be plural combinations that are equivalent. Removing such redundancies, the effective number of combinations is determined to be N. After N number of trainings, the symmetric output pattern is derived by
HO O [1,2
N 21
[3,4
OIi 1
I
(1 2 OIi )
I
J
(12)
where the first summation covers groups 1 and 2 and the second groups 3 and 4. By this procedure one can obtain completely symmetric results [37]. However, such a procedure might be unnecessary in practical use.
4.2. Some difficulties
Fig. 10. How fitting is carried out. The dotted line represents y 5 x.
becomes substantial and the deviation appears sinusoidally along the straight line. It was also shown that as the a value is reduced to a smaller value, the sinusoidal deviation becomes small. This is understandable because the a value in Eq. (4) can be regarded as a nonlinearity parameter. It was observed that an asymmetric character generally appears in both classification and fitting. Here, we discuss this problem, although the degree is small and it is unnecessary to worry about it in practical use. The asymmetric behavior stems from the asymmetric evaluation of Wij in Eq. (6). Namely, the information with a large value of the input element propagates intensively to the second layer through the weight matrix. It is, therefore, possible to make the network perform a symmetric operation by adopting a symmetric output function or simply adopting a ‘symmetrical training’ procedure. Let us explain the latter method. Suppose that the input data, h pi j, and the training data, ht i j, are scaled between 0 and 1. Then the reverse of h pi j and ht i j is defined as hpi j(512pi ) and hti j(512t i ). Using them, the following four groups of backpropagation combinations are considered.
The operation of ANNs is basically nonlinear. The characteristics of nonlinear classification and fitting may be illustrated using a two-dimensional space. Fig. 11 is for classification. When one wants to classify the open and filled circles, there are two ways: one is to use the linear line (actually linear super-surface) and the other is to use the nonlinear line (nonlinear super-surface). One may easily understand the advantage of the nonlinear separation. At the same time, it is easily predicted concerning nonlinear classification that if one of the open circles close to the separation surface is lacking, a different separation surface is so formed that the removed
Fig. 11. In (A) the dot straight line is the supersurface of linear separation, while the solid curve is that of nonlinear separation. In (B) if the blank circle lacks, a new supersurface of nonlinear separation is created and the lacked circle may be judged to be back.
1128
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
5. Relationship of operation between ANNs and conventional methods
5.1. The ALS method [38] Now consider the relationship of operation between the ALS method and the hierarchy neural network [34]. The discriminative function L in the ALS method is expressed using an m number of descriptors x and weight coefficients w as L 5 w 0 1 w 1 x 1( j ) 1 w 2 x 2( j ) 1 ? ? ? ; L ( j ) 5 XW ( j ) Fig. 12. Nonlinear operation in fitting. If the braced data are taken out, a new fitting curve (B) may be created. Then the braced data may not be correctly predicted.
circle may not be predicted (Fig. 11B). The same thing can apply to the case of fitting (Fig. 12). Another problem is that such a flexible nonlinear fitting line easily adapts itself even to fit errors (Fig. 13). Nonlinear operation has merits and demerits. To avoid demerits, one needs to introduce some linear character to the network.
(13)
The rule of discrimination is given according to the values of L. Thus, if a n ,L ,a n 11 then the group is placed in class n. The weight coefficient at cycle j, W ( j ) is obtained by W ( j ) 5 (X t X)21 X t S ( j )
(14)
Using the correction term C, S is given by S ( j 11 ) 5 L (i j ) or 5 L (i j ) 2 C ( j )
(15)
Here, consider the role of term S in Eq. (15). Since the dimension of W is null, S must have the same mathematical and physical characteristics as X. Therefore, the expression by Eq. (15) is appropriate and unique since there is no other quantity equivalent to X in the ALS system. Eq. (15) indicates the S receives a feedback from the output and is, indeed, a kind of backpropagation procedure. It is, therefore, easy to simulate the ALS operation in the neural network by imposing the following restrictions on the neural network for classification. 1. 2. 3. 4.
Use a two-layer neural network. Use a linear output function for all neurons. Set u to be a n . Set w ij 5w ik , where j and k represent any of the neurons in the second layer. 5. Give the training pattern that ignites n number of output neurons for the n-graded classification. Fig. 13. Excessive fitting. Even measurement errors may be incorporated as normal input data.
It is, therefore, understood that the operation of the
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
ALS method is simply a special case of the neural network: linear classification using a two-layer neural network. The details will be given elsewhere [34].
5.2. The multiregression analysis [35] Here, we describe the relationship between the operation of the neural network and the multiregression analysis. For simplification, let us consider a three-1ayer network. Since the operation expressed by Eq. (6) results in vector elements that are too close to 0 or 1, Eq. (6) is not very suitable when it is applied to the situations where the values between 0 and 1 are important. Therefore, we considered a new operation equation. Without losing generality, one can omit uj in Eq. (6), giving yj 5
OW x
(16)
ij i
i
Namely y 5 Wx
1129
constant 1. It should be emphasized here that addition of the constant 1 to the input data means that the optimization of uj in Eq. (6) is carried out through the weight matrix (Wij ). The neural network with Eq. (18) performs the linear operation equivalent that of multiregression analysis. In order to exceed this level, it is necessary to introduce a nonlinear operation in the network. This is possible by incorporating the hidden layers. By letting Oj 5 y j and using Eq. (16) for the last layer, a generalized nonlinear multiregression analysis is established. However, the larger number of the neurons in the second layer must be adopted, rather than the input layer, to avoid loss of the information that the input pattern has. In consequence, the operation of the three-layer neural networks is said to be nonlinear. Since the linear operation is included as a special case, the neural network is predicted to work far better than the ALS method and the multiregression analysis in both classification and fitting.
(2)
where W and x are the weight matrix and the input vector, respectively. Thus, if all neurons of each layer are governed by Eq. (16), i.e.
6. How to overcome defects of ANN’s operation and purposive extension
y 5 W1 xz 5 W2 y
Nothing can beat ANN as classification and fitting machines. This is because ANN’s operation is nonlinear. However, the excessive nonlinear operation causes some inconvenience. Then, the information processing in ANN is parallel. The parallel processing makes it difficult to trace the flow of information in the network and does not give the reason why such a decision was made. This section discusses those problems.
(17)
then the output pattern, z, becomes z 5 (W1W2 )x 5 Wx
(18)
where Wl and W2 are the matrices which express the weights between the layers 1 and 2 and those between the layers 2 and 3, respectively. The method of the multiregression analysis seeks the optimal coefficients of the linear equation zi 5 ai 1
Ob x
i i
(19)
where z and x are, respectively, the elements of the expectation vector and input data. Eq. (19) is equivalently rewritten as z 5 B0(1 1 x)
(20)
Eq. (18), a special case of the neural network’s operation, shows that the operation is equivalent to that of the two-layer network and to that of a generalized multiregression analysis if the variables are so set as x to be the observed values plus the
6.1. How to deal with excessive nonlinear operation The neural network based on Eqs. (2)–(4) performs a nonlinear operation. As discussed, a nonlinear operation is not always convenient in practical application and, therefore, it may be preferable if a linear operation can be introduced into the neural network. Although this is possible by adopting a smaller a value for the sigmoid function, a (mathematically) simple way is to define a new activation function as a combination of the sigmoid function and the linear function as
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1130
Oj 5 b h( y j ) 1 (1 2 b )y j 5 f( y j )
(21)
Here, parameter b expresses the mixing degree of the linear operation to the nonlinear operation and, therefore, by changing b, one can pour the linear operation into the network at any level. If b is set at 0, the network can be expected to perform the linear operation correctly. In practice, however, a problem arises: if b is set close to 0, the neurons in the second layer easily exceed the defined region (0–1), resulting in a destruction of learning. This difficulty can be removed by introducing a concept of ‘neuron fatigue’ into the usual backpropagation learning method [39]. Our experiences tell us that the smallest value of b is around 0.5 by the learning procedure without the ‘neuron fatigue’ procedure. The training of the network is similarly carried out based on Eqs. (7) and (8) until the sum of the squared errors, S(Oj 2 t j )2 [Eq. (11)], becomes small enough. If the new activation function is adopted, the derivative function, f9() or g9() in Eq. (8) is f0( y i ) 5 ab h( y i )[1 2 h( y i )] 1 1 2 b
(22)
In the above equations, both ´ and a, and even b, can be set independently of the layer.
6.2. How to obtain partial differential coefficients by the neural network [39,40] Since the operation of the hierarchy-type neural networks is completely defined by the mathematical formula, it is possible to take the partial derivative of an output to any input parameters. Since ( 1,2 )
dy i 5 W ij
dx i
(23)
dOi 5 f0( y i )dy i
(24)
and
the partial derivative of the output in the second layer becomes ≠O (j 2 ) ]] 5 f0( y j )W ij(1,2) ≠x i
(25)
Likewise, the partial derivatives of the output in the third layer with respect to an input parameter is given by
≠O (j 3 ) ]] 5 ≠x i
Of0( y )W k
( 1,2 ) ik
g0( y j )W (kj2,3 )
(26)
k
where f9 and g9 are, respectively, the differential functions of the activation function in the second and third layers while the superscript on W expresses the layer’s order. The expression using Eq. (21) as the activation function turns out to be ≠O (j 3 ) ]] 5 ≠x i
Ofb
(2)
a (2 ) h( yk )h1 2 h( y k )j
k
1 (1 2 b ( 2 )) gW (ik1,2 ) 3 f b ( 3 ) a (3 ) h( y j )h1 2 h( y j )j 1 (1 ) 2 b ( 3 )) g g0( y j )W (2,3 kj
(27)
Note that when b is near 1, the partial derivative, ≠Oj / ≠x i , approaches 0 as the output value nears 0 or 1. This character stems from the sigmoid function.
6.2.1. Accuracy of the partial derivatives obtained by the neural network Unlike the linear multiregression analysis, one cannot find the analytical method to determine the reliability of the results by the neural network. We may, therefore, discuss the tendencies of the fitting curve or separation surface by using concrete numerical data, although such a tendency is not a proof of reliability. Table 1 shows the analytical and calculated derivatives for y52x and y5x 2 functions together with the reproduced values by the network. The network structure was N(2,10,1), where, as a rule, one of the neurons in the first layer was used as a bias. The 21 points, 0, 1, 2, . . . , 20, were used to train to simulate the y52x function and the 21 points, 210, 29, . . . ,21, 0, 1, . . . , 9, 10, for y5x 2 . One may understand that the network well reproduced the function’s value. Except for the terminal points, the calculated derivatives are within the error |5%, and such errors increase at the terminal and extreme points. According to our experience, this is observed very generally and must be regarded as a defect by the nonlinear fitting. Table 2 shows the derivative in case of classification. We used the y5 2x110 function; 12 equally divided points of the region from (0,10) to (10,0) on the line was selected and points from (0,10) to
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1131
Table 1 Calculated derivatives of simple functions by the neural network a y 5 x2
y 5 2x x
y
y9
x
y calc
y 9calc
y 9theort
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.14 2.08 4.03 5.99 7.96 9.94 11.93 13.93 15.93 17.94 19.95 21.97 23.99 26.01 28.02 30.04 32.05 34.05 36.05 38.03 40.00
1.93 1.94 1.96 1.97 1.98 1.99 1.99 2.00 2.01 2.01 2.02 2.02 2.02 2.02 2.02 2.91 2.01 2.00 1.99 1.97 1.96
210 29 28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8 9 10
99.97 82.14 65.15 49.67 36.18 24.93 15.98 9.24 4.56 1.80 0.85 1.67 4.30 8.85 15.49 24.38 35.64 49.24 64.95 82.27 100.45
218.00 217.53 216.33 214.55 212.40 210.10 27.83 25.68 23.70 21.84 20.07 1.71 3.57 5.56 7.74 10.06 12.45 14.72 16.62 17.89 18.32
220 218 216 214 212 210 28 26 24 22 0 2 4 6 8 10 12 14 16 18 20
a
The network structure was N(2,10,1), where a 54, b 51, and the threshold of the convergence ,10 25 in terms of the scaled units.
Table 2 Derivatives in classification a Point
1 (0,10) 2 (0.83,9.17) 3 (1.67,8.33) 4 (2.50,7.50) 5 (3.33,6.67) 6 (4.17,5.83) 7 (5.83,4.17) 8 (6.67,3.33) 9 (7.50,2.50) 10 (8.33,1.67) 11 (9.17,0.83) 12 (10,0) a
Variable x
y
0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 20.03 (0.03) 20.75 (0.74) 20.74 (0.73) 20.04 (0.03) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)
0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.03 (20.03) 0.75 (20.74) 0.74 (20.73) 0.04 (20.03) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)
The network structure was N(3,10,2), where a 51, b 51, and the threshold of the convergence ,10 23 in terms of the scaled units. Points from 1 to 6 were used to train as class 1 while those from 7 to 12, as class 2. The values in parentheses are the derivatives for class 2.
(4.17,5.83) were trained as class 1, other points as class 2. The network structure was N(3,10,2). As the center (5,5) is the inflection point, derivatives around the center appear to be steep gradients, while those around the trained points are zeros. The results are reasonable and satisfactory.
6.2.2. Is the independency between input neurons kept? When the input parameters are independent of each other, one may wonder whether or not the decision or derivative given by the neural network is influenced by the values of other input neurons. This should be severely checked if one discusses the interpolated values, since they are not the trained points. To determine such independency we introduced dummy input neurons by which the predictions and decisions are checked. The network structure was N(3,10,2). This time the 21 points with the dummy neuron
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1132
Table 3 Fluctuation of predicted function’s values and their derivatives by change of the dummy neuron’s value in y52x a Function’s values x
b
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5
210
c
1.17 3.05 4.95 6.89 8.85 10.83 12.83 14.85 16.88 18.92 20.97 23.02 25.07 27.11 29.14 31.26 33.16 35.14 37.09 39.02
Derivatives 25
0
5
10
210
25
0
5
10
0.67 2.58 4.53 6.52 8.53 10.57 12.64 14.72 16.82 18.93 21.04 23.16 25.28 27.39 29.49 31.58 33.64 35.68 37.70 39.68
0.44 2.36 4.32 6.31 8.34 10.40 12.48 14.59 16.71 18.84 20.99 23.13 25.28 27.42 29.55 31.66 33.75 35.82 37.86 39.87
0.59 2.49 4.42 6.39 8.40 10.43 12.50 14.58 16.68 18.79 20.91 23.04 25.16 27.28 29.39 31.48 33.56 35.61 37.64 39.63
1.19 3.04 4.93 6.85 8.80 10.77 12.77 14.79 16.83 18.87 20.92 22.98 25.03 27.08 29.12 31.15 33.16 35.15 37.11 39.05
1.86 1.89 1.92 1.95 1.97 1.99 2.01 2.03 2.04 2.04 2.05 2.05 2.04 2.04 2.03 2.01 1.99 1.97 1.94 1.91
1.90 1.93 1.97 2.00 2.03 2.05 2.07 2.09 2.11 2.11 2.12 2.12 2.12 2.11 2.09 2.08 2.05 2.03 2.00 1.97
1.90 1.94 1.98 2.01 2.04 2.07 2.10 2.12 2.13 2.14 2.15 2.15 2.14 2.13 2.12 2.10 2.08 2.06 2.02 1.99
1.88 1.92 1.95 1.99 2.02 2.05 2.07 2.09 2.11 2.12 2.12 2.13 2.12 2.11 2.10 2.09 2.06 2.04 2.01 1.98
1.83 1.87 1.90 1.94 1.96 1.99 2.01 2.03 2.04 2.05 2.05 2.06 2.05 2.05 2.03 2.02 2.00 1.98 1.95 1.92
a The network structure was N(3,10,1), where one of neuron in the first layer was use as the dummy input. Other network parameters and the convergence condition were the same as those in Table 1. b Predicted point. c Indicates that the value of 210 was input to the dummy input.
were trained at the same time, in which the value of the dummy neuron was changed 210, 25, 0, 5, and 10. Namely, the 21 input data with 210 for the dummy neuron, the same 21 input data plus 25 for the dummy neuron, and so on. By doing so, one can teach the network that the dummy neuron is independent of the decision. Then intermediate points between the training points were predicted by the network, where various values for the dummy neuron 210, 25, 0, and 10, were input to see the influence of the dummy neuron. Table 3 shows the results. The predicted values are not very accurate at the end, and fluctuations by the dummy neuron become large near both ends, especially at the low end. Table 4 shows the case of the y5x 2 function. Since the neural network is good at treating the nonlinear correlation, the fluctuation in both the predicted and derivative values is smaller than that of linear relation. However, one can see that as a rule,
near the inflection point the fluctuation tends to be large. The results for the classification are shown in Table 5 where only two points, (0,10) and (10,0), were trained. As one can see, the fluctuations in both cases are small enough to say that the independency is virtually maintained. From the above results, it may be generally said that if the independency of input parameters is properly incorporated in the network by training, the predicted values and their derivatives are not influenced by the values of other input parameters. However, a small amount of fluctuation appears at / near the terminal and extreme points.
6.2.3. Isolation of functions out of the mixed functions It is possible to take the partial derivatives of the output strength with respect to each input parameter. This means that isolation of individual linear func-
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1133
Table 4 Fluctuation of predicted function’s values and their derivatives by change of the dummy neuron’s value in y 5 x 2 a xb
Predicted values 210
29.5 28.5 27.5 26.5 25.5 24.5 23.5 22.5 21.5 20.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
c
90.93 73.44 57.17 42.65 30.24 20.06 12.14 6.37 2.63 0.81 0.84 2.68 6.40 12.08 19.86 29.86 42.15 56.67 73.16 91.16
Derivatives
25
0
5
10
210
25
0
5
10
90.89 73.32 57.01 42.49 30.09 19.96 12.08 6.36 2.67 0.89 0.96 2.84 6.59 12.31 20.12 30.17 42.52 57.09 73.65 91.71
90.76 73.19 56.89 42.39 30.02 19.91 12.05 6.35 2.68 0.91 0.98 2.87 6.61 12.33 20.16 30.23 42.60 57.21 73.81 91.90
90.66 73.14 56.88 42.41 30.05 19.95 12.09 6.38 2.69 0.90 0.94 2.79 6.51 12.21 20.02 30.07 42.45 57.07 73.68 91.80
90.69 73.25 57.05 42.60 30.24 20.12 12.22 6.46 2.72 0.87 0.85 2.64 6.31 11.96 19.73 29.75 42.11 56.72 73.33 91.44
217.89 216.99 215.47 213.51 211.31 29.04 26.83 24.73 22.76 20.89 0.93 2.77 4.68 6.71 8.87 11.14 13.43 15.57 17.34 18.53
217.98 217.04 215.49 213.50 211.28 29.00 26.78 24.68 22.72 20.85 0.97 2.80 4.71 6.75 8.92 11.19 13.49 15.63 17.41 18.59
218.00 217.04 215.47 213.48 211.25 28.97 26.76 24.66 22.70 20.84 0.97 2.80 4.71 6.75 8.93 11.22 13.52 15.67 17.45 18.63
217.96 216.99 215.43 213.45 211.24 28.07 26.76 24.68 22.72 20.87 0.94 2.77 4.69 6.73 8.92 11.21 13.52 15.68 17.46 18.65
217.87 216.92 215.39 213.44 211.25 29.00 26.81 24.73 22.78 20.93 0.88 2.72 4.63 6.68 8.88 11.19 13.51 15.67 17.46 18.65
a The network structure was N(3,10,1), where one of neuron in the first layer was use as the dummy input. Other network parameters and the convergence condition were the same as those in Table 1. b Predicted point. c Indicates that the value of 210 was input to the dummy input.
Table 5 Fluctuation of predicted values and derivatives by change of the dummy neuron’s value in classification a Point b
Predicted values 10
1 (0,10) 2 (0.83,9.17) 3 (1.67,8.33) 0.4 (2.50,7.50) 5 (3.33,6.67) 6 (4.17,5.83) 7 (5.83,4.17) 8 (6.67,3.33) 9 (7.50,2.50) 10 (8.33,1.67) 11 (9.17,0.83) 12 (10,1) 1 (0,10) a
c
0.978 0.988 0.944 0.901 0.821 0.686 0.500 0.315 0.180 0.100 0.058 0.035 0.023
Derivatives
5
0
25
210
10
5
0
25
210
0.978 0.966 0.944 0.902 0.823 0.687 0.502 0.316 0.180 0.100 0.057 0.035 0.023
0.978 0.966 0.944 0.902 0.823 0.689 0.504 0.318 0.180 0.100 0.057 0.034 0.022
0.977 0.966 0.944 0.902 0.824 0.691 0.506 0.819 0.181 0.100 0.057 0.034 0.022
0.977 0.965 0.943 0.902 0.825 0.692 0.508 0.320 0.182 0.100 0.057 0.034 0.022
0.065 0.118 0.223 0.428 0.784 1.227 1.446 1.209 0.766 0.419 0.220 0.117 0.065
0.065 0.117 0.220 0.422 0.774 1.219 1.448 1.219 0.755 0.424 0.221 0.118 0.066
0.064 0.116 0.218 0.417 0.765 1.211 1.449 1.229 0.784 0.429 0.224 0.118 0.066
0.064 0.115 0.215 0.412 0.756 1.202 1.449 1.238 0.793 0.434 0.226 0.119 0.066
0.064 0.114 0.213 0.407 0.748 1.192 1.448 1.246 0.802 0.439 0.228 0.120 0.066
Only two points, (0,10) and (10,0) on the x–y plane were used to train as classes 1 and 2, respectively. The network structure was N(4,19,2). One of the first ayer neurons was used as the dummy input. Other network parameters and convergence conditions were the same as those in Table 2. b Predicted point. c Indicates that the value of 210 was input to the dummy input.
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1134
Table 6 Example of input and training data to isolate individual functions out of their combined function Sample no.
Input data x
y
1 2 3 4
8.5 0.1 9.3 0
0.6 2.0 4.1 0.6
Training data x12y 9.7 4.1 17.5 1.2
tions out of the mixed function is possible. We performed this separation test to figure out the minimum sample number of practical isolation according to the following procedures. For example, let us consider the isolation an x and 2y out of the x12y function. As shown in Table 6, using random numbers for x and y which are input patterns, the values for x12y are obtained and used as the training pattern. By obtaining the partial derivatives, one can know the relationship, x12y. The number of sample data averages the derivatives. Table 7 shows the results for x12y, x 2y, x1 2y13z, x1y 2 , and x2y1z 2 , where the values in parentheses show the maximum deviations from the mean values. Although these derivatives may vary
Table 8 Recognition test in x 1 ay a a
1
2
Ratio
0.6 0.7 0.8 0.9
0.982 0.991 0.989 0.998
0.595 0.689 0.795 0.895
0.606 0.695 0.804 0.897
a
The network structure was N(3, 10, 1), where a 51, b 51, and the threshold of the convergence ,10 25 in terms of the scaled units.
according to the fed random numbers, one can see the neural network can detect the individual functions by a relatively small number of data. Such a number depends on the complexity of the mixed function. In a simple relationship like x12y, only five sets of data seems to be enough to separate them. However, in the rather complex x2y1z 2 function, 15 sets of random numbers were needed.
6.2.4. Recognition of two similar functions As already mentioned, the neural network easily recognizes the two functions, x12y. We then examined the limit of such separation ability using the function, x1ay, where a was changed from 0.6 to
Table 7 Isolation of individual functions out of the mixed function a 5 x12y x9 y9 x–y x9 y9 x12y13z x9 y9 z9 x 1 y2 x9 x–y 2 x9 x 2 y 1 z2 x9 y9
10
15
20
30
0.953 (0.188) 1.946 (0.273)
0.979 (0.121) 1.953 (0.137)
0.977 (0.143) 1.949 (0.161)
0.985 (0.519) 1.949 (0.134)
0.989 (0.141) 1.970 (0.136)
1.433 (0.388) 20.674 (0.197)
0.972 (0.077) 20.962 (0.085)
0.973 (0.093) 20.976 (0.101)
0.985 (0.066) 20.981 (0.086)
0.984 (0.011) 20.996 (0.126)
0.848 (0.411) 1.636 (0.690) 3.263 (1.269)
0.979 (0.254) 1.928 (0.230) 2.915 (0.558)
0.968 (0.200) 1.949 (0.260) 2.962 (0.460)
0.989 (0.254) 1.931 (0.339) 2.921 (0.441)
1.001 (0.168) 1.980 (0.301) 2.960 (0.410)
1.646 (4.014)
1.111 (0.528)
1.055 (0.081)
1.007 (0.109)
1.009 (0.072)
1.656 (3.953)
0.963 (0.622)
1.073 (0.442)
1.010 (0.446)
0.994 (0.366)
20.350 (1.052) 0.005 (0.965)
0.449 (0.370) 20.406 (0.343)
1.065 (0.188) 20.936 (0.430)
1.008 (0.315) 21.025 (0.169)
1.051 (0.157) 21.009 (0.193)
a The network structure was N(n,10,1); a 54, b 51, and the convergence condition ,10 25 in terms of the scaled units. Averaged derivatives are shown. The values in parentheses are the maximum deviations from the mean values.
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
0.9. In order to overcome the problem of scarcity of data, we used 40 sets of random numbers independently for x and y. The results are shown in Table 8. It is surprising that if the data is sufficient, the neural network can distinguish the two function, x and 0.9y, out of the function, x10.9y.
6.2.5. Recognition two similar functions by correlated input data So far, we used completely random numbers independently for the variables x, y, or z. Actual data for quantitative structure property relationship (QSPR) analysis, however, have some kinds of correlation among them. It is, therefore, necessary to determine the relationship between the separability and the dispersion among input data. To this end, we again used the x1ay function, where the input data for x and y are correlated by the following way. The similarity of a data set is measured using the dispersion, s 2 [5S(x i 2y i )2 /N (N is the total number of the data sets (540)]. The correlation was made by discarding the sets of data (x i ,y i ) if (x i 2 y i )2 . j. If j is chosen to be 0.82, 0.59 or 0.37, s 2 is approximately 0.2, 0.1, or 0.05. The cases with the coefficient, a, being 1.1, 1.3 and 2.0 were examined. Table 9 shows the results. When a is .1.3, the Table 9 Recognition of two similar linear functions by correlated input data a Trial 1
2
3
4
5
6
s 50.05 (a 5 1.1) x 1.009 1.058 y 1.079 1.034
1.034 1.052
1.082 1.005
1.103 0.981
1.112 0.981
s 2 50.1 (a 5 1.1) x 1.022 1.025 y 1.070 1.064
0.953 1.141
0.976 1.115
0.977 1.105
1.022 1.072
s 2 50.2 (a 5 1.1) x 1.009 1.039 y 1.082 1.055
0.988 1.105
0.977 1.115
1.012 1.074
0.978 1.104
s 2 50.05 (a 5 1.3) x 1.006 1.082 y 1.281 1.209
1.110 1.174
1.120 1.116
1.115 1.168
1.067 1.225
s 2 50.05 (a 5 2.0) x 1.085 1.106 y 1.90 1.881
1.153 1.829
1.178 1.807
1.130 1.845
1.230 1.764
2
a
All conditions concerning the network as in Table 8.
1135
neural network can distinguish the two functions even at s 2 being 0.05. However, when a51.1, the separation cannot be carried out unless s 2 is .0.1.
6.2.6. Application the partial derivative method [40] The recognition tests of individual functions out of the combined function indicate that the recognition ability of the hierarchy neural network seems to be powerful, in fact, far better than we had expected, although the reliability is not given in terms of any mathematical formula. As an example of application, we used a rather standard problem: the relationship between the 13 C NMR chemical shift and the configuration of the subsistent in norbornanes or norbornenes. This problem is the so-called Kowalski Table 10 Derivatives of norbornane and norbornene a
\ Substituent
CH 3 NH 2 OH COOH CH 2 OH CH 3 , =O(3)b CH 3 , =O(5) CH 3 , F 2 (6) CH 3 , 5=6 c OH, 5=6 CH 3 , CH 3 (4) CH 3 , =CH 2 (3) OH, CH 3 (1), (CH 3 ) 2 (7) CN COOCH 3 CH 3 , =O(6) CH 3 , F 2 (5) CH 2 OH, 5=6 Cl, CH 3 (1), (CH 3 ) 2 (7) a
Compound no. Exo
Endo
1 2 3 4 5 6 7 8 9 10 11 12 13 26 28 30 32 34 37
14 15 16 17 18 19 20 21 22 23 24 25 26 27 29 31 33 35 38
The data are quoted from the literature [41]. Indicates that the attached substituent is at position 3. c Indicates the double bond is between positions 5 and 6. b
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1136
Table 11 Relative 13 C NMR chemical shifts and conformations in norbornanes and norbornenes a Compound
C1
C2
C3
C4
C5
C6
C7
Exo / endo
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
6.7 8.9 7.7 4.6 1.8 5.7 6.1 6.5 6.5 7.8 6.9 5.6 2.5 5.4 6.8 6.3 4.2 1.7 4.7 4.7 4.6 5.6 7.1 4.1 3.2
6.7 25.3 44.3 16.7 15.1 3.0 5.9 6.3 7.5 47.0 6.4 4.9 42.5 4.5 23.3 42.4 16.2 12.8 3.1 5.3 11.5 7.5 47.8 1.2 40.2
10.1 12.4 12.3 4.4 4.4 2.6 10.6 10.4 9.5 11.7 10.1 7.0 11.9 10.6 10.5 9.5 2.1 4.0 2.2 9.2 8.9 8.7 13.3 7.0 10.4
0.5 20.4 21.0 20.2 20.2 20.5 0.6 0.3 0.5 21.3 0.7 0.2 20.8 1.4 1.2 0.9 0.9 0.4 0.3 1.3 20.1 1.4 2.2 0.7 20.5
0.2 21.2 21.3 20.3 0.2 20.4 0.2 20.8 1.7 3.9 21.2 21.1 21.1 0.5 0.6 0.2 20.6 0.2 1.3 20.4 0.8 1.7 3.6 0.5 0.0
21.1 23.1 25.2 21.0 20.7 0.7 0.2 20.1 0.7 22.7 0.1 0.2 22.4 27.7 29.5 29.7 24.8 27.2 26.5 26.5 0.4 23.0 23.4 27.4 210.3
23.7 24.4 24.4 21.8 23.3 23.5 23.7 23.5 23.8 23.2 23.9 23.9 1.4 0.2 0.3 20.9 1.9 1.4 20.6 1.4 1.8 1.7 0.6 0.0 3.1
Exo Exo Exo Exo Exo Exo Exo Exo Exo Exo Exo Exo Exo Endo Endo Endo Endo Endo Endo Endo Endo Endo Endo Endo Endo
26 27 28 29 30 31 32 33 34 35 36 37 38
5.5 3.4 5.1 4.0 6.6 6.0 6.3 5.1 1.9 2.3 5.1 2.9 3.7
1.0 0.1 16.4 15.9 7.0 8.4 7.2 4.8 17.1 18.3 4.0 30.3 29.8
6.3 5.5 4.2 2.2 10.1 11.2 9.8 8.4 5.2 5.0 8.4 13.4 10.8
20.3 0.2 20.4 0.7 0.2 20.1 0.7 1.1 20.1 0.3 1.1 20.5 21.6
21.5 20.7 21.1 20.7 21.2 0.7 20.1 20.1 0.9 1.3 0.2 22.1 21.1
21.6 24.9 21.4 25.0 0.5 21.5 0.8 27.3 0.9 22.9 27.7 20.7 29.0
21.3 1.0 22.1 1.7 23.7 21.6 23.5 1.6 23.4 1.4 1.6 2.0 2.2
Exo Endo Exo Endo Exo Endo Exo Endo Exo Endo Endo Exo Endo
a
The data are quoted from the literature [41].
problem [40] and has been frequently quoted for recognition examinations. Shown in Tables 10 and 11 are the endo / exo configurations and the relative 13 C NMR chemical shifts in the derivatives of norbornane and norbornene, quoted from the literature [41], where the same compound numbers are used. In accordance with former studies, we used 25 (nos. 1–25) out of 38
data as training data. The network structure was set to be N(8,14,2). Table 12 shows the averaged partial derivatives. The absolute values for each set of parameters are nearly the same, but the signs are opposite to each other. This means that each input parameter oppositely contributes to the exo / endo decision and its magnitude is the same. The absolute values for
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147 Table 12 Correlation analysis between the
13
1137
C NMR chemical shifts and the configuration a
Configuration
C1
C2
C3
C4
C5
C6
C7
Exo Endo
20.009 0.001
0.088 20.088
0.031 20.032
20.111 0.110
20.107 0.109
0.141 20.141
20.213 0.214
a
The network structure was N(8,14,2), where a 51, b 51, the threshold of the convergence ,10 23 . Averaged values are shown.
parameters 1 and 3 are negligibly small enough to say that these parameters have almost nothing to do with the exo / endo decision. The largest absolute values are found in parameter 7 showing that this parameter has the major contribution to the decision. These results are in good accord with chemical experience. The merit of the present method, however, is that one can quantitatively handle the degree of the contribution of each input parameter.
6.3. Reconstruction learning [34] As we study the operation of such neural networks, we are surprised to see how the behavior of the neural network resembles that of the brain. In the neural network, the information, which is accumulated in the learning phase, is kept as the strength of the connections between the neurons recorded in the weight matrices. As our experience tells us, we learn things repeatedly through the processes of learning and forgetting. The memory that is obtained through such processes is settled firmly in the mind. We considered that these processes may be incorporated in the hierarchy neural networks, and when this is done, it is interesting to know what happens to the connections of the neurons. It is shown that the weight matrices are not unique even if the network gives the same results [35]. This indicates that various kinds of reconstruction of the weight matrices are possible. Therefore, we tried to introduce the procedures of both the learning and forgetting processes into the learning phase of the neural network. The weight matrices thus obtained are called ‘reconstructed matrices.’ The reconstructed matrices were surprising and suggestive. They were found to be widely applicable in finding active neurons of the network and could serve in the analysis of the relationship between the input and output data.
6.3.1. Introduction of the forgetting procedure into the learning phase The training is carried out according to the usual backpropagation algorithm until the error function [Eq. (11)] becomes small enough. Suppose M sets of the input and training patterns are given; all of the output patterns can be made close enough to the training patterns by iteration through Eqs. (7) and (8). If convergence is attained, the neural network has the ability to classify the input patterns into M groups. Here we consider the procedure in which the absolute values of weight matrices are lessened by the equation Wij 5 Wij 2 sgn (Wij )z h1 2 D(Wij ) j
(28)
where D is a function that gives 1 at uWij u , z or 0 at uWij u . z and z is set to be about a tenth of ´ as an initial value and is varied so as not to greatly change E [Eq. (11)], i.e. not to greatly change the decision by the network. If this procedure is applied to the network, some of the information, which is given by training, is partly erased. This corresponds to forgetting in memory and is termed ‘erasing’. We call here the training procedure for M sets of data a ‘training cycle’. If after a training cycle is carried out, the erasing procedure is applied to the same network, the information which is accumulated in the training cycle, is partly lost from the network. Remarkably, we discovered that these contradictory procedures do not vary in all connections equally. Some connections are affected by the training cycle more strongly than by the erasing procedure to give stronger connections and others are affected by the erasing procedure strongly to give weak or null connections. Therefore, the information accumulated between neurons can be reconstructed without changing the contents of the information originally embodied in
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1138
Table 13 Backpropagation learning (A) and reconstruction learning (B) 1
2
3
4
1 2 3 4
1.357 20.511 0.316 0.206
20.116 20.231 20.114 0.086
0.061 0.350 20.786 20.271
0.040 0.339 0.018 0.018
1 2 3 4
1.217 0.000 0.000 0.000
0.002 0.000 0.000 0.000
0.003 0.000 0.000 0.000
0.009 0.000 0.000 0.000
neurons. Thus, the network structure is N(4,4,1) with a being 4. The data were as follows: The values 0, 1,2, . . . , 10 were fed to neuron 1 of the first layer. For each input value, training data were 1, 2, 9,..,100. One-digit random numbers were fed to neurons 2 to 4 of the first layer. Table 13A shows the results by the usual backpropagation method while Table 13B are those by the reconstruction learning method. In both tables, the numbers of the first line indicate the neuron number of the first layer and those of the first column, the second layer. It is seen that connections by the backpropagation learning involve all neurons between the first and the second layers. On the other hand, in the reconstruction learning method, most connections are null: the survived connections are those between both first neurons and those between the 2–4 or the first layer neuron and the first neuron of the second layer. But the values are negligibly small. This demonstrates that to simulate the y5x 2 function, the second layer needs only one neuron and that the first layer essentially needs one neuron. It is understandable why the small values of connections starting from 2–4 neurons of the first layer exist: the neural network expands the y5x 2 function by using the sigmoid functions. Since only one sigmoid function cannot completely adapt itself to the y5x 2 function, some compensation is necessary to accurately express the function.
A
B
Numbers in the top line are those of neurons in the first layer, while the numbers in the left column are those of neurons in the second layer.
the network. This series of procedures was termed ‘reconstruction-learning’ [34]. Reconstruction-learning often reveals the role of each neuron and gives characteristic connections between the neurons; if one traces the connections between the input and output neurons, then one can understand the role of the input parameters in the decision or the output intensity.
6.3.2. How does reconstruction learning work? As a simple example, we show the case that the relationship of y5x 2 is trained with three dummy
Table 14 Matrices by the backpropagation learning a 1
6
7
8
9
10
11
12
13
14
and second-layer neurons b 20.188 20.147 0.047 0.240 20.040 0.108 20.013 20.173 20.141 20.273 20.192 0.205 20.097 0.120 0.093 0.421 0.006 20.095 20.495 20.086 20.038
0.063 0.265 0.083 20.559 20.261 0.664 20.716
20.009 20.355 20.143 0.489 0.367 20.760 0.905
0.112 20.483 20.186 0.451 0.324 20.747 0.999
20.044 0.333 0.156 20.339 20.429 0.524 20.654
20.214 20.054 20.035 20.023 20.149 0.180 20.235
20.117 20.099 20.029 0.043 0.318 20.251 0.583
20.112 0.230 0.097 20.350 20.288 0.358 20.278
0.092 20.358 20.182 0.445 0.381 20.535 0.615
20.221 0.011 0.128 20.242 0.067 0.276 20.404
(B) Weight matrix between the third- and second-layer neurons c 1(exo) 0.751 0.221 0.624 0.149 20.217 2(endo) 20.732 20.374 20.387 0.038 20.078
0.747 20.876
20.913 0.970
20.998 0.956
0.692 20.746
0.188 20.237
20.437 0.453
0.431 20.445
20.655 0.703
0.467 20.219
(A) Weight 1(C 1 ) 2(C 2 ) 3(C 3 ) 4(C 4 ) 5(C 5 ) 6(C 6 ) 7(C 7 )
a
2
matrix between the first20.022 0.133 0.219 0.203 0.321 20.043 20.233 20.149 20.406 20.095 0.449 0.068 20.773 20.381
3
4
The sum of the squared errors (E),0.011. Scale factor50.4176. c Scale factor50.390. b
5
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1139
Table 15 Weight matrices by the reconstruction learning a 1 (A) Weight 1(C 1 ) 2(C 2 ) 3(C 3 ) 4(C 4 ) 5(C 5 ) 6(C 6 ) 7(C 7 )
2
3
4
5
6
7
8
9
10
11
12
13
14
20.002 0.594 0.004 20.515 20.626 0.994 20.998
0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.002 0.000 0.000 0.006 0.003 20.122 0.744
0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.998 20.998
0.001 0.001
20.577 0.577
20.001 0.001
20.001 0.001
20.001 0.001
20.001 0.001
20.001 0.001
20.001 0.001
b
matrix between the first- and second-layer neurons 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
(B) Weight matrix between the third- and second-layer neurons c 1(exo) 20.001 20.001 20.001 20.001 20.001 2(endo) 0.001 0.001 0.001 0.001 0.001 a
The sum of the squared errors (E),0.153. Scale factor50.323. c Scale factor50.168. b
6.3.3. Practical application: the relationship between 13 C NMR chemical shift and the conformation of norbornane and norbornene The reconstruction leaning method was applied to the Kowalski problem. In accordance with the former studies, we used 25 (nos. 1–25) out of 38 data as training data and the others were used for prediction. The network structure was set to be N(7,14,2). To avoid complexity, we do not adopt a bias. The reconstruction learning, which consists of fifty training cycles and on erasing procedure, was repeated fifty times. Table 14 shows the values of the connections obtained by the usual backpropagation and Table 15, those by the reconstruction learning, where the maximum value of the connection was scaled to be 0.999 to compare them in the same level of magnitude (the scale factor is given in each table) (Table 16). We have shown in a previous paper that the predictions by the neural network were better than those by the linear learning machine and cluster analysis. The results by the present method are the same. Without reconstruction, the information obtained through the learning phase is widely distributed among neurons. With reconstruction, however, connections are localized between the special neurons. It is also understood that the neurons other than 6 and 8 in the second layer have nothing to do with the resulting classification other than to optimize the u values. Firing of neuron 1 in the last layer corresponds to the exo conformation and that of the
other neuron, to the endo conformation. Therefore, the conformation is determined by the neurons 6 and 8 propagating the information to two neurons in the third layer with the types of connections (a, 2 a) and ( 2 b, b). The neurons in the third layer make the decision by combining them. The 6 and 8 neurons in the second layer have the strongest connections with the neurons 6 and 7 in the first layer. Since the order of the neurons in the first layer is made to correspond to that of the number of the carbon atoms, it is understood that the information on the chemical shifts at C 6 and C 7 plays the major role in deciding the endo / exo conformations of the derivatives of norbornene. This is consistent with the chemical idea that the C 6 and C 7 carbon atoms are located near the substituent and that the effect of the substituent on C 6 may be reversed on C 7 . Note here that the neuron 8 in the second layer connects only with neurons 6 and 7 in the first layer, suggesting that the information of the endo / exo conformation entailed on the neuron 6 of the second layer is corrected by the neuron 8.
6.4. Descriptor mapping [42] Descriptors in the QSAR / QSPR analysis are not always linear to the output intensity. In addition to this, descriptors are often mutually dependent. The neural network makes their analysis. Andrea and Kalayeh showed mutual dependencies among descriptors in the QSAR of dihydrofolate reductase
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1140
Table 16 Comparison of the results by the backpropagation learning and reconstruction learning methods Backpropagation
Reconstruction
Exo
Endo
Exo
Endo
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0.913 0.989 0.994 0.844 0.955 0.961 0.948 0.964 0.926 0.968 0.970 0.974 0.841 0.014 0.015 0.088 0.020 0.017 0.025 0.014 0.220 0.015 0.067 0.021 0.019
0.084 0.009 0.005 0.155 0.049 0.039 0.051 0.034 0.073 0.030 0.028 0.026 0.161 0.985 0.900 0.979 0.979 0.984 0.975 0.986 0.788 0.984 0.926 0.979 0.981
0.899 0.973 0.978 0.855 0.939 0.955 0.931 0.947 0.914 0.935 0.959 0.962 0.759 0.050 0.051 0.136 0.063 0.049 0.072 0.046 0.226 0.050 0.115 0.059 0.042
0.101 0.027 0.022 0.145 0.061 0.045 0.069 0.053 0.086 0.065 0.041 0.038 0.241 0.950 0.949 0.864 0.937 0.951 0.928 0.954 0.774 0.950 0.885 0.941 0.958
26 27 28 29 30 31 32 33 34 35 36 37 38
0.703 0.079 0.906 0.025 0.981 0.708 a 0.953 0.009 0.974 0.063 0.008 0.825 0.103
0.293 0.924 0.091 0.974 0.018 0.293 0.045 0.990 0.029 0.941 0.992 0.177 0.898
0.707 0.115 0.904 0.070 0.966 0.629 0.941 0.041 0.958 0.100 0.038 0.738 0.093
0.293 0.885 0.096 0.930 0.034 0.371 0.059 0.959 0.042 0.900 0.962 0.262 0.907
a Error. Data 1–25 were used for training and those of 26–38 are for prediction.
inhibitors [43]. In the third layer they used one neuron with a sigmoid function (this is not, therefore, the MR-type network which is described here). Although they did not show the rationale of interpolation in their network, the results are suggestive. We, therefore, discuss the rationale of the interpola-
tion and develop their method to three-dimensional analyses using the MR-type network.
6.4.1. Method To make it easy to understand, let us consider the case with two variables, i.e. the intensity (I) is a function of variables, r and s. Using the minimum values of r and s (r 0 and s 0 ) which are given by the training data, the difference in the intensities, DI(r) and DI(s), is obtained as DI(r) 5 I(r,s 0 1 Ds) 2 I(r 0 ,s 0 ) DI(s) 5 I(r 1 Dr,s) 2 I(r 0 ,s 0 )
(29)
where Dr and Ds are scanned in the regions given by the training data. By displaying DI(r) and DI(s), the mutual relationship between variables can be found. If there are more than two variables, one variable is scanned as the descriptor and the others are treated all together as the background intensity. To show the usefulness and reliability of the descriptor mapping method, we examined the reproducibility of individual mathematical functions from the mixed function. They include simple linear combinations of linear functions and / or a nonlinear function. Since the number of samples in QSAR analysis is generally rather small, we used 30 sets of independent random numbers of x, y, (and z). These numbers ranged from 0.0 to 9.9. The threshold in the backpropagation learning was less than 10 25 in terms of the scaled unit. The a and b values were set at 4.0 and 1.0, respectively.
6.4.2. Examination using mathematical functions w 5 x 1 2y 1 3z First, using an ideal linear relationship we examined whether or not the neural network could reproduce the fact that each variable is exactly linear and independent of other variables. The network structure was N(4,10,1). One neuron in the first layer was used as the bias neuron. Taking variable x, for example, the range of actual input values was sectioned into 11 parts, from 0 to 100%, while the ranges of other variables were also sectioned in the same way. They are designated as the descriptor intensity for x and the background intensity, respectively. As the background intensity, the lowest values for both x and y were at 0% while
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1141
Fig. 14. Analysis of correlation between variables in w5x 12y13z.
the highest values were 100%. Fig. 14 shows a three-dimensional mapping for each variable. It is clear that as each descriptor increases, the intensity, w, linearly increases in accord with the coefficient. Such linearity is independent of the values of other variables, resulting in flat planes with constant gradients. The deviations are small enough to be ignored in practical applications. These are for ideal linear relationships of variables with intensity (w). If nonlinear correlation is included, the plane would be deformed w 5 x 2 y 1 z 2 . This function includes a nonlinear part (z 2 ). It may be necessary to show whether the neural network normally reproduces such nonlinear function. The network structure and parameters were the same as those in the former case. The number of sets of random numbers was 30 and 2 from x 5 y 5 z 5 0
and x 5 y 5 z 5 10. The lowest value for each background intensity was set at 0% and the highest at 100%. Fig. 15 shows the obtained results. Three functions are reasonably separated. However, a small distortion can be observed in the plane for descriptor x. This may because the z value is large. However, such distortion may be negligible in practical application. We investigated this distortion and found that it was because the number of the training data was small: it was greatly improved by increasing sample number to 60. Consequently, the neural network extracts the characteristics of each function in a mixed function, independently of other variables. z 5 x 1 10 exp [2( y 2 5)2 / 4]. This function has a maximum point at y55 (50%). The network structure was N(3,10,1). The
1142
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
Fig. 15. Analysis of correlation between variables in w5x2y1z 2 .
Fig. 16. Analysis of correlation between x and y variables in z5x110 exp [2( y25)2 / 4].
sample number was 32. However, to reproduce smooth surfaces, the ranges of the maximum and minimum values for x and y were sectioned into 21;
therefore, 441 (521321) points were predicted. Fig. 16 shows the results. Each function is beautifully reproduced without distortion. It is rather surprising
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
Fig. 17. Descriptor mapping of structural parameters of carboquinones. Here, ‘bg’ indicates the background intensity.
1143
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
1144
that only 30 sets of data were good enough to reproduce such a complicated function (Fig. 17).
6.4.3. Application to SAR analysis As we have seen, the descriptor mapping method
seems to be useful for analyzing the characteristics of structural parameters in the QSAR / QSPR analysis. Here, we show the results of application of this method to actual QSAR analysis. We used the data in which the least discrepancy was included. The
Table 17 Input data for carboquinones a
No.
R1, R2
MR 1,2
p1,2
p2
MR 1
F
R
Activity b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
C6H5, C6H5 CH 3 , (CH 2 ) 3 C 6 H 5 C 5 H 11 , C 5 H 11 CH(CH 3 ) 2 , CH(CH 3 ) 2 CH 3 , CH 2 C 6 H 5 C3H7, C3H7 CH 3 , CH 2 OC 6 H 5 R 1 5R 2 5CH 2 CH 2 OCON(CH 3 ) 2 C2H5, C2H5 CH 3 , CH 2 CH 2 OCH 3 OCH 3 , OCH 3 CH 3 , CH(CH 3 ) 2 C 3 H 7 , CH(OCH 3 )CH 2 OCONH 2 CH 3 , CH 3 H, CH(CH 3 ) 2 CH 3 , CH(OCH 3 )C 2 H 5 C 3 H 7 , CH 2 CH 2 OCONH 2 R 1 5R 2 5CH 2 CH 2 OCH 3 C 2 H 5 , CH(OC 2 H 5 )CH 2 OCONH 2 CH 3 , CH 2 CH 2 OCOCH 3 CH 3 , (CH 2 ) 3 -dimer CH 3 , C 2 H 5 CH 3 , CH(OCH 2 CH 2 2OCH 3 )CH 2 OCONH 2 CH 3 , CH 2 CH(CH 3 )OCONH 2 C 2 H 5 , CH(OCH 3 )CH 2 OCONH 2 CH 3 , CH(C 2 H 5 )CH 2 OCONH 2 CH 3 , CH(OC 2 H 5 )CH 2 OCONH 2 CH 3 , (CH 2 ) 3 OCONH 2 CH 3 , (CH 2 ) 2 OCONH 2 C 2 H 5 , (CH 2 ) 2 OCONH 2 CH 3 , CH 2 CH 2 OH CH 3 , CH(CH 3 )CH 2 OCONH 2 CH3, CH(OCH 3 )CH 2 OCONH 2 H, N(CH 2 ) 2 R 1 5R 2 5CH 2 CH 2 OH CH 3 , N(CH 2 ) 2 CH 3 , CH(CH 3 )CH 2 OH
5.08 4.5 4.86 3 3.57 3 3.79 6.14 2.06 2.28 1.58 2.07 4.24 1.14 1.6 2.75 3.56 3.42 4.23 2.78 1.96 1.6 4.45 3.09 3.77 3.55 3.77 3.09 2.63 3.09 1.78 3.09 3.31 1.66 2.42 2.13 2.47
3.92 3.66 5 2.6 2.51 3 2.16 0.72 2 1.03 20.04 1.8 0.98 1 1.3 1.53 1.45 1.03 0.98 1.23 2 1.5 0.01 0.75 0.48 1.25 0.48 0.95 0.45 0.95 0.34 0.75 20.02 0.18 20.32 0.68 20.13
1.96 3.16 2.5 1.3 2.01 1.5 1.66 0.36 1 0.53 20.02 1.3 20.52 0.5 1.3 1.03 20.05 0.53 20.02 0.73 1.5 1 20.49 0.25 20.52 0.75 20.02 0.45 20.05 20.05 20.16 0.25 20.52 0.18 20.16 0.18 20.63
2.54 0.57 2.43 1.5 0.57 1.5 0.57 3.07 1.03 0.57 0.79 0.57 1.5 0.57 0.1 0.57 1.5 1.71 1.03 0.57 0.57 0.57 0.57 0.57 1.03 0.57 0.57 0.57 0.57 1.03 0.57 0.57 0.57 0.1 1.21 0.57 0.57
0.16 20.08 20.08 20.08 20.12 20.08 20.04 20.08 20.08 20.08 0.52 20.08 20.04 20.08 20.04 20.04 20.08 20.08 20.04 20.08 20.08 20.08 20.04 20.08 20.04 20.08 20.04 20.08 20.08 20.08 20.08 20.08 20.04 0.1 20.08 0.06 20.04
20.16 20.26 20.26 20.26 20.14 20.26 20.13 20.26 20.26 20.26 21.02 20.26 20.13 20.26 20.13 20.13 20.26 20.26 20.13 20.26 20.26 20.26 20.13 20.26 20.13 20.26 20.13 20.26 20.26 20.26 20.26 20.26 20.13 20.92 20.26 21.05 20.13
4.33 4.47 4.63 4.77 4.85 4.92 5.15 5.16 5.46 5.57 5.59 5.6 5.63 5.66 5.68 5.68 5.68 5.69 5.76 5.78 5.82 5.86 6.03 6.14 6.16 6.18 6.18 6.18 6.21 6.25 6.39 6.41 6.41 6.45 6.54 6.77 6.9
a b
The data were taken from the literature [44]. Chronic injection, log (1 /C).
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
example here concerns carboquinones (anticarcinogenic agents). Carboquinones were synthesized by Nakao et al. and other groups and were developed to an anticarcinogenic drug for the clinical media. A detailed QSAR study based on the Hansh method was carried out by Yoshimoto et al. We have already used those data to compare the results of the neural network with those of conventional QSAR techniques [44]. We used the same structural parameters and the same compound numbers in the literature. Table 17 shows the input data. The input data, physicochemical parameters, are the molecular refractivity constants (MR), hydrophobicity constant (p ), substituent constants (F and R), as well as MR 1,2 and p1,2 . As biological data, we used the minimum effective dose (MED) on a chronic treatment schedule only. MED is the dose giving a 40% increase in lifespan compared to the controls. The input data are scaled to have values between 0.1 and 0.9 and are fed to the network together with constant 1 for the bias. The network structure was N(7,6,1) while the network parameters, a and b, are 4.0 and 1.0, respectively. The iterative backpropagation learning was repeated until the sum of the error became less than 0.003 in terms of scaled units. Fig. 17 shows the results. Here, for example, background intensity of 10% indicates that all of the other physicochemical parameters take 10% of the their maximum magnitudes. At first glance, those descriptors behave nonlinearly and irregularly. Descriptor 1 (5parameter MR 1,2 ) in Fig. 17 has the biggest contribution to the intensity—around 60–70% when the background intensity is around 20–30%, while descriptor 2 (5parameter p1,2 ) has a negative contribution to the intensity and its maximum strength appears at 60% intensity of the background. One can analyze the characteristics of each descriptor. This article is not intended to give a concrete analysis of the present data but to give a method by which the structural parameters can be analyzed. Therefore, we will not go into further detail. In practice, the analysis may be carried on a computer display and three descriptors may be analyzed simultaneously, for example, MR 1,2 , p1,2 , and the background intensity. The neural network method takes some time in the learning phase. However, prediction by the trained network is rapid; even 1000 points can be handled in a second with a moderately hi-speed
1145
personal computer. This enables one to rotate the three-dimensional graph to find the optimal point in relation to other descriptors such as the background intensity.
7. Concluding remarks Recently, the number of applications of ANNs in the pharmaceutical sciences is increasing. Since optimization and prediction problems frequently appear in this field, the hierarchy-type ANN is the main target of application. Most articles which deal with application simply compare ANN with conventional methods for prediction, fitting, etc? The operation is not dealt with. This article reviews ANN articles from the basic viewpoint of such operating characteristics. ANNs have outstanding abilities in both classification and fitting. The operation is basically carried out in a nonlinear manner. The nonlinearity has merits as well as a small number of demerits. The reasons for the demerits are analyzed and their remedies are shown. The operation of the neural network can be fully expressed by mathematics. The mathematical relationships of ANN’s operation and the ALS method as well as the multiregression analysis are reviewed. ANN can be regarded as a function that transforms an input vector to another (output) one. We examined the analytical formula for the partial derivative of this function with respect to the elements of the input vector. This is a powerful means to determine the relationship between the input and output—one can find causes for the results. The reconstruction-learning method determines the minimum number of necessary neurons of the network and is useful to find the necessary descriptors or trace the flow of information from the input to the output. Finally, the descriptor-mapping method is reviewed. This is a useful method to find the nonlinear relationships between descriptors or the output intensity and descriptors.
Acknowledgements The author thanks The Ministry of Education, Culture, Sports, Science and Technology of Japan for financial support.
1146
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147
References [1] E.R. Kandel, J.H. Schwarz, Principle of Neural Science, Elsevier, North-Holland, New York, 1982. [2] D.E. Rumelhart, J.L. McClelland (Eds.), Parallel Distributed Processing, Explorations in the Microstructure of Cognition, Vols. 1 and 2, MIT Press, Cambridge, MA, 1986. [3] J.J. Hopfield, Neural network and physical systems with emergent collective computational abilities, Proc. Natl. Acad. Sci. USA 79 (1982) 2445–2558. [4] J.J. Hopfield, Neurons with graded response have collective computational properties like those of two-state neurons, Proc. Natl. Acad. Sci. USA 81 (1984) 3088–3092. [5] D.H. Acley, G.E. Hinton, T.J. Sejnowski, A Learning Algorithm for Bolzmann Machine, Cognitive Sci. 9, 1985. [6] J. Zupan, J. Gasteiger, Neural Networks in Chemistry and Drug Design, 2nd Edition, Wiley–VCH, Weinheim, 1999. [7] T. Kohonen, Analysis of a simple self-organizing process, Biol. Cybern. 43 (1982) 59–62. [8] T. Kohonen, Self-Organization and Associative Memory, 3rd Ed, Springer, Berlin, 1989. [9] S. Anzali, G. Garnickel, M. Krug, J. Sadowski, M. Wagener, J. Gasteiger, Evaluation of molecular surface properties using a Kohonen neural network, in: J. Devillers (Ed.), Neural Networks in QSAR and Drug Design, Academic Press, London, 1996. [10] J.P. Doucet, A. Panaye, 3D Structural information: from property prediction to substructure recognition with neural networks, SAR QSAR Environ. Res. 8 (1998) 249–272. [11] T. Aoyama, Y. Suzuki, H. Ichikawa, Neural networks applied to pharmaceutical problems. I. Method and application to decision making, Chem. Pharm. Bull. 37 (1989) 2558–2560. [12] T. Aoyama, Y. Suzuki, H. Ichikawa, Neural networks applied to structure–activity relationships, J. Med. Chem. 33 (1990) 905–908. [13] J. Devillers (Ed.), Neural Networks in Qsar and Drug Design, Academic Press, London, 1996. [14] D.A. Winkler, D.J. Madellena, QSAR and neural networks in life sciences, Ser. Math. Biol. Med. 5 (1994) 126–163. [15] D. Manallack, D.J. Livingstone, Neural networks and expert systems in molecular design, Methods Princ. Med. Chem. 3 (1995) 293–318. [16] S. Anzali, J. Gasteiger, U. Holzgrabe, J. Polanski, J. Sadowski, A. Techentrup, M. Markus, The use of self-organizing neural networks in drug design, 3D QSAR Drug Design 2 (1998) 273–299. [17] D.J. Maddalena, Applications of soft computing in drug design, Exp. Opin. Ther. Pat. 8 (1998) 249–258. [18] T. Savid, D.J. Livingstone, Neural networks in drug discovery: have they lived up to their promise?, Eur. J. Med. Chem. 34 (1999) 195–208. [19] M.E. Brier, G.R. Aronoff, Application of artificial neural networks to clinical pharmacology, Int. J. Clin. Pharmacol. Ther. 34 (1996) 510–514. [20] J. Rui, S. Ling, Neural networks model and its application in clinical pharmacology, Zhongguo Linchuang Yaolixue Zazhi 13 (1997) 170–176, Chinese.
[21] E. Tafeit, G. Reibnegger, Artificial neural networks in laboratory medicine and medical outcome prediction, Clin. Chem. Lab. Med. 37 (1999) 845–853. [22] S. Agatonovic-Kustrin, R. Beresford, Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research, J. Pharm. Biomed. Anal. 22 (2000) 717–727. [23] S. Nagl, Neural network models of protein domain evolution, Hyle 6 (2000) 143–159. [24] C. Ochoa, A. Chana, Applications of neural networks in the medicinal chemistry field, Curr. Med. Chem.: Central Nervous System Agents 1 (2001) 247–256. [25] Y. Cai, J. Gong, Z. Cheng, N. Chen, Artificial neural network method for quality estimation of traditional Chinese medicine, Zhongcaoyao 25 (1994) 187–189, Chinese. [26] Y.J. Qiao, X. Wang, K.S. Bi, X. Luo, Application of artificial neural networks to the feature extraction in chemical pattern recognition of the traditional Chinese medicine, Venenum bufonis, Yaoxue Xuebao 30 (1995) 698–701, Chinese. [27] L. Geng, A. Luo, R. Fu, J. Li, Identification of Chinese herbal medicine using artificial neural network in pyrolysis– gas chromatography, Fenxi Huaxue 28 (2000) 549–553, Chinese. [28] J. Bourquin, H. Schmidli, P. van Hoogevest, H. Leuenberger, Basic concepts of artificial neural networks (ANN) modeling in the application to pharmaceutical development, Pharm. Dev. Technol. 2 (1997) 95–109. [29] K. Takayama, M. Fujikawa, T. Nagai, Artificial neural network as a novel method to optimize pharmaceutical formulations, Pharm. Res. 16 (1999) 1–6. [30] R.C. Rowe, R.J. Roberts, Artificial intelligence in pharmaceutical product formulation: neural computing and emerging technologies, Pharm. Sci. Technol. Today 1 (1998) 200–205. [31] T. Takagi, Pharmacometrics. New region in pharmaceutical science, Farumashia 37 (2001) 695–699. [32] W.S. McCulloch, W. Pitts, A logical calculus of the ideas imminent in nervous activity, Bull. Math. Biophys. 5 (1943) 115–133. [33] D.O. Hebb, The Organization of Behavior, Wiley, New York, 1949. [34] T. Aoyama, H. Ichikawa, Reconstruction of weight matrices in neural networks. A method of correlating output with inputs, Chem. Pharm. Bull. 39 (1991) 1222–1228. [35] T. Aoyama, Y. Suzuki, H. Ichikawa, Neural networks as applied to quantitative structure–activity relationship analysis, J. Med. Chem. 33 (1990) 2583–2590. [36] M. Minsky, S. Papert, Perceptrons: An Introduction To Computational Geometry, MIT Press, Cambridge, MA, 1969. [37] T. Aoyama, H. Ichikawa, Basic operating characteristics of neural networks when applied to structure–activity studies, Chem. Pharm. Bull. 39 (1991) 358–366. [38] I. Moriguchi, K. Komatsu, Adaptive least-squares classification applied to structure–activity correlation al antitumor mitomycin derivatives, Chem. Pharm. Bull. 25 (1977) 2800– 2802. [39] T. Aoyama, H. Ichikawa, Obtaining the correlation indices
H. Ichikawa / Advanced Drug Delivery Reviews 55 (2003) 1119–1147 between drug activity and structural parameters using a neural network, Chem. Pharm. Bull. 39 (1991) 372–378. [40] T. Aoyama, H. Ichikawa, Neural networks as nonlinear structure–activity relationship analyzers. Useful functions of the partial derivative method in multilayer neural networks, J. Chem. Inform., Conmut. Sci. 32 (1992) 492–500. [41] B.R. Kowalski, Chemometrics: Theory and Applications, in: ACS Symposium Series, Vol. 53, American Chemical Society, Washington, DC, 1977, p. 43. [42] H. Ichikawa, A. Aoyama, How to see characteristics of
1147
structural parameters in QSAR analysis: descriptor mapping using neural networks, SAR QSAR Envirn. Res. 1 (1993) 115–130. [43] T.A. Andrea, H. Kalayeh, Applications of neural networks in quantitative structure–activity relationships of dihydrofolate reductase inhibitors, J. Med. Chem. 34 (1991) 2824–2836. [44] M. Yoshimoto, H. Miyazawa, H. Nakao, K. Shinkai, M. Arakawa, Quantitative structure–activity relationships in 2,5bis(1-aziridinyl)-p-benzoquine derivatives against leukemia L-1210, J. Med. Chem. 22 (1979) 491–496.