Predictive performance of “highly complex” artificial neural networks

Predictive performance of “highly complex” artificial neural networks

Applied Catalysis A: General 324 (2007) 90–93 www.elsevier.com/locate/apcata Reply to Letter to the Editor Predictive performance of ‘‘highly comple...

173KB Sizes 1 Downloads 52 Views

Applied Catalysis A: General 324 (2007) 90–93 www.elsevier.com/locate/apcata

Reply to Letter to the Editor

Predictive performance of ‘‘highly complex’’ artificial neural networks Andra´s Tompos, Jo´zsef L. Margitfalvi *, Erno˝ Tfirst, Ka´roly He´berger Institute of Surface Chemistry and Catalysis, Chemical Research Center, Hungarian Academy of Sciences, POB 17, 1525 Budapest, Hungary Received 27 February 2007; accepted 28 February 2007 Available online 12 March 2007

Abstract The effectiveness and the ‘‘indeterminacy’’ of artificial neural networks (ANNs) are discussed. In the way of determination of parameters of different models a clear distinction should be made between two approaches: (i) parameter estimation (classical approach) and (ii) ANN approach with the aim of prediction (mainly interpolation). In the classical approach different regression procedures are used, while in latter one the socalled ‘‘training’’ using back propagation algorithm is applied. The latter approach is widely employed in combinatorial materials science for ‘‘information mining’’ purposes. Parameters obtained in classical approach should have a definite physical meaning. It also means that in the absence of a presumed hypothetical model the classical regression procedures cannot be applied. The advantage of the usage of ANNs is that it does not require a casual model for the system investigated. However, parameters obtained by ANN have no physical meaning. Generally, in ANNs the number of parameters to be determined is significantly higher than the number of data in the training set. It has to be emphasized that in ANN approach the good predictive ability has superior importance to physical meaning of parameters. # 2007 Elsevier B.V. All rights reserved. Keywords: Catalyst library design; Combinatorial catalysis; Holographic research strategy; Genetic algorithm; Visualization; Chemometrics; Artificial neural networks; Overfit

1. Introduction In a previous paper, Sha [1] claimed ‘‘mathematical indeterminacy’’ (overfit) of artificial neural networks generally, and of our ANN models, particularly. He stated that highly complicated ANN models were used as compared to relatively few data applied in the training, therefore ‘‘the networks are not logical and determined’’. Although Sha correctly enumerates some of the features of neural networks; nevertheless we are afraid he is not fully aware of the advantages and disadvantages of the ANN approach. To achieve good predictive performance complex networks are generally required having more connection weights than the number of data available for training. Contrary to the Sha’s opinion [1] these networks are logical and determined (see later). Some kind of overfit as well as the complexity of appropriate neural architecture is inherent in the method. It is a bit peculiar to damn an approach because of its main features. The author missed the main point of our papers [2–4], namely, it was not related to the determination of

* Corresponding author. Tel.: +36 1 4381163; fax: +36 1 4381143. E-mail address: [email protected] (J.L. Margitfalvi). 0926-860X/$ – see front matter # 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.apcata.2007.02.052

parameters for a mathematical model, but to find a set of parameters, which ensure good generalization ability of ANNs. It has to be emphasized that these two tasks are completely different from each other and their solutions require different approaches. 2. Difference of ANN and regression approaches A theoretical model is generally based on former knowledge of the researcher. The parameters of that model have physical meaning, and the goal is to estimate their values statistically. In this case traditional regression procedures have to be used (based on maximum likelihood, least squares principles). Regression procedures obviously require more observations than the number of the parameters to be determined as the author of Ref. [1] states correctly. The fitted model will provide acceptable outputs within the domain of input variables only if the theory was correct. However, there is no guarantee for the validity of the model. Contrary to traditional approach ANNs with back propagation algorithm has been applied to achieve good predictive ability. This goal can be completed by using sufficient number of data in the training set and the generalization ability should

A. Tompos et al. / Applied Catalysis A: General 324 (2007) 90–93

be checked on a validation set. The training is stopped when error between the measured and predicted outputs in a validation set starts to increase. The emphasis is laid on the use of a validation set, application of which is crucial, since it ensures good generalization ability of ANNs. The set of obtained connection weights are of subordinate importance here. The weights have no physical meaning and their values are never used later on to draw any conclusions. Actually, our goal is to achieve good generalization ability of trained ANNs, for this reason we formulate our claims to prediction and not to the values of ANN weights. Author of Ref. [1] admits correctly that ‘‘. . . the neural networks can still be trained’’; i.e. using the same initial weights the same final weights should be obtained even if the number of parameters exceeds the number of data. In this sense, the approach is reproducible, mathematically determined and the networks are logical. Contrary to regression procedures the redundancy is irrelevant in course of training. Actually, starting from different initial connection weights arbitrary number of similarly good models, i.e. several sets of final connection weights, can be obtained leading to similar generalization ability. Although the ANN models can easily be overfitted to the training set, it does not necessarily exclude the good generalization ability, i.e. the interpolation in the investigated domain of experimental variables. A disturbing feature of the back propagation learning procedure is that the error-surface may contain local minima so that the global minimum could not be found [5]. However, this behavior is not necessarily undesirable. If quasi-endless number of local minima can be built around the global minimum any of them is suitable for a good approximation. In many practical cases it is inexpedient to find the global minimum. Significantly larger number of weights can be used in a neural network than the number of training cases, i.e. the low number of training data cannot limit the complexity of the neural network. An excellent example of this phenomenon is the classical eXclusive-OR (XOR) problem. Fig. 1 shows the truth table of the logical XOR function. It is trivial that a single-layered network is unable to solve this problem and a hidden layer with 1–3 hidden nodes has to be added (see Fig. 1B) [6–8]. Hence, in this case there are only four realization of inputs while at least five adjustable weights are required to get acceptable accuracy of the resulting outputs. Now, it is not surprising at all that approximation of n number of experimental data cannot be accomplished by a network having n or less adjustable weights. Upon building up an ANN model only the first (initialization) step is a random process the further training steps are deterministic. This is the basis why we trained the 19 different network architectures 1000 times [2–4]. As each training process has been initialized with random connection weights: the process leads to 19,000 different networks, which practically rules out the random influence of the weights. However, for further application only the best 100 networks having the best generalization ability are kept. The approach proposed by Cundari et al. [9] makes possible an automatic

91

Fig. 1. Truth table of logical XOR function and its neural network implementations (A and B).

selection of network architectures. Actually, a so-called optimal linear combination of different networks is used instead of application of a single architecture. The combination coefficients for the best 100 networks are computed using the ordinary least mean squares algorithm and the statistically insignificant (at a 95% level of significance) architectures have been discarded. In order to evaluate the predictive ability of ANNs with different complexity in one of our previous papers simulated experiments were performed using a multidimensional polynomial [10]. The multidimensional polynomial determined the input-output relationship unambiguously. The problem was definitely not linear as the polynomial contained product of second and third order terms of different variables. We were interested in how values of this polynomial could be estimated by neural networks. In the training, validation and test sets 140, 30 and 30 data have been selected from the simulated experimental space. All the networks investigated gave acceptable predictions. Nevertheless, it was shown that as the number of connection weights increased, the mean square error for the test set decreased. In the most complex network the number of connection weights was 451. Actually, this is similar to the case raised in Ref. [1]. Namely, the number of connection weights is significantly larger than the number of training data. Nevertheless, this network had the best generalization ability. In Ref. [1], the use of complex neural networks has justly been criticized, when clear trends are obvious. Generally, the simplest existing model should be used to describe a phenomenon. In the previous paragraph it has been mentioned shortly, that even the simplest network can give acceptable prediction of a multidimensional polynomial. It has to be admitted that our methodology, adopted from Cundari et al. [9], does not meet strictly the ‘‘principle of parsimony’’. The algorithm selects the neural architectures automatically. The linear combination of the selected architectures are supposed to

92

A. Tompos et al. / Applied Catalysis A: General 324 (2007) 90–93

give the best prediction, although, simpler architectures may perform similarly well. Nevertheless, the use of our methodology has several technical advantages: (i) no deep knowledge about the system is required, only experimental data have to be known; (ii) the most complex architectures, present amongst the nineteen applied ones, make very probable that highlycomplex functions can also be described; (iii) straightforward and well-verified algorithm is available for the determination of connection weights (back propagation algorithm); (iv) the learning process takes only few hours on a computer finally resulting in ANNs with good predictive ability. It has to be emphasized again that parameters of a model obtained in a traditional parameter fitting have physical meaning while those obtained using ANN (back propagation algorithm) have not. Instead, we have succeeded in obtaining good correlation between the measured and predicted values [2–4]. In our case good generalization ability has been superior to physical meaning of parameters. 3. Effectiveness of combined application of optimization algorithms and ANNs The author of Ref. [1] probably misunderstands the role of holographic research strategy (HRS) and genetic algorithm (GA). It has to be pointed out that they are not neural network training techniques, but optimization algorithms. They are used in combination with neural networks in virtual optimization. In fact, the neural networks cannot be used alone for optimization. They do not propose new samples to be tested according to an optimization criterion. Therefore, when virtual optimization is performed ANNs are usually combined with different optimization algorithms such as HRS [2–4,10] and GA [11,12]. In such combination the optimization algorithm generates virtual compositions that have to be tested by ANNs. In virtual experiments repeated application of virtual preparation and testing steps leads to virtual hits (promising catalysts). In conclusion, HRS and GA are not neural network training techniques; the optimization relates to the optimization of catalytic composition and not to the optimization of connection weights in neural networks. Author of Ref. [1] claims that the effectiveness and the capability of ANNs have been over-elaborated in our studies [2–4]. In order to answer the comments given in Ref. [1] evidently, the effectiveness of ANNs has to be evaluated. According to the approach described in the previous paragraph virtual hits were obtained. Consequently, the performance of virtual hits has to be evaluated in real catalytic experiments. In our studies [2–4], it was clearly demonstrated that ANNs could give excellent interpolation between experimental points. It was also shown, that ANNs could extrapolate beyond the range of experimental regions measured so far. In combinatorial material science the interpolation capability of ANNs is generally exploited. In case of HRS, fixed levels of parameters determine the investigated experimental space. Using ANNs virtual optimum composition in the experimental space defined by the fix concentration levels can be found. This composition was successfully validated by real experiments. Hence, the

networks are not over-elaborated. Whether the predictive ANNs can be substituted with simpler architectures remains the tasks for further investigations. We also have tested the extrapolation ability of ANNs. In a certain extent, which is about 10% over the intervals of metal concentrations investigated so far, extrapolation gives acceptable predictions. The extrapolation ability, which is thought to be not common for ANNs, may be due to the linear combination method of individual networks adopted from Cundari et al. [9]. In the learning process by back propagation we applied two data sets, i.e. the training and the validation sets. As emerges from section ‘‘Difference of ANN and regression approaches’’ the validation set is used in order to ensure good generalization ability. A third data set the so-called test set is separated from the previous two data sets in order to have an independent set that is not used in any phase of creation of ANNs. The application of an external test set is compulsory according to the relevant literature [13]. Actually, it is mandatory to evaluate predictive ability of ANNs independently of the model building and training (estimation of weights). Single networks prior to linear combination cannot be ranked on the basis of the test set. In Ref. [1] our methane oxidation model [3] has also been criticized: the inputs corresponding to the concentration of elements are not independent of each other. First of all it is worth mentioning that there were six elements constituting the mixed oxide support and the rest three noble metal components were supported on mixed oxide. Concentration unit of the former six components was given in mol percentage, while the concentration unit of the three noble metals was given in weight percentage with respect to the support. In any way so ever, Sha has correctly realized that the six components of the support are really not independent of each other. However, networks were trained obviously only with ‘‘feasible’’ compositions, i.e. where the sum of the six components of the support was 100%. Moreover, it has to be emphasized that the trained networks were used only for the prediction of the performance of ‘‘feasible’’ compositions and the action of ANNs in any other coordinates of the nine-dimensional space is completely out of our interest. Evidently, only an eight-dimensional subspace of the nine-dimensional space is realizable. Actually, after proper transformation of concentrations to molar ratios eight input units could also be used, but the application of the original nine concentration units leads to the same results. 4. Summary Differences between the classical parameter estimation by regression procedures and determination of parameters of ANNs using back propagation learning algorithm have been discussed. In the former case the focus is laid on the parameters themselves. The regression methods cannot work with more parameters than samples. The overfit can only be avoided if the number of data are significantly exceeds that of the parameters. Contrary to regression, in training of ANNs the actual value of connection weights has minor importance. It has to be emphasized that our goal is to attain good predictions and not to analyze weights of the networks. The demand of good

A. Tompos et al. / Applied Catalysis A: General 324 (2007) 90–93

generalization ability makes possible application of higher number of connection weights than number of training data available. The good generalization ability can be achieved if a validation set prevents overtraining. From our point of view ANNs are considered as black boxes. Our interest is limited to the fact whether they can be applicable in a given catalytic system for the establishment of composition–activity (performance) relationships. The method adopted from Cundari et al. [9] meets our goal. By means of linear combination of ANNs the appropriate networks are selected and their combination leads to a network with very unique generalization ability. It can also be mentioned that the overfit as phenomenon is inherent in ANNs, but this fact does not invalidate the predictive performance of ANNs. Acknowledgment Partial financial support given to AT (OTKA grant F 049742) is greatly acknowledged.

93

References [1] W. Sha, Appl. Catal. A: Gen. 324 (2007) 87. [2] A. Tompos, J.L. Margitfalvi, E. Tfirst, L. Ve´gva´ri, Appl. Catal. A: Gen. 303 (2006) 72. [3] A. Tompos, J.L. Margitfalvi, E. Tfirst, L. Ve´gva´ri, M.A. Jaloull, H.A. Khalfalla, M.M. Elgarni, Appl. Catal. A: Gen. 285 (2005) 65. [4] A. Tompos, J.L. Margitfalvi, E. Tfirst, L. Ve´gva´ri, Appl. Catal. A: Gen. 254 (2003) 161. [5] D.E. Rumelhart, G.E. Hinton, R.J. Wiliams, Nature 323 (1986) 533. [6] I.G. Sprinkhuizen-Kuyper, E.J.W. Boers, Neural Comput. 8 (1996) 1301. [7] I.G. Sprinkhuizen-Kuyper, E.J.W. Boers, Neural Networks 11 (1998) 683. [8] I.G. Sprinkhuizen-Kuyper, E.J.W. Boers, IEEE Trans. Neural Networks 10 (1999) 968. [9] T.R. Cundari, J. Deng, Y. Zhao, Ind. Eng. Chem. Res. 40 (2001) 5475. [10] A. Tompos, J.L. Margitfalvi, L. Ve´gva´ri, E. Tfirst, Comb. Chem. High Throughput Screen. 10 (2007) 121. [11] A. Corma, J.M. Serra, E. Argente, V. Botti, S. Valero, Chem. Phys. Chem. 3 (2002) 939. [12] L.A. Baumes, D. Farrusseng, M. Lengliz, C. Mirodatos, QSAR Comb. Sci. 23 (2004) 767. [13] A. Tropsha, P. Gramatica, V.K. Gombar, QSAR Comb. Sci. 22 (2003) 69.