Learning with generalized-mean neuron model

Learning with generalized-mean neuron model

ARTICLE IN PRESS Neurocomputing 69 (2006) 2026–2032 www.elsevier.com/locate/neucom Learning with generalized-mean neuron model R.N. Yadava,, Nimit ...

278KB Sizes 0 Downloads 109 Views

ARTICLE IN PRESS

Neurocomputing 69 (2006) 2026–2032 www.elsevier.com/locate/neucom

Learning with generalized-mean neuron model R.N. Yadava,, Nimit Kumarb, Prem K. Kalrac, Joseph Johnc a

Department of Electronics Engineering, Maulana Azad National Institute of Technology, Bhopal, India b IBM India Research Lab, Block-1, IIT Campus, New Delhi, India c Department of Electrical Engineering, Indian Institute of Technology, Kanpur, India Received 17 April 2004; received in revised form 17 October 2005; accepted 31 October 2005 Available online 20 February 2006 Communicated by L.C. Jain

Abstract The artificial neuron has come a long way in modeling the functional capabilities of various neuronal processes. The higher order neurons have shown improved computational power and generalization ability. However, these models are difficult to train because of a combinatorial explosion of higher order terms as the number of inputs to the neuron increases. This work presents an artificial neural network using a neuron architecture called generalized mean neuron (GMN) model. This neuron model consists of an aggregation function which is based on the generalized mean of the all the inputs applied to it. The proposed neuron model with same number of parameters as the McCulloch–Pitts model demonstrates better computational power. The performance of this model has been benchmarked on both classification and time series prediction problems. r 2006 Elsevier B.V. All rights reserved. Keywords: Generalized mean neuron; Classification; Function approximation; Multilayer perceptrons

1. Introduction An artificial neuron is a mathematical model of the biological neuron and approximates its functional capabilities. The major issue in artificial neuron models is the description of single neuron computation and interaction among the neurons with the application of the input signals. In the past [2,6,7,9,16,18,12], various neuron models and their ability for solving linear and nonlinear classification and function approximation problems have been discussed. In [17], the authors discuss the function of neurons, based on their use. The McCulloch–Pitts [9] initiated the use of summing units as the neuron model, while neglecting all possible nonlinear capabilities of the single neuron and the role of dendrites in information processing in the neural system. It operates under a discrete-time assumption and assumes synchrony of operation of all neurons in a larger network. We propose a simple model for the Corresponding author. Tel.: +9 755 2670328.

E-mail address: [email protected] (R.N. Yadav). 0925-2312/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.10.006

generalized mean neuron (GMN) with a well-defined training procedure based on error back-propagation. The proposed model considers a weighted generalized mean (GM) of all inputs in the space. This ensures that representation of McCulloch–Pitts model is a special case of the proposed neuron model. In Section 2, we discuss the motivation, architecture and mathematical representation of the proposed neuron model. Section 3 presents the learning rule for a multilayer feedforward neural network based on the GMN model. Section 4 discusses the performance of the neural network using proposed neuron model on two typical pattern recognition problems—classification and function approximation. We solve the channel equalization, Pima Indians diabetes and synthetic two-class problems using GMNbased network and compare it with a multilayer perceptron (MLP). Similar experiments are carried on the Mackey– Glass, Box–Jenkins gas furnace and HCL Internet incoming traffic datasets to demonstrate function approximation capabilities of the proposed neuron model. Section 5 presents the final conclusions and some future directions to the work.

ARTICLE IN PRESS R.N. Yadav et al. / Neurocomputing 69 (2006) 2026–2032

2. Generalized mean neuron model Neuron modeling concerns with relating functions to the structure of neuron on the basis of its operation. As the name suggests the GMN model is motivated by the idea of a generalized mean of the input signals [11]. The GM of N input signals xj ðj ¼ 1; 2; . . . ; N; N 2 IÞ can be given as " #1=r N 1 X r GM ¼ x , (1) N j¼1 j where rðr 2 Rþ Þ is the generalization parameter and gives the various means (arithmetic mean, geometric mean and harmonic mean) depending on the value it takes. Inspired by the importance of the flexibility of the above equation, the aggregation function of the GMN can be defined as " #1=r N X r yðxj ; wj Þ ¼ wj xj þ w0 , (2) j¼1

where wj is the adaptive parameter corresponding to each xj and w0 is bias of the neuron. From Eq. (2) we find that " # N X yðxj ; wj Þ ¼ wj xj þ w0 for r ¼ 1, (3) j¼1

which is the output of McCulloch–Pitts model. Thus the perceptron model is a special case of proposed GMN model.The physical architecture of the GMN model is same as that of the perceptron model. 3. Multilayer feedforward network using GMN model 3.1. Network architecture and description Let us consider a feedforward multilayer network in which M hidden layer neurons receive N inputs as shown in Fig. 1. The input and output vectors of the network are X ¼ ½x1 x2 . . . xN T and Y ¼ ½y1 y2 . . . yK T , respectively. If

2027

whij is a weight that connects the ith hidden neuron with jth input, the activation value of the ith hidden neuron can be given as " #1=r N X r whij xj þ wh0i for i ¼ 1; 2; . . . ; M, (4) nethi ¼ j¼1

where wh0i is the bias of the ith hidden neuron. The nonlinear transformation performed by each of M neurons in the network is given as yhi ¼ f ðnethi Þ for i ¼ 1; 2; . . . ; M,

(5)

where f denotes a sigmoid function. Similarly the output of the kth neuron in the output layer can be given as yk ¼ f ðnetk Þ

for k ¼ 1; 2; . . . ; K,

(6)

where " netk ¼

M X

#1=r wki yrhi

þ w0k

for k ¼ 1; 2; . . . ; K,

(7)

i¼1

where wki is the weight that connects the ith neuron of hidden layer to the kth neuron of the output layer and w0k is the bias to corresponding output layer neuron. The value of generalization parameter r, for simplicity, is considered same for every neuron in our experimentations. 3.2. Learning rule We describe an error backpropagation learning rule for the network of proposed GMN model. The simplicity of the learning method makes it convenient for the model to be used in different situations unlike the higher-order neuron model [5], which is difficult to train. A simple gradient descent rule, using a mean-squared error (MSE) function, is described by the following set of equations: Output layer: From Eq. (6) we have 1 . 1 þ enetk The MSE is given as yk ¼ f ðnetk Þ ¼

E MSE ¼ 

K X P 1 X ðyp  ypdk Þ2 , 2PK k¼1 p¼1 k

(8)

(9)

where ypk and ypdk are the actual and desired value of the kth neuron, for the pth pattern, in the output layer, respectively, and P is the number of training patterns in input space. The weight update rule can be defined by the following equations:

Fig. 1. Multilayer feedforward network using GMN model.

Dwki ¼ Z

dyr net1r qE ¼ i k , qwki r

(10)

Dw0k ¼ Z

dnetk1r qE , ¼ qw0k r

(11)

old wnew ki ¼ wki þ Dwki ,

(12)

ARTICLE IN PRESS R.N. Yadav et al. / Neurocomputing 69 (2006) 2026–2032

2028 old wnew 0k ¼ w0k þ Dw0k ,

(13)

where d ¼ Zyk ðyk  ydk Þð1  yk Þ=PK and ZðZ 2 ½0; 1Þ is the learning rate. Hidden layer: Now from Eqs. (4) and (5) we can define the update rules for the weights whij and wh0i by Eqs. (14) and (15): Dwhij ¼  Z ¼

qE qwhij

r1 dð1  yhi ÞyrhÞi net1r hi xj ½

Dwh0i ¼  Z

PK

1r k¼1 netk wki 

r

,

ð14Þ

qE qwh0i

P 1r dð1  yhi Þyrhi neth1r ½ K k¼1 netk wki  i . ð15Þ ¼ r new The new weights wnew hij and wh0i are determined similarly as in Eqs. (12) and (13). The learning rate Z can either be adapted with epochs or can be fixed to a small number based on heuristics. This learning method is used to train the network in the next section to solve some famous benchmark problems of both classification and function approximation. 4. Results and discussions We demonstrate the performance of the GMN model on two important problems arising in machine learning— classification and function approximation. The results obtained are compared with the performance of the multilayer neural (MLN) network architecture. We observe that the GMN model achieves better accuracy and enhanced performance because of its improved architecture. The experiments were carried out for 10 runs and the results were averaged over. The dataset were uniformly normalized in the range ½0:1; 0:9. All networks were trained using an error-backpropagation based gradient descent learning algorithm. The network topology reported is in the form of n  hi  o, where n is the number of input nodes, hi ’s are the number of nodes in the ith hidden layer and o is the number of output nodes. Along with the calculation of training and testing error with network topology we have also used statistical properties like covariance, correlation and Akaike’s information criterion (AIC) [1,4] for testing the capability of the proposed neuron model. The AIC, defined as in Eq. (16), evaluates the goodness of fit of a model based on the MSE for training data and the number of estimated parameters: AIC ¼ 2 logðmaximum likelihoodÞ þ 2L,

(16)

where L is the number of independently estimated parameters. If output errors are statistically independent

of each other and follow normal distribution with zero mean and constant variance, Eq. (16) can be written as AIC ¼ 2Pk logðs2 Þ þ 2L,

(17)

where P is the number of training data, k is the number of output units and s2 is the maximum likelihood estimate of the MSE. The model which minimizes AIC is optimal in the minimal averaging loss sense [10]. In our experiments we have considered the absolute values of the estimated MSE and outputs to avoid any computation on complex values. The CPU average training time in seconds have also been reported to show the time complexity of the proposed model in comparison to MLNs. All the programs were run on a Pentium-4 (2.8 GHz) Machine with 512 MB RAM. 4.1. Classification In this section, we discuss the results of some of the most popular classification problems of artificial neural network—channel equalization, synthetic data and Pima Indians diabetes data. 4.1.1. Channel equalization Band-limited communications channels are driven at high data rates and often display inter-symbol-interference (ISI). Nonlinear channel equalization [13] is a popular problem in communication systems that recovers an estimate of sðt  tÞ, denoted by s^ðt  tÞ, given the channel outputs yðtÞ, with t the equalizer delay. Thus, the channel output vector yðtÞ ¼ ½yðtÞ; yðt  1Þ; . . . ; yðt  m þ 1Þ

(18)

is used to compute s^ ðt  tÞ, where m is the equalizer order. Considering the fact that s^ðt  tÞ is binary, the problem is essentially a classification task. We consider a nonlinear model that uses Eqs. (19) and (20), where the equalizer delay and order both are two: o ¼ sðiÞ þ 0:5sði  1Þ 3

xði  2Þ ¼ o  0:9o .

ð19Þ ð20Þ

The data generated is then subjected to a 10 dB noise level. Fig. 2 shows the plot of two classes (zero and one) in the space yðtÞ and yðt  1Þ. Five hundred points were taken for training and the system validated with 4500 testing points. The performance of the channel equalization problem is discussed in Table 1. The similar network structure using GMN model proves to be a better network for nonlinear channel equalization than the multilayer network. The network of both models were trained up to 500 epochs and the weights corresponding to the minimum error were used for generalization. 4.1.2. Diabetes—Pima Indians The famous diabetes dataset of Pima Indians Women is used as a benchmark for classifier systems. The idea is to predict the presence of diabetes using seven variables: number of pregnancies, plasma glucose concentration,

ARTICLE IN PRESS R.N. Yadav et al. / Neurocomputing 69 (2006) 2026–2032

2029

Table 3 Comparison of performance for the synthetic two-class data between GMN-based network and a standard multilayer network

Topology Epochs Training error (%) Testing error (%) AIC Training time (S)

Fig. 2. A channel equalization problem with two classes.

Table 1 Comparison of performance for the channel equalization problem between multiplicative neuron model and a standard multilayer network

Topology Epochs Training error (%) Testing error (%) AIC Training time (S)

GMNs ðr ¼ 1:15Þ

MLPs

231 12 1.2 1.58 4:58 19.79

231 16 1.4 1.81 4:19 22.14

GMNs ðr ¼ 1:2Þ

MLPs

251 214 14 8.64 13:24 10.91

251 335 19.2 13.84 12:84 14.23

4.1.3. Synthetic two-class problem This is a problem from [14] that is used to illustrate that how learning methods work. There are two features and two classes; each has a bimodal distribution. The class distribution were chosen to allow the best-possible error rate of about 8% and are in fact equal mixtures of two distributions. The GMN-based multilayer network was trained using 250 sample of data and was tested with 1000 samples. Table 3 shows the performance of the GMNbased network as compared to a multilayer network. 4.2. Function approximation We evaluate the capabilities of the proposed GMN model on the following problems: (1) Mackey–Glass time series dataset (2) Short-term Internet incoming traffic dataset (3) Box–Jenkins gas furnace dataset

Table 2 Comparison of performance for the Pima Indians diabetes dataset between GNM-based network and a standard multilayer network

Topology Epochs Training error (%) Testing error (%) AIC Training time (S)

GMNs ðr ¼ 1:2Þ

MLPs

741 401 20 12.59 2:28 27.67

741 488 22 13.35 2:13 31.52

diastolic blood pressure, triceps skin fold thickness, body mass index (weight=height2 ), diabetes pedigree, and age. In [15], the author provides an analysis of the dataset, which has a total of 532 diabetes records. Out of the total 532, 200 are used for training and 332 are used for testing, with about 33% of the total dataset having diabetes. Table 2 shows the performance of the GMN-based network as compared to a multilayer network. The network of both models were trained up to 1000 epochs and the weights corresponding to the minimum error were used for generalization.

Mackey–Glass and Box–Jenkins datasets are benchmark problems and are popularly used to evaluate a proposed learning method. We also investigate short-term Internet incoming traffic prediction using the HCL-Infinet Internet traffic dataset. 4.2.1. Mackey–Glass time series dataset The Mackey–Glass (MG) time series [8] represents a model for white blood cell production in leukemia patients and has nonlinear oscillations. The MG delay-difference equation is given by Eq. (21). yðt þ 1Þ ¼ ð1  bÞyðtÞ þ a

yðk  tÞ , 1 þ y10 ðk  tÞ

(21)

where a ¼ 0:2, b ¼ 0:1, and t ¼ 17. The time delay t is a source of complications in the nature of the time series. The objective of the modeling is to predict the value of the time series based on four previous values. Four measurements yðtÞ, yðt  6Þ, yðt  12Þ and yðt  18Þ are used to predict yðt þ 1Þ. The training is performed on 250 samples and the model is tested on 200 time instants post training. The network of both models were trained up to 10,000 epochs and the weights corresponding to the minimum error were

ARTICLE IN PRESS R.N. Yadav et al. / Neurocomputing 69 (2006) 2026–2032

2030

used for generalization. Fig. 3 shows the training and prediction results. In Table 4, the performance of GMN model-based network is compared with a multilayer network with one hidden layer having three nodes and trained using gradient descent. 4.2.2. Short-term Internet incoming traffic dataset Short-term Internet traffic data was supplied by HCLInfinet (a leading Indian ISP). Weekly Internet traffic graph with a 30-min average is shown in Fig. 4. The solidgraph in gray shows the incoming traffic while the linegraph in black represents the outgoing traffic. All values are reported in bits/s. We propose a model for predicting the Internet traffic using previous values. Four measurements yðtÞ, yðt  1Þ, yðt  2Þ and yðt  3Þ are used to predict yðt þ 1Þ for incoming Internet traffic. In this case, 150 training samples were taken and the model was tested for prediction using 150 samples. The network of both models

were trained up to 10,000 epochs and the weights corresponding to the minimum error were used for generalization. Fig. 5 shows the prediction results for incoming Internet traffic data. The performance is compared with multilayer network which is shown in Table 5.

4.2.3. Box–Jenkins gas furnace dataset The Box–Jenkins gas furnace dataset [3] reports the furnace input as the gas flow rate uðtÞ and the furnace

Fig. 4. Weekly graph (30-min average) of the Internet traffic for the HCLInfinet Router at Delhi, India.

Fig. 3. Long-term prediction results for the Mackey–Glass time series dataset using the proposed neuron model.

Table 4 Comparison of performance for Mackey–Glass time series dataset between GMN network and a standard multilayer network, both trained using gradient descent method with learning rate 0.1

Topology Epochs Training error Testing error Covariance Correlation AIC Training time (S)

GMNs ðr ¼ 0:75Þ

MLPs

431 911 1:90  104 1:89  104 2:22  104 0.9954 8:4149 268.43

431 10,000 4:45  104 3:87  104 3:26  104 0.9938 7:5646 356.95

Fig. 5. Testing result on the HCL-Infinet MRTG incoming Internet bandwidth usage data. Table 5 Comparison of performance for the incoming Internet bandwidth usage of the HCL-Infinet Router data between GMN-based network and a standard multilayer network, both trained using gradient descent method with learning rate 0.2

Topology Epochs Training error Testing error Covariance Correlation AIC Training time (S)

GMNs ðr ¼ 1:2Þ

MLPs

471 630 1:54  104 2:63  103 4:1  103 0.9090 8:3667 216.68

471 10,000 5:50  103 4:2  103 4:4  103 0.9070 4:7056 403.21

ARTICLE IN PRESS R.N. Yadav et al. / Neurocomputing 69 (2006) 2026–2032

2031

The generalization parameter, r, is an important value which has to be fine-tuned in the present set-up. A similar error back-propagation can be used for obtaining an optimal value for r. However, the neuron output is extremely sensitive to the value of r and we decided to obtain the value of r by cross-validation. A future effort would be to find out a learning method in order to obtain an optimal value for the generalization parameter, r. While, the performance of the proposed model has been proved empirically, it will be interesting to investigate the theoretical properties of the proposed model. Acknowledgement

Fig. 6. Performance result on the Box–Jenkins dataset.

We would like to thank P.V. Ramadas, from HCLInfinet, New Delhi for providing the Internet traffic data and Prof. Dana H. Ballard, University of Rochester, USA for useful discussions. References

Table 6 Comparison of performance for the Box–Jenkins gas furnace dataset between GMN model and a standard multilayer network, both trained using gradient descent method with learning rate 0.1

Topology Epochs Training error Testing error Covariance Correlation AIC Training time (S)

GMNs ðr ¼ 0:9Þ

MLPs

251 400 4:68  106 3:48  104 6:53  104 0.9901 12:0599 78.67

251 4000 3:84  104 7:02  104 2:26  105 0.9856 7:5846 142.74

output yðtÞ as the CO2 concentration. In this gas furnace, air and methane were combined in order to obtain a mixture of gases which contained CO2 . We model the furnace output yðt þ 1Þ as a function of the previous output yðtÞ and input uðt  3Þ. The training and testing results of GNM-based network and MLPs network are shown in Fig. 6. The network of both models were trained up to 4000 epochs and the weights corresponding to the minimum error were used for generalization. Table 6 shows the detail comparison of these networks. 5. Conclusions In this work, we propose a new neuron model based on GM. The idea is inspired by nonlinear activities in neuronal interactions. The model has interesting abilities to construct a better neural network architecture. We demonstrate these abilities empirically through experiments on benchmark and real-life datasets. The GMN-based neural network performs better than existing Multilayer network, while maintaining the simplicity of error back-propagation learning for the weights.

[1] H. Akaike, A new look at the statistical model identification, IEEE Trans. Appl. Comput. AC-19 (1974) 716–723. [2] M. Basu, T.K. Ho, Learning behavior of single neuron classifiers on linearly separable or nonseparable inputs, in: IEEE IJCNN’99, 1999, pp. 1259–1264. [3] G.E.P. Box, G.M. Jenkins, G.C. Reinse, Time Series Analysis: Forecasting and Control, Prentice-Hall, Englewood Cliffs, NJ, 1994. [4] D.B. Fogel, An information criterion for optimal neural network selection, IEEE Trans. Neural Networks 2 (1991) 490–497. [5] M. Guler, E. Sahin, A new higher-order binary-input neural unit: learning and generalizing effectively via using minimal number of monomials, in: Proceedings of Third Turkish Symposium on Artificial Intelligence and Neural Networks, 1994, pp. 51–60. [6] E.M. Iyoda, H. Nobuhara, K. Hirota, A solution for the N-bit parity problem using a single translated multiplicative neuron, Neural Process. Lett. 18 (2003) 233–238. [7] R. Labib, New single neuron structure for solving nonlinear problems, in: IEEE IJCNN’99, 1999, pp. 617–620. [8] M. Mackey, L. Glass, Oscillation and chaos in physiological control systems, Science 197 (1997) 287–289. [9] W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys. 5 (1943) 115–133. [10] N. Murata, S. Yoshizawa, S. Amari, Network information criteriondetermining the number of hidden units for and artificial neural networks model, IEEE Trans. Neural Networks 5 (1994) 865–872. [11] A. Piegat, Fuzzy Modeling and Control, Physica-Verlag, Heidelberg, NY, 2001. [12] T.A. Plate, Randomly connected Sigma-Pi Neurons can form associator networks, NETCNS: Network: Comput. Neural Syst. 11 (2000) 321–332. [13] J.G. Proakis, Digital Communications, McGraw Hill International, Singapore, 2001. [14] B.D. Ripley, Neural networks and related methods of classification, J. R. Stat. Soc. Ser. B 56 (1994) 409–456. [15] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, UK, 1996. [16] M. Schmitt, On the complexity of computing and learning with multiplicative neural networks, Neural Comput. 14 (2001) 241–301. [17] K. Schreiner, Neuron function: the mystery persists, IEEE Intell. Syst. 16 (2001) 4–7. [18] C.N. Zhang, M. Zhao, M. Wang, Logic operations based on single neuron rational model, IEEE Trans. Neural Networks 11 (2000) 739–747.

ARTICLE IN PRESS 2032

R.N. Yadav et al. / Neurocomputing 69 (2006) 2026–2032

R.N. Yadav received his B.E. degree from Motilal Nehru Regional Engineering College Allahabad, India, and M.Tech. degree from Maulana Azad College of Technology Bhopal, India, in 1993 and 1997, respectively. In 2005, he received his Ph.D. degree, under quality improvement program, from Department of Electrical Engineering, Indian Institute of Technology Kanpur, India. Currently, he is working as an Assistant Professor in the Department of Electronics and Communication Engineering, Maulana Azad National Institute of Technology Bhopal, India. He is a member of IEEE and life member of IETE and IE(I), India. Nimit Kumar graduated from the Indian Institute of Technology Kanpur, in 2004 with a Bachelor of Technology in Electrical Engineering. Since then, he has been working at IBM India Research Laboratory, Delhi, India. Nimit’s research focus has been on Machine Learning, particularly on Kernel-based Learning and Learning in Hilbert Spaces, Neuro-Fuzzy Systems, Computational Models of Learning, Text Analytics and Visuo-Motor Control. Nimit is a member of IEEE and its Computational Intelligence Society and Systems, Man and Cybernetics Society. He serves as a referee for IEEE Transactions on Systems, Man and Cybernetics—B and the IEEE Transactions on Fuzzy Systems.

Prem K. Kalra received his B.Sc. (Engg.) degree from DEI Agra, India, in 1978, M.Tech. degree from Indian Institute of Technology, Kanpur, India, in 1982 and Ph.D. degree from Manitoba University, Canada, in 1987. From January 1987 to June 1988, he worked as assistant professor in the Department of Electrical Engineering, Montana State University Bozeman, MT, USA. In July–August 1988, he was the visiting assistant professor in the Department of Electrical Engineering, University of Washington Seattle, WA, USA. Since September 1988, he is with Department of Electrical Engineering, Indian Institute of Technology Kanpur, India, where he is a Professor. He is member of IEEE, fellow of IETE and Life member of IE(I), India. He has published over 150 papers in reputed national and international journals and conferences. His research interests are Expert Systems Applications, Fuzzy Logic, Neural Networks and Power Systems. J. John received his B.Sc. (Engg.) degree from Kerala Unversity, India, in 1978 and M.Tech. degree from Indian Institute of Technology Madras, India, in 1980. He did his Ph.D. from University of Birmingham, UK, in 1993. Currently he is Professor in the Department of Electrical Engineering, Indian Institute of Technology Kanpur, India. He is member of IEEE and fellow of IETE. His research interests are Fibre Optics. Optical Wireless Systems, Electronic Circuits and Intelligent Instrumentation Systems.