JID: SPJPM
ARTICLE IN PRESS
[m3+dc;July 6, 2017;5:14]
Available online at www.sciencedirect.com
St. Petersburg Polytechnical University Journal: Physics and Mathematics 000 (2017) 1–8 www.elsevier.com/locate/spjpm
Clustering algorithms application to forming a representative sample in the training of a multilayer perceptron Aleksey A. Pastukhov∗, Aleksander A. Prokofiev National Research University of Electronic Technology, 5 Pass. 4806, Zelenograd, Moscow 124498, Russian Federation Available online xxx
Abstract In this paper, we have considered the problem of effectively forming the representative sample for training a neural network of the multilayer perceptron (MLP) type. An approach based on the use of clustering that allowed to increase the entropy of the training set was put forward. Various clustering algorithms were examined in order to form the representative sample. The algorithm-based clustering of factor spaces of various dimensions was carried out, and a representative sample was formed. To verify our approach we synthesized the MLP neural network and trained it. The training technique was performed with the sets formed both with and without clustering. A comparative analysis of the effectiveness of clustering algorithms was carried out in relation to the problem of representative sample formation. Copyright © 2017, St. Petersburg Polytechnic University. Production and hosting by Elsevier B.V. This is an open access article under the CC BY-NC-ND license. (http://creativecommons.org/licenses/by-nc-nd/4.0/) Keywords: Neural network; Clustering algorithm; Representative sample; Multilayer perceptron.
Introduction Training a neural network type of the multilayer perceptron (MLP) type involves a data preprocessing stage that must be completed before the backpropagation algorithm can be applied. The majority of the studies published on the application of neural networks limited the preprocessing techniques to normalization, scaling and weight pre-initialization. While these operations are undoubtedly necessary, they can hardly be considered sufficient. For a factor space with a small dimension, the specifics of the ∗
Corresponding author. E-mail addresses:
[email protected],
[email protected] (A.A. Pastukhov),
[email protected] (A.A. Prokofiev).
distribution of the initial data needs to be taken into account to effectively train the neural network. This task is substantially complicated with a large number of factors. In such a case, it is advisable to apply clustering to form a training set consisting of examples with the most unique attributes in the set. There are numerous clustering algorithms, but all of them can be conditionally divided into two groups: crisp and fuzzy. In turn, there are two types of crisp methods: hierarchical and non-hierarchical ones [1]. A separate class includes clustering algorithms based on neural networks, which have found wide application in various fields. For example, Ref. [2] explored data clustering based on a Markov algorithm and based on selforganizing growing neural networks. Ref. [3] compared the k-means clustering and density-based
http://dx.doi.org/10.1016/j.spjpm.2017.05.004 2405-7223/Copyright © 2017, St. Petersburg Polytechnic University. Production and hosting by Elsevier B.V. This is an open access article under the CC BY-NC-ND license. (http://creativecommons.org/licenses/by-nc-nd/4.0/) (Peer review under responsibility of St. Petersburg Polytechnic University).
Please cite this article as: A.A. Pastukhov, A.A. Prokofiev, Clustering algorithms application to forming a representative sample in the training of a multilayer perceptron, St. Petersburg Polytechnical University Journal: Physics and Mathematics (2017), http://dx.doi.org/10.1016/j.spjpm.2017.05.004
JID: SPJPM
2
ARTICLE IN PRESS
[m3+dc;July 6, 2017;5:14]
A.A. Pastukhov, A.A. Prokofiev / St. Petersburg Polytechnical University Journal: Physics and Mathematics 000 (2017) 1–8
Fig. 1. Training results for the MLP neural network from the data formed without clustering: MSE is the mean square training error; Epoch number is the current epoch. The error patterns are shown for the training, validation and test sets; the goal and best error values are marked; the latter was achieved for the test set.
Fig. 2. Results of the MLP neural network training on the data generated using the SOM algorithm.
clustering of applications with noise (DBSCAN) in a random sample, and assessed the efficiency of these algorithms based on the Davis–Bouldin index. Ref. [4] is dedicated to clustering of text documents for creating automated classification systems using the Euclidean– Mahalanobis distance, and Ref. [5] discusses applying various algorithms of neural associative memory for creating the memory of an anthropomorphic robot. In this paper, we have examined one fuzzy and three crisp clustering methods within the problem of forming representative samples for training a neural network; we have analyzed the efficiency of these
algorithms from the standpoint of increasing the entropy of the training set and improving the training quality for an MLP-type neural network. We have additionally analyzed the variations in the entropy of the training set and in the mean square error (MSE) of training via these algorithms. The crisp clustering algorithms discussed in this paper are the k-means algorithm [3] that is the most common and easiest to implement, the self-organizing Kohonen maps [6] (the maps are considered in Ref. [7]), and a clustering algorithm based on constructing a hierarchical tree of clusters [8].
Please cite this article as: A.A. Pastukhov, A.A. Prokofiev, Clustering algorithms application to forming a representative sample in the training of a multilayer perceptron, St. Petersburg Polytechnical University Journal: Physics and Mathematics (2017), http://dx.doi.org/10.1016/j.spjpm.2017.05.004
ARTICLE IN PRESS
JID: SPJPM
[m3+dc;July 6, 2017;5:14]
A.A. Pastukhov, A.A. Prokofiev / St. Petersburg Polytechnical University Journal: Physics and Mathematics 000 (2017) 1–8
The c-means fuzzy algorithm [1] is the basic one for a large number of other algorithms of the fuzzy class and has many software implementations (for example, the FCM-algorithm (Fuzzy c-means) implemented in the MATLAB package). Problem setting The training of a neural network is usually performed on three subsets of the factor space: the training, the validation and the test ones. Together they form a representative sample for training the neural network. The training set is used for adjusting the free parameters of the neural network, the validation set is used to control overtraining, the test set is used for independent testing of the already trained neural network. One of the elements of effective training is forming a training set from the elements of the factor space that have the most unique attributes. It was established in Ref. [7] that this could be achieved through clustering the factor space and choosing the representatives from each cluster to form the training set. The most suitable clustering algorithm that meets certain criteria has to be selected for achieving this goal. Problem setting. Let X = {X , ..., X , Y , ..., Y } 1
M
1
M
be a factor space, where X i = {x1 , x2 , x3 , x4 }, Y i = {y(X i )}; M is the number of vectors in the factor space. The goal is to determine the clustering algorithm that allows finding a partition of the factor space into three sets (the training T, the validation V and the test E), for which the following conditions are satisfied: H0 (T ) < H (T ) ≤ Hmax (T ),
(1)
ST takes the maximal value.
(2)
Here H(T) is the entropy of the training set with clustering; H0 (T) is the entropy of the training set for a random partition of the factor space into a representative sample; Hmax (T) = log2 Nt is the maximal entropy of this set (Nt is the size of the training set comprising 80% of the factor space), ST is the mean square error of the training set. Analysis of the clustering algorithms As noted above, the study was conducted with four clustering algorithms: k-means, c-means,
3
self-organizing Kohonen maps and the hierarchical method. The number of clusters was chosen equal to 80% of the volume of the factor space in all experiments. The first stage of the study involved assessing the entropy increase in the training set which consisted of the data generated using the above-mentioned algorithms. Entropy was calculated by Shannon’s formula [9]: H (x) = −
n
pi log2 pi ,
(3)
i=1
where pi is the probability of selecting an element from the cluster. At the second stage, the MLP-type neural network with a 4–4–1 architecture was trained on the data generated using clustering algorithms. The factor space includes four randomly generated input parameters and one output parameter that is the response. The link between the input and the output parameters is set by a nonlinear function y = ex1 + ex2 + 2 ex3 − 3ex4 , Where x1 , x2 , x3 , x4 are the corresponding input parameters, y is the output parameter. Noise described by a random variable distributed normally with a variance of 0.02 was introduced to the output signal vector. The neural network training was performed for the functions from the Neural Network Toolbox (NNtoolbox) of the MATLAB package. The training parameters are given in Table 1. Fig. 1 shows the results for training the neural network on randomly generated data (without clustering) with the purpose of assessing the effectiveness of the algorithms (i.e., the difference between the mean square errors of the training, validation and test sets). The results listed here and below are for training the neural network on a factor space consisting of 200 training vectors, which does not violate the generality, since, as it will be demonstrated later, the increase in the entropy of the training set does not depend on the number of the elements comprising it. The training procedure was conducted ten times for each case. The pre-initialization of the weights was performed for the NNtoolbox functions by the Nguyen–Widrow algorithm [10] for each training procedure. The attempt of training that was the most successful from the standpoint of the minimal meansquare error of training is given as the result.
Please cite this article as: A.A. Pastukhov, A.A. Prokofiev, Clustering algorithms application to forming a representative sample in the training of a multilayer perceptron, St. Petersburg Polytechnical University Journal: Physics and Mathematics (2017), http://dx.doi.org/10.1016/j.spjpm.2017.05.004
ARTICLE IN PRESS
JID: SPJPM
4
[m3+dc;July 6, 2017;5:14]
A.A. Pastukhov, A.A. Prokofiev / St. Petersburg Polytechnical University Journal: Physics and Mathematics 000 (2017) 1–8
Fig. 3. Results for training the MLP neural network on the data generated using the k-means algorithm. Table 1 Training parameters of the neural network. Parameter name
Parameter’s authors/function
Parameter name in MATLAB
Optimization algorithm Adaptation function Optimization criterion Initialization of the free parameters
Levenberg–Marquardt Gradient descent Mean square error Nguyen–Widrow
TRAINLM LEARNGD MSE INITNW
The mean square error of training is calculated by the following formula: 1 (xi − yi )2 , n i=1
Table 2 Entropy computations for factor spaces of various dimensions with two variants of applying the SOM algorithm.
n
MSE =
(4)
where xi , yi are the actual and the expected training results for the training vector i, respectively. Tables 2 and 3 list the results for computing the entropy with the SOM algorithm (Variant 1, H(T) = 0), and for the neural network’s training time T1 for training on the data generated without clustering (T2 = 0). The mean square error of training is equal to 0.31462 in this case. Self-organizing Kohonen maps (SOM) These maps [6,11] are a class of neural networks with unsupervised learning. This is a class of nonhierarchical clustering algorithms. Self-organizing maps are easy to implement and provide a reliable distribution of data over a given number of clusters after the data has passed through the layers of the map. Additionally, due to selforganization the algorithm can independently determine the cluster centers.
N
Entropy, bits Variants 1 and 2
100 200 300 400 500 600 700 800 900 1000
Entropy 2
Hmax (T )
H0 (T )
H (T )
6.32 7.32 7.91 8.32 8.64 8.91 9.13 9.32 9.49 9.64
6.05 7.02 7.67 8.07 8.38 8.66 8.86 9.05 9.16 9.29
6.19 7.15 7.79 8.17 8.50 8.77 8.97 9.17 9.31 9.45
Notations: N is the number of elements in the factor space; Hmax (T) is the maximal entropy value of the training set; H(T), H0 (T) are the entropy values of this set with clustering (Variant 2) and for a random partition of the factor space into a representative sample (Variant 1, when H(T) = 0), respectively.
The SOM learning algorithm consists of minimizing the difference between the input neurons in the corresponding layer and the weight coefficients of the
Please cite this article as: A.A. Pastukhov, A.A. Prokofiev, Clustering algorithms application to forming a representative sample in the training of a multilayer perceptron, St. Petersburg Polytechnical University Journal: Physics and Mathematics (2017), http://dx.doi.org/10.1016/j.spjpm.2017.05.004
ARTICLE IN PRESS
JID: SPJPM
[m3+dc;July 6, 2017;5:14]
A.A. Pastukhov, A.A. Prokofiev / St. Petersburg Polytechnical University Journal: Physics and Mathematics 000 (2017) 1–8 Table 3 Dependence of the processing time by different algorithms on the size of the factor space. N
100 200 300 400 500 600 700 800 900 1000
Time spent, s T1
T2 SOM
k-means
c-means
Hierarchical
3 5 7 10 15 21 29 36 42 53
1 2 4 3 3 4 9 7 8 11
0.1 0.1 0.1 0.3 0.4 0.5 0.7 1.0 1.2 1.5
1 2 3 4 5 6 9 12 16 22
0.093 0.101 0.103 0.108 0.111 0.112 0.114 0.117 0.118 0.120
the training time of the multilayer perceptron T2 increases more slowly than the training time of the selforganizing Kohonen map T1 . The results for training the neural network using self-organizing Kohonen maps are shown in Fig. 2. The best performance of the neural network (the minimal MSE value) is in this case 0.11601, which is much less than the corresponding value for the case when clustering is not used (see Fig. 1). In addition, the difference between the validation and the test sets is much smaller for the case when clustering is used.
The k-means algorithm
Notations: N is the number of elements in the factor space; T1 is the time spent on training the MLP on the data of the corresponding dimension (the architecture of the perceptron was chosen in accordance with formula (4) assuming that ε = 0.2); T2 is the time spent on clustering using the current clustering algorithm. Notes. The algorithms analyzed were k-means, c-means, the hierarchical method, and the self-organizing Kohonen maps (SOM), the size was 0.8 N, the network was trained on the data generated without clustering (Variant 1, T2 = 0) and with clustering (T2 = 0).
output neurons:
ωi (t ) = ωi (t − 1) + α · yi n−1 (t ) − ωi (t − 1) ,
5
(5)
where yi n−1 is the output neuron of the previous layer corresponding to the input neuron of the current layer; ωi is the weight coefficient of the neuron i; t is the number of the training epoch; α is the training rate coefficient (in the simplest case α ∈ [0; 1], α = const), n = 1, 2, ..., N are the layers of the map, i = 1, 2, ..., M are the numbers of the neurons of the current layer. The results of the computations listed below have been obtained by training the MLP network on the data generated by self-organizing Kohonen maps (SOM). Detailed computations are discussed in Ref. [7]. The results of the computations for factor spaces with the number of vectors ranging from 100 to 1000 are listed in Tables 2 and 3. The mean square error of training is 0.11601. The analysis of the results presented in Tables 2 and 3 allows to conclude that conditions (1) and (2) are satisfied for any N. The entropy value for the case with clustering (Variant 2) lies between the H0 (T ) and Hmax (T ) values for all N. The time spent on training the self-organizing Kohonen map (see Table 3, the data for SOM) increases almost linearly. However,
This algorithm [12] also belongs to the class of non-hierarchical clustering and always distributes data over the specified number of clusters. Thus, when it was applied, 80% of the vectors were included in the training set, 10% in the test set, and the remaining 10% in the validation set. Similar to the data in the previous section, the increase in the entropy of the training set was computed using clustering. As a result of the training, the entropy increased by 0.19 bits. The mean square error of training was 0.23909. The dependence of the clustering algorithm’s running time on the size of the factor space is also shown in Table 3. As in the case of SOM, the volume of the factor space virtually does not affect the increase in entropy. The training time of the neural network turned out to be smaller by an order of magnitude than when using SOM. Besides, a greater increase in entropy has been observed compared to self-organizing maps. The results of training the neural network on the set generated using the k-means algorithm are shown in Fig. 3. It should be noted that, despite the greater increase in entropy, the results of neural network training were worse than those obtained for training on the data generated with the SOM. The value of the mean square error, as well as the difference between the errors of the validation/test and training sets turned out to be greater than those obtained with the SOM algorithm. However, the quality of training improved, compared with the case when no clustering was performed. Summarizing the above, we should note that the value of the entropy increase should not be used as an indicator of the performance of the k-means algorithm. This is explained by the fact that the algorithm guarantees that the data is distributed over the specified number of clusters and cannot cope with the task
Please cite this article as: A.A. Pastukhov, A.A. Prokofiev, Clustering algorithms application to forming a representative sample in the training of a multilayer perceptron, St. Petersburg Polytechnical University Journal: Physics and Mathematics (2017), http://dx.doi.org/10.1016/j.spjpm.2017.05.004
ARTICLE IN PRESS
JID: SPJPM
6
[m3+dc;July 6, 2017;5:14]
A.A. Pastukhov, A.A. Prokofiev / St. Petersburg Polytechnical University Journal: Physics and Mathematics 000 (2017) 1–8
Fig. 4. Results for training the MLP neural network on the data generated with the c-means algorithm.
when an object belongs to several clusters in equal measure or does not belong to any of them at all. The class of fuzzy clustering algorithms is of the greatest interest from the standpoint of computing the increase in entropy. The results of such an investigation, carried out for the c-means algorithm, are given below. The c-means algorithm This algorithm [1] belongs to the class of fuzzy clustering algorithms, which is to say that it determines whether the data belongs to a cluster with a certain probability. In this case, the value of the probability p of the training example belonging to the cluster is expressed as pk p= i, N where pki is the probability that a vector i of the initial data falls into a cluster k. The probability values less than 0.1 are regarded as small and are not taken into account in the computations. The increase in entropy was 0.36 bits, and the mean square error of training was equal to 0.1141. The difference in the mean square error between the training/validation and the test sets turned out to be significantly lower than in the case when clustering was not used. The running time of the algorithm, depending on the size of the factor space, is given in Table 3. It
can be seen that the time increases considerably with increasing size of the factor space. The results for training the MLP neural network are shown in Fig. 4. It should be noted that, despite a significant increase in entropy, compared with, for example, selforganizing Kohonen maps, the reduction in the mean square error of training was not so significant and amounted to 0.11419. Additionally, a decrease in the value of the mean square error was negligibly small in comparison with the increase in the running time of the algorithm on large-sized factor spaces. Algorithm based on the construction of a hierarchical cluster tree This algorithm belongs to the hierarchical clustering class [8]. The Euclidean distance between the elements of the factor space is used as a metric, calculated by the following formula: n 2 1 2 d (X , X ) = (Xi 1 − Xi 2 ) , (6) i=1
where X 1 , X 2 are the elements of the factor space, with Xi 1 ∈ X 1 , Xi 2 ∈ X 2 . The increase in entropy was 0.1207 bits. As in the previous experiments, the difference in the MSE between the training and the test sets was less than in the case when clustering was not used. The running time of the algorithm, depending on the size of the factor space, is given in Table 3. It
Please cite this article as: A.A. Pastukhov, A.A. Prokofiev, Clustering algorithms application to forming a representative sample in the training of a multilayer perceptron, St. Petersburg Polytechnical University Journal: Physics and Mathematics (2017), http://dx.doi.org/10.1016/j.spjpm.2017.05.004
ARTICLE IN PRESS
JID: SPJPM
[m3+dc;July 6, 2017;5:14]
A.A. Pastukhov, A.A. Prokofiev / St. Petersburg Polytechnical University Journal: Physics and Mathematics 000 (2017) 1–8
7
Fig. 5. Results for training the MLP neural network on the data generated with the c-means algorithm; the MSE minimum for the validating set is shown by a circle. Table 4 General comparison of experimental results. Algorithm
Entropy increase
Mean square error
SOM k-means c-means Hierarchical Without clustering
0.14 0.19 0.36 0.12 –
0.11601 0.23909 0.11410 0.22595 0.31462
can be seen that the running time of the algorithm increases insignificantly with increasing size of the factor space. The results for training the MLP neural network are shown in Fig. 5. In this case, the value of the entropy increase is the lowest among all clustering algorithms. The mean square error of training is 0.22595, which, despite the low entropy value, is less than for the case when the kmeans algorithm was used, which yielded the highest entropy increase among all the methods. It should be noted that using the hierarchical method has considerable advantages over all the algorithms considered because of its running time (see Table 3). The data for the main results of the study for all clustering algorithms are summarized in Table 4. Conclusion The study carried out lead us to conclude that condition (1), formulated in the problem setting, is satisfied for all clustering algorithms considered. In all experiments, we have observed an increase in entropy
and a decrease in the mean square error of the training set, as well as a decrease in the difference between the mean square error of the validation/training and test sets, which indicates an improvement in the quality of training. The best result in satisfying condition (2) was obtained with the c-means algorithm. However, it should be borne in mind that the gain in efficiency for a substantially high dimension of the factor space is significantly lower, in comparison with the increase in clustering time. In conclusion, let us note that, despite the positive effect of the clustering algorithms, experiments have proved that the entropy of the training set is an important but not the determining factor in improving the training quality of a neural network such as the MLP. References [1] I.M. Neyskiy, Klassifikatsiya i sravneniye metodov klasterizatsii (Classification and comparison of clustering procedures), in: Intellektualnyye tekhnologii i sistemy, 8, 2006, pp. 130–142. [2] Yu.S. Fedorenko, Yu.E. Gapanyuk, Klasterizatsiya dannykh na osnove samoorganizuyushchikhsya rastushchikh neyronnykh setey i markovskogo algoritma klasterizatsii (Data clustering based on self-organizing growing neural networks and on Markovian clustering algorithm), Neurocomput.: Build. Appl. 4 (2016) 3–13. [3] S.L. Podvalnyy, A.V. Plotnikov, A.M. Belyanin, Sravneniye algoritmov klasternogo analiza na sluchaynom nabore dannykh (Comparison of algorithms of cluster analyses based on random data set), 5, Vestnik VGTU, 2012. [4] M.V. Khachumov, Zadacha klasterizatsii tekstovykh dokumentov (Clustering of textual documents), Inf. Tekh. Vych. Syst. 2 (2010) 42–49.
Please cite this article as: A.A. Pastukhov, A.A. Prokofiev, Clustering algorithms application to forming a representative sample in the training of a multilayer perceptron, St. Petersburg Polytechnical University Journal: Physics and Mathematics (2017), http://dx.doi.org/10.1016/j.spjpm.2017.05.004
JID: SPJPM
8
ARTICLE IN PRESS
[m3+dc;July 6, 2017;5:14]
A.A. Pastukhov, A.A. Prokofiev / St. Petersburg Polytechnical University Journal: Physics and Mathematics 000 (2017) 1–8 [5] E.V. Koryagin, Razrabotka modeli assotsiativnoy pamyati robota AR-600 dlya zadachi klasterizatsii i obobshcheniya dannykh (Model building of the associative memory of AR-600 robot for clustering and data integrating), 2015, pp. 38–47. Neuroinformatics-2015, Collected papers. [6] T. Kokhonen, Samoorganizuyushchiyesya Karty (Selforganizing Maps), Binom. Laboratoriya znaniy, Moscow, 2008. [7] A.A. Pastukhov, A.A. Prokofyev, Kohonen self-organizing map application to representative sample formation in the training of the multilayer perceptron, St. Petersburg Polytechnical State Univ. J. Phys. Math. 2 (242) (2016) 95–107.
[8] A.P. Kulaichev, Metody i Sredstva Kompleksnogo Analiza Dannykh (Methods and Resources of Integrated Data Analysis), Infra-M, Moscow, 2006. [9] K. Shennon, Raboty Po Teorii Informatsii i Kibernetike (Studies in Information Theory and Cybernetics), Izd-vo inostr. lit-ry, Moscow, 1963. [10] Y. LeCun, Efficient Learning and Secondary Methods, MIT Press, MA, 1993. [11] S. Khaykin, Neyronnyye Seti (Neural Networks), second ed., ID ‘Williams’, Moscow, 2008. [12] J. Tu, R. Gonsales, in: Printsipy Raspoznavaniya Obrazov (Pattern Recognition Principles), Mir, Moscow, 1978, pp. 109–112.
Please cite this article as: A.A. Pastukhov, A.A. Prokofiev, Clustering algorithms application to forming a representative sample in the training of a multilayer perceptron, St. Petersburg Polytechnical University Journal: Physics and Mathematics (2017), http://dx.doi.org/10.1016/j.spjpm.2017.05.004