Mining massive datasets by an unsupervised parallel clustering on a GRID: Novel algorithms and case study

Mining massive datasets by an unsupervised parallel clustering on a GRID: Novel algorithms and case study

Future Generation Computer Systems 27 (2011) 711–724 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: ...

898KB Sizes 0 Downloads 48 Views

Future Generation Computer Systems 27 (2011) 711–724

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Mining massive datasets by an unsupervised parallel clustering on a GRID: Novel algorithms and case study Alberto Faro ∗ , Daniela Giordano, Francesco Maiorana Dipartimento di Ingegneria Informatica e Telecomunicazioni, University of Catania, Italy

article

info

Article history: Received 12 March 2010 Received in revised form 2 January 2011 Accepted 15 January 2011 Available online 26 January 2011 Keywords: Neural networks Clustering algorithm GRID computing Bioinformatics Parallel algorithms

abstract This paper proposes three novel parallel clustering algorithms based on the Kohonen’s SOM aiming at preserving the topology of the original dataset for a meaningful visualization of the results and for discovering associations between features of the dataset by topological operations over the clusters. In all these algorithms the data to be clustered are subdivided among the nodes of a GRID. In the first two algorithms each node executes an on-line SOM, whereas in the third algorithm the nodes execute a quasibatch SOM called MANTRA. The algorithms differ on how the weights computed by the slave nodes are recombined by a master to launch the next epoch of the SOM in the nodes. A proof outline demonstrates the convergence of the proposed parallel SOMs and provides indications on how to select the learning rate to outperform both the sequential SOM and the parallel SOMs available in the literature. A case study dealing with bioinformatics is presented to illustrate that by our parallel SOM we may obtain meaningful clusters in massive data mining applications at a fraction of the time needed by the sequential SOM, and that the obtained classification supports a fruitful knowledge extraction from massive datasets. © 2011 Elsevier B.V. All rights reserved.

1. Introduction The main concern of clustering algorithms is to group in a reasonable time large amounts of data in meaningful classes that can be disjoint, overlapping or organized in some hierarchical fashion [1]. A ‘‘meaningful class’’ is a class characterized by a maximum similarity among its items and minimum similarity between its items and the ones belonging to other classes; the similarity metrics may vary with the application domain. Clustering analysis has been used in numerous fields ranging from machine learning, artificial intelligence, pattern recognition, to web mining, spatial database analysis, text mining and image segmentation [2]. Recently, clustering algorithms have been used extensively in life sciences such as genomics and proteomics (e.g., [3,4]). All the mentioned application fields have in common the exponentially increasing size of the number of records to be analyzed, a high number of features for each record of the dataset, and a need for interactivity to allow the users to visualize and analyze the dataset from different perspectives. Currently, if one needs only to cluster a dataset efficiently, many scalable clustering techniques are available, such as BIRCH [5] or bEMADS [6], which are able to process massive datasets efficiently even on a normal PC. Alternatively, to save time, one could resort



Corresponding author. Tel.: +39 0957382371; fax: +39 0957382397. E-mail address: [email protected] (A. Faro).

0167-739X/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2011.01.002

to the parallelization of the clustering technique more suitable for the problem at hand. For example, if one is interested in clustering large high density data, a parallelization of the original DBSCAN [7] or its recent improved version [8] could be used. However, often the need is to cluster a massive dataset by preserving its original topology. This would allow an efficient visualization of the clusters and the extraction, by using topological operations, of cluster properties which can be referred to the original dataset too. In this case a parallelization of either the Self Organizing feature Maps (SOM) by Kohonen et al. [9], Kaski et al. [10] and Oja et al. [11], or the Generative Topographic Mapping (GTM) by Bishop and Svensen [12] would be appropriate, as pointed out in the study carried out by [13] on topographic maps and data visualization, where a thorough comparison between the SOM, the GTM based methods and some variants of the K -Means is presented. Another method that is able to cluster a dataset by preserving its topology is the Growing Neural Gas (GNG), that is able also to match the temporal distribution of the original dataset [14], but it is too computationally demanding. Although the GTM and its extensions, as well as the extended K -Means, show a very low topographic error,1 the SOM is significantly faster [15], thus it is a suitable target of parallelization efforts for dealing with massive datasets.

1 A low topological error, guaranteed by the SOM, implies that close clusters contain close data and that the second cluster in order of importance to which a data also belongs is adjacent to the first cluster.

712

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

The Parallel SOMs (PSOMs) that have been proposed in the literature fall in two classes: (1) parallelization of the original SOM, known as on-line SOM, mainly implemented on ad-hoc hardware (e.g., [16]), and (2) parallelization of an approximation of the original SOM, i.e., the so called batch SOM, implemented on loosely distributed computing systems that less preserves the original dataset topology [17]. The paper presents a novel approach to the efficient parallelization of the on-line SOM on a loosely distributed computing system. This approach can be used not only to improve the on-line SOM processing time, but also to improve the performance of other SOM based techniques, namely: (1) the ESOM, that is an Extended version of the SOM [18,19], to better preserve the topology of the original dataset and to solve complex optimization problems; (2) the Growing SOM (GSOM) to achieve the same topological accuracy of GTM methods, widely studied in this decade (e.g., [20–22]); and (3) the Relative Density SOM (ReDSOM) proposed recently by [23] to model high density datasets with variable topology, which may achieve the same performance of the GNG. The proposed PSOMs will be compared with the original SOM, but not with the techniques that do not preserve the dataset’s topology (e.g., DBSCAN), neither with techniques that preserve topology but are much slower than the SOM (e.g., GTM). Moreover, we will show that the moderate differences between the classifications obtained by the original SOM and our PSOM influence little the associations between features one can derive by topological operations from the obtained clusters. This makes the method particularly suitable for data mining purposes. In particular, the SOM algorithm to be parallelized is briefly described in Section 2. Section 3 reviews the literature on the parallel SOMs to point out the limits of the existing algorithms. A novel parallel clustering method that overcomes these limits is proposed with three variants in Section 4, where its suitability to parallelize other SOM based techniques featured by a very low topographic error (i.e., GSOM, and ReDSOM) is also pointed out. In all these variants the dataset is partitioned among the nodes of a GRID, or any loosely coupled computing system. Section 5 outlines the proof of the convergence of the proposed clustering method and gives some measures of its topological properties. The time performance is analyzed in Section 6, to demonstrate that all the three variants are scalable and converge with a precision of about 85%–90% to the same clusters obtainable by the sequential SOM. Also, Section 6 shows a knowledge discovery methodology based on the proposed PSOMs that is able to discover about 85% of the associations one can extract from the same methodology based on the sequential SOM. Finally, in Section 7, we first describe how the GRID environment is used to execute the proposed SOMs, then a bioinformatics case study is discussed to demonstrate the effectiveness of the PSOMs to discover features associations in a realistic case. 2. Kohonen SOM and cluster analysis A SOM consists of a neural net with two layers: an input layer which receives the values of input vectors (one vector at a time) and an output layer, consisting of nout neurons, used to identify to what class the current input vector belongs to. Such two layers are connected by a synaptic weights matrix, usually denoted by W . The class of an input vector Xi (for i = 1 to nin ) belonging to a set X is given by the output neuron more activated by it, i.e., Maxj [Yj = Σ WjiXi]

for j = 1 to nout .

(1)

The SOM algorithm establishes how to train the network to find the matrix W that allows us to classify the input vectors preserving their topological properties, i.e., close input vectors belong to the same class or to close classes.

Fig. 1. Typical mono-dimensional layout of a SOM with two output neurons and whose inputs vectors Xi have four components.

In principle, we may have both mono or bi-dimensional SOMs, depending on if the output neurons are disposed along one or two dimensions. The paper deals with the parallel clustering obtained by using a mono-dimensional SOM since it has been theoretically proven that such network converges, under certain conditions, towards a solution Wfinal starting from a random synaptic weights matrix Winitial [9]. Fig. 1 shows the mono-dimensional SOM used in the paper whose input nodes may refer either to the dataset features or to the similarities between the dataset items. In the former case we say that we operate on the feature space, whereas in latter one on the similarity space. Let us recall that if one operates on the feature space, the value of the input node i, i.e., Xi in Fig. 1, refers to the ith feature of the records of the dataset, whereas if one operates on the similarity space, this value refers to the similarity between the current item and ith one. Advantages and disadvantages of operating on the feature or similarity space are pointed out in [21]. To appreciate fully the differences of the PSOMs proposed in the literature and the basic features of the one proposed in the paper, let us recall that the original SOM, also called on-line SOM, executes for each input the following two main steps: (a) On-line winning selection: determination of the winning output neuron, i.e., the output neuron characterized by the minimum distance of its synaptic weights from the current input vector X , i.e.: Minj [Σ (Xi − Wji)2 ]0.5

for j = 1 to N

(2)

being N the total number of output neurons. Let us note that the winning neuron can be computed also as the one that shows the maximum output value Yj in response to the input vector Xi according to the previous Eq. (1). (b) On-line weight updating: updating the weights of the winning neuron w and of the output neurons j that are in a user defined neighborhood of w , as follows:

∆Wji = −ηh(j)Yj(Xi − Wji).

(3)

Usually η, called the learning rate, ranges between 0 and 1, whereas h(j) is equal to 1 if j refers to the winning neuron w , otherwise it is inversely proportional to the distance between the neuron j and the winning one. The algorithm terminates when the update of the weights is under a suitable threshold or when it has executed the number of epochs chosen by the user, where by epoch we mean the weight updating due to the processing of all the input vectors. Since the on-line SOM may become very slow at the increase of the input space, faster versions have been proposed: a batch version which performs a batch winner selection and a batch weight update [24], and a quasi-batch version, called MANTRA,

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

713

In this case, each node computes the winning neuron for each input by using the weights of the previous epoch, i.e., following a batch approach for winning neuron selection. After processing its input vectors, each node sends to all the others the weight update for the winning neurons so that all the nodes can compute the new weights. The synaptic weights between the input neurons and the output neurons are computed at the end of the epoch by adding the weight update due to each input at different times. In particular, the weight update is given by the following Eq. (4): Fig. 2. Network Partitioning: each Computing Element (CE) manages a mini network consisting of only on output node (a), and broadcasts to all the others (b) the distances to find the winning node according to Eq. (2) and to update the weights according to Eq. (3).

which performs a batch winner selection and an on-line weight update [25]. A batch winner selection is obtained by computing the winning neuron for each input using the weights of the previous epoch (not the ones computed in the current one). Analogously, a batch weight update is computed at the end of the current epoch as the sum of the weight updates computed for each input with respect to the weights of the previous epoch. As said above, the batch SOM saves processing time but does not satisfy the topology preservation requirement (e.g., [17]). On the contrary, the MANTRA saves processing time with respect to the on-line SOM and its clustering is a satisfactory approximation of the one obtainable by the on-line algorithm [25]. 3. Parallel SOMs The algorithms available in literature to parallelize the SOM can be grouped into two main categories: (a) network partitioning and (b) data partitioning methods. The network partitioning approach, proposed in [26], subdivides the computational effort among different Computing Elements (CEs), by assigning to each CE a portion of the output nodes. Fig. 2(a) shows the case in which each CE deals with only one output node. In this case, for any input, each CEj receives all the input components, but it computes the distance Di for only the output node assigned to it, i.e., Di = [Σ (Xi − Wji)2 ]0.5 . Each CE then broadcasts to all the others the computed distance, thus allowing every CE to compute in parallel and redundantly the winning node and to update the weights only for the assigned network portion. In this approach, at each data presentation, every CE has to broadcast the distances to find the winning node according to Eq. (2) and to update the weights according to Eq. (3) to all the other CEs, as shown in Fig. 2(b). Thus, if the input nodes are nin and the output nodes are nout divided among NP processors, each CE is in charge of q = nout /NP output nodes, computes ni · nout /NP distances and receives (NP − 1) · nin · nout /NP distances. If q = 1 this method does not introduce any approximation compared to the original on-line algorithm, but its high communication overhead limits the parallel scalability, especially when applied to a highly multidimensional input space, as in the case of biological datasets. Alternatively, one could assign to each CE a mini network consisting of q > 1 output nodes thus decreasing the communication overhead. However, this solution, denoted in the paper as q-NP PSOM, leads to a clustering similar to the one obtainable by parallelizing the batch SOM [25] that preserve the dataset’s topology less. The data partitioning approach [27,25] assigns to each node a portion nd/NP of the input data, where nd is the overall number of items of the dataset. This solution is scalable since the parallel granularity is determined only by the number of objects to classify which can be very high.

t ′ = tf

wk (tf ) = wk (t0 ) + η(t0 )



hwk (t ′ )‖x(t ′ ) − wk (t0 )‖

(4)

t ′ = t0

where t0 and tf are the start and end time of the considered epoch, η is the learning rate and hwk is the function that establishes the degree of activation of the neurons that are in a user defined neighborhood of the winning neuron k. Alternatively, the weight update could be done on-line, thus implementing a parallel MANTRA. The data partitioning approach requires less communication overhead with respect to the network partitioning approach since the broadcast primitive is performed by the slaves nd/NP times per iteration, instead of the nd times per iteration needed by the network partitioning. However, this better time performance is obtained at the expenses of the clustering precision, i.e., with a higher topological error. In fact, the winning neuron selection in each node does not take into account the weight update due to the inputs residing on the other nodes. Moreover, the parallel batch SOM is less accurate than the MANTRA since the weight update is delayed at the end of the epoch, i.e., it is done for each input on the basis of the difference between the input and the weights at the previous epoch. Consequently, as pointed out in [17], a batch PSOM leads to a clustering that may present differences compared to the online sequential solution. To improve the batch SOM, a k-batch PSOM has been proposed which better preserves the dataset’s topology [28]. In this case, the dataset is divided into k slices, and each slice is clustered by a batch SOM which uses the synaptic weights computed by the clustering of the (k − 1) slice. However, the parallelization of the batch PSOM related to each slice suffers from a high communication overhead. In addition, only for k ≤ 5 the topology preservation is satisfied [28]. Thus both batch and k-batch approaches are not taken into account in the paper. Alternatively, the nodes may process the inputs executing the MANTRA algorithm instead of the batch SOM, i.e., the winning selection is batch, but the weight update is done on line. This way of proceeding leads to better preserve topology than the batch SOM [25], even if the communication overhead is still high. Thus the parallelization of the MANTRA version, as the one proposed in the following, may have still made sense if another data partitioning scheme which decreases the communication overhead significantly is chosen. Besides the data partitioning methods above, which subdivide the dataset ‘‘horizontally’’, other methods have been proposed which subdivide the dataset ‘‘vertically’’, i.e., which send to each node a vertical slice of the feature matrix, e.g., [29]. Such methods show a satisfactory accuracy, but their speed-up highly decreases at the increase of the number of nodes. Consequently, also these approaches are not considered in the paper. From the analysis of the existing parallel SOMs we derive that: (a) the network partitioning schemes are more suitable for a multiprocessor environment where the communication overhead may be greatly reduced thanks to the high speed of the common bus shared by the processors, as shown in [16,30,31];

714

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

Fig. 4. Flow diagram of the WR PSOM. Fig. 3. Computational flow of the WR PSOM. Matrices Wf ,k (i) are computed by the slaves from matrix W (i). The computation starts from matrix W (1) randomly generated by the master, for i ̸= 1 W (i + 1) = f (Wf ,1 (i), . . . , Wf ,NP (i)).

(b) the data partitioning schemes become appropriate when the parallel SOM is executed in loosely coupled systems, as shown in [32] and in [33], although the proposed implementations exploit only 60%–80% of the overall speed-up due to the increase of the communication overhead at the increase of the number of CEs [34]. Thus, especially with the advent of the GRID, new PSOMs are envisaged to exploit the overall speed-up offered by the GRID by greatly reducing not only the communication overhead but also the number of operations (and related execution time) carried out by each node. With this aim in mind, in the next sections we propose and evaluate three novel PSOMs based on different ways to recombine the neural network weights: two are based on both on-line winning selection and weight update, the third is the parallelization of the MANTRA featured by a low communication overhead. 4. A novel parallel clustering method: the Weight Recombining (WR) PSOM In order to parallelize the Kohonen’s algorithm we propose an approach based on data partition and Weight Recombining (WR) to be executed in a master–slave computing environment that greatly decreases the communication overhead. Fig. 3 shows the architecture of the master and the computation flow of the WR PSOM. In particular, the input dataset is subdivided by the dividing unit into NP blocks to be delivered to the NP slaves. Each slave has the task of clustering the received block of data. Each block consists of nd/NP input data. Let us note that the data of each block can be chosen in several ways, e.g., randomly or by dividing the dataset sequentially. The best way to assign the data to each node (e.g., depending on how many fields are different from zero) is for further study. After executing some epochs of the original Kohonen’s sequential SOM over the proper block starting from the weight matrix received from the master, each jth slave sends the final weight matrix to the master to recombine these matrices, by using the recombining unit, into a new matrix that will be sent to each slave for another iteration of the computation. At the end of the iterations, a final iteration starting from the last weight matrix is done by the master (or by the slaves in parallel) to compute the class of each data. The structure of the algorithm is summarized in Fig. 4. According to the flow diagram of Fig. 4, the inputs of each slave to be supplied by the master are: (a) a slice of the similarity matrix composed of nd/NP rows and nd columns; and (b) the initial weight matrix composed of nd rows and nc columns. In order to compare the results between the sequential on-line SOM and the

parallel one, we always start from the same randomly selected weight matrix for both the sequential and parallel versions. We have designed and tested three variants of the WR PSOM. The three algorithms follow the same diagram flow shown in Figs. 3 and 4, but differ on the way the master recombines the weights received from the slaves. Algorithm N.1. is the on-line WR WM (Weight Mean) PSOM, where the slaves execute an on-line SOM and the master recombines the weight matrices to obtain the next weight matrix by averaging all the weights received from the slaves as follows: W (i + 1) = [V1 (i + 1) . . . Vnc (i + 1)] Vn (i + 1) = Σk=1 to NP Vfkn (i)/NP

for n = 1 to nc

(5)

where Vn (i + 1) is the nth column of matrix W (i + 1), and Vfkn (i, n) is the nth column of the matrix Wf ,k (i) obtained by the node k after processing W (i). Algorithm N.2. is the on-line WR WWM (Weighted Weight Mean) PSOM, where the slaves execute an on-line SOM and the master computes the next weight matrix by the weighted average of all the weights received from the slaves as follows: Vn (i + 1) = Σk=1 to NP Nkn (i)Vfkn (i)/(Σk=1 to NP Nkn (i)), n = 1, nc

(6)

where Nkn (i) is the number of objects pertaining to slave k which belong to the class n at the ith iteration. Algorithm N.3. is the MANTRA WR WM (Weight Mean) PSOM, where the slave nodes execute the MANTRA algorithm and the master recombines the weight matrices to obtain the next weight matrix by averaging all the weights received from the slaves according to the Eq. (5). A discussion on when the parallel MANTRA could be preferable to the other two parallel SOMs is given in Section 6, where we also show how to choose the number of slaves, the number of iterations of the parallel SOM, and the number of epochs of the sequential SOM executed by the slaves. Fig. 5 clearly shows that in all the above algorithms the communication overhead is very low with respect to the one of the PSOMs discussed in the previous section, since there is no communication between the slaves but only between master and slaves. Moreover, each slave performs nd · nc operations per epoch to update the weights. This is an order of magnitude less than the operations carried out by the nodes of the existing parallel SOMs to compute the weights of the next epoch described in the previous section. A proof outline of the convergence of our parallel algorithms is given in the next section by taking advantage from the theorems described in [9] and in [25]. Before concluding this section, it is useful to outline how the proposed WR PSOMs may be used to parallelize relevant extensions of the original SOM, such as the GSOMs and the ReDSOM.

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

715

In addition this section will present some heuristics to satisfy the assumptions of the proof outline, thus ensuring convergence of the WR PSOM, and the approach used to characterize its topological properties. 5.1. Proof outline

Fig. 5. Communication scheme between the master and the CEs that execute the proposed WR PSOM.

Concerning the GSOM, all the proposed techniques start from a set of output neurons (called prototypes), whose coordinates on the physical mono or bi-dimensional layout fit roughly the dataset distribution projected on the layout. Then, the prototypes are moved (i.e., by adjusting their synaptic weights) to better approximate the dataset, and in case their influence area is too large (i.e. the subset of the dataset pertaining to them is too spread in space), another prototype is added. Conversely, if two prototypes are too close, they are substituted by a prototype located in between them. The master of the WR PSOMs can find the initial prototype configuration if during the weight updates it also adds/deletes output neurons in the current neural network depending on the density of the data around each output neuron, e.g., following the rule proposed in [21]. To this aim, each slave k should send to the master the mean distance Dkn (i) from the winning neuron n of the data items residing on the slave which belong to the class n at the ith iteration. After the initialization phase, the master could adopt a fine tuning rule to adjust the output neurons’ weights (or to increase/delete the output neurons) following the formulas of the MIGSOM [22], thus obtaining a clustering whose topographic error is much less than the one obtained by using the original SOM. The parallel management of time changing cluster could be achieved by extending the ReDSOM approach, but it is more complicated. Indeed by exploiting the WR PSOM one can cluster fast the dataset at time t1 , t2 , . . . , tn , but further sophisticated computations should be done by the master, according to the ReDSOM formulas, to identify how each cluster evolves in time, and the appearance (disappearance) of some new (old) cluster. A complete study on how to parallelize the MIGSOM and ReDSOM is for future work. 5. Convergence and topological properties of the WR PSOM Let Kk , for k = 1 to NP, be a set of NP mono-dimensional SOM networks having the same number of nf or nd inputs and nc output neurons, each dedicated to classify a dataset Dk into nc classes. The main aim of this section is to demonstrate that the weight matrix computed by the master using the parallel algorithms proposed in the previous section converges towards a weight matrix W = [V1 . . . VNP ] very close to the weight matrix WS = [VS1 . . . VSNP ] computed by an on-line sequential SOM to classify the dataset D having the following form:

The computation of the matrix WS by a sequential SOM could be implemented by the mentioned set of NP neural networks by initializing the K1 network with a random weight matrix, and by starting, for the first epoch, the computation of the Kk+1 network with the weight matrix computed by the Kk network. In the subsequent epochs, we should proceed in the same manner with the difference that the K1 network will start with the weight matrix computed by the KNP network. This implementation has the advantage of requiring less powerful computers with respect to the one needed by a SOM implementation carried out on a single computer, but the time to execute the entire parallel algorithm is the same of the sequential one since the nodes work one at a time. For this reason, in our parallel SOMs all the Kk neural nets start with the same random matrix. Then, they execute simultaneously one epoch and instead of starting the next epoch with the final value of the net Kk−1 , they will start with the weight matrix W (i + 1) given by Eqs. (5) or (6). By selecting a suitable small value for the learning rate η of the sequential SOM executed by the slaves, it is possible to maintain a small distance between W (i) and WS (i) for any ith iteration, i.e., between each vector pair Vn (i) and VSn (i), in all the three algorithms. To explain that, let us consider a dataset consisting of records with two features subdivided into NP blocks to be classified into nc classes by using nc nodes. In this case, Vn (i) can be represented by a point in a two dimensional setting, as in Fig. 6, where its coordinates, i.e., vn1 (i) and vn2 (i), are, respectively, the weights between the nth output neuron and the first and the second input neuron. According to the theorems described in [9] and in [25], each kth network, starting from the vector Vn (i), will converge towards a value Vkn (∞, i) which depends on Vn (i), either if the slaves execute the on-line SOM or the MANTRA. Thus, if one chooses a suitable small value of the learning rate, let us say ηp , the final value V1n (i + 1) will be on the path from Vn (i) to V1n (∞, i). Since the distance |V1n (i) − V1n (i + 1)| is very small, also the value V2n (i + 1) will be on the path between V1n (i + 1) and V2n (∞, i), and so on. This means that the next value of the weight matrix of the sequential SOM starting from Vn (i) can be expressed as follows: WS (i + 1) = [VS1 (i + 1) . . . VSNP (i + 1)] ′ being VSn (i + 1) = Σk=1 to NP Vkn (i + 1)

(7)

where: V1n (i + 1) is obtained by executing one epoch of the first ′ K -net starting from Vn (i), and Vkn (i + 1) for k ̸= 1 is obtained by executing one epoch of the kth K -net starting from V(′k−1)n (i + 1). Eq. (7) allows us to establish that: ′

Vn (i + 1) ≈ VSn (i + 1)/NP .

(8)

In fact, recalling that in the Algorithms N.1 and N.3 the weight recombination follows the Eq. (5) and taking into account the diagram drawn in Fig. 6 it is easy to understand that: Vn (i + 1) = (Σk=1 to NP Vfkn (i))/NP

≈ Σk=1 to NP (Vk′,n (i + 1)/NP )

(9)

and, consequently, that Eq. (8) is valid for both these algorithms, being VSn (i + 1) = Σk=1 to NP Vk′,n (i + 1). However, since it is reasonable to assume that Vn (i + 1) obtained by Eq. (5) is very close to the one obtainable from Eq. (6) starting from the same Vn (i), we can conclude that Eq. (8) is valid also for Algorithm N.2.

716

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

Fig. 7. Distance |W (i + 1) − W (i)| between two subsequent weight matrices at the increase of the epochs for the WR PSOM and its sequential analogous. Fig. 6. Weights between the nth output neuron and the two input neurons (i.e., vn1 and vn2 ), assuming NP = 3.

Let us note that the final clustering of the SOMs carried out by any node starting from Vn (i) may differ from the one obtained starting from Vn (i + 1). Thus, the final weights between the input neurons and the nth output neuron may differ from Vn (∞, i) which is the average of V1,n (∞, i), V2,n (∞, i) and V3,n (∞, i) at the ith iteration. Consequently, the final weights have to be computed by a step by step procedure. 5.2. Heuristics At this point, we have to point out that in the proof outline we have implicitly assumed that, starting from Vn (i), the output neurons remain ordered in the same way in all the nodes, i.e., that the homologous output neurons of all the nodes refer to the same classes. Since this is true with probability one after a certain number of epochs, [9], the parallel algorithms could behave differently from the sequential one especially for the output neurons which are very close in the synaptic weight hyperspace. Therefore, a suitable heuristics should be adopted to avoid this problem. For example, one could distribute among the slaves a smaller dataset Dϑ containing data belonging to all the classes and whose distance from the relevant neurons is less than a threshold ϑ , thus reinforcing the probability that homologous neurons in the slaves refer to the same classes. Another expedient is the one of starting the parallel SOMs after executing some few epochs according to the sequential SOM to approach the weight configuration in which the output neurons remain ordered in the same way. Moreover, the dataset should be subdivided among the slaves in such a way that each slave contains data of all classes. In Section 6 we show that by a proper heuristics one can satisfy the hypotheses of the above theorem and that for any class j of the sequential SOM there is a class k of the WR PSOM which contains almost all its data and vice versa. Since the update after one epoch of the weight vector VSn (i) related to the nth output neuron is given by a rotation proportional to ηs towards those input vectors for which it is the winning neuron, we have that the parallel SOMs will evolve towards the final solution with a speed which is very close to the one of the sequential SOM having ηs = ηp /NP. Fig. 7 confirms experimentally that the time needed by the parallel SOM carried out by NP nodes with ηp = 0.5 to cluster a dataset of 64 rows and 22 columns dealing with the amount of goods that 8 countries sold from 1991 to 1998 in 22 world areas [35] is quite equal to the one spent by a sequential SOM with ηs = 0.5/NP.

Fig. 8. Distance |WS (i + 1) − WS (i)| between two subsequent weight matrices at the increase of the epochs computed by a sequential SOM for different learning rates.

Indeed, Fig. 7 shows that the decrease of the distance |W (i+1)− W (i)|2 between two subsequent weight matrices of the parallel Algorithm N.1 with ηp = 0.5 and NP = 8 coincides, at the increase of the epochs, with the one of the sequential algorithm evolving with ηs = 0.5/8 = 0.0625. Similar results were obtained by the Algorithms N.2 and N.3. Consequently, although the convergence time of a sequential SOM is not inversely proportional to ηs (see Fig. 8), the WR PSOMs certainly reach the final clustering in more epochs than the ones needed by the sequential SOM having ηs = ηp . A further heuristics is then needed to eliminate such ‘‘slowness’’ so that the parallel SOM may reach its final clustering in a time TTp = TTs /NP, where TTs is the time spent by its sequential analogous to reach the final clustering. For this reason, we should satisfy the following equation: Np Ts /(NP · E ) + Np Tc = Ns Ts /NP

(10)

where: Np and Ns are the number of iterations needed, respectively, by the parallel and the sequential SOM to reach the final clustering, Tc is the time needed by the slaves to send the weights to the master plus the time needed by the master to recombine the weights and resending them to the slaves, Ts is the time for executing one epoch by the sequential SOM, and Tp = Ts /(NP · E ) is the time for executing one epoch by the sequential SOM, being E the efficiency of the parallel algorithm. If we assume in Eq. (10) that E ≈ 1, according to the results shown in Section 6.3, and neglect the term Np Tc , since Tc ≪ Tp due to the high speed of the LAN connecting the GRID nodes, a way to achieve TTp = TTs /NP (i.e., to achieve a full speed-up) is to set ηp in order that Np = Ns . Since Ns mainly depends on the first phase (also known as ‘‘weight ordering phase’’) of a sequential SOM and our parallel SOMs evolve as a sequential SOM having ηs = ηp /NP,

2 |W (i+1)−W (i)| is computed as Σ 2 r =1 to nd Σs=1 to nc (wrs (i+1)−wrs (i)) /(nd nc ).

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

717

Table 1 Precision and distortion of the WR WM PSOM executed by 3 nodes to cluster the dataset documented in [35] in 5 clusters.

3 slaves

Fig. 9. Suggested learning rate decay for the WR SOMs.

Fig. 10. Distance between two subsequent weight matrices at the increase of the epochs computed for the PSOMs with NP = 3 and 5 and for its sequential analogous (i.e., NP = 1).

this may be obtained by imposing that ηp decreases from NP · ηs to ηs during the first epochs (usually three epochs) related to the weight ordering phase. Fig. 9 depicts a possible learning rate decay according to the previous constraint, whereas Fig. 10 shows that adopting this learning rate decay, the number of iterations to reach the final clustering of the mentioned dataset is the same for NP = 1, 3, 5. This result points out that the proposed WR PSOMs may achieve a full speed-up. 5.3. Topological properties The convergence of the WR PSOMs, possibly provided with a satisfactory convergence speed, is the first requirement to satisfy. However, we have also to verify that the clustering of the WR PSOM shows: (a) a low topological error Et , being Et given by: Et = Σk=1,N u(k)/nd

(11)

where nd is the total number of items, and u(k) is 1 if the first and second best matching output neurons of the k-th item are not next to each other, otherwise is 0. Let us note that in the paper we verify that Et is ‘‘low’’ indirectly, i.e., by verifying that the final clustering of the WR PSOMs approximates the one achievable by the sequential SOM; (b) a low quantization error, i.e., a low mean distance between the items and their best matching output neurons. Also in this case we assume that the quantization error of the clustering of a dataset D obtained by the WR PSOMs is ‘‘low’’ if it approaches the one of the clustering of D obtained by the sequential SOM. Concerning the topological error, an important difference between our clustering method and the original SOM is that the topology of our clustering does not refer necessarily to a physical mono- or bi-dimensional layout of the output neurons but to the n-dimensional space identified by the dimensions of the input vectors, where the coordinates of each item are given by the components of its descriptive vector and the coordinates of each

Weak precision (WP)

Strong precision (SP)

Cp Distortion/Cs Distortion

0.90

0.90

0.98

output neuron are given by its synaptic weights. In fact, function h of Eq. (3) is slightly different from the one proposed in the SOM since it aims at increasing the weights of the neighborhood of the winning neurons in the n-dimensional space. As a consequence, the neuron n2c identifying the second cluster to which a vector v also belongs, is by construction very close to the neuron n1c identifying the first cluster of v in the above mentioned hyperspace, but this does not ensure that n2c is necessarily located in the neighbourhood of n1c in the mono- or bi-dimensional layout. In this way we may avoid, when needed, the simplification entailed by reducing the n-dimensional space to a mono or bi-dimensional layout. Indeed, visualizing the dataset in a mono or bi-dimensional layout is not always a worthwhile goal since one could have interest in visualizing zones of the above hyperspace to better explore the relationships between neighbouring clusters and related items. This is especially useful for mining massive datasets from different perspectives and to discover hidden relationships between clusters, beyond the ones highlighted by mono or bi-dimensional layouts. A suitable way to visualize the data clustered in an n-dimensional setting is by an hyperspace whose axis refer to the clusters and where each item is represented by the coordinates associated to its membership degrees to the clusters [36]. In any case, if dimension reduction is desirable, it is always possible to force function h in Eq. (3) so as to have the WR PSOMs preserve the topology of the physical layout of the output neurons. Three notions will be adopted to evaluate the effectiveness of the parallel SOM from the mentioned different points of view, i.e.:

• Weak Precision (WP): measures the percentage of items classified into a cluster by the sequential SOM that remain clustered together in the classification obtained by the parallel SOM. • Strong Precision (SP): measures the percentage of items of each cluster of the WR PSOM that are also present in the best matching cluster in the classification of the sequential SOM. Let us note that each WR PSOM cluster is matched at best to only one cluster of the sequential SOM, i.e., the one with which it has the maximum intersection. • Distortion [25]: measures the mean of the distance between the inputs and the relevant winning neuron. According to the above definitions, if WP ≈ 100%, each cluster obtained by the sequential SOM may be tracked within one of the clusters obtained by the WR PSOM, whereas SP ≈ 100% implies that both the sequential and parallel SOMs produce the same clusters and therefore they have comparable topological errors. To have a first evaluation of the WR PSOMs, we computed WP, SP, and the distortion of the clustering Cp of the mentioned dataset obtained by the Algorithm N.1 with respect to the clustering Cs obtained by its sequential analogous. In particular, Cs has been computed by a sequential SOM with ηs = 0.5, whereas Cp has been obtained by a parallel SOM carried out by 3 nodes with ηp following the decay law drawn in Fig. 6. Table 1 shows that there is an overlapping of about 90% between the clustering Cp and Cs , and that the distortion of Cp is about the same of the one of Cs . This is an important result since it shows that by the WR SOMs we may obtain an accurate classification by fully exploiting the available distributed processing resources.

718

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

A deeper discussion on the topological properties of the WR PSOMs in terms of topological accuracy and time performance for a fruitful feature analysis is given in the next sections, where we take into account larger datasets. This will allow us to evaluate the performance at the increase of the number of slaves in realistic cases.

The parallel algorithms presented in the previous section were conceived for use in a GRID infrastructure. Their performance has been evaluated first by simulating the GRID over a sequential machine before testing them in practice. In particular, this section first points out the precision achievable by the WR PSOMs according to the above definitions of precision and distortion, then the processing time to carry out one epoch and the number of epochs needed to reach the final clustering by both the sequential and the WR PSOMs are compared. In addition, a comparison with the other PSOMs referred in Section 3 is performed. Let us note also that the clustering accuracy might not be so critical in data mining problems where the main aim is to discover the most relevant associations between the features of the items. Thus, a detailed analysis is also carried out in the next sections to investigate if the proposed WR PSOMs allow us to discover the same relevant feature associations we may discover by using a knowledge discovery methodology based on the sequential SOM. 6.1. Similarity matrix Three different metrics to build the similarity matrix have been tested in two realistic experiments to evaluate how they influence the topological error. The first, called Mean Closeness (MC) based Similarity, was adopted to cluster the dataset described in [35], whose feature matrix is characterized by many fields different from zero. The MC similarity between two records vr (i) and vs (i) depends on the average of the similarity ratios di between the values of the correspondent fields, i.e.:



di /nf

MC similarity Weak precision

3 nodes 9 nodes 18 nodes

6. Performance evaluation

MC =

Table 2a Weak and strong precision of the WR PSOMs depending on the node number for ten iterations. MC similarity. Dataset extracted from [38].

(12)

i=1 to nf

where di = min(vr (i), vs (i))/ max(vr (i), vs (i)) if max(vr (i), vs (i)) ̸= 0 otherwise di = 1. The second and the third metrics are feature based similarities. They are denoted in the following as F1 and F2 similarities and are defined as follows:

• F1 similarity is computed, for each pair of records, as the sum of the minimum of the corresponding fields divided by the numbers of fields; it was adopted by the authors to cluster biological data in [37] characterized by a sparse feature matrix. • F2 similarity is given for each pair of records, by the number of features that are treated in both the records divided by the numbers of fields. Thus, for example, according to the above definitions, for the records r1 (2, 0, 5, 1, 0, 4) and r2 (1, 0, 3, 0, 1, 0) we have MC12 = 0.35, F 112 = 4/6 = 0.66, and F 212 = 2/6 = 0.33. Let us note that the MC similarity between two records increases, but not F1 and F2, if the corresponding fields of these records are equal to zero. In the first experiment we classify a dataset of 615 rows and 100 features according to the proposed three WR PSOMs. The dataset chosen came from a database containing a collection of documents with abstract, title, keywords, and other information. The whole dataset, described in [38], was indexed by a set of user selected words, and the feature and similarity matrices were obtained. The clustering has been carried out by using the similarity matrix. We classify the documents into fifteen classes. The optimal number

Strong precision

WM

WM WWM

P-Mantra

WM

WM WWM

P-Mantra

0.79 0.78 0.78

0.75 0.78 0.78

0.79 0.83 0.77

0.75 0.73 0.73

0.73 0.70 0.72

0.67 0.70 0.63

Table 2b Weak and strong precision of the WR PSOMs depending on the number of the processing nodes for ten iterations. F1 similarity. Dataset extracted from [38]. F1 similarity Weak precision WM

WM WWM

Strong precision P-Mantra

WM

WM WWM

P-Mantra

3 nodes

0.83

0.77

0.79

0.80

0.76

0.77

9 nodes

0.78

0.77

0.77

0.77

0.75

0.74

18 nodes

0.76

0.74

0.76

0.75

0.70

0.73

of classes came from a preliminary study with the K -Means algorithm. This number represents the best compromise between reducing the total sum of distances and the mean silhouette value [39]. Both MC and F1 similarity criteria have been tested for each algorithm. The results for different number of slaves are summarized in Tables 2a and 2b. These results point out that the WR WM PSOM achieves the highest precision, even if its precision decreases at the increase of the node number. Moreover, the F1 similarity is more appropriate than MC. However, if one assumes that the lowest acceptable limit of the strong precision is about 80%, the previous results are not satisfactory. Then, the F2 similarity has been also tested to verify if it is possible to achieve better precision. The simulation results shown that a higher strong precision of about 85% may be achieved. Thus the choice of the similarity metrics is certainly an issue that deserves a thorough investigation in further works. As an example, by using the Pearson correlation based similarity [40] the WR WM PSOM further increases its strong precision from 85% to 90%. Therefore, from the first experiment we derive that the MC similarity is more suitable for clustering datasets whose feature matrix possesses many fields different than zero, whereas F1 and F2 similarities are more appropriate for sparse feature matrices. In the second experiment, we used a dataset of 3528 abstracts collected from Pubmed (www.ncbi.nlm.nih.gov/pubmed). Each abstract was indexed with a user selected list of genes obtained by a search on the Entrez gene database (www.ncbi.nlm.nih.gov/gene) with the same keyword used for searching the abstract (e.g. genetic disease). We obtained a sparse feature matrix with 3.528 rows and 262 columns. From this feature matrix, both the MC and F1 similarity matrices were computed and used as input data for the classification task. The similarity matrix has a 3.528 × 3.528 size. The number of classes (i.e., 80 classes) was determined by a preliminary study with the K -Means algorithm. Tables 3 and 4 summarize the results for different numbers of slaves and for ten iterations of the WR PSOMs. Each slave executes one epoch. The results were compared with the ones of a sequential SOM after 10 epochs. The WR WM PSOM shows the highest precision. The F1 similarity is more appropriate than MC. Since the feature matrix is a sparse matrix, we have also carried out a simulation by using the F2 similarity. The simulation results, shown in table 5, point out that all the WR PSOMs may achieve a precision of about 90%. This is a very important result since our primary aim is to apply the parallel algorithms to biological databases characterized by sparse feature matrices. In particular,

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724 Table 3 Precision of the WR PSOMs as a function of the number of processing nodes for ten iterations. MC similarity. Dataset extracted from Pubmed.

Table 6 Strong Precision (SP) and distortion of the WR WM PSOM compared with the one of the clustering obtained by the sequential SOM. Dataset extracted from Pubmed.

MC similarity Weak precision (WP)

3 nodes 9 nodes 18 nodes 24 nodes

Strong precision (SP)

WM

WM WWM

P-Mantra

WM

WM WWM

P-Mantra

0.87 0.89 0.91 0.93

0.79 0.83 0.91 0.94

0.88 0.90 0.95 0.92

0.77 0.68 0.60 0.60

0.69 0.62 0.62 0.59

0.70 0.55 0.56 0.58

Table 4 Precision of the WR PSOMs as a function of the number of processing nodes for ten iterations. F1 similarity. Dataset extracted from Pubmed. F1 similarity Weak precision (WP)

Strong precision (SP)

WM

WWM

P-Mantra

WM

WWM

P-Mantra

3 nodes

0.87

0.83

0.89

0.84

0.80

0.82

9 nodes

0.85

0.74

0.88

0.82

0.71

0.80

18 nodes

0.86

0.77

0.87

0.80

0.71

0.74

24 nodes

0.88

0.79

0.89

0.81

0.74

0.76

719

3 nodes 9 nodes 18 nodes 24 nodes

SP

Dp

Ds

Dp /Ds

0.90 0.91 0.91 0.92

9.96 9.22 9.57 10.10

8.84 8.84 8.84 8.84

0.89 0.95 0.92 0.88

Table 7 Execution time of one iteration of the WR WM PSOM. Number of grid nodes

Processing time (s)

3 6 12 18 24

164 84 43 29.5 22

Table 5 Precision of the WR PSOMs as a function of the number of processing nodes for ten iterations. F2 similarity. Dataset extracted from Pubmed. F2 similarity Weak precision (WP)

Strong precision (SP)

WM

WWM

P-Mantra

WM

WWM

P-Mantra

3 nodes

0.93

0.92

0.95

0.90

0.92

0.91

9 nodes

0.93

0.92

0.93

0.91

0.89

0.89

18 nodes

0.93

0.94

0.96

0.91

0.88

0.92

24 nodes

0.93

0.94

0.94

0.92

0.87

0.87

P-MANTRA is convenient for a moderate parallelization degree, i.e., for NP ≤ 9, whereas the WR WM PSOM is more convenient at the increase of the nodes. 6.2. Clustering distortion The notion of precision introduced in the paper assumes the sequential SOM as the ‘‘gold standard’’ since our primary aim is to demonstrate that the results obtained by our parallel SOMs are almost the same of the sequential one, but obtained at a fraction of its processing time. However, this notion does not suggest to the user how to set the values of the parameters which optimize precision and time performance (e.g., learning rate or total number of epochs). Indeed, if the user changes these parameters, she/he is not able to know if the WR PSOM fits the clusters of the sequential SOM better. The notion of distorsion, on the contrary, may be useful for this aim. Namely, when the clustering has reached the minimum distortion, the output neurons are on their final positions and the WR PSOMs best approach the gold standard. As an example, Table 6 shows that terminating the above third example when the distortion difference ∆DSp between two subsequent iterations is less than a suitable threshold, the distortion Dp of the clustering obtained by the WR WM PSOM is very close to the one, let us say Ds , related to the sequential SOM. Moreover, the Strong Precision SP, computed at the last iteration, is very close to its highest value.

Fig. 11. Execution time of one iteration of the WR WM PSOM as a function of the number of nodes.

6.3. Processing time and convergence speed Since the WR WM PSOM has shown the best performance in all the previous cases, we present in this section the time performance of this version with respect to the sequential SOM. To this aim, let us note that the execution of the above mentioned clustering by a sequential SOM on a Pentium IV with 3.2 GHz, and 2 GB of RAM took about 90 min, whereas the WR WM PSOM with ten iterations and one last cycle to compute the class of each item of the dataset took 30 min and 30 s with 3 computing elements. Table 7 reports the execution time for one iteration of the parallel algorithm with one epoch as a function of the computing elements. Such results are also reported in Fig. 11 to better visualize how the execution time of one iteration of the WR WM PSOM is inversely proportional to the number of computing elements. Fig. 11 shows the scalability of the WR WM PSOM. In fact, recalling that the efficiency E of a parallel algorithm executed over p processors is a measure of the possibility of achieving a speed-up equal to the number of parallel processors, i.e.,: E = Tseq /pTpar

(13)

(where Tseq and Tpar are, respectively, the execution time of the algorithm on a single node and on one of the p nodes) and observing from Fig. 11 that Tpar is inversely proportional to p, i.e.: Tpar ≈ 520/p = const /p

(14)

we can conclude that our parallel algorithm is scalable in the selected range with a more or less constant efficiency E. To have a measure of the speed-up of the WR WM PSOM, we draw in Fig. 12 its efficiency. As we can see, the efficiency is of about 0.99–0.96 in the selected range of processors, and consequently the WR WM PSOM is almost fully scalable. This high value of E allows the WR WM PSOM to approximate a full speed-up. Indeed, recalling the discussion done in Section 5 and

720

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724 Table 8a Topological preservation and communication overhead comparison between the WR PSOMs with the existing PSOMs.

Fig. 12. Efficiency versus the number of the computing elements (CE) for the WR WM PSOM.

Fig. 13. Distance between two subsequent weight matrices at the increase of the iterations computed by the parallel SOMs and by its sequential analogous. The diagram refers to a clustering performed by 24 slaves. Similar diagrams were obtained by using NP = 3, 9, 18.

taking into account that the efficiency E ≈ 1, a full speed-up may be achieved if the number of iterations to reach the final clustering of the parallel SOM, i.e., Np , is very close to the one needed by the sequential SOM, i.e., Ns . Fig. 13 confirms that, by choosing the learning rate decay according to Fig. 9, it is possible to satisfy this condition not only in the toy example discussed in Section 5, but also in the clustering of the realistic dataset extracted from Pubmed. In particular, let us note that Fig. 13 shows that both the parallel and sequential SOMs achieve their final clustering for Np = Ns = 5. For this reason, the experiments reported in Tables 3–5 were repeated for five iterations. Since the results obtained by executing five iterations differ little (about 1.5%) from the ones reported in mentioned tables, no further considerations are deserved. Also, we have found that increasing the number of epochs per iteration does not result into a significant improvement of the classification performance. Then, one epoch per iteration is recommended to classify with a reasonable processing time the dataset on the GRID. Thus, all the proposed WR PSOMs improve the time performance of the PSOMs available in the literature for clustering large datasets by means of a loosely coupled computing system. Indeed, the speed-up of these latter algorithms rapidly decays from 0.8 to 0.6 at the increase of NP if the data are partitioned among the nodes of a LAN [41], whereas it is reasonable to expect that the WR PSOMs have a quite full speed-up even for high NP. Of course, the degree of parallelism cannot be increased indefinitely because the resulting clusters become too approximate. Moreover, the efficiency, at a certain point, decreases significantly due to the communication overhead and to the not negligible probability of node faults. For example, in [37] we have shown that with a parallel K -Means algorithm the maximum number of nodes is obtained by imposing that each node processes at least 10% of the overall records since if one decreases such a value (thus increasing the degree of parallelism), the precision of the algorithm becomes unsatisfactory (i.e., it is less than 80%). The WR PSOMs proposed in this paper outperform this result since, as shown in Table 5 for

Topology preservation

Communication overhead

Batch PSOM k-BATCH PSOM Mantra PSOM p-NP PSOM p

Medium Medium–high Medium–high Medium–high

Medium Medium Medium Medium

Mantra WM PSOM

Medium–high

Low

WR WM PSOM

High

Low

Table 8b Efficiency in case of low, medium and high parallelism. Comparison of the efficiency between the WR PSOMs and the existing PSOMs. Efficiency 2–5 nodes low parallelism

Efficiency 6–15 nodes medium parallelism

Efficiency 16–20 nodes high parallelism

Batch PSOM k-BATCH PSOM Mantra PSOM q-NP PSOM

0.80 0.80 0.80 1

0.70 N.A. 0.70 0.90

0.60 N.A. 0.60 0.80

Mantra WM PSOM

1

0.95

0.90

WR WM PSOM

1

0.95

0.90

a realistic large dataset, they may achieve a high strong precision (i.e., 90%) even if the computing nodes process only 5% of the overall dataset. 6.4. Comparison with the existing PSOMs Tables 8a and 8b summarize the overall performance achieved with the WR PSOMs with respect to the four PSOM discussed in the previous section, i.e., Batch, k-Batch, MANTRA and p-NP PSOMs. In the tables, the values dealing with the existing techniques are taken from [34,17,25], whereas the values of WR PSOMs are the ones from the former sections. In particular, Table 8a shows that WR PSOMs are featured by a low communication overhead and highly preserve the dataset topology. Table 8b reports the values of the efficiency to point out that we have achieved a high speedup by taking advantage from the low communication overhead of the novel methods. 6.5. Feature analysis In many data mining applications, the value of the result does not lie just in the correct presence of an item into a cluster, but rather in the discovery of most of the relevant features that characterize the clusters, called positive features in [42], i.e., the features possessed by the majority of the items of the cluster. This is especially true when the aim is the discovery of associations from the clustered data which is a pressing need of research fields such as proteomics and genomics. For this reason, we have carried out some experiments to verify if the positive features emerging from the clustering of biological datasets by a sequential SOM are almost the same as the ones we may discover by the clusters obtained with the WR PSOMs. In the experiments we do not take into account clusters consisting of a number of elements less than a threshold which has been fixed to 6. Moreover, we use the threshold function drawn in Fig. 14 to find the positive features in both the sequential and parallel algorithms. The threshold function has to be used as follows: (a) if the number of items of a cluster is greater than N2 (e.g. 100), then a feature is considered to be a positive one if it is possessed by a percentage P1 of the items of the cluster (e.g., 10%), (b) if the number of items of a cluster is lower than N1 (e.g., 10), then a feature is considered to be a positive one if it is possessed by a

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

721

Table 9 Matching percentage between the positive features discovered by the WR WM PSOM executed by 9 slaves and its sequential analogous. P2 = 0.30. Dataset extracted from Pubmed. P1

# Pos. features (Seq.)

# Pos. features (Par.)

# Pos. features shared between Seq. and Par.

Matching % of Pos. features for class Seq. Par.

Average # of Pos. features for class (Seq.)

Average # of Pos. features for class (Par.)

0.06 0.04 0.02

44 44 44

18 37 69

18 28 40

0.92 0.94 0.90

6.15 6.15 6.15

5.03 7.00 9.49

To test in practice the mentioned mining methodology, the clustering has been executed on a GRID architecture based on the Globus Toolkit 4 [43,44]. The nodes of the GRID were connected by a switch and a local area network operation at 100 Mb/s. In the fastest computer we installed the Globus Toolkit server which provides the services to be used by our master to execute the steps of WR WM PSOM, i.e.,:

Fig. 14. Threshold function used to find the positive features of a cluster.

percentage P2 of the items of the cluster (e.g., 30%), and (c) if the number N of the items of a cluster is between N1 and N2 then a feature is considered to be a positive one if it is possessed by a percentage P (N ) of the items of the cluster. The results for the dataset which will be used also in the case study are reported in Table 9 where the items were clustered following the F1 similarity by the WR WM PSOM executed by 9 slaves. It is important to note that the positive feature matching in average between the sequential and the parallel SOM is very high (about 90%–94%). Table 9 shows also that the matching value depends on P1 . Indeed, it shows that an increase of P1 from 4% to 6%, i.e., from 0.04 to 0.06, produces a decrease of the positive feature matching percentage. Thus in general, rules to find their optimal values depending on P1 , P2 , N1 and N2 are certainly of interest. This is for further study. To fully appreciate the effectiveness of the proposed parallel SOMs in the positive feature extraction, a comparison between the relevant associations discovered from clusters obtained by a sequential SOM and the ones discovered in a much shorter time starting from clusters obtained by the WR PSOMs is discussed in the case study. 7. Case study In this case study we automatically search for the associations between genes and diseases by using the scientific abstracts freely available in Pubmed. These associations are extracted starting from the clusters obtained by the WR WM PSOM and are then compared with the ones derived from clusters obtained by a sequential SOM. In detail, we started by searching in Pubmed all the abstracts containing the phrase ‘‘genetic disease’’ and selecting a collection of 110,000 abstracts. Then we performed a search in the Entrez Gene, the above mentioned NCBI database containing all the genes, by using the same keyword. From the search we selected 738 genes. Another search using the same keyword (genetic disease) was performed in MeSH (www.ncbi.nlm.nih.gov/mesh) resulting in 102 diseases. These data were then used for parsing and indexing the abstracts in order to obtain a matrix that has as many rows as the abstracts that contain at least one gene or one disease, and as many columns as the genes and the diseases that appear in at least one abstract. A feature matrix of 12.204 rows and 435 columns was obtained. A similarity matrix of 12.204 rows and 12.204 columns was then built from the feature matrix using the F1 similarity. The abstracts were clustered in 80 clusters using the WR WM PSOM. The number of classes was obtained also in this case from a preliminary study with the K -Means algorithm.

(a) secure authentication and communication over the networks by using the GRID security infrastructure (GSI). (b) execution of remote jobs by using the GRID Resource Allocation and Management (GRAM). (c) accessing and moving data by using the GridFTP, that is a high performance, secure, reliable data transfer protocol based on the File Transfer Protocol. In particular, the master, implemented as a C++ process installed in a Toolkit Container of another PC of the GRID, reads the similarity matrix, divides this matrix in horizontal slices and sends the pieces, by using the GridFTP, to the slaves installed in all the available PCs of the LAN including the one hosting the Toolkit server. Then, the master forks as many processes as the number of slaves. These processes activate in each slave, by using the GRAM, the programs to execute the on-line SOM for the received piece of the input dataset. After completing the submission phase, the master and the slaves execute the Kohonen’s algorithm for the relevant piece of the input data. During this phase, the master verifies the slave status and may resubmit the data to some other slave if the submission or the java process fails. After receiving the weight matrices from all the slaves, the master performs the weight recombination according to the formula (4) and launches another iteration of the clustering process. A final computation is done by the master, after terminating all the iterations, to compute the class of each object of the dataset. To achieve a high precision, both weak (WP) and strong (SP), we have taken into account the heuristics suggested in Section 6, i.e., a first epoch of the sequential SOM was carried out to find a suitable initial weight configuration for the WR WM PSOM, whereas a limited number of GRID nodes (from 3 to 9) was used to increase the probability that the data assigned to each node belong to all the eighty classes, thus satisfying the hypotheses underlying the theorem described in Section 5. Let us note that the optimal choice of the node number NP should depend on the number of the classes nc and on the number of the data of the overall dataset nd. Typically one can choose NP = nd/(20 nc ) to have more or less twenty data per class in each node so that a deviation of few elements around this mean value would result in about 10% of approximation with respect to the uniform distribution of the data among the classes envisaged by the theorem of Section 5. For this reason, in our case, the choice of a GRID with 9 nodes seems the most suitable to obtain a parallel clustering very close to the one achievable by the sequential SOM. Indeed, this is confirmed by the high accuracy and low distortion, corresponding to low topological and quantization errors, obtained by executing the WR WM PSOM with 3 and 9 nodes, as reported in Table 10. Let us note that we also carried out a simulation of the above algorithm for a 18 nodes GRID obtaining the same performance. This suggests that the performance of the WR WM PSOM is quite robust.

722

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

Table 10 Accuracy and distortion of the parallel clustering of the dataset in the case study, SP = Strong Precision, WP = Weak Precision. Nodes SP

3 9

WP

N. docs original set

0.94 0.85 12,204 0.95 0.78 12,204

SP filtered set

WP filtered set

N. docs filtered set

Ds/Dp filtered sets

0.98 0.98

0.97 0.94

9302 9302

0.89 0.82

With respect to the results shown in Table 10, we have to point out also that the accuracy increases considerably if one does not consider the data that are at a certain distance from the winning neurons, as shown in the columns denoted as SP and WP filtered. In fact, if one filters these data, reducing the overall dataset from 12.204 to 9.302 elements, the accuracy is greater than 90%. Since the peripheral data with respect to the winning neurons behave as a noise that does not contribute to the extraction of the positive features, the high accuracy shown in Table 8 suggests that the discovery of the gene-disease associations should be almost the same starting from either the parallel or the sequential SOM. To illustrate this important result, we report in Table 11 the genes that are the positive features of all the classes that have the disease ‘‘Cystic Fibrosis’’ among their disease positive features. In this table the columns refer respectively to the class id, the number of abstracts of the class that support the association, the gene ID taken from Entrez Gene, the gene name, the number of abstracts obtained by searching Pubmed for Cystic fibrosis and the gene symbol. Table 11 points out that many genes, that may have a role in the Cystic fibrosis, are present in few abstracts of the joint search, i.e., their role does not emerge explicitly from Pubmed. Thus, in general we can say that our method is effective since the possible associations are not only the ones explicitly supported by Pubmed, but include the ones mediated through other features which are not necessarily positive features. By comparing the associations derived from the WR PSOM and the ones obtained by the sequential SOM we found that they are almost the same. Indeed, Table 12 shows that the method based on

the WR WM PSOM finds 4/26 new associations, and it does not find only 5/26 of the associations discovered by the sequential SOM, as shown in Table 13. This confirms that the associations discovered by both the parallel and sequential SOMs are about 85% of the total discovered associations. Let us note that in principle, a better precision could be obtained by choosing appropriately the parameters P1 and P2 of the threshold function the master uses to extract the positive features for each class. In fact, if P1 and P2 are too low, we will extract a large number of positive features, whereas if P1 and P2 are too high the number of the positive features could be very low. In both these cases the gene-disease associations discovered by using a parallel SOM may differ from the ones derived from a sequential SOM. Thus a suitable compromise should be sought, e.g., we have found that it is enough to change P2 from 30% to 32.5% to achieve a complete coincidence between the associations discovered by the sequential and the WR WM SOM. How to optimally choose P1 and P2 is for further study. 8. Concluding remarks In this paper we have proposed a novel parallel SOM based on Weight Recombination (WR) and discussed three variants of the method. The WR PSOM aims at preserving the dataset topology to support a satisfactory visualization of the results and to discover relevant associations between the data features by using topological operations over the clusters obtained. The WR PSOMs presented use a ‘‘horizontal’’ data partitioning scheme where each computational element of a GRID performs an on-line SOM or a MANTRA algorithm over a horizontal slice of the feature matrix, or the similarity matrix, characterizing the dataset. At the end of each iteration, the weights computed by each slave are sent to a master which recombines and sends them back to the slaves. This new approach takes advantage from the fact that most of the weight ordering phase takes place in the first phase of the Kohonen’s algorithm and that a limited number of epochs is enough to reach

Table 11 Genes that are the positive features in the classes that have Cystic fibrosis as a disease positive feature. The last column reports the number of abstracts obtained by searching Pubmed for Cystic fibrosis and the gene symbol. Cystic fibrosis Class

# Abs in the Class

Gene ID

Gene name

# Abs in Pubmed

1

204

3,036 4,524 5,573

- Hyaluronan synthase 1 - 5, 10 Methylenetetrahydrofolate reductase (NADPH) - Protein kinase, cAMP-dependent, regulatory, type I, alpha (tissue specific extinguisher 1)

0 1 0

3

70

1,277 324 4,524 5,573

- Collagen, type I, alpha 1 - Adenomatosis polyposis coli - 5, 10 Methylenetetrahydrofolate reductase (NADPH) - Protein kinase, cAMP-dependent, regulatory, type I, alpha (tissue specific extinguisher 1)

2 8 1 0

12

70

154 3,036

- Adrenergic,beta-2-, receptor, surface - Hyaluronan synthase 1

21

61

1,080 3,036 6,833

- Cystic fibrosis transmembrane conductance regulator, ATP-binding cassette (sub-family C, member 7) - Hyaluronan synthase 1 - ATP-binding cassette, sub-family C (CFTR/MRP), member 8

53

77

3,064 5,979 6,532

- Huntingtin - Ret proto-oncogene - Solute carrier family 6 (neurotransmitter transporter, serotonin), member 4

1 4 0

56

17

1,277 1,760 51,422 5,573 958

- Collagen, type I, alpha 1 - Dystrophia myotonica-protein kinase - Protein kinase, AMP-activated, gamma 2 non-catalytic subunit - Protein kinase, cAMP-dependent, regulatory, type I, alpha (tissue specific extinguisher 1) - CD40 molecule, TNF receptor superfamily member 5

2 1 0 0 9

77

91

1,277 80,781

- Collagen, type I, alpha 1 - Collagen, type XVIII, alpha 1

2 0

78

38

1,080 154 5,573

- Cystic fibrosis transmembrane conductance regulator, ATP-binding cassette sub-family C, member 7 - Adrenergic,beta-2-, receptor, surface - Protein kinase, cAMP-dependent, regulatory, type I, alpha (tissue specific extinguisher 1)

40 0 5409 0 5

5409 40 0

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724 Table 12 Positive features shared among classes that have Cystic fibrosis as a positive feature which were found by using the parallel SOM but not by using the sequential SOM. Gene ID 3,064 5,979 6,532 51,422

Gene name

# Abs in Pubmed

Huntingtin Ret proto-oncogene Solute carrier family 6 (neurotransmitter transporter, serotonin), member 4 Protein kinase, AMP-activated, gamma 2 non-catalytic subunit

1 4 0 0

Table 13 Positive features shared among classes that have Cystic fibrosis as a positive feature found by the sequential SOM, but not by using the WR PSOM. Gene ID

Gene name

# Abs in Pubmed

2147 356 4210 5354 538

Coagulation factor II (thrombin) Fas ligand (TNF superfamily, member 6) Mediterranean fever Proteolipid protein 1 ATPase; Cu++ transporting; alpha polypeptide

28 5 8 0 0

a stable clustering solution, thus requiring a small communication overhead between the slaves and the master. The proposed WR PSOMs converge towards a solution which is a valuable approximation of the sequential on-line SOM. The operations needed to recombine the weights for launching a new epoch in the WR PSOMs is an order of magnitude less than the one needed by the nodes of the existing PSOMs to compute, at the end of each epoch, how the weights have to be updated because of the contribution of the other nodes. Moreover, if the classification is done on the feature space, the number of operations for the weight recombination phase of the WR PSOMs is linearly dependent on the number of features nf , whereas the number of operations executed by the nodes of the existing PSOMs is proportional to the product between the number of features nf and the number nd of the data of the dataset. The reported simulation studies demonstrate an almost linear speed-up due to the limited communication overhead and suggest how to compute the similarity matrix to achieve better accuracy and time performance and to support an effective data mining of large datasets. The test of the method in a GRID infrastructure confirms the theoretical considerations. The use of novel similarity criteria, and how to best fix the thresholds drawn in Fig. 14 for feature extraction should be further investigated to facilitate the overall mining process, even if the measured performances of our parallel algorithms (i.e., of about 90% of speed-up, accuracy, correct positive feature extraction and correct feature associations discovery) in our target domain, i.e., genomics end related fields, poses a challenge for further improvements. Finally, the exploitation of the WR PSOMs to improve valuable SOM extensions which better preserve the dataset’s topology even for time changing clusters (i.e., ESOM, MIGSOM and ReDSOM) will be also investigated. References [1] Y. Zhao, G. Karypis, Data clustering in life science, Molecular Biotechnology 31 (2005). [2] R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16 (3) (2005). [3] J.T.L. Wang, et al., Data Mining in Bioinformatics, Springer-Verlag, London, 2005. [4] J. Augen, Bioinformatics in the Post-Genomic Era, Addison Wesley, 2005. [5] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: a new data clustering algorithm and its applications, Data Mining and Knowledge Discovery 1 (2) (1997). [6] H. Jin, M.L. Wong, K.S. Leung, Scalable model-based clustering for large databases based on data summarization, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (11) (2005).

723

[7] D. Arlia, M. Coppola, Experiments in parallel clustering with DBSCAN, in: Proc. EuroPar2001, 2001. [8] J. Tan, J. Zhang, W. Li, An improved clustering algorithm based on density distribution function, Computer and Information Science 3 (3) (2010). [9] T. Kohonen, Self Organizing Maps, Springer, 1995. [10] S. Kaski, J. Kangas, T. Kohonen, Bibliography of self organizing map (SOM) papers: 1981–1997, Neural Computing Survey 1 (1) (1998). [11] M. Oja, S. Kaski, T. Kohonen, Bibliography of self organizing map (SOM) papers: 1998–2001 Addendum, Neural Computing Survey 3 (1) (2003) 2003. [12] C. Bishop, M. Svensen, Developments of the generative topographic mapping, Neurocomputing 21 (1998) 203–224. [13] M. Pena, W. Barbakh, Topology-preserving mappings for data visualisation, in: Principal Manifolds for Data Visualization and Dimension Reduction, in: Lecture Notes in Computational Science and Engineering, vol. 58, 2008, pp. 131–150. doi:10.1007/978-3-540-73750-6_5. [14] I.J. Sledge, J.M. Keller, Growing neural gas for temporal clustering, in: Proc. Int. Conf. on Pattern Recognition, Tampa, FL, USA, 2008. [15] E. Pampalk, Limitations of the SOM and the GTM, 2001. [email protected]. [16] C. Garcia, M. Prieto, A. Pascual-Montano, A speculative parallel algorithm for self organizing maps, in: Proceedings of the International Conference ParCo Parallel Computing: Current & Future Issue of High-End Computing, 2005. [17] M. Cottrel, J.C. Fort, P. Letremy, Advantages and drawbacks of the batch Kohonen algorithm, in: 10th European Symp. On Artificial Neural Network, Bruges, Belgium, 2005. [18] H. Jin, M.L. Wong, K.S. Leung, W. Shum, K.-S. Leung, Expanding self-organizing map for data visualization and cluster analysis, Information Sciences 163 (1–3) (2004). [19] M. Morchren, A. Ultsch, ESOM Maps, Technical Report 45, Dept. CS, University of Marburg, Germany, 2005. [20] D. Alahakoon, S.H. Halgamuge, B. Srinivasan, Dynamic self-organizing maps with controlled growth for knowledge discovery, IEEE Transactions on Neural Networks 11 (3) (2000). [21] A. Faro, D. Giordano, F. Maiorana, Discovering complex regularities by adaptive self organizing classification, in: The Second World Enformatika Conference, WEC’05, Instanbul, 2005. [22] T. Ayadi, T.M. Hamdani, I.E.E.E. Member, M. Adel, A.M. Alimi, Enhanced data topology preservation with multilevel interior growing self-organizing maps, in: Proc. World Congress on Engineering, WCE 2010, London, 2010. [23] G. Denny, J. Williams, P. Christen, ReDSOM: relative density visualization of temporal changes in cluster structures using self-organizing maps, in: Proceedings of the Eighth IEEE International Conference on Data Mining, ICDM, 2008, pp. 173–182. [24] M. Cottrel, J.C. Fort, P. Letremy, Stochastic on-line algorithm versus batch algorithm for quantization and self organizing maps, in: Second NNSP Conference, Falmouth, September 2001. [25] P. Ienne, P. Thiran, N. Vassilas, Modifier self organizing feature map algorithms for efficient digital hardware implementation, IEEE Transactions on Neural Networks 8 (2) (1997). [26] M. Ceccarelli, A. Petrosino, R. Vaccaio, Competitive neural networks on message passing parallel computers, Concurrency: Practice and Experience 5 (6) (1993). [27] R.D. Lawrence, G.S. Almasi, H.E. Rushmejer, A scalable parallel algorithm for self-organizing maps with applications to sparse data mining problems, Data Mining and Knowledge Discovery 3 (171) (1995). [28] M. Nocker, M. Morchren, A. Ultsch, An algorithm for fast and reliable ESOM learning, in: ESANN 2006, Bruges, Belgium, 2006. [29] V. Fiolet, B. Toursel, A clustering method to distribute a database on a GRID, Future Generation Computer Systems 23 (2007) 997–1002. [30] P. Tomsich, A. Rauber, D. Merkl, Optimizing the parSOM neural network implementation for data mining with distributed memory systems and cluster computing, in: Proceedings. 11th International Workshop on Database and Expert Systems Applications, 2000. [31] P. Ozdzynski, A. Lin, M. Liljeholm, J. Beatty, A parallel general implementation of Kohonen’s self organizing map algorithm: performance and scalability, Neurocomputing 44 (2002). [32] E. Schikula, C. Weidmann, Data parallel simulation of self-organizing maps on hypercube architectures, in: Proceedings of WSOM’97, Workshop on SelfOrganizing Maps, Espoo, Finland, 1997, June 4–6. [33] M. Yang, N. Ahuja, A data partition method for parallel Self-organizing map, in: International Joint Conference on Neural Networks, 1999 IJCNN’99, vol. 3, 1999. [34] J. Kwiatkowski, M. Pawlik, U. Markowska-Kaczmar, D. Konieczny, Performance evaluation of different kohenen network parallelization techniques, in: IEEE Proc. Parallel Computing in Electrical Engineering PARELEC’06, 2006. [35] D. Giordano, F. Maiorana, A Visual Tool for Mining Macroeconomics Data, in: V. Zanasi, N.F.F. Ebecken, C. Brebbia (Eds.), Data Mining, WIT Press, 2004. [36] A. Faro, D. Giordano, F. Maiorana, Navigating in neurobiological hyperspaces in search of relevant associations, in: Proc. Neuroinformatics Conferenece, Computational Neuroscience Track, Pilzen, Germany, 2009. [37] A. Faro, D. Giordano, F. Maiorana, C. Spampinato, Discovering gene-disease associations from specialized literature, IEEE Transactions on Information Technology in Biomedicine (2009). [38] J.D. Fekete, G. Grinstein, C. Plaisant, IEEE InfoVis 2004 Contest, the history of InfoVis, 2004. www.cs.umd.edu/hcil/iv04contest. [39] P. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics 20 (1987) 53–65. [40] P.J. Huber, Robust Statistics, Wiley, 2004.

724

A. Faro et al. / Future Generation Computer Systems 27 (2011) 711–724

[41] J. Kwatowski, M. Pawlik, U. Markowska-Kaczmar, D. Konieczny, Performance evaluation of different Kohonen network parallelization techniques, in: Proceedings of the International Symposium on Parallel Computing in Electrical Engineering, PARELEC 06, 2006. [42] A. Faro, D. Giordano, F. Maiorana, Discovering complex regularities: from tree to semi-lattice classifications, International Journal of Computational Intelligence 2 (1) (2005). [43] I. Foster, Globus toolkit version 4: software for service-oriented systems, in: IFIP Int. Conference on Network and Parallel Computing, in: LNCS, vol. 3779, Springer-Verlag, 2005, pp. 2–13. [44] I. Foster, C. Kesselman, S. Tuecke, The anatomy of the GRID: enabling scalable virtual organizations, International Journal of Supercomputer Applications 15 (3) (2001).

Alberto Faro received the Laurea in Nuclear Engineering from Politecnico of Milan in 1971. He is full Professor of Artificial Intelligence at the University of Catania. His research interests include knowledge discovery in distributed and parallel systems, signal and image processing, and bioinformatics.

Daniela Giordano received the Laurea in Electronic Engineering (1990) from the University of Catania, where she is associate Professor of Information Systems, and the Ph.D. in Educational Technology (1998) from Concordia University, Montreal. Her research deals with knowledge-based systems, computer vision and learning environments.

Francesco Maiorana received the Laurea in Electronic Engineering from the University of Catania in 1990, and the M.S. in Computer Science from New York University in 1993. He is a researcher in bioinformatics at the ICT-E1 Project of the Catania Municipality. His research interests include data mining, image processing and grid computing.