Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
Visualizing Big data with Compressed Score Plots: Approach and research challenges José Camacho ⁎ Departamento de Teoría de la Señal, Telemática y Comunicaciones, Universidad de Granada, 18071 Granada, Spain
a r t i c l e
i n f o
Article history: Received 18 February 2014 Received in revised form 12 April 2014 Accepted 15 April 2014 Available online 24 April 2014 Keywords: Exploratory Data Analysis Big data Principal Component Analysis Partial Least Squares Score plots
a b s t r a c t Exploratory Data Analysis (EDA) can be defined as the initial exploration of a data set with the aim of generating a hypothesis of interest. Projection models based on latent structures and associated visualization techniques are valuable tools within EDA. In particular, score plots are a main tool to discover patterns in the observations. This paper addresses the extension of score plots to very large data sets, with an unlimited number of observations. The proposed solution, based on clustering and approximation techniques, is referred to as the Compressed Score Plots (CSPs). The approach is presented to deal with high volume data sets and high velocity data streams. The objective is to retain the visualization capabilities of traditional score plots while making the user-supervised analysis of huge data sets affordable in a similar time scale to that of low size data sets. Efficient processing and updating approaches, visualization techniques, performance measures and challenges for future research are identified throughout the paper. The approach is illustrated with several data sets, including a data set of five million observations and more than one hundred variables. © 2014 Elsevier B.V. All rights reserved.
1. Introduction According to [1], Exploratory Data Analysis (EDA) is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA has been employed for decades in many research fields, including social sciences, psychology, education, medicine, chemometrics and related fields [2,3]. EDA is both a data analysis philosophy and a set of visualization tools [4]. Nevertheless, while the philosophy has essentially remained the same, the tools are in constant evolution, as numerous recent references suggest [5–8]. This is the direct consequence of the increasing complexity of the problems tackled with data analysis methods thanks to increasing computers capabilities. The advances in technology in the last decade have led to the socalled Big data era, where exabytes1 of data are daily generated by humans and, more importantly, machines [9]. This has drawn the attention of the scientific and technological community, driving initiatives for Big data analysis like Hadoop (http://hadoop.apache.org/). Also, extensions of modeling, classification and data mining techniques to Big data, like the Mahout project (http://mahout.apache.org/), have been developed. Unfortunately, the application of the EDA philosophy, which relies so much on visualizations, to Big data problems is
⁎ Tel.: +34 958 248898; fax: +34 958 240831. E-mail address:
[email protected]. 1 one exabyte corresponds to 218 bytes or one million terabytes.
http://dx.doi.org/10.1016/j.chemolab.2014.04.011 0169-7439/© 2014 Elsevier B.V. All rights reserved.
challenging due to the large scale of the data sets involved. However, this application deserves attention, since both EDA and modelling applications are complementary, with EDA a suggested first step prior to data modelling [10,11]. Omitting EDA has the risk of misinterpreting the modelling results, as illustrated in [12] with a number of real examples. Therefore, there is the need for developing EDA methods that are suitable to manage the data scales aforementioned, while taking advantage of the basic importance of simply looking at data [4]. Big data are commonly defined by the so-called 4 Vs [13]: • Variety: Data are varied in nature. Different sources, including unstructured and structured information, need to be properly combined in order to make the most of the analysis. • Veracity: The search for valuable information in large data sets is very much like the problem of finding the needle in a haystack. Big data present low signal to noise ratio, and exploratory and data mining techniques are needed to find patterns or trends of practical use, which are more reliable than punctual measures. • Volume: The amount of data that needs to be handled simultaneously makes processing parallelism a must. Exabytes, zettabytes, and even higher amounts of data are described in Big data applications. • Velocity: In Big data problems, a high rate of sampling is common. This further complicates the analysis and makes parallelism even more necessary. In data sets with a large number of variables, collinear data and missing values, projection models based on latent structures are valuable tools within EDA. Standard well-known projection models are Principal Component Analysis (PCA) [14,15,2] and Partial Least Squares (PLS)
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
[16–18]. These models and the set of tools used in combination [19–22] simplify the visual analysis of complex data sets. While the calibration of projection models from large scale data has already been studied [23, 24], the extension of the visual tools in this context has not been treated. One of the most used visualization tools in the context of projection models is the score plot. The score plot is a very useful tool to discover the distribution of the observations, including special observations (outliers) and clusters of related observations. All this information may be of paramount importance to improve data knowledge, and a valuable starting point for the selection and proper application of modeling and data mining tools. Projection models have outstanding capabilities for combining data from different sources and for handling uncertain data. This covers two of the aforementioned 4 Vs of Big data. Addressing the other two, Volume and Velocity, is the focus of this paper. For this, an approach to extend the score plots to Big data, the compressed score plots (CSPs), is introduced. It is argued that this extension is of interest not only for the chemometric community, but for the general community involved in Big data analytics. For this reason, the examples used to illustrate the approach are not limited to chemometric data. Limitations and research challenges in the proposed approach are pointed out throughout the document. The paper is organized as follows. Section 2 introduces PCA and PLS and their iterative computation. Section 3 discusses the limitations of traditional score plots to visualize a large number of observations. This is illustrated using a data set collected from a continuous process. Section 4 introduces the proposed solution to that limitation: the Compressed Score Plots. Section 5 introduces a methodology to update Compressed Score Plots when the subspace of interest changes. Section 6 illustrates the complete approach with two additional cases studies. Section 7 presents conclusions and future research challenges. 2. Projection subspaces Both PCA and PLS provide a similar solution to the same problem: data collinearity. The approach of these methods to overcome the problems derived from collinearity is to identify a reduced number of new variables, referred to as latent variables (LVs) or specifically in PCA as principal components (PCs). These LVs are obtained as a combination of the original variables in the data. In standard PCA and PLS, the LVs are linear combinations of the original variables, but non-linear extensions also exist [25]. For a given data set, the LVs are found by maximizing a given quadratic function, variance in the case of PCA and covariance for PLS. The operation to obtain the LVs from the original variables can be geometrically interpreted as a projection operation. Thus, projection models can be understood as projection subspaces of the original variables space. PCA follows the expression: t
X ¼ T A PA þ E A ;
ð1Þ
where TA is the N × A score matrix containing the projection of the observations in the A PCs sub-space, PA is the M × A loading matrix containing the A eigenvectors of XT ⋅ X with highest associated eigenvalues and EA is the N × M matrix of residuals. The number of PCs retained in a PCA model, A, is a principal choice [15,26], which in general can be regarded as an application dependent decision [27,28]. PLS performs a biased solution of the linear regression problem to the least squares solution. The linear regression problem is defined by the following expression: Y ¼XBþ F
ð2Þ
where Y is the N × K matrix of variables that are to be estimated, X is the N × M matrix of variables available to estimate Y, B is the M × K matrix of regression coefficients and F is the N × K matrix of residuals. A possible
111
way to interpret B is as a model of Y, with X being the input to the model. The aim of PLS regression is to estimate Y from the subspace of X that maximizes its covariance with Y. The partial linear regression problem between normalized matrices X and Y can be stated as: T
X ¼ T A PA þ E A T Y ¼ TA Q A þ FA
ð3Þ
where TA is the N × A score matrix which contains the projections of X to the latent A-dimensional subspace, PA and QA are the M × A and K × A regressor matrices, also called loading matrices, and EA and FA are the N × M and N × K matrices of residuals of X and Y, respectively. Eq. (3) can be rearranged in the following form: T
Y ¼ X R A Q A þ FA
ð4Þ
with: −1 T RA ¼ WA PA WA
ð5Þ
where WA is a M × A matrix of weights. Thus, a PLS model is represented by matrices PA, WA and QA. A useful variant of PLS for supervised classification in EDA is PLSDiscriminant Analysis (PLS-DA) [29]. In PLS-DA, matrix Y is artificially generated with dummy variables, which codify the different classes in the data set. Typically, Y is constructed with as many variables as classes. All the observations belonging to a class have value 1 for the corresponding variable, and −1 (or 0) for the rest. In the following subsections, some problems arisen to fit projection models from Big Data are discussed. Since the main goal of the present paper is to provide a solution to data visualization, model fitting problems are only briefly reviewed and a practical solution based on the iterative computation of cross-product matrices is adopted. 2.1. Computation of projection models from large volumes of data There are two problems of interest to the scientific community in the calibration of projection models from very large data sets. The main problem is to compute the model out-of-core [30], that is, without maintaining the whole data set in the main memory of the computer. A second problem is the development of fast, approximate and computationally efficient calibration algorithms [31,32]. Model fitting does not represent the most computational intensive step in the proposal of this paper. For this reason, this section is only devoted to the first problem. Several algorithms for PCA or PLS model fitting take the calibration data set X (and optionally Y), with N observations, as input. Due to limited computer resources, in particular computer memory, this approach is infeasible when N grows beyond a certain number, as is the case for large volume data sets. In those cases, cross-product matrices can be used for model fitting. The loading vectors of PCA can be identified using the eigendecomposition (ED) of the cross-product matrix XX = XT ⋅ X. Similarly, the loadings and weights in PLS regression can be identified from matrices XX and XY = XT ⋅ Y using the kernel algorithm [33–35]. The computation of these cross-product matrices can be performed in an iterative manner. This procedure assumes the data have been previously preprocessed, which can also be performed in an iterative fashion. For the sake of generality and to reduce computational overhead, the cross-product matrices are updated for each batch of observations of size B, instead of for each single observation. The observation-wise iterative computation can be derived for B = 1. The iterative computation follows: T
XXt ¼ XXt−1 þ Xt Xt ;
ð6Þ
112
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
where xit represents the i-th observation in Xt. From this estimate, the standard deviation can be derived:
and T
XYt ¼ XYt−1 þ Xt Yt ;
ð7Þ
where XXt and XYt are the cross-product matrices after the t-th batch of observations Xt and Yt have been considered. Using this method, only the cumulative cross-product matrices need to be maintained in memory for model fitting instead of the whole set of observations. The computation of cross-product matrices has been a extended choice for PCA interative computation [31] and is appropriate for parallelization [36], a must in Big Data analysis. Also, this is a suitable approach for continuous model update and the computation of several models, e.g. PCA and PLS. An alternative is to split the data in tractable parts, for instance with the use of hierarchical models [37]. This assumes the existence of meaningful splits, e.g. clusters in the observations. Also, the models are more difficult to update with incoming data. Another alternative is data sampling, which results in an approximated model rather than an exact one. 2.2. Computation of projection models from high velocity data streams The problem of model fitting from high velocity streams bears some similarities with that of large data sets discussed in the previous section. A main difference, however, is that it is commonly assumed that the data stream cannot be stored in memory, and therefore the model needs to be iteratively fitted from incoming data [38]. The model fitting solution based on the computation of cross-product matrices can also be applied to this problem, but in this case the preprocessing parameters need to be updated as the data flows. In order to account for the non-stationarity in the data, which is the common case in streaming data, the update law may follow an EWMA procedure based on the proposal of [39]. This approach makes use of a forgetting factor that takes its value from 0 to 1, so that if 1 is chosen no forgetting factor is applied. For brevity, only the x-block updating procedure will be discussed. The EWMA computations involving the y-block follow exactly the same approach. Again, for the sake of generality and to reduce computational overhead, the EWMA is updated for each batch of observations of size B. Also, B may be variable in time. For a new batch of observations of the variables at time interval t, Xt, with Bt observations, the EWMA update of the mean is computed in two steps. First, the accumulate is computed from: x
x
Mt ¼ λ Mt−1 þ Xt
ð8Þ
where λ is the forgetting factor. Then, the actual mean is computed following: x
x
mt ¼ ð1=Nt Þ Mt
ð9Þ
where Nt , which represents the number of observations used to compute the mean at a given time t, is also computed using an EWMA: Nt ¼ λ Nt−1 þ Bt
ð10Þ
so that, for Bt = B constant in time, Nt converges to B/(1 − λ). If data are auto-scaled, the standard deviation of the blocks needs to be computed. First, an EWMA estimate of the accumulated variance is obtained:
x 2
σt
Bt x 2 X i x 2 ¼ λ σ t−1 þ xt −mt i¼1
ð11Þ
x
σt ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1=ðNt −1ÞÞ ðσ xt Þ2
ð12Þ
Finally, the cross-product matrix is computed from: eT X e ; XXt ¼ λ XXt−1 þ X t t
ð13Þ
with: eit ¼ xit −mxt ⊘σ xt x
ð14Þ
e and ⊘ is the Hadamard eit represents the i-th observation in X where x t (element-wise) division. An alternative to handle projection models from non-stationary data streams is the use of recursive model fitting approaches like [40]. However, this is less suitable for visualization considering that it may be necessary to compute several variants of a model throughout a data analysis. For this, computing intermediate parameters like cross-product matrices has the advantage that different models, with different subsets of variables and/or observations, may be obtained straightforwardly. 2.3. Handling missing data The approaches of the previous two sections can be directly applied in the absence of missing data. Missing elements may be handled in a different way depending on the estimates of interest. To estimate preprocessing parameters and cross-product matrices, missing elements can be handled by imputation methods for the parameters of the multivariate distribution [41], either for the complete data set or for parts or batches of data. If desired, missing data methods based on projection models can also be used to improve the imputation of the scores. In that domain, most contributions focus on the estimation of missing data once a model is built [42–46], although the problem of model building with missing elements has also been considered [47] based on an expectation–maximization iterative approach. These approaches can be seamlessly integrated in the methodology of this paper. 3. Limitations of score plots in Big data In EDA, PCA and PLS can be used to improve data understanding mainly in the following three main aspects: a) The distribution of the observations. This distribution, in particular the existence of outliers and clusters, contains relevant information for data mining and data understanding. Thus, outliers may distort the result of data mining algorithms. They also represent abnormal situations which may be very informative in certain contexts. Clusters of observations may represent habitual situations. The separation among clusters may reflect different operation points in the phenomenon of interest. Periodic switching among clusters may reflect repetitive processes. For labelled data, clustered classes may lead to the conclusion that the set of variables considered is adequate for classification. Data sets with several tens, hundreds or even thousands of variables are common in many fields. The direct observation of the data distribution in these data sets is not possible, and the visualization of selective pairs of variables at a time is a tedious approach. PCA and PLS can be used straightforwardly to visualize the distribution of the data in the latent subspace, considering a number of LVs that is commonly much lower than the number of variables. Score plots are used for this purpose. b) The relationships among variables. PCA or PLS are not specifically designed to find relationships among variables, as for instance Factor Analysis (FA) is [2,15]. Nevertheless, there are a number of
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
(b)
1
0.5 0
−0.5 −1 −1
−0.5
0
0.5
1
(c)
Samples/Scores Plot of x3 1
Scores on PC 2 (28.12%)
Samples/Scores Plot of x
Scores on PC 2 (28.12%)
Scores on PC 2 (28.12%)
(a)
0.5 0 −0.5 −1 −1
Scores on PC 1 (71.88%)
−0.5
0
0.5
1
113
Samples/Scores Plot of x2 1
0.5 0 −0.5 −1 −1
Scores on PC 1 (71.88%)
−0.5
0
0.5
1
Scores on PC 1 (71.88%)
Fig. 1. Three different scores distributions for the same loadings in the PCA model with 2 PCs.
There are several additional tools that may be valuable to improve data understanding using projection models. Most of them come from the chemometric and related fields like econometrics, psychometrics, biometrics or statistics, where PCA and PLS models are customarily used to interpret data, rather than for simple compression tasks. In the context of PLS, randomization tests [49–51] are a useful model validation and variable selection tool. Other interesting variable selection tools are introduced in [52,53]. When a projection model is calibrated, no matter the purpose, the distribution of the scores should always be inspected. This is because the same model (same covariance structure) may be the result of very different data distributions. For instance, in Fig. 1, the score plots of three completely different data distributions that provide the same PCA loadings are shown: a multinormal distribution, a distribution with one outlier and a distribution with two clusters.2 The inspection of score plots or equivalent tools is strongly recommended in EDA. The EDA for large data sets, with even millions of observations, is a particular case in which additional considerations must be taken into account. These include how to process score plots to improve their interpretability. In very large data sets, traditional score plots such as those in Fig. 1 are limited as a visualization tool. To illustrate the limitations of scores plots in large data sets, a data set from an industrial continuous process collected by Perceptive Engineering LTD (http://www. perceptiveapc.com/) [54] is used. The data set was collected during a period of more than 4 days of continuous operation of a fluidized bed reactor fed with four reactants. The collection rate is 20 s. Data consist of 18.887 observations on 36 process variables. The process variables include feed flows, temperatures, pressures, vent flow and steam flow, and the observations include 22 different operational points of the process. Fig. 2 depicts the score plot corresponding to the Perceptive data set for the first two PCs of the PCA model. In the figure, the number of observations (scores) is 18.887. This number is excessive for proper visualization. The plot shows clouds of points for each operational point, some of them overlapping and hiding part of the others. As a consequence, the interpretability of the plot is reduced.
4. Compressed Score Plots (CSPs) One procedure to solve the visibility problem illustrated in Fig. 2 is using clustering techniques [55,56]. In more demanding cases, with even millions of observations, the volume of data is so large that data cannot be treated and plotted directly due to limitations in computer resources. Then, clustering or an alternative procedure turns out to be mandatory. Suitable alternatives to clustering in this problem are density plots [57] or other form of bivariate histograms, see [58] and references therein. These methods are based on tessellating a bivariate subspace with a regular grid of bins of a specific form (e.g. rectangles or hexagons) and counting the number of points contained in each bin [59]. As it will be shown, these methods are faster than clustering but less visually representative of the data, specially when several classes are shown in the plot. The clustering approach has been widely considered in many contexts and in many areas of knowledge as one possibly-needed step in EDA and data visualization [60]. The grouping criterion is fundamental for the definition of a cluster. For that, a measure of the similarity between observations of a data set is employed. Commonly, this measure is distance-based. The most popular distance metric is the Euclidean distance. When the difference in scale and the correlation between variables are not negligible, the Mahalanobis distance is preferred. In a general manner, a quadratic distance can be expressed as: dK xi ; x j ¼ xi −x j ¼ K T 1=2 −1 xi −x j ¼ xi −x j K
ð15Þ
15
10
5
PC 2
visualization tools which may be useful for this purpose, like loading plots or MEDA [19]. c) The connection between observations and variables. The investigation of the connection between observations and variables in the latent subspace is principal to gain a complete picture of the data, for instance to unveil potential causes for the appearance of outliers and clusters. Traditionally, biplots [20] have been used for this purpose. Also, an extension of MEDA named observation-based MEDA or oMEDA [48], is useful in this context.
0
−5
−10 −30
−20
−10
0
10
20
30
40
PC 1 2
The ADICOV algorithm [54] was employed to simulate these data sets.
Fig. 2. Score plot for the first two PCs of the PCA model of the Perceptive data set.
114
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
where K is the M-dimensional identity matrix in the case of Euclidean distance and the covariance matrix of the original M variables in the case of Mahalanobis distance. A negative side-effect of the use of the Mahalanobis distance is that noisy components of low variability may be enhanced when inverting K. To avoid this, the Mahalanobis distance can be restricted to a projection sub-space, such as that of the first LVs in PCA or PLS. This is conveniently treated in [54]. Eq. (15) also holds for the definition of the Euclidean and Mahalanobis distances in PCA and PLS subspaces, where K−1 is chosen as follows: −1
K
−1
T
¼ P KPCA P ;
ð16Þ
Algorithm 1. Clustering algorithm
for PCA and: K
−1
−1
¼ R KPLS R
T
ð17Þ
for PLS. KPCA in Eq. (16) and KPLS in Eq. (17) is set to the A-dimensional identity matrix, for A LVs, in the case of Euclidean distance and to the covariance matrix of the scores in the case of the Mahalanobis distance. 4.1. Clustering method The application of clustering methods to compress score plots is not straight-forward. Score plots are used for data interpretation. Therefore, the clustering should not destroy significant details or introduce artifacts in the data distribution. Also, since the size of the data is very large, a fast clustering approach needs to be defined. Jain [61] defines five categories of efficient clustering techniques for large data sets: • • • • •
large data sets and data streams are naturally ordered in packets, so that this operation does not need to be explicitly implemented. The partition of the data into packets is a common step in Big data analysis, like in the Hadoop Distributed File System [9]. In Algorithm 1, from the packets of data, a set of centroids is iteratively computed using the merge() routine. The computation of centroids is optimized for a given definition of distance in K− 1. The number of original observations represented by each centroid, i.e. its multiplicity, is stored in μ. The multiplicity of original observations is 1. Therefore, each time a packet of data Xt is joined to the previously computed clusters, a vector of Bt ones, 1Bt , is included in the vector of multiplicities μ.
Efficient Nearest Neighbor Search. Data summarization. Distribute computing. Incremental Clustering. Sampling-based methods.
The clustering algorithm used in this paper is a variant of the algorithm of Bradley et al. [62]. It belongs to the incremental clustering category and only requires one scan of the data set. Also, it can be easily extended to distribute computing, or combined with sampling-based or summarization methods. The aim of the clustering is to compress the scores in the score plot corresponding to a given projection subspace while maintaining a visually representative distribution of that of the original scores. For the sake of representativeness, the distance-based measures should be restricted to the projection subspace, i.e. Eqs. (16) and (17) should be applied. The choice between Euclidean and Mahalanobis distances depends on the procedure to depict the score plots. When both dimensions in the score plot are of similar size regardless the magnitude in the margins, Mahalanobis distance is preferred. The clustering has been designed considering the aforementioned issues. The features of the clustering are summarized as follows: • Mahalanobis or Euclidean distance in the PCA or PLS subspace (Eqs. (16) and (17)) • Iterative exploration of the data set, previously arranged into packets of data. • When the data sets contain different classes, the clustering is applied for each one separately 3. The clustering algorithm is presented in Algorithm 1. In it, the input data set X is divided into T packets of Bt observations each. That way, only part of the data rather than the whole data set needs to be maintained in the processor memory at the same time. Most of the time, 3 If there are several underlying classes, the analyst needs to decide whether to repeat the clustering per each class of interest, or to divide the observations in groups according to several common classes.
C ← Cluster(X, K−1): C = [] μ = [] [X1, …, XT] ← partition(X) for each packet Xt, C ← [C, Xt] μ←½μ; 1Bt [C, μ] ← merge(C, μ, K−1) end The merge() routine in the clustering algorithm is presented in Algorithm 2. In the routine, the pair of observations/centroids with minimum distance in C are iteratively replaced by their centroid and multiplicities are conveniently recomputed. The distance is computed following Eq. (15) for given matrix K−1 in Eq. (16) or Eq. (17). For observations/centroids in different classes, the distance is set to infinitum. This replacement operation is repeated until only Lend elements are left in C. Algorithm 2. Merge routine in the clustering algorithm [C, μ] ← merge(C, μ, K−1): L := # (C) C := [c1, …, cL] μ := [μ1, …, μL] while (# (C) N Lend), [ci, cj] ← min _ dist(C, K−1) ci ← centroid(μi ⋅ ci, μj ⋅ cj) C ← [c1, …, cj − 1, cj + 1, …, cL] μi ← μi + μj μ ← [μ1, …, μj − 1, μj + 1, …, μL] end When all the packets have been already processed by the clustering algorithm, a reduced data set of centroids is provided, along with the associated multiplicity. The number of remaining centroids is Lend. Lend is user-defined and should be chosen so that the visualization of the CSP is adequate. Typically, a total of 100–300 points are adequately visualized in a CSP. Let us illustrate the clustering approach with the Perceptive data set. The Mahalanobis distance in the PCA subspace corresponding to the first 2 PCs is used to perform the clustering. Thus, matrix K−1 is set according to Eq. (16) for KPCA equal to the covariance matrix of the scores. The resulting CSP in the PCA subspace is shown in Fig. 3. For the sake of interpretability, multiplicity is included in the plot by using different marker sizes. This CSP design makes an efficient use of three visualization components [63]: position and volume, to represent the data distribution, and color, to distinguish among classes. Comparing the original score plot in Fig. 2 with the CSP, it can be seen that the distribution of the scores of the different operational points is clearer in the latter.
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
15
also, centroids with multiplicities below a given value (say 0.1) are discarded. In Fig. 7, the EWMA CSP for the Perceptive data set with λ equal to 0.9 is shown. Only the last 10 operational points of the process remain in the plot, while the others have been forgotten by the EWMA law.
10
5
PC 2
4.2. Performance measures The assessment of the quality of a given CSP is principal to compare alternative approaches. There are at least three performance criteria which should be quantified:
0
−5
−10 −30
115
−20
−10
0
10
20
30
• Visualization quality: is the visualization adequate to draw the correct conclusions from the data? • Computational efficiency: how long does it take to compute and visualize the CSP? • Representativeness: is the centroids distribution representative of the original data distribution?
40
PC 1 Fig. 3. Compressed score plot of the Perceptive data set and PCA. Multiplicity is shown in the size of the markers.
This enhancement of visualization is the goal of CSPs, but turns out to be mandatory for millions of observations, as it will be shown later on. The procedure to visualize the multiplicity is a principal decision in a CSP design. An alternative CSP is shown in Fig. 4, where the multiplicity is shown in the Z-axis. Thus, the plot needs to be rotated to inspect the multiplicity. For comparison purposes, in Fig. 5 two views of a bivariate histogram based on a regular grid are shown. A histogram is much faster to compute than a CSP computed from scratch. However, the fidelity of the histogram is also reduced in comparison to a CSP. Furthermore, a principal limitation is how to represent class information, in particular when there is a high number of classes. Neither the marker form (Fig. 5(a)) nor a third dimension (Fig. 5(b)) are adequate choices for that purpose. Most popular forms of bivariate plots are density plots and hexagonal binning plots. An example of the corresponding plots for the PCA subspace of the Perceptive data set, without class information, is shown in Fig. 6. Again, main shortcomings are the limited fidelity and the difficulty to include the class information. When the input is a data stream of high velocity, an EWMA approach for CSP building can be used. The clustering algorithm can be straightforwardly extended to the EWMA law by updating the multiplicities following this law. For each batch of incoming information, the merge() is launched after the following update of the multiplicities: h i μ← λ μ; 1Bt
ð18Þ
(a)
The visualization quality falls within the so-called visual analytics research area [60,63] and is beyond the scope of this paper. Computational efficiency can be measured in computational time or performing an asymptotic analysis of the algorithm complexity. If the latter is used, the relative efficiency between memory access and computation efficiencies in a given computation setup should be taken into account. Finally, representativeness may be computed by measuring distances between the centroids set and the original data set in the subspace of interest, in order to detect artifacts or uncovered areas in a CSP. Two complementary representativeness measures are defined here: the Mean Distance to the Furthest Observation (MDFO) and the Mean Distance to the Closest Observation (MDCO). Let us call Xci the set of observations whose closest centroid is ci for a given distance and subspace definition K, that is: n o Xci ¼ x ∈ Xci ¼ arg min dK ðx; cÞ
ð19Þ
c
and take:
M
di ¼ max dK ðx; ci Þ
ð20Þ
x∈Xc
i
m
di ¼ min dK ðx; ci Þ x∈Xc
ð21Þ
i
(b)
15 2500 10
2000
PC 2
1500 5 1000 500
0
0 −5
10 0
−10 −30
−20
−10
0
10
PC 1
20
30
40
PC 2
−10
−20
0
20
40
PC 1
Fig. 4. Compressed score plot of the Perceptive data set and PCA. Multiplicity is shown in a third dimension: (a) X-axis vs Y-axis and (b) rotated view to inspect the multiplicity.
116
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
(a)
(b)
15
PC 2
25 10
20
5
15 10
0
5 −5
0 10
−10 −30
0 −20
−10
0
10
20
30
40
PC 2
PC 1
−20
−10
20
0
40
PC 1
Fig. 5. Bivariate histogram of the Perceptive data set and PCA. Multiplicity is shown in the size of the markers. Classes are shown in the form of the markers and in a third dimension: (a) Xaxis vs Y-axis and (b) rotated view to inspect the classes. M
m
so that di and di are the maximum and minimum distances, respectively, from observations in Xci to ci according to K. Then: MDFO ¼
Lend 1 X
Lend
MDCO ¼
v
Lend 1 X
Lend
M
ð22Þ
m
ð23Þ
di
di
i¼1
A high MDFO means that there are original points that are far from their closest centroid, and therefore are not correctly represented in the CSP. This is useful to assess whether there are uncovered areas in the plot. A high MDCO value, otherwise, highlights artifacts in the CSP which do not reflect the actual data distribution, since the minimum distance from centroids to observations is high. In Fig. 8, a number of performance measures for different executions of the clustering algorithm on the Perceptive data set are shown. The executions differ in the packet size B used in the algorithm. Performance measures include processing elapsed time (in seconds), the MDFO and the MDCO. In Fig. 8(a), the elapsed time of executing the clustering algorithm with and without storing the original observations associated to each centroid are compared. If the original observations are not stored, the computation is faster. However, storing the original data has many advantages in terms of model recomputation, for instance after outlier
extraction. The comparison shows that all the executions provide similar results in terms of representativeness, which means that the partition of the complete data set into packets does not lead to representativeness degradation. Also, Fig. 8(a) shows that the partition is quite effective in reducing the computational time. A packet size around 100–200 observations (0.5%–1% in the plot) seems to be an adequate choice.
5. Model update The common use of projection models in EDA is subject to continuous model update or model rebuild. The EWMA update strategy previously introduced can only be used when the model update is smooth. This is the case when the analysis is performed on-line on streaming data, so that incoming data is added to the model, continuously modifying it. However, there are several reasons to rebuild the model from scratch during an EDA. For instance, when a number of outliers are found in the score plot, the typical procedure is to isolate them from the data to re-compute the model. This procedure commonly changes the projection subspace to a large extent. The model also changes after variable selection [52,53], which is also a common step in EDA. A useful tool for Big data analysis should be capable of performing model update/ rebuild in an efficient manner. As a matter of fact, due to the strongly interactive nature of EDA, the time scale to perform model update/rebuild in Big data should be similar to that in low size data. Otherwise, the EDA 5
500
4 3
0 10
2
−10
40
20
0
−20
PC 2
0
1 0
15 −1
10 5
−2
0
−3
−5 −10 −30
−4 −10 −20
−10
0
10
20
30
40
Fig. 6. Classless density plot [57] (top) and hexagonal binning plot computed with the EDA toolbox [64] (bottom) of the Perceptive data set and PCA.
0
10
20
30
PC 1 Fig. 7. Exponentially weighted moving average compressed score plot of the Perceptive data set and PCA. Multiplicity is shown in the size of the markers and λ = 0.9.
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
(a)
(b)
0.025 0.6
500
0.02
MDFO
Seconds
0.03
storing original data without storing original data
400 300
MDCO
600
(c)
0.8
700
117
0.4
0.015 0.01
200
0.2 0.005
100 0
0 0.1% 0.2% 0.5% 1%
2%
5% 10%
0.1% 0.2% 0.5% 1%
2%
0
5% 10%
0.1% 0.2% 0.5% 1%
2%
5% 10%
Batch size Fig. 8. Performance measures for different batch size in the clustering algorithm on the Perceptive data set: (a) elapsed time, (b) Mean Distance to the Furthest Observation and (c) Mean Distance to the Closest Observation. Batches sizes considered range from 20 observations (0.1% of the complete data set) to 2000 observations (10%).
would be paused to perform long computations. This section is devoted to present the proposal of this paper on this matter. As it was discussed before, PCA and PLS can be fitted from the crossproduct matrices XX and XY, the latter only for PLS. One approach for fast model update after outliers extraction is to store the crossproduct matrices associated with each centroid. Thus, if a centroid is determined to be an outlier and deleted from the model, the total crossproduct matrices and projection models can be easily recomputed by subtracting the cross-product matrices associated with the centroid to the cross-product matrices corresponding to the complete data set. Under the assumption that the volume of the data is larger enough than the volume of outliers, no preprocessing parameters and crossproduct matrices need to be recomputed from scratch. Other types of model updates, such as after variable selection, are also easily computed from cross-product matrices. The most challenging step is the CSPs update. The distance in the clustering is computed according to Eq. (17) or Eq. (16), which in turn depend on the specific subspace fitted. Therefore, after model (subspace) update, the centroids yielded by the clustering may not remain representative of the data distribution in the new subspace. This is illustrated in Fig. 9. Fig. 9(a) presents the original observations and centroid corresponding to one cluster of Fig. 3. Recall that this cluster was obtained using the Mahalanobis distance in the PCA subspace of the first 2 PCs. In Fig. 9(b), both data and centroid are projected in a different subspace, in particular in the PCA subspace corresponding to the 3rd and 4th PCs of the same data set, which is completely orthogonal to the subspace where the clustering was performed. The result is that the cloud of observations is dispersed and clustered, which means that the centroid is not representative of that data distribution in that new subspace. To further illustrate this problem, in Fig. 10 the projection in the 3rd and 4th PCs subspace of the original data, the clustering obtained in that specific
(a)
subspace and the clustered data of Fig. 3 are compared. When the clustering is performed on the same subspace of the CSP, the representativeness is better that when a different subspace was considered in the clustering. Thus, Fig. 10(b) represents Fig. 10(a) with higher fidelity than Fig. 10(c) does. This is also concluded from the MDFO index. The clustering performed on the 3rd and 4th PCs subspace yields less than half the value of MDFO (MDFO = 0.096) than the clustering performed on the 1st and 2nd PCs (MDFO = 0.253), showing that the CSP of Fig. 10(c) leaves uncovered areas. This can also be observed by visually comparing Figs. 10(a) and 10(c). However, the CSP in Fig. 10(c) does not introduce artifacts, as illustrated by its MDCO value (MDCO = 1.6 × 10−3), which is even lower than that of the CSP in Fig. 10(b) (MDCO = 2.0 × 10−3). One approach for updating the CSPs when the subspace of interest changes is to repeat all the clustering operation from scratch in the new subspace. Unfortunately, the clustering of large data sets is a time demanding operation which may take up to days depending on the size of the data. It is not realistic to expect that the framework proposed in this paper would be of any use if the EDA has to be stopped for such a long time after each model update/rebuild. This is especially true considering that a typical EDA with projection models may need several chained model updates, which should be performed almost immediately to maintain the interactiveness of the process. An alternative approach for CSPs update is to split the already identified centroids into a reduced number of observations and then recluster them. Provided that the number of intermediate observations generated in this splitting operation is (much) lower than the original data set, the re-clustering operation would be (much) faster than performing a clustering from scratch. Nevertheless, the splitting operation using the original data is time demanding itself and requires the storage of the original data during the clustering. In this paper, an
(b)
15
3 2
10
mult = 52
mult = 52
1 5 0
0 −1 −2
−5 −3 −10 −30
−20
−10
0
10
20
30
40
−4 −8
−6
−4
−2
0
2
4
Fig. 9. Original data (+) and centroid (□) of one cluster (number 99) of Fig. 3 with multiplicity 52 projected on the PCs 1 vs 2 subspace (a) and the PCs 3 vs 4 subspace (b). Perceptive data set.
118
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
(b) 3
3
2
2
1
1
0
0
PC 2
PC 4
(a)
−1
−1
−2
−2
−3
−3
−4 −8
−6
−4
−2
0
2
4
−4 −8
−6
−4
−2
PC 3
0
2
4
PC 1
(c) 3 2
PC 2
1 0 −1 −2 −3 −4 −8
−6
−4
−2
0
2
4
PC 1 Fig. 10. Score plots of the Perceptive data set for the 3rd and 4th PCs subspace: (a) score plot of the original data, (b) CSP computed in the 3rd and 4th PCs subspace and (c) CSP computed in the 1st and 2nd PCs subspace. The CSP in (b) yields MDFO = 0.096 and MDCO = 2.0 × 10−3. The CSP in (c) yields MDFO = 0.253 and MDCO = 1.6 × 10−3.
approximation method is proposed instead for this splitting operation. This is based on the method termed Approximation of a DIstribution for a given COVariance (ADICOV) [54]. ADICOV approximates a data set by another data set with constrained covariance. Using this method, artificial data can be generated for each centroid according to the information of its corresponding cross-product matrix XX. Then, the clustering is recomputed in the new subspace from the artificial observations. Regarding the validity of this approximation, it should be noted that the clustering itself is an approximation of the original data. Furthermore, EDA is concerned with visual information, not exact computations, and therefore approximation techniques such as ADICOV are suitable in this context. The proposed approximation approach is depicted in Fig. 11. A first clustering is computed in subspace A. After model update, the subspace changes to B. To perform the update of the CSP, as discussed, there are
two choices: (a) the repetition of the clustering in the new projection subspace represented by the dashed arrow in Fig. 11 and (b) the approximation using ADICOV and reclustering. Choice (b) is much faster than choice (a) provided the volume of data is high. Therefore, and also provided the approximation by ADICOV is good enough, choice (b) is preferred. The splitting algorithm based on ADICOV is presented in Algorithm 3. D is the output artificial set of points, which has to be re-clustered afterwards with the clustering algorithm. Routine est _ disp is used to estimate the number of new observations in which the centroid needs to be split. This is explained afterwards. The core of the splitting algorithm is the two following steps. Firstly, a random data set is generated using a multi-normal distribution. This data set is distributed randomly in all directions in the space of the original variables. Secondly, ADICOV is applied over this data set with the cross-product matrix XX(i). This is
Fig. 11. Compressed score plots update: splitting and re-clustering procedure compared to the clustering from the original data.
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
the cross-product matrix computed from the data in the cluster centered according to the centroid. Thus, it contains the variance–covariance structure from the centroid. Matrix I performs the linear transformation from the original space of the variables to the subspace of interest and O performs the linear inverse transformation from that subspace to the original space of the variables. The result of ADICOV is a random data set distributed according to the information in XX(i). Concluding, the basic idea in the algorithm is to select one reference point, the centroid, and compute the variance of the data around this point. Using this information, an artificial data set which resembles the original data distribution in the cluster is yielded. The details on the ADICOV algorithm are given in the Appendix. Algorithm 3. Approximation algorithm based on ADICOV D ← Split(C, μ, XX(), th, I, Iold, O): C := [c1, …, cL] μ := [μ1, …, μL] XX() := [XX(1), …, XX(L)] D = [] for each cluster i, j = est _ disp(XX(i), μi, I, Iold, th) L = random(j) A = ADICOV(L, XX(i), μi, I, O) + ci D ← [D, A] end The estimation of the number of new observations per cluster in routine est _ disp is also a relevant step. This estimation is performed computing the Ratio of Variance (RoV) of the cluster between the new subspace and the original subspace where the clustering was performed. The RoV of cluster i follows:
RoV
ði Þ
var IT XXðiÞ I ¼ var ITold XXðiÞ Iold
ð24Þ
where in the numerator the projection of XX(i) in the new subspace is obtained and the variance is computed, and the same is done in the denominator for the clustering subspace. In a two-dimensional subspace of a score plot, computing the variance is essentially summing up the two eigenvalues of the covariance matrix. Then, the number of new observations is obtained as:
j ¼ min
RoV th
ðiÞ
! ð25Þ
; μi
(a)
where th is a user-defined input parameter used to control the splitting. The lower the value of th, the higher the number of observations in D, the output of the approximation algorithm. Fig. 12(a) illustrates the application of this approach to the cluster depicted in Fig. 9. It should be noted that the cluster used for illustration is the one with the poorest approximation in the case study, so that limitations of the approximation can be discussed. The plot in Fig. 12(a), referred to as a cluster plot, is very useful to validate the approximation of the clusters when necessary, provided the original data is stored. It can be seen that the approximation of the cluster in Fig. 12(a) is not completely satisfactory, since there is a group of original observations towards the right-bottom corner, which is not represented by any circle. This illustrates a main limitation of the approximation approach. ADICOV only takes into account the variance–covariance information in the cluster and around the centroid. Therefore, if the cluster is itself clustered in the new subspace, the approximation of some parts will be poor. Another side problem is that if th is low enough, artifacts are introduced in D, as it will be discussed afterwards. Still, the approximation is quite representative for most of the cases. In Fig. 12(b), the multiplicities of the clusters of Figs. 3 and 10(c) is displayed against the RoV between the 3rd and 4th PCs subspace (I in Eq. (24)) and thefirst 2 PCs (Iold in Eq. (24)) in the Perceptive data set. This is a very useful plot to predict poorly approximated clusters. In our experience, those clusters which combine high multiplicity and RoV are more likely to be poorly approximated. This is the case, for instance, of cluster 99 displayed in Fig. 12(a). The multiplicity vs RoV is also convenient from a practical point. In a Big data domain, the original data may not always be accessible in a practical time scale. The approximation algorithm, as defined, does not store the complete data. Rather, the centroids C, multiplicities μ, and inner covariance matrices XX() of the clusters need to be stored. Without storing the original data, a cluster plot like Fig. 12(a) cannot be computed but the RoV and multiplicity in Fig. 12(b) can still be obtained. In Fig. 13(a), the artificial data set generated with ADICOV in the 3rd and 4th PCs subspace from the clustered data in Fig. 3 is shown. Data in Fig. 13(a) are reclustered and presented in Fig. 13(b). The two step process takes 0.2 s, which is two orders of magnitude faster than starting the clustering in the 3rd and 4th PCs subspace from scratch. Also, the resulting centroids improve representativeness of the original distribution in terms of MDFO. Thus, the MDFO = 0.253 of Fig. 10(c) is improved to MDFO = 0.19 in Fig. 13(b). This is also visually appreciable. However, the approximation negatively affects the MDCO, which means that some degrees of artifacts are introduced. The MDFO = 1.6 × 10−3 of Fig. 10(c) is increased in one order of magnitude in Fig. 13(b). This is partly the consequence of cluster 99. One approach to reduce the risk of introducing significant artifacts in the visualization is to isolate the outliers in terms of multiplicity against
(b) 2500
3 2
78
2000
1
55
1500
0
µ
mult = 52
119
−1
99
1000
−2 500
23
−3
76
−4 −8
−6
−4
−2
0
2
4
0
0
2
100
4
6
8
10
12
RoV Fig. 12. Subfigure (a) shows the cluster plot of one cluster (number 99) of Fig. 3 with multiplicity 52 projected on the PCs 3 vs 4 subspace: original data (+), centroid (□) and artificial data (○). Subfigure (b) shows themultiplicity against the RoV in the 100 clusters. Perceptive data set.
120
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
(b)
3
2
2
1
1
0
0
PC 2
PC 2
(a) 3
−1
−1
−2
−2
−3
−3
−4 −8
−6
−4
−2
0
2
4
−4 −8
−6
PC 1
−4
−2
0
2
4
PC 1
Fig. 13. Approximated compressed score plot (ACSP) of the Perceptive data set in the 3rd and 4th PCs subspace: (a) artificial data generated with ADICOV (th = 0.1) from the clusters in Fig. 3 and (b) centroids re-computed with the clustering algorithm. The ACSP in (b) yields MDFO = 0.19 and MDCO = 0.013 and is computed in 0.2 s.
RoV. From Fig. 12(b), cluster number 99 may be identified as an outlierwhich combines high multiplicity and high RoV. One common approach in EDA is simply to discard outliers, provided that their multiplicity is not very high in comparison with the total data size (52 observations out of the 18.887 total of observations for cluster 99) If cluster number 99 is not used in the approximation, the increase of MDCO in the approximation is highly reduced. In Fig. 14, performance results on the Perceptive data set, with and without the outlier removal, are compared for different values of parameter th in Eq. (25). Recall that low values in th lead to higher numbers of intermediate points generated in the approximation algorithm, from which the clustering is recomputed. The lower th, the higher the processing time. Still, the approximation is more than one/two orders of magnitude faster than the clustering algorithm. MDFO values in the approximation are slightly below 0.2, no matter what the value of th or whether the outlier is considered or not in the approximation. However, MDCO values are reduced to a large extent with the outlier removal. This effect was not
70
(c) 0.12
0.3
0.1
0.25
MDFO
Seconds
50 40 30
0.15 0.1
10
0.05 .2 C12
0.06 0.04 0.02
0 C34 .001.002.005 .01 .02 .05 .1
0 C34 .001.002.005 .01 .02 .05 .1
.2 C12
70
(e) 0.35
(f) 0.12
60
0.3
0.1
50
0.25
MDFO
Second
0.2
20
0
(d)
0.08
MDCO
60
(b) 0.35 storing original data without storing original data
40 30
0.2 0.15
20
0.1
10
0.05
0 .2 C12
.2 C12
C34 .001.002.005 .01 .02 .05 .1
.2 C12
0.06 0.04 0.02
0 C34 .001.002.005 .01 .02 .05 .1
C34 .001.002.005 .01 .02 .05 .1
0.08
MDCO
(a)
observed by removing high RoVbut low multiplicity centroids (e.g. centroid 100 in Fig. 12(b)) or low RoV but high multiplicity centroids (e.g. centroid 78 in Fig. 12(b)). To end this case study, in Fig. 15 a pair of bivariate histograms in the 3rd and 4th PCs subspace are shown. The plot in Fig. 15(a) has the same number of points as the Approximated CSP (ACSP) in Fig. 13(b), but in the former several of the points corresponding to different classes lay in the same location. Visualization may be improved by slightly modifying the locations for different classes. Data fidelity in terms of MDFO andMDCO in the histogram is poorer than in the ACSP, but computation time is lower in the former. Notice also that the computation time in the ACSP assumes an initial set of centroids is already available. That initial set of centroids can be computed without the direct analyst supervision and prior to starting the EDA. To improve the visualization of the histogram, an intuitive approach is to increment the number of bins, like in Fig. 15(b). Unfortunately, the plot may get blurred and more difficult to interpret.
0 C34 .001.002.005 .01 .02 .05 .1
.2 C12
Fig. 14. Performance measures for different th values in the approximation algorithm on the Perceptive data set. Considering the 100 clusters: (a) elapsed time, (b) Mean Distance to the Furthest Observation and (c) Mean Distance to the Closest Observation. Isolating outlier number 99: (d) elapsed time, (e) Mean Distance to the Furthers Observation and (f) Mean Distance to the Closest Observation. Values of th considered range from 0.001, which generates an approximate amount of 10000 intermediate observations, to .2, which generates only 15 new observations.
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
2
2
1
1
0
0
PC 2
(b) 3
PC 2
(a) 3
−1
−1
−2
−2
−3
−3
−4 −8
−6
−4
−2
0
2
4
PC 1
121
−4 −8
−6
−4
−2
0
2
4
PC 1
Fig. 15. Bivariate histogram of the Perceptive data set in the 3rd and 4th PCs subspace: (a) with 100 filled bins and (b) with 318 filled bins. The histogram in (a) yields MDFO = 0.33 and MDCO = 0.099 and is computed in 0.05 s. The histogram in (b) yields MDFO = 0.04 and MDCO = 0.009 and is computed in 0.1 s.
As a summary, Table 1 compares the performance measurements for the CSP, the ACSPs obtained for th = 0.1, with and without the outlier, and the bivariate histogram with 100 points. The results of the second bivariate histogram are not included since the MDFO and MDCO for very different number of observations cannot not be compared. The CSP provides the best fidelity results at the expense of a very high computational time. The bivariate histogram yields the opposite results: reduced fidelity and visualization but very fast computation. The ACSPs yield an intermediate case, with an adequate fidelity and visualization and reasonable computational time. 6. Other case studies The same experiment of Fig. 14 is repeated for a character recognition database of handwritten digits. This data set, available from http://yann.lecun.com/exdb/mnist/, consists of a set of different handwritten digits. In this work, a subset of 9 digits (from 1 to 9) has been selected, each digit with a sample of 1000 observations. The digits have been size-normalized and centered in a fixed-size image of 28x28 pixels. Original images were centered by computing the center of mass of the pixels, and digits are relocated according to the bounding box of the digit in the image. The resulting post-processed database consists of 9000 observations (images) grouped in 9 different classes (digits 1 to 9), with 1000 observations each. Each observation contains 784 variables, corresponding to the 28 × 28 grid of pixels in grey scale with values from 0 to 255. The clustering algorithm was applied on the PCA subspace corresponding to the first 2 PCs. Three hundred centroids were obtained and then projected to the PLS-DA subspace. Approximations and clustering in the PLS-DA subspace were also computed, and the results in terms of performance are shown in Fig. 16. Slightly different conclusions than those for the Perceptive data set can be extracted in this case, in which the number of variables is much higher. In this case, the computational time in the approximations is at most one order of magnitude lower than that of the clustering, since the number of original observations (9000) is not truly large. Also, the approximations improve the MDFO to values close to that of the clustering performed in the PLS-
Table 1 Performance comparison: Perceptive data set in the 3rd and 4th PCs subspace. The CSP in the first row was directly computed in that subspace. Plot
# obs.
MDFO
MDCO
Elapsed time (s)
CSP ACSP (100) th = 0.1 ACSP (99) th = 0.1 Biv. H.
100 100 99 100
0.10 0.19 0.20 0.33
0.002 0.013 0.007 0.099
16.3 0.2 0.16 0.05
DA subspace. As the number of intermediate observations is increased, the MDFO is lowered at the expense of an increase of MDCO. Fig. 17 visually compares the score plot of the original observations with the CSP computed in PCA, the ACSP for th = 1 and a bivariate histogram with 300 filled bins, all of them in the PLS-DA subspace. The CSP computed in PCA is not a good representation of the original data distribution, since several areas of specific classes are not correctly covered. After the approximation, a representative ACSP is obtained in less than 8 s. The bivariate histogram hides some of the classes (e.g. black circles). Table 2 compares the performance measurements for the CSP from the centroids computed in the PLS-DA subspace, the ACSP obtained for th = 1 and the bivariate histogram with 300 points. Again, the CSP provides the best fidelity results at the expense of a very high computational time. The bivariate histogram yields good results in terms of MDFO and computational time. The ACSPs yields better MDCO, but worse results in the other two performance measures. A third experiment was performed on a much larger data set. The data set considered was generated from the 1998 DARPA Intrusion Detection evaluation Program, prepared and managed by MIT Lincoln Labs [65,66]. The objective of this program was to survey and evaluate research in networking intrusion detection. For that, a large data set including a wide variety of intrusions simulated in a military network environment was provided. The original data set included 4.880.000 observations. The observations belong to 22 different classes, one class for normal traffic and the remaining for different types of network attacks. For illustrative purposes, the analysis will be restricted to two types of attacks, smurf and neptune, and normal traffic. These three classes represent a 99.3% of the total traffic in the data set. For each connection, 42 variables are computed, including numerical and categorical variables. To consider categorical variables in the EDA, one dummy variable per category is included in the data set. The resulting data set contains 4.844.253 observations, each one with 122 variables. This data set, although not huge, can already be considered a Big data set. The other two data sets were illustrative examples and not the real target of this paper. The EDA will be aimed at detecting differences between normal traffic and attacks. For that, a PCA model is built, and the CSP for the first 2 PCs is obtained with 100 centroids. This takes 1 h and 20 min if original data are not stored, or almost 9 h if they are. Clearly, this computational time istoo high to perform an interactive EDA involving many model recomputations. After having a look at the PCA space, the PLS-DA space is desired to be inspected. Once again, there are several choices: (i) the clustering can be recomputed, waiting for the result for at least one hour, (ii) the PCA clustering can be used to visualize the CSP in the PLS-DA subspace, which may not be representative of the real data distribution in that subspace, or (iii) the ADICOV approximation can be performed, yielding an intermediate case. The performance results of
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
(a) 250 200
(c) 0.35
3
0.3
2.5
0.25
150
MDFO
Seconds
(b) 3.5 storing original data without storing original data
100 50 0
Cpls 0.1 0.2 0.5 1
2
5
2 1.5
0.2 0.15
1
0.1
0.5
0.05
0
10 20 Cpca
MDCO
122
Cpls 0.1 0.2 0.5 1
2
5
0
10 20 Cpca
Cpls 0.1 0.2 0.5 1
2
5
10 20 Cpca
Fig. 16. Performance measures for different th values in the approximation algorithm on the Characters data set: (a) elapsed time, (b) Mean Distance to the Furthest Observation and (c) Mean Distance to the Closest Observation. Values of textitth considered range from 0.1, which generates an approximate amount of 4000 intermediate observations, to 11, which generates only 19.
(a) 15
(b) 10
10
5
5
LV 2
LV 2
0 0 −5
−10
−10
−15
−15 −20 −25
−5
−20
−15
−10
−5
0
5
−20 −20
10
−15
−10
LV 1
10
10
5
5
0
0
PC 2
(d) 15
LV 2
(c) 15
−5
−10
−15
−15
−20
−15
−10
−5
0
5
10
−5
−10
−20 −25
−5
LV 1
0
5
10
−20 −25
−20
LV 1
−15
−10
−5
0
5
10
PC 1
Fig. 17. Score plots of the Character data set in the PLS-DA subspace: (a) original data, (b) CSP from the centroids computed in the PCA subspace, (c) ACSP (th = 1) and (d) bivariate histogram for 300 filled bins. The CSP in (b) yields MDFO = 3.18 and MDCO = 0.7 and is computed in 48 s. The ACSP in (c) yields MDFO = 1.88 and MDCO = 0.18 and is computed in 7.9 s from the clustering of (b). The histogram in (d) yields MDFO = 1.48 and MDCO = 0.30 and is computed in 0.15 s.
these choices are compared in Fig. 18. Since the data set has much higher volume than the previous experiments, now a much higher difference in terms of computation time is observed. Elapsed time for any of the ACSPs considered for different th values is below 15 s, and most of them take less than 1 s. These values can be considered to be comparable to the computation time of a model/score plot computation in lowsize data sets. Also, like in the previous experiments, the MDCO grows and the MDFO decreases as the number of intermediate observations grows. Fig. 19 visually compares the CSPs computed for the three choices aforementioned4 and the bivariate histogram for 100 points. A very good approximation of Fig. 19(a) is obtained with the proposed 4
Notice that in this case and due to limited computer resources, the score plot of original data cannot be visualized.
approach in the ACSP of Fig. 19(c) and in only 14 s. The bivariate histogram in Fig. 19(d) bears less fidelity with Fig. 19(a). Also, in this case study the computation of the bivariate histogram takes a similar elapsed time than that of the ACSP. This may be the consequence of the iterative computation out-of-core needed for the histogram. The elapsed times does not include disk accesses. If they are included, the ACSP would be more computationally efficient than the histogram, since the ACSP computation does not need to access the disk memory. Performance results are summarized in Table 3. 7. Conclusion The application of Exploratory Data Analysis (EDA) to Big data problems using projection models has been very limited. This is because visualization tools for user-supervision are scarce in domains with
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125 Table 2 Performance comparison: Character data set in the PLS-DA subspace. The CSP in the first row was directly computed in that subspace. Plot
# obs.
MDFO
MDCO
Elapsed time (s)
CSP ACSP (th = 1) Biv. H.
300 300 300
1.13 1.88 1.48
0.047 0.184 0.302
47.0 7.9 0.15
millions of observations. There is a clear interest in extending the EDA approach based on projection models for its efficient application in very large data sets. This paper introduces a methodology that meets the interactive nature of EDA to visualize data sets with an unlimited number of observations, in which the analyst jumps from one visualization to another in a timely manner, guided by what the data show. Combining the inherent capability of projection models to handle large numbers of variables with the methods introduced here to deal with a large, potentially unlimited, number of observations, the proposed framework is broadly applicable. A first step in the approach, and only this first step, is computationally intensive. Data are clustered in a subspace of interest in order to make possible their visualization in a single score plot. The visualization of the cluster centroids in the subspace is called a Compressed Score Plot (CSP). Once the first visualization is obtained, it may be necessary to update it. Here a difference is made between smooth, continuous, updates and sharp updates. Smooth updates due to incoming streams of data can be handled using an Exponentially Weighted Moving Average (EWMA) approach. However, sharp updates are more difficult to handle. The analyst may decide to remove centroids, focus her attention in a subset of observations, perform variables selection, etc. Any of these actions leads to a sharp change in the subspace of interest, which in turn reduces the representativeness of the original clustering in the CSP. At this point, it is argued that it is not realistic to repeat a computationally intensive operation like clustering per each update in the analysis. Otherwise, the EDA would not be truly interactive and useful. An alternative based on the approximation of the data from covariance matrices is proposed. A consistent methodology to compute and evaluate approximations is proposed. The proposal makes the user-supervised visualization and analysis of a huge data set affordable, while avoiding an overwhelming amount of work for the user. There are a number of limitations and/or research challenges that should be addressed to improve the proposed methodology. Some of them are listed in the following: • The visualization of CSPs, in particular how to present the multiplicity of the clusters, should be studied in detail. Alternatives to the proposals presented may include the visualization of the information of the variability of each cluster, or a procedure to inspect its Ratio of Variance (RoV) when the CSP is approximated. Also, alternatives to reduce clutter in CSPs with different classes should be addressed. • Parallelization in the computations should be treated for data sets of huge volume or input velocity, and new designs based on
(a) 4 x 104
•
• • •
current parallelization approaches, such as MapReduce [9], should be defined. Fast clustering algorithms are already available in the literature for their application to Big data [67]. Their application should be assessed in terms of computational efficiency and data representativeness. Alternative approximation algorithms may be defined. One possible alternative is to define several reference points for covariance-based approximation, not just the centroid. In that solution, a main decision is where to locate those reference points. Another alternative is to define approximations based on parameters different from covariance matrices. Bivariate histograms may be employed as a first step to compute an approximated CSP. Alternative procedures to detect and handle outliers may be studied, possibly defining measures different from the RoV. The proposed procedure may be extended to other projection models.
Conflict of interest None. Acknowledgments Research in this paper was partially supported by the Spanish Ministry of Science and Innovation and FEDER funds from the European Union through grants TEC2011-22579 and CEI BioTIC GENIL (CEB090010). Perceptive Engineering LTD (http://www.perceptiveapc.com/) is acknowledged for providing the Perceptive data set. Dr. Pablo Padilla, Associate Professor at the University of Granada, is gratefully acknowledged for the pre-processing of the Character data set and his support in the development of some of the figures. Dr. Jesús Díaz-Verdejo, Professor at the University of Granada, and anonymous reviewers are gratefully acknowledged for their comments on the paper. Appendix A. The ADICOV algorithm ADICOV computes the approximation matrix A as the least squares approximation of L constrained to the covariance Cx. The approximation is performed in a given subspace. Matrix I performs the linear transformation from the original space of the variables to the subspace of interest and O performs the linear transformation from the subspace to the original space of the variables. In the context of this paper, L is a random matrix and Cx is the crossproduct XX(i) matrix corresponding to a given cluster and around the centroid. Also, if the subspace of interest is the PCA subspace being P the loading matrix [15], then I and O are set to P. If the subspace of interest is the PLS subspace being P the loading matrix of the x-block and W the weights matrix [17], then I = W ⋅ (PT ⋅ W)−1 and O = P.
(b) 0.7
(c) 0.12
0.6
0.1
storing original data without storing original data
3
0.5
2
0.08
MDCO
MDFO
Seconds
•
0.4 0.3
0.02
0.1 0
Cpls 0.01 0.02 0.05 0.1 0.2 0.5
1
2
5
10
20 Cpca
0
0.06 0.04
0.2
1
123
Cpls 0.01 0.02 0.05 0.1 0.2 0.5
1
2
5
10
20 Cpca
0 Cpls 0.01 0.02 0.05 0.1 0.2 0.5
1
2
5
10
20 Cpca
Fig. 18. Performance measures for different th values in the approximation algorithm on the KDD data set: (a) elapsed time, (b) Mean Distance to the Furthest Observation and (c) Mean Distance to the Closest Observation. Values of th considered range from 0.01, which generates an approximate amount of 13600 intermediate observations, to 20, which generates only 18.
124
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125
0
0
−5
−5
−10
−10
LV 2
(b) 5
LV 2
(a) 5
−15
−15
−20
−20
−25
−25
−30
−30
−35 −8
−6
−4
−2
0
2
−35 −8
4
−6
−4
LV 1
0
0
−5
−5
−10
−10
PC 2
(d) 5
LV 2
(c) 5
−15
−20
−25
−25
−30
−30 −6
−4
−2
0
2
4
0
2
4
−15
−20
−35 −8
−2
LV 1
0
2
4
−35 −8
LV 1
−6
−4
−2
PC 1
Fig. 19. Compressed score plot of the KDD data set in the PLS-DA subspace: (a) clustering performed on the PLS-DA subspace, (b) clustering performed on the PCA subspace, (c) ACSP (th = 0.01) and (d) bivariate histogram for 100 filled bins. The CSP in (a) yields MDFO = 0.25, MDCO = 4.2 × 10−3 and is computed in 1 h and 5 min. The CSP in (b) yields MDFO = 0.51, MDCO = 5.0 × 10−3 and is computed in 1 h and 20 min. The ACSP in (c) yields MDFO = 0.34, MDCO = 93.1 × 10−3 and is computed in 14 s. The histogram in (d) yields MDFO = 0.58 and MDCO = 83.0 × 10−3 and is computed in 0.16 s.
The ADICOV algorithm is as follows:
Please, refer to [54] for more details on the ADICOV algorithm.
Inputs: L: current data set. Cx: covariance matrix. Nx:number of observations from which Cx was computed. I, O: subspace of interest.
References
Algorithm: Nl ← countObservations(L) Gx = IT ⋅ Cx ⋅ I Tl = L ⋅ I [Vx, Dx] ← ED(Gx) Sx ← pelementwiseSqrt(D x) ffiffiffiffiffiffiffiffiffi N −1 ffi Sx Sa ¼ pNffiffiffiffiffiffiffiffiffi −1 T M ¼ Tl Sa Vx T [Um, Sm, Vm] ← SVD(M) Ua ¼ Um Vm T Ta ¼ Ua Sa Vx T A = Ta ⋅ OT l
x
Table 3 Performance comparison: KDD data set in the PLS-DA subspace. The CSP in the first row was directly computed in that subspace. Plot
# obs.
MDFO
MDCO
Elapsed time
CSP ACSP Biv. H.
100 100 100
0.25 0.34 0.58
0.004 0.093 0.083
1 h, 5 min 14.3 s 16.4 s
[1] NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/ div898/handbook . [2] I. Jolliffe, Principal Component Analysis, Springer series in statistics, Springer-Verlag, 2002. [3] J. Han, M. Kamber, Data Mining: Concepts and Techniques, The Morgan Kaufmann series in data management systems, Elsevier, 2006. [4] G. Keren, C. Lewis, A handbook for data analysis in the behavioral sciences: statistical issues, A Handbook for Data Analysis in the Behavioral Sciences, L. Erlbaum, 1993. [5] D. Ruppert, Statistics and Data Analysis for Financial Engineering, Springer Texts in Statistics, Springer, 2010. [6] T. Yu, An exploratory data analysis method to reveal modular latent structures in high-throughput data, BMC Bioinforma. 11 (2010) 440. [7] Y.Y. Teo, Exploratory data analysis in large-scale genetic studies, Biostatistics 11 (2010) 70–81. [8] L.M. Sangalli, P. Secchi, S. Vantini, A. Veneziani, A case study in exploratory functional data analysis: geometrical features of the internal carotid artery, J. Am. Stat. Assoc. 104 (2009) 485. [9] T. White, Hadoop: The Definitive Guide, 1st edition O'Reilly Media, Inc., 2009 [10] V.G. Grishin, A.S. Sula, M. Ulieru, Pictorial analysis: a multi-resolution data visualization approach for monitoring and diagnosis of complex systems, Inf. Sci. 152 (2003) 1–24. [11] P. Loslever, Obtaining information from time data statistical analysis in human component system studies. (i). methods and performances, Inf. Sci. 132 (1–4) (2001) 133–156. [12] J.T. Behrens, C.-H. Yu, Exploratory Data Analysis, Handbook of Psychology, vol. 2, John Wiley & Sons, Hoboken, NJ, 2003. [13] M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, P. Tufano, Analytics: the Realworld use of big data, Ibm institute for business value—executive report, IBM Institute for Business Value, 2012. [14] K. Pearson, On lines and planes of closest fit to systems of points in space, Philos. Mag. 2 (6) (1901) 559–572. [15] J. Jackson, A User's Guide to Principal Components, Wiley series in probability and mathematical statistics, Probability and mathematical statisticsWiley-Interscience, 2003.
J. Camacho / Chemometrics and Intelligent Laboratory Systems 135 (2014) 110–125 [16] H. Wold, E. Lyttkens, Nonlinear iterative partial least squares (NIPALS) estimation procedures, Bull. Intern. Statist. Inst. Proc., 37th session, London, 1969, pp. 1–15. [17] P. Geladi, B. Kowalski, Partial least-squares regression: a tutorial, Anal. Chim. Acta. 185 (1986) 1–17. [18] S. Wold, M. Sjõstrõm, L. Eriksson, PLS-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst. 58 (2001) 109–130. [19] J. Camacho, Missing-data theory in the context of exploratory data analysis, Chemom. Intell. Lab. Syst. 103 (2010) 8–18. [20] K. Gabriel, The biplot graphic display of matrices with application to principal component analysis, Biometrika 58 (1971) 453–467. [21] J. Westerhuis, S. Gurden, A. Smilde, Generalized contribution plots in multivariate statistical process monitoring, Chemom. Intell. Lab. Syst. 51 (2000) 95–114. [22] J. Camacho, J. Picó, A. Ferrer, Data understanding with PCA: structural and variance information plots, Chemom. Intell. Lab. Syst. 100 (1) (2010) 48–56. [23] M. Berry, Large scale sparse singular value computations, Int. J. Supercomput. Appl. 6 (1992) 13–49. [24] J. Romero, E. Plotkin, On-line SVD-based iterative method for signal recovery from nonuniform samples, Electron. Lett. 32 (1) (1996) 20–21. [25] C. Dhanjal, S. Gunn, J. Shawe-Taylor, Efficient sparse kernel feature extraction based on partial least squares, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009) 1347–1361. [26] S. Wold, Cross-validatory estimation of the number of components in factor and principal components, Technometrics 20 (4) (1978) 397–405. [27] J. Camacho, A. Ferrer, Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects, J. Chemom. 26 (7) (2012) 361–373. [28] J. Camacho, A. Ferrer, Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: practical aspects, Chemom. Intell. Lab. Syst. 131 (2014) 37–50. [29] M. Barker, W. Rayens, Partial least squares for discrimination, J. Chemom. 17 (2003) 166–173. [30] E. Rabani, S. Toled, Out-of-core svd and qr decompositions, Proceedings of the 10th SIAM Conference on Parallel Processing for Scientific Computing, 2001. [31] N. Halko, P.-G. Martinsson, Y. Shkolnisky, M. Tygert, An algorithm for the principal component analysis of large data sets, 2010. (cite arxiv:1007.5510 Comment: 17 pages, 3 figures (each with 2 or 3 subfigures), 2 tables (each with 2 subtables)). [32] F. Vogt, M. Tacke, Fast principal component analysis of large data sets, Chemom. Intell. Lab. Syst. 59 (1?2) (2001) 1–18. [33] F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS, J. Chemom. 7 (1993) 45–59. [34] S. de Jong, C. ter Braak, Comments on the PLS kernel algorithm, J. Chemom. 8 (1994) 169–174. [35] B. Dayal, J. MacGregor, Improved PLS algorithms, J. Chemom. 11 (1997) 73–85. [36] C. Ordon, N. Mohanam, C. Garcia-Alvarado, Pca for large data sets with parallel data summarization, Distrib. Parallel Databases (2013) 1–27. [37] N. Kettaneh, A. Berglund, S. Wold, {PCA} and {PLS} with very large data sets, Comput. Stat. Data Anal. 48 (1) (2005) 69–85 (partial Least Squares). [38] A. Balsubramani, S. Dasgupta, Y. Freund, The fast convergence of incremental pca, in: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger (Eds.), Advances in Neural Information Processing Systems, 26, 2013, pp. 3174–3182. [39] B.S. Dayal, J.F. Macgregor, Recursive exponentially weighted PLS and its applications to adaptive control and prediction, J. Process Control 7 (3) (1997) 169–179. [40] S. Qin, Recursive PLS algorithms for adaptive data modeling, Comput. Chem. Eng. 22 (1998) 503–514. [41] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, Inc., New York, NY, USA, 1986. [42] P. Nelson, P. Taylor, J. MacGregor, Missing data methods in PCA and PLS: score calculations with incomplete observations, Chemom. Intell. Lab. Syst. 35 (1996) 45–65. [43] D. Andrews, P. Wentzell, Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer, Anal. Chim. Acta. 350 (1997) 341–352. [44] B. Walczak, D. Massart, Dealing with missing data: Part I, Chemometr. Intell. Lab. Syst. 58 (2001) 15–27. [45] F. Arteaga, A. Ferrer, Dealing with missing data in MSPC: several methods, different interpretations, some examples, J. Chemom. 16 (2002) 408–418.
125
[46] F. Arteaga, A. Ferrer, Framework for regression-based missing data imputation methods in on-line MSPC, J. Chemom. 19 (2005) 439–447. [47] F. Arteaga, A. Ferrer, New proposals for PCA model buildingwith missing data, 11th International Conference on Chemometrics for Analytical Chemistry, vol. 1, 2008, p. 87. [48] J. Camacho, Observation-based missing data methods for exploratory data analysis to unveil the connection between observations and variables in latent subspace models, J. Chemom. 25 (11) (2011) 592–600. [49] F. Lindgren, B. Hansen, W. Karcher, M. Sjõstrõm, L. Eriksson, Model validation by permutation tests: applications to variable selection, J. Chemom. 10 (1996) 521–532. [50] S. Wiklund, D. Nilsson, L. Eriksson, M. Sjõstrõm, S. Wold, K. Faber, A randomization test for PLS component selection, J. Chemom. 21 (2007) 427–439. [51] N. Faber, R. Rajkó, How to avoid over-fitting in multivariate calibration—the conventional validation approach and an alternative, Anal. Chim. Acta. 595 (2007) 98–106. [52] A. Lazraq, R. Cléroux, J. Gauchi, Selecting both latent and explanatory variables in the PLS1 regression model, Chemom. Intell. Lab. Syst. 66 (2003) 117–126. [53] J. Gauchi, P. Chagnon, Comparison of selection methods of explanatory variables in PLS regression with application to manufacturing process data, Chemom. Intell. Lab. Syst. 58 (2001) 171–193. [54] J. Camacho, P. Padilla, J. Díaz-Verdejo, Least-squares approximation of a space distribution for a given covariance and latent sub-space, Chemom. Intell. Lab. Syst. 105 (2) (2011) 171–180. [55] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv. 31 (3) (1999) 264–323. [56] J. Grabmeier, A. Rudolph, Techniques of cluster algorithms in data mining, Data Min. Knowl. Disc. 6 (2002) 303–360. [57] P.H.C. Eilers, J.J. Goeman, Enhancing scatterplots with smoothed densities, Bioinformatics 20 (5) (2004) 623–628. [58] M.C. Hao, U. Dayal, R.K. Sharma, D.A. Keim, H. Janetzko, Visual analytics of large multidimensional data using variable binned scatter plots, in: J. Park, M.C. Hao, P.C. Wong, C. Chen (Eds.), VDA, SPIE Proceedings, vol. 7530, SPIE, 2010. [59] N. Lewin-Koh, Hexagon binning: an overview, Technical Report, 2011. URL http:// cran.r-project.org/web/packages/hexbin/vignettes/hexagon-binning.pdf . [60] G. Ellis, A. Dix, A taxonomy of clutter reduction for information visualization, IEEE Trans. Vis. Comput. Graph. 13 (2007) 1216–1223. [61] A.K. Jain, Data clustering: 50 years beyond k-means, Pattern Recogn. Lett. 31 (8) (2010) 651–666 (award winning papers from the 19th International Conference on Pattern Recognition (ICPR) 19th International Conference in Pattern Recognition (ICPR)). [62] P.S. Bradley, U.M. Fayyad, C. Reina, Scaling clustering algorithms to large databases, Knowledge Discovery and Data Mining, 1998, pp. 9–15. [63] N. Yau, Data Points: Visualization That Means Something, 1st edition Wiley, 2013. [64] W.L. Martinez, Exploratory Data Analysis with MATLAB (Computer Science and Data Analysis), Chapman & Hall/CRC, 2004. [65] Kdd cup 1999 data set, The UCI KDD Archive, Information and Computer Science, University of California, 1999, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99. html . [66] H. Kayacik, A.N. Zincir-Heywood, M.I. Heywood, Selecting features for intrusion detection: a feature relevance analysis on KDD 99 intrusion detection datasets, Citeseer (2005) 3–8. [67] T. Zhang, R. Ramakrishnan, M. Livny, Birch: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD '96, ACM, New York, NY, USA, 1996, pp. 103–114.