An effective discretization method for disposing high-dimensional data

An effective discretization method for disposing high-dimensional data

Information Sciences 270 (2014) 73–91 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins ...

3MB Sizes 37 Downloads 102 Views

Information Sciences 270 (2014) 73–91

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

An effective discretization method for disposing high-dimensional data Yu Sang a, Heng Qi a, Keqiu Li a,⇑, Yingwei Jin b, Deqin Yan c, Shusheng Gao d a

School of Computer Science and Technology, Dalian University of Technology, No. 2, Linggong Road, Dalian 116023, China School of Management, Dalian University of Technology, No. 2, Linggong Road, Dalian 116024, China c School of Computer and Information Technology, Liaoning Normal University, No. 850, Huanghe Road, Dalian 116029, China d Institute of Computing Technology, Research Institute of Exploration and Development, Liaohe Oilfield, PetroChina, No. 98, Oil Street, Panjin 124010, China b

a r t i c l e

i n f o

Article history: Received 15 September 2012 Received in revised form 20 November 2013 Accepted 15 February 2014 Available online 28 February 2014 Keywords: Feature discretization High-dimensional data Dimension reduction Local Linear Embedding (LLE)

a b s t r a c t Feature discretization is an extremely important preprocessing task used for classification in data mining and machine learning as many classification methods require that each dimension of the training dataset contains only discrete values. Most of discretization methods mainly concentrate on discretizing low-dimensional data. In this paper, we focus on discretizing high-dimensional data that frequently present the nonlinear structures. Firstly, we present a novel supervised dimension reduction algorithm to map high-dimensional data into a low-dimensional space, which ensures to keep intrinsic correlation structure of the original data. This algorithm overcomes the deficiency that the geometric topology of the data is easily distorted when mapping data that present an uneven distribution in high-dimensional space. To the best of our knowledge, this is the first approach to solve high-dimensional nonlinear data discretization with a dimension reduction technique. Secondly, we propose a supervised area-based chi-square discretization algorithm to effectively discretize each continuous dimension in the low-dimensional space. This algorithm overcomes the deficiency that existing methods do not consider the possibility of being merged for each interval pair from the view of probability. Finally, we conduct the experiments to evaluate the performance of the proposed method. The results show that our method achieves higher classification accuracy and yields a more concise knowledge of the data especially for high-dimensional datasets than existing discretization methods. In addition, our discretization method has also been successfully applied to computer vision and image classification. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Discretization is one of the preprocessing methods used frequently in data mining, machine learning and knowledge discovery [50,12,20], which has been generally used to slice the value domain of each continuous dimension into a finite number of intervals associated with a discrete value. The significance of discretization methods derives from interest in extending to continuous dimension classification methods, such as decision trees, rough set theory, Bayesian classifier or Bayesian networks, which are used for dealing with discretized data. Discretization can also facilitate the interpretation of the obtained results and to improve the accuracy of classification tasks [29,48]. It has been used for some applications such as medical diagnosis [34,38] as a necessary preprocessing step. ⇑ Corresponding author. Tel.: +86 411 84709242. E-mail address: [email protected] (K. Li). http://dx.doi.org/10.1016/j.ins.2014.02.113 0020-0255/Ó 2014 Elsevier Inc. All rights reserved.

74

Y. Sang et al. / Information Sciences 270 (2014) 73–91

Existing discretization methods, such as chi2-based heuristic algorithms [24,30,44,43,8], class-attribute interdependency algorithms [11,26,46,31], entropy-based methods [16,25] and correlation-based discretization methods [3,10,32] are proposed to find good partition of each continuous dimension of a dataset. Presently, there are some works addressing the discretization of high-dimensional data. Entropy-based methods perform well on high-dimensional data regarding both the discretization intervals and classification accuracy. Ferreira et al. [17] proposed an incremental supervised FD technique based on recursive bit allocation. The method can achieve the highest mutual information with the class label after discretization. Mehta et al. [32] proposed a correlation preserving discretization algorithm based on principal component analysis (PCA), which can discretize high-dimensional data by considering the correlation structure of the data. However, highdimensional data now in the real-world frequently present the nonlinear structures; therefore, it is still a challenging task to study more efficient discretization methods for nonlinear high-dimensional data. In this paper, we present a novel high-dimensional data discretization method. The main contributions of this paper are summarized as follows: 1. We present a supervised dimension reduction algorithm for data discretization, which effectively maps high-dimensional data into a lower intrinsic dimensional space. This method keeps the intrinsic correlation structure of the original data and overcomes the deficiency of the geometric topology of data, which is easily distorted when mapping data that presents an uneven distribution in high-dimensional space. 2. We propose a supervised area-based discretization algorithm to effectively discretize each continuous dimension in the low-dimensional space. This algorithm overcomes the deficiency of chi2-based methods using the change of chi-square as merging criterion to discretize the data without considering the possibility of being merged for each interval pair from the view of probability. 3. We conduct the experiments results on real-world and synthetic datasets to evaluate the performance of the proposed method by comparison with some popular discretization methods. The experimental results show that the proposed method outperforms existing methods over the performance metrics considered. Furthermore, we also apply our proposed method to computer vision and image classification. The remainder of this paper is organized as follows. We introduce related work in Section 2. We present our proposed method in Section 3. Experiments and performance evaluation are presented in Section 4. Finally, we summarize our work and conclude this paper in Section 5.

2. Related work Existing discretization methods mainly focus on disposing low-dimensional data. Liu et al. [29] and Tsai et al. [46] present a taxonomy of discretization methods including several main axes: bottom-up vs. top-down, and supervised vs. unsupervised [15] and so on. Top-down methods, such as class-attribute interdependency discretization, start from the initial interval and recursively split it into smaller intervals. Bottom-up methods, such as chi2-based discretization, begin with the set of single value intervals and iteratively merge adjacent intervals. In the unsupervised methods, continuous ranges are divided into subranges by the user specified width (range of values) or frequency (number of instances in each interval). There are not many unsupervised methods available in the literature which may be attributed to the fact that discretization is commonly associated with the classification task. Unsupervised methods provide no class information, such as equal-width and equal-frequency [15], kernel density estimation (KDE) [6], tree-based density estimation (TDE) [42]. The equal-width and equal-frequency methods can be implemented with a low computational cost, and the EFB method [49] with naive Bayes (NB) classification produces good results. The KDE and TDE methods are state-of-the-art unsupervised top-down methods, which use density estimators to select the best cut-points and automatically adapt subintervals to the data. They determine the discretized number of intervals by the cross-validated log-likelihood. Supervised methods provide class information with each feature value and they are much more sophisticated, such as entropy-based discretization method, class-attribute interdependency methods, and the chi2-based heuristic algorithms. Entropy-based method proposed by Fayyad and Irani [16] recursively selects the cutpoints on each target feature to minimize the overall entropy and determines the appropriate number of intervals by using the minimum description length principle (MDLP). Kononenko’s method [25], a variant of Fayyad and Irani’s method, analyzes the biases of eleven measures for estimating the quality of the multi-valued features. The values of these measures tend to linearly increase with the number of values of a feature. Whereas, Kononenko introduces a function based on the MDL principle whose value slightly decreases with the increasing number of feature’s values. Class-attribute interdependency methods are distinguished top-down supervised discretization methods with the objective to maximize the interdependency between the class and the continuous-valued feature and to generate a possibly minimal number of discrete intervals. The chi2-based methods are famous bottom-up supervised discretization methods based on statistical independence. The chi-square statistic is used to determine whether the current interval pair is to be merged or not. These methods trade off the number of intervals with the number of inconsistent instances and control the process of discretization by introducing inconsistency with the aim to control the degree of misclassification.

Y. Sang et al. / Information Sciences 270 (2014) 73–91

75

Recently, many researchers have focused on the production of new discretization methods, i.e., interval distance-based discretization (IDD) [40], data discretization unification (DDU) [22], Bayes optimal discretization (MODL) [9], semi-supervised discretization (SSD) [7], refining discretization [2], M&S discretization [41], and a survey [37] of discretization techniques. IDD, which is neither a bottom-up nor a top-down method, can be used with any number of output variable values based on interval distances and a named D-neighborhood. It considers the order of the output variable and enable that class distribution of two contiguous intervals are different as much as possible and the classes of the same interval can be distributed with a broad deviation. Jin et al. present a latest unification method of data discretization (DDU). They prove that discretization methods based on information theory and statistical independence are approximately equivalent. A parameterized goodness function is derived to unify six discretization criteria, providing a flexible framework to access a potentially infinite of goodness functions. MODL is another latest discretization method. It builds an optimal criterion based on a Bayesian model and proposes three algorithms to find the optimal criterion. Beyond the supervised and unsupervised methods, Bondu et al. [7] developed a semi-supervised method. It is based on the MODL framework and discretizes the numerical domain of a continuous input variable, while keeping the information relative to the prediction of classes. Refining discretization introduces a method, called n-procedure, that constructs classical partitions on the range of a feature taking continuous values. These partitions can be seen as refinements of the ones given by the expert or the ones given by a standard method of discretization. M&S discretization presents a combining univariate and multivariate merging criterion by using the Minimum Description Length Principle (MDLP) and developing a measurement of significance of interval pair among features. And, a stopping criterion is proposed to control the degree of misclassification while maximizing the merging accuracy. Moreover, A heuristic algorithm is developed to find the best discretization based on the merging and stopping criterion. Garcia et al. [37] present a survey of discretization techniques: taxonomy and empirical analysis in supervised learning. They develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. All of the aforementioned methods discretize continuous features individually. As a result, they cannot generate optimal intervals for all involved continuous features in multivariate cases, as they do not take correlation structure of the data into account. Therefore, correlation-based discretization methods [10,3,32] are more concerned with multivariate discretization cases. However, they have not an insight into the intrinsic geometry for high-dimensional data frequently presenting the nonlinear structures. To do with nonlinear high-dimensional data well, we rely on a special nonlinear dimension reduction technique for discretization in this paper. At present, dimensionality reduction based on spectral analysis plays an important role in many areas. It is the process of transforming measurements from a high-dimensional space to a low-dimensional subspace through the spectral analysis on specially constructed matrices. Representative spectral analysis-based dimensionality reduction algorithms can be classified into two groups: (1) conventional linear dimensionality reduction algorithms; and (2) manifold learning-based nonlinear algorithms. Representative conventional linear dimensionality reduction algorithms include principal component analysis (PCA) [23] and linear discriminant analysis (LDA) [18]. Representative manifold learningbased dimensionality reduction algorithms include locally linear embedding (LLE) [39], ISOMAP [45], Laplacian eigenmaps (LE) [4] and local tangent space alignment (LTSA) [52]. These have been successfully applied in many areas such as image classification, data visualization [47], machine learning and seismic signal processing [35]. Recently, Musa et al. [36] give a comparison of l1-regularizion for dimensionality reduction in logistic regression, including principal component analysis (PCA), kernel principal component analysis (KPCA) and independent component analysis (ICA). Lee and Verleysen’s book [27] describes existing and advanced methods to reduce the dimensionality of numerical databases. It aims to summarize clear facts and ideas about well-known methods as well as recent developments in the topic of nonlinear dimensionality reduction. As high-dimensional data frequently present nonlinear structures, therefore, we use a simple and outstanding manifold learning-based algorithm (LLE) as a basis, and extend it to explore the intrinsic correlation structures of the data. The extended algorithm maps high-dimensional data to a low-dimensional space, which intrinsically considers the relationship among the dimensions to retain the geometry of the original data and achieves better separability inter-class distance of the data in the low-dimensional space. This helps to obtain a good discretization scheme. 3. Our discretization method In this section, we present our proposed discretization method. As we know, discretization of continuous dimensions depends on the characteristics of a dataset [20]. The discretization result is sensitive to the influence of changes in the intrinsic correlation structure of the data, especially for high-dimensional data that frequently present the nonlinear structures. In this section, we first propose a novel local linear embedding algorithm for dimension reduction, which investigates the intrinsic geometric structure of the data to discover the hidden relationships and mine the lower representative dimensions. Secondly, we propose an area-based discretization algorithm to effectively discretize each continuous dimension in the low-dimensional space in this section. 3.1. A novel local linear embedding algorithm In this section, we describe the proposed novel local linear embedding algorithm in detail. It can be viewed as an extension of LLE. Let X ¼ fx1 ; . . . ; xN g be a given dataset of N samples, where xi ¼ ðxi1 ; xi2 ; . . . ; xiD ÞT 2 RD and D is the dimension of

76

Y. Sang et al. / Information Sciences 270 (2014) 73–91

the dataset. LLE represents xi by constructing its locally linear structures using its selected nearest k neighbors. The optimal reconstruction weights fwij g are determined by solving the constrained least squares problem as follows:

8 2   N  k X X > >   > > arg min xi  wij xj  > < wij   i¼1

> k X > > > > wij ¼ 1 : s:t:

j¼1

2

;

ð1Þ

j¼1

where k  k2 denotes the L2 norm. Once the reconstruction weights are computed, LLE maps X to Y ¼ fy1 ; . . . ; yN g in a lower dimensional space Rd according to (2):

8 2   N  k > X X >  < arg min  wij yj  yi  yi  ;  i¼1 j¼1 2 > > : T s:t: Y Y ¼ I

ð2Þ

where I is an identity matrix and yi ¼ ðyi1 ; yi2 ; . . . ; yid ÞT 2 Rd . LLE obtains d representative eigenvectors that constitute Y in the low-dimensional space by solving (2). LLE shows a successful dimension reduction when data points are evenly distributed in the high-dimensional space. However, it may distort the local geometric structure of the original data and lead to a poor low-dimensional embedding result when data points presents an uneven distribution in high-dimensional space. For example, Fig. 1 shows an uneven distribution of data points. The nearest neighbors of the cross-point for LLE algorithm are the points in the red circle. This results in information loss when mapping such an uneven distributed data because the selected neighborhood is not covered in the local information of the cross-point. So, the right neighborhood should be the points in the curved black line. An intuitive example can be found as shown in Fig. 2 that presents the embedding result of S curve sampled points randomly generated from 3-dimension to 2-dimension by LLE. We can clearly see from Fig. 2 that LLE greatly changes the geometric topology structure of the original data. To overcome this deficiency, we propose a novel local linear embedding algorithm in this section, which cannot only effectively keep the geometric topology structure of the original data, but also achieve better separability inter-class distance of the data in the low-dimensional space. Specifically, LLE uses the reconstruction weights fwij g to guarantee the geometric structure of the original data, but it cannot reflect the density information of nearest k neighbors for xi . So, we use the class label information to achieve better separability inter-class distance of the data. This aims to avoid bringing together the sample points of different classes when mapping an uneven distributed data. Note that each sample point only corresponds to one class label. 8yi 2 Y, we use Y yi to denote the k nearest neighbors of yi . According to the class label information, we divide Y yi into two parts: Y 1yi and Y 2yi , where Y 1yi ¼ fy1 ; y2 ; . . . ; yk1 g denotes the k1 nearest neighbors with the same class as yi , and Y 2yi ¼ fyk1 þ1 ; yk1 þ2 ; . . . ; yk g denotes the k2 nearest neighbors with different classes from yi . Now, we construct the local neighn o borhood of yi by yi ; Y 1yi ; Y 2yi . The objective is to find Y  such that the Euclidean distance between yi and Y 1yi is minimized and the Euclidean distance between yi and Y 2yi is maximized.

Fig. 1. Selection of the local neighbors. This presents an uneven distribution of data points. The nearest neighbors of the cross-point for LLE algorithm are the points in the red circle. To cover the local information of the cross-point, the right neighborhood should be the points in the curved black line. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

77

Y. Sang et al. / Information Sciences 270 (2014) 73–91

Geometric topology of original local neighborhood is greatly changed

Geometric topology of local neighborhood in 3-dimensional space

x3 x2

y 2 y3 y 4 y1 y5 y9 yi y 6

x4

x

x5

1

x6

xi

y8

x9

y7

x7 x8

Embedding result in 2-dimensional space by LLE

S curve sampled points with N = 3000

Fig. 2. 2D embedding result of S curve by LLE. It greatly changes the geometric topology structure of the original data.

From the above analysis, we can obtain Y  by solving the following optimization problem:

0 8 2  2 1 k1     N k X X X > >     1 2 < arg min @yi  wij yj   yi  wij yj  A yi   ;   j¼1 i¼1 j¼k1 þ1 2 2 > > : T s:t: Y Y ¼ I

ð3Þ

where

8 , k1 > X >
j ¼ 1; 2; . . . ; k1 ij ij ; j¼1 > > : 0 otherwise 8 , k X > : 0 otherwise

w1ij ¼

To solve this optimization problem, we first transform (3) and (4) according to Theorem 1:

8   T h  i  > < arg min Tr Y T W 2  W 1 2I  W 1 þ W 2 Y > :

Y

s:t: Y T Y ¼ I

n o where W 1 ¼ w1ij

NN

n o and W 2 ¼ w2ij

NN

;

ð4Þ

.

Theorem 1. (3) and (4) are equivalent. (See the proof in Appendix) Now, we obtain d representative eigenvectors to constitute Y  by using Lagrange multiplier method [23] to solve (4). Fig. 3 presents an intuitive example. We can easily see that the geometric topology structure of the original data is exactly kept when our proposed method is applied. The example ahead presents the case of dimension reduction for 3D S curve data, and the same for uneven distributed high-dimensional data like sparse bag-of-words data. The term bag of words (BoW) is a representation for text data and derives from experiments in text recognition: when categorizing a document as belonging to a certain category (e.g. scientific paper or children’s story), the bag of words approach models the frequencies of different words in each type of document.

78

Y. Sang et al. / Information Sciences 270 (2014) 73–91

Geometric topology of local neighborhood in 3-dimensional space

y8

y7

x3 x2

An optimized local neighborhood in low-dimensional space

wik

x4

x1

x5 xi x6

x8 x7

S curve sampled points with N = 3000

yi

y4

wik1

y1 wi1

y3

y2 y5

wij y6

Embedding result in 2-dimensional space by local neighborhood optimization

Fig. 3. 2D embedding result of S curve by the proposed local linear embedding algorithm. In the low-dimensional space, we expect that distances between the given sample point and the neighbor samples of a same class are as small as possible, while distances between the given sample point and the neighbor samples of different classes are as large as possible.

The term ‘‘bag’’ comes from the absence of sequence/structure when describing text data in this format. So we might classify a new document containing the words ‘princess’, ‘wizard’ and ‘castle’ as a children’s story because these words are more frequent in the children’s stories than scientific papers in our training corpus. For sparse bag-of-words data, the vector representation can be high-dimensional, in which case the vectors are sparse, i.e., the values of most of elements in the vectors are equal to zero. Our proposed dimension reduction can cover the local information of the sample point well when selecting local neighborhood as shown in Fig. 1. The aim is that distances between the given sample point and the neighbor samples of a same class are as small as possible, while distances between the given sample point and the neighbor samples of different classes are as large as possible by considering class information. This achieves better separability inter-class distance of the data in the low-dimensional space in guarantee of keeping the geometric topology structure of the original data. Actually, dimension reduction does not necessarily reduce the accuracy, as already observed in [21]. Indeed, it shows that a limited reduction tends to improve the accuracy. The detailed algorithm can be found in Algorithm 1. As we know, it is significant to determine the reduced dimension. Most manifold learning algorithms for dimension reduction do not give an explicit standard for the selection of dimensions [39,45,4,52,51]. The mapped dimension is a key parameter for dimension reduction methods: if the obtained dimension is too small, this can lead to a loss of significant features; and if the dimension is too large, the projections become noisy. Therefore, we first use the outstanding maximum likelihood estimation (MLE) method [28] to estimate the intrinsic dimension (ID) d from X in Step 1. The MLE method for dimension estimation is based on probabilistic assumptions of data and have been tested on various datasets with stable performance. It estimates the intrinsic dimension d from a collection of i.i.d. observations X ¼ ½x1 ; . . . ; xN  2 RD , and assumes that close neighbors lie on the same manifold. The basic idea is to fix a point x, assume f ðxÞ  const in a small sphere Sx ðRÞ of radius R around x, and treat the observations as a homogeneous Poisson process in Sx ðRÞ. The estimator proceeds as follows, letting k be a fixed number of nearest neighbors to sample xi :

" dk ðxi Þ ¼

# k1 1 X T k ðxi Þ log ; k  1 j¼1 T j ðxi Þ

ð5Þ

where T k ðxi Þ is the distance from point xi to its k-th nearest neighbor in X, measured with some suitable metric. The intrinsic dimension for the dataset can then be estimated as the average over all observations:



N 1X dk ðxi Þ: N i¼1

ð6Þ

Y. Sang et al. / Information Sciences 270 (2014) 73–91

79

In Step 2, we find the k nearest neighbors of each sample xi and calculate its local reconstruction weights according to (1). Finally, the algorithm maps X to Y in a lower intrinsic dimensional space Rd by solving (4). Algorithm 1. The Novel Local Linear Embedding Algorithm Input

: Dataset X of N samples with dimension D k: number of nearest neighbors at each sample point xi Output : Embedded result Y with the intrinsic dimension d 1 Estimate the global intrinsic dimension d of the dataset X using the MLE method; 2 Find k nearest neighbors of each xi ; 3 Calculate local reconstruction weights according to (1); 4 Map the dataset X ! Y in a lower dimensional space Rd ; 5 Determine global Y by solving the optimization problem (4);

3.2. An area-based chi-square discretization algorithm Based on the proposed dimension reduction algorithm, we find the lower representative dimensions from high-dimensional data, which intrinsically keeps the geometric structure of the original data and achieves better separability inter-class distance of the data in the low-dimensional space. This paves the way to high-dimensional data discretization. For example, the embedded outcome in Fig. 3 is more responsive to the discretization as shown in Fig. 4. The sample points of different classes can be well separated only in a horizontal axis. In this section, we propose an area-based chi-square algorithm to discretize the continuous dimensions in the low-dimensional space. Now, we state the problem of discretization. A discretization task requires a training dataset consisting of N samples yi 2 Rd ði ¼ 1; 2; . . . ; NÞ obtained by Algorithm 1 and each sample belongs to only one of the S classes. For the purpose of discretization, the entire dataset is projected onto the targeted continuous dimension. The result of such a projection is a two dimensional contingency table with I rows and S columns (see Table 1). Each row corresponds to an initial data interval, and each column corresponds to a different class. dij represents the number of samples with the jth class in the ith interval T i ; dj is the total number of samples belonging to the jth class and di is the total number of samples within T i . As we know, the chi2-based algorithms are well-known discretization methods based on statistical independence. They use the distance from v2 to v2a as the merging criterion with a certain significance level a to determine whether the current interval pair is to be merged or not. The merging criterion is defined as follows:

Dist ¼ v2a  v2 ; where

v2 ¼

PI PS h i¼1

j¼1

2

i

ðdij  Eij Þ =dij ; Eij ¼

ð7Þ di dj N

is the expected frequency of dij .

v2a is determined by selecting a desired sig-

nificance level a. If the value of Dist is larger, then the interval pair is more important. Thus, the interval pair with the largest Dist is merged first.

Fig. 4. An example of discretization: the sample points of different classes can be well separated.

80

Y. Sang et al. / Information Sciences 270 (2014) 73–91

Table 1 Notations of contingency table. Intervals

Class label

T 1 : ½t0 ; t1  T 2 : ðt1 ; t2  ... T I : ðtI1 ; t I  Sum of Column

Sum of row

Class1

Class2



ClassS

d11 d21 .. . dI1 d1

d12 d22 .. . dI2 d2

  .. .  

d1S d2S .. . dIS dS

d1 d2 .. . dI N (Total)

Actually, v2 is the random variable of chi-square probability density function (PDF) f ðxÞ. If we evaluate the change of v2 , R v2 the corresponding probability v2a f ðxÞdx is the most precise [5]. From the view of probability, the change of v2 should be R v2 precisely calculated by v2a f ðxÞdx instead of Dist. This can reflect statistical and probabilistic information and measure the possibility of being merged for each interval pair. Therefore, the significance of an interval pair is determined by (8), defined as follows:

Area ¼

Z v2a

f ðxÞdx;

ð8Þ

v2

where f ðxÞ is a probability density function of v2 distribution [5]. Thus, the interval pair with the largest Area is first merged. We can take an intuitive example of discretization about people’s education dataset as shown in the top of Fig. 5. The persons with different ages have corresponding education. If the merging criterion (7) is applied to discretize the dimension ‘age’ according to class label, we have Dist2 > Dist 1 as shown in Fig. 6. The merging result is presented by Fig. 5-A. However, from the view of education, it is common that the samples 1–8 are in the elementary and secondary school, and the samples 5–13 are in the education of a large-span from elementary and secondary school to post-doctoral. So, people in the first interval pair connected by cutpoint 1 is more similar than people in the second interval pair connected by cutpoint 2; thus, the first interval pair should be first merged. If (8) is applied to discretize ‘age’, we have Area1 > Area2 as shown in Figs. 6, and 5-B presents the merging result. As we know, any discretization process leads to a loss of information. Thus, the goal of a good discretization algorithm is to minimize such information loss. So, we use the level of consistency Lc in Rough Set Theory (RST) [33] as the stopping rule of our algorithm. It is intended to capture the degree of completeness of the knowledge about the dataset. For discretization, the level of consistency is known as the degree of dependency of class label on A ¼ fa1 ; a2 ; . . . ; ad g that is the set of all the low-dimensional features, ai denotes the ith feature in all the low-dimensional features. Specifically, let Y denote the set of instances for the transformed dataset. We say two instances yi and yj are indiscernible if their projection in space A are the same. For any instance yi 2 Y, we use ½yi A to denote its equivalence class with regard to the indiscernible relationship, and we use INDðAÞ to denote all the equivalent classes. We use Table 2 as an example. The table contains seven instances, two condition features (Age and LEMS or Lower Extremity Motor Score) and one class feature (Walk). We have

INDðfAge; LEMSgÞ ¼ ffy1 g; fy2 g; fy3 ; y4 g; fy5 ; y7 g; fy6 gg; INDðfLEMSgÞ ¼ ffy1 g; fy2 g; fy3 ; y4 g; fy5 ; y6 ; y7 gg; NDðfWalkgÞ ¼ ffy1 ; y4 ; y6 g; fy2 ; y3 ; y5 ; y7 gg:

Fig. 5. Example of discretization running on the feature dimension age. Notes: 0: Elementary school, 1: Junior high school, 2: Senior high school, 3: Bachelor, 4: Master, 5: Ph.D, 6: Post-doctoral.

81

Y. Sang et al. / Information Sciences 270 (2014) 73–91

degrees of freedom v=2

0.3

degrees of freedom v=6

0.25

0.25 0.2

0.2 0.15

0.15 0.1

0.1

0.05

0.05

Area

1

Area

2

0

0

2

2

χ1

4

6

Dist

8

χ2α

10

0

0

2

4

χ2

6

8

2

10

12

Dist2

1

14

χ2α

(b) The second interval pair connected by cut point 2

(a) The first interval pair connected by cut point 1

Fig. 6. Probability density function of

v2 distribution.

We pay special attention to INDðfWalkgÞ, the equivalent classes of the class feature Walk. We have two equivalent classes, that is, INDðfWalkgÞ ¼ fZ 1 ; Z 2 g, where Z 1 ¼ fy1 ; y4 ; y6 g and Z 2 ¼ fy2 ; y3 ; y5 ; y7 g. Intuitively, discretization makes two instances indiscernible. However, if the two instances fall into the same equivalent class of the class feature, then we can consider there is no loss in discretization. Let Z i be an equivalent class in the subspace defined by the class feature. We define

sðA; Z i Þ ¼ fyj½yA # Z i g;

ð9Þ

that is, sðA; Z i Þ is the set of instances whose equivalent classes in the space defined by A are entirely contained within a single equivalent class Z i in the subspace defined by the class feature. The level of consistency of A with respect to the class label is defined as follows:

X jsðA; Z i Þj Lc ðAÞ ¼

i

jYj

ð10Þ

;

where Z i is the ith equivalence class in the subspace defined by the class feature. For a consistent dataset, Lc ðAÞ ¼ 1. In the following, we compare the area-based chi-square method with other well-known methods, i.e., the entropy-based and class-attribute interdependence measurements. A measure of the discrepancy is introduced by [1], which is calculated according to (11):

k1 ¼ 2

I X S X N  dij dij log : di dj i¼1 j¼1

Actually, it is known that

v2  k1 ¼ 2 ¼2

ð11Þ

v2 is approximately equivalent to k1 [22]. So, we have

 I X S I X S  I X S S I X X X X N  dij dij N dij dj X ¼2 dij log ¼2 dij log þ dij log dij log  2 log  dij dj di dj di di N i¼1 j¼1 i¼1 j¼1 i¼1 j¼1 j¼1 i¼1

I X S S X X dij dj dij log  2 dj log ¼ 2N  ½HðT 1 [    [ T I Þ  HðT 1 ; . . . ; T I Þ; d N i i¼1 j¼1 j¼1

Table 2 Walk: An example decision table. Y

Age

LEMS

Walk

y1 y2 y3 y4 y5 y6 y7

16–30 16–30 31–45 31–45 46–60 16–30 46–60

50 0 1–25 1–25 26–49 26–49 26–49

Yes No No Yes No Yes No

82

Y. Sang et al. / Information Sciences 270 (2014) 73–91

hP



P

i

  

P  ¼  Sj¼1 dj =N log dj =N denotes the entropy of the entire PI di PI PS dij d interval corresponding to the class label, and HðT 1 ; . . . ; T I Þ ¼ i¼1 N HðT i Þ ¼  i¼1 j¼1 di log diji denotes the joint entropy of P where HðT 1 [    [ T I Þ ¼  Sj¼1

I i¼1 dij =N

log

I i¼1 dij =N

I intervals when treating each interval independently. The entropy reflects a measure of the uncertainty (classification error) associated with a random variable when discretizing a continuous dimension. Now, we can see that the entropy-based and statistical independence-based discretization criteria are equivalent or comparable for measuring the significance of interval pairs. Now, we use the mutual information [13,11] as the class-attribute interdependency measurement. It reflects the interdependence between input variables and output variable. It takes maximum value if input and output are totally dependent and takes minimum value if they are totally independent. It can be calculated according to (12):

k2 ¼

I X S X dij k1 dij log ¼  N log N: d d 2 i j i¼1 j¼1

ð12Þ

We can clearly see from the above analysis that (12) is also comparable to the other two kinds of discretization criteria for measuring the significance of interval pairs. However, our proposed area-based method evaluates the significance of interval pair from the view of probability, which overcomes the weakness of using the change of v2 variable to evaluates the significant of an interval pair. An additional advantage is that the area-based chi-square method takes into account the number of inconsistencies in multivariate datasets by introducing the level of consistency in RST, which avoids unnecessary information loss after discretization. 3.3. Algorithm description Based on the above analysis, a heuristic algorithm can be found in Algorithm 2. The objective of this algorithm is to find a better discretization scheme, which consists of two stages: implementation of the novel local linear embedding algorithm on an input dataset, and a greedy bottom-up discretization in the low-dimensional space. Algorithm 2.

Function notations in Algorithm 2: AreaðdataÞ: calculating Area of all the interval pairs. MergeðdataÞ: Returning true or false depending on whether the interval pairs for all the continuous dimensions are merged or not.

83

Y. Sang et al. / Information Sciences 270 (2014) 73–91 Table 3 The summary of datasets. Dataset

Number of dimensions

Number of classes

Number of samples

CNAE-9 Int-adv MADELON Isolet Secom Multi-features DEXTER Arcene AR10P ORL10P

857 1558 500 617 591 649 20,000 10,000 2400 10,304

9 2 2 26 2 10 2 2 10 10

1080 3279 4400 6238 1567 2000 2600 900 130 100

SubAreaðattÞ: calculating Area that needs to be updated. MergeðattÞ: Returning true or false depending on whether the interval pairs for a single continuous dimension are merged or not. n is a predefined tolerable amount of information loss supplied by the user. Actually, the discretization in low-dimensional space is similar to the bottom-up discretization algorithms [44,43]. We initially set the significance level as a ¼ 0:5 and dynamically adjust it to monitor the level of consistency Lcdis ðAÞ of the discretized data. We introduce a predefined parameter n to estimate the condition Lcori ðAÞ  Lcdis ðAÞ < n. This aims to control the degree of misclassification. As a general rule, we set n ¼ 0 in Algorithm 2, that is, no information loss can be allowed. Note that the formulation of k value in Algorithm 1 of the proposed discretization method is commonly determined by user-input, a fixed number of neighbors (k nearest neighbors, k N), which is similar to LLE. The computational complexity of Algorithm 2 is analyzed as follows. The time complexity for estimating intrinsic dimension using MLE is OðDN log NÞ. The time complexity for performing the linear embedding algorithm based on local neighbor3 2 hood optimization is OðDN log NÞ þ OðNk Þ þ OðdpN Þ, where p is the value of nonzero by zero elements in sparse matrix established by the algorithm. Note that although it adds one consideration (i.e., to optimize local neighborhood), it does not increase the computational complexity as compared to the LLE algorithm. In addition, sorting the values of continuous dimensions needs OðN log NÞ, and the algorithm only needs one sorting. The complexity of searching best merge is OðN 2 Þ, and the computation of the level of consistency also needs OðN 2 Þ. Therefore, the computational complexity of Algorithm 2 is 3 2 OðDN log NÞ þ OðNk Þ þ OðdpN Þ. 4. Experiments and performance evaluation This section first evaluates the performance of our proposed high-dimensional discretization method on University of California, Irvine (UCI) machine learning datasets [19] and Arizona State University (ASU) datasets1 in comparison with five representative discretization algorithms, i.e., Entropy [16], modified chi2 [44], PCA-based method [32], DDU [22] and MODL [9]. We demonstrate the general-purpose utility of the proposed method as a preprocessing step for data mining tasks [50] such as Naive Bayes (NB) classifier and C4.5 decision tree, etc. In addition, we evaluate our proposed method by testing image data. The experimental results show that our method can be successfully extended to adapt to image classification. 4.1. Performance evaluation on UCI DataSets In order to evaluate our proposed method in a real-world situation, we conduct experiments on ten datasets, where eight datasets are selected from the UCI machine learning data repository with numeric features and varying data sizes, and the other two datasets (AR10P and ORL10P) are selected from ASU data repository. The data are fully consistent or correct (level of consistency is equal to 1), and they have been used widely in testing pattern recognition and machine learning methods. A summary of the datasets can be found in Table 3. In the experiments, we set n ¼ 0 in Algorithm 2. We select the number of the nearest neighbors at each sample from 4 to 40 for N < 1000, and 4 to 200 for N > 1000 when the MLE method is applied. Among the discretization methods, modified chi2, PCA-based method and DDU require the user to specify in advance some parameters of discretization. For the modified chi2, we set the level of significance to 0:9995. There are two parameters (a0 and b0 ) in DDU. We create a 5  5 uniform grid for 0 6 a0 6 1 and 0 6 b0 6 1. In addition, we use a value 105 to replace 0 for 0 0 b0 since it cannot be equal to 0. For PCA-based method, we retain the d (d < D) eigenvectors that correspond to the largest eigenvalues which add up to 90 percent. MODL and Entropy have their respective automatic stopping rule and do not require any parameter setting. Naive Bayes (NB) classifier is chosen to be the benchmark for evaluating and comparing the performance of the original data, our proposed discretization method, and other five outstanding methods. The reason for our choice is that NB is a 1

http://featureselection.asu.edu/datasets.php.

84

Y. Sang et al. / Information Sciences 270 (2014) 73–91

simple classifier and works well for many decision-making problems. All experiments were run on a PC equipped with Windows XP operating system, Pentium IV 2.4 GHz CPU, and 2G SDRAM memory. Each dataset is discretized respectively by the above mentioned discretization methods. The 10-fold cross-validation test method is applied to all datasets. Each one is divided into ten parts, among which nine parts are used as the training sets and one as the testing set. The experiments are repeated ten times. The final predictive accuracy is taken as the average predictive accuracy. For our method, we need to identify the intrinsic dimension (ID) of the datasets by using MLE. Table 4 represents the identified information: the dimension determined by PCA with a user-specified threshold, the range of the ID and the dimension selection by MLE. We retain 0 the d eigenvectors that correspond to the largest eigenvalues which add up to 90 percent for PCA. For example, we can see from Table 4 that the dimension determined by PCA is 91 for ‘Isolet’ dataset, while its ID is about 63 identified by MLE. That is to say, we can only use intrinsic 63 dimensions to explain the ‘Isolet’ dataset adequately. We show for each dataset the mean execution time, the mean number of discrete intervals and Naive Bayes learning accuracy in Table 5. As it is well-known that the main goal of a discretization algorithm is to improve the accuracy of a learning Table 4 Dimension identified by MLE. Dataset

Discretization methods PCA-based

CNAE-9 Int-adv MADELON Isolet Secom Multi-features DEXTER Arcene AR10P ORL10P

Ours

136 394 74 91 87 104 3058 1228 96 104

Range

Selection

[67.6413, 100.9465] [102.435, 163.4357] [26.3464, 47.1952] [32.0620, 65.2428] [41.1435, 64.3621] [45.3672, 56.6452] [1205.3694, 1568.3742] [569.35, 704.26] [69.4352, 77.2679] [82.5412, 91.2153]

{99, 100} {162, 163} {46, 47} {63, 64, 65} {63, 64} {55, 56} {1567, 1568} {703, 704} {77, 78} {91, 92}

Table 5 The comparison of discretization scheme and Naive Bayes performance on 10 real datasets. Criterion

Dataset

Algorithm Continuous

Entropy

Modified Chi2

PCA-based

MODL

DDU

Our method

Mean discretization time (s)

CNAE-9 Int-adv MADELON Isolet Secom Multi-features DEXTER Arcene AR10P ORL10P

– – – – – – – – – –

269.7 523.5 152.73 189.2 224.5 95.6 1840.5 865.6 79.2 227.2

1879.6 2934.7 934.56 1267.4 1497.3 654.5 27912.3 19155.5 347.6 563.8

204.9 396.7 123.37 157.6 156.8 76.5 1743.4 796.4 38.7 147.5

1598.6 2469.3 889.61 1027.4 1264.9 578.1 26364.5 17434.7 269.7 634.3

28745.3 34306.6 19434.64 24549.7 27653.2 10241.5 60267.3 43935.7 1476.9 1273.6

{1247.5, 1294.6} {1997.5, 2003.6} {894.91, 902.43} {946.2, 958.4, 972.8} {1047.6, 1053.4} {696.7, 672.5} {2848.5, 2901.6} {1012.6, 1067.3} {104.1, 108.2} {263.8, 270.1}

Mean number of intervals

CNAE-9 Int-adv MADELON Isolet Secom Multi-features DEXTER Arcene AR10P ORL10P

– – – – – – – – – –

9.4 5.3 5.7 7.0 4.9 11.6 11.5 5.6 11.7 10.6

20.5 13.4 11.2 14.5 12.8 23.2 28.6 25.3 12.6 12.5

11.0 4.5 6.5 9.0 4.5 10.0 15.5 9.5 10.0 10.0

19.6 12.1 13.4 15.8 10.6 25.6 24.3 19.6 11.8 14.6

25.3 19.8 17.6 17.0 17.5 31.2 27.5 22.4 14.5 16.7

{19.4, {15.5, {11.9, {15.4, {13.8, {20.4, {16.5, {14.6, {12.2, {13.5,

Mean accuracy (%)

CNAE-9 Int-adv MADELON Isolet Secom Multi-features DEXTER Arcene AR10P ORL10P Mean rank

76.3 95.4 83.4 N/A 60.9 88.6 N/A 75.9 91.5 90.9 7.0

88.5 97.2 88.7 90.6 64.5 91.5 86.9 78.7 93.1 94.5 5.5

89.8 98.0 84.0 91.4 69.7 92.4 92.9 79.8 92.8 93.6 4.4

89.2 98.2 91.2 93.2 65.5 92.9 88.6 81.9 94.6 96.1 3.3

91.5 97.6 86.4 94.5 67.4 92.5 90.7 82.2 93.2 94.2 3.8

91.9 97.9 90.5 95.1 67.3 92.3 94.2 82.6 93.9 94.7 3.0

{92.8, 92.8} {98.7, 98.7} {92.3, 92.3} {96.1, 96.1, 94.2} {72.5,72.5} {94.6, 94.9} {95.2, 95.2} {84.6, 85.3} {95.7, 95.7} {96.8, 96.8} 1.0

19.9} 15.7} 12.3} 14.5, 14.0} 13.5} 20.6} 16.9} 13.8} 12.2} 13.8}

85

Y. Sang et al. / Information Sciences 270 (2014) 73–91

algorithm, simplify the discretization results, and implement the discretization process as fast as possible. Ideally, the best discretization result can score highest in all the departments. In reality, it may not be achievable or necessary. Although the number of discrete intervals in this experiment, is not our main concern, a discretization scheme with fewer intervals may not only lead to a worse quality of discretization scheme and a decrease in the accuracy of a classifier. In addition, we hope that the discretization algorithm can achieve the highest learning accuracy of the classifiers in an applicable discretization time limit. As seen from Table 5, although the execution time of our proposed discretization algorithm is in a medium level compared with other algorithms, our algorithm achieves the highest learning accuracy of NB classifier on the average. For our method, dimension reduction process increases the execution time of the algorithm; but the execution time of discretization can be decreased after implementing dimension reduction. Note that the execution time of the algorithm can be obtained by the clock () function in the C programming Language enclosing the header file ‘‘time.h’’. In short, finding the best discretization is in fact to find the best trade-off among the measures. Consistently, we mainly pay close attention to accuracy of a learning algorithm implemented with the discretized data. The predictive accuracy on NB classifier is presented in the last criterion of Table 5. For the large datasets, such as ‘Isolet’ and ‘DEXTER’, weka fails to report any result due to high memory requirement (represented by N/A). Besides, since the dimension is too large for the ‘Int-adv’, Isolet’, ‘DEXTER’, ‘Arcene’, ‘AR10P’ and ‘ORL10P’ datasets, the modified chi2, MODL and DDU discretization methods are implemented after using LLE. The predictive accuracy of these seven methods are presented in Table 5. The comparison results show that on the average, our method achieves the highest classification accuracy, which demonstrates that it produces a high quality discretization scheme. Quick comparisons of the seven methods can be obtained by checking the mean rank in the last row in Table 5. Each value of this row is acquired by average ranking of each discretization method for all the ten datasets. We rank the algorithms for each dataset separately, the algorithm with the best performance gets the rank of 1, the second best gets the rank of 2, and so on. In order to obtain the statistical support, we use the Friedman test to check if the measured mean ranks statistically reach significant differences. If the Friedman test shows that there is a significant difference, the Bonferroni-Dunn test in the Holm’s post hoc test is used to further analyze the comparisons of all the methods against UniDis. The Friedman statistic is described as follows: 2 F

v

" # X 2 PðP þ 1Þ2 12Q ¼ L  ; PðP þ 1Þ j j 4

ð13Þ

P where P is the number of discretization algorithms, Q is the number of datasets, Lj ¼ Q1 i uji , and uji is the rank of the jth of P algorithms on the ith of Q datasets. For the measured mean ranks in Table 5, the corresponding value of the Friedman test is

v2F ¼

" # 12  10 7  82 ¼ 47:0143; ð7:02 þ 5:52 þ 4:42 þ 3:32 þ 3:82 þ 3:02 þ 1:02 Þ  78 4

which is larger than the threshold 12:6. Therefore, the visualization of the Bonferroni-Dunn test in the Holm’s post hoc test is illustrated in Fig. 7 according to [14]. We can see that the top line in the figure is the axis on which we plot the average ranks of all the methods while a method on the left side means that it performs better. A method with rank outside the marked

(a)

2

1

3

4

5

6

7

Our method Continuous DDU

Entropy

PCA-based

Modified Chi2

MODL

(b) Our method

1

2

3

4

MODL PCA-based DDU

Fig. 7. Comparison of Naive Bayes performance with Holm’s post hoc tests (a ¼ 0:05).

86

Y. Sang et al. / Information Sciences 270 (2014) 73–91

interval in Fig. 7-A means that it is significantly different from our discretization method. We can see from Fig. 7-A that the mean predictive accuracy of our method is statistically comparable to that of DDU, MODL and PCA-based, and it performs significantly better than the other six methods. The comparison of the measured mean ranks among our method, DDU, MODL and PCA-based does not achieve statistically significant difference since we compare all seven algorithms. If we remove Continuous, Entropy and Modified chi2, we can obtain Fig. 7-B in which our method performs significantly better than DDU, MODL and PCA-based. Note that the mean predictive accuracy of our method is statistically comparable to that of DDU from the statistical point of view though mean rank of our method is higher than that of DDU. Considering the discretization algorithms in Table 5, we simply describe their time complexity in the following. For MODL method [9], it runs in OðDN 3 Þ time with a straightforward implementation of the algorithm. But, the method can be optimized in OðDN log NÞ time by using heuristic categories. The time complexity of the Entropy discretization method [16] is also OðDN log NÞ. For DDU algorithm [22], its time complexity is OðDI3 Þ by designing a dynamic programming algorithm, where I is the number of initial intervals. For the modified Chi2, the computational complexity of algorithm [44] has OðKDN log NÞ, where K is the number of incremental steps in the algorithm. For the PCA-based correlation preserving discretization algorithm [32], it mainly implements two phases. The first phase is to generates a set of orthogonal vectors from the input dataset with dimensionality of D by using PCA, and the computational complexity is OðNDÞ þ OðD3 Þ. Distance-based clustering is then applied to the data projected on this eigenvector to produce the cutpoints. The clustering phase needs OðTCNÞ, where T is iterative times and C is the number of cluster. Therefore, the total time complexity of the PCA-based algorithm is OðNDÞ þ OðD3 Þ þ OðTCNÞ. In order to evaluate the effectiveness of the area-based chi-square discretization algorithm proposed in Section III-B, we also discretize the lower-dimensional data produced by Algorithm 1 (NLLE for short in this paper) by area-based chi-square discretization and other supervised discretization methods. The same classifier, i.e., Naive Bayes classifier, is then applied to test the data generated by these supervised discretization algorithms. The comparison results of learning accuracy on NB classifier is presented in Table 6. The area-based algorithm achieves the highest classification accuracy on the average. Besides, these supervised discretization algorithms generally improve the learning accuracy of NB classifier after conducting the dimension reduction algorithms such as LLE and NLLE. This is the reason that the dimension reduction algorithms explore the geometry correlation structure and remove the redundant information. We also compare LLE + Chi2 with other supervised algorithms. As seen from Table 6, the accuracy acquired by LLE + Chi2 is lower. We simply analyze the reason. First, LLE may change the geometric topology structure of the original data that present uneven distribution, while NLLE can keep the geometric topology structure of the original data and achieve better separability inter-class distance of the data in the low-dimensional space by considering class information. In addition, in Chi2 [30], some input variables are removed according to the larger inconsistency count. However, these results are obtained on the basis of decreasing the fidelity of the original data because the calculation of inconsistency rate in Chi2 is the total number of the instances minus the largest number of the instances of class label, considering only the largest number, Table 6 Naive Bayes learning accuracy (%) on 10 real datasets. Dataset

CNAE-9 Int-adv MADELON Isolet Secom Multi-features DEXTER Arcene AR10P ORL10P

Algorithm

Reduced dimension by MLE

LLE + Chi2

NLLE + Entropy

NLLE + Modified Chi2

NLLE + MODL

NLLE + DDU

Our method

91.2 97.4 90.8 94.3 66.3 92.5 94.3 82.3 93.2 94.7

90.5 97.4 90.2 93.6 68.6 92.9 93.6 81.9 93.9 95.6

91.6 97.6 91.9 94.4 68.9 93.4 94.3 82.4 94.2 95.4

92.2 98.4 91.7 94.7 69.7 92.9 94.3 84.5 95.2 96.5

92.8 98.4 91.2 95.6 70.2 93.7 94.8 84.9 95.1 95.3

92.8 98.7 92.3 96.1 72.5 94.9 95.2 85.3 95.7 96.8

100 163 46 63 64 56 1567 704 77 91

Table 7 Compression results (in kbytes) of input datasets. Dataset

CNAE-9

Int-adv

MADELON

Isolet

Secom

Multi-features

DEXTER

Arcene

AR10P

ORL10P

Original size Compressed size Compressed size Compressed size Compressed size Compressed size Compressed size

1843 924 1694 38 1728 1734 31

10035.2 5045 9126 137 5283 5312 125

7373 4618 7124 796 7046 7245 613

27,942 14,633 22,034 4057 25,492 23,971 3108

5222.4 2539 8064 87 4518 4894 83

8704 4357 7543 106 7963 8231 98

3681 1654 3436 126 3329 3578 89

2653 1461 2154 307 2362 2184 197

273 147 254 19 207 215 17

930 528 873 35 847 858 28

by by by by by by

entropy modified Chi2 PCA-based MODL DDU ours

87

Y. Sang et al. / Information Sciences 270 (2014) 73–91

3

2

1

0

−1 6 1

4

0.5 0

2 −0.5 0

(a)

−1

(b) Fig. 8. (a) S curve dataset. (b) S curve sampled points with N ¼ 3000.

1

3

−1

2

1

2

−1

1.5

1

−1

1

−1

0.5

1

0 1 −1 1

−2

1

−3

1 −2.5−2 −1.5−1 −0.5 0

0.5 1

1.5 2

2.5

−4 −2.5−2 −1.5−1 −0.5 0

(a) LLE,k=7

1 1 1 1 1

0.5

1.5 2

2.5

0

−1

−0.5

−1

−1

−1

−1.5

−1 −2 −1.5 −1 −0.5 0

1

1.5

2

2.5 2 .5 1 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 0

(e) LLE,k=15

0.5

1

1.5

−2 −2 −1.5 −1 −0.5 0

2

(c) LLE,k=11

(b) NLLE,k=7

1

1 −2 −1.5 −1 −0.5 0

0.5 1

−1

0.5

1

1.5

2

1.5

2

(d) NLLE,k=11

1

2.5

1

1.5

2 1 1

0.5 0

1

−0.5 −1

1

−1.5 0.5

1

(f) NLLE,k=15

1.5

2

1 −2 −1.5 −1 −0.5 0

0.5

1

1.5

2

(g) LLE,k=20

−2 −2 −1.5 −1 −0.5 0

0.5

1

(h) NLLE,k=20

Fig. 9. 2-dimensional embedding results of S curve data by LLE and NLLE.

not the difference among all the number of the instances of class label. This leads to the loss of information. If a classifier is learning with such a discretized dataset produced by Chi2, the accuracy would be worse. In the following, we evaluate the compressibility that is a useful utility in the case of large and high-dimensional datasets or data warehousing environments. This easily reduce the storage requirements. Table 7 shows the results of compression on various datasets. As seen from the results that our proposed discretization method generates a larger data compression than the original data and other discretization methods as it reduces a lot of dimensions. 4.2. Performance evaluation on images As the quality of dimension reduction can directly determine the quality of discretization claimed in this paper, we first evaluate the effectiveness of the proposed NLLE dimension reduction method. The tested dataset is an artificial S curve in R3 as shown in Fig. 8(a). Fig. 8(b) shows the scatter plot of the S curve dataset. As we can see, the S curve data points are lying on a highly nonlinear manifold with randomly generated 3000 points. This is a typical of dataset with a nonlinear structure.2 We compare NLLE with LLE for a 2D manifold with S curve embedded in 3D space and run them with 2 A data structure can be classified into two categories: linear and nonlinear data structures. A data structure is linear if the elements form a sequence, for example array, linked list, queue etc. Elements in a nonlinear data structure do not form a sequence, for example tree, hash tree, binary tree, etc.

88

Y. Sang et al. / Information Sciences 270 (2014) 73–91

Fig. 10. Freyface classification by our method.

Fig. 11. 2D (left) and 3D (right) Embedding results of N ¼ 2832 handwritten digits by our method.

k ¼ 7; 11; 15; 20. Note that k is the selected nearest neighbor number of each sample point xi as mentioned in III-A. We can see clearly from Fig. 9 that NLLE keeps the geometric topology structure of the original S curve data and generates better embedding results for different number of nearest neighbors. On the contrary, LLE greatly changes the geometric topology structure of the original data. The second dataset is the Freyface dataset [39], which contains 1965 samples in a 560-dimensional space. We apply our method with k ¼ 12 on the dataset. We classify the dataset into five classes according to facial expression: 558 neutral faces, 630 happy faces, 627 unhappy faces, 77 tongue faces and 73 pouting faces. Fig. 10 presents the visualization of 2D and 3D classification on Freyface dataset by using our discretization method. For 2D classification, we do not consider the class label. We can clearly see from Fig. 10(a) that our method still separates happy, unhappy and natural expression faces well in 2-dimensional space, and it can classify each type of expression in the details. We consider the class label for 3D classification, and five face expressions can be clearly separated in 3-dimensional space as shown in Fig. 10(b). The third one, we consider the MNIST dataset3 containing N ¼ 2832 handwritten digits (‘4–6’). The gray scale images of handwritten numerals are at 28  28 resolution and converted 784-dimensional vectors. The data points are mapped into 2D and 3D space respectively using our method with k ¼ 12. These experiments are shown in Fig. 11. It is clear that our method performs very well. Most of sample points with the same digit class are well clustered in the resulting embedding, and they can also be classified finely in each digit class through discretization.

3

The dataset can be downloaded at http://www.cs.nyu.edu/roweis/data.html.

89

Y. Sang et al. / Information Sciences 270 (2014) 73–91

Fig. 12. This shows some samples of the Isoface dataset. As can be seen, a head is under left–right, up-down and lighting changes.

Table 8 Naive Bayes learning accuracy (%) on three image datasets. Image dataset

Algorithm LLE + Chi2

NLLE + Entropy

NLLE + Modified Chi2

NLLE + MODL

NLLE + DDU

Our method

Freyface

2D 3D

65.3 74.9

64.9 73.2

64.5 76.8

66.7 76.5

65.3 77.1

67.3 79.3

MNIST

2D 3D

92.5 95.7

91.7 93.3

93.2 96.5

91.4 97.1

94.9 96.4

94.3 97.8

Isoface

3D

94.7

94.3

95.5

95.5

96.3

96.7

Finally, we consider the Isoface real dataset [45] of 698 face images. The image size is 64-by-64 pixel, and each image is converted to a D ¼ 4096 dimensional vector. We classify the dataset into three categories according to a three-dimensional movement: up-down pose, left–right pose and lighting changes. Some samples of the Isoface dataset are shown in Fig. 12. In [45], the Isomap algorithm estimated its intrinsic dimension (ID) as 3 using the projection approach. We apply NLLE with k ¼ 12 and d ¼ 3 on the dataset. Then, the mapped 3D Isoface dataset can be discretized by different discretization algorithms. As a whole, we report classification accuracy of NB on the three image datasets for different discretization algorithms in Table 8. Our method has better classification ability overall. 5. Conclusions Discretization methods have played an important role in data mining and knowledge discovery. They not only produce a concise summarization of continuous dimensions to help the experts understand the data more easily, but also make learning more accurate and faster. In this paper, we present a supervised high-dimensional data discretization method that learns the intrinsic geometry of the data to derive the lower representative dimensions by proposing a novel local linear embedding algorithm for dimension reduction. A supervised area-based chi-square algorithm is proposed to discover discretization scheme in the low-dimensional space. Our proposed method demonstrates that there is a significant improvement on the mean learning accuracy of the classifiers than existing discretization methods and generates a more concise knowledge of the data especially for high-dimensional data. Our discretization method has also been successfully applied to computer vision and image classification. Acknowledgments This work is supported by the National Science Foundation for Distinguished Young Scholars of China (Grant No. 61225010); NSFC under Grant Nos. 61173161, 61173162, 61173165 and 61300189; Project funded by China Postdoctoral Science Foundation; the Fundamental Research Funds for the Central Universities. Appendix A Theorem 1. (3) and (4) are equivalent.

Proof.

0 2  2 1   k1     N k T    T    X X X     1 2 @yi  argmin wij yj   yi  wij yj  A ¼ argmin Tr Y T I  W 1 I  W1 Y  YT I  W2 I  W2 Y : yi     Y j¼1 i¼1 j¼k þ1 1

1

2

1 T

2

T

As W and W are orthogonal, ðW Þ W ¼ 0 and ðW 2 Þ W 1 ¼ 0. In addition, according to the nature of the matrix trace, we have

90

Y. Sang et al. / Information Sciences 270 (2014) 73–91

     T    T    T   T T Tr Y T I  W 1 I  W1 Y  YT I  W2 I  W 2 Y ¼ Tr Y T W 2  W 1 þ W 2  W 1 þ ðW 1 Þ W 1  ðW 2 Þ W 2 Y   T      ¼ Tr Y T W 2  W 1 Y þ Tr Y T W 2  W 1 Y   T    þ Tr Y T W 1  W 2 W1 þ W2 Y   T  Y ¼ Tr Y T 2 W 2  W 1   T    þ Tr Y T W 1  W 2 W1 þ W2 Y   T  T    ¼ Tr Y T 2 W 2  W 1 þ W 1  W 2 W1 þ W2 Y   T h  i  ¼ Tr Y T W 2  W 1 2I  W 1 þ W 2 Y : Therefore, the theorem is proven.

h

References [1] A. Agresti, Categorical Data Analysis, Wiley, New York, 1990. [2] E. Armengol, A. Garcia-Cerdana, Refining Discretizations of Continuous-valued Attributes, Modeling of Decisions of Artificial Intelligence Conference, LNAI, Springer, Heidelberg, 2012. pp. 258–269. [3] S.D. Bay, Multivariate discretization for set mining, Knowl. Inform. Syst. 3 (4) (2001) 491–512. [4] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and representation, Neural Comput. 15 (6) (2003) 1373–1396. [5] G.R. Bian, L.D. Wu, X.P. Li, J.G. Wang, Probability Theory, Mathematical Statistics, vol. 2, People’s Education Press, Beijing, 1979. [6] M. Biba, F. Esposito, S. Ferilli, N.D. Mauro, T. Basile. Unsupervised discretization using kernel density estimation, in: The Twentieth International Joint Conference on Artificial Intelligence (IJCAI), 2007, pp. 696–701. [7] A. Bondu, M. Boulle, V. Lemaire, S. Loiseau, B. Duval, A non-parametric semi-supervised discretization method, in: 2008 Eighth IEEE International Conference on Data Mining (ICDM), 2008, pp. 53–62. [8] M. Boulle, Khiops: a statistical discretization method of continuous attributes, Machine Learn. 55 (1) (2004) 53–69. [9] M. Boulle, MODL: a bayes optimal discretization method for continuous attributes, Machine Learn. 65 (1) (2006) 131–165. [10] M. Boulle, Optimum simultaneous discretization with data grid models in supervised classification: a bayesian model selection approach, Adv. Data Anal. Classif. 3 (1) (2009) 39–61. [11] J.Y. Ching, A.K.C. Wong, K.C.C. Chan, Class-dependent discretization for inductive learning from continuous and mixed-mode data, IEEE Trans. Pattern Anal. Machine Intell. 17 (7) (1995) 641–651. [12] K.J. Cios, L.A. Kurgan, CLIP4: hybrid inductive machine learning algorithm that generates inequality rules, Inform. Sci. 177 (17) (2007) 3592–3612. [13] T.M. Cover, J.A. Thomas, Elements of Information Theory, second ed., Wiley-Interscience, 2006. [14] J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Machine Learn. Res. 7 (1) (2006) 1–30. [15] J. Dougherty, R. Kohavi, M. Sahami, Supervised and unsupervised discretization of continuous feature, in: Proceedings of 12th International Conference of Machine Learning (ICML), 1995, pp. 194–202. [16] U. Fayyad, K. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, in: Proceedings of Thirteenth International Joint Conference on Artificial Intelligence, Morgan Kaufman, San Mateo, CA, 1993, pp. 1022–1027. [17] A. Ferreira, M. Figueiredo, An incremental bit allocation strategy for supervised feature discretization, in: Pattern Recognition and Image Analysis, Lecture Notes in Computer Science, vol. 7887, 2013, pp. 526–534. [18] R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugen. 7 (2) (1936) 179–188. [19] S. Hettich, S.D. Bay, The UCI KDD archive [db/ol], 1999. . [20] C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector machines, IEEE Trans. Neural Netw. 13 (2) (2002) 415–425. [21] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, C. Schmid, Aggregating local image descriptors into compact codes, IEEE Trans. Pattern Anal. Machine Intell. 33 (10) (2011) 1–12. [22] R.M. Jin, Y. Breitbart, C. Muoh, Data discretization unification, in: The Seventh IEEE International Conference on Data Mining (ICDM Best Paper), 2007, pp. 183–192. [23] I.T. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, 1986. [24] R. Kerber, Chimerge: discretization of numeric attributes, in: Proceedings Ninth National Conference on Artificial Intelligence, AAAI Press, 1992, pp. 123–128. [25] I. Kononenko, On biases in estimating multi-valued attributes, in: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 1995, pp. 1034–1040. [26] L.A. Kurgan, K.J. Cios, CAIM discretization algorithm, IEEE Trans. Knowl. Data Eng. 16 (2) (2004) 145–153. [27] J.A. Lee, M. Verleysen, Nonlinear Dimensionality Reduction, Information Science and Statistics, Springer, 2007. [28] E. Levina, P.J. Bickel, Maximum likelihood estimation of intrinsic dimension, Advances in Neural Information Processing Systems, vol. 17, MIT Press, Cambridge, MA, 2005, pp. 777–784. [29] H. Liu, F. Hussain, C.L. Tan, M. Dash, Discretization: an enabling technique, J. Data Min. Knowl. Disc. 6 (4) (2002) 393–423. [30] H. Liu, R. Setiono, Feature selection via discretization, IEEE Trans. Knowl. Data Eng. 9 (4) (1997) 642–645. [31] L.L. Liu, A.K.C. Wong, Y. Wang, A global optimal algorithm for class-dependent discretization of continuous data, Intell. Data Anal. 8 (2) (2004) 151– 170. [32] M. Mehta, S. Parthasarathy, H. Yang, Toward unsupervised correlation preserving discretization, IEEE Trans. Knowl. Data Eng. 17 (8) (2005) 1–14. [33] Z. Pawlak, Rough sets, Int. J. Comput. Inform. Sci. 11 (5) (1982) 341–356. [34] K. Polat, S. Karab, A. Guven, S. Gunes, Utilization of discretization method on the diagnosis of optic nerve disease, Comput. Methods Prog. Biomed. 91 (3) (2008) 255–264. [35] J. Ramirez, F.G. Meyer, Machine learning for seismic signal processing: Seismic phase classification on a manifold, in: Proceedings of 10th International Conference on Machine Learning and Applications, 2011, pp. 382–388.

Y. Sang et al. / Information Sciences 270 (2014) 73–91

91

[36] A.B. Musa, PCA, KPCA and ICA for dimensionality reduction in logistic regression, Int. J. Machine Learn. Cybernet. PP (99) (2013), http://dx.doi.org/ 10.1007/s13042-013-0171-7. [37] S. Garcia, J. Luengo, J.A. Saez, V. Lopez, F. Herrera, A survey of discretization techniques: taxonomy and empirical analysis in supervised learning, IEEE Trans. Knowl. Data Eng. 25 (4) (2013) 734–750. [38] M.X. Ribeiro, A.J.M. Traina, C. Traina, P.M. Azevedo-Marques, An association rule-based method to support medical image diagnosis with efficiency, IEEE Trans. Multimedia 10 (2) (2008) 277–285. [39] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [40] F.J. Ruiz, C. Angulo, N. Agell, IDD: a supervised interval distance-based method for discretization, IEEE Trans. Knowl. Data Eng. 20 (9) (2008) 1230– 1238. [41] Y. Sang, K.Q. Li, Combining univariate and multivariate bottom-up discretization, Multiple-Valued Logic Soft Comput. 20 (1–2) (2012) 161–187. [42] G. Schmidberger, E. Frank, Unsupervised discretization using tree-based density estimation, in: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2005, pp. 240–251. [43] C.T. Su, J.H. Hsu, An extended chi2 algorithm for discretization of real value attributes, IEEE Trans. Knowl. Data Eng. 17 (3) (2005) 437–441. [44] E.H. Tay, L. Shen, A modified chi2 algorithm for discretization, IEEE Trans. Knowl. Data Eng. 14 (3) (2002) 666–670. [45] J.B. Tenenbaum, V.D. Sliva, J.C. Landford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [46] C.J. Tsai, C.I. Lee, W.P. Yang, A discretization algorithm based on class-attribute contingency coefficient, Inform. Sci. 178 (3) (2008) 714–731. [47] M. Vlachos, C. Domeniconi, D. Gunopulos, G. Kollios, N. Koudas, Non-linear dimensionality reduction techniques for classification and visualization, in: Proceedings of Eighth ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, 2002, pp. 645–651. [48] X.Z. Wang, Y.L. He, D.D. Wang, Non-naive bayesian classifiers for classification problems with continuous attributes, IEEE Trans. Cybernet. (2013), http://dx.doi.org/10.1109/TCYB.2013.2245891. [49] I. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, second ed., Diane Cerra, San Francisco, CA, 2005. [50] X.D. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, P.S. Yu, Z.H. Zhou, M. Steinbach, D.J. Hand, D. Steinberg, Top 10 algorithms in data mining, Knowl. Inform. Syst. 14 (1) (2008) 1–37. [51] T.H. Zhang, D.C. Tao, X.L. Li, J. Yang, Patch alignment for dimensionality reduction, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1299–1313. [52] Z.Y. Zhang, H. Zha, Principal manifolds and nonlinear dimension reduction via local tangent space alignment, SIAM J. Sci. Comput. 26 (1) (2005) 313– 338.