A group lasso based sparse KNN classifier

A group lasso based sparse KNN classifier

Pattern Recognition Letters 131 (2020) 227–233 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

1MB Sizes 2 Downloads 81 Views

Pattern Recognition Letters 131 (2020) 227–233

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

A group lasso based sparse KNN classifier Shuai Zheng∗, Chris Ding University of Texas at Arlington, TX, USA

a r t i c l e

i n f o

Article history: Received 2 August 2019 Revised 8 November 2019 Accepted 28 December 2019 Available online 3 January 2020 Keywords: Sparse learning Group lasso Explainable classifier

a b s t r a c t Sparse features have been shown effective in applications of computer vision, machine learning, signal processing, etc. Group sparsity was proposed by considering that there exist natural group structures in many problems. Previous research mainly focuses on improving the way to extract sparse features, such as lasso, group lasso, overlapped group lasso, sparse group lasso. In existing work, sparse features are usually taken as input for classifiers, such as SVM, KNN, or SRC (Sparse Representation based Classification). In this paper, we find that, instead of using sparse group features as input for classifiers, sparse group features are good candidates for selection of most relevant classes/groups. We design a new classifier to improve classification accuracy: (1) we use sparse group lasso to select K most relevant classes/groups, which makes this approach robust, because it filters out unrelated classes/groups in group level, instead of individual sample level; (2) KSVD is used to get exact desired sparsity (k nonzero entries) and thus eliminates the difficulty of hyperparameter tuning; (3) simple summation of regression weights within each class/group contains sufficient class discriminant information, and the chance of a sample belonging to a specific class is denoted simply by the summation of corresponding regression weights within each class, which is in line with the need of Explainable AI (XAI). The K most relevant groups/classes can be considered as K neighbors of the correct class. Thus, we call this classifier Group Lasso based Sparse KNN (GLSKNN). Compared to 8 other approaches, GLSKNN classifier outperforms other methods in term of classification accuracy for two public image datasets and images with different occlusion/noise levels. © 2020 Elsevier B.V. All rights reserved.

1. Introduction The Least Absolute Shrinkage and Selection Operator (Lasso) [1,2] is a sparse regression model by finding a sparse coefficient vector for a regression problem. Lasso has been widely used in machine learning, pattern recognition, signal processing areas. Given a p-dimensional test data sample y ∈ R p×1 , training data matrix X ∈ R p×m with m samples, linear regression finds a coefficient vector α ∈ Rm×1 to minimize reconstruction error minα 12 y − X α22 . Lasso introduces sparsity to solution α by solving the objective with a 1 penalty, minα 12 y − X α22 + λ1 α1 , where λ1 controls the sparsity of α. Similar regularization approach has also been applied to SVD and SVM for subspace learning and classification [3– 5] and various regularizations were also widely used in deep learning [6–8]. Due to the non-smoothness of the regularization term, there are many research works proposing efficient algorithms for Lasso [9–11]. Standard Lasso does not consider the structure of the sparse solutions and the dependencies among input data. However, in many



Corresponding author. E-mail address: [email protected] (S. Zheng).

https://doi.org/10.1016/j.patrec.2019.12.020 0167-8655/© 2020 Elsevier B.V. All rights reserved.

real problems, the structure of data is very complex and there exists natural grouping structures. For example, there are groups of training samples in supervised learning problems (w.r.t different classes); signals monitoring the same object exhibit group properties (multi-view computer vision problems) [12]; cluster computing nodes exhibit group properties [13]. This group information provides valuable information for machine learning objectives. Individual sample tends to be noisy and not reliable. By considering that samples in the same group should have higher correlation and larger similarity and samples from different groups should have lower correlation and smaller similarity, these noisy and unreliable factors from individual samples can be reduced and minimized. This idea is also similar to subspace learning problems [14– 17], where group relationships are considered in terms of withinclass and between-class distances. In the problem of variable selection for regression [18–20], we expect the proposed model to find the correct groups of variables instead of focusing on individual variables. In computer vision and machine learning problems, many publications have been working on finding groups of factors that are mostly related to the input signals, such as group lasso [18,19,21,22]. Yuan and Lin’s work [18] extends the classical Lasso problem and adds a group penalty to regression in order to expose the group structure on the regression coefficients. Specifically, Yuan

228

S. Zheng and C. Ding / Pattern Recognition Letters 131 (2020) 227–233

and Lin solve

  1  y− XG j αG j 2 + λ2  αG j  2 , α 2 J

J

j=1

j=1

min

(1)

where y ∈ R p×1 is a test sample, p is sample dimension, XG1 ∈ R

p×nG

1

, XG2 ∈ R

p×nG

2

, ..., XGJ ∈ R

p×nG

J

are the training samples from

nG ×1

J groups, αG j ∈ R j are the regression coefficients of group Gj . The regularization term in Yuan and Lin’s work imposes the group structure, because Eq. (1) will force sparsity on αG j ’s length, αG j 2 . λ2 controls the desired group levels sparsity in the solution. In this way, the joint dependency among input data can be discovered. In practical, it has been shown that group sparsity improves the performance of Lasso [23]. Some extensions about group lasso have been developed. Zhao’s work [21] proposes a generalized penalty class, where the regularization of Eq. (1) is replaced with the Composite Absolute Penalties family J l T (α ) = j=1 αG j l0 , where l0 ranges from 1 to +∞. Jacob’s work j

[22] considers overlapping groups, with the assumption that some data is shared across different groups. Swirszcz’s work [19] proposes a Group Orthogonal Matching Pursuit approach for variable selection, which is a group structure extension of Orthogonal Matching Pursuit (OMP) algorithm [24], also known as “forward greedy feature selection algorithm”. Most of these publications focused on how to extract good sparse features (sparse coefficient vector α) and the regularization hyperparameters are difficult to tune. In real applications, the learned sparse features are used as input for standard classifiers, such as KNN, SVM, Regression. These classifiers are not specifically designed for sparse features. Sparse group features can be considered as assigning different weights in individual sample level and some groups of samples will have larger weights together. However, sparse group features do not filter out or select groups directly. When taking these sparse features into classifiers, the unrelated groups may still affect the performance of standard classifiers. In this paper, we find that, instead of using sparse group features as input for classifiers, sparse group features are good candidates for selection of K most relevant classes/groups. In this way, we can filter out those unrelated classes/groups in group level, instead of individual sample level, which makes this approach robust to individual noises. Due to the difficulty in hyperparameters tuning in optimization, we use KSVD to get exact desired sparsity (k nonzero entries) and thus eliminates the difficulty of hyperparameter tuning. Finally, we simply use summation of regression weights within each group to denote the chance of a sample belonging to a specific class. The K most relevant groups/classes can be considered as K neighbors of the correct class. Thus, we call this classifier Group Lasso based Sparse KNN (GLSKNN). As the increasing application of AI in industry, there is strong need for Explainable AI (XAI) [25–27]. The ability to explain the rationale behind an algorithm’s decision is a prerequisite for establishing trust relationship between researchers and industrial practitioners, especially for industrial use cases with heavy safety and economic impact [28–30]. GLSKNN is a straightforward and explainable classifier. It simply uses the summation of weights as an indicator for predicted classes. The contributions of this work can be summarized as below:

3. GLSKNN results and features are explainable. The chance of a sample belonging to a specific class is denoted simply by the summation of regression weights within each class. 4. Experiments show that GLSKNN classifier outperforms other approaches by more than 27% in term of classification accuracy for some data (see Table 4). 2. Group Lasso Sparse KNN Classifier (GLSKNN) In this section, we introduce the three steps of Group Lasso Sparse KNN Classifier (GLSKNN). Firstly. GLSKNN applies Sparse Group Lasso to select most relevant classes/groups and to extract sparse features with large signal-to-noise ratio (SNR). Secondly. GLSKNN applies KSVD on sparse features for denoising. Finally, GLSKNN uses summation of regression coefficients as class weights for classification. 2.1. Selecting most relevant classes/groups using Sparse Group Lasso Group Lasso can expose group structures of data. However, problems with group structures often come with both sparsity on group level and within group level. This leads to an improved version of Group Lasso, Sparse Group Lasso. Thus, we choose Sparse Group Lasso to select groups and extract sparse features. Given a testing data point y ∈ R p×1 , and m training samples X ∈ p×m R . Data dimension is p. There are J classes in the training data X. Samples of each class form a group, so we can write the J groups p×n p×n p×n as XG1 ∈ R G1 , XG2 ∈ R G2 , ..., XGJ ∈ R GJ . We then arrange the training data as X = [XG1 , XG2 , . . . , XGJ ]. nG j is number of samples in group j, j = 1, 2, . . . , J. Sparse Group Lasso finds α ∈ Rm×1 by solving the following convex minimization problem:

 1 y − X α22 + λ1 α1 + λ2  αG j  2 , α 2 J

min

nG ×1

where αG j ∈ R j are the regression coefficients of group Gj , weight λ1 controls the overall sparsity of α, , α = [αG1 , αG2 , . . . , αGJ ]. Weight λ2 controls the group level sparse structure of α. If αG j 2 is too small, we set αG j ← 0 and discard this unrelated class/group (as shown in step 2 and 3 in Algorithm 1). In this way, we can select the most relevant classes/groups. Algorithm 1 Proximal Gradient Algorithm for Sparse Group Lasso. Input: Data j = 1, 2, ..., J,

matrix testing

X = [XG1 , XG2 , ..., XGJ ], data

y ∈ R p×1 , nG ×1

Initial

XG j ∈ R guess

p×nG

j

,

α=

[αG1 , αG2 , ..., αGJ , ], with αG j ∈ R j , j = 1, 2, ..., J, Lipschitz continuous constant p, γ = 1.1 Output: Regression coefficient vector α 1: while Objective function not converged do 2: # Set groups with small weights to be 0 If αG j 2 < δ , set αG j ← 0, j = 1, 2, ..., J 3: 4: 5: 6: 7: 8: 9: 10:

1. GLSKNN is robust. Sparse group features are used to select K most relevant classes/groups. In this way, we can filter out those unrelated classes/groups in group level, instead of individual sample level. 2. GLSKNN requires less hyperparameters tuning. Using KSVD eliminates the process of hyperparameters tuning in regularization optimization.

(2)

j=1

11: 12: 13: 14: 15: 16: 17:

Compute Gradient of squared loss, ∇ = X T (X α − y ) while True do b = α − ∇ /p for j = 1, 2,..., J do c j ← arg minc j 12 c j − b j 22 + λ1 c j 1 + λ2 c j 2 end for if J (c ) < J (α ) then Lipschitz condition satisfied, α ← c break loop else Lipschitz condition not satisfied, p ← γ p end if end while end while

S. Zheng and C. Ding / Pattern Recognition Letters 131 (2020) 227–233

In next section, we will explore algorithms to solve Eq. (2) and properties of sparse feature α.

Table 1 Datasets.

2.2. Denoising using KSVD Given sparse features from Sparse Group Lasso, we will now focus on how to decide the predicted group/class of a testing sample y. The solution of Eq. (2) is a sparse vector with group structures. One problem of the sparse solution from Eq. (2) is that it is difficult and time-consuming to tune λ1 and λ2 . KSVD can ensure the solution has exactly k non-zero values and control the sparsity of Sparse Group Lasso explicitly. To simplify model tuning and further reduce noisy in sparse feature, we apply KSVD [31] on the sparse group lasso feature. In experiment section, we will show the effectiveness of KSVD process. Formally, given a solution of Eq. (2) α, KSVD [31] solves the following problem:

min y − X α22 , α

α0 ≤ k

(3)

where solution α has exactly k non-zero elements. KSVD can be solved using Orthogonal Matching Pursuit [31]. 2.3. Decision functions of GLSKNN Standard classifiers are not good for sparse features. First, standard classifiers, such as KNN, SVM, compute the 2 distance between two vectors. However, Sparse Group Lasso feature α is a very sparse representation. Due to the curse of dimensionality, the 2 measurement is not a good representation to measure the difference between two sparse feature vectors. Second, sparse features usually contain much discriminant information between classes and have high Signal-to-Noise Ratio (SNR). Thus, we can directly use the regression coefficients to compute class weights. This is in line with the strong need for Explainable AI (XAI). We propose three decision functions, GMP Pure, GMP ABS, and GMP SRC, for classification using Sparse Group Lasso feature. For those indexes corresponding to the same class, we use the summation of the sparse coefficients values as class weight. Since we are not sure the effect of negative values in sparse features, we designed two methods to do classification: GLSKNN, sum the sparse coefficients values directly; GLABS, sum the absolute values of sparse coefficients. Since Sparse Group Lasso features are sparse features, and SRC (Sparse Representation based Classification) [32] is specifically designed for sparse feature classification, applying SRC on Sparse Group Lasso features is another obvious option, which we called GLSRC. Define class indicator matrix H = [h1 , h2 , . . . , hm ] ∈ RJ×m , where H has m samples and J classes, and it contains only 0s and 1s, and each column only has one 1 to denote the class of the corresponding column in X. GLSKNN GLSKNN classifier determines the class weight using the summation of coefficients of denoised Sparse Group Lasso solution Eq. (3). The predicted class label is chosen as the class with maximum class weight. The decision function is given as:

arg max(l = H α ) j , j

α0 ≤ k,

(4)

where the solution has k non-zeros and the class is assigned as the class with maximum sum of class coefficients. We call it Group Lasso Sparse KNN classifier, because Eq. (4) has only k non-zero values, and those k non-zero values indicates the k most related samples to the test data, which is like the k nearest neighbors in KNN classifier containing most relevant information to target data. GLABS This classifier determines the class weight using the sum of absolute value of Sparse Group Lasso solution Eq. (2). The predicted class label is chosen as the class with maximum class weight. The decision function is given as:

arg max(l = H |α| ) j , j

(5)

229

Name

Dimension p

Training m

Testing n

Class # J

Caltech101 YaleB

3000 504

3060 (30/class) 1216 (32/class)

6084 1198

102 38

where |α| denotes the absolute value of Sparse Group Lasso solution. GLSRC This classifier applies SRC [32] on Sparse Group Lasso solution Eq. (2). It computes the residual of each class separately and choose the class with the minimum residual. The decision function is given as

arg min y − X j (α ) j 22 ,

(6)

j

where (α)j is coefficient vector from group j of the solution of Eq. (2). 2.4. Connections to SRC Sparse Representation-based Classification (SRC) [32] is a classification approach based on Lasso solution. To predict the class of sample y, SRC computes the residuals with respect to each class and finds the class with the minimal residual:

arg min y − X j (α ) j 22 ,

(7)

j

where α is the optimal solution standard Lasso, j = 1, 2, . . . , J denotes class indexes. The predicted class of y is determined as the class with the minimum residual. As we can see from the objective function, the within-group relationship information and betweengroup relationship information is not considered. When the data structure is complex and group dependency is strong, SRC does not perform well. We will compare SRC with GLSKNN in experiment part. 3. Algorithm to solve Eq. (2) Some research [33,34] solves Sparse Group Lasso using accelerated generalized gradient descent, block coordinate descent, etc. In this section, we uses an efficient proximal gradient algorithm. There are some advantages using proximal gradient algorithm: (1). proximal gradient algorithm converges faster and can be accelerated in many ways [35]; (2). proximal gradient can be generalized for both convex and non-convex problems; (3). no matrix factorization is required thus the can be used for larger size problems. In Algorithm 1, during each iteration, if the weights for a specific group is too small, such as αG j 2 < δ, we set all the regression weights of this group to be 0, as shown in step 3. δ is a datadependent hyperparameter. Step 4 computes the gradient for the squared loss (the first part) of Eq. (2). We then break the problem into J small problems. Within each group, step 8 solves the withingroup level sparsity and group level sparsity. In step 11, J(c) is the objective function value using new solution c, J(α) is the objective function value using last iteration solution. Fig. 1 shows the convergence of Algorithm 1 on some example data (see Table 1 for data details). The objective function is Eq. (2). Each iteration is one “while” loop in Algorithm 1. This computation is very fast and objective function converges within 400 to 800 iterations. 4. Illustration of sparse features Sparse Group Lasso can extract features with large Signal-toNoise Ratio (SNR). Large SNR features are important for classification problems. We use some example YaleB data to show

230

S. Zheng and C. Ding / Pattern Recognition Letters 131 (2020) 227–233

Fig. 1. Algorithm 1 convergence on sample data.

the difference of Sparse Group Lasso (Eq. (2)) with linear regression, Lasso, and Group Lasso (Eq. (1)). We use 1216 training samples from YaleB data [36] with 32 training images per class, 38 classes in total. Taking a testing image from class 1 as an example, Fig. 2 show the regression coefficients α of linear regression, Lasso, Group Lasso and Sparse Group Lasso respectively. X-axis is the training data index, where training data of the same class are grouped together. For example, index 1–32 are from class 1, 33–64 are from class 2, etc. We use a black vertical line to indicate the separation between 2 classes (since the testing data is from class 1, we only show first several class separation lines). Y-axis is the scale of the coefficient. We can see that coefficients of Sparse Group Lasso from class 1 have larger magnitude than the rest of groups. Sparsity is enforced both in group level and within group level. To further investigate the relationship of α and predicted class, we define Signal-to-Noise Ratio (SNR) of a testing point from class i as:

SNR =

|αi | i∈Ci |αi | i ∈Ci

(8)

where |αi | i∈C denotes the signal information, which is the i average of absolute value of α i with the index i that corresponds to class Ci , i ∈ Ci denotes those indexes that do not correspond to class Ci . We use the average values, instead of the summation, this is because, i ∈ Ci contains all other classes except class Ci and the number of elements for i ∈ Ci is larger than the number of elements for i ∈ Ci . Using average value will trade off the imbalanced effect. As we can see from Fig. 2a–2 d, the SNR of linear regression is the smallest one, Sparse Group Lasso gives the largest SNR 11.9584. This property is very good for classification purpose and large SNR features will improve the classification accuracy. 5. Experiments In this section, we evaluate the proposed GLSKNN classifier on two public databases Caltech101 and YaleB, and compare the classification performance of GLSKNN with other algorithms. We also evaluate the robustness of GLSKNN by applying it on occluded images and images with noises and study the effect of k in KSVD. 5.1. Data sets For each data set, we randomly create 10 training sets and 10 testing sets. Table 1 shows the data attributes and experiment settings.

Caltech101 data contains 9144 images from 102 classes (with 101 object classes and a “background” class) [37], including human faces, leopards, motorbikes, binocular, brain, camera, etc. Some samples are shown in Fig. 3. The images from each class have significant shape variability. The number of images in each class varies from 31 to 800. We use the spatial pyramid image features processed by Jiang et al. [38], where SIFT descriptors are first extracted, and then spatial pyramid feature [39] based on extracted sift features with three grids of size 1 × 1, 2 × 2, 4 × 4 are extracted. Dimension of spatial pyramid feature is reduced to 30 0 0 by using PCA. In our experiments, we use 30 random images per class for training and the remaining images for testing. Dataset property is shown in Table 1. YaleB data contains 2414 face images of 38 persons under 64 illumination conditions [36]. We use the processed version from [40], where only the frontal pose images are used. Some samples are shown in Fig. 3. This database is challenging due to the varying illumination conditions. The images have been resized to 24 × 21. There are about 64 to 65 images for each person. Some images are corrupted during the image acquisition and have been dropped [40]. In our experiments, we use 32 random images per person for training and the remaining images for testing. Dataset property is shown in Table 1.

5.2. Comparison algorithms We compare the proposed GLSKNN classifier with existing classifier KNN, SRC, Regression and SVM. We tune the hyperparameters and choose the value which gives the highest accuracy for each classifier. To isolate the effect of Group Lasso and KSVD, SKNN eliminates the process of group selection using Group Lasso and GL eliminates the process of KSVD denoising. SKNN To evaluate the effectiveness of group selection using Group Lasso in GLSKNN classifier, this classifier applies decision function Eq. (4) on the solution of linear regression directly. There is no group selection in this process. GL To evaluate the effectiveness of KSVD denoising in GLSKNN classifier, this classifier directly use sparse features from Eq. (2) for classification. The KSVD denoising process is eliminated. Formally, this classifier uses the following decision function to decide the class of a testing point:

arg max(l = H α ) j , j

(9)

S. Zheng and C. Ding / Pattern Recognition Letters 131 (2020) 227–233

231

Fig. 2. Illustration of sparse features and Signal-to-Noise Ratio (SNR).

as in [38] to compare the classification accuracy result with other approaches. In our experiment, we randomly generate 10 training sets and testing sets and run our experiments on all the 10 training and testing sets. We then compute the mean accuracy and standard deviation. Table 2 reports the classification accuracy result.

Fig. 3. Example images.

where l ∈ RJ×1 denotes the class weights of J classes. The testing signal will be classified into the class that corresponds to the maximum weight of l. We call decision function Eq. (9) GL (Group Lasso) method (as shown in Table 2). 5.3. Caltech101 Caltech101 is a widely used benchmark dataset [37] and various algorithms have been tested on it, such as Yang method [41], Wang method [42], D-KSVD [43], LC-KSVD [44]. In this experiment, we use the same setting and use the same spatial pyramid feature

Table 2 Recognition results on Caltech101(reported results only use 1 random training and testing set, our experiments use 10 random training and testing sets). Algorithms

Accuracy

Yang method (reported result) Wang method (reported result) D-KSVD (reported result) LC-KSVD (reported result) KNN (10 random runs) Regression (10 random runs) SRC (10 random runs) SKNN (10 random runs) GL (10 random runs) GLABS (10 random runs) GLSRC (10 random runs) GLSKNN (10 random runs)

0.732 0.734 0.730 0.736 0.597 0.729 0.718 0.742 0.703 0.705 0.692 0.762

(± 0.006) (± 0.003) (± 0.004) (± 0.004) (± 0.005) (± 0.006) (± 0.006) ( ± 0.004)

232

S. Zheng and C. Ding / Pattern Recognition Letters 131 (2020) 227–233

Fig. 4. Effects of k in Eq. (4) (single run): 1. SKNN is to apply KSVD on linear regression solution, 2. GLSKNN is to apply KSVD on Group Lasso solution Eq. (2).

Table 3 Recognition results on YaleB (10 random runs). Algorithms

YaleB

KNN Regression SVM SRC SKNN GL GLABS GLSRC GLSKNN

0.778 0.950 0.908 0.948 0.965 0.921 0.931 0.920 0.971

(± (± (± (± (± (± (± (± (±

0.010) 0.005) 0.009) 0.007) 0.005) 0.007) 0.006) 0.007) 0.004)

Fig. 5. Occlusion on YaleB (from 5 b to 5 f, block occlusion size covers 10% to 50% of the total image size).

with the increase of k from 10 to 150, with 10 as interval. However, GLSKNN can give a little bit higher classification accuracy. In Table 3, we use the k with the best accuracy when k = 10 to k = 150. Robustness on image occlusion In machine learning research, images are assumed to be well-positioned, cleaned and clear. However, for applications, the images can be occluded due to various reasons. In order to test the robustness of GLSKNN classifier, we add an occlusion block of different size to each testing YaleB image and test whether this will affect the recognition accuracy of testing images. The training images are not occluded. As shown in Fig. 5, the position of the occlusion block is randomly determined. We increase the size of the occlusion block from 0.1 to 0.5 (10% to 50%) of the total image pixels. Fig. 5a is the original image. Table 4 reports the classification accuracy of different classifiers on all occlusion levels. We can see that GLSKNN classifier achieves the best classification accuracy for all levels of occlusion. Robustness on images with noise Sparse models have been shown effective for problems with noise. Thus, we create YaleB faces with Gaussian noise to evaluate the robustness of the proposed method. The original YaleB face images are in grayscale with values from 0 to 255. We add Gaussian noise to each pixel of the original training images. The Gaussian noise has mean of 0, and standard deviation of 10, 20 and 30. Fig. 6 shows one image with different standard deviation of Gaussian noise. Table 5 reports the classification accuracy of different classifiers. We can see that

Overall, GLSKNN classifier performs the best. The k in Eq. (4) is determined by the best accuracy result within the range from 10 to 150. Effects of parameter selection of k Fig. 4a shows the effects of different k in Eq. (4). GLSKNN in the figure denotes result using GLSKNN classifier. SKNN in the figure denotes the result of eliminating the process of group selection using Group Lasso. As we can see, GLSKNN accuracy increases along with the increase of k from 10 to 150, with 10 as interval. While SKNN accuracy first increases, after k = 70, SKNN accuracy decreases. This tells us that Sparse Group Lasso can select most relevant groups. However, if we eliminate the group selection step, the sparse features selected are not relevant, and as k increases, there are more noises in the data. 5.4. YaleB Table 3 shows the average classification accuracy and standard deviation of the 10 runs. As we can see from the table, GLSKNN result outperforms other results. Effects of parameter selection of k Fig. 4b shows the effects of choosing different k in Eq. (4) on one single training and testing set. Both GLSKNN accuracy and SKNN accuracy increase along Table 4 Recognition results on YaleB with different occlusion levels (10 random runs). Occlusion percentage

KNN

0 0.1 0.2 0.3 0.4 0.5

0.778 0.616 0.515 0.387 0.287 0.234

SRC (±0.010) (±0.014) (±0.013) (±0.015) (±0.019) (±0.012)

0.948 0.869 0.741 0.575 0.412 0.337

SKNN (±0.006) (±0.008) (±0.009) (±0.010) (±0.012) (±0.012)

0.965 0.877 0.752 0.591 0.441 0.374

Regression (±0.005) (±0.006) (±0.013) (±0.007) (±0.013) (±0.013)

0.950 0.742 0.612 0.502 0.395 0.355

(±0.005) (±0.011) (±0.011) (±0.016) (±0.023) (±0.016)

GL 0.921 0.846 0.705 0.565 0.435 0.384

GLABS (±0.007) (±0.007) (±0.008) (±0.015) (±0.018) (±0.013)

0.931 0.859 0.711 0.544 0.396 0.328

GLSRC (±0.006) (±0.009) (±0.009) (±0.015) (±0.021) (±0.019)

0.920 0.854 0.729 0.585 0.457 0.400

SVM (±0.007) (±0.007) (±0.007) (±0.015) (±0.019) (±0.014)

0.908 0.776 0.621 0.482 0.363 0.325

GLSKNN (±0.009) (±0.013) (±0.014) (±0.022) (±0.029) (±0.030)

0.971 0.891 0.776 0.627 0.485 0.429

(±0.004) (±0.009) (±0.017) (±0.018) (±0.024) (±0.024)

S. Zheng and C. Ding / Pattern Recognition Letters 131 (2020) 227–233

233

Table 5 Recognition results on YaleB with Gaussian noise (10 random runs). Standard deviation σ

KNN

0 10 20 30

0.778 0.583 0.584 0.582

SRC (±0.010) (±0.011) (±0.012) (±0.012)

0.948 0.935 0.931 0.929

SKNN (±0.006) (±0.005) (±0.006) (±0.005)

0.965 0.943 0.942 0.936

Regression (±0.005) (±0.005) (±0.005) (±0.005)

0.950 0.931 0.911 0.885

(±0.005) (±0.006) (±0.004) (±0.005)

Fig. 6. Gaussian noise on YaleB with different standard deviation.

GLSKNN classifier outperforms other approaches in terms of classification accuracy for images with Gaussian noise. 6. Conclusion and future work In conclusion, we proposed an efficient and robust GLSKNN classifier. GLSKNN works very well on different levels of occluded images and images with noises. For future work, we will apply this classifier for industrial use cases, where noisy data is very common and IoT data tends to have strong group effect. References [1] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B (Methodol.) (1996) 267–288. [2] F. Windmeijer, H. Farbmacher, N. Davies, G.D. Smith, On the use of the lasso for instrumental variables estimation with some invalid instruments, J. Am. Stat. Assoc. (2019) 1–12. [3] S. Zheng, C. Ding, F. Nie, Regularized singular value decomposition and application to recommender system, (2018) arXiv:1804.05090. [4] S. Zheng, C. Ding, Minimal support vector machine, (2018) arXiv:1804.02370. [5] S. Zheng, Machine Learning: Several Advances in Linear Discriminant Analysis, Multi-View Regression and Support Vector Machine, The University of Texas at Arlington, 2017 Ph.d thesis. [6] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (1) (2014) 1929–1958. [7] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, (2015) arXiv:1502.03167. [8] S. Zheng, A. Vishnu, C. Ding, Accelerating deep learning with shrinkage and recall, in: Proceedings of the IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), IEEE, 2016, pp. 963–970. [9] M.Y. Park, T. Hastie, L1-regularization path algorithm for generalized linear models, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 69 (4) (2007) 659–677. [10] M. Schmidt, G. Fung, R. Rosales, Fast optimization methods for l1 regularization: A comparative study and two new approaches, in: Proceedings of the Machine Learning: ECML, Springer, 2007, pp. 286–297. [11] T. Zhang, et al., Some sharp performance bounds for least squares regression with l1 regularization, Ann. Stat. 37 (5A) (2009) 2109–2144. [12] C. Xu, D. Tao, C. Xu, A survey on multi-view learning, (2013) arXiv:1304.5634. [13] S. Zheng, Z.-Y. Shae, X. Zhang, H. Jamjoom, L. Fong, Analysis and modeling of social influence in high performance computing workloads, in: Proceedings of the European Conference on Parallel Processing, Springer, Berlin Heidelberg, 2011, pp. 193–204. [14] S. Zheng, C.H. Ding, F. Nie, H. Huang, Harmonic mean linear discriminant analysis, IEEE Trans. Knowl. Data Eng. (2018), doi:10.1109/TKDE.2018.2861858. [15] S. Zheng, X. Cai, C.H. Ding, F. Nie, H. Huang, A closed form solution to multi-view low-rank regression, in: Proceedings of the AAAI, 2015, pp. 1973–1979. [16] S. Zheng, F. Nie, C. Ding, H. Huang, A harmonic mean linear discriminant analysis for robust image classification, in: Proceedings of the IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2016, pp. 402–409. [17] S. Zheng, C. Ding, Kernel alignment inspired linear discriminant analysis, in: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer Berlin Heidelberg, 2014, pp. 401–416. [18] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68 (1) (2006) 49–67.

GL

GLABS

0.921 0.915 0.916 0.910

(±0.007) (±0.006) (±0.008) (±0.009)

0.931 0.915 0.918 0.917

GLSRC (±0.006) (±0.007) (±0.005) (±0.004)

0.920 0.919 0.914 0.916

SVM (±0.007) (±0.008) (±0.010) (±0.008)

0.908 0.903 0.903 0.900

GLSKNN (±0.009) (±0.009) (±0.007) (±0.009)

0.971 0.969 0.958 0.955

(±0.004) (±0.009) (±0.009) (±0.014)

[19] G. Swirszcz, N. Abe, A.C. Lozano, Grouped orthogonal matching pursuit for variable selection and prediction, in: Proceedings of the Advances in Neural Information Processing Systems, 2009, pp. 1150–1158. [20] S. Zheng, C. Ding, Sparse classification using group matching pursuit, Neurocomputing 338 (2019) 83–91. [21] P. Zhao, G. Rocha, B. Yu, Grouped and hierarchical model selection through composite absolute penalties, Department of Statistics, UC Berkeley, 2006 Technical Report vol. 703. [22] L. Jacob, G. Obozinski, J.P. Vert, Group lasso with overlap and graph lasso, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 433–440. [23] J. Huang, T. Zhang, et al., The benefit of group sparsity, Ann. Stat. 38 (4) (2010) 1978–2004. [24] S.G. Mallat, Z. Zhang, Matching pursuits with time-frequency dictionaries, IEEE Trans. Signal Process. 41 (12) (1993) 3397–3415. [25] D. Gunning, Explainable artificial intelligence (xai), in: Proceedings of the Defense Advanced Research Projects Agency (DARPA), nd Web, 2, 2017. [26] W. Samek, T. Wiegand, K.R. Müller, Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models, (2017) arXiv:1708. 08296. [27] A. Holzinger, C. Biemann, C.S. Pattichis, D.B. Kell, What do we need to build explainable ai systems for the medical domain?, (2017) arXiv:1712.09923. [28] S. Zheng, K. Ristovski, A. Farahat, C. Gupta, Long short-term memory network for remaining useful life estimation, in: Proceedings of the IEEE International Conference on Pyrognostics and Health Management (ICPHM), IEEE, 2017, pp. 88–95. [29] S. Zheng, A. Farahat, C. Gupta, Generative adversarial networks for failure prediction, (2019a) arXiv:1910.02034. [30] S. Zheng, C. Gupta, S. Serita, Manufacturing dispatching using reinforcement and transfer learning, (2019b) arXiv:1910.02035. [31] M. Aharon, M. Elad, A. Bruckstein, K-Svd: an algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006) 4311–4322. [32] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [33] J. Friedman, T. Hastie, R. Tibshirani, A note on the group lasso and a sparse group lasso, (2010) arXiv:1001.0736. [34] N. Simon, J. Friedman, T. Hastie, R. Tibshirani, A sparse-group lasso, J. Comput. Graph. Stat. 22 (2) (2013) 231–245. [35] K.-C. Toh, S. Yun, An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems, Pacif. J. Optim. 6 (3) (2010) 615–640. [36] A.S. Georghiades, P.N. Belhumeur, D. Kriegman, From few to many: illumination cone models for face recognition under variable lighting and pose, IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (2001) 643–660. [37] L. Fei-Fei, R. Fergus, P. Perona, Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories, Comput. Vis. Image Understand. 106 (1) (2007) 59–70. [38] Z. Jiang, Z. Lin, L.S. Davis, Learning a discriminative dictionary for sparse coding via label consistent K-SVD, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 1697–1704. [39] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2, IEEE, 2006, pp. 2169–2178. [40] K.-C. Lee, J. Ho, D.J. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 684–698. [41] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR, IEEE, 2009, pp. 1794–1801. [42] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2010, pp. 3360–3367. [43] Q. Zhang, B. Li, Discriminative K-SVD for dictionary learning in face recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2010, pp. 2691–2698. [44] Z. Jiang, Z. Lin, L.S. Davis, Label consistent K-SVD: learning a discriminative dictionary for recognition, Proceedings of the IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013) 2651–2664.