Semi-supervised two phase test sample sparse representation classifier

Semi-supervised two phase test sample sparse representation classifier

Accepted Manuscript Semi-supervised Two Phase Test Sample Sparse Representation Classifier Y. El Traboulsi, F. Dornaika, Y. Ruichek PII: DOI: Referen...

629KB Sizes 0 Downloads 39 Views

Accepted Manuscript

Semi-supervised Two Phase Test Sample Sparse Representation Classifier Y. El Traboulsi, F. Dornaika, Y. Ruichek PII: DOI: Reference:

S0950-7051(18)30325-3 10.1016/j.knosys.2018.06.018 KNOSYS 4391

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

11 December 2017 5 June 2018 19 June 2018

Please cite this article as: Y. El Traboulsi, F. Dornaika, Y. Ruichek, Semi-supervised Two Phase Test Sample Sparse Representation Classifier, Knowledge-Based Systems (2018), doi: 10.1016/j.knosys.2018.06.018

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Semi-supervised Two Phase Test Sample Sparse Representation Classifier Y. El Traboulsi1 , F. Dornaika2,3,∗ , Y. Ruichek1 Le2i FRE2005, CNRS, Arts et M´etiers, Universit´e Bourgogne Franche-Comt´e, UTBM, F-90010 Belfort, France 2 University of the Basque Country UPV/EHU, San Sebastian, Spain 3 IKERBASQUE, Basque Foundation for Science, Bilbao, Spain

CR IP T

1

Abstract

Two Phase Test Sample Sparse Representation (TPTSSR) classifier was recently proposed as

AN US

an efficient alternative to the Sparse Representation Classifier (SRC). It aims at classifying data using sparse coding in two phases with `2 regularization. Although high performances can be obtained by the TPTSSR classifier, since it is a supervised classifier, it is not able to benefit from unlabeled samples which are very often available. In this paper, we introduce a semisupervised version of the TPTSSR classifier called Semi-supervised Two Phase Test Sample

M

Sparse Representation (STPTSSR). STPTSSR combines the merits of sparse coding, active

ED

learning and the two phase collaborative representation classifiers. The proposed framework is able to make any sparse representation based classifier semi-supervised. Extensive experiments carried out on six benchmark image datasets show that the proposed STPTSSR can outperform

PT

the classical TPTSSR as well as many state-of-the-art semi-supervised methods. Keywords: Semi-supervised learning, active learning, sparse coding, two phase test sample

CE

representation classifiers, pattern classification.

AC

1. Introduction

Automatic image classification has attracted tremendous attention in the computer vision

community. Thanks to semi-supervised learning (SSL), image classification has seen remarkable progress [1, 2, 3, 4]. Nowadays, SSL-based methods become a hot topic in machine learning. Indeed, obtaining labeled data for machine learning is often very difficult and expensive [5, 6, 7, 8, 9]. Therefore, the use of unlabeled data is an excellent way which can provide a significant influence and a considerable extension of application for learning methods. Such methods benefit

Preprint submitted to Elsevier

July 25, 2018

ACCEPTED MANUSCRIPT

from labeled samples that explicitly contribute to the discrimination capacity of the final model, and from unlabeled ones to maintain the intrinsic structure of data. However, some difficulties have encountered SSL especially in cases of scarcity in the number of labeled samples. In this context, increasing the number of labeled data can be carried out using active learning paradigms (e.g., [10, 11, 12, 13]). The main concern of active learning is to get more labeled samples by predicting labels of unlabeled samples, and possibly use them

CR IP T

to build new models or classifiers. One of the active method’s problems is to correctly identify the samples that should be handled. This problem has been well resolved by active learning researchers. In general, the proposed solutions rely on notions of confidence in classification [12]. This can be achieved by either recovering the label of the selected relevant samples [14] or by eliminating non-confident decisions. For instance, if the misclassification is predominant for a

AN US

specific sample (i.e. high uncertainty), then its predicted label will not be used, and it will play its previous role as an unlabeled sample. In addition to uncertainty, some other criteria were also proposed. In order to avert repeatedly labeling samples in a same cluster, Nguyen et al. [15] introduce the diversity by proposing a pre-clustering method. In [16], the authors proposed an active cluster based sampling technique. Nevertheless, since it uses a hierarchical clustering of

M

unlabeled samples, its performance is directly affected by the performance of clustering process. In [12], Jain et al. introduce an active probabilistic variant of the K-NN classifier that can be

ED

further used in multi-class scenarios. In [17], the authors proposed an algorithm that exploits informativeness and representativeness of unlabeled samples. Active learning methods have a

PT

broad spectrum of applications in various fields such as social network analysis [18], saliency detection [19], object tracking [20], and zero shot visual object recognition [21]. Besides active learning paradigms, sparse representation has brought a breakthrough to

CE

the pattern recognition community [22, 23]. This is due to its enormous capacity to acquire, represent and compress knowledge of the domain, and thus to reconstruct the data with minimal

AC

loss [24]. Sparse representation approaches construct a dictionary from the gallery images, then they reconstruct the query image using a sparse linear combination of the dictionary. The reconstructed form of the query image is assigned to the class with the minimal reconstruction error and with the least possible number of entries. Thus, the Sparse Representation based Classifier (SRC) [25] can be considered as a generalization of the Nearest Neighbor classifier (NN) and the Nearest Feature Subspace (NFS) [26]. The advantage of SRC compared to NN and NFS is its robustness to deviations and occlusions

2

ACCEPTED MANUSCRIPT

[27]. Robust Sparse Coding (RSC) model was proposed by Yang and Zhang [28]. The classifier that is built on this coding scheme was robust to various kinds of outliers (e.g. occlusion and facial expression). In [29, 30], He et al. proposed a robust sparse representation for face recognition based on the maximum correntropy criterion. Despite the good results SRC offers, it suffers from its slow processing because it adopts the `1 minimization. This problem grows exponentially when the number of atoms used for the

CR IP T

dictionary increases. Thus, SRC is not very useful for scenarios demanding a rapid decision or classification. This prompted many researchers to exploit the locality of data. In this context, the work of [31] demonstrated that the computational complexity of SRC can be reduced without sacrificing the accuracy by simply limiting the sparse coding for the nearest neighbors. Other researchers were interested in the implementation of other coding schemes (using different norms)

AN US

that achieve similar or even better results than the original SRC [32, 33]. For instance, He et al. [34] propose a two-stage coding scheme based on Non-negative Least Square representation. In contrast, using `2 minimization, Xu et al., propose in [35] a Two Phase Test Sample Sparse Representation (TPTSSR) method. This method is based on two phases. In the first phase, the testing sample is represented as a linear combination of all training samples. Then, the first

M

M samples which give the best representation of the testing sample are considered to be the nearest neighbors of the latter. In the second phase, the testing sample is represented as a linear

representation.

PT

1.1. Paper contribution

ED

combination of its M nearest neighbors, and the classification decision is made based on this

The main contribution of the paper is the extension of the TPTSSR classifier to the semi-

CE

supervised case. The resulting classifier is termed semi-supervised TPTSSR (STPTSSR). As it can be seen, the main stream for semi-supervised learning is to exploit the unlabeled samples

AC

in the process of data projection estimation or in the label propagation process. Our proposed semi-supervised variant jointly uses the unlabeled samples as data samples and integrates the concept of active learning in order to have predicted labels that should be cleverly used in the final decision of the resulting classifier. Our idea is to make the collaborative representation classifier TPTSSR (or any other similar classifier) semi-supervised in the sense that it learns the discrimination power by exploiting a given amount of unlabeled data samples. The main properties of the proposed STPTSSR are: (i) it can be considered as an inductive semi-supervised classifier, (ii) it actively expands its 3

ACCEPTED MANUSCRIPT

discrimination ability, and (iii) it uses a self-taught parameter. To the best of our knowledge, the proposed STPTSSR is the first attempt to introduce semi-supervised learning to collaborative representation based classifiers. The most important properties that characterize our work can be summarized as follows: • Our proposed STPTSSR combines the merits of sparse coding, active learning and semi-

CR IP T

supervised learning. • The introduced framework is able to make any collaborative representation based classifier semi-supervised.

• It regularizes the deviation provided by labeled samples and the one provided by the whole training data. This regularization maintains the balance between ground truth samples

AN US

and the predicted ones.

• Unlike the SRC which is based on the `1 minimization, our proposed classifier inherits the computational efficiency of the TPTSSR classifier which is based on the `2 norm, and consequently the classification task is not computationally expensive.

M

This paper is organized as follows: in section 2, we review some related works. This section briefly describes some semi-supervised learning approaches. It also reviews the classical Two

ED

Phase Test Sample Sparse Representation classifier. Section 3 describes two schemes for the automatic estimation of the TPTSSR classifier parameter. Section 4 introduces our STPTSSR classifier. Experimental results and a comparative study of our proposed classifier and state-

PT

of-the-art methods are described in section 5. In section 6, the conclusion is presented. In the

CE

sequel, capital bold letters denote matrices and bold letters denote vectors. 2. Related works

AC

2.1. Semi-supervised learning This section briefly describes some works on semi-supervised classification. All semi-supervised

methods exploit two types of data samples: labeled and unlabeled. The semi-supervised methods can be classified into two main categories: transductive and inductive. Transductive methods are able to infer the label of unlabeled data samples only. On the other hand, the inductive methods are able to infer the label of new unseen data samples in addition to the unlabeled samples. Furthermore, the semi-supervised approaches can be classified into two main categories: (i) graph-based label propagation methods, and (ii) graph-based embedding methods. 4

ACCEPTED MANUSCRIPT

2.1.1. Graph-based label propagation methods The semi-supervised learning methods using graph-based label propagation attracted much attention in the last decade. All of them impose that samples with high similarity should share similar labels. They differ by the regularization term as well as by the loss function used for fitting label information associated with the labeled samples. All of these methods use the

CR IP T

graph similarity matrix and the initial labels of some samples. Some recent label propagation algorithms (they can also be called classifiers [36]) are: Gaussian Fields and Harmonic Functions (GFHF) [37], Local and Global Consistency (LGC) [38], Laplacian Regularized Least Square (LapRLS) [39], Robust Multi-class Graph Transduction (RMGT) [40], Flexible Manifold Embedding (FME) [41], and Kernel Flexible Manifold Embedding (KFME) [6]. These techniques can be either transductive (defined for training samples only) or inductive (defined

AN US

for both training and unseen samples). For instance, the GFHF method is transductive whereas the LapRLS and FME methods are inductive. The FME method solves the label propagation problem by adopting a criterion for both settings: transductive and inductive. 2.1.2. Graph-based embedding methods

M

In this category of methods, the label and unlabeled samples are simultaneously used in order to estimate a discriminant projection. Once data are represented in the new space a

ED

simple classifier is invoked in order to predict the label of the testing samples. In practice, the used classifier is given by the nearest neighbor classifier.

PT

Semi-supervised Discriminant Analysis (SDA). By adding a geometrically based regularizer, Cai et al. extend LDA into its semi-supervised version: SDA [42]. This method seeks a

CE

projection matrix that projects samples to a reduced subspace wherein samples having the same label are as close as possible, keeping at the same time the intrinsic geometric structure of the

AC

data.

Semi-supervised Discriminant Embedding. SDE [43] can be seen as a semi-supervised extension of LDE. Its purpose is to predict a transformation matrix that can transform data points from their high dimensional space to a reduced subspace and contributing, at the same time, to the classification task. Sparsity Preserving Discriminant Analysis (SPDA). Like SDA, SPDA [44] can be considered as an extension of LDA. Instead of using Laplacian smoothness criterion related to the

5

ACCEPTED MANUSCRIPT

whole labeled and unlabeled data, SPDA proposes to use the Sparsity Preserving property. Therefore, the SDA method uses the criterion related to Locality Preserving Projections and SPDA uses the criterion related to Sparsity Preserving Projections (SPP). Exponential Semi-supervised Discriminant Embedding (ESDE). ESDE [45] was proposed in order to overcome the Small Sample Size problem as well as to emphasize the discrim-

CR IP T

ination property by enlarging distances between samples that belong to different classes. 2.2. Review of the Two Phase Test Sample Sparse Representation (TPTSSR)

The Two Phase Test Sample Sparse Representation (TPTSSR) classifier was proposed in [35]. It tries to reconstruct the testing sample as a sparse combination of training samples using two phases while minimizing the residual error with `2 regularization.

AN US

Assume X = [x1 , x2 , . . . , xN ] ∈ RD×N is the training data matrix where each column xi , 1 ≤ i ≤ N , is a training sample and D is the dimension of a sample. Let C denotes the number of classes. First Phase:

M

In the first phase, the testing sample y ∈ RD is represented as a linear combination of all training samples as follows:

(1)

ED

y = Xa

where a = [a1 , a2 , . . . , aN ]T and ai (1 ≤ i ≤ N ) are the coefficients. Using `2 regularization, TPTSSR estimates a by:

PT

a? = (XT X + λ I)−1 XT y

CE

where λ is a small positive constant and I is the identity matrix with an appropriate size. Based on Equation (1), the contribution of the sample xi in the construction of y is ai xi . Thereby,

AC

a great contribution for xi will be considered when ky − ai xi k2 is small. Accordingly, the M samples (1 ≤ M ≤ N ) that have the M largest contributions will pass to the second phase.

e = [e e1 , x e2 , . . . , x eM , and they are denoted by matrix X e2 , . . . , x eM ]. These samples are named x x1 , x Second Phase:

In the second phase the testing sample is represented as a combination of the selected M training samples. This can be expressed by: eb y=X 6

ACCEPTED MANUSCRIPT

where b is the coefficients vector. Similarly to a, the unknown coefficients, b, can be computed by: eT X e + γ I)−1 X eT y b? = (X

where γ is a small positive regularization constant.

coefficients are bk1 , . . . , bkt . The deviation of class k is given by: Dev(k) = ky −

t X j=1

ekj bkj k2 x

CR IP T

ek1 , . . . , x ekt , and that their corresponding Suppose that t samples belong to the k th class: x (2)

From the above expression, one can realize that greater contribution corresponds to smaller

AN US

deviation. Therefore, the estimated class of y, l(y) is given by: l(y) = arg min Dev(k), k

k = 1, . . . , C

The TPTSSR classifier requires a parameter that defines the number of selected samples in the first phase denoted by M . In [35], the authors provide no mechanism for choosing M

M

automatically. Alternatively, the authors use a large number of values for M when applying the algorithm to the testing data. This is not a practical solution because, based on results shown

ED

in [35, 32], the value of M can have a significant impact on the performance of the classifier. Indeed, unlike the regularization constants λ and γ, the value of M significantly influences the

PT

final performance of the TPTSSR classifier. In [32], we proposed two schemes that aim to automatically estimate M using only the training set. The supervised scheme presented in [32]

CE

as well as a revisited version of an unsupervised scheme are briefly presented in the next section.

3. Optimization schemes of TPTSSR parameter

AC

3.1. Global supervised scheme In this scheme, for a given value of M , a TPTSSR classification is applied for each training

sample. This sample is treated as a test, while all other training samples form the training data of the classifier (i.e., when a sample xi is considered as a testing sample, x1 , x2 , . . . , xi−1 , xi+1 , . . . , xN will constitute the training samples). For this M , a scoring measure is estimated according to the number of correctly classified samples. This process is repeated for all possible values of M ∈ [Mmin , Mmax ]. This scheme is too similar to the Leave One Out Cross Validation scheme 7

ACCEPTED MANUSCRIPT

except that only training data are contributing to the estimation of M . Consequently, the estimated optimal M (one single value) is chosen as the value that gives the best score: M ? = arg

max

M ∈[Mmin ,Mmax ]

score(M ).

3.2. Local unsupervised scheme

CR IP T

This scheme is composed of two stages: a training stage and a testing stage. In the training stage, for each training sample xi , M is calculated using Algorithm 1. In this algorithm, dist(xi , xj ) denotes the distance that separates xi from xj , and d(xj , CS) means the minimal distance between xj and any sample of the “Current Set” (CS). The parameter K should be large enough in order to capture the local structure. As we can realize, the basic idea of Al-

AN US

gorithm 1 is to compute a neighborhood size for every training sample using a kind of jump detection in the current cluster.

The threshold value used in Algorithm 1 is sample-based in the sense it is computed for each training sample whose M should be estimated. In our work, we set this threshold to the average of the distances to the K nearest neighbors of the sample in question: PK

M

j=1 d(xi , xj )

K

(3)

ED

T hreshold(xi ) = where K is large neighborhood size.

In the testing stage, the provided value of M for the testing sample depends on the nearest

PT

training neighbors of this latter. In our work, the M value of the testing sample is set to the

CE

mean of M s of its nearest training neighbors. 4. Semi-supervised Two Phase Test Sample Sparse Representation classifier

AC

4.1. Proposed Method

Since TPTSSR is a supervised classifier, its final performance can be limited by the scarcity

of labeled samples. In many situations, labeled samples are in practice rare unlike the unlabeled ones that are very often available with sheer numbers. Inspired by active learning principles, we introduce in this section, the Semi-supervised Two Phase Test Sample Sparse Representation (STPTSSR) classifier. The inputs to our framework are labeled and unlabeled data samples. The objective is to build another TPTSSR classifier that exploits both labeled and unlabeled samples. 8

ACCEPTED MANUSCRIPT

Algorithm 1 Local unsupervised scheme Inputs: Train samples, index i of the current sample xi Output: Optimal M for xi

AN US

CR IP T

Initialize a large neighborhood size K Search the set of K nearest neighbors of xi : KN N (xi ) = {xj }, j = 1, . . . , K Initialize Current Set: CS = {xi } ; Initialize Remain Set: RS = KNN (xi ) −{xi }; Pj=K d(xi ,xj ) Initialize the threshold: T hreshold(xi ) = j=1 K ; j = 1; While (RS 6= ∅) do xj = arg minx∈RS dist(x, xi ) M = j; if d(xj , CS) < T hreshold(i) CS = CS ∪ {xj }; RS = RS − {xj }; j = j + 1; else break ; end if end while

M

Assume that x1 , x2 , . . . , xL are the labeled samples and that xL+1 , xL+2 , . . . , xN are the unlabeled samples. We define the matrix of labeled samples by Xl = [x1 , x2 , . . . , xL ] ∈ RD×L

ED

and the matrix of unlabeled samples by Xu = [xL+1 , xL+2 , . . . , xN ] ∈ RD×U where L and U

U = N − L are the numbers of labeled and unlabeled samples, respectively. The training matrix X is defined by X = [x1 , x2 , . . . , xN ] ∈ RD×N .

PT

Using active learning principles, we aim to recover the labels of unlabeled samples, Xu , and then use the original labeled data and the predicted ones, X, to build a new classifier. We

CE

emphasize that TPTSSR is a lazy classifier in the sense that all its computation steps runs in the testing phase. Recovering the labels of the unlabeled samples can be carried out using any

AC

classifier. In our work, we use the TPTSSR classifier based on the original set of labeled samples. While the active learning principle seems to be trivial, we will introduce a new mechanism for performing the testing sample classification once the set of unlabeled samples have their predicted labels. This mechanism can be used by all classifiers that are based on collaborative representation. Our proposed scheme is able to upgrade any collaborative representation-based classifier to the semi-supervised case. Once we have the data matrix X where each sample has either a ground-truth label or a predicted one, the testing sample classification of the proposed STPTSSR proceeds as follows. 9

ACCEPTED MANUSCRIPT

Similarly to TPTSSR, the proposed STPTSSR has two phases. The basic idea is to run two coding schemes separately, each contains two phases of coding. The first coding scheme will be invoked on the labeled data Xl . The second coding scheme is invoked on the training data matrix X. Then a fusing scheme for residual errors is proposed in order to get the class label of

First Phase:

CR IP T

the testing sample. Let Ml and M denote the parameter of the two coding processes.

In the first phase, the testing sample y ∈ RD will be represented in two forms: the first one is a linear combination of labeled samples Xl and the second one is a linear combination of the whole training data X. The representations of y are given by the following equations:

AN US

y = al1 x1 + al2 x2 + . . . + alL xL

y = a1 x1 + a2 x2 + . . . + aL xL + . . . + aN xN

(4) (5)

The above equations can be written in matrix form as follows: and

y = Xa

M

y = Xl al

where al = [al1 , al2 , . . . , alL ]T and a = [a1 , a2 , . . . , aN ]T . Thus, the problem is to find the coeffi-

ED

cients ali (1 ≤ i ≤ L) and aj (1 ≤ j ≤ N ).

The unknown code vectors al and a are estimated using `2 regularized Least Square. These

PT

two vectors are estimated by optimizing the following criteria, respectively:

CE

al? = arg min ky − Xl al k2 + λl kal k2 al a? = arg min ky − X ak2 + λ kak2 a

AC

where λl and λ are small positive constants. Using simple algebraic manipulation, we can prove that al and a are given by: al? = (XTl Xl + λl Il )−1 XTl y a? = (XT X + λ I)−1 XT y

(6)

where I and Il are identity matrices with an appropriate size. With reference to equations (4) and (5), it can be noted that each sample has its own 10

ACCEPTED MANUSCRIPT

contribution in the construction of y. According to Equation (4), the contribution of a sample xi is ali xi . From Equation (5), the contribution is ai xi . Thus, xi has a large contribution if ky−ali xi k2 is small, and it has a large contribution in Equation (5) if ky−ai xi k2 is small. Based on this, the Ml samples (1 ≤ Ml ≤ L) that have the largest Ml contributions when reconstructing y in Equation (4) and the M samples (1 ≤ M ≤ N ) that have the largest M contributions when reconstructing y in Equation (5) are chosen to be handed over to the second phase. The two

CR IP T

el2 , . . . , x elMl }, and {e e2 , . . . , x eM }. In matrix subsets of retained samples are denoted by {e xl1 , x x1 , x fl = [e e = [e el2 , . . . , x elMl ] and X e2 , . . . , x eM ]. form, these data samples are given by X xl1 , x x1 , x Second Phase:

Similarly to the first phase, in the second phase the testing sample is represented in two

AN US

forms: the first one is a combination of the remaining Ml labeled samples and the second one is a combination of the remaining M training samples. This can be expressed by: fl bl y=X

and

eb y=X

where bl and b represent the coefficient vectors. Similarly to equations (4) and (5), the vectors

M

bl and b are given by:

ED

fl T X fl + γl Il )−1 X fl T y bl? = (X

eT X e + γ I)−1 X eT y b? = (X

(7)

PT

where γ and γl are two positive constants.

Suppose that there are tl samples, from the Ml labeled samples, belonging to the k th class:

CE

(e xl1 )k , (e xl2 )k , . . . , (e xltl )k and their corresponding coefficients are (bl1 )k , (bl2 )k , . . . , (bltl )k , and from the M training samples, there are t samples belonging to the k th class (or estimated to be of

AC

this class): (e x1 )k , (e x2 )k , . . . , (e xt )k and their corresponding coefficients are (b1 )k , (b2 )k , . . . , (bt )k . We define the deviation of class k as follows:

2

2

tl t X X



k k l k k

ej (bj ) + (1 − η) y − ej (bj ) x x Dev(k) = η y −

(8)

j=1

j=1

where η is a balance parameter (0 ≤ η ≤ 1).

As we can see from the above equation, the proposed deviation is a way of evaluating the collaborative contribution of the retained samples of the k th class, in representing the testing 11

ACCEPTED MANUSCRIPT

sample y by both Xl and X. A large contribution corresponds to small deviation. Therefore, y is finally classified as follows: l(y) = arg min Dev(k) k

1≤k≤C

where C is the number of classes and l(y) is the estimated class label of y. The motivation

CR IP T

behind this merging rule is the fact the labels associated with the samples in X are not all correct, and thus it is safe to down-weigh their deviation in each class they have representatives. 4.2. The Algorithm

The input of the algorithm are: the labeled data matrix Xl = [x1 , x2 , . . . , xL ] ∈ RD×L ,

the training data matrix X = [x1 , x2 , . . . , xN ] ∈ RD×N (it contains both labeled and unlabeled

AN US

samples), the testing sample y ∈ RD and the parameters M and Ml .

1. Predict labels for samples xL+1 , xL+2 , . . . , xN using the TPTSSR classifier or any other classifier. The number of samples that should be crossed to the second phase of TPTSSR is M .

M

2. Calculate a? and al? using Equation (6).

3. Compute the vector e = (e1 , e2 , . . . , eN )T where ei = ky − ai xi k2 , then sort e. Select the samples that correspond to the lowest M values of e. These samples are denoted by

PT

Ml instead of M .

ED

e = [e e1 , x e2 , . . . , x eM . Finally, form the matrix X e2 , . . . , x eM ]. x x1 , x fl = [e el2 , . . . , x elMl ] while using eli = ky − ali xi k2 instead of ei = ky − ai xi k2 and 4. Form X xl1 , x

5. Compute the coefficients b? and bl? using Equation (7).

CE

6. For each class k (1 ≤ k ≤ C) compute the global deviation using Equation (8). 7. The class that corresponds to the lowest deviation will be selected to be the estimated

AC

class of y.

5. Performance Study In order to assess the performance of the proposed classifier, we compared it to twelve

methods: K-Nearest Neighbor (K-NN), Support Vector Machines (SVM) adopting a polynomial kernel, Sparse Representation based Classifier (SRC) [25], Two Phase Test Sample Sparse Representation (TPTSSR) [35], Semi-supervised Discriminant Embedding (SDE) [46], Semisupervised Discriminant Analysis (SDA) [42], Transductive Component Analysis (TCA) [47], 12

ACCEPTED MANUSCRIPT

Sparsity Preserving Discriminant Analysis (SPDA) [48], Laplacian Regularized Least Squares (LapRLS) [49], Flexible Manifold Embedding (FME) [50], Kernel Flexible Manifold Embedding (KFME) [6], and Semi-supervised Exponential Discriminant Embedding (ESDE) [45]. In this context, six benchmark image datasets are used. Selected datasets correspond to several types: four face datasets (Extended Yale, FERET, UMIST and Honda), one object

CR IP T

database (COIL20) and one handwritten digits database (USPS). 5.1. Data description

• Extended Yale1 : We use the cropped version of Extended Yale which consists of 1774 facial images belonging to 28 individuals. Multiple variations in illumination and facial expression are detected in this database. Images are resized to 32 × 32.

AN US

• FERET2 : We use a subset of FERET. This subset contains 7 images for each of the 200 subjects. According to the position of eyes, original images are cropped and rescaled to 32 × 32 pixels. Variations in facial expressions, illuminations and poses are present. • UMIST3 : This database contains 575 face images. These images belong to 20 different

M

individuals. Variations in head pose are observed in this database.

• Honda: We use a subset taken from the public Honda Video DataBase (HVDB). This

ED

subset consists of 1138 images belonging to 22 persons. • COIL204 : The Columbia Object Image Library (COIL20) database consists of 1440 im-

PT

ages of 20 objects. For each object 72 images are taken depending on the rotation angle (0o , 5o , 10o , . . . , 355o ). Objects display a wide variety of complex geometry and reflectance

CE

characteristics. We use a subset of the database containing 18 images for each object (one image for every 20o of rotation).

AC

• USPS Handwritten Digits5 : This dataset consists of 11000 images of handwritten digits from “0” to “9” (1100 images per digit). We use the tenth of this dataset (for each digit 110 images).

1

http : //vision.ucsd.edu/ leekc/ExtY aleDatabase/ExtY aleB.html http : //www.itl.nist.gov/iad/humanid/f eret 3 https : //www.shef f ield.ac.uk/eee/research/iel/research/f ace 4 http : //www.cs.columbia.edu/CAV E/sof tware/sof tlib/coil − 20.php 5 http : //www.cs.nyu.edu/ roweis/data.html 2

13

ACCEPTED MANUSCRIPT

5.2. Experimental Setup We randomly divide each dataset into labeled, unlabeled and testing samples. Unless stated otherwise, three different partitions of data were used in our experiments. These partitions are illustrated in Table 1. For each partition, the splitting process is repeated ten times. As a preprocessing step, PCA was applied to the training data of all datasets. The percentage of

CR IP T

PCA variability was fixed to 98%. Table 1: Data partitions for the used image datasets.

Training Samples Labeled Samples Unlabeled Samples 15% 35% 25% 25% 35% 15%

Testing Samples 50% 50% 50%

AN US

Partition Partition 1 Partition 2 Partition 3

For a multi-class classification problem, each class can be regarded as ”positive” and all other classes are considered as ”negative”. Thus, we adopt the following three metrics for evaluating the performance of classification. These are as follows:

1. Correct Classification Rate This is given by the proportion of testing data that are

M

correctly classified. For multi-class classification of images, this is the most commonly used quantitative metric.

ED

2. Recall The recall, Rj , of a given positive class j (also called True Positive Rate), is the Correct Classification Rate associated with the testing examples of that class. For a given

PT

set of testing images belonging to C classes, the associated overall recall R is given by:

CE

R=

C C T Pj 1 X 1 X Rj = C C T Pj + F Nj j=1

(9)

j=1

3. Precision. The precision, Pj , of a given positive class j measures the number of true

AC

positives out of the samples predicted as positives. Thus, for a given set of testing images belonging to C classes, the associated overall precision P is given by:

P =

C C T Pj 1 X 1 X Pj = C C T Pj + F Pj j=1

(10)

j=1

In the above formulas, the jth class is considered as the positive class and the other classes are considered as the negative class. Furthermore, we have:

14

ACCEPTED MANUSCRIPT

• T Pj (True Positive for the jth class) is the number of examples in class j that are correctly classified as belonging to class j. • F Nj (False Negative for the jth class) is the number of examples in class j that are wrongly classified as belonging to the negative class (all other classes). • F Pj (False Positive for the jth class) is the number of examples in the negative class (all

CR IP T

other classes) that are wrongly classified as being belonging to class j.

For testing data containing imbalanced data samples, the overall recall and precision are given by:

P

=

j=1 nj

PC

PC

Rj

j=1 nj

AN US

R =

PC

j=1 nj

PC

Pj

j=1 nj

(11) (12)

where nj is the number of testing examples belonging to class j.

M

5.3. Global comparison

Table 2 presents the recognition performance of the proposed method together with ten of the

ED

already mentioned state of the art methods. This table reports the recognition rate average and its standard deviation over ten random trials. For FME, KFME, SDE, SDA, SPDA, LapRLS and TCA methods, all parameters are tuned using the interval {10−9 , 10−6 , 10−3 , 1, 10+3 , 10+6 , 10+9 }.

PT

Regarding the STPTSSR classifier, M and Ml parameters are chosen from {30, 60, 90, ..., N }. The regularization parameters λ, λl , γ and γl of the STPTSSR method are fixed to 0.01 and

CE

η = 0.8. It is worthy noting that the search step for the M and Ml parameters can be smaller than 30. Indeed, a small step can provide better final performance of the proposed method.

AC

However, we report results using this arbitrary choice for M and Ml . For the embedding methods (SDE, SDA, SPDA, TCA, and ESDE), the classification was

performed using the nearest neighbor (1-NN) classifier. Reported results are the top-1 recognition rate from the best parameters configuration over ten splits. The best recognition rates are shown in bold. Table 3 illustrates the precision and recall of the competing methods obtained with the USPS, Honda, COIL20 and Extended Yale datasets. Since our problem is a multi-class problem, the precision and recall are derived from the multi-class confusion matrices and correspond to an 15

ACCEPTED MANUSCRIPT

average over all classes. In Table 3, the results correspond to one random split associated with partition 3. Several conclusions can be drawn from Tables 2 and 3. These are as follows. • In general, our proposed semi-supervised classifier outperforms many other competing methods.

CR IP T

• Although our method is outperformed by some methods for the UMIST dataset, the recognition rate of STPTSSR is very close to the best rate obtained by other methods. In fact, the difference between the accuracy of the STPTSSR method and the best accuracy is less than 1% in both cases.

AN US

• The outperformance of the proposed STPTSSR method is significant for the Honda and Extended Yale datasets which contain face images with a high variability. • For the FERET dataset, the recognition accuracy obtained by the 1-NN classifier seems to be good: it obtains the highest recognition rate in one case (one labeled image per class) and the second highest rate in two other cases. This may be due to the large number of

M

classes in FERET dataset and to the very few labeled samples per class. Therefore, the simplicity of the 1-NN classifier can contribute to a good classification performance for

ED

such cases.

• For the FERET dataset, all tested methods have provided a large standard deviation for

PT

the accuracy. This can be explained by the fact that this dataset contains images of 200 subjects (classes) where each class has a large variation. At the same time, the training

CE

process considers very few labeled images per class (1 or 2 images par class). Adopting few random splits for performance evaluation, the accuracies of the individual splits will

AC

have a broad interval. • In general, the recall and precision obtained by the proposed semi-supervised classifier are better than those obtained by the competing methods.

5.4. Statistical significance In order to rigorously analyze the results obtained and derive strong conclusions out of them, we perform a statistical significance analysis. In the literature, one can find several statistical significance tests [51, 52, 53]. According to the study conducted in [52], the Wilcoxon Rank-Sum

16

ACCEPTED MANUSCRIPT

Test seems to be a powerful tool since it has many advantages over the paired t-test [51]. Thus, we consider the Wilcoxon Rank-Sum Test and adopt a confidence level of 95% (i.e., we consider a statistical significance threshold of p < 0.05). Table 4 illustrates the results of Wilcoxon test on the performances depicted in Table 2. Each cell in the table corresponds to a comparison between the proposed STPTSSR method and a given competing method. The symbol (X) means that the proposed method (STPTSSR) significantly improves over the corresponding

CR IP T

competing method. The symbol (≡) indicates that there is no statistical evidence that the method outperforms the compared method.

As it can be seen, out of 180 configurations the proposed semi-supervised method was significantly better in 146 configurations representing 81.33% of the configurations. The statistical significance tests on the FERET dataset were not similar to those obtained on the other

AN US

databases. This can be explained by the low number of splits and the low number of images per class for the FERET database. 5.5. Imbalanced datasets

The above results concerned balanced datasets. This means that all classes have the same

M

number of labeled images. In order to study the performance of all competing methods in the presence of imbalanced classes, we conducted the following two experiments.

ED

In the first experiment, we consider the Extended Yale dataset. This dataset has 28 classes where each class has about 62 images. We modify the partition of the labeled images in the following way. Let L denote the number of labeled images per class. Let U denote the number

PT

of unlabeled images par class. For the first six classes, the pair (L, U ) is set to (6, 26), for the next six classes this pair is set to (13, 19), for the next six classes the pair is set to (19,

CE

13), and for the remaining classes it is set to (26, 6). Table 5 illustrates the class-wise recall and precision obtained by all competing methods. The last row of the upper (lower) table

AC

illustrates the overall recall (precision). As it can be seen, for the majority of the competing methods (including our proposed method), the classification accuracy associated with a given class increases as the number of the labeled images in this class increases. For the first 12 classes, this phenomenon is not observed for the FME and KFME methods which are graph-based label propagation methods. The overall recall and precision of the proposed method are still better than the those of the competing methods. In the second experiment, we consider the COIL20 dataset. This dataset contains 20 classes where each class has 18 images. We modify the partition of the images in each class in the 17

ACCEPTED MANUSCRIPT

following way. Let T denote the number of testing images per class. For every class, we consider the triplet (L, U, T ) where L and U are respectively the numbers of labeled and unlabeled images per class. For the first five classes, this triplet is set to (4, 5, 9), for the next five classes, this is set to (5, 4, 9), for the next five classes this is set to (6, 3, 9), and for the remaining classes this triplet is set to (7, 2, 9). Table 6 shows the class-wise recall and precision obtained by all competing methods. The last row of the upper (lower) table depicts the overall recall (precision).

methods.

CR IP T

As it can be seen, the performance of the proposed method is better than that of the competing

Table 2: Average and standard deviation over ten random splits of the correct classification rate (%) using several methods. SDE, SDA, SPDA, TCA, and ESDE methods use the Nearest Neighbor (1-NN) classifier.

Method 1-NN SVM SDE SDA TCA SPDA LapRLS FME KFME ESDE STPTSSR

Partition 78.4 ± 66.1 ± 75.1 ± 77.3 ± 66.9 ± 55.6 ± 74.0 ± 73.7 ± 80.5 ± 78.5 ± 81.1 ±

1 1.6 5.3 1.7 1.9 2.2 2.8 1.8 1.8 1.7 1.6 2.0

USPS Partition 83.1 ± 74.9 ± 82.7 ± 83.4 ± 74.0 ± 76.9 ± 75.3 ± 76.5 ± 84.7 ± 83.1 ± 86.2 ±

1 6.1 6.8 6.6 6.5 4.4 3.2 5.4 5.7 5.3 6.1 6.2

COIL20 Partition 2 81.5 ± 5.4 80.8 ± 4.4 80.7 ± 4.7 77.1 ± 3.6 66.8 ± 4.1 56.2 ± 3.5 81.2 ± 4.2 74.7 ± 3.9 78.7 ± 6.0 81.5 ± 5.4 84.8 ± 4.1

Partition 73.5 ± 69.6 ± 72.2 ± 66.8 ± 68.0 ± 34.8 ± 76.0 ± 68.9 ± 72.0 ± 73.6 ± 76.8 ±

2 2.3 4.1 2.4 1.0 1.5 2.7 4.6 1.7 1.3 2.3 1.2

Partition 3 60.7 ± 10.2 57.3 ± 15.4 59.6 ± 11.3 56.2 ± 20.7 55.1 ± 20.1 54.4 ± 20.6 44.1 ± 19.1 47.7 ± 8.7 60.9 ± 13.3 60.6 ± 13.2 64.0 ± 15.4

Partition 74.9 ± 73.4 ± 76.5 ± 78.3 ± 77.0 ± 52.2 ± 75.9 ± 63.4 ± 71.8 ± 77.2 ± 78.5 ±

1 3.3 3.0 2.7 4.4 3.9 4.5 3.2 4.5 5.1 3.4 3.7

Partition 85.1 ± 81.1 ± 85.4 ± 85.4 ± 77.8 ± 83.5 ± 76.5 ± 77.6 ± 87.4 ± 85.1 ± 88.4 ±

3 1.8 3.5 1.2 0.8 1.5 1.5 4.3 1.6 1.8 1.8 1.6

Partition 56.9 ± 31.1 ± 55.7 ± 56.7 ± 49.7 ± 50.6 ± 42.9 ± 51.4 ± 56.9 ± 57.9 ± 62.2 ±

Partition 83.6 ± 84.8 ± 83.2 ± 80.1 ± 69.6 ± 67.1 ± 83.9 ± 77.1 ± 80.7 ± 83.6 ± 87.3 ±

3 3.7 4.5 3.4 3.2 3.9 5.3 3.7 3.2 4.6 3.7 3.7

Partition 1 66.4 ± 5.6 75.2 ± 11.1 81.3 ± 6.5 79.5 ± 9.9 84.7 ± 8.4 78.3 ± 9.8 76.1 ± 8.9 73.2 ± 5.8 80.4 ± 8.7 78.6 ± 10.1 88.2 ± 3.8

ED

PT

CE

AC Method 1-NN SVM SDE SDA TCA SPDA LapRLS FME KFME ESDE STPTSSR

FERET Partition 2 53.9 ± 12.4 45.8 ± 16.4 51.1 ± 13.6 42.3 ± 19.9 40.1 ± 18.9 29.8 ± 15.4 45.6 ± 17.5 43.4 ± 9.4 53.6 ± 12.8 53.9 ± 16.3 54.3 ± 16.0

UMIST Partition 2 88.4 ± 2.7 88.3 ± 3.1 88.9 ± 2.7 91.7 ± 2.4 89.5 ± 3.0 89.1 ± 2.1 86.0 ± 3.5 74.4 ± 3.5 83.8 ± 3.6 90.3 ± 2.0 91.5 ± 3.4

Partition 93.6 ± 93.9 ± 93.9 ± 96.0 ± 95.9 ± 96.0 ± 91.8 ± 80.6 ± 89.4 ± 94.4 ± 95.3 ±

3 2.4 2.3 2.0 1.3 1.4 1.3 1.2 3.1 2.2 2.1 1.5

Honda Partition 2 67.5 ± 3.0 35.9 ± 3.2 66.3 ± 2.9 67.8 ± 3.0 61.9 ± 3.4 67.2 ± 2.9 45.4 ± 2.6 57.0 ± 2.3 66.0 ± 3.0 68.3 ± 3.5 73.3 ± 2.7

Partition 73.7 ± 38.4 ± 72.7 ± 74.7 ± 69.4 ± 74.6 ± 53.4 ± 59.6 ± 69.9 ± 74.4 ± 78.9 ±

3 2.4 3.4 2.5 3.8 2.6 3.6 1.5 1.6 2.0 3.0 2.1

Ext. YALE Partition 2 75.2 ± 5.3 87.7 ± 5.8 86.5 ± 4.1 88.0 ± 5.8 91.7 ± 3.0 87.8 ± 6.0 80.9 ± 6.0 77.1 ± 5.6 90.3 ± 5.2 85.4 ± 7.8 93.0 ± 2.7

Partition 79.3 ± 92.2 ± 88.5 ± 91.4 ± 93.7 ± 91.3 ± 82.1 ± 79.6 ± 94.0 ± 89.2 ± 94.4 ±

3 3.2 2.2 3.2 3.3 2.1 3.4 5.3 4.6 3.5 5.8 1.9

AN US

Partition 1 38.5 ± 14.5 25.8 ± 13.7 35.9 ± 13.0 20.9 ± 12.2 31.9 ± 12.8 12.8 ± 9.6 34.0 ± 14.9 35.2 ± 11.5 40.0 ± 14.1 38.5 ± 14.6 36.3 ± 15.8

M

Method 1-NN SVM SDE [46], SDA [42] TCA [47] SPDA [48] LapRLS [49] FME [50] KFME [6] ESDE [45] STPTSSR

18

1 2.6 4.4 2.7 3.1 2.4 2.4 2.8 3.4 3.1 2.9 2.4

ACCEPTED MANUSCRIPT

Table 3: Precision and recall (%) obtained by all competing methods. The results correspond to the USPS, Honda, COIL20, and Extended Yale datasets adopting partition 3.

Honda COIL20 Ext.Yale

Precision Recall Precision Recall Precision Recall Precision Recall

1-NN 84.0 84.2 72.9 76.6 82.8 89.5 78.3 81.9

SVM 82.0 87.0 46.3 66.0 85.0 89.7 92.4 93.3

SDE 86.4 86.6 69.9 75.9 82.8 88.9 86.7 88.3

SDA 86.2 86.8 75.3 76.5 83.9 88.2 93.0 93.7

TCA 79.1 81.1 70.5 77.1 71.7 82.5 93.5 94.1

SPDA 84.7 85.0 72.7 74.2 66.7 68.7 93.0 93.7

LapRLS 74.7 77.6 71.3 72.1 87.8 88.0 88.2 89.1

FME 72.0 75.4 61.5 63.0 78.9 80.4 78.0 81.1

5.6. STPTSSR versus sparse classifiers

KFME 86.0 86.5 73.2 74.5 83.9 88.0 95.9 93.7

ESDE 84.9 85.4 73.5 77.5 82.8 89.0 82.1 84.0

STPTSSR 88.0 88.0 80.5 80.9 88.3 90.5 95.0 95.4

CR IP T

USPS

In this section, we compare our proposed STPTSSR with other sparse classifiers. More precisely, we evaluate the performance of the SRC classifier [25], the original TPTSSR classifier

of using TPTSSR classifier:

AN US

and an intuitive way to propagate active learning into TPTSSR. We have two different scenarios

• TPTSSR Scenario 1 (original TPTSSR): In this scenario, the training samples are limited to the labeled ones.

Thus, the unlabeled samples are not exploited and the

M

TPTSSR is used in its original fashion.

• TPTSSR Scenario 2: Scenario 2 can be considered as an intuitive active learning for

ED

the TPTSSR classifier. It consists of two main steps. In the first step, the class of all unlabeled samples is predicted using TPTSSR classifier. Notice that this step is also used

PT

by the proposed STPTSSR. In the second step, all unlabeled samples are considered to be correctly classified, and then, these samples form together with labeled samples a new set

CE

used as labeled samples in the resulting final classifier. It is worthy to note that TPTSSR Scenario 2 is a special case of the proposed STPTSSR in which we set η to zero in Eq.

AC

(8).

Table 7 depicts a comparison between the already mentioned classifiers. In this table, values

of Ml and M parameters of STPTSSR are respectively estimated using the global supervised and local unsupervised methods described in sections 3.1 and 3.2. The value of M used by the TPTSSR scenarios is also set using the local unsupervised method. Concerning the local scheme, the value of M for a testing sample depends on its nearest neighbors. For each testing sample, the value of M is computed as the mean of the values of M associated with the nearest neighbors in the training set. The number of nearest neighbors that 19

ACCEPTED MANUSCRIPT

Table 4: Statistical significance of the proposed method using Wilcoxon method with 95% confidence level (i.e., the p-value is set to 0.05). The symbol (X) means that the proposed method (STPTSSR) significantly improves over the corresponding competing method. The symbol (≡) indicates that there is no statistical evidence that the method outperforms the compared method.

FERET Partition 2 ≡ X ≡ ≡ X X X X ≡ ≡

Partition 3 ≡ X X ≡ ≡ X X X X ≡

Partition 1 ≡ X ≡ ≡ X X X X X ≡

Method 1-NN SVM SDE SDA TCA SPDA LapRLS FME KFME ESDE

Partition 1 X X X X X X X X ≡ X

USPS Partition 2 X X X X X X X X X X

Partition 3 X X X X X X X X ≡ X

Partition 1 X X X X X X X X X X

Method 1-NN SVM SDE SDA TCA SPDA LapRLS FME KFME ESDE

Partition 1 ≡ X X X X X ≡ X X ≡

COIL20 Partition 2 X X X X X X X X X X

Partition 3 X X X X X X X X X X

Partition 1 X X X X ≡ X X X X X

M

ED

PT

CE

AC

UMIST Partition 2 X X X ≡ X X X X X ≡

Partition 3 X X X ≡ ≡ ≡ X X X ≡

Honda Partition 2 X X X X X X X X X X

Partition 3 X X X X X X X X X X

CR IP T

Partition 1 ≡ X ≡ X ≡ X X ≡ ≡ ≡

AN US

Method 1-NN SVM SDE SDA TCA SPDA LapRLS FME KFME ESDE

20

Ext. YALE Partition 2 Partition 3 X X X X X X X X ≡ ≡ X X X X X X X ≡ X X

ACCEPTED MANUSCRIPT

Table 5: Class-wise recall and precision for the imbalanced Extended Yale dataset (see text for details). Recall

SVM

SDE

SDA

TCA

SPDA

LapRLS

FME

KFME

ESDE

STPTSSR

32.1 37.0 64.3 38.7 30.0 32.3 80.6 75.0 71.9 75.0 81.3 71.9 78.1 71.9 81.3 84.4 90.6 84.4 75.0 84.4 96.9 84.4 84.4 87.5 71.9 84.4 84.4 84.4 72.1

60.7 51.9 60.7 71.0 36.7 58.1 87.1 84.4 81.3 84.4 81.3 84.4 87.5 90.6 90.6 90.6 93.8 90.6 87.5 96.9 96.9 87.5 93.8 96.9 93.8 90.6 90.6 90.6 82.5

64.3 59.3 67.9 67.7 50.0 71.0 90.3 81.3 78.1 81.3 84.4 78.1 87.5 84.4 93.8 96.9 96.9 90.6 81.3 96.9 96.9 90.6 100.0 90.6 87.5 93.8 87.5 93.8 83.7

67.9 66.7 64.3 67.7 76.7 74.2 100.0 90.6 71.9 87.5 75.0 96.9 93.8 81.3 90.6 100.0 90.6 87.5 87.5 93.8 90.6 90.6 96.9 96.9 90.6 93.8 87.5 96.9 86.0

75.0 74.1 67.9 77.4 63.3 80.6 100.0 100.0 93.8 87.5 84.4 96.9 100.0 90.6 100.0 100.0 96.9 96.9 90.6 100.0 96.9 100.0 93.8 96.9 90.6 93.8 90.6 96.9 90.5

67.9 66.7 64.3 64.5 76.7 74.2 100.0 87.5 71.9 87.5 75.0 96.9 93.8 81.3 87.5 100.0 90.6 87.5 87.5 93.8 87.5 90.6 96.9 96.9 90.6 96.9 87.5 90.6 85.4

71.4 55.6 60.7 61.3 70.0 51.6 80.6 68.8 68.8 78.1 78.1 81.3 87.5 81.3 78.1 100.0 96.9 84.4 75.0 93.8 100.0 96.9 93.8 96.9 84.4 100.0 87.5 96.9 81.4

78.6 77.8 75.0 71.0 70.0 58.1 64.5 68.8 50.0 71.9 65.6 65.6 78.1 68.8 65.6 93.8 96.9 78.1 56.3 62.5 100.0 65.6 81.3 65.6 56.3 71.9 62.5 78.1 71.4

78.6 77.8 85.7 80.6 80.0 87.1 93.5 96.9 87.5 84.4 93.8 87.5 100.0 84.4 90.6 100.0 96.9 96.9 87.5 93.8 100.0 93.8 96.9 93.8 96.9 93.8 93.8 100.0 91.1

100.0 70.4 64.3 71.0 53.3 71.0 83.9 84.4 75.0 81.3 81.3 87.5 96.9 90.6 93.8 93.8 96.9 87.5 84.4 100.0 96.9 93.8 96.9 93.8 90.6 87.5 90.6 96.9 86.2

75.0 77.8 92.9 83.9 70.0 77.4 96.8 93.8 90.6 93.8 90.6 93.8 100.0 93.8 100.0 100.0 96.9 100.0 90.6 100.0 96.9 100.0 96.9 100.0 100.0 96.9 96.9 100.0 93.0

Precision

1-NN

SVM

SDE

SDA

Class 01 Class 02 Class 03 Class 04 Class 05 Class 06 Class 07 Class 08 Class 09 Class 10 Class 11 Class 12 Class 13 Class 14 Class 15 Class 16 Class 17 Class 18 Class 19 Class 20 Class 21 Class 22 Class 23 Class 24 Class 25 Class 26 Class 27 Class 28 Overall

75.0 90.9 94.7 100.0 25.7 100.0 80.6 82.8 85.2 92.3 92.9 50.0 96.2 79.3 83.9 90.0 85.3 57.4 85.7 77.1 100.0 54.0 96.4 60.9 56.1 48.2 73.0 62.8 77.7

85.0 100.0 100.0 100.0 44.0 78.3 93.1 79.4 96.3 81.8 100.0 79.4 96.6 78.4 96.7 96.7 68.2 78.4 80.0 86.1 100.0 75.7 93.8 75.6 69.8 67.4 90.6 78.4 84.6

90.0 100.0 100.0 100.0 55.6 100.0 80.0 100.0 92.6 83.3 96.4 67.6 100.0 93.1 93.8 93.9 86.1 74.4 83.9 88.6 100.0 69.0 96.8 63.6 71.1 57.7 84.8 76.9 85.7

100.0 100.0 100.0 80.0 62.2 92.0 83.8 87.5 95.8 87.5 96.0 79.5 93.8 86.7 73.7 91.4 93.5 84.8 87.5 76.9 96.6 82.9 91.2 77.5 74.4 81.1 84.8 96.7 87.1

AC

CE

AN US

M TCA

SPDA

LapRLS

FME

KFME

ESDE

STPTSSR

100.0 100.0 100.0 85.2 72.0 86.2 88.2 100.0 87.9 84.8 100.0 86.1 100.0 93.1 91.2 93.9 81.1 96.9 87.9 91.2 96.7 80.0 90.9 76.9 74.4 75.0 93.1 88.6 89.3

100.0 100.0 100.0 80.0 62.2 92.0 83.8 87.5 95.8 87.5 96.0 79.5 93.8 86.7 73.7 91.4 93.5 84.8 90.3 76.9 96.6 82.9 91.2 77.5 74.4 81.6 84.8 96.7 87.2

100.0 100.0 100.0 95.0 63.6 100.0 89.3 80.8 100.0 96.2 89.3 86.7 82.4 76.5 83.3 91.4 82.9 79.4 96.0 81.1 85.3 72.1 76.9 63.3 60.0 76.2 71.8 72.1 84.0

78.6 72.4 38.5 65.6 38.8 85.7 80.0 91.7 84.2 60.5 75.0 80.8 92.6 72.4 84.0 66.7 76.9 65.8 81.0 95.2 58.2 87.0 74.3 44.7 75.0 92.0 90.9 77.4 74.5

100.0 100.0 100.0 96.0 68.6 66.7 96.6 96.9 100.0 100.0 96.8 77.8 82.1 92.9 100.0 94.1 73.2 93.3 100.0 96.8 93.9 73.2 91.2 78.4 85.7 96.8 100.0 94.1 90.9

90.9 100.0 91.7 85.0 42.1 77.3 88.9 82.8 71.0 83.9 95.8 54.8 96.4 67.6 88.5 80.6 68.3 86.2 75.0 70.3 90.3 38.4 90.3 63.8 55.8 64.3 78.8 75.7 76.9

90.5 100.0 100.0 100.0 71.4 100.0 93.8 93.8 93.3 93.3 96.6 84.8 97.0 93.3 96.9 88.9 81.1 80.0 92.9 97.0 96.8 76.2 93.8 78.9 94.1 85.7 96.9 86.1 91.2

ED

(6 lab.) (6 lab.) (6 lab.) (6 lab.) (6 lab.) (6 lab.) (13 lab.) (13 lab.) (13 lab.) (13 lab.) (13 lab.) (13 lab.) (19 lab.) (19 lab.) (19 lab.) (19 lab.) (19 lab.) (19 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) precision

PT

(6 lab.) (6 lab.) (6 lab.) (6 lab.) (6 lab.) (6 lab.) (13 lab.) (13 lab.) (13 lab.) (13 lab.) (13 lab.) (13 lab.) (19 lab.) (19 lab.) (19 lab.) (19 lab.) (19 lab.) (19 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) (26 lab.) recall

CR IP T

1-NN

Class 01 Class 02 Class 03 Class 04 Class 05 Class 06 Class 07 Class 08 Class 09 Class 10 Class 11 Class 12 Class 13 Class 14 Class 15 Class 16 Class 17 Class 18 Class 19 Class 20 Class 21 Class 22 Class 23 Class 24 Class 25 Class 26 Class 27 Class 28 Overall

21

ACCEPTED MANUSCRIPT

Table 6: Class-wise recall and precision for the imbalanced COIL20 dataset (see text for details).

AC

SVM 75.0 100.0 87.5 90.0 100.0 100.0 60.0 69.2 100.0 100.0 87.5 100.0 52.9 100.0 100.0 100.0 100.0 100.0 62.5 100.0 89.2

SDE 45.0 100.0 100.0 100.0 100.0 50.0 80.0 75.0 100.0 100.0 100.0 100.0 75.0 90.0 100.0 100.0 81.8 88.9 41.7 100.0 86.4

LapRLS 100.0 77.8 77.8 100.0 66.7 44.4 100.0 100.0 33.3 100.0 77.8 100.0 100.0 88.9 100.0 100.0 100.0 88.9 22.2 100.0 83.9

FME 55.6 44.4 66.7 55.6 77.8 88.9 66.7 100.0 11.1 100.0 66.7 100.0 100.0 100.0 100.0 100.0 100.0 77.8 55.6 100.0 78.3

KFME 77.8 55.6 77.8 77.8 77.8 77.8 77.8 100.0 33.3 100.0 77.8 100.0 100.0 100.0 100.0 100.0 100.0 77.8 55.6 100.0 83.3

ESDE 100.0 55.6 33.3 100.0 55.6 11.1 88.9 100.0 55.6 100.0 77.8 100.0 100.0 100.0 100.0 100.0 100.0 100.0 55.6 100.0 81.7

STPTSSR 100.0 77.8 77.8 100.0 66.7 66.7 100.0 100.0 55.6 100.0 77.8 100.0 100.0 100.0 100.0 100.0 100.0 88.9 66.7 100.0 88.9

LapRLS 47.4 87.5 70.0 100.0 46.2 80.0 90.0 69.2 100.0 100.0 100.0 100.0 90.0 88.9 100.0 100.0 100.0 100.0 100.0 100.0 88.5

FME 62.5 100.0 100.0 100.0 50.0 66.7 66.7 52.9 50.0 100.0 100.0 90.0 100.0 64.3 90.0 69.2 100.0 100.0 83.3 90.0 81.8

KFME 58.3 100.0 100.0 100.0 53.8 77.8 70.0 60.0 100.0 100.0 100.0 100.0 100.0 69.2 90.0 81.8 100.0 100.0 100.0 90.0 87.6

ESDE 45.0 100.0 75.0 100.0 100.0 100.0 57.1 69.2 100.0 100.0 100.0 100.0 90.0 100.0 100.0 100.0 100.0 100.0 33.3 100.0 88.5

STPTSSR 64.3 87.5 87.5 100.0 100.0 85.7 75.0 69.2 100.0 100.0 100.0 90.0 81.8 100.0 100.0 100.0 100.0 100.0 75.0 100.0 90.8

AN US

CR IP T

Recall TCA SPDA 100.0 100.0 44.4 22.2 11.1 0.0 77.8 88.9 55.6 0.0 11.1 44.4 55.6 55.6 100.0 100.0 22.2 11.1 44.4 88.9 77.8 77.8 100.0 100.0 77.8 55.6 88.9 55.6 100.0 100.0 100.0 100.0 100.0 100.0 88.9 55.6 55.6 11.1 100.0 44.4 70.6 60.6

SDA 100.0 55.6 77.8 100.0 66.7 66.7 100.0 100.0 22.2 88.9 77.8 100.0 100.0 88.9 100.0 100.0 100.0 88.9 33.3 100.0 83.3

M

SDE 100.0 66.7 22.2 100.0 55.6 22.2 88.9 100.0 55.6 100.0 88.9 100.0 100.0 100.0 100.0 100.0 100.0 88.9 55.6 100.0 82.2

SDA 50.0 83.3 58.3 100.0 60.0 85.7 81.8 69.2 100.0 100.0 100.0 100.0 90.0 100.0 100.0 100.0 100.0 100.0 50.0 100.0 86.4

Precision TCA SPDA 37.5 52.9 80.0 33.3 100.0 0.0 100.0 80.0 83.3 0.0 100.0 44.4 50.0 83.3 60.0 64.3 100.0 50.0 100.0 34.8 77.8 63.6 64.3 100.0 100.0 50.0 100.0 83.3 90.0 90.0 64.3 75.0 100.0 100.0 100.0 100.0 29.4 14.3 100.0 44.4 81.8 58.2

ED

1-NN 45.0 100.0 75.0 100.0 100.0 100.0 57.1 69.2 100.0 100.0 100.0 100.0 90.0 100.0 100.0 100.0 100.0 100.0 33.3 100.0 88.5

SVM 100.0 55.6 77.8 100.0 55.6 33.3 100.0 100.0 55.6 100.0 77.8 100.0 100.0 100.0 100.0 100.0 100.0 88.9 55.6 100.0 85.0

CE

Class \ Method Class 01 (4 lab.) Class 02 (4 lab.) Class 03 (4 lab.) Class 04 (4 lab.) Class 05 (4 lab.) Class 06 (5 lab.) Class 07 (5 lab.) Class 08 (5 lab.) Class 09 (5 lab.) Class 10 (5 lab.) Class 11 (6 lab.) Class 12 (6 lab.) Class 13 (6 lab.) Class 14 (6 lab.) Class 15 (6 lab.) Class 16 (7 lab.) Class 17 (7 lab.) Class 18 (7 lab.) Class 19 (7 lab.) Class 20 (7 lab.) Overall precision

1-NN 100.0 55.6 33.3 100.0 55.6 11.1 88.9 100.0 55.6 100.0 77.8 100.0 100.0 100.0 100.0 100.0 100.0 100.0 55.6 100.0 81.7

PT

Class \ Method Class 01 (4 lab.) Class 02 (4 lab.) Class 03 (4 lab.) Class 04 (4 lab.) Class 05 (4 lab.) Class 06 (5 lab.) Class 07 (5 lab.) Class 08 (5 lab.) Class 09 (5 lab.) Class 10 (5 lab.) Class 11 (6 lab.) Class 12 (6 lab.) Class 13 (6 lab.) Class 14 (6 lab.) Class 15 (6 lab.) Class 16 (7 lab.) Class 17 (7 lab.) Class 18 (7 lab.) Class 19 (7 lab.) Class 20 (7 lab.) Overall recall

22

ACCEPTED MANUSCRIPT

are considered in our experiments was set to 20% of the training samples. All other parameters of TPTSSR and STPTSSR have the same values as in the previous section. Table 7: Recognition rate and standard deviation in (%), averaged over ten random splits using SRC, TPTSSR (in two scenarios) and our proposed classifier. Ml and M parameters of STPTSSR are calculated using the global supervised and local unsupervised optimizer methods respectively. Regarding TPTSSR, M is calculated using the local unsupervised method.

SRC FERET UMIST COIL20 Ext. YALE USPS Honda

± ± ± ± ± ±

50.33 90.24 81.61 91.31 82.05 69.26

16.92 2.88 5.9 4.55 1.89 3.01

34.18 76.92 60.23 84.50 80.00 60.99

± ± ± ± ± ±

16.14 3.13 6.63 5.43 1.69 2.20

33.42 77.83 74.94 84.08 82.29 61.25

± ± ± ± ± ±

12.93 3.63 5.39 5.38 2.48 3.33

35.22 78.20 76.78 86.38 81.60 62.42

± ± ± ± ± ±

15.45 3.80 0.87 5.01 2.07 2.13

Partition 2: 25% Labeled, 25% Unlabeled, 50% Test TPTSSR scenario1 TPTSSR scenario2 STPTSSR 17.92 3.33 4.27 3.39 1.46 2.31

50.78 90.24 83.67 91.20 85.89 72.49

± ± ± ± ± ±

AN US

± ± ± ± ± ±

32.35 77.08 73.89 86.14 74.18 57.61

18.84 3.28 4.44 3.37 1.79 2.87

49.48 91.25 82.22 87.74 86.95 72.56

M

FERET UMIST COIL20 Ext. YALE USPS Honda

CR IP T

Partition 1: 15% Labeled, 35% Unlabeled, 50% Test SRC TPTSSR scenario1 TPTSSR scenario2 STPTSSR

± ± ± ± ± ±

14.16 3.50 3.41 2.58 1.77 2.47

52.18 91.59 84.17 91.57 86.95 73.99

± ± ± ± ± ±

17.46 3.31 3.97 2.85 1.54 2.59

± ± ± ± ± ±

15.51 1.86 3.42 2.13 1.48 2.02

59.75 94.51 86.72 93.22 85.89 78.62

PT

60.72 94.92 83.42 93.51 85.09 73.13

± ± ± ± ± ±

19.15 2.08 3.38 2.00 1.61 2.87

59.40 94.58 84.83 89.41 88.82 79.70

± ± ± ± ± ±

13.72 1.71 2.19 2.07 1.76 2.12

62.42 94.81 86.44 93.29 88.91 79.59

± ± ± ± ± ±

16.92 1.96 3.71 1.94 1.54 2.21

CE

FERET UMIST COIL20 Ext. YALE USPS Honda

ED

Partition 3: 35% Labeled, 15% Unlabeled, 50% Test SRC TPTSSR scenario1 TPTSSR scenario2 STPTSSR

The observations that can be drawn from these results are as follows:

AC

• In most cases, the proposed STPTSSR outerperfoms the other sparse classifiers. • The results show the efficacy of the proposed active learning in STPTSSR for exploiting the unlabeled data. Indeed, the latter surpasses the intuitive way for which active learning is used in scenario 2 in all cases expect two. • The utility of unlabeled samples is also proven. From Table 7, we can appreciate the superiority of STPTSSR compared to TPTSSR scenario 1 which exploits limited labeled samples. This emphasizes the importance of semi-supervised learning. 23

ACCEPTED MANUSCRIPT

• In a significant number of cases, the recognition accuracy of scenario 1 is better than that of scenario 2. This demonstrates the challenge of applying active learning and semisupervised paradigms to the TPTSSR framework. The inadequacy of scenario 2 (that uses more training data) can be resulted from the misclassified samples which can directly degrade the performance of the classifier. Such deterioration does not occur with our proposed STPTSSR due to the introduced merging rule for the deviation, which maintains

CR IP T

the balance between the information transmitted by labeled samples and the predicted ones. 5.7. Computational time evaluation

In this section, we compare the processing time of the following classifiers: SRC, TPTSSR

AN US

and STPTSSR. To this end, we measure the CPU time needed by the processor while running the classification of the testing set presented in the previous section (the first split is only considered for this study, and the execution times declared for TPTSSR are those for Scenario 1). Table 8 depicts the measured CPU time needed by STPTSSR for predicting the labels of the unlabeled samples. This time, although it is short compared to that of the testing phase, will be

M

excluded when we compare the CPU time for different classifiers, since in practice, predicting the labels of the unlabeled data is done offline before using the classifier in the testing phase.

ED

Table 8: Processing time (in seconds) of predicting unlabeled samples for STPTSSR.

Partition 2 1.60 0.12 0.05 1.91 0.52 0.62

CE

PT

FERET UMIST COIL20 Ext. Yale USPS Honda

Partition 1 0.95 0.12 0.04 1.38 0.44 0.45

Partition 3 1.65 0.11 0.04 1.88 0.48 0.48

AC

Table 9 shows the processing time of the already mentioned classifiers when applied to the testing part of the six datasets. The experiments have been run using MATLAB on a 128GB RAM intel core I7-6900k 8 cores 3.6GHz CPU computer. The size of each testing set is equal to half of the whole dataset. From this table, one can see that TPTSSR is the fastest classifier and SRC is the slowest. In general, the ratio between the SRC time and the STPTSSR time is more than 6. This means that STPTSSR is 6 times faster than SRC. It can still be seen from this table that the ratio between the processing time of STPTSSR and TPTSSR decreases as the number of labeled samples increases. This could be due to the fact that, based on Eq. (8), 24

ACCEPTED MANUSCRIPT

STPTSSR uses all training samples (labeled and unlabeled) when measuring the contribution of each sample, while this contribution is limited to labeled samples in TPTSSR. Thus, in order to compare the processing time of classifiers when similar number of labeled samples is used, we evaluate, in Table 10, the time taken for SRC, TPTSSR and STPTSSR when all training samples are labeled (50% of the whole data). In this table, generally TPTSSR is two times faster then STPTSSR. It is not surprising that STPTSSR takes twice the time needed by TPTSSR

CR IP T

since the formula that defines the deviation in TPTSSR (Eq. (2)) is repeated twice for the deviation of STPTSSR (Eq. (8)). Due to the independence of the two coding used in (8), it is possible to reduce the CPU time of the proposed ATPSSR by using parallel implementation.

Table 9: Processing time (in seconds) taken when applying SRC, TPTSSR and STPTSSR classifiers for different partitions of data.

Partition 2 STPTSSR TPTSSR 8.42 2.28 0.83 0.26 0.34 0.12 13.42 3.74 3.78 1.10 4.05 1.25

AN US

SRC 18.77 5.55 2.66 33.02 16.56 17.64

SRC 27.13 7.74 3.62 42.00 19.96 21.34

Partition 3 STPTSSR TPTSSR 11.17 4.78 0.95 0.39 0.34 0.14 16.10 5.96 4.27 1.66 4.54 1.61

M

FERET UMIST COIL20 Ext. Yale USPS Honda

Partition 1 STPTSSR TPTSSR 6.89 0.88 0.74 0.15 0.26 0.07 11.78 2.06 3.34 0.65 3.42 0.63

ED

Table 10: Processing time (in seconds) taken when applying SRC, TPTSSR and STPTSSR classifiers while 50% of data is labeled and 50% test.

STPTSSR 11.37 1.10 0.45 19.59 5.09 5.68

TPTSSR 5.70 0.48 0.20 9.22 2.55 2.75

SRC 39.77 10.14 5.13 60.33 28.11 29.96

AC

CE

PT

FERET UMIST COIL20 Ext. Yale USPS Honda

6. Conclusion

In this paper, we proposed a new Semi-supervised Two Phase Test Sample Sparse Repre-

sentation (STPTSSR) classifier. Indeed, upgrading the original TPTSSR to an active variant is a challenging task. In addition to the sparsity property, our proposed classifier is active, in the sense that, it makes predictions for unlabeled data in order to use them as labeled. Unlike the classic SRC, that uses the `1 minimization for all training samples, our STPTSSR uses the

25

SRC 36.32 8.65 3.92 50.41 23.23 23.48

ACCEPTED MANUSCRIPT

`2 minimization (which is much faster than `1 minimization) and applies it on a carefully selected part of training data. STPTSSR can also be seen as a semi-supervised classifier that can benefit from unlabeled data samples dissimilarly to TPTSSR which is a supervised classifier. Experiments performed on six different datasets prove the superiority of our classifier over nine state-of-the-art classification methods. Moreover, experiments prove that the appropriate use of active learning contributes to the performance of STPTSSR. Furthermore, STPTSSR is 6 times

CR IP T

faster than SRC, and its speed is practically almost half of that of TPTSSR.

References

[1] R. A. R. Ashfaq, X.-Z. Wang, J. Z. Huang, H. Abbas, and Y.-L. He, “Fuzziness based semi-supervised

AN US

learning approach for intrusion detection system,” Information Sciences, vol. 378, pp. 484–497, 2017. [2] F. Dornaika and Y. El Traboulsi, “Matrix exponential based semi-supervised discriminant embedding for image classification,” Pattern Recognition, vol. 61, pp. 92–103, 2017.

[3] A. Iwayemi and C. Zhou, “Saraa: Semi-supervised learning for automated residential appliance annotation,” IEEE Transactions on Smart Grid, vol. 8, no. 2, pp. 779–786, 2017.

M

[4] F. Dornaika and Y. El Traboulsi, “Learning flexible graph-based semi-supervised embedding,” IEEE transactions on cybernetics, vol. 46, no. 1, pp. 206–218, 2016.

ED

[5] D. Dai and L. Van Gool, “Unsupervised high-level feature learning by ensemble projection for semisupervised image classification and image clustering,” arXiv preprint arXiv:1602.00955, 2016.

PT

[6] Y. El Traboulsi, F. Dornaika, and A. Assoum, “Kernel flexible manifold embedding for pattern classification,” Neurocomputing, vol. 167, pp. 517–527, 2015.

CE

[7] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang, “Multi-modal curriculum learning for semi-supervised image classification,” IEEE Transactions on Image Processing, vol. 25, no. 7,

AC

pp. 3249–3260, 2016.

[8] X. Liu, T. Guo, L. He, and X. Yang, “A low-rank approximation-based transductive support tensor machine for semisupervised classification,” IEEE Transactions on Image Processing, vol. 24, no. 6, pp. 1825–1838, 2015.

[9] D. Dai and L. Van Gool, “Ensemble projection for semi-supervised image classification,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2072–2079, 2013. [10] T. Drugman, J. Pylkk¨ onen, and R. Kneser, “Active and semi-supervised learning in asr: Benefits on the acoustic and language models.,” in INTERSPEECH, pp. 2318–2322, 2016.

26

ACCEPTED MANUSCRIPT

[11] M. M. Al Rahhal, Y. Bazi, H. AlHichri, N. Alajlan, F. Melgani, and R. R. Yager, “Deep learning approach for active classification of electrocardiogram signals,” Information Sciences, vol. 345, pp. 340–354, 2016. [12] P. Jain and A. Kapoor, “Active learning for large multi-class problems,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 762–769, IEEE, 2009.

CR IP T

[13] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-class active learning for image classification,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 2372– 2379, IEEE, 2009.

[14] C. Gong, D. Tao, W. Liu, L. Liu, and J. Yang, “Label propagation via teaching-to-learn and learningto-teach,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, pp. 1452–1465,

AN US

June 2017.

[15] H. T. Nguyen and A. Smeulders, “Active learning using pre-clustering,” in Proceedings of the twentyfirst international conference on Machine learning, p. 79, ACM, 2004.

[16] S. Dasgupta and D. Hsu, “Hierarchical sampling for active learning,” in Proceedings of the 25th international conference on Machine learning, pp. 208–215, ACM, 2008.

M

[17] S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples,” in Advances in neural information processing systems, pp. 892–900, 2010.

ED

[18] J. Ugander and L. Backstrom, “Balanced label propagation for partitioning massive graphs,” in Proceedings of the sixth ACM international conference on Web search and data mining, pp. 507–516,

PT

ACM, 2013.

[19] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,

CE

pp. 3166–3173, 2013.

[20] K. Amit Kumar and C. De Vleeschouwer, “Discriminative label propagation for multi-object track-

AC

ing with sporadic appearance features,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2000–2007, 2013.

[21] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 453–465, 2014. [22] Y. Zhang, G. Zhou, J. Jin, Y. Zhang, X. Wang, and A. Cichocki, “Sparse bayesian multiway canonical correlation analysis for eeg pattern recognition,” Neurocomputing, vol. 225, pp. 103–110, 2017.

27

ACCEPTED MANUSCRIPT

[23] X. Chang, Z. Ma, M. Lin, Y. Yang, and A. Hauptmann, “Feature interaction augmented sparse learning for fast kinect motion detection,” IEEE Transactions on Image Processing, 2017. [24] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1031–1044, 2010. [25] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse

CR IP T

representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009.

[26] S. Jafarpour, W. Xu, B. Hassibi, and R. Calderbank, “Efficient and robust compressed sensing using optimized expander graphs,” IEEE Transactions on Information Theory, vol. 55, no. 9, pp. 4299– 4308, 2009.

AN US

[27] J. Wang, C. Lu, M. Wang, P. Li, S. Yan, and X. Hu, “Robust face recognition via adaptive sparse representation,” IEEE transactions on cybernetics, vol. 44, no. 12, pp. 2368–2378, 2014. [28] M. Yang, L. Zhang, J. Yang, and D. Zhang, “Robust sparse coding for face recognition,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 625–632, IEEE, 2011. [29] R. He, W.-S. Zheng, B.-G. Hu, and X.-W. Kong, “A regularized correntropy framework for robust

M

pattern recognition,” Neural computation, vol. 23, no. 8, pp. 2074–2100, 2011. [30] R. He, W.-S. Zheng, and B.-G. Hu, “Maximum correntropy criterion for robust face recognition,”

ED

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1561–1576, 2011.

PT

[31] C.-G. Li, J. Guo, and H.-G. Zhang, “Local sparse representation based classification,” in Pattern Recognition (ICPR), 2010 20th International Conference on, pp. 649–652, IEEE, 2010.

CE

[32] F. Dornaika, Y. El Traboulsi, and A. Assoum, “Adaptive two phase sparse representation classifier for face recognition,” in International Conference on Advanced Concepts for Intelligent Vision Systems,

AC

pp. 182–191, Springer, 2013. [33] F. Dornaika, Y. El Traboulsi, C. Hernandez, and A. Assoum, “Self-optimized two phase test sample sparse representation method for image classification,” in Advances in Biomedical Engineering (ICABME), 2013 2nd International Conference on, pp. 163–166, IEEE, 2013.

[34] R. He, W.-S. Zheng, B.-G. Hu, and X.-W. Kong, “Two-stage nonnegative sparse representation for large-scale face recognition,” IEEE transactions on neural networks and learning systems, vol. 24, no. 1, pp. 35–46, 2013.

28

ACCEPTED MANUSCRIPT

[35] Y. Xu, D. Zhang, J. Yang, and J.-Y. Yang, “A two-phase test sample sparse representation method for use with face recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 9, pp. 1255–1262, 2011. [36] C. Sousa, S. Rezende, and G. Batista, “Influence of graph construction on semi-supervised learning,” in European Conferene on Machine Learning, pp. 160–175, 2013.

CR IP T

[37] X. Zhu, Z. Ghahramani, J. Lafferty, and others, “Semi-supervised learning using gaussian fields and harmonic functions,” vol. 3, pp. 912–919, 2003.

[38] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨ olkopf, “Learning with local and global consistency,” Advances in neural information processing systems, vol. 16, no. 16, pp. 321–328, 2004. [39] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for

AN US

learning from labeled and unlabeled examples,” The Journal of Machine Learning Research, vol. 7, pp. 2399–2434, 2006.

[40] W. Liu and S.-F. Chang, “Robust multi-class transductive learning with graphs,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 381–388, IEEE, 2009. [41] F. Nie, D. Xu, I. W.-H. Tsang, and C. Zhang, “Flexible manifold embedding: A framework for

vol. 19, no. 7, pp. 1921–1932, 2010.

M

semi-supervised and unsupervised dimension reduction,” Image Processing, IEEE Transactions on,

ED

[42] D. Cai, X. He, and J. Han, “Semi-supervised discriminant analysis,” in IEEE International Conference on Conputer Vision, 2007.

PT

[43] H. Huang, J. Liu, and Y. Pan, “Semi-supervised marginal fisher analysis for hyperspectral image classification,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sci-

CE

ences, vol. 1, p. 3, 2012.

[44] L. Qiao, S. Chen, and X. Tan, “Sparsity preserving discriminant analysis for single training image

AC

face recognition,” Pattern Recognition Letters, vol. 31, no. 5, pp. 422–429, 2010. [45] F. Dornaika and Y. E. Traboulsi, “Matrix exponential based semi-supervised discriminant embedding,” Pattern Recognition, vol. 61, pp. 92–103, 2017.

[46] H. Huang, J. Liu, and Y. Pan, “Semi-supervised marginal fisher analysis for hyperspectral image classification,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 3, pp. 377–382, 2012. [47] W. Liu, D. Tao, and J. Liu, “Transductive component analysis,” in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pp. 433–442, IEEE, 2008.

29

ACCEPTED MANUSCRIPT

[48] L. Qiao, S. Chen, and X. Tan, “Sparsity preserving discriminant analysis for single training image face recognition,” Pattern Recognition Letters, vol. 31, no. 5, pp. 422–429, 2010. [49] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” Journal of machine learning research, vol. 7, no. Nov, pp. 2399–2434, 2006.

CR IP T

[50] F. Nie, D. Xu, I. W.-H. Tsang, and C. Zhang, “Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction,” IEEE Transactions on Image Processing, vol. 19, no. 7, pp. 1921–1932, 2010.

[51] http : //www.statisticssolutions.com/manova − analysis − paired − sample − ttest/.

[52] J. Dems˘sar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine

AN US

Learning Research, vol. 7, p. 130, 2006.

[53] S. Garcia and F. Herrera, “An extension on # statistical comparisons of classifiers over multiple data sets # for all pairwise comparisons,” Journal of Machine Learning Research, vol. 9, pp. 2677–2694,

AC

CE

PT

ED

M

2008.

30