A Novel multiple kernel-based dictionary learning for distributive and collective sparse representation based classifiers

Neurocomputing (xxxx) xxxx–xxxx Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A Novel m...

Download PDF

1MB Sizes 1 Downloads 132 Views

Report

PDF Reader
Full Text

Neurocomputing (xxxx) xxxx–xxxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

A Novel multiple kernel-based dictionary learning for distributive and collective sparse representation based classiﬁers ⁎

Tahereh Zare , Mohammad Taghi Sadeghi Electrical Engineering Department, Yazd University, Yazd, Iran

A R T I C L E I N F O

A BS T RAC T

Communicated by Jun Yu

In recent years, sparse representation theory has attracted the attention of many researchers in the signal processing, pattern recognition and computer vision communities. The choice of dictionary matrix plays a key role in the sparse representation based methods. It can be a pre-deﬁned dictionary or can be learned via an optimization procedure. Furthermore, the dictionary learning process can be extended to a non-linear setting using an appropriate kernel function in order to handle non-linear structured data. In this framework, the choice of kernel function is also a key step. Multiple kernel learning is an appealing strategy for dealing with this problem. In this paper, within the framework of kernel sparse representation based classiﬁcation, we propose an iterative algorithm for coincident learning of the dictionary matrix and multiple kernel function. The weighted sum of a set of basis functions is considered as the multiple kernel function where the weights are optimized such that the reconstruction error of the sparse coded data is minimized. In our proposed algorithm, the sparse coding, dictionary learning and multiple kernel learning processes are performed in three steps. The optimization process is performed considering two diﬀerent structures namely distributive and collective for the sparse representation based classiﬁer. Our experimental results show that the proposed algorithm outperforms the other existing sparse coding based approaches. These results also conﬁrm that the collective setting leads to better results when the number of training examples is limited. On the other hand, the distributive setting is more appropriate when there are enough training samples.

Keywords: Sparse representation classiﬁcation Dictionary learning Multiple kernel learning K-SVD algorithm Kernel K-SVD algorithm

1. Introduction In recent years, sparse representation based algorithms have become increasingly important in machine learning applications such as image denoising, object recognition and classiﬁcation. The Sparse Representation based Classiﬁer (SRC) is among these fruitful algorithms [1]. The success of sparse representation based classiﬁcation stems from the fact that high-dimensional signals and images are naturally sparse in an appropriate feature space. So, such signals can be represented by a few samples (atoms) of a proper set of exemplars (dictionary). Although the SRC was originally introduced for classiﬁcation task, its ability to represent an input signal with a few training samples causes to be used in diﬀerent applications such as image retrieval [2], click prediction [3] and human pose recovery [4]. In the last decade, various studies have been conducted to improve the performance of this classiﬁer. Most of the researches have focused on two aspects of the SRC. One is optimizing of the sparse coding function and the other is improving the dictionary matrix. In [4], the authors incorporated a local similarity preserving term into the

⁎

objective function of sparse coding which groups similar silhouettes to alleviate the instability of sparse codes. Wang et. al. [5] demonstrated that similar inputs are represented by the similar atoms. So, they introduced Locality-constrained Linear Coding (LLC) which implements locality by projecting each descriptor into its local-coordinate system. In [6], the authors presented a method which makes use of the Histogram Intersection Kernel (HIK) technique within the LLC framework.. In the proposed method, using the feature space induced by the HIK, the dictionary is learned and the local sparse codes of the input histograms are computed. Dictionary matrix plays a key role in sparse representation of signals. The feature vectors derived from a training data set can be directly used in order to determine the dictionary elements. Alternatively, a learning process can be applied to form a learned dictionary. It has been shown that applying a proper learning process signiﬁcantly improves the results [7–9]. There are several known algorithms for dictionary learning. Among them the K-SVD algorithm is widely used due to its eﬀectiveness in practical applications [10]. The main goal of the K-SVD algorithm is to

Corresponding author. E-mail addresses: [email protected] (T. Zare), [email protected] (M.T. Sadeghi).

http://dx.doi.org/10.1016/j.neucom.2016.12.056 Received 14 July 2016; Received in revised form 14 December 2016; Accepted 20 December 2016 0925-2312/ © 2016 Elsevier B.V. All rights reserved.

Please cite this article as: Zare, T., Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.12.056

Neurocomputing (xxxx) xxxx–xxxx

T. Zare, M.T. Sadeghi

ﬁnd an overcomplete dictionary matrix which contains K atoms such that the reconstruction error of the resulted sparse representation is minimized. The algorithm uses an iterative two steps algorithm in which the sparse representation of the training data and the associated dictionary elements are iteratively updated. In the K-SVD algorithm, the feature vectors extracted from a training data set are linearly combined in order to design the dictionary. However, because of the non-linear structure of some real world data, such a linear combination is not always eﬃcient. Nonlinear transformation using kernel methods is a well-known technique widely used for generalization of linear methods. In the non-linear version of the K-SVD algorithm, the Kernel K-SVD (KK-SVD) algorithm, the data points are implicitly mapped into a new high dimensional feature space. The sparse coding and dictionary learning steps are then performed in this new feature space. The authors in [11] believe that iterative dictionary learning algorithms such as the K-SVD are computationally expensive and proposed a new method called Orthogonal Projective Sparse Coding (OPSC). This algorithm integrates the manifold learning and sparse coding techniques. It has been shown that the kernel based sparse representation algorithms can provide better results compare to their linear counterparts [12]. However, the type of kernel function and the value(s) of the kernel parameter(s) have to be selected appropriately. A typical solution to the problem of kernel selection is to apply the cross-validation technique in order to ﬁnd the best kernel function among a set of candidates. However, this procedure is time-consuming. Moreover, there is no guarantee that the best possible solution is found. The other solution to this problem is to use an appropriate combination of diﬀerent kernel functions [13]. Up to now, there are only very few works for using multiple kernels in the sparse representation ﬁeld [14– 17]. In [14], the authors proposed a multiple kernels Sparse Representation Classiﬁcation (SRC) algorithm in which two basis kernel matrices are combined by the weighted sum rule. The resulted matrix is considered as the dictionary matrix. For each test sample, the sparse coeﬃcients and the kernel weights are iteratively updated. This method is obviously not suitable for real-time applications. The other problem is that in their proposed method, the required two basis functions are selected by applying the cross validation procedure to a set of Gaussian and polynomial kernels. This process sounds in contradiction of the multiple kernel learning concept. In [15] also, the dictionary matrix is supposed to be the weighted sum of diﬀerent kernel matrices. However, compare to [14], the weights are determined in a training phase where an iterative process is applied for determining the weights. In [16], a same structure has been considered for the multiple kernel function and the kernel weights and dictionary elements are learned using a three steps algorithm. In these three steps, the kernel weights are optimized based on graph embedding principles, the sparse coding is performed using a simple level wise pursuit scheme and the dictionary elements are learned using multiple levels of 1-D subspace clustering successively. Diﬀerent image descriptors such as color, shape and texture along with a kernel mapping function have been used for generating the basis kernels. The authors in [17] proposed a multiple instance learning algorithm using a multiple kernel dictionary learning framework in weakly supervised condition where the labels are in the form of positive and negative bags. Their proposed algorithm is composed of four steps in them the kernels weights, sparse codes of the positive and negative bags data, positive and negative dictionaries and sample selection matrix are optimized. The sample selection matrix is used for selecting a true positive sample from the related positive bag. They have deﬁned a cost function for optimizing the kernels weights that increases discrimination between the positive and negative bags. In this paper, we propose an iterative multiple kernel-based dictionary learning algorithm. Each iteration consists of three steps wherein them the sparse representation of the training samples, the kernels weights and the dictionary matrix elements are updated. In

each step, it is supposed. that all the other parameters except the step related ones are ﬁxed. The optimization process is performed within the framework of the SRC algorithm considering two diﬀerent structures for dictionary matrix namely the distributive and collective schemes [12]. The key contributions of our work are:

• • • • •

We incorporate multiple kernel learning process into dictionary learning framework. We propose a new three steps algorithm for sparse coding, multiple kernel learning and dictionary learning. Our main work is on the multiple kernel learning stage. We introduce distributive and collective points of view in the sparse representation based classiﬁcation and formulate the proposed algorithm considering these structures. We ﬁnd an analytical solution to the multiple kernel learning stage which speeds up the learning process. We demonstrate the eﬀectiveness of our proposed algorithms on three popular datasets.

1.1. A. Notations Vectors are denoted by lowercase bold letters and matrices by uppercase bold letters. Scalars and matrix elements are shown by nonbold letters. The ℓ0 -pseudo-norm, denoted by 2 , is the number of nonzero elements of a vector and F shows the Frobenius norm. T0 refers to the sparsity level. 1.2. B. Paper organization This paper is organized as follow: Section 2 deﬁnes and formulates the sparse representation problem in the kernel space. Section 3 presents the proposed multiple kernel based dictionary learning in two scenarios of collective and distributive SRC algorithm. The classiﬁcation procedures for both distributive and collective structures are presented in Section 4. The experimental setups and results are presented in Section 5. Finally, Section 6 concludes the paper with the summary and suggests some future research directions. 2. Sparse representation and dictionary learning in the kernel space Given a set of training samples, Y = [y1, …, yN ] ∈ N × n , the goal is to learn a dictionary D ∈ N × K with K atoms that leads to the best representation of the training samples. By best representation, we mean the one that leads to the least reconstruction error. The optimization problem for achieving this goal is as follows:

arg min

Y − DX

2 F

D, X

s.t

xi

0

≤ T0,

dj

2

= 1,

∀ i, j.

(1)

where xi is the sparse representation of the i-th training sample, yi . That is the i-th column of the sparse coeﬃcient matrix X with maximum number of T0 non-zero entries. Similarly, dj is the j-th column of the dictionary matrix, D, which is referred to as the j-th atom of D. Two well-known algorithms for solving the above problem are the method of optimal direction (MOD) [18] and the KSVD algorithm [10]. In the above mentioned formulation of the problem, the sparse representation of the samples in the original feature space is calculated. Other feature spaces can also be taken into account. Let ∼ ϕ: n → H ⊂ n be a non-linear mapping function that maps the data samples from the original feature space n into a dot product space (Hilbert space H). It is worth noting that the dimensionality of the new feature space, n∼, is often much larger than n . It can possibly be inﬁnite. Using this non-linear mapping function, one can generate the nonlinear form of the sparse representation problem in (1) as follows: 2

Neurocomputing (xxxx) xxxx–xxxx

T. Zare, M.T. Sadeghi

arg min

ϕ(Y) − DX

2 F

≤ T0,

∀ i.

Suppose that the feature space corresponding to this multiple kernel is shown with Φ , that is k (yi , yj) = 〈Φ(yi), Φ(yj)〉. The objective function for optimizing the dictionary produced by this multiple kernel matrix would be:

D, X

s.t.

xi

0

(2) ∼

where ϕ(Y) = [ϕ(y1), …, ϕ(yN )] ∈ N × n . It has been shown that the following arrangement can be used for obtaining the optimal non-linear dictionary matrix [12]:

D* = ϕ(Y)A

M

∑m =1 βm2 ≤ 1

s.t.

(3)

ϕ(Y) − ϕ(Y)AX xi dj

0 2

≤ T0, = 1,

ϕ(Y) − ϕ(Y)AX

2 F

AX)T K(Y,

2 F

=

≤ T0,

∀ i = 1, …, N .

(7)

Y)(I − AX))

=

3.1. Distributive algorithm Within the framework of sparse representation based classiﬁcation, the word distributive was coined in [12] for the case that for each class, a dictionary matrix is formed using the training samples of the class. The class-based sparse representation of a test sample is then computed using the related dictionary. In [12], the KK-SVD algorithm has been applied for dictionary learning and for obtaining the representation. This approach motivated us to introduce a distributive form of the multiple kernel dictionary learning algorithm in which the learning process is performed in three steps: updating the kernel weights, sparse coding of training samples and updating the dictionaries. Let Y = [Y1, …, YC ] contain the feature vectors obtained from a set of training samples where Yc presents the data which belong to the c-th class and C is the total number of classes. Also, suppose that {Dc = Φ(Yc)Ac}Cc =1 are the class dependent dictionary matrices where Ac is the atom selection matrix corresponding to the c-th class. In fact, this matrix is updated via the KK-SVD algorithm. Also, X = [X1, …, XC ] is the sparse coeﬃcient matrix where Xc contains the sparse coeﬃcient of the data samples of class c. Details of the learning process steps are as follows:

3. Multiple kernel-based dictionary learning As the cost function in (5) presents kernel function plays a key role in this problem. Selecting an appropriate kernel function and tuning its parameters are important issues in the kernel based methods. Multiple Kernel Learning (MKL) approaches suggest that instead of a single kernel, a proper combination of diﬀerent kernels is used for this purpose. Such an arrangement not only releases the kernel based method from choosing a kernel as the best one during a time consuming procedure but also provides a framework for processing of more complicated data structures. Let {Km}mM=1 be a set of M kernels produced by either M descriptors such as color, texture, etc. or by M diﬀerent kernel functions applied to a single descriptor. We use a linear combination of these kernels to produce a multiple kernel as the following: M

3.1.1. Updating coeﬃcient matrix, X, when {Ac}Cc =1 and β are ﬁxed In this Stage, the dictionary matrices ({Ac}Cc =1) and the kernel weights β are assumed to be ﬁxed and the coeﬃcient matrix, X , is optimized. From the distributive point of view, the sparse coeﬃcient matrices, {Xc}Cc =1, are computed by minimizing the following constrained cost function:

(6)

where βm is the weight corresponding to the m-th kernel matrix, Km . Table 1 Variables' dimensions.

arg min

ϕ (Y)

D

di

X

xi

A

n×N

ñ×N

ñ×K

ñ×1

K×N

K×1

N×K

∑ [K(yi, yi) − 2xiT AT K(Y, yi) + xiT AT K(Y, Y)Axi] (8)

is a unitary matrix and K(Y, Y) ∈ is the where I ∈ associated ﬁnite dimensional kernel matrix. For the convenience of the reader, Table 1 summarizes the dimension of the various variables.

Y

2 F

In the following parts of this section, we present our multiple kernel dictionary learning algorithm. Our proposed algorithm involves three stages wherein them the kernels weights ( βi s), sparse representation of training examples (xi s) and the atom selection matrix (A ) are iteratively updated. Our optimization process is performed within the framework of sparse representation based classiﬁcation. We consider two diﬀerent structures for the sparse representation based classiﬁer namely distributive and collective settings. These structures lead to two optimization algorithms. In the following, each algorithm is introduced in details.

(5)

m =1

Φ(yi) − Φ(Y)Axi

i =1

N × N

∑ βmKm

∑ i =1 N

(4)

N × N

KM =

0

N

Φ(Y) − Φ(Y)AX

∀ i = 1, …, N ∀ j = 1, …, K .

= trace((I −

xi

The ℓ2 -norm constraint for the kernel weights ( β ≤ 1) leads to a non-sparse solution for the kernel combination coeﬃcients. It causes to make use of the complementary information of as many kernels as possible. Moreover, as we will show later, this form of constraint will simplify our later formulations. The function in (7) can be reformulated as:

By this formulation, the problem of optimizing a high dimensional dictionary matrix, D*, is replaced by optimizing a ﬁnite dimensional matrix, A . Moreover, this formulation leads us to applying the wellknown technique of kernel trick. As known, a kernel function k (yi , yj) can be used as a similarity measure between two samples, yi and yj . A kernel function computes the inner product of the samples in the related feature space; that is k (yi , yj) = 〈ϕ(yi), ϕ(yj)〉, where 〈, 〉 is the dot product operator. Kernel trick proposes that any dot product in a projected feature space can be computed with much lower computational burden using an appropriate nonlinear function. Gram matrix or kernel matrix, K = [k (yi , yj)]ij , is a positive semi-deﬁnite matrix which is constructed using the assumed kernel function considering the pairs of i-th and j-th training samples. After some algebraic manipulations and using kernel trick, the cost function in (4) can be re-written as: 2 F

and

2

A, X

s.t.

2 F

A, X, β

where A ∈ N × K is a matrix with K atoms. The above deﬁnition for the optimal dictionary allows us to tune the dictionary to the training samples via modifying the coeﬃcient matrix, A . Hence, the optimal dictionary can be seeded through optimization of A instead of D . By substituting (3) into (2), the optimization problem will become:

arg min

Φ(Y) − Φ(Y)AX

arg min

Φ(Yc) − Φ(Yc)AcXc

2 F

Xc

s.t.

xic

0

≤ T0,

∀ i = 1, …, Nc , ∀ c = 1, …, C.

(9)

This optimization problem can be solved using the kernel orthogo3

Neurocomputing (xxxx) xxxx–xxxx

T. Zare, M.T. Sadeghi

∼c ack = σ ′1E k v1

nal matching pursuit (KOMP) algorithm proposed in [12]. Note that, C diﬀerent representations are obtained using each class training samples separately. The other classes samples have no contribution in calculation of the sparse representation of the sample.

where σ ′1 = Δ(1, 1) . So, in this step, the dictionary atoms of the Ac 's matrices are updated using (14).

3.1.2. Updating dictionary matrices via updating {Ac}Cc =1 when X and β are ﬁxed As mentioned before, the dictionary matrices are updated by updating the associated Ac 's matrices. For this purpose, similar to [12], we update one atom of each matrix {Ac}Cc =1 at a time in an eﬃcient way. To update the k-th atom of the c-th dictionary, ack , we minimize the following cost function:

Φ(Yc) − Φ(Yc)AcXc

2 F

K

j c = Φ(Yc) − Φ(Yc) ∑ j =1 acjxcT

⎞ ⎛ K j k = Φ(Yc)⎜I − ∑ j ≠c k acjxcT ⎟ − Φ(Yc)ackx cT ⎠ ⎝

3.1.3. Updating kernel weights, β , when {Ac}Cc =1 and X are ﬁxed In this subsection, we present our approach for learning the kernel weights while the sparse coeﬃcient matrices, {Xc}Cc =1, and dictionary matrices are assumed to be ﬁxed. Toward this goal, the optimization problem in (7) can be re-written as:

β F

= Φ(Yc)Eck − Φ(Yc)Mck

2 F

∑ trace(ATc KM (Yc, Yc)Ac − Ic)

(16)

c =1

where Ic ∈ Kc× Kc is a unitary matrix and Kc is the number of atoms of the c-th dictionary. So, the optimization function of (15) can be written as: C

J=

C

∑

Φ(Yc) − Φ(Yc)AcXc

2 F

+

c =1

∑ trace(ATc KM (Yc, Yc)Ac − Ic) c =1

+ λ(βT β − 1)

(17)

where λ is the Lagrange multiplier. After substituting the multiple kernel function as in Eq. (6) into (17) and some manipulations, the optimization problem will become: C

J=

Nc

M

C

Nc

⎛

M

⎝

m =1

⎞

∑ ∑ ∑ βmKm(yic, yic) − ∑ ∑ ⎜⎜2xTicATc ∑ βmKm(Yc, yic)⎟⎟ c =1 ic =1 m =1

c =1 ic =1

⎛ + ∑ ∑ ⎜⎜xTicATc c =1 ic =1 ⎝ C

C

+

Nc

⎠

⎞ ∑ βmKm(Yc, Yc)Acxic⎟⎟ ⎠ m =1 M

⎛

M

⎝

m =1

⎞

∑ trace⎜⎜ATc ∑ βmKm(Yc, Yc)Ac − Ic⎟⎟ + λ(βT β − 1) c =1

⎠

(18)

Choosing the convex constraint for the kernel weights, β 2 ≤ 1, provides an analytical solution to the above problem. By taking the ﬁrst derivative of the above function with respect to βm and setting to zero, the m-th kernel weight is obtained as:

(12)

⎛ ∑C ∑ Nc [K (y , y ) − 2xT AT K (Y , y ) + xT AT K (Y , Y )⎞ c ⎟ ic c m c ic ic c m c ⎜ c =1 ic =1 m ic ic −1 ⎜ ⎟ βm = A x ] c ic ⎟ 2λ ⎜ ⎜+ ∑C trace(AT K (Y , Y )A ) ⎟ c c c m c ⎝ ⎠ c =1 (19) The Lagrange multiplier, λ , can also be found by taking the ﬁrst derivative of the function with respect to λ and setting it to zero.

3.2. Collective algorithm (13)

The word collective was also coined in [12] for the case that a single dictionary matrix is used for all classes. Compare to the distributive one, this dictionary is formed by concatenating the class-based dictionaries; that is D = [D1, …, DC ]. In this framework, the sparse representation matrices are also concatenated as X = [X1, …, XC ]. Similar to the distributive setting, we propose a three-stages algorithm for the sparse coding, dictionary learning and kernel weights learning.

T

where Δ = ∑ ∑ and KM is the corresponding multiple kernel matrix, M

m =1 atom, ack , can

(15)

C

where σ1 = ∑ (1, 1) is the largest eigen value and u1 and v1 are the corresponding eigenvectors which are the ﬁrst columns of the sorted U k x cT = σ1v1. This selection guarantees and V . We chose Φ(Yc)ack = u1 and ∼ the unity norm of dictionary atoms because each dictionary atom is Φ(Yc)ack and we know that u1T u1 = 1. Please note that we usually do not have the mapping function, Φ , explicitly. Moreover, the associated feature space may have inﬁnite dimension which makes the SVD ∼c operation on Φ(Yc)E k practically impossible. However, kernel trick solve this problem by creating a dot product in the feature space, Φ , as follows:

∑ βmKm .

M

ζ (A, β) =

(11) F ∼c c k k ∼ where x cT contains the non-zero elements of the x cT and E k = Ek P . The above minimization problem can be solved by Singular Value ∼c Decomposition (SVD) of Φ(Yc)E k = U ∑ V in order to obtain the c k optimal values of ak and x cT as:

KM =

+ ζ (A, β)

We included the penalty term of ζ into the cost function in order to force the dictionary atoms to have unit norm. This penalty term is deﬁned as:

F

2

∼c ∼c (Φ(Yc)E k )T (Φ(Yc)E k ) = (U ∑ V)T (U ∑ V) ∼c ∼c (E k )T KM (Yc , Yc)E k = VΔVT

2 F

∑m =1 βm2 ≤ 1

s.t

2

(10)

k Φ(Yc)ack∼ x cT = σ1u1v1

C

∑c =1 Φ(Yc) − Φ(Yc)AcXc

arg min

2

j where ack and xcT denotes the k-th column of Ac and the j-th row of Xc ⎛ ⎞ K j k respectively. Also, Eck = ⎜I − ∑ j ≠c k acjxcT and Kc is the total ⎟, Mck = ackx cT ⎝ ⎠ number of atoms of the c-th dictionary, Ac . Note that Φ(Yc)Eck deﬁnes the diﬀerence between the true value of the c-th class data and the approximated value when the k-th dictionary atom of the c-th class is removed from the related dictionary matrix. Also, Φ(Yc)Mck deﬁnes the contribution of the k-th dictionary atom to the approximated values (the sparse represented data). Minimization of the above cost function is equivalent to ﬁnding the k values of ack and x cT such that matrix Φ(Yc)Mck best approximates c Φ(Yc)Ek in terms of mean squared error. In order to keep the sparsity k and update level ﬁxed, we consider only the non-zero samples of x cT c c them which result in shrinking the matrices Mk and Ek . Let's deﬁne pck k as the set of indices corresponding to the non-zero elements of x cT , i.e. c k pck = {i 1 ≤ i ≤ Nc , x cT (i ) ≠ 0} and P ck ∈ Kc× pk be a matrix with ones on the (pck (j ), j )-th entries and zeros elsewhere. By multiplying the matrix pck in Mck and Eck , their desired elements, i.e. those corresponding to the k , remain. Therefore, the cost function in (10) non-zero elements of x cT can be rewritten as follows:

∼c k Φ(Yc)E k − Φ(Yc)ack∼ x cT

(14)

After some manipulations, it is shown that the k-th

be updated as:

4

Neurocomputing (xxxx) xxxx–xxxx

T. Zare, M.T. Sadeghi

3.2.1. Updating coeﬃcient matrix, X, when {Ac}Cc =1 and β are ﬁxed In this stage, the dictionary matrix and kernel weights, β , are assumed to be ﬁxed and the sparse coeﬃcient matrix, X, is optimized. Unlike the aforementioned distributive algorithm, in the collective algorithm, the sparse representation of a sample is calculated using the dictionary derived from all the classes' samples. Hence, the coeﬃcient matrix, X, is computed by minimizing the following cost function:

Φ(Y) − Φ(Y)AX

arg min

2 F

X

s.t

xi

0

≤ T0,

∀ i = 1, …, N .

(20)

Similar to the distributive algorithm, this optimization problem can be solved using the kernel orthogonal matching pursuit (KOMP) algorithm proposed in [12]. 3.2.2. Updating dictionary matrices,{Ac}Cc =1, when X and β are ﬁxed Noting to the structure of the SVD algorithm, although a single dictionary matrix is used in the collective setting, elements of the dictionary matrix have to be learned for each class separately. So, in this stage, the class-dependent sub matrices of {Ac}Cc =1 is used and similar to the distributive algorithm one atom of each matrix is updated at a time using the Eq. (14). After updating these sub matrices, the dictionary matrix is formed by concatenating the related dictionary matrices, D = [D1, …, DC ], where Dc = Φ(Yc)Ac . 3.2.3. Updating kernel weights, β , when {Ac}Cc =1 and X are ﬁxed In this part, we present our collective approach for learning kernel weights while the sparse coeﬃcient matrix, X, and matrices {Ac}Cc =1 are assumed to be ﬁxed. The collective form of the optimization problem in (7) can be re-written as:

Φ(Y) − DX

arg min

2 F

+ ζ (D, β)

β

s.t.

M

∑m =1 βm2 ≤ 1 and

xi

0

≤ T0, ∀ i = 1, …, N .

(21)

Fig. 1. The proposed distributive and collective algorithms.

Similar to the distributive algorithm, we consider a penalty term, ζ , to force the dictionary atoms to have unit norm. This penalty term is deﬁned as:

ζ (D, β) = DT D − I

Table 2 Computational Complexity of the proposed methods.

(22) C

where I ∈ K × K is a unitary matrix and K = ∑c =1 Kc is the total number of dictionary atoms. After some manipulations similar to those mentioned in part C of the distributive algorithm, the m-th kernel weight is obtained by:

βm =

−1 ⎛∑iN=1 [Km(yi , yi) − 2xTi Q + xTi Gxi]⎞ ⎜⎜ ⎟⎟ 2λ ⎝ + G ⎠

Stage 1 sparse coding

Stage 2 dictionary learning

Stage 3 kernel weights learning

Distributive

O(CNc2 )

O( pck 3)

O(CNc2 )

Collective

O (N 2 )

O( pck 3)

O (N 2 )

(23)

where

3.3. Computational complexity of the proposed methods

⎡ AT K (Y , y ) ⎤ ⎢ 1 m 1 i ⎥ ⎢ AT K (Y , y ) ⎥ T Q = D Φ(yi) = ⎢ 2 m 2 i ⎥ ⋮ ⎢ ⎥ ⎢⎣ ATC Km(YC , yi)⎥⎦ and

As we discussed in the previous section, the training phase of the proposed methods contains three stages. Table 2 contains the order of computational complexity of each stage in the distributive and collective settings. In this table N͠ c , C and N are respectively the number of training samples of each class, number of classes and total number of training examples. Also, pck is the number of non-zero elements of P ck .

⎡ AT K (Y , Y )A … AT K (Y , Y )A ⎤ C C 1 m 1 ⎢ 1 m 1 1 1 ⎥ ⋮ ⋱ ⋮ G = DT D=⎢ ⎥. ⎢⎣ ATC Km(YC , Y1)A1 … ATC Km(YC , YC )AC ⎥⎦ It is noteworthy that in the collective setting, all of the dictionary elements are contributing to the sparse representation of a sample. It has to be also noted that in the updating process of the kernels weights, the similarities between the samples from diﬀerent classes are considered via K(Yi , Yj) . Algorithm 1 in Fig. 1 contains the pseudo-codes for the multiple kernel-based dictionary learning algorithm in the distributive and collective settings.

4. Classiﬁcation procedure for a test sample Given a test sample, the sparse representation of the sample is computed ﬁrst. The sample is then reconstructed using the coeﬃcients associated to each class separately and the related residual values are computed. The test sample is classiﬁed to the class that leads to the smallest reconstruction error. The classiﬁcation procedures in the distributive and collective settings are detailed in the next subsections. 5

Neurocomputing (xxxx) xxxx–xxxx

T. Zare, M.T. Sadeghi

• • • • •

4.1. Distributive algorithm Let Dc be the learned kernel based dictionary matrix of the c-th class where Dc = Φ(Yc)Ac for c = 1, …, C . Given a test sample, z ∈ n , the optimization problem in (9) is solved for each class separately using the KOMP algorithm, the results of which are the sparse coeﬃcients x c ∈ Kc . The reconstruction error for each x c is then computed as:

ri = Φ(z) − Φ(Yc)Acx c +

xTc ATc KM (Yc ,

2 F

= KM (z, z) − 2xTc ATc KM (Yc , z)

Yc)Acx c,

∀ c = 1, …, C.

(24)

The test sample is simply classiﬁed to the class that gives the lowest reconstruction error {ri}Cc =1.

The best result of the kernel KSVD algorithm using diﬀerent kernel functions (KK-SVD(Best)). The KK-SVD algorithm when the average kernel is used (KKSVD(Mean). The proposed multiple kernel-based dictionary learning algorithm (MKDL). The SRC algorithm when the number of dictionary atoms is reduced to K (SRC††). The proposed multiple kernel-based dictionary learning algorithms when the number of dictionary atoms is not reduced (MKDL†).

In the collective setting, the dictionary is constructed as D = [D1, …, DC ] where {Dc}Cc =1 = Φ(Yc)Ac . The sparse representation of a test sample, z ∈ n , which is computed by applying the KOMP algorithm through a similar optimization problem in (20) is x = [x1, …, xC] ∈ K where x c is the sparse coeﬃcients associated to the c-th class. The reconstruction error for each class is then computed using (24). Finally, the test sample is classiﬁed to the class with the smallest reconstruction error. Note that the main diﬀerence between the distributive and collective algorithms in the test phase is in the sparse coding process.

We use a total of 31 basis kernels for learning the optimal kernel. The linear kernel, 10 Gaussian kernels with parameter σ varying from 0.5 to 5 in steps of 0.5, 10 polynomial kernels of d = 2 with the constant coeﬃcient c varying from 0.5 to 5 in steps of 0.5, and 10 polynomial kernels of d = 3 with the constant coeﬃcient c varying from 0.5 to 5 in steps of 0.5 are the adopted kernels. For a fair comparison, the experiments are performed in both distributive and collective conﬁgurations while the dictionary matrices are determined using the above mentioned approaches. It means that in the distributive setting, the sparse representation of a test data is computed for each class separately using the associated dictionary. Whereas in the collective setting, the sparse coeﬃcients are computed using a dictionary which is obtained by concatenating the class based dictionaries.

5. Experimental results

5.2. Results for the Extended Yale B database

In this section, we present our experimental results demonstrating the eﬀectiveness of the proposed multiple kernel based dictionary learning approach. The evaluation is performed within the framework of classiﬁcation task using the Extended Yale B face database [19], USPS digit dataset [20] and LUNGML dataset [21].

The Extended Yale B database contains 2414 frontal face images of 38 individuals taken under varying illumination conditions [22]. We used the cropped and normalized face images of 192×168 pixels. Fig. 2 shows examples of the cropped face images. We randomly choose 30 images per each person as the training samples and the rest as the test samples. Based on the Random face method proposed in [23], each face image, fi ∈ 192×168, is projected into a lower dimensional vector, yi ∈ 504 .

4.2. Collective algorithm

5.1. Experimental setup We compare our proposed algorithm in the distributive and collective modes to some benchmark algorithms. The Sparse representation based classiﬁer (SRC) is used as the baseline algorithm. We applied the KSRC algorithm using a variety of kernel functions. The best obtained results are presented as the other benchmark. As a basic multiple kernel SRC algorithm, we also present the KSRC results when the average of the adopted basis kernels is considered as the ﬁnal kernel. We compare the proposed algorithm to some state of the art dictionary learning approaches too. The kernel K-SVD (KK-SVD) is one of these algorithms. Here again, the best results of the KK-SVD method when diﬀerent kernels are adopted will be reported. The KK-SVD results using the average kernel are also reported as a simple multiple kernel method where the kernel weights are assumed to have uniform distribution. We also report the results of the SRC algorithm when the number of dictionary atoms is reduced to the number of atoms of the learned dictionary, K. In this case, K dictionary atoms are selected randomly. Also, for a better comparison, the result of the proposed algorithms when the number of dictionary atoms is set to the number of training samples (like SRC) is reported. It means that in this case the learning process is performed but the number of dictionary atoms is not reduced to K. Thus, overall, we compare the performance of the following algorithms (the short names inside the brackets will be subsequently used):

• • •

5.3. Parameter setting In the proposed methods, there are three important parameters

The Sparse Representation based Classiﬁcation algorithm (SRC). The best results obtained from the KSRC using diﬀerent kernel functions (KSRC(Best)). The KSRC algorithm when the kernel is the average of the basis kernels (KSRC(Mean)). Fig. 2. Examples of the Extended Yale B face images.

6

Neurocomputing (xxxx) xxxx–xxxx

T. Zare, M.T. Sadeghi

Fig. 3. Comparison of the Extended Yale B recognition accuracies for diﬀerent sparsity level T0 and the number of dictionary atoms Kc .

which have to be chosen appropriately. These parameters are the sparsity level, T0 , the number of dictionary atoms in the KKSVD algorithm, Kc and the number of basis kernels, M. In this section the process of tuning of these parameters and the sensitivity of the algorithms to these parameters are analyzed. We randomly choose 10% of the training data as the evaluation set and the rest as the training set. We utilized the evaluation samples to set the parameters. Toward this goal, T0 varied from 1 to 40 with the step size of 1 and Kc from 5 to 20 with the step size of 5. Fig. 3 shows the recognition rate of the proposed methods for the Extended Yale B database when diﬀerent parameters are used. As expected, the recognition rate highly improves when the dictionary size and the sparsity level is initially increased. The accuracy of the methods remains in almost a speciﬁc level by more increasing of the sparsity level. Based on these experiments, the sparsity level was selected equal to 15 and 20 for the distributive and collective methods respectively. We also chose {Kc}Cc =1 = 15 for both settings. We also performed a set of experiments to investigate the eﬀects of the number of basis kernels on the proposed methods. We considered 31 mentioned kernels as the basis kernels. Then, we implemented the proposed algorithm considering diﬀerent number of kernels. Fig. 4 Demonstrates the results. It can be seen that by increasing the number

Fig. 5. The computational time of the proposed algorithms for diﬀerent number of kernels.

of kernels, the performance of the both algorithms especially the distributive one is improved. It would be due to adding more information sources to the associated classiﬁer. In the case of the distributive algorithm, the input data is represented by each class data separately. Therefore, the presence of more information sources is more important in order to represent the data more accurately. Obviously, the computational complexity of the training phase of the algorithms is increased by increasing the number of kernels. We also compute the computational time of the training phase for diﬀerent number of kernels. Fig. 5 shows the computational time of the proposed algorithms in distributive and collective settings. As expected, the computational time of both algorithms increase when the number of kernels are increased. We set the maximum number of iterations in Algorithm 1–40 and use it as the stopping criterion. Table 3 summarizes the recognition rate of diﬀerent algorithms in both distributive and collective settings. As these results show in the distributive case, the basic SRC algorithm highly outperforms the KSRC ones. In the distributive mode of the KSRC algorithm, each class dictionary matrix is formed by applying the related kernel function to the class training samples, Dc = K(Yc , Yc) ∈  Nc × Nc . A kernel function measures the similarity of

Fig. 4. Comparison of the Extended Yale B recognition accuracies for diﬀerent number of kernels.

7

Neurocomputing (xxxx) xxxx–xxxx

T. Zare, M.T. Sadeghi

Table 3 Recognition rate of different algorithms (in %) for the Extended Yale B database.

Distributive Collective

SRC

KSRC (Best)

KSRC (Mean)

K-KSVD (Best)

K-KSVD (Mean)

MKDL

MKDL †

SRC ††

98.6656 98.0377

64.0502 98.2732

63.8932 98.5341

97.5981 98.9985

97.3210 98.7598

97.8807 99.7645

98.8226 99.8321

94.7410 92.8571

rithm in the distributive setting lead to slightly better results in comparison to the collective one. An explanation for this observation is that here a large number of training samples per each class is available (30 images for each person). So, the main characteristics of each class are well described by the associated dictionary matrix. On the other hand, frontal faces of diﬀerent persons have some similarities which can create undesirable eﬀects on the collective representation of the input image. However, due to the learning process, the KK-SVD and MKDL algorithms have better performances in the collective mode. Moreover, the result of MKDL in collective setting (99.76%) is better than the other reported results for this dataset. As far as we know, the best reported results for this data set is a recognition rate of 99.47% [16]. The authors in [16], improved the SRC algorithm using low-rank representation and eigen face extraction techniques. These techniques are based on Robust Principal Component Analysis and Singular Value Decomposition approaches. The superiority of our proposed method is mainly due to the eﬀectiveness of non-linear mapping of the learned multiple kernel and learned dictionary matrix. As mentioned before, in these experiments, feature vectors are produced by using the Random face method. The number of dimensions of the feature vectors is an important factor which aﬀects the classiﬁcation performance. In order to investigate the eﬀects of this factor, we applied random matrix with diﬀerent sizes to each face image, fi ∈ 192×168, for extracting the related feature vector. Fig. 6 contains the associated results (Since the distributive KSRC method has very poor results, we have not shown its results in the ﬁgure). In this ﬁgure "D" and "C" refer to Distributive and Collective settings respectively. It can be seen that the proposed MKDL-C always leads to the best results in terms of classiﬁcation accuracy.

pairs of data samples which has usually higher values for data samples from a same class rather than those of diﬀerent classes. In the distributive setting, the dictionary matrices contain only within class similarity information. Hence, the class based dictionary matrices contain almost similar values. It causes that the sparse representations of an input sample on diﬀerent classes, {x c}Cc =1, become similar to each other. In such a condition, the classiﬁer is usually biased toward selection of the class which involves more training examples. It has to be also noted that the n × Nc dictionary matrix of the SRC is replaced by a Nc × Nc matrix in the KSRC algorithm. Since Nc is usually much smaller than n, the latter is computationally faster. However, in the sparse representation problem, y = Dx where D ∈ m1× m2 . Now, if m1 = m2 there could be a unique solution to the sparse representation problem, but the obtained x is not necessarily sparse. Therefore, the reconstruction errors for the obtained x on diﬀerent classes may become approximately similar to each other and the classiﬁer is confused to adopt the correct class label. The results in Table 1 shows that KK-SVD algorithms (in diﬀerent cases) lead to better results compare to the KSRC algorithms. This improvement is due to two reasons: ﬁrst, decreasing the number of dictionary atoms from N in KSRC to Kc , i.e. Dc ∈  Nc × Kc which eliminates the above mentioned problem of the KSRC; second, the Kc dictionary atoms are learned to be optimal through an optimization problem by reducing the reconstruction error. However, the problem of choosing an optimal kernel still remains. The accuracy of the MKDL algorithm is 97.5667% which is approximately same as the KK-SVD (best) algorithm. It shows that the proposed algorithm converges to the optimal case. Moreover, there is no need to search for the best kernel through the use of multiple kernel. The SRC algorithm leads to slightly better results as compared to the MKDL algorithm but it is computationally more expensive in the classiﬁcation phase. As mentioned before, it has to be also noted that in the kernel based methods, the original feature vectors is replaced by the similarity between the feature vectors. Via this transformation some information may be lost. In the MKDL† algorithm, the learning process is performed again but the number of dictionary atoms remains equal to the number of training samples. It can be seen that the MKDL† outperforms the SRC with the same number of dictionary atoms. In the SRC†† algorithm, the number of dictionary atoms is reduced to Kc , i.e. same as the MKDL and KK-SVD algorithms. In fact, we randomly select Kc training samples in order to construct the corresponding dictionary. As expected, in such a condition the performance of the SRC algorithm is degraded. The results of the MKDL† and SRC†† experiments show that the proposed method outperforms the SRC in a same condition. In the collective setting (the second row of Table 1), the KSRC(Best) and KSRC(Mean) outperform the SRC algorithm. Compare to the distributive case, much better results are obtained using the KSRC approaches. This is due to this fact that in the collective setting, the dictionary is constructed as D = K(Y, Y) where Y refers to all the training samples. It means that the training samples of diﬀerent classes have contribution in making each atom of the dictionary. So, each dictionary atom contains both within class similarity and between class dissimilarity information. It removes the so far explained problem of the KSRC in the distributive case. The reconstruction errors for diﬀerent classes are not necessarily similar and the classiﬁer can work properly. As we expected the KK-SVD and MKDL algorithms lead to better results. Another interesting observation is that the SRC algo-

5.3.1. Convergence of the proposed methods The proposed methods iterate until the adopted cost criterion (J) is not signiﬁcantly decreased or the maximum number of iterations is reached. Fig. 6 shows the cost value versus the iteration number of the distributive and collective methods for the Extended Yale B database. Although there is no theoretical guarantee that the cost should converge to a global minimum, Fig. 7 shows that the cost value is

Fig. 6. The classiﬁcation accuracies for diﬀerent dimensions of input images.

8

Neurocomputing (xxxx) xxxx–xxxx

T. Zare, M.T. Sadeghi

Fig. 7. Convergence of the cost for the Extended Yale B database.

Fig. 9. The classiﬁcation accuracies for diﬀerent fractions of missing pixels.

contains examples of the USPS images. Table 4 contains the results of diﬀerent algorithms. In this table, the classiﬁcation accuracy for both distributive and collective setting has been presented. Similar to the previous results, the KSRC algorithms initially lead to very bad results compare to the SRC algorithm. However, learning of dictionary atoms as in the KKSVD and MKDL algorithms improves the classiﬁcation accuracy (about 2% compare to the SRC). An interesting point is that, in contrast to the face recognition experiments, the distributive results here are comparable or slightly better than the collective ones. It can be explained by the fact that in the case of the USPS experiments, the number of training examples is suﬃciently large so that the distributive dictionaries contain the required information to produce an accurate sparse representation for a test sample. Moreover, the best recognition rate for the USPS dataset was reported for the kernel KSVD algorithm. The results conﬁrm that the proposed MKDL algorithm outperforms the kernel KSVD (KKSVD) algorithm. We also investigated robustness of the methods in terms of the missing pixels problem. For this purpose, a fraction of image pixels are randomly selected and their values are replaced with zero. Fig. 9. demonstrates the classiﬁcation accuracy of diﬀerent methods for diﬀerent fractions of the missing pixels. It can be seen that the proposed method always outperform the others. Fig. 8. Sample images of the USPS database.

5.5. Results for the LUNG database

monolithically decreases ﬁrst and the algorithms reach to their convergence point after a few iterations.

In order to evaluate the eﬀectiveness of the proposed methods on a high dimensional small sample-size database, we applied the adopted multiple kernel-based dictionary learning approaches to a bioinformatics dataset. The LUNG dataset contains in total 203 samples in 5 classes which have 139, 20, 21, 17 and 6 samples respectively. Each sample has 12600 genes where a subset of 3312 most variable genes over the ﬁve classes has been identiﬁed, i.e. yi ∈ 3312 . We use 10-fold cross-validation technique to split the data samples into training and test sets. The average result of the relevant experiments is reported. In this database, the distribution of samples in diﬀerent classes is imbalanced and choosing a ﬁxed value as the number of dictionary

5.4. The USPS database results The USPS database contains 10 digit classes (0−9) of 256-dimentional handwritten digits with a total 9298 samples. For each class, we randomly select 500 samples as the training set and 200 samples as the test set. In the dictionary learning process, we chose the following values for the related parameters: T0 = 10 , {Kc}Cc =1 = 300 and JMax = 40 , where JMax is the stopping criterion in the training phase. Fig. 8 Table 4 Recognition rate of different algorithms (in %) for the USPS hand written digit database.

Distributive Collective

SRC

KSRC (Best Kernel)

KSRC (Mean)

K-KSVD (Best Kernel)

K-KSVD (Mean)

MKDL

MKDL †

SRC ††

96.2800 95.7600

43.1600 96.6800

30.0400 96.0800

98.5000 97.0400

98.2600 96.6800

98.9800 98.2200

98.9600 98.4800

94.4000 94.2400

9

Neurocomputing (xxxx) xxxx–xxxx

T. Zare, M.T. Sadeghi

Table 5 Recognition rate of different algorithms (in %) for the LUNG database.

Distributive Collective

SRC

KSRC (Best Kernel)

KSRC (Mean)

K-KSVD (Best Kernel)

K-KSVD (Mean)

MKDL

MKDL †

SRC ††

91.9048 95.2381

68.3333 94.2857

22.6190 94.7619

94.2857 94.2857

95.1743 94.7619

96.1905 96.6667

96.6667 96.6667

92.2584 93.3333

[11] W. Zhao, Z. Liu, Z. Guan, B. Lin, D. Cai, Orthogonal projective sparse coding for image representation, Neurocomputing 173 (2016) 270–277. [12] V.H. Nguyen, V.M. Patel, N.M. Nasrabadi, R. Chellappa, Design of non-linear kernel dictionaries for object recognition, IEEE Trans. Image Process. 22 (12) (2013) 5123–5135. [13] M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268. [14] H.Zheng, F.Liu, Z.Jin, Multiple Kernel Sparse Representation Based Classiﬁcation, in:Chinese Conference on Pattern Recognition, 2012. pp. 48–55. [15] A. Shrivastava, V.M. Patel, R. Chellappa, Multiple kernel learning for sparse representation-based classiﬁcation, IEEE Trans. Image Process. 23 (7) (2014) 3013–3024. [16] J.J. Thiagarajan, K.N. Ramamurthy, A. Spanias, Multiple Kernel Sparse representations for Supervised and unsupervised Learning, IEEE Trans. Image Process. 23 (7) (2014) 2905–2915. [17] A. Shrivastava, J.K. Pillai, V.M. Patel, Multiple kernel-based dictionary learning for weakly supervised classiﬁcation, Pattern Recog. 48 (8) (2015) 2667–2675. [18] K.Engan, S.O.Aase, J.H.Husoy, Method of optimal directions for frame design, in: IEEE International Conference on Acoustics, Speech, and Signal Processing 52443–2446, 1999. [19] A.S. Georghiades, P.N. Belhumeur, D. Kriegman, From few to many: illumination cone models for face recognition under variablelighting and pose, IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (2001) 643–660. [20] J.J. Hull, A database for hand written text recognition research, Trans. Pattern Anal. Mach. (1994) 550–554. [21] K. Yang, Z.P. Cai, J.Z. Li, G.H. Lin, A stable gene selection in microarray data analysis, BMC Bioinforma. (2006). [22] J. Yin, X. Liu, Z. Jin, W. Yang, Kernel sparse Represent. Based Classif. Neurocmputing 77 (2012) 120–128. [23] S. Gao, I.W. Tsang, L.-T. Chia, Sparse representation with kernels, IEEE Trans. Image Process. 22 (2) (2013) 423–434. [24] D. Wang, F. Nie, H. Huang, Feature selection via global redundancy minimization, IEEE Trans. Knowl. Data Eng. 27 (10) (2015) 2743–2755.

atoms, Kc , seems to be unfair. Hence, we choose 100, 12, 12, 12 and 3 as the number of dictionary atoms for the associated class respectively. Moreover, the maximum sparsity level is set to T0 = 10 and the maximum number of iteration for training is set to 40. The classiﬁcation results of both distributive and collective settings are summarized in Table 5. These results are consistent with the earlier reported results and conﬁrm the superiority of the proposed schemes. The recognition rates of the proposed algorithms are about 96% while, based on our knowledge, the best reported results for this database is about 93% [24]. In [24], the best four features (genes) have been selected and the K-NN and SVM classiﬁers have been used. 6. Conclusion In this paper, we proposed a multiple kernel-based dictionary learning approach. The learned dictionary is used within the framework of classiﬁcation task where we deﬁned two diﬀerent classiﬁcation setups, distributive and collective settings. We introduced the proposed approach in the case of distributive and collective settings and formulated the associated optimization problem. It was shown that, in both cases, the optimization problem is terminated to an analytical solution. Our experimental results show the superiority of the proposed approach in terms of classiﬁcation accuracy. Moreover, compare to the SRC method, in the classiﬁcation phase, the proposed method reduces the computational burden by decreasing the size of the associated dictionary. In the feature work, in order to improve the classiﬁcation power, we intend to impose some discriminative constraints to the kernel weights in the learning process.

Tahereh Zare received BSc and MSc degree of electrical engineering and electronics from Yazd University in Yazd, Iran in 2006 and 2009 respectively. Now she is the Ph.D. candidate in Yazd University. Her research interests include machine learning and computer vision.

References [1] J. Wright, A.Y. Yang, A. Ganesh, et al., Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [2] C. Hong, J. Zhu, Hypergraph-based multi-example ranking with sparse representation for transductive learning image retrieval, Neurocomputing 101 (2013) 94–103. [3] J. Yu, Y. Rui, D. Tao, Click prediction for web image Reranking using multimodal Sparse coding, IEEE Trans. Image Process. 23 (5) (2014) 2019–2032. [4] C. Hong, J. Yu, D. Tao, M. Wang, Image-based three-dimensional human Pose recovery by multiview Locality sensitive Sparse retrieval, IEEE Trans. Ind. Electron. 62 (6) (2015) 3742–3751. [5] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained Linear Coding for image classiﬁcation, CVPR (2010) 3360–3367. [6] P. Li, Y. Liu, G. Liu, M. Guo, Z. Pan, A robust local sparse coding method for image classiﬁcation with Histogram intersection kernel, Neurocomputing 182 (2016) 36–42. [7] R. Rubinstein, A. Bruckstein, M. Elad, Dictionaries for sparse representation modeling, Proc. IEEE 98 (6) (2010) 1045–1057. [8] J.J. Wright, Y. MaMairal, G. Sapiro, T. Huang, S. Yan, Sparse representation for computer vision and pattern recogntion, Proc. IEEE 98 (6) (2010) 1031–1044. [9] M. Elad, M. Figueiredo, Y. Ma, On the role of sparse and redundantrepresentations in image processing, Proc. IEEE 98 (6) (2010) 972–982. [10] M. Aharon, M. Elad, A.M. Bruckstein, The K-SVD: an algorithm for designing of overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006) 4311–4322.

Mohammed Taghi Sadeghi received a BSc in electrical engineering and electronics from Sharif University in Tehran, Iran in 1991. In 1995 he completed an MSc in electrical engineering and communications at Tarbiat Modarres University, Iran. In 2003, he completed a PhD in Machine Vision whilst at the Centre for Vision, Speech and Signal Processing (CVSSP) within the Department of Electronics and Electrical Engineering of the University of Surrey in the UK. He worked at the CVSSP as a research fellow for two years. He joined the Department of Electrical Engineering of Yazd University in 2005. His current research interests include pattern recognition, image processing and computer vision.

10

A Novel multiple kernel-based dictionary learning for distributive and collective sparse representation based classifiers

A Novel multiple kernel-based dictionary learning for distributive and collective sparse representation based classifiers

Recommend Documents