Neurocomputing (xxxx) xxxx–xxxx
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
A Novel multiple kernel-based dictionary learning for distributive and collective sparse representation based classifiers ⁎
Tahereh Zare , Mohammad Taghi Sadeghi Electrical Engineering Department, Yazd University, Yazd, Iran
A R T I C L E I N F O
A BS T RAC T
Communicated by Jun Yu
In recent years, sparse representation theory has attracted the attention of many researchers in the signal processing, pattern recognition and computer vision communities. The choice of dictionary matrix plays a key role in the sparse representation based methods. It can be a pre-defined dictionary or can be learned via an optimization procedure. Furthermore, the dictionary learning process can be extended to a non-linear setting using an appropriate kernel function in order to handle non-linear structured data. In this framework, the choice of kernel function is also a key step. Multiple kernel learning is an appealing strategy for dealing with this problem. In this paper, within the framework of kernel sparse representation based classification, we propose an iterative algorithm for coincident learning of the dictionary matrix and multiple kernel function. The weighted sum of a set of basis functions is considered as the multiple kernel function where the weights are optimized such that the reconstruction error of the sparse coded data is minimized. In our proposed algorithm, the sparse coding, dictionary learning and multiple kernel learning processes are performed in three steps. The optimization process is performed considering two different structures namely distributive and collective for the sparse representation based classifier. Our experimental results show that the proposed algorithm outperforms the other existing sparse coding based approaches. These results also confirm that the collective setting leads to better results when the number of training examples is limited. On the other hand, the distributive setting is more appropriate when there are enough training samples.
Keywords: Sparse representation classification Dictionary learning Multiple kernel learning K-SVD algorithm Kernel K-SVD algorithm
1. Introduction In recent years, sparse representation based algorithms have become increasingly important in machine learning applications such as image denoising, object recognition and classification. The Sparse Representation based Classifier (SRC) is among these fruitful algorithms [1]. The success of sparse representation based classification stems from the fact that high-dimensional signals and images are naturally sparse in an appropriate feature space. So, such signals can be represented by a few samples (atoms) of a proper set of exemplars (dictionary). Although the SRC was originally introduced for classification task, its ability to represent an input signal with a few training samples causes to be used in different applications such as image retrieval [2], click prediction [3] and human pose recovery [4]. In the last decade, various studies have been conducted to improve the performance of this classifier. Most of the researches have focused on two aspects of the SRC. One is optimizing of the sparse coding function and the other is improving the dictionary matrix. In [4], the authors incorporated a local similarity preserving term into the
⁎
objective function of sparse coding which groups similar silhouettes to alleviate the instability of sparse codes. Wang et. al. [5] demonstrated that similar inputs are represented by the similar atoms. So, they introduced Locality-constrained Linear Coding (LLC) which implements locality by projecting each descriptor into its local-coordinate system. In [6], the authors presented a method which makes use of the Histogram Intersection Kernel (HIK) technique within the LLC framework.. In the proposed method, using the feature space induced by the HIK, the dictionary is learned and the local sparse codes of the input histograms are computed. Dictionary matrix plays a key role in sparse representation of signals. The feature vectors derived from a training data set can be directly used in order to determine the dictionary elements. Alternatively, a learning process can be applied to form a learned dictionary. It has been shown that applying a proper learning process significantly improves the results [7–9]. There are several known algorithms for dictionary learning. Among them the K-SVD algorithm is widely used due to its effectiveness in practical applications [10]. The main goal of the K-SVD algorithm is to
Corresponding author. E-mail addresses:
[email protected] (T. Zare),
[email protected] (M.T. Sadeghi).
http://dx.doi.org/10.1016/j.neucom.2016.12.056 Received 14 July 2016; Received in revised form 14 December 2016; Accepted 20 December 2016 0925-2312/ © 2016 Elsevier B.V. All rights reserved.
Please cite this article as: Zare, T., Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.12.056
Neurocomputing (xxxx) xxxx–xxxx
T. Zare, M.T. Sadeghi
find an overcomplete dictionary matrix which contains K atoms such that the reconstruction error of the resulted sparse representation is minimized. The algorithm uses an iterative two steps algorithm in which the sparse representation of the training data and the associated dictionary elements are iteratively updated. In the K-SVD algorithm, the feature vectors extracted from a training data set are linearly combined in order to design the dictionary. However, because of the non-linear structure of some real world data, such a linear combination is not always efficient. Nonlinear transformation using kernel methods is a well-known technique widely used for generalization of linear methods. In the non-linear version of the K-SVD algorithm, the Kernel K-SVD (KK-SVD) algorithm, the data points are implicitly mapped into a new high dimensional feature space. The sparse coding and dictionary learning steps are then performed in this new feature space. The authors in [11] believe that iterative dictionary learning algorithms such as the K-SVD are computationally expensive and proposed a new method called Orthogonal Projective Sparse Coding (OPSC). This algorithm integrates the manifold learning and sparse coding techniques. It has been shown that the kernel based sparse representation algorithms can provide better results compare to their linear counterparts [12]. However, the type of kernel function and the value(s) of the kernel parameter(s) have to be selected appropriately. A typical solution to the problem of kernel selection is to apply the cross-validation technique in order to find the best kernel function among a set of candidates. However, this procedure is time-consuming. Moreover, there is no guarantee that the best possible solution is found. The other solution to this problem is to use an appropriate combination of different kernel functions [13]. Up to now, there are only very few works for using multiple kernels in the sparse representation field [14– 17]. In [14], the authors proposed a multiple kernels Sparse Representation Classification (SRC) algorithm in which two basis kernel matrices are combined by the weighted sum rule. The resulted matrix is considered as the dictionary matrix. For each test sample, the sparse coefficients and the kernel weights are iteratively updated. This method is obviously not suitable for real-time applications. The other problem is that in their proposed method, the required two basis functions are selected by applying the cross validation procedure to a set of Gaussian and polynomial kernels. This process sounds in contradiction of the multiple kernel learning concept. In [15] also, the dictionary matrix is supposed to be the weighted sum of different kernel matrices. However, compare to [14], the weights are determined in a training phase where an iterative process is applied for determining the weights. In [16], a same structure has been considered for the multiple kernel function and the kernel weights and dictionary elements are learned using a three steps algorithm. In these three steps, the kernel weights are optimized based on graph embedding principles, the sparse coding is performed using a simple level wise pursuit scheme and the dictionary elements are learned using multiple levels of 1-D subspace clustering successively. Different image descriptors such as color, shape and texture along with a kernel mapping function have been used for generating the basis kernels. The authors in [17] proposed a multiple instance learning algorithm using a multiple kernel dictionary learning framework in weakly supervised condition where the labels are in the form of positive and negative bags. Their proposed algorithm is composed of four steps in them the kernels weights, sparse codes of the positive and negative bags data, positive and negative dictionaries and sample selection matrix are optimized. The sample selection matrix is used for selecting a true positive sample from the related positive bag. They have defined a cost function for optimizing the kernels weights that increases discrimination between the positive and negative bags. In this paper, we propose an iterative multiple kernel-based dictionary learning algorithm. Each iteration consists of three steps wherein them the sparse representation of the training samples, the kernels weights and the dictionary matrix elements are updated. In
each step, it is supposed. that all the other parameters except the step related ones are fixed. The optimization process is performed within the framework of the SRC algorithm considering two different structures for dictionary matrix namely the distributive and collective schemes [12]. The key contributions of our work are:
• • • • •
We incorporate multiple kernel learning process into dictionary learning framework. We propose a new three steps algorithm for sparse coding, multiple kernel learning and dictionary learning. Our main work is on the multiple kernel learning stage. We introduce distributive and collective points of view in the sparse representation based classification and formulate the proposed algorithm considering these structures. We find an analytical solution to the multiple kernel learning stage which speeds up the learning process. We demonstrate the effectiveness of our proposed algorithms on three popular datasets.
1.1. A. Notations Vectors are denoted by lowercase bold letters and matrices by uppercase bold letters. Scalars and matrix elements are shown by nonbold letters. The ℓ0 -pseudo-norm, denoted by 2 , is the number of nonzero elements of a vector and F shows the Frobenius norm. T0 refers to the sparsity level. 1.2. B. Paper organization This paper is organized as follow: Section 2 defines and formulates the sparse representation problem in the kernel space. Section 3 presents the proposed multiple kernel based dictionary learning in two scenarios of collective and distributive SRC algorithm. The classification procedures for both distributive and collective structures are presented in Section 4. The experimental setups and results are presented in Section 5. Finally, Section 6 concludes the paper with the summary and suggests some future research directions. 2. Sparse representation and dictionary learning in the kernel space Given a set of training samples, Y = [y1, …, yN ] ∈ N × n , the goal is to learn a dictionary D ∈ N × K with K atoms that leads to the best representation of the training samples. By best representation, we mean the one that leads to the least reconstruction error. The optimization problem for achieving this goal is as follows:
arg min
Y − DX
2 F
D, X
s.t
xi
0
≤ T0,
dj
2
= 1,
∀ i, j.
(1)
where xi is the sparse representation of the i-th training sample, yi . That is the i-th column of the sparse coefficient matrix X with maximum number of T0 non-zero entries. Similarly, dj is the j-th column of the dictionary matrix, D, which is referred to as the j-th atom of D. Two well-known algorithms for solving the above problem are the method of optimal direction (MOD) [18] and the KSVD algorithm [10]. In the above mentioned formulation of the problem, the sparse representation of the samples in the original feature space is calculated. Other feature spaces can also be taken into account. Let ∼ ϕ: n → H ⊂ n be a non-linear mapping function that maps the data samples from the original feature space n into a dot product space (Hilbert space H). It is worth noting that the dimensionality of the new feature space, n∼, is often much larger than n . It can possibly be infinite. Using this non-linear mapping function, one can generate the nonlinear form of the sparse representation problem in (1) as follows: 2
Neurocomputing (xxxx) xxxx–xxxx
T. Zare, M.T. Sadeghi
arg min
ϕ(Y) − DX
2 F
≤ T0,
∀ i.
Suppose that the feature space corresponding to this multiple kernel is shown with Φ , that is k (yi , yj) = 〈Φ(yi), Φ(yj)〉. The objective function for optimizing the dictionary produced by this multiple kernel matrix would be:
D, X
s.t.
xi
0
(2) ∼
where ϕ(Y) = [ϕ(y1), …, ϕ(yN )] ∈ N × n . It has been shown that the following arrangement can be used for obtaining the optimal non-linear dictionary matrix [12]:
D* = ϕ(Y)A
M
∑m =1 βm2 ≤ 1
s.t.
(3)
ϕ(Y) − ϕ(Y)AX xi dj
0 2
≤ T0, = 1,
ϕ(Y) − ϕ(Y)AX
2 F
AX)T K(Y,
2 F
=
≤ T0,
∀ i = 1, …, N .
(7)
Y)(I − AX))
=
3.1. Distributive algorithm Within the framework of sparse representation based classification, the word distributive was coined in [12] for the case that for each class, a dictionary matrix is formed using the training samples of the class. The class-based sparse representation of a test sample is then computed using the related dictionary. In [12], the KK-SVD algorithm has been applied for dictionary learning and for obtaining the representation. This approach motivated us to introduce a distributive form of the multiple kernel dictionary learning algorithm in which the learning process is performed in three steps: updating the kernel weights, sparse coding of training samples and updating the dictionaries. Let Y = [Y1, …, YC ] contain the feature vectors obtained from a set of training samples where Yc presents the data which belong to the c-th class and C is the total number of classes. Also, suppose that {Dc = Φ(Yc)Ac}Cc =1 are the class dependent dictionary matrices where Ac is the atom selection matrix corresponding to the c-th class. In fact, this matrix is updated via the KK-SVD algorithm. Also, X = [X1, …, XC ] is the sparse coefficient matrix where Xc contains the sparse coefficient of the data samples of class c. Details of the learning process steps are as follows:
3. Multiple kernel-based dictionary learning As the cost function in (5) presents kernel function plays a key role in this problem. Selecting an appropriate kernel function and tuning its parameters are important issues in the kernel based methods. Multiple Kernel Learning (MKL) approaches suggest that instead of a single kernel, a proper combination of different kernels is used for this purpose. Such an arrangement not only releases the kernel based method from choosing a kernel as the best one during a time consuming procedure but also provides a framework for processing of more complicated data structures. Let {Km}mM=1 be a set of M kernels produced by either M descriptors such as color, texture, etc. or by M different kernel functions applied to a single descriptor. We use a linear combination of these kernels to produce a multiple kernel as the following: M
3.1.1. Updating coefficient matrix, X, when {Ac}Cc =1 and β are fixed In this Stage, the dictionary matrices ({Ac}Cc =1) and the kernel weights β are assumed to be fixed and the coefficient matrix, X , is optimized. From the distributive point of view, the sparse coefficient matrices, {Xc}Cc =1, are computed by minimizing the following constrained cost function:
(6)
where βm is the weight corresponding to the m-th kernel matrix, Km . Table 1 Variables' dimensions.
arg min
ϕ (Y)
D
di
X
xi
A
n×N
ñ×N
ñ×K
ñ×1
K×N
K×1
N×K
∑ [K(yi, yi) − 2xiT AT K(Y, yi) + xiT AT K(Y, Y)Axi] (8)
is a unitary matrix and K(Y, Y) ∈ is the where I ∈ associated finite dimensional kernel matrix. For the convenience of the reader, Table 1 summarizes the dimension of the various variables.
Y
2 F
In the following parts of this section, we present our multiple kernel dictionary learning algorithm. Our proposed algorithm involves three stages wherein them the kernels weights ( βi s), sparse representation of training examples (xi s) and the atom selection matrix (A ) are iteratively updated. Our optimization process is performed within the framework of sparse representation based classification. We consider two different structures for the sparse representation based classifier namely distributive and collective settings. These structures lead to two optimization algorithms. In the following, each algorithm is introduced in details.
(5)
m =1
Φ(yi) − Φ(Y)Axi
i =1
N × N
∑ βmKm
∑ i =1 N
(4)
N × N
KM =
0
N
Φ(Y) − Φ(Y)AX
∀ i = 1, …, N ∀ j = 1, …, K .
= trace((I −
xi
The ℓ2 -norm constraint for the kernel weights ( β ≤ 1) leads to a non-sparse solution for the kernel combination coefficients. It causes to make use of the complementary information of as many kernels as possible. Moreover, as we will show later, this form of constraint will simplify our later formulations. The function in (7) can be reformulated as:
By this formulation, the problem of optimizing a high dimensional dictionary matrix, D*, is replaced by optimizing a finite dimensional matrix, A . Moreover, this formulation leads us to applying the wellknown technique of kernel trick. As known, a kernel function k (yi , yj) can be used as a similarity measure between two samples, yi and yj . A kernel function computes the inner product of the samples in the related feature space; that is k (yi , yj) = 〈ϕ(yi), ϕ(yj)〉, where 〈, 〉 is the dot product operator. Kernel trick proposes that any dot product in a projected feature space can be computed with much lower computational burden using an appropriate nonlinear function. Gram matrix or kernel matrix, K = [k (yi , yj)]ij , is a positive semi-definite matrix which is constructed using the assumed kernel function considering the pairs of i-th and j-th training samples. After some algebraic manipulations and using kernel trick, the cost function in (4) can be re-written as: 2 F
and
2
A, X
s.t.
2 F
A, X, β
where A ∈ N × K is a matrix with K atoms. The above definition for the optimal dictionary allows us to tune the dictionary to the training samples via modifying the coefficient matrix, A . Hence, the optimal dictionary can be seeded through optimization of A instead of D . By substituting (3) into (2), the optimization problem will become:
arg min
Φ(Y) − Φ(Y)AX
arg min
Φ(Yc) − Φ(Yc)AcXc
2 F
Xc
s.t.
xic
0
≤ T0,
∀ i = 1, …, Nc , ∀ c = 1, …, C.
(9)
This optimization problem can be solved using the kernel orthogo3
Neurocomputing (xxxx) xxxx–xxxx
T. Zare, M.T. Sadeghi
∼c ack = σ ′1E k v1
nal matching pursuit (KOMP) algorithm proposed in [12]. Note that, C different representations are obtained using each class training samples separately. The other classes samples have no contribution in calculation of the sparse representation of the sample.
where σ ′1 = Δ(1, 1) . So, in this step, the dictionary atoms of the Ac 's matrices are updated using (14).
3.1.2. Updating dictionary matrices via updating {Ac}Cc =1 when X and β are fixed As mentioned before, the dictionary matrices are updated by updating the associated Ac 's matrices. For this purpose, similar to [12], we update one atom of each matrix {Ac}Cc =1 at a time in an efficient way. To update the k-th atom of the c-th dictionary, ack , we minimize the following cost function:
Φ(Yc) − Φ(Yc)AcXc
2 F
K
j c = Φ(Yc) − Φ(Yc) ∑ j =1 acjxcT
⎞ ⎛ K j k = Φ(Yc)⎜I − ∑ j ≠c k acjxcT ⎟ − Φ(Yc)ackx cT ⎠ ⎝
3.1.3. Updating kernel weights, β , when {Ac}Cc =1 and X are fixed In this subsection, we present our approach for learning the kernel weights while the sparse coefficient matrices, {Xc}Cc =1, and dictionary matrices are assumed to be fixed. Toward this goal, the optimization problem in (7) can be re-written as:
β F
= Φ(Yc)Eck − Φ(Yc)Mck
2 F
∑ trace(ATc KM (Yc, Yc)Ac − Ic)
(16)
c =1
where Ic ∈ Kc× Kc is a unitary matrix and Kc is the number of atoms of the c-th dictionary. So, the optimization function of (15) can be written as: C
J=
C
∑
Φ(Yc) − Φ(Yc)AcXc
2 F
+
c =1
∑ trace(ATc KM (Yc, Yc)Ac − Ic) c =1
+ λ(βT β − 1)
(17)
where λ is the Lagrange multiplier. After substituting the multiple kernel function as in Eq. (6) into (17) and some manipulations, the optimization problem will become: C
J=
Nc
M
C
Nc
⎛
M
⎝
m =1
⎞
∑ ∑ ∑ βmKm(yic, yic) − ∑ ∑ ⎜⎜2xTicATc ∑ βmKm(Yc, yic)⎟⎟ c =1 ic =1 m =1
c =1 ic =1
⎛ + ∑ ∑ ⎜⎜xTicATc c =1 ic =1 ⎝ C
C
+
Nc
⎠
⎞ ∑ βmKm(Yc, Yc)Acxic⎟⎟ ⎠ m =1 M
⎛
M
⎝
m =1
⎞
∑ trace⎜⎜ATc ∑ βmKm(Yc, Yc)Ac − Ic⎟⎟ + λ(βT β − 1) c =1
⎠
(18)
Choosing the convex constraint for the kernel weights, β 2 ≤ 1, provides an analytical solution to the above problem. By taking the first derivative of the above function with respect to βm and setting to zero, the m-th kernel weight is obtained as:
(12)
⎛ ∑C ∑ Nc [K (y , y ) − 2xT AT K (Y , y ) + xT AT K (Y , Y )⎞ c ⎟ ic c m c ic ic c m c ⎜ c =1 ic =1 m ic ic −1 ⎜ ⎟ βm = A x ] c ic ⎟ 2λ ⎜ ⎜+ ∑C trace(AT K (Y , Y )A ) ⎟ c c c m c ⎝ ⎠ c =1 (19) The Lagrange multiplier, λ , can also be found by taking the first derivative of the function with respect to λ and setting it to zero.
3.2. Collective algorithm (13)
The word collective was also coined in [12] for the case that a single dictionary matrix is used for all classes. Compare to the distributive one, this dictionary is formed by concatenating the class-based dictionaries; that is D = [D1, …, DC ]. In this framework, the sparse representation matrices are also concatenated as X = [X1, …, XC ]. Similar to the distributive setting, we propose a three-stages algorithm for the sparse coding, dictionary learning and kernel weights learning.
T
where Δ = ∑ ∑ and KM is the corresponding multiple kernel matrix, M
m =1 atom, ack , can
(15)
C
where σ1 = ∑ (1, 1) is the largest eigen value and u1 and v1 are the corresponding eigenvectors which are the first columns of the sorted U k x cT = σ1v1. This selection guarantees and V . We chose Φ(Yc)ack = u1 and ∼ the unity norm of dictionary atoms because each dictionary atom is Φ(Yc)ack and we know that u1T u1 = 1. Please note that we usually do not have the mapping function, Φ , explicitly. Moreover, the associated feature space may have infinite dimension which makes the SVD ∼c operation on Φ(Yc)E k practically impossible. However, kernel trick solve this problem by creating a dot product in the feature space, Φ , as follows:
∑ βmKm .
M
ζ (A, β) =
(11) F ∼c c k k ∼ where x cT contains the non-zero elements of the x cT and E k = Ek P . The above minimization problem can be solved by Singular Value ∼c Decomposition (SVD) of Φ(Yc)E k = U ∑ V in order to obtain the c k optimal values of ak and x cT as:
KM =
+ ζ (A, β)
We included the penalty term of ζ into the cost function in order to force the dictionary atoms to have unit norm. This penalty term is defined as:
F
2
∼c ∼c (Φ(Yc)E k )T (Φ(Yc)E k ) = (U ∑ V)T (U ∑ V) ∼c ∼c (E k )T KM (Yc , Yc)E k = VΔVT
2 F
∑m =1 βm2 ≤ 1
s.t
2
(10)
k Φ(Yc)ack∼ x cT = σ1u1v1
C
∑c =1 Φ(Yc) − Φ(Yc)AcXc
arg min
2
j where ack and xcT denotes the k-th column of Ac and the j-th row of Xc ⎛ ⎞ K j k respectively. Also, Eck = ⎜I − ∑ j ≠c k acjxcT and Kc is the total ⎟, Mck = ackx cT ⎝ ⎠ number of atoms of the c-th dictionary, Ac . Note that Φ(Yc)Eck defines the difference between the true value of the c-th class data and the approximated value when the k-th dictionary atom of the c-th class is removed from the related dictionary matrix. Also, Φ(Yc)Mck defines the contribution of the k-th dictionary atom to the approximated values (the sparse represented data). Minimization of the above cost function is equivalent to finding the k values of ack and x cT such that matrix Φ(Yc)Mck best approximates c Φ(Yc)Ek in terms of mean squared error. In order to keep the sparsity k and update level fixed, we consider only the non-zero samples of x cT c c them which result in shrinking the matrices Mk and Ek . Let's define pck k as the set of indices corresponding to the non-zero elements of x cT , i.e. c k pck = {i 1 ≤ i ≤ Nc , x cT (i ) ≠ 0} and P ck ∈ Kc× pk be a matrix with ones on the (pck (j ), j )-th entries and zeros elsewhere. By multiplying the matrix pck in Mck and Eck , their desired elements, i.e. those corresponding to the k , remain. Therefore, the cost function in (10) non-zero elements of x cT can be rewritten as follows:
∼c k Φ(Yc)E k − Φ(Yc)ack∼ x cT
(14)
After some manipulations, it is shown that the k-th
be updated as:
4
Neurocomputing (xxxx) xxxx–xxxx
T. Zare, M.T. Sadeghi
3.2.1. Updating coefficient matrix, X, when {Ac}Cc =1 and β are fixed In this stage, the dictionary matrix and kernel weights, β , are assumed to be fixed and the sparse coefficient matrix, X, is optimized. Unlike the aforementioned distributive algorithm, in the collective algorithm, the sparse representation of a sample is calculated using the dictionary derived from all the classes' samples. Hence, the coefficient matrix, X, is computed by minimizing the following cost function:
Φ(Y) − Φ(Y)AX
arg min
2 F
X
s.t
xi
0
≤ T0,
∀ i = 1, …, N .
(20)
Similar to the distributive algorithm, this optimization problem can be solved using the kernel orthogonal matching pursuit (KOMP) algorithm proposed in [12]. 3.2.2. Updating dictionary matrices,{Ac}Cc =1, when X and β are fixed Noting to the structure of the SVD algorithm, although a single dictionary matrix is used in the collective setting, elements of the dictionary matrix have to be learned for each class separately. So, in this stage, the class-dependent sub matrices of {Ac}Cc =1 is used and similar to the distributive algorithm one atom of each matrix is updated at a time using the Eq. (14). After updating these sub matrices, the dictionary matrix is formed by concatenating the related dictionary matrices, D = [D1, …, DC ], where Dc = Φ(Yc)Ac . 3.2.3. Updating kernel weights, β , when {Ac}Cc =1 and X are fixed In this part, we present our collective approach for learning kernel weights while the sparse coefficient matrix, X, and matrices {Ac}Cc =1 are assumed to be fixed. The collective form of the optimization problem in (7) can be re-written as:
Φ(Y) − DX
arg min
2 F
+ ζ (D, β)
β
s.t.
M
∑m =1 βm2 ≤ 1 and
xi
0
≤ T0, ∀ i = 1, …, N .
(21)
Fig. 1. The proposed distributive and collective algorithms.
Similar to the distributive algorithm, we consider a penalty term, ζ , to force the dictionary atoms to have unit norm. This penalty term is defined as:
ζ (D, β) = DT D − I
Table 2 Computational Complexity of the proposed methods.
(22) C
where I ∈ K × K is a unitary matrix and K = ∑c =1 Kc is the total number of dictionary atoms. After some manipulations similar to those mentioned in part C of the distributive algorithm, the m-th kernel weight is obtained by:
βm =
−1 ⎛∑iN=1 [Km(yi , yi) − 2xTi Q + xTi Gxi]⎞ ⎜⎜ ⎟⎟ 2λ ⎝ + G ⎠
Stage 1 sparse coding
Stage 2 dictionary learning
Stage 3 kernel weights learning
Distributive
O(CNc2 )
O( pck 3)
O(CNc2 )
Collective
O (N 2 )
O( pck 3)
O (N 2 )
(23)
where
3.3. Computational complexity of the proposed methods
⎡ AT K (Y , y ) ⎤ ⎢ 1 m 1 i ⎥ ⎢ AT K (Y , y ) ⎥ T Q = D Φ(yi) = ⎢ 2 m 2 i ⎥ ⋮ ⎢ ⎥ ⎢⎣ ATC Km(YC , yi)⎥⎦ and
As we discussed in the previous section, the training phase of the proposed methods contains three stages. Table 2 contains the order of computational complexity of each stage in the distributive and collective settings. In this table N͠ c , C and N are respectively the number of training samples of each class, number of classes and total number of training examples. Also, pck is the number of non-zero elements of P ck .
⎡ AT K (Y , Y )A … AT K (Y , Y )A ⎤ C C 1 m 1 ⎢ 1 m 1 1 1 ⎥ ⋮ ⋱ ⋮ G = DT D=⎢ ⎥. ⎢⎣ ATC Km(YC , Y1)A1 … ATC Km(YC , YC )AC ⎥⎦ It is noteworthy that in the collective setting, all of the dictionary elements are contributing to the sparse representation of a sample. It has to be also noted that in the updating process of the kernels weights, the similarities between the samples from different classes are considered via K(Yi , Yj) . Algorithm 1 in Fig. 1 contains the pseudo-codes for the multiple kernel-based dictionary learning algorithm in the distributive and collective settings.
4. Classification procedure for a test sample Given a test sample, the sparse representation of the sample is computed first. The sample is then reconstructed using the coefficients associated to each class separately and the related residual values are computed. The test sample is classified to the class that leads to the smallest reconstruction error. The classification procedures in the distributive and collective settings are detailed in the next subsections. 5
Neurocomputing (xxxx) xxxx–xxxx
T. Zare, M.T. Sadeghi
• • • • •
4.1. Distributive algorithm Let Dc be the learned kernel based dictionary matrix of the c-th class where Dc = Φ(Yc)Ac for c = 1, …, C . Given a test sample, z ∈ n , the optimization problem in (9) is solved for each class separately using the KOMP algorithm, the results of which are the sparse coefficients x c ∈ Kc . The reconstruction error for each x c is then computed as:
ri = Φ(z) − Φ(Yc)Acx c +
xTc ATc KM (Yc ,
2 F
= KM (z, z) − 2xTc ATc KM (Yc , z)
Yc)Acx c,
∀ c = 1, …, C.
(24)
The test sample is simply classified to the class that gives the lowest reconstruction error {ri}Cc =1.
The best result of the kernel KSVD algorithm using different kernel functions (KK-SVD(Best)). The KK-SVD algorithm when the average kernel is used (KKSVD(Mean). The proposed multiple kernel-based dictionary learning algorithm (MKDL). The SRC algorithm when the number of dictionary atoms is reduced to K (SRC††). The proposed multiple kernel-based dictionary learning algorithms when the number of dictionary atoms is not reduced (MKDL†).
In the collective setting, the dictionary is constructed as D = [D1, …, DC ] where {Dc}Cc =1 = Φ(Yc)Ac . The sparse representation of a test sample, z ∈ n , which is computed by applying the KOMP algorithm through a similar optimization problem in (20) is x = [x1, …, xC] ∈ K where x c is the sparse coefficients associated to the c-th class. The reconstruction error for each class is then computed using (24). Finally, the test sample is classified to the class with the smallest reconstruction error. Note that the main difference between the distributive and collective algorithms in the test phase is in the sparse coding process.
We use a total of 31 basis kernels for learning the optimal kernel. The linear kernel, 10 Gaussian kernels with parameter σ varying from 0.5 to 5 in steps of 0.5, 10 polynomial kernels of d = 2 with the constant coefficient c varying from 0.5 to 5 in steps of 0.5, and 10 polynomial kernels of d = 3 with the constant coefficient c varying from 0.5 to 5 in steps of 0.5 are the adopted kernels. For a fair comparison, the experiments are performed in both distributive and collective configurations while the dictionary matrices are determined using the above mentioned approaches. It means that in the distributive setting, the sparse representation of a test data is computed for each class separately using the associated dictionary. Whereas in the collective setting, the sparse coefficients are computed using a dictionary which is obtained by concatenating the class based dictionaries.
5. Experimental results
5.2. Results for the Extended Yale B database
In this section, we present our experimental results demonstrating the effectiveness of the proposed multiple kernel based dictionary learning approach. The evaluation is performed within the framework of classification task using the Extended Yale B face database [19], USPS digit dataset [20] and LUNGML dataset [21].
The Extended Yale B database contains 2414 frontal face images of 38 individuals taken under varying illumination conditions [22]. We used the cropped and normalized face images of 192×168 pixels. Fig. 2 shows examples of the cropped face images. We randomly choose 30 images per each person as the training samples and the rest as the test samples. Based on the Random face method proposed in [23], each face image, fi ∈ 192×168, is projected into a lower dimensional vector, yi ∈ 504 .
4.2. Collective algorithm
5.1. Experimental setup We compare our proposed algorithm in the distributive and collective modes to some benchmark algorithms. The Sparse representation based classifier (SRC) is used as the baseline algorithm. We applied the KSRC algorithm using a variety of kernel functions. The best obtained results are presented as the other benchmark. As a basic multiple kernel SRC algorithm, we also present the KSRC results when the average of the adopted basis kernels is considered as the final kernel. We compare the proposed algorithm to some state of the art dictionary learning approaches too. The kernel K-SVD (KK-SVD) is one of these algorithms. Here again, the best results of the KK-SVD method when different kernels are adopted will be reported. The KK-SVD results using the average kernel are also reported as a simple multiple kernel method where the kernel weights are assumed to have uniform distribution. We also report the results of the SRC algorithm when the number of dictionary atoms is reduced to the number of atoms of the learned dictionary, K. In this case, K dictionary atoms are selected randomly. Also, for a better comparison, the result of the proposed algorithms when the number of dictionary atoms is set to the number of training samples (like SRC) is reported. It means that in this case the learning process is performed but the number of dictionary atoms is not reduced to K. Thus, overall, we compare the performance of the following algorithms (the short names inside the brackets will be subsequently used):
• • •
5.3. Parameter setting In the proposed methods, there are three important parameters
The Sparse Representation based Classification algorithm (SRC). The best results obtained from the KSRC using different kernel functions (KSRC(Best)). The KSRC algorithm when the kernel is the average of the basis kernels (KSRC(Mean)). Fig. 2. Examples of the Extended Yale B face images.
6
Neurocomputing (xxxx) xxxx–xxxx
T. Zare, M.T. Sadeghi
Fig. 3. Comparison of the Extended Yale B recognition accuracies for different sparsity level T0 and the number of dictionary atoms Kc .
which have to be chosen appropriately. These parameters are the sparsity level, T0 , the number of dictionary atoms in the KKSVD algorithm, Kc and the number of basis kernels, M. In this section the process of tuning of these parameters and the sensitivity of the algorithms to these parameters are analyzed. We randomly choose 10% of the training data as the evaluation set and the rest as the training set. We utilized the evaluation samples to set the parameters. Toward this goal, T0 varied from 1 to 40 with the step size of 1 and Kc from 5 to 20 with the step size of 5. Fig. 3 shows the recognition rate of the proposed methods for the Extended Yale B database when different parameters are used. As expected, the recognition rate highly improves when the dictionary size and the sparsity level is initially increased. The accuracy of the methods remains in almost a specific level by more increasing of the sparsity level. Based on these experiments, the sparsity level was selected equal to 15 and 20 for the distributive and collective methods respectively. We also chose {Kc}Cc =1 = 15 for both settings. We also performed a set of experiments to investigate the effects of the number of basis kernels on the proposed methods. We considered 31 mentioned kernels as the basis kernels. Then, we implemented the proposed algorithm considering different number of kernels. Fig. 4 Demonstrates the results. It can be seen that by increasing the number
Fig. 5. The computational time of the proposed algorithms for different number of kernels.
of kernels, the performance of the both algorithms especially the distributive one is improved. It would be due to adding more information sources to the associated classifier. In the case of the distributive algorithm, the input data is represented by each class data separately. Therefore, the presence of more information sources is more important in order to represent the data more accurately. Obviously, the computational complexity of the training phase of the algorithms is increased by increasing the number of kernels. We also compute the computational time of the training phase for different number of kernels. Fig. 5 shows the computational time of the proposed algorithms in distributive and collective settings. As expected, the computational time of both algorithms increase when the number of kernels are increased. We set the maximum number of iterations in Algorithm 1–40 and use it as the stopping criterion. Table 3 summarizes the recognition rate of different algorithms in both distributive and collective settings. As these results show in the distributive case, the basic SRC algorithm highly outperforms the KSRC ones. In the distributive mode of the KSRC algorithm, each class dictionary matrix is formed by applying the related kernel function to the class training samples, Dc = K(Yc , Yc) ∈ Nc × Nc . A kernel function measures the similarity of
Fig. 4. Comparison of the Extended Yale B recognition accuracies for different number of kernels.
7
Neurocomputing (xxxx) xxxx–xxxx
T. Zare, M.T. Sadeghi
Table 3 Recognition rate of different algorithms (in %) for the Extended Yale B database.
Distributive Collective
SRC
KSRC (Best)
KSRC (Mean)
K-KSVD (Best)
K-KSVD (Mean)
MKDL
MKDL †
SRC ††
98.6656 98.0377
64.0502 98.2732
63.8932 98.5341
97.5981 98.9985
97.3210 98.7598
97.8807 99.7645
98.8226 99.8321
94.7410 92.8571
rithm in the distributive setting lead to slightly better results in comparison to the collective one. An explanation for this observation is that here a large number of training samples per each class is available (30 images for each person). So, the main characteristics of each class are well described by the associated dictionary matrix. On the other hand, frontal faces of different persons have some similarities which can create undesirable effects on the collective representation of the input image. However, due to the learning process, the KK-SVD and MKDL algorithms have better performances in the collective mode. Moreover, the result of MKDL in collective setting (99.76%) is better than the other reported results for this dataset. As far as we know, the best reported results for this data set is a recognition rate of 99.47% [16]. The authors in [16], improved the SRC algorithm using low-rank representation and eigen face extraction techniques. These techniques are based on Robust Principal Component Analysis and Singular Value Decomposition approaches. The superiority of our proposed method is mainly due to the effectiveness of non-linear mapping of the learned multiple kernel and learned dictionary matrix. As mentioned before, in these experiments, feature vectors are produced by using the Random face method. The number of dimensions of the feature vectors is an important factor which affects the classification performance. In order to investigate the effects of this factor, we applied random matrix with different sizes to each face image, fi ∈ 192×168, for extracting the related feature vector. Fig. 6 contains the associated results (Since the distributive KSRC method has very poor results, we have not shown its results in the figure). In this figure "D" and "C" refer to Distributive and Collective settings respectively. It can be seen that the proposed MKDL-C always leads to the best results in terms of classification accuracy.
pairs of data samples which has usually higher values for data samples from a same class rather than those of different classes. In the distributive setting, the dictionary matrices contain only within class similarity information. Hence, the class based dictionary matrices contain almost similar values. It causes that the sparse representations of an input sample on different classes, {x c}Cc =1, become similar to each other. In such a condition, the classifier is usually biased toward selection of the class which involves more training examples. It has to be also noted that the n × Nc dictionary matrix of the SRC is replaced by a Nc × Nc matrix in the KSRC algorithm. Since Nc is usually much smaller than n, the latter is computationally faster. However, in the sparse representation problem, y = Dx where D ∈ m1× m2 . Now, if m1 = m2 there could be a unique solution to the sparse representation problem, but the obtained x is not necessarily sparse. Therefore, the reconstruction errors for the obtained x on different classes may become approximately similar to each other and the classifier is confused to adopt the correct class label. The results in Table 1 shows that KK-SVD algorithms (in different cases) lead to better results compare to the KSRC algorithms. This improvement is due to two reasons: first, decreasing the number of dictionary atoms from N in KSRC to Kc , i.e. Dc ∈ Nc × Kc which eliminates the above mentioned problem of the KSRC; second, the Kc dictionary atoms are learned to be optimal through an optimization problem by reducing the reconstruction error. However, the problem of choosing an optimal kernel still remains. The accuracy of the MKDL algorithm is 97.5667% which is approximately same as the KK-SVD (best) algorithm. It shows that the proposed algorithm converges to the optimal case. Moreover, there is no need to search for the best kernel through the use of multiple kernel. The SRC algorithm leads to slightly better results as compared to the MKDL algorithm but it is computationally more expensive in the classification phase. As mentioned before, it has to be also noted that in the kernel based methods, the original feature vectors is replaced by the similarity between the feature vectors. Via this transformation some information may be lost. In the MKDL† algorithm, the learning process is performed again but the number of dictionary atoms remains equal to the number of training samples. It can be seen that the MKDL† outperforms the SRC with the same number of dictionary atoms. In the SRC†† algorithm, the number of dictionary atoms is reduced to Kc , i.e. same as the MKDL and KK-SVD algorithms. In fact, we randomly select Kc training samples in order to construct the corresponding dictionary. As expected, in such a condition the performance of the SRC algorithm is degraded. The results of the MKDL† and SRC†† experiments show that the proposed method outperforms the SRC in a same condition. In the collective setting (the second row of Table 1), the KSRC(Best) and KSRC(Mean) outperform the SRC algorithm. Compare to the distributive case, much better results are obtained using the KSRC approaches. This is due to this fact that in the collective setting, the dictionary is constructed as D = K(Y, Y) where Y refers to all the training samples. It means that the training samples of different classes have contribution in making each atom of the dictionary. So, each dictionary atom contains both within class similarity and between class dissimilarity information. It removes the so far explained problem of the KSRC in the distributive case. The reconstruction errors for different classes are not necessarily similar and the classifier can work properly. As we expected the KK-SVD and MKDL algorithms lead to better results. Another interesting observation is that the SRC algo-
5.3.1. Convergence of the proposed methods The proposed methods iterate until the adopted cost criterion (J) is not significantly decreased or the maximum number of iterations is reached. Fig. 6 shows the cost value versus the iteration number of the distributive and collective methods for the Extended Yale B database. Although there is no theoretical guarantee that the cost should converge to a global minimum, Fig. 7 shows that the cost value is
Fig. 6. The classification accuracies for different dimensions of input images.
8
Neurocomputing (xxxx) xxxx–xxxx
T. Zare, M.T. Sadeghi
Fig. 7. Convergence of the cost for the Extended Yale B database.
Fig. 9. The classification accuracies for different fractions of missing pixels.
contains examples of the USPS images. Table 4 contains the results of different algorithms. In this table, the classification accuracy for both distributive and collective setting has been presented. Similar to the previous results, the KSRC algorithms initially lead to very bad results compare to the SRC algorithm. However, learning of dictionary atoms as in the KKSVD and MKDL algorithms improves the classification accuracy (about 2% compare to the SRC). An interesting point is that, in contrast to the face recognition experiments, the distributive results here are comparable or slightly better than the collective ones. It can be explained by the fact that in the case of the USPS experiments, the number of training examples is sufficiently large so that the distributive dictionaries contain the required information to produce an accurate sparse representation for a test sample. Moreover, the best recognition rate for the USPS dataset was reported for the kernel KSVD algorithm. The results confirm that the proposed MKDL algorithm outperforms the kernel KSVD (KKSVD) algorithm. We also investigated robustness of the methods in terms of the missing pixels problem. For this purpose, a fraction of image pixels are randomly selected and their values are replaced with zero. Fig. 9. demonstrates the classification accuracy of different methods for different fractions of the missing pixels. It can be seen that the proposed method always outperform the others. Fig. 8. Sample images of the USPS database.
5.5. Results for the LUNG database
monolithically decreases first and the algorithms reach to their convergence point after a few iterations.
In order to evaluate the effectiveness of the proposed methods on a high dimensional small sample-size database, we applied the adopted multiple kernel-based dictionary learning approaches to a bioinformatics dataset. The LUNG dataset contains in total 203 samples in 5 classes which have 139, 20, 21, 17 and 6 samples respectively. Each sample has 12600 genes where a subset of 3312 most variable genes over the five classes has been identified, i.e. yi ∈ 3312 . We use 10-fold cross-validation technique to split the data samples into training and test sets. The average result of the relevant experiments is reported. In this database, the distribution of samples in different classes is imbalanced and choosing a fixed value as the number of dictionary
5.4. The USPS database results The USPS database contains 10 digit classes (0−9) of 256-dimentional handwritten digits with a total 9298 samples. For each class, we randomly select 500 samples as the training set and 200 samples as the test set. In the dictionary learning process, we chose the following values for the related parameters: T0 = 10 , {Kc}Cc =1 = 300 and JMax = 40 , where JMax is the stopping criterion in the training phase. Fig. 8 Table 4 Recognition rate of different algorithms (in %) for the USPS hand written digit database.
Distributive Collective
SRC
KSRC (Best Kernel)
KSRC (Mean)
K-KSVD (Best Kernel)
K-KSVD (Mean)
MKDL
MKDL †
SRC ††
96.2800 95.7600
43.1600 96.6800
30.0400 96.0800
98.5000 97.0400
98.2600 96.6800
98.9800 98.2200
98.9600 98.4800
94.4000 94.2400
9
Neurocomputing (xxxx) xxxx–xxxx
T. Zare, M.T. Sadeghi
Table 5 Recognition rate of different algorithms (in %) for the LUNG database.
Distributive Collective
SRC
KSRC (Best Kernel)
KSRC (Mean)
K-KSVD (Best Kernel)
K-KSVD (Mean)
MKDL
MKDL †
SRC ††
91.9048 95.2381
68.3333 94.2857
22.6190 94.7619
94.2857 94.2857
95.1743 94.7619
96.1905 96.6667
96.6667 96.6667
92.2584 93.3333
[11] W. Zhao, Z. Liu, Z. Guan, B. Lin, D. Cai, Orthogonal projective sparse coding for image representation, Neurocomputing 173 (2016) 270–277. [12] V.H. Nguyen, V.M. Patel, N.M. Nasrabadi, R. Chellappa, Design of non-linear kernel dictionaries for object recognition, IEEE Trans. Image Process. 22 (12) (2013) 5123–5135. [13] M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268. [14] H.Zheng, F.Liu, Z.Jin, Multiple Kernel Sparse Representation Based Classification, in:Chinese Conference on Pattern Recognition, 2012. pp. 48–55. [15] A. Shrivastava, V.M. Patel, R. Chellappa, Multiple kernel learning for sparse representation-based classification, IEEE Trans. Image Process. 23 (7) (2014) 3013–3024. [16] J.J. Thiagarajan, K.N. Ramamurthy, A. Spanias, Multiple Kernel Sparse representations for Supervised and unsupervised Learning, IEEE Trans. Image Process. 23 (7) (2014) 2905–2915. [17] A. Shrivastava, J.K. Pillai, V.M. Patel, Multiple kernel-based dictionary learning for weakly supervised classification, Pattern Recog. 48 (8) (2015) 2667–2675. [18] K.Engan, S.O.Aase, J.H.Husoy, Method of optimal directions for frame design, in: IEEE International Conference on Acoustics, Speech, and Signal Processing 52443–2446, 1999. [19] A.S. Georghiades, P.N. Belhumeur, D. Kriegman, From few to many: illumination cone models for face recognition under variablelighting and pose, IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (2001) 643–660. [20] J.J. Hull, A database for hand written text recognition research, Trans. Pattern Anal. Mach. (1994) 550–554. [21] K. Yang, Z.P. Cai, J.Z. Li, G.H. Lin, A stable gene selection in microarray data analysis, BMC Bioinforma. (2006). [22] J. Yin, X. Liu, Z. Jin, W. Yang, Kernel sparse Represent. Based Classif. Neurocmputing 77 (2012) 120–128. [23] S. Gao, I.W. Tsang, L.-T. Chia, Sparse representation with kernels, IEEE Trans. Image Process. 22 (2) (2013) 423–434. [24] D. Wang, F. Nie, H. Huang, Feature selection via global redundancy minimization, IEEE Trans. Knowl. Data Eng. 27 (10) (2015) 2743–2755.
atoms, Kc , seems to be unfair. Hence, we choose 100, 12, 12, 12 and 3 as the number of dictionary atoms for the associated class respectively. Moreover, the maximum sparsity level is set to T0 = 10 and the maximum number of iteration for training is set to 40. The classification results of both distributive and collective settings are summarized in Table 5. These results are consistent with the earlier reported results and confirm the superiority of the proposed schemes. The recognition rates of the proposed algorithms are about 96% while, based on our knowledge, the best reported results for this database is about 93% [24]. In [24], the best four features (genes) have been selected and the K-NN and SVM classifiers have been used. 6. Conclusion In this paper, we proposed a multiple kernel-based dictionary learning approach. The learned dictionary is used within the framework of classification task where we defined two different classification setups, distributive and collective settings. We introduced the proposed approach in the case of distributive and collective settings and formulated the associated optimization problem. It was shown that, in both cases, the optimization problem is terminated to an analytical solution. Our experimental results show the superiority of the proposed approach in terms of classification accuracy. Moreover, compare to the SRC method, in the classification phase, the proposed method reduces the computational burden by decreasing the size of the associated dictionary. In the feature work, in order to improve the classification power, we intend to impose some discriminative constraints to the kernel weights in the learning process.
Tahereh Zare received BSc and MSc degree of electrical engineering and electronics from Yazd University in Yazd, Iran in 2006 and 2009 respectively. Now she is the Ph.D. candidate in Yazd University. Her research interests include machine learning and computer vision.
References [1] J. Wright, A.Y. Yang, A. Ganesh, et al., Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [2] C. Hong, J. Zhu, Hypergraph-based multi-example ranking with sparse representation for transductive learning image retrieval, Neurocomputing 101 (2013) 94–103. [3] J. Yu, Y. Rui, D. Tao, Click prediction for web image Reranking using multimodal Sparse coding, IEEE Trans. Image Process. 23 (5) (2014) 2019–2032. [4] C. Hong, J. Yu, D. Tao, M. Wang, Image-based three-dimensional human Pose recovery by multiview Locality sensitive Sparse retrieval, IEEE Trans. Ind. Electron. 62 (6) (2015) 3742–3751. [5] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained Linear Coding for image classification, CVPR (2010) 3360–3367. [6] P. Li, Y. Liu, G. Liu, M. Guo, Z. Pan, A robust local sparse coding method for image classification with Histogram intersection kernel, Neurocomputing 182 (2016) 36–42. [7] R. Rubinstein, A. Bruckstein, M. Elad, Dictionaries for sparse representation modeling, Proc. IEEE 98 (6) (2010) 1045–1057. [8] J.J. Wright, Y. MaMairal, G. Sapiro, T. Huang, S. Yan, Sparse representation for computer vision and pattern recogntion, Proc. IEEE 98 (6) (2010) 1031–1044. [9] M. Elad, M. Figueiredo, Y. Ma, On the role of sparse and redundantrepresentations in image processing, Proc. IEEE 98 (6) (2010) 972–982. [10] M. Aharon, M. Elad, A.M. Bruckstein, The K-SVD: an algorithm for designing of overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006) 4311–4322.
Mohammed Taghi Sadeghi received a BSc in electrical engineering and electronics from Sharif University in Tehran, Iran in 1991. In 1995 he completed an MSc in electrical engineering and communications at Tarbiat Modarres University, Iran. In 2003, he completed a PhD in Machine Vision whilst at the Centre for Vision, Speech and Signal Processing (CVSSP) within the Department of Electronics and Electrical Engineering of the University of Surrey in the UK. He worked at the CVSSP as a research fellow for two years. He joined the Department of Electrical Engineering of Yazd University in 2005. His current research interests include pattern recognition, image processing and computer vision.
10