Noise learning based discriminative dictionary learning algorithm for image classification

Noise learning based discriminative dictionary learning algorithm for image classification

Journal Pre-proof Noise Learning based Discriminative Dictionary Learning Algorithm for Image Classification Tian Zhou , Yunyi Li , Guan Gui PII: DOI...

3MB Sizes 0 Downloads 196 Views

Journal Pre-proof

Noise Learning based Discriminative Dictionary Learning Algorithm for Image Classification Tian Zhou , Yunyi Li , Guan Gui PII: DOI: Reference:

S0016-0032(20)30022-3 https://doi.org/10.1016/j.jfranklin.2020.01.007 FI 4369

To appear in:

Journal of the Franklin Institute

Received date: Revised date: Accepted date:

15 July 2019 23 November 2019 9 January 2020

Please cite this article as: Tian Zhou , Yunyi Li , Guan Gui , Noise Learning based Discriminative Dictionary Learning Algorithm for Image Classification, Journal of the Franklin Institute (2020), doi: https://doi.org/10.1016/j.jfranklin.2020.01.007

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd on behalf of The Franklin Institute.

Noise Learning based Discriminative Dictionary Learning Algorithm for Image Classification Tian Zhou, Yunyi Li, and Guan Gui College of Telecommunications and Information Engineering Nanjing University of Posts and Telecommunications, Nanjing, China Email: [email protected]

Abstract: Dictionary learning is an efficient and effective method to preserve the label property in supervised learning for image classification. However, different noises in the image samples may cause unstable residuals during the training stage, which is particularly reflected in obtaining an inaccurate dictionary and inefficient utilization rate of label information. In order to fully exploit the supervised information for learning a discriminative dictionary, we propose an effective dictionary learning algorithm for designing structured dictionary where each atom related to a corresponding label. The proposed algorithm is implemented by alternating direction method of multipliers (ADMM) based on noise learning, where the noise is composed of interference signals and reconstruction residuals. In the training stage, we first adopt cross-label suppression method to enlarge the difference among the representations of different labels. Meanwhile, a mathematical operator of Laplacian matrix in spectral clustering named N-cut is also utilized to shorten the difference among the representations of same labels. In the testing stage, to take fully advantage of the learnt dictionary, two efficient classifiers of global coding and local coding are adopted in the denoising step respectively. Experiments are conducted in different datasets including face recognition, scene classification, object categorization, and dynamic texture categorization. Simulation results confirm our proposed method in terms of both classification performance and computational efficiency.

Keywords: cross-label suppression; N-cut spectral clustering; denoising classifier; image classification; supervised dictionary learning.

I. Introduction Dictionary learning has been received intensely attention and applied successfully into many specific areas, such as image processing [1][2], clustering [3] and classification [4][5] due to its superior performance. According to sparse representation (SR) [6], various signals containing images, videos, and audio signals, can be transformed into a subspace [7] with much less dimension. That is, an original column signal or sequence can be compressed into a linear combination of a few representative elements, which are also arranged in a column. Besides, it is also believed that any given signal/image can be well- represented by a set of discriminative patterns, where the whole patterns is named dictionary and each column of the dictionary is called atom [4][5]. Therefore, dictionary learning is referred to as an effective tool for reconstruction of signals and classification. It is well documented that, the learnt dictionary from a certain dataset rather than predefined dictionary, such as wavelets and

Fourier bases [1][8] can provide superior performances in different applications, including face recognition [9][10], scene classification [11] and object identification [12]. For the purpose of learning an effective and efficient dictionary, some algorithms have been presented to attain a good reconstruction of physical signals, and obtained good results. Such as the method of optimal directions (MOD) [13], the least squares optimization [14][15][16], the classical K-SVD [1], and a structured dictionary model [17]. However, these unsupervised learning methods pay more attention to the reconstruction of the physical signals, rather than exploit label information of training samples. Thereby, these methods may have many limitations in classification tasks. To achieve a better performance for classification, supervised learnt dictionary had been considered. On one hand, a supervised dictionary learning method namely sparse representation based classification (SRC) [4] directly take the training samples as the dictionary, performance of SRC for face recognition exhibits a robustness against noise; on the other hand, a classifier with joint learning as well as the dictionary in training process [18], which adopt hinge loss function [19], logistic loss function [20], and linear prediction cost [11], are adopted for training the classifier. Nevertheless, for pursuing a more accurate sparse representation, another supervised algorithms have been proposed, which adopt ℓ0 -norm [11],[18] or ℓ1 -norm [9],[21] as regularization term to penalize the sparsity. However, owing to non-smoothness of the model, these methods are commonly computational complex and need abundant calculations for optimal solution. Recently, a cross-label suppression with group regularization model (CLS-GR) [22] have been reported which utilizes a cross-label suppression term with ℓ2 -norm representation penalty term, and produce a good performance in both classification and computational efficiency. Additionally, it is generally adopted that some traditional mathematical algorithm like Eigenface [23], SIFT [24] are committed to extract features as a priori and simplify the learning process as preprocessing before classification. Although these approaches really develop the applications of recognition, the above methods are devoted to improving the robustness of learning models which can decrease the influences of noise on classification, rather than learn the noise from the dataset and process it. In this paper, we propose an innovative learning model consists of both dictionary learning and noise learning in the training stage to learn a relatively small but refined dictionary, which both reconstruction residuals and discriminative atoms are considered at the same time. According to the structured dictionary model [25][26] with label particular and shared atoms, we suggest an ℓ2 -norm representation penalty term as well as the cross-label suppression term [22] to obtain a smooth

optimization model and enlarge the discrimination of different labels. Motivated by the spectral clustering method, we exploit N-cut [27] as optimization goal to smooth the difference of representations which share the same label. In the training stage, we utilize two decoupled components of our proposed model inspired by ADMM which uses variable-splitting scheme, to learn the representations and the noises respectively with analytical solution. Afterwards, the optimization model of learning dictionary is not convex, and it is learnt through the updated representations atom by atom via least squares. Finally, the learnt dictionary is discriminative enough, and the analytical solutions can greatly accelerate the process of learning. In the testing stage, any original image signal is transformed into the corresponding subspace through the learnt dictionary. Then, two practical denoising classifiers: Global Coding deNoising classifier (GCN) and Local Coding deNoising classifier (LCN), are designed by us to predict the labels and improve the performance of classification further. The contribution of our work contains three aspects, the first is separate the optimization problem into several sub-problems via ADMM, which largely decreases the complexity of algorithm; the second is adding a step of noise learning from the image dataset, which aims at obtaining a more representative codes and discriminative dictionary; the third is improving the classification schemes, which utilizes the learnt noise to achieve a higher accuracy of classification. The rest of this paper is organized as follows. In Section I, we briefly introduce some researches on dictionary learning models. In Section II, some related and basic methods are described. In Section III, the proposed algorithm and formulation is introduced in details. Afterwards, we conduct some experiments to compare the proposed method to some former algorithms on public datasets of different applications in Section IV. In Section V, the conclusion is presented.

II. Related Work A.

Standard ADMM Algorithm

ADMM [28] is a simple but powerful framework, which can be well adopted to solve the problems of high-dimension optimization flexibly. Meanwhile, ADMM utilize a variable-splitting scheme to separate a coupled-optimization model into several sub-problems equivalently via introducing auxiliary constraint variables, where the decoupled variables can be solved efficiently in an alternating minimization manner. Through an auxiliary variable 𝐯 ∈ ℝ𝑚 , the formula can be written as 1

min𝐱,𝐯 ‖𝐯‖1 + 𝑃(𝐱) 𝜇

𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝐀𝐱 − 𝐲 = 𝐯

(1)

where 𝐱 ∈ ℝ𝑛 represents another variable, and the function 𝑃(𝐱) is the regularization term. 𝐲 ∈ ℝ𝑚 denotes the original physical signal of the image, and 𝐀 ∈ ℝ𝑚×𝑛 is the measurement matrix, which is the measurement matrix or dictionary. Then, the Lagrangian of the problem can be derived as 𝐿(𝐯, 𝐱, 𝐰) =

1 𝜇

𝜌

‖𝐯‖1 + 𝑃(𝐱)+< 𝐰, 𝐀𝐱 − 𝐲 − 𝐯 > + ‖𝐀𝐱 − 𝐲 − 𝐯‖22 2

(2)

where 𝐰 ∈ ℝ𝑚 is the Lagrangian multiplier, and 𝜌 > 0 is the penalty parameter. Furthermore, ADMM consists of the following three sections (sub-problems) 𝜌

𝐰 (𝑘)

2

𝜌

𝐱 (𝑘+1) = argmin (𝑃(𝐱) + ‖𝐀𝐱 − 𝐲 − 𝐯 (𝑘) − 𝐱

𝟐

‖ ) 𝟐

(3)

1

𝜌

𝐰 (𝑘)

𝜇

2

𝜌

𝐯 (𝑘+1) = argmin ( ‖𝐯‖1 + ‖𝐀𝐱 (𝑘+1) − 𝐲 − 𝐯 (𝑘) − 𝐯

𝐰 (𝑘+1) = 𝐰 (𝑘) − 𝜌(𝐀𝐱 (𝑘+1) − 𝐲 − 𝐯 (𝑘+1) )

𝟐

‖ ) 𝟐

(4)

(5)

the above three steps are general framework of decoupled and penalized optimization problem with ℓ1 -norm regularization. B.

Cross-label Suppression

To make the learnt dictionary more discriminative and compact, supervised information of the whole training samples can be taken full advantages by cross-label suppression operator [22], as a result, each atom in the dictionary will be associated with a specific label. According to the multiplication rule of matrix, rows of representation are also corresponding to specific labels, as shown in Figure 1.

suppression label 1 label 2

Sparse representation

...

...

label c

...

Image signal (label c)

suppression label C

Structured sparse representation

Figure 1. Structured representation corresponding to image sample with label c.

Therefore, the approach of the cross-label suppression is a term devoted to minimizing the value of the representation’s rows, whose label are different from their corresponding training sample. Considering the shared atoms are not suppressed, the cross-label suppression term 𝐏 𝑐 ∈ ℝ(𝑁−𝑛−𝑠)×𝑁 is defined as 0 𝑐 𝐏 𝑐 (𝑚, 𝑛) = { 1 𝑚 = 𝑛, 𝑚 ∉ (𝐿 ∪ 𝐿 ) (6) 0 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒 where 𝑁 is the number of whole label-particular atoms of the dictionary, 𝑛 is the number of atoms belonging to each label, and 𝑠 is the number of shared atoms. Given an image dataset of 𝐶 class, 𝑁 = 𝑛𝐶 + 𝑠. 𝑐 means the label of the training sample, 𝐿𝑐 denotes the columns in 𝐏 𝑐

belong to the 𝑐 -th label, specifically, 𝐿0 represents the label related to shared atoms. The

cross-label suppression term 𝐏 𝑐 is such a structured matrix consists a zero matrix in the area of 𝐿𝑐 and 𝐿0 as well as an identity matrix in other areas, therefore, the values not belong to 𝐿𝑐 and 𝐿0 are extracted out and suppressed by the model of minimization. C.

N-Cut Spectral Clustering

Given an N-vertex undirected graph 𝐺 with the vertex set 𝑉 and edge set 𝐸, for any two points in the dataset, generally, we define ω𝑖𝑗 as the weight between point 𝑣𝑖 and 𝑣𝑗 , and ω𝑖𝑗 = ω𝑗𝑖 . The Laplacian matrix of an 𝑁-vertex is defined as

𝐋=𝐌−𝐖

(7)

where 𝐖 and 𝐌 are the adjacency matrix of the graph and the degree matrix, respectively. The Spectral clustering method defines the weight of cut between two sub-graphs 𝐺1 and 𝐺2 as

Ω(𝐺1 , 𝐺2 ) = ∑𝑖∈𝐺1,𝑗∈𝐺2 ω𝑖𝑗

(8)

According to the definition, the cut of the whole sub-graphs can be defined as 1

𝑐𝑢𝑡(𝐺1 , 𝐺2 , ⋯ , 𝐺𝑘 ) = ∑𝑘𝑖=1 Ω(𝐺𝑖 , 𝐺̅𝑖 )

(9)

2

where 𝐺̅𝑖 means the complementary set of 𝐺𝑖 . Considering only a minimization of 𝑐𝑢𝑡(𝐺1 , 𝐺2 , ⋯ , 𝐺𝑘 ) may result in an unbalanced segmentation of the origin graph, that is, the model should also solve a maximization of the number of points that belongs to each sub-graph, as shown in Figure 2.





⋯⋯

⋯⋯ similar



⋯⋯ ⋯⋯

smooth

⋯ ⋯



Figure 2. Diagram of representation in different sub-graphs by spectrum clustering. Therefore, a new definition named 𝑁𝑐𝑢𝑡 is proposed and formulated as 1

Ω(𝐺𝑖 ,𝐺̅𝑖 )

2

𝑣𝑜𝑙(𝐺)

𝑁𝑐𝑢𝑡(𝐺1 , 𝐺2 , ⋯ , 𝐺𝑘 ) = ∑𝑘𝑖=1

(10)

where 𝑣𝑜𝑙(𝐺) = ∑𝑖∈𝐺 𝑡𝑟(𝐌𝑖 ) , 𝐌𝑖 means the degree matrix of the 𝑖-th sub-graph. An intermediate variable ℎ was introduced to formulate 0 ℎ𝑖𝑗 = {

1 √𝑣𝑜𝑙(𝐺𝑗 )

𝑣𝑖 ∉ 𝐺𝑗 𝑣𝑖 ∉ 𝐺𝑗

(11)

thus, the optimization can be formulated as Ω(𝐺𝑖 ,𝐺̅𝑖 ) 𝑣𝑜𝑙(𝐺𝑖 ) 1 𝑘 Ω(𝐺𝑖 ,𝐺̅𝑖 ) Ω(𝐺̅𝑖 ,𝐺𝑖 ) = 2 ∑𝑖=1 ( 𝑣𝑜𝑙(𝐺 ) + 𝑣𝑜𝑙(𝐺 ) 𝑖 𝑖) ω ∑𝑚∈𝐺𝑖 ,𝑛∉𝐺𝑖 𝑚𝑛 ) + 1 𝑘 𝑣𝑜𝑙(𝐺𝑖 = 2 ∑𝑖=1 ( ) ω ∑𝑚∉𝐺𝑖 ,𝑛∈𝐺𝑖 𝑚𝑛 ) 𝑣𝑜𝑙(𝐺𝑖

𝑁𝑐𝑢𝑡(𝐺1 , 𝐺2 , ⋯ , 𝐺𝑘 ) = ∑𝑘𝑖=1

(12)

= ∑𝑘𝑖=1 ∑𝑚=1 ∑𝑛=1 ω𝑚𝑛 (ℎ𝑖𝑚 − ℎ𝑖𝑛 )2 = ∑𝑘𝑖=1(𝐇 𝑇 𝐋𝐇)𝑖𝑖 = 𝑡𝑟(𝐇𝑇 𝐋𝐇) as the indication vector 𝐇 is not a standard orthogonal basis, we make that 𝐇 = 𝐌 −1/2 𝐅,and 𝐋̅ = 𝐌 −1/2 𝐋𝐌 −1/2 , therefore, 𝑉𝑎𝑟(𝐟) = 𝐟 𝑇 𝐋̅𝐟 . The variation represents the smoothness of the sub-gragh 𝐟 among neighboring vertices. Particularly, smaller the variation is, smoother the map is.

III. Our Proposed Method For the purpose of learning a more discriminative and efficient dictionary, a joint optimization of dictionary and representation is a crucial section, and the noise of training images is also trained by the proposed model. Furthermore, the noise is learnt from the dataset instead of traditional mathematical method, specifically, during the process of the optimization approach. A.

Proposed dictionary model with denoising learning

To achieve a good performance of classification, both dictionary learning and noise learning based on the training dataset are considered simultaneously. Thereby, the mathematical model of optimization comprises general residual term, regularization term, cross-label suppression term and spectral clustering term; besides, the solution space is restricted to a function with consideration of noise. Given an image dataset of 𝐶 class, the optimization is proposed as 𝑐

2

𝑐 𝑐 𝑐 2 𝑐 ̅𝑐 𝑐 𝑇 min𝐃,𝐗 ∑𝐶𝑐=1( ‖𝐘 𝑐 − 𝐃𝐗 𝑐 ‖2𝐹 + 𝛽 ∑𝑁 𝑗=1‖𝐱𝑗 ‖ + 𝜆‖𝐏 𝐗 ‖𝐹 + 𝛾𝑡𝑟(𝐗 𝐋 (𝐗 ) ) ) 2

𝑐

𝑐

𝑐

𝑠. 𝑡. 𝐃𝐗 − 𝐘 = 𝐕 , ‖𝐝𝑘 ‖ = 1 𝑐

where 𝐘 ∈ ℝ

𝑚×𝑝

(13)

denotes the image samples of the training samples in the dataset, the superscript 𝑐

represents the samples that belong to the 𝑐-th class, 𝐃 ∈ ℝ𝑚×𝑁 , 𝑁 = 𝑛𝐶 + 𝑠, is the learnt dictionary which is obtained from the training dataset, 𝐝𝑘 is the 𝑘-th column of 𝐃, 𝐗 𝑐 ∈ ℝ𝑛×𝑝 represents a structured matrix of the representation. Furthermore, 𝐏 𝑐 ∈ ℝ(𝑁−𝑝−𝑠)×𝑁 is the cross-label suppression term, and 𝐋̅𝑐 = 𝐌−1/2 𝐋𝐌 −1/2 ∈ ℝ𝑝×𝑝 is the Laplacian operator of Ncut. 𝐕 𝑐 ∈ ℝ𝑚×𝑝 means the noise we have defined as the whole interference signals of the 𝑐-th class. Moreover, the corresponding Lagrangian of the optimization model is formulated as (14). 2

𝑐

𝑐

𝑐 𝑁 𝑐 𝑐 𝑐 𝑐 𝑐 𝑐 𝑐 𝑐 2 𝐿(𝐃, 𝐗, 𝐕) = ∑𝐶𝑐=1 (∑𝑁 𝑗=1‖𝐯𝑗 ‖ + 𝑓(𝐗 ) − ∑𝑗=1 < 𝐰𝑗 , 𝐃𝐱𝑗 − 𝐲𝑗 − 𝐯𝑗 > +𝜌‖𝐃𝐗 − 𝐘 − 𝐕 ‖𝐹 ) (14) 2

𝑐

2

𝑐 𝑐 𝑐 2 𝑐 ̅𝑐 𝑐 𝑇 where 𝑓(𝐗 𝑐 ) = 𝛽 ∑𝑁 𝑗=1‖𝐱𝑗 ‖ + 𝜆‖𝐏 𝐗 ‖𝐹 + 𝛾𝑡𝑟(𝐗 𝐋 (𝐗 ) ). Although the above objective function 2

of group regularization dictionary model is not convex, we can also update the representations as well as the dictionary referring to the principle of renewing matrix column by column, which utilized in K-SVD [1]. According to the framework of ADMM, the Lagrangian function can be divided into three components, including the solution of representation 𝐗 ∈ ℝ𝑛×𝑃 , noise of images 𝐕 ∈ ℝ𝑚×𝑃 , and the Lagrange multiplier 𝐰 ∈ ℝ𝑚×𝑃 , 𝑃 = 𝑝𝐶, the three components are formulated respectively as (15) -(17). 𝐗 𝑐(𝑘+1) = arg min𝐗𝑐 (𝑓(𝐗 𝑐(𝑘) ) + 𝜌 ‖𝐃𝐗 𝑐(𝑘) − 𝐘 𝑐 − 𝐕 𝑐(𝑘) −

𝐰𝑘 𝜌

2

‖ ) 𝐹

(15) 𝑁𝑐

𝐕 𝑐(𝑘+1) = arg min𝐕 𝑐 (∑

𝑗=1

𝑐(𝑘) 2

‖𝐯𝑗

‖ + 𝜌 ‖𝐃𝐗 𝑐(𝑘+1) − 𝐘 𝑐 − 𝐕 𝑐(𝑘) − 2

𝐰𝑘 𝜌

2

‖ ) 𝐹

(16) 𝐰 𝑘+1 = 𝐰 𝑘 − 𝜌(𝐃𝐗 𝑐(𝑘+1) − 𝐘 𝑐 − 𝐕 𝑐(𝑘+1) ) where 𝑓(𝐗

(𝑘)

(17)

) denotes the group regularization which consists of three aspects, including the term of

representation with 𝓵𝟐 -norm , cross-label suppression and spectral clustering. Due to the contribution

of cross-label suppression term, the model is convex and differential, which can avoid the problem of solving optimization with term of 𝓵𝟏 -norm. Therefore, the proposed model with 𝓵𝟐 -norm can achieve a solution of closed form rather than abundant iterations, including the representation 𝐗, noise of images 𝐕, and the Lagrange multiplier 𝐰. In addition, solution of the dictionary 𝐃, can be also learnt after the step of renewing the representation 𝐗, which is named joint optimization. In summary, the solutions of optimization obtained by least squares are aimed at achieving a better optimization result in both accuracy and computational complexity, than some others which are inaccurate or need to iterate. In details, we renew the variables 𝐗 and 𝐕 label by label, and the solutions of above three optimizations are formulated as (18) (22) and (23).

𝐗𝑐 = (𝜌𝐃T 𝐃 + 𝜆𝐏 𝑇 𝐏 + 𝛽𝐈)−1 ⋅ [𝜌𝐃𝑇 (𝐘 𝑐 + 𝐕 𝑐 +

𝐰𝑘 𝜌

𝑐

) − 𝛾𝐗𝑐 𝐋̅ ]

(18)

after updating the whole representation 𝐗, the dictionary are renewed atom by atom for the purpose of fully utilizing the areas of dictionary which have already been updated, instead of updating the whole dictionary at the same time. Provided 𝑖 ∈ 𝐿𝑐 and other parts of 𝐃 are fixed, to update the 𝑖-th atom 𝐝𝑖 , we arrive at the optimization problem as

min‖𝐘 + 𝐕 − ∑𝑘∉𝐿𝑐 𝐝𝑘 𝐱̅𝑘 − ∑𝑘∈𝐿𝑐 ,𝑘≠𝑖 𝐝𝑘 𝐱̅ 𝑘 − 𝐝𝑖 𝐱̅𝑖 ‖ 𝐝𝑖

(19)

where 𝐱̅ 𝑘 denotes the estimated value of the 𝑘-th row in the whole code matrix 𝐗, and the solution of 𝐝𝑖 can be obtained by least squares as 𝐙 = 𝐘 + 𝐕 − ∑𝑘∉𝐿𝑐 𝐝𝑘 · 𝐗 𝑘 − ∑𝑘∈𝐿𝑐 𝐝𝑘 · 𝐗 𝑘 𝑘≠𝑖

𝐙⋅(𝐱 )T

𝐝i = 𝐱 ⋅(𝐱i )T i

(20)

(21)

i

Given the energy of each atom, the updated atom is further normalized as 𝐙⋅(𝐱 )T

𝐝i = ‖𝐙⋅(𝐱 i)T ‖ . i

(22)

𝟐

Furthermore, if the parameter of the cross-label suppression term is very large, the construction of samples depends little on other label-particular atoms (not shared atoms), which are without indices in 𝐿𝑐 . Then, the above procedures replaces 𝐘 and 𝐗 to 𝐘 𝑐 and 𝐗 𝑐 , to accelerate the process of learning the dictionary. Afterwards, the noise of training samples can be learnt according to the Lagrangian of the proposed model, the renewing formula of noise 𝐕 𝑐 can be obtained by least squares as 𝐕𝑐 =

𝜌 1+𝜌

(𝐃𝐗 𝑐 − 𝐘 𝑐 −

𝐰𝑘 𝜌

)

(23)

Thence, the noise of training samples can be learnt and is going to be processed by the practical denoising classifier, the flowchart of our proposed model is shown in Figure 3.

Feature Extract ⋯ Training Dataset ⋯ Label Information Learnt Dictionary

Learnt Noise

Dictionary Learning

Training Stage Classification Stage Feature Extract



DeNoising

Classifier

Predicted Label

Figure 3. The flowchart of our proposed algorithm for image classification.

B.

Proposed denoising classifier

Different from the traditional classifiers, we propose an innovative classifier with denoising procedure before image classification, and the learnt dictionary is exploited to represent a query image and predict its label. Due to the contribution of the cross-label suppression term, the largest representation coefficients should be mainly concentrated on its related corresponding atoms, after the step of denoising. Two classification scheme namely GCN and LCN are proposed to improve the accuracy of image classification. 1) Global Coding deNoising classifier Given an image test sample 𝐲 as well as the learnt dictionary 𝐃, its label is unknown and going to be predicted. So that its representation without cross-label suppression can be obtained by least squares as

𝐱̂ = (𝐃𝑇 𝐃 + 𝛽𝐈)−1 𝐃𝑇 𝐲

(24)

According to the dictionary structure and the proposed dictionary learning model, if the label of sample 𝐲 belongs to the 𝑐-th label, the large coefficients should be mainly concentrated at the particular area of representation with the 𝑐-th label. We can obtain the following metric for the visual recognition

𝑙𝑎𝑏𝑒𝑙(𝐲) = arg min𝑐

‖𝐲+𝐯 𝑐 −∑𝑘∈(𝐿𝑐 ∪𝐿0 ) 𝐝𝑘 ·𝐱̂𝑘 ‖

2 2

∑𝑘∈(𝐿𝑐 ∪𝐿0 )|𝐱̂𝑘 |

(25) 𝑐

where 𝐯 ∈ ℝ

𝑚

𝑐

is averaged value of the noise 𝐕 , which is leant from the images of each label.

2) Local Coding denoising classifier We combine the shared atoms of the learnt dictionary 𝐃0 with the particular part-dictionary 𝐃𝑐 , furthermore, we force the query sample 𝐲 to be represented by each combined part-dictionary

̂𝑐 𝑇 𝐃 ̂𝑐 + 𝛽𝐈) ̂𝑐 = (𝐃 𝐱

−1

̂𝑐 𝑇 𝐲 𝐃

(26)

̂𝑐 = [𝐃0 , 𝐃𝑐 ], 𝑐 = 1,2, ⋯ , 𝐶, denotes the combination of associated atoms of 𝐿𝑐 as well as the where 𝐃 shared atoms of 𝐿0 in the particular areas of the dictionary. 2

̂𝑐 𝐱 ̂𝑐 ‖ 𝑙𝑎𝑏𝑒𝑙(𝐲) = argmin‖𝐲 + 𝐯 𝑐 − 𝐃 2 𝑐

(27)

therefore, the better choice can be easily selected from these two classification schemes in the different cases of dataset. C.

Initialization of the dictionary

Empirically, different initialization of dictionary will result in different processes of learning, which is essential for results of classification and training time. Thereby, we initialize the dictionary 𝐃0 label by label with k-means method, in order to take full advantages of the relevant training samples. The initialization of each class representations can be reconstructed through their corresponding part of dictionary 𝐗 𝑐0 = argmin‖𝐘 𝑐 − 𝐃𝑐0 𝐗‖2𝐹 + 𝛽‖𝐗‖2𝐹 𝐗

= [(𝐃𝑐0 )𝑇 𝐃𝑐0 + 𝛽𝐈]−1 (𝐃𝑐0 )𝑇 𝐘 𝑐 and the residuals of each class signals are formulated as 𝐄 𝑐 = 𝐘 𝑐 − 𝐃𝑐0 𝐗 𝑐0

(28)

(29)

Afterwards, the shared atoms of 𝐃0 can be obtained by utilization of 𝑘-means method with the whole residuals 𝐄 = [𝐄1 , ⋯ , 𝐄 𝐶 ], so that the initialized dictionary 𝐃0 is completed. According to the whole training samples 𝐘 = [𝐘1 , ⋯ , 𝐘 𝐶 ], the initialized representations 𝐗 0 for all classes can be formulated as 𝐗 0 = argmin‖𝐘 𝑐 − 𝐃0 𝐗‖2𝐹 + 𝛽‖𝐗‖2𝐹 𝐗

= [(𝐃0 )𝑇 𝐃0 + 𝛽𝐈]−1 (𝐃0 )𝑇 𝐘

(30)

IV. Experiment Results In this section, we will experimentally evaluate our proposed methods of dictionary learning for image classification on some publicly available datasets involving face recognition, scene classification, object category, and make a comparison with some former works. We set the experiments on the platform of MATLAB2018a applied in one PC with a 64-bit Windows 10 operating system and equipped with i7-7700CPU, the processing of proposed approach is operated on NVIDIA GTX1060 GPU with 8 GB memory to fairly evaluate performance of each algorithm. Moreover, a public dataset is divided into two sections, the former is utilized to train, the latter is to test several different models and algorithms. Furthermore, the convergence of different algorithms is considered by us, where the changes in updating is quantitatively represented by

𝐼𝑛𝑐𝑟𝑒𝑚𝑒𝑛𝑡(𝑟) = A.

‖𝝍𝑟 −𝝍𝑟−1 ‖ ‖𝝍𝑟−1 ‖

(31)

Face recognition

The database contains 165 gray scale images divided in 15 individuals, with 11 images per class. Each class contains one different facial expression, such as center-light, glasses, happy, left-light, no glasses, normal, right-light, sad, sleepy, surprised, and wink, as shown in the Figure 4. Each image is represented

as a 576-dimensional normalized vector. Next, some of the images are selected to be the training samples, and the remaining images are used to check the performance of the classification scheme for generalization.

Figure 4. Samples of the Yale Face dataset. In the experiments, the proposed dictionary learning model is compared with developed dictionary methods including GCN, GCC [22], SVM [29] and K-SVD [1]. To evaluate different algorithms fairly, the size of the learnt dictionary is chosen as 65 atoms for our proposed model and GLS-GR [22], which represents 4 label-particular atoms for each class (person) and 5 atoms as the shared, furthermore, the parameters 𝛽, 𝛾, 𝜆 and 𝜌 are set to 8 × 10−4 , 2 × 10−2 , 2 × 104 and 6 × 10−1 respectively. To acquire a stable recognition rate, each algorithm is operated for 100 times; the performances of the different schemes are shown in in Figure 5 and Table I.

Figure 5. Performances of different classification schemes on the Yale Face Dataset. TABLE I. RESULTS OF EXPERIMENTS ON THE YALE FACE DATASET.

Method

Accuracy (%)

Training Time (s)

SVM [29]

94.67

-

K-SVD [1]

80.81

-

GCC [22]

95.60±2.3

0.21

GCN (ours)

96.78±1.4

0.54

Due to a variety of details in terms of facial expression, and the Yale face dataset is very small in contrast to other dataset, the results is not very stable and with relatively large standard deviations in each independent experiment. Since it has been demonstrated by the experiments, the proposed GCN can

obtain a higher accuracy than GCC [22] for about 1.2% in the situation of 6 training samples each label. Moreover, the best performance of classification is acquired by GCN for about 2.1% accuracy than SVM [29], and about 16% than the classical dictionary learning method K-SVD [1]. And convergence in dictionary learning of our proposed approach and the previous work CLS-GR using (31) are shown in Figure 6.

Figure 6. Comparison of convergence on the Yale Face Dataset. Another face dataset named the Extended YaleB face database, which contains frontal 2414 face images of 38 people, and each person with 64 facial samples, is shown in Figure 7. In this experiment, the eigenface [23] feature with dimension 300 instead of the original pictures which cropped to 192 × 168 pixels is adopted to preprocess for better performance.

Figure 7. Samples of the Extended YaleB face dataset. In order to evaluate different algorithms fairly, number of each label-particular atoms is set to 15 as well as shared atom 0, otherwise, model hyperparameters 𝛽, 𝛾, 𝜆 are fixed as 8× 10−4 , 2× 10−2 , 2× 103 and the step size ρ = 0.5 respectively. In our experiment, each situation is operated for 100 times to obtain a steady learning rate, performances of the two classifiers namely GCC [22] and ours GCN are shown in Figure 8 and Table II.

Figure 8. Performances of different classification schemes on the Extended YaleB Dataset. TABLE II. RESULTS OF EXPERIMENTS ON THE EXTENDED YALEB DATASET.

Method

Accuracy (%)

Training Time (s)

SVM [29]

96.68

-

K-SVD [1]

92.89

-

GCC [22]

98.44±0.3

25.52

GCN (ours)

98.62±0.3

9.51

Due to a large amount of details in face database, we can see that our proposed approach of a great success in robustness by learning the noise and removing it. Our designed classifier GCN achieves 0.18% improvement than GCC [22] and 2% than SVM [29] as well as 5.7% than K-SVD [1] of recognition accuracy, in the Extended YaleB database for face recognition. Furthermore, the previous training model CLS-GR [22] needs large number of iterations to be convergent for about 20 times, however, our proposed model only needs iterations for about 10 times to be convergent. Therefore, our proposed training model can largely decrease the time cost in training time, just need a quarter of CLS-GR [22]. And the figure of convergence in updating with iterations using (31) is shown in Figure 9.

Figure 9. Comparison of convergence on the Extended YaleB Face Dataset.

B.

Object identification

Caltech101 dataset [30] is a challenging dataset for object recognition with a large number of classes, consists of a dataset which describes different objects and belongs to 101 labels, as well as a class which do not belong to any label, namely open training. The dataset consists of a variety of different objects, including animals, faces, vehicles, flowers, insects and so on. Caltech101 object dataset is created by Feifei Li, and consists of 9144 images with 300 × 200 pixels each image, as a whole. For a fair comparison, the 3000-dimensional SIFT-based features [24] used in LC-KSVD [9] is utilized in our experiments where a four-level spatial pyramid and PCA are adopted.

Figure 10. Samples of the Caltech101 object dataset. In the independent experiments, for a fair comparison between different algorithms, the number of atoms for each label (object) is set as 19, and 100 as the number of shared atoms. The parameters 𝛽, 𝛾, 𝜆 and 𝜌 are set to 1, 18, 2× 104 and 1 respectively. To obtain a steady recognition rate, each algorithm is operated for 100 times; the performances of the different algorithms are shown in Figure 11 and Table III.

Figure 11. Performances of different classification schemes on the Caltech101 Dataset.

Figure 12. Confusion matrix of our classification algorithm on the Caltech101 Dataset. TABLE III. RESULTS OF EXPERIMENTS ON THE CALTECH101 DATASET.

Method

Accuracy (%)

Training Time (s)

SVM [29]

71.03

-

K-SVD [1]

70.52

-

LCC [22]

75.35±1.2

127.27

LCN (ours)

76.21±0.9

82.04

According to the different results of classification frameworks, our proposed dictionary learning model with classifier LCN obtain the most outstanding performance, about 0.9% more accurate than CLS-GR with LCC [22], and 5.2% than SVM [29] as well as 5.7% than K-SVD [1]. It is demonstrated by the dataset that the noise of object image samples may result in a decreasing of quality of classification schemes on a certain standard. Figure 12 shows the accuracy of classification of each label, some objects can be classified correctly in a high accuracy at 90%-100%, however, others are in a very low accuracy at about only 30% or even lower, the results may be caused by different clarities of pictures with different labels. It is worth mentioning that, the dataset is tough and large, however, the samples of each label is not very abundant, only about 40-200 samples for each class, as well as an extra class about 800 samples which do not belong to any label. Due to the Caltech101 contains too many labels, however, the training sample is not enough, hence the performance is mainly to show the generalization of our proposed approach in complicated goal of recognition. C.

Texture categorization on DynTex++

The DynTex++ dataset [31] is a challenging dataset which constitutes 36 categories of specific dynamic textures of videos or sequences of moving scenes exhibiting certain stationary properties, and ranges from waves on beach to cars on road (samples in Figure 13). The 177-dimensional LBP-TOP histogram [32] which is extracted from each dynamic texture signal is adopted by us for categorization and comparison.

Figure 13. Samples of the Dyntex++ texture dataset. In this experiments, algorithms of independent experiments for each number of training samples is repeated for 100 times. The dimension of representation for each label is set to 30 as well as none of shared atom, also, 𝛽, 𝛾, 𝜆 and 𝜌 are set to 8 × 10−3 , 1 × 10−2 , 2× 104 and 1 respectively. The performances of the different algorithms are shown in Figure 14 and Table IV.

Figure 14. Performances of different classification schemes on the DynTex++ dataset. TABLE IV. RESULTS OF EXPERIMENTS ON THE DYNTEX++ DATASET.

Method

Accuracy (%)

Training Time (s)

SVM [29]

95.42

-

K-SVD [1]

94.88

-

LCC [22]

95.42±0.9

23.74

LCN (ours)

95.81±0.7

45.67

In the experiments for categorization based on DynTex++ dataset, our proposed model with classifier LCN obtain the most outstanding performance, about 0.4% more accurate than CLS-GR with LCC [22], and 0.4% than SVM [29] as well as 1% than K-SVD [1]. However, our proposed dictionary learning method costs more training time than CLS-GR [22], which achieves a better performance with higher time cost, convergence in updating on this dataset using (31) is shown in Figure 15.

Figure 15. Comparison of convergence on the DynTex++ dataset.

D.

Scene classification on Scene15

In this section, the Scene15 dataset [33] is considered by us, which consists of different outdoor and indoor scene environments such as bedroom, tall building, library, grass, and so on. Each category has 200 to 400 images with a total number of 4485, and the average image size is about 250 × 300 pixels. For a fair comparison, the 3000-dimensional SIFT based features applied by LC-KSVD [11] is adopted in this experiment, the same as the case of Caltech101 dataset.

Figure 16. Samples of the Scene15 scene dataset.

In order to evaluate different algorithms fairly, the parameters 𝛽, 𝛾, 𝜆 and 𝜌 are set to 1, 10, 2× 103 and 750 respectively. Also, each situation is operated for 100 times to obtain a steady learning rate, performances of a few classifications schemes are shown as Figure 17 and Table V.

Figure 17. Performances of different classification schemes on the Scene15 Dataset.

Figure 18. Confusion matrix of our classification algorithm on the Scene15 Dataset. TABLE V. RESULTS OF EXPERIMENTS ON THE SCENE15 DATASET

Method

Accuracy (%)

Training Time (s)

SVM [29]

95.76

-

K-SVD [1]

87.12

-

CNN [34]

96.34

-

GCC [22]

98.22±1.2

30.38

GCN(ours)

98.48±0.9

12.59

In this experiment, the training samples is abundant enough to avoid overfitting, therefore, we add a result of deep learning framework namely Convolutional Neural Networks (CNN) for a comparison. We set the environment as Python 3.6.5 with NVIDIA GTX 1060 for training the specific framework CNN, and the structure of CNN is shown as the Figure 19.

Figure 19. Structure of the Convolutional Neural Networks. As is illustrated in the Figure 17 and Table V, our proposed dictionary learning method with classifier GCN also obtain the highest accuracy of classification based on the Scene15 dataset, even better than CNN in the case of a few training samples, totally for about 0.25% more accurate than CLS-GR with GCC [22], and 2.7% than SVM [29], 2.1% than CNN [34] as well as 11.3% than K-SVD [1]. Figure 18 shows the accuracy of classification of each label, and the accuracy of each class is relatively similar and steady with each other through repeated experiments. Also, the convergence of comparison using (31) is shown in the Figure 20.

Figure 20. Comparison of convergence on the Scene15 dataset. Another needs to be mentioned is that, the training time of the proposed dictionary learning model is less than 40% of CLS-GR, due to the efficiency of the framework via ADMM most. Moreover, the cross-label suppression term which obtains a solution of dictionary and representation with closed form rather than enough iterations, also reduces the complexity of the algorithm in a large standard.

V. Conclusion and Future Work In this paper, we have innovatively combine cross-label suppression term with noise learning from training samples via ADMM, and two classifiers namely GCN and LCN are designed for better performance of classification. Furthermore, a few experiments on different datasets demonstrate our proposed model have obtained a better performance in both accuracy of classification and training time of

learning process, than some former learning algorithms like CLS-GR [22], SVM [29], K-SVD [1], and particularly CNN [34] on scene classification. Due to the limitation of traditional optimization models, we want to study how to train a more effective dictionary, more accurate parameters, and more reliable noise, to obtain a better performance of classification in both accuracy and efficiency. In future work, a model driven based deep learning method ADMM-net can be utilized for further improvement. Besides, the solution with closed form may accelerate the training of net, and the learnt dictionary may be smoother than that learnt via traditional mathematical optimization.

VI. Acknowledgement This work was supported by the Project Funded by the National Science and Technology Major Project of China under Grant TC190A3WZ-2, the National Natural Science Foundation of China under Grant 61671253, the Jiangsu Specially Appointed Professor under Grant RK002STP16001, the Innovation and Entrepreneurship of Jiangsu High-level Talent under Grant CZ0010617002, the Six Top Talents Program of Jiangsu under Grant XYDXX-010, the 1311 Talent Plan of Nanjing University of Posts and Telecommunications.

Declare of interest statement We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “Noise Learning based Discriminative Dictionary Learning Algorithm for Image Classification”.

Reference [1]

M. Aharon, M. Elad, A. Bruckstein, K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (2006) 4311–4322. https://doi.org/10.1109/TSP.2006.881199.

[2]

R. Rubinstein, M. Zibulevsky, M. Elad, Double sparsity: Learning sparse dictionaries for sparse signal approximation, IEEE Trans. Signal Process. 58 (2010) 1553–1564. https://doi.org/10.1109/TSP.2009.2036477.

[3]

P. Sprechmann, G. Sapiro, Dictionary learning and sparse coding for unsupervised clustering, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. (2010) 2042–2045. https://doi.org/10.1109/ICASSP.2010.5494985.

[4]

J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation.,

IEEE

Trans.

Pattern

Anal.

Mach.

Intell.

31

(2009)

210–227.

https://doi.org/10.1109/TPAMI.2008.79. [5]

J. Mairal, F.R. Bach, J. Ponce, G. Sapiro, A. Zisserman, Discriminative Learned Dictionaries for Local Image Analysis, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Anchorage, AK, USA, 2008: pp. 1–8. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4587652.

[6]

Y. Li, F. Dai, X. Cheng, L. Xu, G. Gui, Multiple-prespecified-dictionary sparse representation for compressive sensing image reconstruction with nonconvex regularization, J. Franklin Inst. 356 (2019) 2353–2371. https://doi.org/10.1016/j.jfranklin.2018.12.013.

[7]

Y. Li, Y. Lin, X. Cheng, Z. Xiao, G. Gui, Nonconvex Penalized Regularization for Robust Sparse Recovery in the Presence of $S \alpha S$ Noise, IEEE Access. 6 (2018) 25474–25485.

[8]

B.A. Olshausen, D.J. Field, Sparse coding with an overcomplete basis set: A strategy employed by V1?, Vision Res. 37 (1997) 3311–3325. https://doi.org/10.1016/S0042-6989(97)00169-7.

[9]

D. Wang, S. Kong, A classification-oriented dictionary learning model: Explicitly learning the particularity and commonality across categories, Pattern Recognit. 47 (2014) 885–898. https://doi.org/10.1016/j.patcog.2013.08.004.

[10]

M. Harandi, R. Hartley, C. Shen, B. Lovell, C. Sanderson, Extrinsic Methods for Coding and Dictionary Learning on Grassmann Manifolds, Int. J. Comput. Vis. 114 (2015) 113–136. https://doi.org/10.1007/s11263-015-0833-x.

[11]

Z. Jiang, Z. Lin, L.S. Davis, Label consistent K-SVD: Learning a discriminative dictionary for recognition,

IEEE

Trans.

Pattern

Anal.

Mach.

Intell.

35

(2013)

2651–2664.

https://doi.org/10.1109/TPAMI.2013.88. [12]

D.S. Pham, S. Venkatesh, Joint learning and dictionary construction for pattern recognition, 26th

IEEE

Conf.

Comput.

Vis.

Pattern

Recognition,

CVPR.

(2008).

https://doi.org/10.1109/CVPR.2008.4587408. [13]

K. Engan, S.O. Aase, J.H. Husoy, Frame based signal compression using method of optimal directions

(MOD),

in:

Proc.

IEEE

Int.

Symp.

Circuits

Syst.,

1999:

pp.

1–4.

https://doi.org/10.1109/ISCAS.1999.779928. [14]

H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, Proc. Conf. Neural Inf. Process. Syst. 19 (2006) 801–808. https://doi.org/10.1371/journal.pone.0006098.

[15]

S. Kong, S. Punyasena, C. Fowlkes, Spatially Aware Dictionary Learning and Coding for Fossil Pollen Identification, in: IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., 2016: pp. 1305–1314. https://doi.org/10.1109/CVPRW.2016.165.

[16]

J. Wen, Z. Zhou, D. Li, X. Tang, A novel sufficient condition for generalized orthogonal matching

pursuit,

IEEE

Commun.

Lett.

21

(2017)

805–808.

https://doi.org/10.1109/LCOMM.2016.2642922. [17]

A. Torralba, K.P. Murphy, W.T. Freeman, Sharing visual features for multiclass and multiview object detection,

IEEE Trans.

Pattern

Anal.

Mach.

Intell.

29

(2007) 854–869.

https://doi.org/10.1109/TPAMI.2007.1055. [18]

Q. Zhang, B. Li, Discriminative K-SVD for dictionary learning in face recognition, in: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., San Francisco, CA, USA, 2010: pp. 2691–2698. https://doi.org/10.1109/CVPR.2010.5539989.

[19]

X.C. Lian, Z. Li, B.L. Lu, L. Zhang, Max-margin dictionary learning for multiclass image categorization,

in:

Proc.

Eur.

Conf.

Comput.

Vis.,

2010:

pp.

157–170.

https://doi.org/10.1007/978-3-642-15561-1_12. [20]

J. Mairal, F. Bach, J. Ponce, G. Sapiro, A. Zisserman, Supervised Dictionary Learning, in: Adv. Neural Inf. Process. Syst. Process. Syst., 2009: pp. 1033–1040. http://arxiv.org/abs/0809.3083.

[21]

I. Ramirez, P. Sprechmann, G. Sapiro, Classification and clustering via dictionary learning with structured incoherence and shared features, in: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern

Recognit.,

San

Francisco,

CA,

USA,

2010:

pp.

3501–3508.

https://doi.org/10.1109/CVPR.2010.5539964. [22]

X. Wang, Y. Gu, Cross-label Suppression: a Discriminative and Fast Dictionary Learning with Group Regularization., IEEE Trans. Image Process. 26 (2017) 3859–3873.

[23]

M. Turk, A. Pentland, Eigenfaces for Recognition, J. Cogn. Neurosci. 3 (1991) 209–232.

[24]

D.G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, Int. J. Comput. Vis. (2004) 91–110.

[25]

N. Zhou, Y. Shen, J. Peng, J. Fan, Learning inter-related visual dictionary for object recognition, in: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2012: pp. 3490–3497. https://doi.org/10.1109/CVPR.2012.6248091.

[26]

S. Gao, I.W.H. Tsang, Y. Ma, Learning category-specific dictionary and shared dictionary for fine-grained image categorization, IEEE Trans. Image Process. 23 (2014) 623–634. https://doi.org/10.1109/TIP.2013.2290593.

[27]

U.

Von

Luxburg,

A

Tutorial

on

Spectral

Clustering,

2006.

https://www.cs.cmu.edu/~aarti/Class/10701/readings/Luxburg06_TR.pdf. [28]

Fei Wen, L. Pei, Y. Yang, W. Yu, P. Liu, Efficient and Robust Recovery of Sparse Signal and Image Using Generalized Nonconvex Regularization, IEEE Trans. Comput. Imaging. 3 (2017) 566–579.

[29]

H.

F.

Chen,

Support-Vector

Networks,

in:

Mach.

Learn.,

1995:

pp.

273–297.

https://doi.org/10.1111/j.1747-0285.2009.00840.x. [30]

L. Fei- Fei, R. Fergus, P. Perona, Learning Generative Visual Models from Few Training Examples :, Conf. Comput. Vis. Pattern Recognit. Work. (CVPR 2004). 00 (2004) 178. https://doi.org/10.1109/CVPR.2004.109.

[31]

B. Ghanem, N. Ahuja, Maximum margin distance learning for dynamic texture recognition, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 6312 LNCS (2010) 223–236. https://doi.org/10.1007/978-3-642-15552-9_17.

[32]

Z. Guoying, P. Matti, Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007) 915–928.

[33]

S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features:spatial pyramid matching for recognizing natural scene categories, in: Cvpr, 2007: pp. 2169–2178.

[34]

A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM. 60 (2017) 84–90.