Image description using joint distribution of filter bank responses

Pattern Recognition Letters 30 (2009) 368–376 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.c...

Download PDF

414KB Sizes 4 Downloads 59 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 30 (2009) 368–376

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Image description using joint distribution of ﬁlter bank responses Timo Ahonen *, Matti Pietikäinen Machine Vision Group, University of Oulu, PL 4500, FI-90014 Oulun yliopisto, Finland

a r t i c l e

i n f o

Article history: Received 4 January 2008 Received in revised form 15 August 2008 Available online 11 November 2008 Communicated by Q. Ji Keywords: Texture Face image description Local binary pattern LBP MR8 Gabor ﬁlters

a b s t r a c t This paper presents a uniﬁed framework for image descriptors based on quantized joint distribution of ﬁlter bank responses and evaluates the signiﬁcance of ﬁlter bank and vector quantizer selection. First, a ﬁlter bank based representation of the local binary pattern (LBP) operator is introduced, which shows that LBP can also be presented as an operator producing vector quantized ﬁlter bank responses. Maximum response 8 (MR8) and Gabor ﬁlters are widely used alternatives to the derivative ﬁlters which are used to implement LBP, and the performance of these three sets is compared in the texture categorization and face recognition tasks. Despite their small spatial support, the local derivative ﬁlters are shown to outperform Gabor and MR8 ﬁlters in texture categorization with the KTH-TIPS2 images. In face recognition task with CMU PIE images, the Gabor ﬁlter-based representation achieves the best recognition rate. Furthermore, it is shown that when the ﬁlter response vectors are quantized for histogram based joint density estimation, thresholding is clearly faster than using learned codebooks and, being robust to gray-level changes, it yields better recognition rate in most cases. Third, automatic selection of ﬁlter bank is discussed and excellent face recognition performance in the face recognition task is achieved with the optimized ﬁlter bank. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction Quantitative description of local image appearance has wide range of applications in image analysis and computer vision. Describing the appearance locally, e.g., using co-occurrences of gray values or with ﬁlter bank responses and then forming a global description by computing statistics of them over the image area is a well established technique in texture analysis (Tuceryan and Jain, 1998). On the other hand, recent ﬁndings in applying texture methods to face image analysis, for example, indicate that texture might have applications in new ﬁelds of computer vision that have not been considered texture analysis problems. In this work we extend the ﬁndings of our preliminary work (Ahonen and Pietikäinen, 2008) to more general image analysis. Because of the importance of texture analysis, a wide variety of different texture descriptors have been presented in the literature. However, there is no formal deﬁnition of the phenomenon of texture itself that the researchers would agree upon. This is possibly one of the reasons that so far no uniﬁed theory or no uniﬁed framework of texture descriptors has been presented. The local binary pattern (LBP) (Ojala et al., 2002), maximum response 8 (Varma and Zisserman, 2005) and Gabor ﬁlter-based texture descriptors are among the most studied and best known recent texture analysis techniques. Despite the large number of * Corresponding author. Tel.: +358 50 343 0733; fax: +358 8 553 2612. E-mail addresses: [email protected].ﬁ (T. Ahonen), [email protected].ﬁ (M. Pietikäinen). 0167-8655/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2008.10.012

publications discussing and applying these methods, the connections and differences between them are not well understood. This paper presents a new uniﬁed framework for these texture descriptors, which allows for a systematic comparison of these widely used descriptors and the parts that they are built of. LBP is an operator for image description that is based on the signs of differences of neighboring pixels. It is fast to compute and invariant to monotonic gray-scale changes of the image. Despite being simple, it is very descriptive, which is attested by the wide variety of different tasks it has been successfully applied to. The LBP histogram has proven to be a widely applicable image feature for, e.g., texture classiﬁcation, face analysis, video background subtraction, etc. (The Local Binary Pattern Bibliography, 2008). Another frequently used approach in texture description is using distributions of quantized ﬁlter responses to characterize the texture (Leung and Malik, 2001; Varma and Zisserman, 2005). In the ﬁeld of texture analysis, ﬁltering and pixel value based texture operators have been seen as somewhat contradictory. However, in this paper we show that the local binary pattern operator can be seen as a ﬁlter operator based on local derivative ﬁlters at different orientations and a special vector quantization function. Apart from clarifying the connections between LBP and ﬁlter-based methods, this also helps analyzing the properties of the LBP operator. The estimated distribution of local image appearance is widely used in image or image patch description and different implementations of this idea have resulted in excellent performance in wide

369

T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376

range of applications (e.g., Schiele and Crowley, 2000; Ojala et al., 2002; Varma and Zisserman, 2005; Lowe, 2004; Ahonen et al., 2006). There are still a number of open questions regarding how to describe the local appearance, how to estimate the distribution, how to use the estimated distribution in the selected application, and whether the optimal methods are application speciﬁc or more generic. This paper contributes to these questions by setting a framework, and providing systematic experimental results in two different applications, namely texture categorization and illumination invariant face recognition. Commonly used ﬁlter sets with different characteristics and of varying spatial support are tested in local appearance description. Then two different methods for quantizing the ﬁlter responses are compared. Finally, a method for selecting a subset of ﬁlters from a large ﬁlter bank is proposed. 2. Image descriptors This paper discusses image descriptors that are based on estimating the distribution of local characteristics of the image. In texture analysis literature, a variety of such local characteristics have been studied. The well-known co-occurrence matrix introduced by Haralick (1979) is based on gray values of pixel pairs deﬁned by a displacement vector. Another local characteristic computed directly from pixel gray values is the LBP label that is computed from gray-level differences of neighboring pixels. On the other hand, in the core of many texture descriptors is a ﬁlter bank or wavelet coefﬁcient based description of local image appearance. In the following we take a closer look at the three image descriptors that are studied in this paper. 2.1. The local binary pattern operator The local binary pattern operator (Ojala et al., 2002) is a powerful means of texture description. The original version of the operator labels the pixels of an image by thresholding the 3 3neighborhood of each pixel with the center value and summing the thresholded values weighted by powers of two. Then the histogram of the labels can be used as a texture descriptor. See Fig. 1 for an illustration of the basic LBP operator. The operator can also be extended to use neighborhoods of different sizes (Ojala et al., 2002). Using circular neighborhoods and bilinearly interpolating the pixel values allow any radius and number of pixels in the neighborhood. For neighborhoods we will use the notation ðP; RÞ which means P sampling points on a circle of radius of R. See Fig. 2 for an example of different circular neighborhoods. Let us denote the center pixel value by g c and the gray values of the P sampling points by g 1 ; g 2 ; . . . ; g P . Now the generic LBPP;R operator is deﬁned as

LBPP;R ¼

P X

n1

sðg n g c Þ2

ð1Þ

;

n¼1

Fig. 2. Three circular neighborhoods: (8, 1), (16, 2), (6, 1). The pixel values are bilinearly interpolated whenever the sampling point is not in the center of a pixel.

sfzg ¼

1; z P 0;

Further extensions to the original operator are uniform and rotationally invariant binary patterns (Ojala et al., 2002). A local binary pattern is called uniform if the binary pattern contains at most two bitwise transitions from 0 to 1 or vice versa when the bit pattern is considered circular. For example, the patterns 00000000 (0 transitions), 01110000 (2 transitions) and 11001111 (2 transitions) are uniform whereas the patterns 11001001 (4 transitions) and 01010011 (6 transitions) are not. In the computation of the LBP histogram, uniform patterns are used so that the histogram has a separate bin for every uniform pattern and all non-uniform patterns are assigned to a single bin. In the context of LBPs, rotational invariance is achieved by circularly rotating each bit pattern to the minimum value. For instance, the bit sequences 1000011, 1110000 and 0011100 arise from different rotations of the same local pattern and they all correspond to the normalized sequence 0000111. 2.2. Maximum response 8 ﬁlters The second descriptor considered here is the maximum response 8 descriptor (Varma and Zisserman, 2005). In the core of that descriptor is a ﬁlter set consisting of 38 ﬁlters: two isotropic ﬁlters, a Gaussian and a Laplacian of Gaussian both at scale r ¼ 10 pixels and an edge and a bar ﬁlter both at 3 scales ðrx ; ry Þ ¼ fð1; 3Þ; ð2; 6Þ; ð4; 12Þg and 6 orientations. The ﬁlter kernels are shown in Fig. 3. As the image has been convolved with the ﬁlter bank, the maximum of the 6 responses at different orientations is computed. This results in a total of 8 responses, 2 from the isotropic ﬁlters and 6 from the edge and bar ﬁlters at different scales. Finally the response vector is labeled with the nearest codebook vector (texton) and the histogram of these labels is used to represent the texture. In the learning stage, the codebook is obtained by clustering a set of training samples with the k-means algorithm. 2.3. Gabor ﬁlters Another type of ﬁlter kernels that is widely used in image description is Gabor ﬁlters. The complex Gabor functions can be deﬁned as

gðx; yÞ ¼ 1=ð2prx ry Þeðx

where

Threshold 5

9

1

1

4

4

6

1

7

2

3

1

Weights 1 0

0

1

1

128

0

64 32 16

LBP code: 1+2+8+64+128=203 Fig. 1. The basic LBP operator.

2

4 8

ð2Þ

0; z < 0:

2 =2 2 þy2 =2 2 Þ x y

r

r

e2pjðuxþvyÞ

ð3Þ

in which rx and ry deﬁne the scale of the Gabor function and ðu; vÞ deﬁnes the frequency of the complex sinusoid. Thus, the Gabor function is a product of an elliptical Gaussian and a complex plane wave. The typical way Gabor ﬁlters are applied in texture description is to convolve the input image with a bank of Gabor ﬁlters at different scales and frequencies and compute a set of features from the output images. In texture description, the best known method applying Gabor ﬁlters is the one proposed Manjunath and Ma (1996) in which a vector of means and standard deviations of Gabor ﬁlter responses are used for texture description. Another classic work, applying Gabor ﬁlters in face recognition, is the Elastic

370

T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376

Fig. 3. The MR8 ﬁlter bank. The ﬁlter bank consists of an edge and a bar ﬁlter both at 3 scales and 6 orientations and a Gaussian and a Laplacian of Gaussian ﬁlter.

Bunch Graph Matching method (Wiskott et al., 1997), which is based on Gabor ﬁlter bank responses at certain facial landmarks. In a recent work by Zou et al. (2007) Gabor ﬁlters and local binary patterns were compared in the face recognition task and a face descriptor based on Gabor ﬁlter responses computed at points spaced one wavelength apart on the whole face area was developed. In that work, the Gabor ﬁlter-based descriptor was shown to produce better recognition rates than LBPs especially on difﬁcult image sets containing lighting variation and aging of the subjects. 3. Framework for ﬁlter bank and vector quantization based texture descriptors A widely used approach to texture analysis is to convolve an image with N different ﬁlters whose responses at a certain position ðx; yÞ form an N-dimensional vector. At learning stage, a set of such vectors is collected from training images and the set is clustered using, e.g., k-means to form a codebook. Then each pixel of a texture image is labeled with the label of the nearest cluster center and the histogram of these labels over a texture image is used to describe the texture (Leung and Malik, 2001; Varmaand Zisserman, 2005). More formally, let Iðx; yÞ be the image to be described by the texture operator. Now the vector valued image obtained by convolving the original image with ﬁlter kernels F 1 ; F 2 ; . . . ; F N :

I1 ðx; yÞ ¼ Iðx; yÞIF 1

3

6 I2 ðx; yÞ ¼ Iðx; yÞIF 2 6 I f ðx; yÞ ¼ 6 .. 6 4 .

7 7 7 7 5

2

ð4Þ

IN ðx; yÞ ¼ Iðx; yÞIF N The labeled image Ilab ðx; yÞ is obtained with a vector quantizer f : RN #f0; 1; 2; ; M 1g, where M is the number of different labels produced by the quantizer. Thus, the labeled image is

Ilab ðx; yÞ ¼ f ðI f ðx; yÞÞ

ð5Þ

and the histogram of labels is

Hi ¼

X

dfi; Ilab ðx; yÞg; i ¼ 0; . . . ; M 1;

neighbor classiﬁer was compared to Bayesian classiﬁcation but no signiﬁcant difference in the performance was found. In (Caputo et al., 2005) it was shown that the performance of a material categorization system can be enhanced by using suitably trained support vector machine based classiﬁer. In this work, the main interest is not in the classiﬁer design but in the local descriptors and thus the nearest neighbor classiﬁer with v2 distance was selected for the experimental part. The following two subsections discuss in more detail the two parts that deﬁne an image descriptor in the proposed framework. These parts are the ﬁlter bank F 1 ; F 2 ; . . . ; F N and the quantization function f. 3.1. Filter bank In this paper we compare three different types of ﬁlter kernels that are commonly used in texture description. The ﬁrst ﬁlter bank is a set of oriented derivative ﬁlters whose thresholded output is shown to be equivalent to the local binary pattern operator. The other two ﬁlter banks included in the comparison are Gabor ﬁlters and the maximum response 8 ﬁlter set. A novel way to look at the LBP operator proposed in this paper is to see it as a special ﬁlter-based texture operator. The ﬁlters for implementing LBP are approximations of image derivatives computed at different orientations. The ﬁlter coefﬁcients are computed so that they are equal to the weights of bilinear interpolation of pixel values at sampling points of the LBP operator and the coefﬁcient at ﬁlter center is obtained by subtracting 1 from the center value. For example, the kernels shown in Fig. 4 can be used for ﬁlter-based implementation of local binary pattern operator in the circular (8, 1) neighborhood. The response of such ﬁlter at location ðx; yÞ gives the signed difference of the center pixel and the sampling point corresponding to the ﬁlter. These ﬁlters, which will be called local derivative ﬁlters in the following, can be constructed for any radius and any number of sampling points. Applying the maximum response 8 descriptor in this framework is straightforward. In the ﬁlter bank design we follow the

ð6Þ

x;y

in which d is the Kronecker delta

dfi; jg ¼

1; i ¼ j; 0; i–j:

ð7Þ

If the task is classiﬁcation or categorization as in this work, several possibilities exist for classiﬁer selection. The most typical strategy is to use nearest neighbor classiﬁer using, e.g., v2 distance to measure the distance between histograms (Leung and Malik, 2001; Varma and Zisserman, 2005). In (Varma and Zisserman, 2004), the nearest

0

1

0

0

0.207

0

–1

0

0

–0.914 0.207

0

0

0

0

0

0.5

0

0

0

0

0

–1

1

0

0

0

Fig. 4. Filters F1–F3 of the total of 8 local derivative ﬁlters at (8, 1) neighborhood. The remaining 5 ﬁlters are obtained by mirroring the ﬁlters shown here.

T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376

procedure of Varma and Zisserman (2005). This ﬁlter bank produces a 38 dimensional vector valued image. Selecting the maximum over orientations is handled in the vector quantizer and is described in more detail in the following section. For Gabor ﬁlters, a lot of work has been devoted to designing the ﬁlter bank and feature computation methods (see, e.g., Manjunath and Ma, 1996; Clausi and Jernigan, 2000; Grigorescu et al., 2002). In this work we apply the Gabor ﬁlters in the proposed image description framework, which is to say that the responses of the ﬁlter bank at a certain position are stacked into a vector which is used as an input for the vector quantizer. This resembles the Gabor ﬁlter-based face description suggested by Zhang et al. (2007) in which the Gabor ﬁlter responses are quantized and a histogram of them is then formed to encode a facial image. In the design of the ﬁlter bank, i.e. selection of the scale and frequency parameters, the procedure of Manjunath and Ma (1996) is applied. Furthermore, at each chosen scale and frequency, the real and imaginary parts of the complex Gabor ﬁlter are treated as separate ﬁlters so the total number of ﬁlters is 2ns nf in which ns is the number of scales and nf is the number of orientations at each scale. 3.2. Vector quantization The assumption onto which the proposed texture description framework is based on is that the joint distribution of ﬁlter responses can be used to describe the image appearance. Depending on the size of the ﬁlter bank, the dimension of the vectors in the image I f ðx; yÞ can be high and quantization of the vectors is needed for reliable estimation of the histogram. A simple, non-adaptive way to quantize the ﬁlter responses is to threshold them and to compute the sum of thresholded values multiplied by powers of two:

Ilab ðx; yÞ ¼

N X

st fIn ðx; yÞg2n1 ;

ð8Þ

n¼1

where sðzÞ is the thresholding function

st fzg ¼

1; z P t; 0; z < t;

ð9Þ

in which the parameter t is the threshold. Thresholding divides each dimension of the ﬁlter bank output into two bins. The total number of different labels produced by threshold quantization is 2N where N is the number of ﬁlters. If the threshold t ¼ 0 and the coefﬁcients of each of the ﬁlters in the bank have zero mean (i.e. they sum up to zero), the value of sfIn ðx; yÞg and thus the value of Ilab ðx; yÞ is not affected by afﬁne gray-level changes I0 ðx; yÞ ¼ aIðx; yÞ þ b; a > 0. If the ﬁlter coefﬁcients F n ðx; yÞ sum up to zero, I0 ðx; yÞIF n ¼ aðIðx; yÞIF n Þ and, assuming a > 0, the sign is not changed. On the other hand, such global gray-level changes can be easily normalized, but this holds also locally for such areas where changes in pixel values, e.g., due to lighting changes, can be modeled as afﬁne gray-level change within the ﬁlter support area. Now, let us consider the case that ﬁlter bank used to obtain the image I f ðx; yÞ is the set of local derivative ﬁlters (e.g., the ﬁlters presented in Fig. 4), designed so that ﬁlter responses are equal to the signed differences, i.e. at each pixel location

In ¼ g c g n ; where In is the response of n-th ﬁlter at a given location, and g c and g n are the center pixel and nth sampling point gray values (see Eq. (1)) at the same location. As the quantizer (8) with threshold t ¼ 0 is applied to I f , it follows that the resulting label Ilab is equal to that resulting from local binary pattern operator LBPP;R using the same

371

neighborhood. Therefore, the LBP operator can be represented in the proposed framework. It should be noted that for LBPs, even stronger invariance to gray-level changes holds than presented above. As discussed by Ojala et al. (2002), the LBP labels are invariant to any monotonic mapping of gray values. Due to these reasons, the choice of threshold t ¼ 0 has been common especially with the LBP operator. Still, under some circumstances a different choice may yield better results. For instance, when LBP features are used for background subtraction, choosing a non-zero threshold was observed to result in more stable codes in nearly ﬂat gray areas and, consequently, increased performance (Heikkilä and Pietikäinen, 2006). Another method for quantizing the ﬁlter responses is to construct a codebook of them at the learning stage and then use the nearest codeword to represent the ﬁlter bank output at each location:

Ilab ðx; yÞ ¼ arg min kI f ðx; yÞ c m k; m

ð10Þ

in which cm is the mth vector (codeword) in the codebook. This approach is used in (Leung and Malik, 2001; Varma and Zisserman, 2005), which use k-means to construct the codebook whose elements are called textons. Codebook based quantization of signed differences of neighboring pixels (which correspond to local derivative ﬁlter outputs) was presented in (Ojala et al., 2001). When comparing these two methods for quantizing the ﬁlter responses, one might expect that the if the number of labels produced by the quantizers is kept roughly the same, the codebook based quantizer handles the possible statistical dependencies between the ﬁlter responses better. On the other hand, since the codebook based quantization requires search for the closest codeword at each pixel location, it is clearly slower than simple thresholding, even though a number of both exact and approximate techniques have been proposed for ﬁnding the nearest codeword without exhaustive search through the codebook (Gray and Neuhoff, 1998, p. 2362; Chávez et al., 2001). 3.3. Rotational invariance It is important to note that a clever co-design of the ﬁlter bank and the vector quantizer can also make the texture descriptor rotationally invariant. Again, two different strategies have been proposed. Rotationally invariant LBP codes are obtained by circularly shifting a LBP binary code to its minimum value (Ojala et al., 2002). In the joint framework this can be represented as further combining the labels of threshold quantization (8) so that all the different labels that can arise from rotations of the local gray pattern are joined to form a single label. On the other hand, the approach chosen for the MR8 descriptor to achieve rotational invariance is to select only the maximum of the 6 different rotations of each bar and edge ﬁlters. Only these maximum values and the responses of the two isotropic ﬁlters are used in further quantization so the 8-dimensional response of the ﬁlter is invariant to rotations of the gray pattern. 4. Experiments The proposed framework and the relative descriptiveness of the different ﬁlter banks and vector quantization methods were systematically tested in two different application areas: material categorization and illumination invariant face recognition. Both of these are very challenging unsolved problems so they clearly highlight the performance differences of the operators. To test the proposed framework and to systematically explore the relative descriptiveness of the different ﬁlter banks and vector

372

T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376

Table 1 Properties of the tested ﬁlter kernels. Filter bank

Size

Number of ﬁlters

Local derivative ﬁlters Gabor(1, 4) Gabor(4, 6) MR8

33 77 49 49 49 49

8 8 48 38

quantization methods, the challenging task of material categorization using the KTH-TIPS2 database (Mallikarjuna et al., 2006) was utilized. The widely used CMU PIE (Pose, Illumination, and Expression) database by Sim et al. (2003) was selected to serve as test material in the face recognition experiments. It is especially interesting how the different ﬁlter banks and vector quantizers respond to changes in lighting conditions. The images in the PIE database contain

systematic lighting variation so it is very well suited for these experiments. The same ﬁlter banks were utilized in both experiments. The proposed framework allowed testing the performance of different ﬁlters and different quantization methods independently. The ﬁlter banks that were included in the tests were local derivative ﬁlters, two different banks of Gabor ﬁlters and MR8 ﬁlters. The local derivative ﬁlter bank was chosen to match the LBP8;1 operator which resulted in 8 ﬁlters (see Fig. 4). Two very different types of Gabor ﬁlter banks were tested, one with only 1 scale and 4 orientations and small spatial support (7 7) and another one with 4 scales and 6 orientations and larger spatial support. The properties of the tested ﬁlter kernels are listed in Table 1. 4.1. Material categorization The KTH-TIPS2 (Mallikarjuna et al., 2006) was utilized to test the performance of the descriptors in the material categorization

Fig. 5. Examples of images from the KTH-TIPS2 database. Figures in each column belong to the same texture category.

Fig. 6. Example images of 2 out of the 68 subjects in the CMU PIE database.

a

b 0.5

0.7

0.6 0.4 0.5 0.3

0.4

0.3 0.2 0.2 0.1

0 16

Local derivatives Gabor(4,6) Gabor(1,4) MR8 32

64

128

256

Gabor(4,6) Gabor(1,4) Local derivatives MR8

0.1

0 16

32

64

128

256

Fig. 7. (a) The KTH-TIPS2 categorization rates and (b) CMU PIE recognition rates for k-means based vector quantization of ﬁlter bank responses as a function of codebook size.

373

T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376

a

b

0.6

0.9 0.8

0.5 0.7 0.4

0.6 0.5

0.3 0.4 0.2

0.3 0.2

0.1

0 –0.1 –0.08 –0.06 –0.04 –0.02

Local derivatives Gabor(1,4) MR8

0

0.02 0.04 0.06 0.08

Local derivatives Gabor(1,4) MR8

0.1

0.1

0 –0.1 –0.08 –0.06 –0.04 –0.02

0

0.02 0.04 0.06 0.08

0.1

Fig. 8. (a) The KTH-TIPS2 categorization rates and (b) CMU PIE recognition rates for thresholding quantization of ﬁlter bank responses as a function of threshold value.

task. The database contains 4 samples of 11 different materials, each sample imaged at 9 different scales and 12 lighting and pose setups, totaling 4572 images. Examples of texture images from the KTH-TIPS2 database are shown in Fig. 5. Caputo et al. performed material categorization tests using the KTH-TIPS2 and considered especially the signiﬁcance of classiﬁer selection (Caputo et al., 2005). In that paper, the main conclusions were that the state-of-the-art descriptors such as LBP and MR8 have relatively small differences in the performance but signiﬁcant gains in classiﬁcation rate can be obtained by using support vector machine classiﬁer instead of nearest neighbor. Moreover, the classiﬁcation rates can be enhanced by increasing the number of samples used for training. In this work, the main interest is to examine the relative descriptiveness of different setups of the ﬁlter bank based texture descriptors. To facilitate this task, we chose a very challenging test setup that resembles the most difﬁcult setup used in (Caputo et al., 2005). Using each of the descriptors to be tested, a nearest neighbor classiﬁer using Chi square distance was trained with one sample (i.e. 9 12 images) per material category. The remaining 3 9 12 images were used for testing. This was repeated with 10,000 random combinations as training and testing data and the mean and standard deviations over the permutations were used to assess the performance. 4.2. Illumination invariant face recognition To test the performance of the descriptors in illumination invariant face recognition, the CMU PIE database was used. Totally, the database contains 41,368 images of 68 subjects taken at different angles, lighting conditions and with varying expression. For our experiments, we selected a set of 23 images of each of the 68 subjects. 2 of these are taken with the room lights on and the remaining 21 each with a ﬂash at varying positions. Preprocessed example images from the database are shown in Fig. 6. In obtaining a descriptor for the facial image, the procedure of Ahonen et al. (2006) was followed. The faces were ﬁrst normalized so that the eyes are at ﬁxed positions. The selected ﬁlter bank and vector quantization method was then applied and the resulting label image was cropped to size 128 128 pixels. Thus, for further analysis, the size of the vector valued image was the same irrespective of the ﬁlter kernel size. The labeled image was further divided into blocks of size of 16 16 pixels and histograms were computed in each block individually and then concatenated to form the spatially enhanced histogram describing the face.

Nearest neighbor classiﬁer with Chi square distance was utilized for classiﬁcation. One image per person was used for training and the remaining 22 images for testing. Again, 10,000 random selections into training and testing data were used. 4.3. Codebook based vector quantization All the four ﬁlter banks were tested using two types of vector quantization: thresholding and codebook based quantization. For codebook based quantization, the selected approach was to aim for compact, universal texton codebooks, i.e. codebooks of rather small size that are not tailored for this speciﬁc set of textures or faces. Therefore, images from the CuRET texture database (Dana et al., 1999) were used to learn the codebooks for the texture categorization test and images from Yale B face database (Georghiades et al., 2001), for the face recognition test. The codebook sizes that were tested were 16–256 codewords. The texture categorization rates and face recognition rates as a function of the codebook size obtained with each ﬁlter bank and codebook based quantization are plotted in Fig. 7a and b. Fig. 7a shows that for most of the time, using a larger codebook enhances the texture categorization rate but the selection of the ﬁlter bank is a clearly more dominant factor than the codebook size. For example, local derivative ﬁlters achieve a higher categorization rate with the smallest codebook size than the MR8 ﬁlters with any codebook size. The same applies to the face recognition results, Fig. 7b. Here it is seen that Gabor(4, 6) ﬁlters with a large support area achieve better recognition rates than Gabor(1,4) and local derivatives and each of these performs better than the MR8 descriptor. A probable cause for this is the rotational invariance built into MR8 descriptor that might actually lose some useful information. 4.4. Thresholding based vector quantization In the next experiment the material categorization and face recognition tests were performed using the same ﬁlter banks but thresholding based vector quantization. The Gabor(4, 6) ﬁlter bank was omitted from this experiment due to the large number of ﬁlters in the ﬁlter bank (the resulting histograms would have been of length 248 ). For the local derivative and Gabor ﬁlters have zero the thresholding function (9) was applied directly with different choices of threshold t. For the edge and bar ﬁlters in the MR8 ﬁlter set, only the maximum of responses over different orientations is measured and therefore in that case the mean of 8-dimensional response

374

T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376

0.6

1

Codebook (256) Thresholding

Codebook (256) Thresholding

0.9

0.5 0.8 0.7

0.4

0.6 0.3

0.5 0.4

0.2 0.3 0.2

0.1

0.1 0

Local derivatives

Gabor(1,4)

MR8

Gabor(4,6)

0

Local derivatives

Gabor(1,4)

MR8

Gabor(4,6)

Fig. 9. The KTH-TIPS2 categorization rates.

Fig. 10. The CMU PIE recognition rates.

vectors over all the training images was computed and subtracted from the response before applying thresholding. Fig. 8 shows the texture categorization and face recognition rates as a function of the threshold value. Interestingly, the shapes of the curves are somewhat different in the two tasks. Non-zero threshold values provide best results in texture categorization (Fig. 8a) with Gabor(1, 4) and local derivative ﬁlters but in general the performance differences caused by changing the threshold are small. In face recognition, on the other hand, the situation is reversed. With all the ﬁlter sets, best results are obtained with threshold t ¼ 0. This effect is likely to be due to lighting effects, since as discussed in Section 3.2, using threshold t ¼ 0 with ﬁlters having zero mean yields in descriptor that is invariant to afﬁne changes of gray values. Both in texture and in face experiments there are lighting changes but in the texture experiments those are compensated in part by the larger amount of training images under different lighting conditions whereas in face recognition experiment there is only one training image per subject. Table 2 and Figs. 9 and 10 show the texture categorization and face recognition rates using thresholding based quantization with t ¼ 0 and codebook based quantization with codebook size 256. In texture categorization, codebook based quantization yields slightly worse categorization rate than thresholding when using local derivative ﬁlters, but the difference is smaller than the standard deviation in the rates. With the Gabor(1, 4) ﬁlter bank thresholding performs worse than codebook based quantization, but interestingly with MR8 ﬁlters, thresholding yields better rate. The local derivative ﬁlters give the best categorization rate over the tested ﬁlter sets with both quantization functions. The results obtained in these experiments are slightly different than those presented in (Ahonen and Pietikäinen, 2008) and those presented earlier by Ojala et al. (2001) which showed that codebook based quantization of signed gray-level differences yields slightly better recognition than LBPs, however at the cost of higher

computational complexity. We believe that this is due to different training data for the k-means algorithm. Our earlier experiments suggested that codebook based quantization might perform slightly better than thresholding but in those experiments the material that was used to learn the codebook had some overlap with the testing material. In the present experiments the training and testing data sets were completely separate so it might be that the codebook learned from training images does not suit well enough for describing the test images. Moreover, the images used for testing by Ojala et al. (2001) had less variation in lighting conditions than the KTH-TIPS2 images thus the robustness of thresholding to gray-level variations discussed in Section 3.2 also explains these results. In illumination invariant face recognition, the Gabor ﬁlters show better performance than local derivative ﬁlters. With all the ﬁlter banks, thresholding based quantization yields better recognition rates than codebook. Here the performance gain of changing codebook based quantization to thresholding is apparent. Again, this somewhat surprising result is probably explained to some degree by the invariance of thresholding based quantization to gray-level changes. Considering the computational cost of the presented methods, thresholding based quantization is much faster than codebook based quantization. As for the ﬁlter bank operations, the computational cost grows with the size and number of ﬁlters, but using FFT based convolution can make the operations faster. Still, at two extremes, the computations for local derivative ﬁlter and thresholding based labeling of an image of size of 256 256 take 0.04 s whereas the codebook based labeling of the same image using Gabor(4, 6) ﬁlters (and performing convolutions using FFT) take 10.98 s. Both running times were measured using unoptimized Matlab implementations of the methods on a PC with AMD Athlon 2200 MHz processor.

4.5. Filter subset selection Table 2 Recognition rates for different ﬁlter banks and quantization methods.

Local derivative ﬁlters Gabor(1, 4) Gabor(4, 6) MR8 ﬁlters

Texture, CB

Texture, thresh.

Face, CB

Face, thresh.

0.521 0.458 0.447 0.455

0.528 0.366 – 0.492

0.482 0.502 0.704 0.204

0.742 0.883 – 0.348

The third experiment tested whether it is possible to select a representative subset of ﬁlters from a large ﬁlter bank for thresholding based quantization. The number of labels produced by the quantizer is 2N in which N is the number of ﬁlters, which means that the length of the label histograms grows exponentially with respect to the number of ﬁlters. Thus a small ﬁlter bank is desirable for the thresholding quantization.

T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376

In this experiment, the sequential ﬂoating forward selection (SFFS) (Pudil et al., 1994) algorithm was used to select a maximum of 8 ﬁlters from a larger ﬁlter bank. The optimization criterion was the recognition rate over the training set (KTH-TIPS1). Two different initial ﬁlter banks were tested. First, 8 ﬁlters were selected from the 48 ﬁlters in the Gabor(4, 6) ﬁlter bank. However, the resulting 8-ﬁlter bank did not perform well on the testing database, yielding a categorization rate of only 0.295. The same experiment was done for face recognition problem, using Yale B images as training data for SFFS. Again, 8 ﬁlters were selected from the full Gabor(4, 6) ﬁlter bank using the recognition rate in Yale B set as optimization criterion. Then the recognition rate in the test set, CMU PIE dataset, was recorded. Here the ﬁlter subset selection performed superbly, achieving a recognition rate of 1.000. In the face recognition literature, there are ﬁndings that LBP and Gabor ﬁlter-based information are complementary. In (Zhang et al., 2005), LBP histograms were extracted from Gabor ﬁltered images and in (Yan et al., 2007), score level fusion of LBP and Gabor ﬁlter-based similarity scores was done. Motivated by these ﬁndings, SFFS was used to select 8 ﬁlters from the union of local derivative and Gabor(1, 4) ﬁlter banks. This resulted in a set of 6 local derivative and 2 Gabor ﬁlters and the resulting ﬁlter bank reached categorization rate of 0.544 which is signiﬁcantly higher than the rate of Gabor(1, 4) ﬁlter bank and slightly higher than the rate of the local derivative ﬁlter bank. Unfortunately we were not able to achieve performance gains in face recognition by combining the two types of ﬁlters.

5. Discussion and conclusion In this paper we have presented a novel uniﬁed framework under which the histogram based image description methods such as the well-known local binary pattern and MR8 descriptors can be explained and analyzed. Even though this is still far from a complete uniﬁed theory of statistical image description, the framework makes the differences and similarities between the methods apparent. Moreover, the presented framework allows for systematic comparison of different descriptors and the parts that they are built of. Such analytic approach can be useful in analyzing texture descriptors as they are usually presented in the literature as a sequence of steps whose relation to other description methods is unclear. The framework presented in this work allows for explicitly illustrating the connection between the parts of the LBP and MR8 descriptors and experimenting with the performance of each part. The ﬁlter sets and vector quantization techniques for LBP, MR8 and Gabor ﬁlter-based texture descriptors were compared in this paper. In this comparison it was found out that the local derivative ﬁlter responses are both fastest to compute and most descriptive in the texture categorization task. This somewhat surprising result further attests the previous ﬁndings that texture descriptors relying on small-scale pixel relations yield comparable or even superior results to those based on ﬁlters of larger spatial support (Ojala et al., 2002; Varma and Zisserman, 2003). On the other hand, in face recognition, the Gabor ﬁlters showed better performance than local derivatives or the MR8 descriptor. It seems that larger spatial support is beneﬁcial in face description as Gabor(4, 6), having the largest spatial support of 49 49 pixels, performed better than Gabor(1, 4) (7 7 pixels) and local derivatives with the smallest support (3 3 pixels) gave the worst performance of the three. Furthermore, there is evidence that Gabor ﬁlters suppress the effects of lighting variation in facial images and for this reason they have even been used as preprocessing for the LBP descriptor (Zhang et al., 2005).

375

When comparing the different vector quantization methods, the experimental results show that thresholding is faster and in most cases also more descriptive method for quantization than codebooks. The likely explanation for this is the robustness of thresholding with t ¼ 0 to illumination changes. This is especially apparent in the face description experiments, where the performance drops when the threshold t ¼ 0 is replaced with a different value or with codebook quantization. Finally, the experiments on ﬁlter subset selection and Gabor ﬁlter responses showed that these ﬁlter sets may be complementary and may yield better performance than either of the sets alone in texture description. In face recognition, signiﬁcant performance gain was achieved by selecting a subset of Gabor(4, 6) ﬁlters and applying it to threshold based description. Not only does the presented framework contribute to understanding and comparison of existing texture descriptors but it can be utilized for more systematic development of new, even better performing methods. The framework is simple to implement and together with the publicly available KTH-TIPS2 and CMU PIE image databases it can be easily used for comparing novel descriptors with the current state-of-the-art methods. We believe that further advances in both the ﬁlter bank and vector quantizer design are possible, especially as new invariance properties of the descriptors are aimed for. References Ahonen, T., Pietikäinen, M., 2008. A framework for analyzing texture descriptors. In: Third Internat. Conf. on Computer Vision Theory and Applications (VISAPP 2008), pp. 1:507–1:512. Ahonen, T., Hadid, A., Pietikäinen, M., 2006. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Anal. Machine Intell. 28 (12), 2037–2041. Caputo, B., Hayman, E., Mallikarjuna, P., 2005. Class-speciﬁc material categorisation. In: Proc. 10th IEEE Internat. Conf. on Computer Vision (ICCV 05), vol. 2, pp. 1597–1604. Chávez, E., Navarro, G., Baeza-Yates, R.A., Marroquín, J.L., 2001. Searching in metric spaces. ACM Comput. Surv. 33 (3), 273–321. Clausi, D.A., Jernigan, M.E., 2000. Designing Gabor ﬁlters for optimal texture separability. Pattern Recognition 33 (11), 1835–1849. Dana, K.J., van Ginneken, B., Nayar, S.K., Koenderink, J.J., 1999. Reﬂectance and texture of real-world surfaces. ACM Trans. Graphics 18 (1), 1–34. Georghiades, A., Belhumeur, P., Kriegman, D., 2001. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Machine Intell. 23 (6), 643–660. Gray, R.M., Neuhoff, D.L., 1998. Quantization. IEEE Trans. Inform. Theory 44 (6), 2325–2383. Grigorescu, S.E., Petkov, N., Kruizinga, P., 2002. Comparison of texture features based on Gabor ﬁlters. IEEE Trans. Image Process. 11 (10), 1160–1167. Haralick, R.M., 1979. Statistical and structural approaches to texture. Proc. IEEE 67 (5), 786–804. Heikkilä, M., Pietikäinen, M., 2006. A texture-based method for modeling the background and detecting moving objects. IEEE Trans. Pattern Anal. Machine Intell. 28 (4), 657–662. Leung, T., Malik, J., 2001. Representing and recognizing the visual appearance of materials using three-dimensional textons. Internat. J. Comput. Vision 43 (1), 29–44. Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. Internat. J. Comput. Vision 60 (2), 91–110. Mallikarjuna, P., Fritz, M., Targhi, A.T., Hayman, E., Caputo, B., Eklundh, J.-O., 2006. The KTH-TIPS and KTH-TIPS2 Databases. . Manjunath, B.S., Ma, W.-Y., 1996. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Machine Intell. 18 (8), 837–842. Ojala, T., Valkealahti, K., Oja, E., Pietikäinen, M., 2001. Texture discrimination with multidimensional distributions of signed gray-level differences. Pattern Recognition 34 (3), 727–739. Ojala, T., Pietikäinen, M., Mäenpää, T., 2002. Multiresolution gray-scale and rotation invariant texture classiﬁcation with local binary patterns. IEEE Trans. Pattern Anal. Machine Intell. 24 (7), 971–987. Pudil, P., Novovicová, J., Kittler, J., 1994. Floating search methods in feature selection. Pattern Recognition Lett. 15 (10), 1119–1125. Schiele, B., Crowley, J.L., 2000. Recognition without correspondence using multidimensional receptive ﬁeld histograms. Internat. J. Comput. Vision 36 (1), 31–50. Sim, T., Baker, S., Bsat, M., 2003. The CMU pose, illumination, and expression database. IEEE Trans. Pattern Anal. Machine Intell. 25 (12), 1615–1618.

376

T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376

The Local Binary Pattern Bibliography, 2008. . Tuceryan, M., Jain, A.K., 1998. Texture analysis. In: Chen, C.H., Pau, L.F., Wang, P.S.P. (Eds.), The Handbook of Pattern Recognition and Computer Vision, second ed. World Scientiﬁc Publishing Co., pp. 207–248. Varma, M., Zisserman, A., 2003. Texture classiﬁcation: Are ﬁlter banks necessary? In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2003), vol. 2, pp. 691–698. Varma, M., Zisserman, A., 2004. Unifying statistical texture classiﬁcation frameworks. Image Vision Comput. 22 (14), 1175–1183. Varma, M., Zisserman, A., 2005. A statistical approach to texture classiﬁcation from single images. Internat. J. Comput. Vision 62 (1–2), 61–81. Wiskott, L., Fellous, J.-M., Kuiger, N., von der Malsburg, C., 1997. Face recognition by elasticbunch graph matching.IEEETrans.Pattern Anal. MachineIntell. 19,775–779.

Yan, S., Wang, H., Tang, X., Huang, T.S., 2007. Exploring feature descriptors for face recognition. In: Internat. Conf. on Acoustics Speech and Signal Processing (ICASSP 2007), Honolulu, Hawaii, USA. IEEE Signal Processing Society, pp. I:629– I:632. Zhang, W., Shan, S., Gao, W., Chen, X., Zhang, H., 2005. Local Gabor binary pattern histogram sequence (LGBPHS): A novel non-statistical model for face representation and recognition. In: Proc. 10th IEEE Internat. Conf. on Computer Vision (ICCV 05), vol. 1, pp. 786–791. Zhang, B., Shan, S., Chen, X., Gao, W., 2007. Histogram of Gabor phase patterns (HGPP): A novel object representation approach for face recognition. IEEE Trans. Image Process. 16 (1), 57–68. Zou, J., Ji, Q., Nagy, G., 2007. A comparative study of local matching approach for face recognition. IEEE Trans. Image Process. 16 (10), 2617–2628.

Image description using joint distribution of filter bank responses

Image description using joint distribution of filter bank responses

Recommend Documents