Pattern Recognition Letters 30 (2009) 368–376
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Image description using joint distribution of filter bank responses Timo Ahonen *, Matti Pietikäinen Machine Vision Group, University of Oulu, PL 4500, FI-90014 Oulun yliopisto, Finland
a r t i c l e
i n f o
Article history: Received 4 January 2008 Received in revised form 15 August 2008 Available online 11 November 2008 Communicated by Q. Ji Keywords: Texture Face image description Local binary pattern LBP MR8 Gabor filters
a b s t r a c t This paper presents a unified framework for image descriptors based on quantized joint distribution of filter bank responses and evaluates the significance of filter bank and vector quantizer selection. First, a filter bank based representation of the local binary pattern (LBP) operator is introduced, which shows that LBP can also be presented as an operator producing vector quantized filter bank responses. Maximum response 8 (MR8) and Gabor filters are widely used alternatives to the derivative filters which are used to implement LBP, and the performance of these three sets is compared in the texture categorization and face recognition tasks. Despite their small spatial support, the local derivative filters are shown to outperform Gabor and MR8 filters in texture categorization with the KTH-TIPS2 images. In face recognition task with CMU PIE images, the Gabor filter-based representation achieves the best recognition rate. Furthermore, it is shown that when the filter response vectors are quantized for histogram based joint density estimation, thresholding is clearly faster than using learned codebooks and, being robust to gray-level changes, it yields better recognition rate in most cases. Third, automatic selection of filter bank is discussed and excellent face recognition performance in the face recognition task is achieved with the optimized filter bank. Ó 2008 Elsevier B.V. All rights reserved.
1. Introduction Quantitative description of local image appearance has wide range of applications in image analysis and computer vision. Describing the appearance locally, e.g., using co-occurrences of gray values or with filter bank responses and then forming a global description by computing statistics of them over the image area is a well established technique in texture analysis (Tuceryan and Jain, 1998). On the other hand, recent findings in applying texture methods to face image analysis, for example, indicate that texture might have applications in new fields of computer vision that have not been considered texture analysis problems. In this work we extend the findings of our preliminary work (Ahonen and Pietikäinen, 2008) to more general image analysis. Because of the importance of texture analysis, a wide variety of different texture descriptors have been presented in the literature. However, there is no formal definition of the phenomenon of texture itself that the researchers would agree upon. This is possibly one of the reasons that so far no unified theory or no unified framework of texture descriptors has been presented. The local binary pattern (LBP) (Ojala et al., 2002), maximum response 8 (Varma and Zisserman, 2005) and Gabor filter-based texture descriptors are among the most studied and best known recent texture analysis techniques. Despite the large number of * Corresponding author. Tel.: +358 50 343 0733; fax: +358 8 553 2612. E-mail addresses:
[email protected].fi (T. Ahonen),
[email protected].fi (M. Pietikäinen). 0167-8655/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2008.10.012
publications discussing and applying these methods, the connections and differences between them are not well understood. This paper presents a new unified framework for these texture descriptors, which allows for a systematic comparison of these widely used descriptors and the parts that they are built of. LBP is an operator for image description that is based on the signs of differences of neighboring pixels. It is fast to compute and invariant to monotonic gray-scale changes of the image. Despite being simple, it is very descriptive, which is attested by the wide variety of different tasks it has been successfully applied to. The LBP histogram has proven to be a widely applicable image feature for, e.g., texture classification, face analysis, video background subtraction, etc. (The Local Binary Pattern Bibliography, 2008). Another frequently used approach in texture description is using distributions of quantized filter responses to characterize the texture (Leung and Malik, 2001; Varma and Zisserman, 2005). In the field of texture analysis, filtering and pixel value based texture operators have been seen as somewhat contradictory. However, in this paper we show that the local binary pattern operator can be seen as a filter operator based on local derivative filters at different orientations and a special vector quantization function. Apart from clarifying the connections between LBP and filter-based methods, this also helps analyzing the properties of the LBP operator. The estimated distribution of local image appearance is widely used in image or image patch description and different implementations of this idea have resulted in excellent performance in wide
369
T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376
range of applications (e.g., Schiele and Crowley, 2000; Ojala et al., 2002; Varma and Zisserman, 2005; Lowe, 2004; Ahonen et al., 2006). There are still a number of open questions regarding how to describe the local appearance, how to estimate the distribution, how to use the estimated distribution in the selected application, and whether the optimal methods are application specific or more generic. This paper contributes to these questions by setting a framework, and providing systematic experimental results in two different applications, namely texture categorization and illumination invariant face recognition. Commonly used filter sets with different characteristics and of varying spatial support are tested in local appearance description. Then two different methods for quantizing the filter responses are compared. Finally, a method for selecting a subset of filters from a large filter bank is proposed. 2. Image descriptors This paper discusses image descriptors that are based on estimating the distribution of local characteristics of the image. In texture analysis literature, a variety of such local characteristics have been studied. The well-known co-occurrence matrix introduced by Haralick (1979) is based on gray values of pixel pairs defined by a displacement vector. Another local characteristic computed directly from pixel gray values is the LBP label that is computed from gray-level differences of neighboring pixels. On the other hand, in the core of many texture descriptors is a filter bank or wavelet coefficient based description of local image appearance. In the following we take a closer look at the three image descriptors that are studied in this paper. 2.1. The local binary pattern operator The local binary pattern operator (Ojala et al., 2002) is a powerful means of texture description. The original version of the operator labels the pixels of an image by thresholding the 3 3neighborhood of each pixel with the center value and summing the thresholded values weighted by powers of two. Then the histogram of the labels can be used as a texture descriptor. See Fig. 1 for an illustration of the basic LBP operator. The operator can also be extended to use neighborhoods of different sizes (Ojala et al., 2002). Using circular neighborhoods and bilinearly interpolating the pixel values allow any radius and number of pixels in the neighborhood. For neighborhoods we will use the notation ðP; RÞ which means P sampling points on a circle of radius of R. See Fig. 2 for an example of different circular neighborhoods. Let us denote the center pixel value by g c and the gray values of the P sampling points by g 1 ; g 2 ; . . . ; g P . Now the generic LBPP;R operator is defined as
LBPP;R ¼
P X
n1
sðg n g c Þ2
ð1Þ
;
n¼1
Fig. 2. Three circular neighborhoods: (8, 1), (16, 2), (6, 1). The pixel values are bilinearly interpolated whenever the sampling point is not in the center of a pixel.
sfzg ¼
1; z P 0;
Further extensions to the original operator are uniform and rotationally invariant binary patterns (Ojala et al., 2002). A local binary pattern is called uniform if the binary pattern contains at most two bitwise transitions from 0 to 1 or vice versa when the bit pattern is considered circular. For example, the patterns 00000000 (0 transitions), 01110000 (2 transitions) and 11001111 (2 transitions) are uniform whereas the patterns 11001001 (4 transitions) and 01010011 (6 transitions) are not. In the computation of the LBP histogram, uniform patterns are used so that the histogram has a separate bin for every uniform pattern and all non-uniform patterns are assigned to a single bin. In the context of LBPs, rotational invariance is achieved by circularly rotating each bit pattern to the minimum value. For instance, the bit sequences 1000011, 1110000 and 0011100 arise from different rotations of the same local pattern and they all correspond to the normalized sequence 0000111. 2.2. Maximum response 8 filters The second descriptor considered here is the maximum response 8 descriptor (Varma and Zisserman, 2005). In the core of that descriptor is a filter set consisting of 38 filters: two isotropic filters, a Gaussian and a Laplacian of Gaussian both at scale r ¼ 10 pixels and an edge and a bar filter both at 3 scales ðrx ; ry Þ ¼ fð1; 3Þ; ð2; 6Þ; ð4; 12Þg and 6 orientations. The filter kernels are shown in Fig. 3. As the image has been convolved with the filter bank, the maximum of the 6 responses at different orientations is computed. This results in a total of 8 responses, 2 from the isotropic filters and 6 from the edge and bar filters at different scales. Finally the response vector is labeled with the nearest codebook vector (texton) and the histogram of these labels is used to represent the texture. In the learning stage, the codebook is obtained by clustering a set of training samples with the k-means algorithm. 2.3. Gabor filters Another type of filter kernels that is widely used in image description is Gabor filters. The complex Gabor functions can be defined as
gðx; yÞ ¼ 1=ð2prx ry Þeðx
where
Threshold 5
9
1
1
4
4
6
1
7
2
3
1
Weights 1 0
0
1
1
128
0
64 32 16
LBP code: 1+2+8+64+128=203 Fig. 1. The basic LBP operator.
2
4 8
ð2Þ
0; z < 0:
2 =2 2 þy2 =2 2 Þ x y
r
r
e2pjðuxþvyÞ
ð3Þ
in which rx and ry define the scale of the Gabor function and ðu; vÞ defines the frequency of the complex sinusoid. Thus, the Gabor function is a product of an elliptical Gaussian and a complex plane wave. The typical way Gabor filters are applied in texture description is to convolve the input image with a bank of Gabor filters at different scales and frequencies and compute a set of features from the output images. In texture description, the best known method applying Gabor filters is the one proposed Manjunath and Ma (1996) in which a vector of means and standard deviations of Gabor filter responses are used for texture description. Another classic work, applying Gabor filters in face recognition, is the Elastic
370
T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376
Fig. 3. The MR8 filter bank. The filter bank consists of an edge and a bar filter both at 3 scales and 6 orientations and a Gaussian and a Laplacian of Gaussian filter.
Bunch Graph Matching method (Wiskott et al., 1997), which is based on Gabor filter bank responses at certain facial landmarks. In a recent work by Zou et al. (2007) Gabor filters and local binary patterns were compared in the face recognition task and a face descriptor based on Gabor filter responses computed at points spaced one wavelength apart on the whole face area was developed. In that work, the Gabor filter-based descriptor was shown to produce better recognition rates than LBPs especially on difficult image sets containing lighting variation and aging of the subjects. 3. Framework for filter bank and vector quantization based texture descriptors A widely used approach to texture analysis is to convolve an image with N different filters whose responses at a certain position ðx; yÞ form an N-dimensional vector. At learning stage, a set of such vectors is collected from training images and the set is clustered using, e.g., k-means to form a codebook. Then each pixel of a texture image is labeled with the label of the nearest cluster center and the histogram of these labels over a texture image is used to describe the texture (Leung and Malik, 2001; Varmaand Zisserman, 2005). More formally, let Iðx; yÞ be the image to be described by the texture operator. Now the vector valued image obtained by convolving the original image with filter kernels F 1 ; F 2 ; . . . ; F N :
I1 ðx; yÞ ¼ Iðx; yÞIF 1
3
6 I2 ðx; yÞ ¼ Iðx; yÞIF 2 6 I f ðx; yÞ ¼ 6 .. 6 4 .
7 7 7 7 5
2
ð4Þ
IN ðx; yÞ ¼ Iðx; yÞIF N The labeled image Ilab ðx; yÞ is obtained with a vector quantizer f : RN #f0; 1; 2; ; M 1g, where M is the number of different labels produced by the quantizer. Thus, the labeled image is
Ilab ðx; yÞ ¼ f ðI f ðx; yÞÞ
ð5Þ
and the histogram of labels is
Hi ¼
X
dfi; Ilab ðx; yÞg; i ¼ 0; . . . ; M 1;
neighbor classifier was compared to Bayesian classification but no significant difference in the performance was found. In (Caputo et al., 2005) it was shown that the performance of a material categorization system can be enhanced by using suitably trained support vector machine based classifier. In this work, the main interest is not in the classifier design but in the local descriptors and thus the nearest neighbor classifier with v2 distance was selected for the experimental part. The following two subsections discuss in more detail the two parts that define an image descriptor in the proposed framework. These parts are the filter bank F 1 ; F 2 ; . . . ; F N and the quantization function f. 3.1. Filter bank In this paper we compare three different types of filter kernels that are commonly used in texture description. The first filter bank is a set of oriented derivative filters whose thresholded output is shown to be equivalent to the local binary pattern operator. The other two filter banks included in the comparison are Gabor filters and the maximum response 8 filter set. A novel way to look at the LBP operator proposed in this paper is to see it as a special filter-based texture operator. The filters for implementing LBP are approximations of image derivatives computed at different orientations. The filter coefficients are computed so that they are equal to the weights of bilinear interpolation of pixel values at sampling points of the LBP operator and the coefficient at filter center is obtained by subtracting 1 from the center value. For example, the kernels shown in Fig. 4 can be used for filter-based implementation of local binary pattern operator in the circular (8, 1) neighborhood. The response of such filter at location ðx; yÞ gives the signed difference of the center pixel and the sampling point corresponding to the filter. These filters, which will be called local derivative filters in the following, can be constructed for any radius and any number of sampling points. Applying the maximum response 8 descriptor in this framework is straightforward. In the filter bank design we follow the
ð6Þ
x;y
in which d is the Kronecker delta
dfi; jg ¼
1; i ¼ j; 0; i–j:
ð7Þ
If the task is classification or categorization as in this work, several possibilities exist for classifier selection. The most typical strategy is to use nearest neighbor classifier using, e.g., v2 distance to measure the distance between histograms (Leung and Malik, 2001; Varma and Zisserman, 2005). In (Varma and Zisserman, 2004), the nearest
0
1
0
0
0.207
0
–1
0
0
–0.914 0.207
0
0
0
0
0
0.5
0
0
0
0
0
–1
1
0
0
0
Fig. 4. Filters F1–F3 of the total of 8 local derivative filters at (8, 1) neighborhood. The remaining 5 filters are obtained by mirroring the filters shown here.
T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376
procedure of Varma and Zisserman (2005). This filter bank produces a 38 dimensional vector valued image. Selecting the maximum over orientations is handled in the vector quantizer and is described in more detail in the following section. For Gabor filters, a lot of work has been devoted to designing the filter bank and feature computation methods (see, e.g., Manjunath and Ma, 1996; Clausi and Jernigan, 2000; Grigorescu et al., 2002). In this work we apply the Gabor filters in the proposed image description framework, which is to say that the responses of the filter bank at a certain position are stacked into a vector which is used as an input for the vector quantizer. This resembles the Gabor filter-based face description suggested by Zhang et al. (2007) in which the Gabor filter responses are quantized and a histogram of them is then formed to encode a facial image. In the design of the filter bank, i.e. selection of the scale and frequency parameters, the procedure of Manjunath and Ma (1996) is applied. Furthermore, at each chosen scale and frequency, the real and imaginary parts of the complex Gabor filter are treated as separate filters so the total number of filters is 2ns nf in which ns is the number of scales and nf is the number of orientations at each scale. 3.2. Vector quantization The assumption onto which the proposed texture description framework is based on is that the joint distribution of filter responses can be used to describe the image appearance. Depending on the size of the filter bank, the dimension of the vectors in the image I f ðx; yÞ can be high and quantization of the vectors is needed for reliable estimation of the histogram. A simple, non-adaptive way to quantize the filter responses is to threshold them and to compute the sum of thresholded values multiplied by powers of two:
Ilab ðx; yÞ ¼
N X
st fIn ðx; yÞg2n1 ;
ð8Þ
n¼1
where sðzÞ is the thresholding function
st fzg ¼
1; z P t; 0; z < t;
ð9Þ
in which the parameter t is the threshold. Thresholding divides each dimension of the filter bank output into two bins. The total number of different labels produced by threshold quantization is 2N where N is the number of filters. If the threshold t ¼ 0 and the coefficients of each of the filters in the bank have zero mean (i.e. they sum up to zero), the value of sfIn ðx; yÞg and thus the value of Ilab ðx; yÞ is not affected by affine gray-level changes I0 ðx; yÞ ¼ aIðx; yÞ þ b; a > 0. If the filter coefficients F n ðx; yÞ sum up to zero, I0 ðx; yÞIF n ¼ aðIðx; yÞIF n Þ and, assuming a > 0, the sign is not changed. On the other hand, such global gray-level changes can be easily normalized, but this holds also locally for such areas where changes in pixel values, e.g., due to lighting changes, can be modeled as affine gray-level change within the filter support area. Now, let us consider the case that filter bank used to obtain the image I f ðx; yÞ is the set of local derivative filters (e.g., the filters presented in Fig. 4), designed so that filter responses are equal to the signed differences, i.e. at each pixel location
In ¼ g c g n ; where In is the response of n-th filter at a given location, and g c and g n are the center pixel and nth sampling point gray values (see Eq. (1)) at the same location. As the quantizer (8) with threshold t ¼ 0 is applied to I f , it follows that the resulting label Ilab is equal to that resulting from local binary pattern operator LBPP;R using the same
371
neighborhood. Therefore, the LBP operator can be represented in the proposed framework. It should be noted that for LBPs, even stronger invariance to gray-level changes holds than presented above. As discussed by Ojala et al. (2002), the LBP labels are invariant to any monotonic mapping of gray values. Due to these reasons, the choice of threshold t ¼ 0 has been common especially with the LBP operator. Still, under some circumstances a different choice may yield better results. For instance, when LBP features are used for background subtraction, choosing a non-zero threshold was observed to result in more stable codes in nearly flat gray areas and, consequently, increased performance (Heikkilä and Pietikäinen, 2006). Another method for quantizing the filter responses is to construct a codebook of them at the learning stage and then use the nearest codeword to represent the filter bank output at each location:
Ilab ðx; yÞ ¼ arg min kI f ðx; yÞ c m k; m
ð10Þ
in which cm is the mth vector (codeword) in the codebook. This approach is used in (Leung and Malik, 2001; Varma and Zisserman, 2005), which use k-means to construct the codebook whose elements are called textons. Codebook based quantization of signed differences of neighboring pixels (which correspond to local derivative filter outputs) was presented in (Ojala et al., 2001). When comparing these two methods for quantizing the filter responses, one might expect that the if the number of labels produced by the quantizers is kept roughly the same, the codebook based quantizer handles the possible statistical dependencies between the filter responses better. On the other hand, since the codebook based quantization requires search for the closest codeword at each pixel location, it is clearly slower than simple thresholding, even though a number of both exact and approximate techniques have been proposed for finding the nearest codeword without exhaustive search through the codebook (Gray and Neuhoff, 1998, p. 2362; Chávez et al., 2001). 3.3. Rotational invariance It is important to note that a clever co-design of the filter bank and the vector quantizer can also make the texture descriptor rotationally invariant. Again, two different strategies have been proposed. Rotationally invariant LBP codes are obtained by circularly shifting a LBP binary code to its minimum value (Ojala et al., 2002). In the joint framework this can be represented as further combining the labels of threshold quantization (8) so that all the different labels that can arise from rotations of the local gray pattern are joined to form a single label. On the other hand, the approach chosen for the MR8 descriptor to achieve rotational invariance is to select only the maximum of the 6 different rotations of each bar and edge filters. Only these maximum values and the responses of the two isotropic filters are used in further quantization so the 8-dimensional response of the filter is invariant to rotations of the gray pattern. 4. Experiments The proposed framework and the relative descriptiveness of the different filter banks and vector quantization methods were systematically tested in two different application areas: material categorization and illumination invariant face recognition. Both of these are very challenging unsolved problems so they clearly highlight the performance differences of the operators. To test the proposed framework and to systematically explore the relative descriptiveness of the different filter banks and vector
372
T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376
Table 1 Properties of the tested filter kernels. Filter bank
Size
Number of filters
Local derivative filters Gabor(1, 4) Gabor(4, 6) MR8
33 77 49 49 49 49
8 8 48 38
quantization methods, the challenging task of material categorization using the KTH-TIPS2 database (Mallikarjuna et al., 2006) was utilized. The widely used CMU PIE (Pose, Illumination, and Expression) database by Sim et al. (2003) was selected to serve as test material in the face recognition experiments. It is especially interesting how the different filter banks and vector quantizers respond to changes in lighting conditions. The images in the PIE database contain
systematic lighting variation so it is very well suited for these experiments. The same filter banks were utilized in both experiments. The proposed framework allowed testing the performance of different filters and different quantization methods independently. The filter banks that were included in the tests were local derivative filters, two different banks of Gabor filters and MR8 filters. The local derivative filter bank was chosen to match the LBP8;1 operator which resulted in 8 filters (see Fig. 4). Two very different types of Gabor filter banks were tested, one with only 1 scale and 4 orientations and small spatial support (7 7) and another one with 4 scales and 6 orientations and larger spatial support. The properties of the tested filter kernels are listed in Table 1. 4.1. Material categorization The KTH-TIPS2 (Mallikarjuna et al., 2006) was utilized to test the performance of the descriptors in the material categorization
Fig. 5. Examples of images from the KTH-TIPS2 database. Figures in each column belong to the same texture category.
Fig. 6. Example images of 2 out of the 68 subjects in the CMU PIE database.
a
b 0.5
0.7
0.6 0.4 0.5 0.3
0.4
0.3 0.2 0.2 0.1
0 16
Local derivatives Gabor(4,6) Gabor(1,4) MR8 32
64
128
256
Gabor(4,6) Gabor(1,4) Local derivatives MR8
0.1
0 16
32
64
128
256
Fig. 7. (a) The KTH-TIPS2 categorization rates and (b) CMU PIE recognition rates for k-means based vector quantization of filter bank responses as a function of codebook size.
373
T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376
a
b
0.6
0.9 0.8
0.5 0.7 0.4
0.6 0.5
0.3 0.4 0.2
0.3 0.2
0.1
0 –0.1 –0.08 –0.06 –0.04 –0.02
Local derivatives Gabor(1,4) MR8
0
0.02 0.04 0.06 0.08
Local derivatives Gabor(1,4) MR8
0.1
0.1
0 –0.1 –0.08 –0.06 –0.04 –0.02
0
0.02 0.04 0.06 0.08
0.1
Fig. 8. (a) The KTH-TIPS2 categorization rates and (b) CMU PIE recognition rates for thresholding quantization of filter bank responses as a function of threshold value.
task. The database contains 4 samples of 11 different materials, each sample imaged at 9 different scales and 12 lighting and pose setups, totaling 4572 images. Examples of texture images from the KTH-TIPS2 database are shown in Fig. 5. Caputo et al. performed material categorization tests using the KTH-TIPS2 and considered especially the significance of classifier selection (Caputo et al., 2005). In that paper, the main conclusions were that the state-of-the-art descriptors such as LBP and MR8 have relatively small differences in the performance but significant gains in classification rate can be obtained by using support vector machine classifier instead of nearest neighbor. Moreover, the classification rates can be enhanced by increasing the number of samples used for training. In this work, the main interest is to examine the relative descriptiveness of different setups of the filter bank based texture descriptors. To facilitate this task, we chose a very challenging test setup that resembles the most difficult setup used in (Caputo et al., 2005). Using each of the descriptors to be tested, a nearest neighbor classifier using Chi square distance was trained with one sample (i.e. 9 12 images) per material category. The remaining 3 9 12 images were used for testing. This was repeated with 10,000 random combinations as training and testing data and the mean and standard deviations over the permutations were used to assess the performance. 4.2. Illumination invariant face recognition To test the performance of the descriptors in illumination invariant face recognition, the CMU PIE database was used. Totally, the database contains 41,368 images of 68 subjects taken at different angles, lighting conditions and with varying expression. For our experiments, we selected a set of 23 images of each of the 68 subjects. 2 of these are taken with the room lights on and the remaining 21 each with a flash at varying positions. Preprocessed example images from the database are shown in Fig. 6. In obtaining a descriptor for the facial image, the procedure of Ahonen et al. (2006) was followed. The faces were first normalized so that the eyes are at fixed positions. The selected filter bank and vector quantization method was then applied and the resulting label image was cropped to size 128 128 pixels. Thus, for further analysis, the size of the vector valued image was the same irrespective of the filter kernel size. The labeled image was further divided into blocks of size of 16 16 pixels and histograms were computed in each block individually and then concatenated to form the spatially enhanced histogram describing the face.
Nearest neighbor classifier with Chi square distance was utilized for classification. One image per person was used for training and the remaining 22 images for testing. Again, 10,000 random selections into training and testing data were used. 4.3. Codebook based vector quantization All the four filter banks were tested using two types of vector quantization: thresholding and codebook based quantization. For codebook based quantization, the selected approach was to aim for compact, universal texton codebooks, i.e. codebooks of rather small size that are not tailored for this specific set of textures or faces. Therefore, images from the CuRET texture database (Dana et al., 1999) were used to learn the codebooks for the texture categorization test and images from Yale B face database (Georghiades et al., 2001), for the face recognition test. The codebook sizes that were tested were 16–256 codewords. The texture categorization rates and face recognition rates as a function of the codebook size obtained with each filter bank and codebook based quantization are plotted in Fig. 7a and b. Fig. 7a shows that for most of the time, using a larger codebook enhances the texture categorization rate but the selection of the filter bank is a clearly more dominant factor than the codebook size. For example, local derivative filters achieve a higher categorization rate with the smallest codebook size than the MR8 filters with any codebook size. The same applies to the face recognition results, Fig. 7b. Here it is seen that Gabor(4, 6) filters with a large support area achieve better recognition rates than Gabor(1,4) and local derivatives and each of these performs better than the MR8 descriptor. A probable cause for this is the rotational invariance built into MR8 descriptor that might actually lose some useful information. 4.4. Thresholding based vector quantization In the next experiment the material categorization and face recognition tests were performed using the same filter banks but thresholding based vector quantization. The Gabor(4, 6) filter bank was omitted from this experiment due to the large number of filters in the filter bank (the resulting histograms would have been of length 248 ). For the local derivative and Gabor filters have zero the thresholding function (9) was applied directly with different choices of threshold t. For the edge and bar filters in the MR8 filter set, only the maximum of responses over different orientations is measured and therefore in that case the mean of 8-dimensional response
374
T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376
0.6
1
Codebook (256) Thresholding
Codebook (256) Thresholding
0.9
0.5 0.8 0.7
0.4
0.6 0.3
0.5 0.4
0.2 0.3 0.2
0.1
0.1 0
Local derivatives
Gabor(1,4)
MR8
Gabor(4,6)
0
Local derivatives
Gabor(1,4)
MR8
Gabor(4,6)
Fig. 9. The KTH-TIPS2 categorization rates.
Fig. 10. The CMU PIE recognition rates.
vectors over all the training images was computed and subtracted from the response before applying thresholding. Fig. 8 shows the texture categorization and face recognition rates as a function of the threshold value. Interestingly, the shapes of the curves are somewhat different in the two tasks. Non-zero threshold values provide best results in texture categorization (Fig. 8a) with Gabor(1, 4) and local derivative filters but in general the performance differences caused by changing the threshold are small. In face recognition, on the other hand, the situation is reversed. With all the filter sets, best results are obtained with threshold t ¼ 0. This effect is likely to be due to lighting effects, since as discussed in Section 3.2, using threshold t ¼ 0 with filters having zero mean yields in descriptor that is invariant to affine changes of gray values. Both in texture and in face experiments there are lighting changes but in the texture experiments those are compensated in part by the larger amount of training images under different lighting conditions whereas in face recognition experiment there is only one training image per subject. Table 2 and Figs. 9 and 10 show the texture categorization and face recognition rates using thresholding based quantization with t ¼ 0 and codebook based quantization with codebook size 256. In texture categorization, codebook based quantization yields slightly worse categorization rate than thresholding when using local derivative filters, but the difference is smaller than the standard deviation in the rates. With the Gabor(1, 4) filter bank thresholding performs worse than codebook based quantization, but interestingly with MR8 filters, thresholding yields better rate. The local derivative filters give the best categorization rate over the tested filter sets with both quantization functions. The results obtained in these experiments are slightly different than those presented in (Ahonen and Pietikäinen, 2008) and those presented earlier by Ojala et al. (2001) which showed that codebook based quantization of signed gray-level differences yields slightly better recognition than LBPs, however at the cost of higher
computational complexity. We believe that this is due to different training data for the k-means algorithm. Our earlier experiments suggested that codebook based quantization might perform slightly better than thresholding but in those experiments the material that was used to learn the codebook had some overlap with the testing material. In the present experiments the training and testing data sets were completely separate so it might be that the codebook learned from training images does not suit well enough for describing the test images. Moreover, the images used for testing by Ojala et al. (2001) had less variation in lighting conditions than the KTH-TIPS2 images thus the robustness of thresholding to gray-level variations discussed in Section 3.2 also explains these results. In illumination invariant face recognition, the Gabor filters show better performance than local derivative filters. With all the filter banks, thresholding based quantization yields better recognition rates than codebook. Here the performance gain of changing codebook based quantization to thresholding is apparent. Again, this somewhat surprising result is probably explained to some degree by the invariance of thresholding based quantization to gray-level changes. Considering the computational cost of the presented methods, thresholding based quantization is much faster than codebook based quantization. As for the filter bank operations, the computational cost grows with the size and number of filters, but using FFT based convolution can make the operations faster. Still, at two extremes, the computations for local derivative filter and thresholding based labeling of an image of size of 256 256 take 0.04 s whereas the codebook based labeling of the same image using Gabor(4, 6) filters (and performing convolutions using FFT) take 10.98 s. Both running times were measured using unoptimized Matlab implementations of the methods on a PC with AMD Athlon 2200 MHz processor.
4.5. Filter subset selection Table 2 Recognition rates for different filter banks and quantization methods.
Local derivative filters Gabor(1, 4) Gabor(4, 6) MR8 filters
Texture, CB
Texture, thresh.
Face, CB
Face, thresh.
0.521 0.458 0.447 0.455
0.528 0.366 – 0.492
0.482 0.502 0.704 0.204
0.742 0.883 – 0.348
The third experiment tested whether it is possible to select a representative subset of filters from a large filter bank for thresholding based quantization. The number of labels produced by the quantizer is 2N in which N is the number of filters, which means that the length of the label histograms grows exponentially with respect to the number of filters. Thus a small filter bank is desirable for the thresholding quantization.
T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376
In this experiment, the sequential floating forward selection (SFFS) (Pudil et al., 1994) algorithm was used to select a maximum of 8 filters from a larger filter bank. The optimization criterion was the recognition rate over the training set (KTH-TIPS1). Two different initial filter banks were tested. First, 8 filters were selected from the 48 filters in the Gabor(4, 6) filter bank. However, the resulting 8-filter bank did not perform well on the testing database, yielding a categorization rate of only 0.295. The same experiment was done for face recognition problem, using Yale B images as training data for SFFS. Again, 8 filters were selected from the full Gabor(4, 6) filter bank using the recognition rate in Yale B set as optimization criterion. Then the recognition rate in the test set, CMU PIE dataset, was recorded. Here the filter subset selection performed superbly, achieving a recognition rate of 1.000. In the face recognition literature, there are findings that LBP and Gabor filter-based information are complementary. In (Zhang et al., 2005), LBP histograms were extracted from Gabor filtered images and in (Yan et al., 2007), score level fusion of LBP and Gabor filter-based similarity scores was done. Motivated by these findings, SFFS was used to select 8 filters from the union of local derivative and Gabor(1, 4) filter banks. This resulted in a set of 6 local derivative and 2 Gabor filters and the resulting filter bank reached categorization rate of 0.544 which is significantly higher than the rate of Gabor(1, 4) filter bank and slightly higher than the rate of the local derivative filter bank. Unfortunately we were not able to achieve performance gains in face recognition by combining the two types of filters.
5. Discussion and conclusion In this paper we have presented a novel unified framework under which the histogram based image description methods such as the well-known local binary pattern and MR8 descriptors can be explained and analyzed. Even though this is still far from a complete unified theory of statistical image description, the framework makes the differences and similarities between the methods apparent. Moreover, the presented framework allows for systematic comparison of different descriptors and the parts that they are built of. Such analytic approach can be useful in analyzing texture descriptors as they are usually presented in the literature as a sequence of steps whose relation to other description methods is unclear. The framework presented in this work allows for explicitly illustrating the connection between the parts of the LBP and MR8 descriptors and experimenting with the performance of each part. The filter sets and vector quantization techniques for LBP, MR8 and Gabor filter-based texture descriptors were compared in this paper. In this comparison it was found out that the local derivative filter responses are both fastest to compute and most descriptive in the texture categorization task. This somewhat surprising result further attests the previous findings that texture descriptors relying on small-scale pixel relations yield comparable or even superior results to those based on filters of larger spatial support (Ojala et al., 2002; Varma and Zisserman, 2003). On the other hand, in face recognition, the Gabor filters showed better performance than local derivatives or the MR8 descriptor. It seems that larger spatial support is beneficial in face description as Gabor(4, 6), having the largest spatial support of 49 49 pixels, performed better than Gabor(1, 4) (7 7 pixels) and local derivatives with the smallest support (3 3 pixels) gave the worst performance of the three. Furthermore, there is evidence that Gabor filters suppress the effects of lighting variation in facial images and for this reason they have even been used as preprocessing for the LBP descriptor (Zhang et al., 2005).
375
When comparing the different vector quantization methods, the experimental results show that thresholding is faster and in most cases also more descriptive method for quantization than codebooks. The likely explanation for this is the robustness of thresholding with t ¼ 0 to illumination changes. This is especially apparent in the face description experiments, where the performance drops when the threshold t ¼ 0 is replaced with a different value or with codebook quantization. Finally, the experiments on filter subset selection and Gabor filter responses showed that these filter sets may be complementary and may yield better performance than either of the sets alone in texture description. In face recognition, significant performance gain was achieved by selecting a subset of Gabor(4, 6) filters and applying it to threshold based description. Not only does the presented framework contribute to understanding and comparison of existing texture descriptors but it can be utilized for more systematic development of new, even better performing methods. The framework is simple to implement and together with the publicly available KTH-TIPS2 and CMU PIE image databases it can be easily used for comparing novel descriptors with the current state-of-the-art methods. We believe that further advances in both the filter bank and vector quantizer design are possible, especially as new invariance properties of the descriptors are aimed for. References Ahonen, T., Pietikäinen, M., 2008. A framework for analyzing texture descriptors. In: Third Internat. Conf. on Computer Vision Theory and Applications (VISAPP 2008), pp. 1:507–1:512. Ahonen, T., Hadid, A., Pietikäinen, M., 2006. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Anal. Machine Intell. 28 (12), 2037–2041. Caputo, B., Hayman, E., Mallikarjuna, P., 2005. Class-specific material categorisation. In: Proc. 10th IEEE Internat. Conf. on Computer Vision (ICCV 05), vol. 2, pp. 1597–1604. Chávez, E., Navarro, G., Baeza-Yates, R.A., Marroquín, J.L., 2001. Searching in metric spaces. ACM Comput. Surv. 33 (3), 273–321. Clausi, D.A., Jernigan, M.E., 2000. Designing Gabor filters for optimal texture separability. Pattern Recognition 33 (11), 1835–1849. Dana, K.J., van Ginneken, B., Nayar, S.K., Koenderink, J.J., 1999. Reflectance and texture of real-world surfaces. ACM Trans. Graphics 18 (1), 1–34. Georghiades, A., Belhumeur, P., Kriegman, D., 2001. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Machine Intell. 23 (6), 643–660. Gray, R.M., Neuhoff, D.L., 1998. Quantization. IEEE Trans. Inform. Theory 44 (6), 2325–2383. Grigorescu, S.E., Petkov, N., Kruizinga, P., 2002. Comparison of texture features based on Gabor filters. IEEE Trans. Image Process. 11 (10), 1160–1167. Haralick, R.M., 1979. Statistical and structural approaches to texture. Proc. IEEE 67 (5), 786–804. Heikkilä, M., Pietikäinen, M., 2006. A texture-based method for modeling the background and detecting moving objects. IEEE Trans. Pattern Anal. Machine Intell. 28 (4), 657–662. Leung, T., Malik, J., 2001. Representing and recognizing the visual appearance of materials using three-dimensional textons. Internat. J. Comput. Vision 43 (1), 29–44. Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. Internat. J. Comput. Vision 60 (2), 91–110. Mallikarjuna, P., Fritz, M., Targhi, A.T., Hayman, E., Caputo, B., Eklundh, J.-O., 2006. The KTH-TIPS and KTH-TIPS2 Databases.
. Manjunath, B.S., Ma, W.-Y., 1996. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Machine Intell. 18 (8), 837–842. Ojala, T., Valkealahti, K., Oja, E., Pietikäinen, M., 2001. Texture discrimination with multidimensional distributions of signed gray-level differences. Pattern Recognition 34 (3), 727–739. Ojala, T., Pietikäinen, M., Mäenpää, T., 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Machine Intell. 24 (7), 971–987. Pudil, P., Novovicová, J., Kittler, J., 1994. Floating search methods in feature selection. Pattern Recognition Lett. 15 (10), 1119–1125. Schiele, B., Crowley, J.L., 2000. Recognition without correspondence using multidimensional receptive field histograms. Internat. J. Comput. Vision 36 (1), 31–50. Sim, T., Baker, S., Bsat, M., 2003. The CMU pose, illumination, and expression database. IEEE Trans. Pattern Anal. Machine Intell. 25 (12), 1615–1618.
376
T. Ahonen, M. Pietikäinen / Pattern Recognition Letters 30 (2009) 368–376
The Local Binary Pattern Bibliography, 2008.
. Tuceryan, M., Jain, A.K., 1998. Texture analysis. In: Chen, C.H., Pau, L.F., Wang, P.S.P. (Eds.), The Handbook of Pattern Recognition and Computer Vision, second ed. World Scientific Publishing Co., pp. 207–248. Varma, M., Zisserman, A., 2003. Texture classification: Are filter banks necessary? In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2003), vol. 2, pp. 691–698. Varma, M., Zisserman, A., 2004. Unifying statistical texture classification frameworks. Image Vision Comput. 22 (14), 1175–1183. Varma, M., Zisserman, A., 2005. A statistical approach to texture classification from single images. Internat. J. Comput. Vision 62 (1–2), 61–81. Wiskott, L., Fellous, J.-M., Kuiger, N., von der Malsburg, C., 1997. Face recognition by elasticbunch graph matching.IEEETrans.Pattern Anal. MachineIntell. 19,775–779.
Yan, S., Wang, H., Tang, X., Huang, T.S., 2007. Exploring feature descriptors for face recognition. In: Internat. Conf. on Acoustics Speech and Signal Processing (ICASSP 2007), Honolulu, Hawaii, USA. IEEE Signal Processing Society, pp. I:629– I:632. Zhang, W., Shan, S., Gao, W., Chen, X., Zhang, H., 2005. Local Gabor binary pattern histogram sequence (LGBPHS): A novel non-statistical model for face representation and recognition. In: Proc. 10th IEEE Internat. Conf. on Computer Vision (ICCV 05), vol. 1, pp. 786–791. Zhang, B., Shan, S., Chen, X., Gao, W., 2007. Histogram of Gabor phase patterns (HGPP): A novel object representation approach for face recognition. IEEE Trans. Image Process. 16 (1), 57–68. Zou, J., Ji, Q., Nagy, G., 2007. A comparative study of local matching approach for face recognition. IEEE Trans. Image Process. 16 (10), 2617–2628.