Optik 126 (2015) 5733–5739
Contents lists available at ScienceDirect
Optik journal homepage: www.elsevier.de/ijleo
Individualized boosting learning for classification Zizhu Fan ∗ , Ming Ni School of Basic Science, East China Jiaotong University, Nanchang, China
a r t i c l e
i n f o
Article history: Received 24 August 2014 Accepted 24 August 2015 Keywords: Boosting learning Individualized boosting learning (IBL) Learning region Face recognition
a b s t r a c t Boosting learning is a very popular learning approach in pattern recognition and machine learning. The current boosting algorithms aim to obtain a number of classifiers through learning the training samples in general. They can work well in the learning settings where the distribution of the training and testing samples is identical. The typical boosting algorithms do not pay attention to learning the test samples. They may not deal well with the large scale and high dimensional data set in which the distribution of the training and testing samples tends to be not identical. In order to deal well with the large scale and high dimensional data sets, such as face image data set, we investigate the boosting algorithms from a new perspective, and propose a novel boosting learning algorithm, termed individualized boosting learning (IBL), in this paper. The proposed IBL algorithm focuses on learning both the training and testing samples. For each test sample, IBL determines a part of training set, referred to as learning region, to perform boosting algorithm, and classify the test sample. Experiments on several popular real-world data sets show that the proposed IBL algorithm can achieve desirable recognition performance. © 2015 Elsevier GmbH. All rights reserved.
1. Introduction Boosting has become a widely used learning approach since Freund and Schapire proposed the first practical boosting algorithm, that is, AdaBoost [1–4]. Boosting is an approach that builds a strong classifier by linearly combining a set of weak classifiers or base hypotheses [5]. The typical boosting algorithm and its variants have two main properties. The first property is that they learn a predictor or classifier by combing a number of weak classifiers or base learners. The second one is that these typical boosting algorithms employ a sample reweighting scheme to emphasize data points, which are difficult to classify [6,7]. In recent years, researchers have proposed a number of approaches to modify and improve the boosting algorithms. G. Ratsch and M.K. Warmuth have pointed out that AdaBoost has not been proven to maximize the margin of the final hypothesis. They proposed an efficient algorithm that produced a final hypothesis leading to maximized margin [8]. In genearl, boosting-like learning is not suitable for a strong and stable learner. Nevertheless, J. Lu had broken this limitation, and developed a boosting algorithm based on linear discriminant analysis (LDA) [9], which was applied to face recognition [10]. By using a tiny amount of new data and a large amount of old data, W. Dai presented a novel transfer learning framework, TrAdaBoost, which extends boosting-based learning
∗ Corresponding author. Tel.: +86 13970993650. E-mail address:
[email protected] (Z. Fan). http://dx.doi.org/10.1016/j.ijleo.2015.08.153 0030-4026/© 2015 Elsevier GmbH. All rights reserved.
algorithms [11]. Based on the same basic idea, Y. Yao proposed an algorithm, TaskTrAdaBoost, which is a boosting framework for transferring knowledge from multiple sources [12]. D. Masip used multitask learning principle and developed a boosted online learning for face recognition [13]. In order to train a supervised learning algorithm with a limited number of labeled samples and lots of unlabeled samples, P. Kumar proposed a boosting framework for semi-supervised learning, referred to as SemiBoost, which can improve any supervised learning algorithm with lots of unlabeled data [14]. S. Chen has presented ranked minority oversampling in boosting, that is, RAMOBoost. In an ensemble learning system, RAMOBoost is a RAMO technique based on the idea of adaptive synthetic data generation [15]. C. Shen investigated the dual problems of AdaBoost, LogitBoost [7], and LPBoost [16], and developed column-generation-based optimization algorithms, which can enable to build the ensemble with fewer weak classifiers [17]. H. Masnadi-Shirazi proposed a novel framework of cost-sensitive boosting. He discussed two necessary conditions about expected losses and empirical loss minimization for optimal cost-sensitive learning [18]. In [19], a boosting k-NN algorithm for categorization of natural scenes has been proposed. This approach built a strong classifier by linearly combining predictions from the k closest prototypes via minimizing a surrogate risk. C. Shan used the intensity differences between pixels in the grayscale face images as features, and adopted AdaBoost to build a strong classifier via intensity differences, which is performed to smile detection [20]. In order to avoid solving the complicated optimization of both kernel and classifiers in multiple kernel learning (MKL), H. Xia proposed a
5734
Z. Fan, M. Ni / Optik 126 (2015) 5733–5739
framework of multiple kernel boosting, that is, MKBoost [21], and its variants. MKBoost is expected to be more effective and efficient than the conventional MKL approaches. Note that all the typical boosting algorithms exploit only one learning model derived from the whole training set, which implies that all the test samples (whatever the distribution of the test samples) are classified by employing this unique and common learning model. They can work well if the data structure of the training and testing samples is consistent. It is worth nothing that all the above boosting algorithms pay attention to how to learn the training set, and they do not take into account the distribution of the test samples. When dealing with the large-scale and high-dimensional data sets in which the data structure of the training and testing samples tends to be inconsistent, these algorithms may not achieve desirable classification performance. To address the above problem occurred in the typical boosting learning, we investigate the boosting algorithms from a new perspective in this paper. Similar to the other typical learning approaches, such as principal component analysis (PCA) [22] and LDA, the typical boosting learning does not consider the distribution information of the test samples in the training stage. It exploits the whole training set to generate a unique learning model, referred to as the universal learning model, for all the test samples. In other words, all the test samples share the universal learning model in typical boosting algorithms. These boosting algorithms cannot build different learning models for different individual test samples. Given a test sample, the universal learning model may not be appropriate for learning it. By contrast, if we employ a number of training samples that are close or similar to this test sample to build the learning model, the model may classify this test sample. Clearly, if we can build appropriate learning models for individual test samples, then these models can achieve better learning performance than the universal learning model, particularly when the data set is heterogeneous, or the data structure of the training samples and the testing samples are not inconsistent. Actually, the individualized learning scheme, that is, learn the samples one by one, is currently popular in machine learning community. For example, the sparse representation for classification (SRC) [23] can be viewed as a typical individualized learning approach. Compared with universal model, the individualized leaning model can improve the classification performance since it builds a model for each test sample. The model can capture the distribution property of the test sample. Besides SRC, there are lots of other individualized learning approaches, such as linear regression for classification (LRC) [24], two-phase test sample sparse representation (TPTSR) [25], and collaborative representation for classification (CRC) [26]. We note that TPTSR exploits a simple and effective way to obtain sparse representation of the test sample but achieves desirable classification performance. In this work, we seek to build an appropriate boosting learning model for each individual test sample. Our approach is based on the following assumption. That is, among the total training samples, there must be a number of training samples that are very helpful to learn a test sample. In general, these training samples come from the class to which the test sample belongs, or they are close or similar to the test sample. Here, these training samples are termed as positive training samples. On the other hand, there may be some training samples that are not helpful to learn the test sample. For example, the samples are noises. These training samples are referred to as nonpositive training samples. For each test sample, our approach tries to pick the positive training samples and simultaneously discard the nonpositive training samples from the whole training set. The main procedure of our approach is that for a test sample, we first determine its positive training samples that are close or similar to the test sample. These positive training samples compose a special region, which is referred to as learning region.
Then, we build a boosting learning model for the test sample within this learning region. The typical boosting algorithms focus on learning the training set. Our algorithm focuses on not only learning the training samples, but also learning the distribution information of the test samples. Actually, the proposed algorithm aims to capture the distribution information of the test samples through determining the learning regions for individual test samples. Our approach learns and classifies the test samples one by one. The proposed approach is referred to as individualized boosting learning (IBL). Compared with the typical boosting learning, IBL is theoretically very suitable for learning high-dimensional and large-scale data sets, such as image data set, particularly when the distribution of the training and testing set of images is inconsistent, or image data are heterogeneous. We apply IBL to face and handwritten digit recognition in this work. A number of experiments on popular real-world data sets show that IBL can achieve desirable recognition performance. The rest of this paper is organized as follows. Section 2 gives the review of the typical boosting learning algorithm for multi-class classification problems. Section 3 introduces our IBL algorithm. Section 4 reports some experiment results on the popular face data sets. We offer the conclusions in Section 5. 2. Review of the boosting learning The famous Adaboost (AB) algorithm was first proposed by Freund and Schapire. The goal of the AB algorithm is to construct a classifier through linearly combining a number of base learners or classifiers. The classification performance of the constructed classifier is required to be much better than that of each individual base classifier. In the construction of the final classifier, AB focuses on how to correctly learn the training samples, particularly the hardest training samples. To this end, AB adopts a sample weight updating scheme: it assigns larger weights on the training samples that are incorrectly classified [27]. The AB algorithm and its variants can be applied to both binary classification and multi-class classification [1]. In order to deal with multi-class classification problems, for example, face recognition, J. Lu et al. proposed a variant of the AB algorithm, referred to as boosting LDA (BLDA) [10], which uses a LDA-style learner as base learner. Unlike most boosting algorithms, such as TrAdaBoost mentioned in Section 1, BLDA can directly deal well with multi-class problems. The BLDA algorithm is shown in Algorithm 1 (for details, please refer to [10]). Algorithm 1 The BLDA algorithm (J. Lu et al.) i }c Input: training samples Z = {(zij , yij )Lj=1 with labels yij = i ∈ Y, i=1
where Y = {1, ..., c}, a LDA-style learner and the number of iterations T. Let B = {(zij , y) : zij ∈ Z, y ∈ Y, y = / yij }. Initialized: the mislabel distribution over B, 1 (zij , y) = 1/|B| = 1/N(c − 1) Do for t = 1,2,. . .,T ˆ t (t ). 1. Update the pseudo sample distribution: D 2. If t = 1 then randomly choose r samples per class to form a learning set R1 ; ˆ t to else choose r hardest samples per class based on D form the learning set Rt . 3. Train a LDA-style feature extractor and build a gClassifier ht . 4. Calculate the pseudo-loss produced by ht : et =
1 2
t (zij , y)(1 − ht (zij , yij ) + ht (zij , y))
(zij ,y) ∈ B
5. Set ˇt = et /(1 − et ). If ˇt = 0, then set T = t-1 and abort loop. 6. Update the mislabel distribution: t+1 (zij , y) = t (zij , y) · ˇt(1+ht (zij ,yij )−ht (zij ,y))/2 .
Z. Fan, M. Ni / Optik 126 (2015) 5733–5739
7. Normalize t+1 so that it is a distribution. Output the final composite gClassifier, arg max y∈Y
T
(log
1 ˇt
hf (z) =
)ht (z, y).
t=1
The BLDA algorithm is based on the multi-class version of the AB algorithm. It trains a LDA-style base learner to obtain the final classifier on the whole training set, and is effectively applied to face recognition. Like the other AB algorithms, the BLDA algorithm only focuses on how to train the training set. It does not take into account the distribution information of the test samples. Its classification performance can be further improved in face recognition and other large-scale and high-dimensional applications. To this end, we propose the IBL algorithm based on BLDA to perform more effective face recognition and other applications. Our algorithm is presented in the following section.
5735
Before determining the learning region for a test sample, we need to specify a value K that indicates the number of the samples that are most similar to the test sample in terms of each similarity measure. Suppose that the training set is denoted by X = [x1 , x2 , · · · , xL ] where xi ∈ RD (i = 1,2,. . .,L) is the ith training sample and D is the sample dimensionality. y ∈ RD is a test sample. In the following, we use three similarity measures to determine three similarity sets for the test sample, respectively. The first similarity set consists of K nearest neighbors of the test sample, and these neighbors are determined by using the popular Euclidean distance based on L2-norm, which can be viewed as a similarity measure. The small distance between samples indicates the large similarity they have. We determine the second similarity set by employing the representation method newly proposed in [25]. Suppose that there is a vector ˇ = [b1 , b2 , ..., bL ]T satisfies the following equation: y = Xˇ.
3. Individualized boosting learning (ibl) algorithm In this section, we propose the individualized boosting algorithm (IBL) that is based on BLDA. Our IBL contains two main steps. The first step is that for each individual test sample, IBL exploits similarity measures to determine a number of positive training samples constituting the learning region. The second step is that IBL uses BLDA to construct learning models within the learning regions and then classifies the individual test samples. We first introduce how to determine the learning regions by applying similarity measures. 3.1. Learning region In the IBL algorithm, we exploit three similarity measures to determine the training samples that are similar to the test samples. Note that many previous works use only single similarity measure, such as the Euclidean distance or the representation residual [25,28] to determine the similar samples. Although computationally simple, applying single similarity measure may not be suitable to determine the similar samples for a specified sample. For example, on the ORL face data set [29], we use the Euclidean distance to determine the first five nearest neighbors for a test sample, as shown in Fig. 1. In this figure, the test image is from Class 19 whereas its first five nearest neighbors are not from this class. Therefore, if we use these neighbors to build the boosting learning model, it is clear that the model cannot correctly learn and classify the test image. In this case, we can say that the Euclidean distance is not a suitable similarity measure for this test image when we determine its five neighbors. Clearly, we can use other similarity measures to correctly calculate the nearest neighbors for the test sample in Fig. 1. Therefore, in order to effectively address the above problem encountered in applying the single similarity measure, we use three similarity measures, that is, the Euclidean distance, the cosine distance and the representation residual, to determine the similar samples composing the learning region, for a given sample. They are introduced as follows.
Fig. 1. A test sample and its five nearest neighbors obtained by using the Euclidean distance. (a) The test image. (b) The five nearest neighbors of the test image that are obtained by using the Euclidean distance.
(1)
If X is nonsingular, we can solve ˇ as follows: ˇ = X −1 y.
(2)
Otherwise, we can solve it by employing: ˇ = (X T X + I)
−1
X T y,
(3)
where is a small positive constant (in this work, is set to 0.01), and I is the identity matrix. After obtaining ˇ, we compute the following representation residuals using individual training samples:
2
rej = y − bj xj (j = 1, 2, . . ., L).
(4)
Also, the representation residual can be viewed as a similarity measure [25]. When applied to different algorithms, this similarity measure can enable the consequent classification procedure to obtain good result [30,31]. If a training sample has less residual when we use it to represent the test sample, then this training sample is more similar to the test sample. Thus, we can choose K training samples that corresponds to the first K least representation residuals, respectively, to generate the second similarity set. The third similarity set is determined by using the cosine distance. Given two samples xi and xj , the cosine distance between them is simply defined as: cos(xi , xj ) =
xT x
i j xi xj
(5)
Therefore, for a given test sample y, we choose K training samples that are most similar to the test sample in terms of the cosine distance, to yield the third similarity set. Among three above similarity measures we employ, the first similarity measure can evaluate the correlation between the samples. The second measure evaluates the similarity between the samples from the viewpoint of the representation, and is essentially an extension of the Euclidean distance. Unlike the Euclidean distance, the cosine distance measure focuses on the difference on the orientations of sample vectors. These similarity measures can capture three types of information from the data set. After obtaining the above three similarity sets, we can determine a learning region for the test sample. We compute the similarity between the test sample and the individual classes that contain the samples from the similarity sets. Let the number of the total samples in three similarity sets be H = 3 K, and these samples belong to l classes: {C1 , C2 , ..., Cl }. Assume that among the three similar
ity sets, there are ns samples belonging to Class Cs (s = 1, 2, . . ., l). Then, the similarity between the test sample and Class Cs is defined
5736
Z. Fan, M. Ni / Optik 126 (2015) 5733–5739
as Ss = ns /H. Clearly, the large Ss indicates that the high similarity between the test sample and Class Cs . Thus, we can obtain a similarity value set S = {S1 , S2 , ..., Sl } in which the values in S are sorted in descending order. It is possible that all values in S are identical in some particular case, for example, the samples of the similarity sets are from different classes. We can properly specify the number of nearest neighbors K in the construction of the similarity sets to avoid this case that rarely occurs in practice. We consider that the samples corresponding to the smallest values in S are usually not helpful to learn and classify the test sample. In some cases, these samples may be noises or outliers within the data. We discard these samples from three similarity sets, and use the remaining training samples to constitute a novel set, that is, the learning region, which is used to build the learning model. For the test sample shown in Fig. 1a, Fig. 2 shows two similarity sets that are also two types of sets of nearest neighbors obtained by using the representation residual and cosine distance, respectively, as well as the final learning region. Fig. 2a shows the first five nearest neighbors of the test sample obtained by using the representation residual. We observe that the label of the test sample is included in the labels of these neighbors (the first neighbor). It is also included in the labels of the nearest neighbors of the test sample obtained by using the cosine distance, as shown in Fig. 2b. Actually, the third and fifth neighbors and the test sample are from the same class (Class 19) in Fig. 2b. Also, the final learning region contains these neighbors, as shown in Fig. 2c. Therefore, compared with the neighbors shown in Fig. 1b, the learning region is more suitable for learning the test sample in Fig. 1a.
3.2. Implementation of IBL After determining the learning region, the proposed IBL algorithm builds the BLDA learning model within this region. Algorithm 2 shows the IBL algorithm. Algorithm 2 the IBL algorithm 1. Input data: xi ∈ Rn (i = 1, 2, ..., L) and their labels. 2. From the training set, determine the learning region of the test sample xs . 3. Construct the BLDA model within the determined learning region. 4. Classify the test sample xs using the constructed model. It is clear that the IBL algorithm is an extension of the BLDA algorithm. BLDA is a special case of IBL. In fact, if the learning region of a test sample is the whole training set in IBL, then recognition performance of IBL is theoretically equivalent to that of BLDA. We know that IBL is slower than BLDA. Note that IBL learns and classifies the test samples one by one. In IBL, learning and classifying one test sample does not affect learning and classifying another test sample. In other words, learning and classifying the test samples in IBL is parallel. Hence, if we want to improve the computational efficiency of IBL, we can perform IBL through parallel computation, which will largely improve the computational efficiency of IBL. On the other hand, in general learning scenario, that is, the main memory can load the whole training set, the space complexity of our IBL algorithm is nearly the same as the one of the BLDA algorithm. When the training set is so large-scale and high-dimensional that the main memory cannot load the whole training set, the BLDA algorithm and most other usual boosting algorithms might fail to learn, since they need to load all the training samples into the main memory in the training procedure. Nevertheless, our proposed IBL can compute the distances between a test sample and the training set in advance. Note that in the distance computation, we do not need to load all the training samples into the main memory at a time. When computing the distances between a test sample and the training set, we can read the training samples from the disk one by one. After determining the neighbors of a sample, we can implement our IBL. 4. Experiments
Fig. 2. Two similarity sets are obtained by using the representation residual and cosine distance, respectively, as well as the final learning region. (a) The first five nearest neighbors of the test sample obtained by using the representation residual. (b) The first five nearest neighbors of the test sample obtained by using the cosine distance. (c) The final learning region.
In this section, we have conducted several experiments on popular data sets to demonstrate the recognition effectiveness of the proposed IBL algorithm. The first two experiments are performed on the Georgia Tech (GT) and AR data sets, respectively. The third experiment is conducted on a new data set, which is combined by the ORL and Yale data sets (that is, ORL + Yale). The fourth experiment is conducted on the MNIST data set, which is a handwritten digit data set. For the purpose of comparison, we have implemented the nearest neighbor classifier combining the linear discriminant analysis (LDA), that is, Fisherfaces, simply denoted by LDA in the experiments [9], the boosting LDA (BLDA) algorithm [10], the collaborative representation for classification (CRC) [26], the two-phase test sample representation (TPTSR) [25], and the sparse representation based classification (SRC) algorithm [23]. The last three algorithms, which are newly proposed in recent years, apply the nearly same learning schemes as the one adopted in our IBL algorithm. That is, they learn and classify the test samples one by one. For each data set, we randomly choose a part of the samples for training, and the rest samples are used for testing, as conducted in many works. This scheme has the merit that randomly choosing the training set ensures that the classification results will not depend on any special choice of the training data [23]. In IBL, we
Z. Fan, M. Ni / Optik 126 (2015) 5733–5739
5737
Table 1 Recognition rates (MEAN ± STD-DEV PERCENT) on the GT data set. Algorithms
N=4
CRC SRC TPTSR LDA BLDA IBL
58.02 58.49 60.95 54.26 58.27 64.11
N=5 ± ± ± ± ± ±
1.88 1.77 1.65 2.71 3.28 2.20
62.54 63.70 64.54 63.02 65.42 69.18
N=6 ± ± ± ± ± ±
1.74 1.57 1.73 2.23 2.72 2.64
± ± ± ± ± ±
0.97 0.70 0.88 0.95 1.25 1.20
65.47 66.40 67.56 67.44 68.64 71.17
N=7 ± ± ± ± ± ±
1.79 1.99 1.68 2.45 2.28 2.11
± ± ± ± ± ±
1.12 0.84 1.09 1.20 1.03 1.13
68.03 70.22 70.15 72.20 72.42 75.10
N=8 ± ± ± ± ± ±
1.70 1.28 2.27 1.79 3.08 2.58
± ± ± ± ± ±
0.64 0.82 0.45 1.0 1.02 0.91
70.03 72.09 71.03 76.63 75.89 78.06
± ± ± ± ± ±
2.30 2.39 2.43 2.26 1.98 1.91
± ± ± ± ± ±
0.75 0.72 0.82 0.87 0.49 0.58
Table 2 Recognition rates (MEAN ± STD-DEV PERCENT) on the AR data set. Algorithms
N=3
CRC SRC TPTSR LDA BLDA IBL
78.86 78.39 78.75 76.59 79.66 81.08
N=4 ± ± ± ± ± ±
1.74 1.31 1.38 1.51 1.52 1.72
84.07 84.19 84.06 82.97 86.31 87.41
N=5
need to set the parameter r, which is the ratio of the number of the samples in the learning region to the number of the total training samples, to determine the learning region for the test samples. We will give the relationship between the ratio r and the recognition performance in the experiments. 4.1. Experiment on the GT data set We have conducted the first experiment on the Georgia Tech (GT) face data set. The GT data set contains 50 subjects with 15 images per subject and characterizes several variations, such as pose, expression, and illumination [24]. All the images are cropped and resized to a resolution of 60 × 50 pixels. We randomly grouped the image samples of each subject into two parts. One part is used for training and the other part is used for testing. The number of training images that is chosen for each subject is 4, 5, 6, 7, and 8, which make up five subsets of the training data. For the purpose of fair comparison, we compute the recognition rate with the feature space dimension 100 obtained by using PCA. that is, the face image data dimensionality is reduced to 100 by using PCA. In TPTSR, the number of the neighbors of the test sample is set to 15 yielding the best performance of TPTSR on this database. In Fisherfaces, the number of the discriminant vectors is set to be c-1 where c is the number of the classes. This scheme is also applied in the second, third, and fourth experiments. In IBL, the ratio r is set to 0.2. We randomly ran each algorithm 10 times on each training subset. Table 1 reports the recognition rates on five training subsets (denoted by N = 4, 5, 6, 7, and 8). In this table, the bold italics highlight the best recognition result on each training subset. From Table 1, we can observe that the recognition performance of the proposed IBL algorithm is better than that of the other five algorithms. The main reason is that since IBL can effectively capture the distribution of test samples, it outperforms the BLDA algorithm that is one type of typical boosting algorithms. Compared to the CRC, TPTSR and SRC algorithms that can also be viewed as the individualized learn method, IBL pays more attention to learning the discriminative information within the data set, and accordingly achieves better classification result. 4.2. Experiment on the AR data set The second experiment is conducted on the AR face data set. The AR data set consists of over 4000 face images of 126 subjects, which include frontal views of faces with different facial expressions, lighting conditions, and occlusions. We used the face images of 120 subjects and each subject has 26 images [32,33]. All the
87.37 87.80 88.06 86.08 90.37 90.72
N=6 89.70 90.07 90.46 88.80 92.96 93.27
N=7 91.21 92.29 92.45 89.75 94.46 94.68
images are cropped and resized to a resolution of 50 × 40 pixels. For each subject, (N = 3, 4, 5, 6, and 7) images are randomly selected for training, and the rest are used for testing. We compute the recognition rate with the feature space dimension 120 obtained by using PCA. In TPTSR, the number of the neighbors is set to 30 that also yields the best performance of TPTSR on this database. The ratio r in IBL is set to 0.3. Similar to the first experiment, we randomly ran each algorithm 10 times on each training subset. Table 2 reports the recognition rates on five training subsets (denoted by N = 3, 4, 5, 6, and 7). From Table 2, we can also observe that the recognition performance of the proposed IBL algorithm is better than that of the other five algorithms. 4.3. Experiment on the ORL + Yale data set The third experiment is conducted on a hybrid data set, that is, the ORL + Yale data set. The ORL face database contains 40 individuals with 400 face images. Each individual has 10 images. These images were captured at different times and have different variations including expression and facial details [34]. Yale face database contains 165 images of 15 individuals (each person has 11 different images) under various facial expressions and lighting conditions [29]. It is clear that the variations of the ORL database and the Yale database are different. Hence, the new database, that is, the ORL + Yale database, is heteroscedastic. All the images are cropped and resized to a resolution of 32 × 32 pixels. For each subject, (N = 3, 4, and 5) images are randomly selected for training, and the rest are used for testing. We compute the recognition rate with the feature space dimension 60 obtained by using PCA. In TPTSR, the number of the neighbors is set to 15. The ratio r in IBL is set to 0.2. Similarly, we randomly ran each algorithm 10 times on each training subset. Table 3 reports the recognition results on three training subsets. From Table 3, we can also observe that IBL outperforms the other five algorithms. Moreover, the above experiments show that IBL tends to achieve larger performance improvement (compared to Table 3 Recognition rates (MEAN ± STD-DEV PERCENT) on the ORL + Yale data set. Algorithms
N=3
CRC SRC TPTSR LDA BLDA IBL
83.92 85.75 86.45 82.90 84.38 86.97
N=4 ± ± ± ± ± ±
1.16 1.31 1.90 1.77 2.59 1.67
86.29 89.04 89.80 89.86 89.36 90.96
N=5 ± ± ± ± ± ±
1.15 1.20 0.69 2.0 1.78 1.37
88.48 89.45 91.17 91.34 92.72 93.10
± ± ± ± ± ±
1.97 1.43 1.30 2.43 1.60 1.76
5738
Z. Fan, M. Ni / Optik 126 (2015) 5733–5739
Table 4 Recognition rates (MEAN ± STD-DEV PERCENT) on the MNIST data set.
100
80
r=0.2 r=0.3
80
Algorithms CRC SRC TPTSR LDA BLDA IBL
N=5 62.64 60.56 63.08 59.90 60.90 63.21
N = 10 ± ± ± ± ± ±
2.99 2.80 3.81 3.23 3.07 3.91
69.24 69.55 70.35 63.88 70.18 71.42
N = 15 ± ± ± ± ± ±
2.35 2.22 2.33 2.13 1.11 1.46
72.80 74.11 74.37 64.84 74.0 75.02
60
N = 20 ± ± ± ± ± ±
1.16 1.01 1.11 2.09 1.86 1.75
73.72 75.08 75.74 62.38 75.34 76.37
± ± ± ± ± ±
0.93 0.84 1.45 1.84 1.83 1.78
60
40 40
20
20
0
3
4
5
6
7
0
(a) The AR database
BLDA) when the number of the training samples is smaller. The potential reason is that when the training set is small-scale, the distribution of the training samples and the testing samples is not consistent, which can be well addressed by IBL.
r=0.2 r=0.3
100
r=0.2 r=0.3
80
80
4
5
6
7
8
(b) The GT database r=0.2 r=0.3
60
60
40
4.4. Experiment on the MNIST data set
40
20
20
The fourth experiment is conducted on the MNIST data set [35] that is a handwritten digit data set with ten classes, that is, 0,1,2,. . .,9 (each numeral corresponds to a class). For each numeral, the training set contains 6000 image samples, and the test set contains 1000 image samples. The size of each image is 28 × 28. We randomly selected the samples from the training set of each class and used them as the training samples in this experiment. The number of training samples that is chosen for each class is 5, 10, 15, and 20, which make up four training subsets. For each training subset, we randomly selected the samples from the test set of each class and used them as the test samples. The number of the test samples that is chosen for each class is 300. Thus, we generate four test subsets, respectively corresponding to five training subsets. Each test subset contains 3000 test samples. We compute the recognition rate with the feature space dimension 40 obtained by using PCA. In TPTSR, the number of the neighbors is set to 15. The ratio r in IBL is set to 0.2. Similarly, we randomly ran each algorithm 10 times on each training subset. Table 4 reports the recognition results on four training subsets. From Table 4, we can also observe that IBL outperforms the other five algorithms. Again, the above experiments show that IBL tends to achieve larger performance improvement (compared to BLDA) when the number of the training samples is smaller. 4.5. Relationship between the ratio r and the recognition performance We experimentally investigate the relationship between the ratio r and the recognition performance on the above popular data sets. By using the same parameters except r, we randomly ran IBL 10 times on the GT, AR, ORL + Yale, and MNIST data sets. Fig. 3 shows the recognition results on these data sets with the different ratios. In this figure, the values under the bars indicate the numbers of the training samples of each class. It is impossible to report the recognition results with all the ratios. We report the cases r = 0.2 and r = 0.3, which can usually yield the desireable recognition performance, and are used in our experiments. From Fig. 3, we can find that the recognition results are not significantly sensitive to the ratio r. Fig. 3a shows that the recognition performance yielded by r = 0.2 is nearly the same as that yielded by r = 0.3 on the AR database. Fig. 3b shows that the recognition performance yielded by r = 0.2 is slightly better than that yielded by r = 0.3 on the GT database. This implies that the larger ratio might not guarantee to yield the better recognition performance. Fig. 3c shows that the recognition performance yielded by the ratio r = 0.2 is similar to that yielded by r = 0.3 on the ORL + Yale data set. Also, Fig. 3d shows that the ratio r = 0.2 leads to similar recognition performance with the ratio r = 0.3. From Fig. 3, we can conclude that it is easy to choose the ratio r for the above data sets. In general, we can choose the
0
3
4
5
(C) The ORL+Yale database
0
5
10
15
20
(d) The MNIST database
Fig. 3. The relationship between the ratio r and the recognition performance. (a) The recognition performance on the AR database. (b) The recognition performance on the GT database. (c) The recognition performance on the ORL + Yale database. (d) The recognition performance on the MNIST database.
ratio r ranging from 0.1 to 0.4. Empirically, the ratio r is selected from the set {0.1, 0.2, 0.3, and 0.4}. 5. Conclusions This paper improves the conventional boosting algorithms by using the individual learning scheme, and proposes a novel boosting algorithm, that is, individual boosting learning (IBL). Unlike the conventional boosting algorithms that focus only on learning the training samples, IBL focuses on learning both the training and testing samples. Thus, IBL can more effectively learn the test samples. As a consequence, it can improve the recognition performance. Experiments show that our IBL approach is promising. The future work is to apply this individualized learning scheme to other learning settings. Acknowledgments This work was supported in part by Natural Science Foundation of China (NSFC) under grants Nos. 61472138, 61263032, 71262011, 61271385, and 61362031, and the Jiangxi Provincial Science and Technology Foundation of China under Grant (KJLD12607 and GJJ14375). References [1] Y. Freund, R.E. Schapire, A decision-theoretic generalization of online learning and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139. [2] R.E. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated predictions, Mach. Learn. 37 (3) (1999) 297–336. [3] C. Domingo, O. Watanabe, MadaBoost: a modification of AdaBoost, in: Proceedings of Annual Conference Computational Learning Theory (COLT), 2000, pp. 180–189. [4] C. Rudin, I. Daubechies, R.E. Schapire, The dynamics of adaboost: cyclic behavior and convergence of margins, J. Mach. Learn. Res. 5 (2004) 1557–1595. [5] C. Shen, P. Wang, F. Shen, H. Wang, U Boost: boosting with the Universum, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2012) 825–832. [6] L. Breiman, Arcing classifier, Ann. Stat. 26 (3) (1998) 801–849. [7] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, Ann. Stat. 28 (2) (2000) 337–474. [8] G. Rätsch, M.K. Warmuth, Efficient margin maximizing with boosting, J. Mach. Learn. Res. 6 (2005) 2131–2152. [9] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720.
Z. Fan, M. Ni / Optik 126 (2015) 5733–5739 [10] J. Lu, K. Plataniotis, A. Venetsanopoulos, S.Z. Li, Ensemble-based discriminant learning with boosting for face recognition, IEEE Trans. Neural Netw. 17 (1) (2006) 166–178. [11] W. Dai, Q. Yang, G.-R. Xue, Y. Yu, Boosting for transfer learning, in: Proceedings of the 24th international conference on machine learning, 2007, pp. 193–200. [12] Y. Yao, G. Doretto, Boosting for transfer learning with multiple sources, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010, pp. 1855–1862. [13] D. Masip, À. Lapedriza, J. Vitrià, Boosted online learning for face recognition, IEEE Trans. Syst. Man Cybern. B Cybern. 39 (2) (2009) 530–538. [14] P.K. Mallapragada, R. Jin, A.K. Jain, Y. Liu, Semiboost: Boosting for semisupervised learning, IEEE Trans. Pattern Anal. Mach. Intell. 31 (11) (2009) 2000–2014. [15] S. Chen, H. He, E.A. Garcia, Ramoboost: Ranked minority oversampling in boosting, IEEE Trans. Neural Netw. 21 (10) (2010) 1624–1642. [16] A. Demiriz, K.P. Bennett, J. Shawe-Taylor, Linear programming boosting via column generation, Mach. Learn. 3 (2002) 225–254. [17] C. Shen, H. Li, On the dual formulation of boosting algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 32 (12) (2010) 2216–2231. [18] H. Masnadi-Shirazi, N. Vasconcelos, Cost-sensitive boosting, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2) (2011) 294–309. [19] R. Nock, P. Piro, F. Nielsen, W.B.H. Ali, et al., Boosting k-NN for categorization of natural scenes, Int. J. Comput. Vis. 100 (3) (2012) 294–314. [20] C. Shan, Smile detection by boosting pixel differences, IEEE Trans. Image Process. 21 (1) (2012) 431–436. [21] H. Xia, S. Hoi, Mkboost: A framework of multiple kernel boosting, IEEE Trans. Knowl. Data Eng. 25 (7) (2013) 1574–1586. [22] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognitive Neurosci. 3 (1) (1991) 71–86. [23] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, et al., Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227.
5739
[24] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 32 (11) (2010) 2106–2112. [25] Y. Xu, D. Zhang, J. Yang, J.Y. Yang, et al., A two-phase test sample sparse representation method for use with face recognition, IEEE Trans. Circ. Syst. Vid. Technol. 21 (9) (2011) 1255–1262. [26] L. Zhang, M. Yang, X. Feng, Sparse Representation or Collaborative Representation: Which Helps Face Recognition? ICCV 2011, Barcelona, Spain, 2011. [27] L. Reyzin, Boosting on a budget: sampling for feature-efficient prediction, in: Proceedings of the 28th Internation Conference on Machine Learning (ICML), 2011. [28] H. Zhang, A.C. Berg, M. Maire, J. Malik, SVM-KNN: discriminative nearest neighbor classification for visual category recognition, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, pp. 2126–2136. [29] J. Yang, D. Zhang, A.F. Frangi, J.-y. Yang, Two-dimensional PCA: a new approach to appearance-based face representation and recognition IEEE Trans. Pattern Anal. Mach. Intell. 26 (1) (2004) 131–137. [30] Y. Xu, Q. Zhu, Z. Fan, D. Zhang, et al., Using the idea of the sparse representation to perform coarse-to-fine face recognition, Inf. Sci. 238 (2013) 138–148. [31] Y. Xu, Q. Zhu, Z. Fan, Y. Wang, et al., From the idea of “sparse representation” to a representation-based transformation method for feature extraction, Neurocomputing (2013). [32] A.M. Martinez, A.C. Kak, Pca versus lda, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2) (2001) 228–233. [33] J. Yang, D. Zhang, J.-y. Yang, B. Niu, Globally Maximizing, Locally Minimizing: Unsupervised Discriminant Projection with Applications to Face and Palm Biometrics, IEEE Trans. Pattern Anal. Mach. Intell. 29 (4) (2007) 650–664. [34] W.H. Yang, D.Q. Dai, Two-dimensional maximum margin feature extraction for face recognition, IEEE Trans. Syst. Man Cybern. B Cybern. 39 (4) (2009) 1002–1012. [35] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.