Applied Soft Computing Journal 79 (2019) 424–438
Contents lists available at ScienceDirect
Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc
Regularization of boosted decision stumps using tabu search Michał Bereta Institute of Computer Science, Cracow University of Technology, ul. Warszawska 24, 31-155 Kraków, Poland
highlights • • • • •
A novel method of regularization of boosting procedure is proposed. A tabu list contains features forbidden when training subsequent base classifiers. The proposed method is also hybridized with ε -AdaBoost regularization. Results for 20 machine learning datasets are presented. Application to face verification and gender recognition is also presented.
article
info
Article history: Received 2 October 2018 Received in revised form 11 March 2019 Accepted 2 April 2019 Available online 8 April 2019 Keywords: Boosting AdaBoost Tabu search Regularization Face verification Gender recognition Local binary patterns Gabor filters FERET database
a b s t r a c t This study proposes a novel method to improve the well-known AdaBoost algorithm by combining it with a procedure inspired by tabu search. After each iteration of AdaBoost, the attribute used by the weak learner is placed on the tabu list, which prevents it from being utilized by the subsequent weak learners. The length of the tabu list becomes a new meta-parameter of the learning process and can be tuned based on the cross-validation procedure. This study shows that the proposed approach can improve the original AdaBoost procedure, preventing it from over-fitting to training data. This study also demonstrates that the novel method can act as a regularization procedure. Finally, the paper presents results for the proposed algorithm for 20 classification problems from the UCI repository and for face verification and gender recognition problems. © 2019 Elsevier B.V. All rights reserved.
1. Introduction There are many approaches to machine and statistical learning, such as the Bayesian approach, linear and kernel models such as Support Vector Machines or neural networks [1]. Among the different methodologies, the family of boosting algorithms takes a prominent place, as it is both conceptually simple and practically effective [2]. This work proposes a method to improve the AdaBoost algorithm with a simple modification. This modification is inspired by the tabu search (TS) algorithm [3,4], well known from optimization problems. In fact, the training of any classifier can be interpreted as a process optimizing a given performance criterion, such as a classification ratio, a margin or an exponential loss. TS is an established technique to improve greedy search optimization algorithms. The main idea is to maintain a list of forbidden solutions that cannot be used by the greedy search algorithm when exploring the search space. In effect, such algorithms can often escape local minima. In this work, we first revisit the basic boosting algorithm, AdaBoost, then we interpret it as a greedy optimization process. E-mail address:
[email protected]. https://doi.org/10.1016/j.asoc.2019.04.003 1568-4946/© 2019 Elsevier B.V. All rights reserved.
This point of view allows modifying the learning process by introducing a list of forbidden features that cannot be used during the learning process of the subsequent base classifiers in the boosting procedure. We show that this simple modification improves the final classifier when compared to the basic AdaBoost algorithm. However, the proposed algorithm is not a feature selection method. Several classification datasets from the UCI Machine Learning Repository were used to test the characteristics of the proposed TabuBoost algorithm [5]. Additionally, in order to further test the proposed approach, face verification and facebased gender recognition tasks were considered as examples of image recognition problems. We point out similarities between the proposed method and regularization methods in neural network learning as a potential explanation of the positive effect of combining AdaBoost and tabu search. 1.1. Previous work Boosting was proposed as a method of combining a set of classifiers [6]. One of the most popular and analyzed boosting algorithms is the AdaBoost algorithm [7]. There have been several approaches to explaining the source of boosting’s success, such
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
as probably approximately correct (PAC) learning [8], margin maximization [7] and statistical learning explanation [9]. In [9], AdaBoost was presented as a stage-wise gradient descent optimization method in functional space. This formulation of boosting allowed interpreting it as an optimization algorithm of a more general, usually convex loss function. Boosting is well regarded as a method with excellent generalization ability that can provide satisfactorily low test errors for many problems. For a while, some researchers believed that boosting forever (i.e., adding arbitrarily many base classifiers) was a safe approach to boosting. Thus, regularization was considered unnecessary in boosting. However, it was discovered that some forms of regularization may improve the general boosting procedure [10] beyond only improving the test errors. Proper regularization can also provide simpler final classifiers (i.e., consisting of a smaller number of base learners), which is useful for some applications. In fact, regularization is often not necessary when only test error is considered [11]. Many approaches to tuning the boosting classifier have been focused on selecting the proper number of base learners, which in practice can be done by analyzing the cross-validation error. Regularization can be interpreted as any modification of the primary learning procedure or the optimized criterion that results from the characteristic of the training data or/and the structure of the trained classifiers. For example, some regularization techniques punish the classifier for being too complicated; in others, the classifier cannot focus too much on the selected features (i.e., attributes describing training examples). An example is an l2 norm regularization often used in neural network (NN) learning. The original loss function L can be modified as L′ (Θ ) = L (Θ) + λ ∥Θ∥2
(1)
where λ weights the importance of the regularization term and Θ is a vector of parameters of the classifier being optimized (e.g., connections weights in case of NN). Most work on boosting regularization has focused on the stage-wise functional gradient descent formulation of boosting [12], which converts the initially simple formulation of AdaBoost into a more complicated form. It allows using different loss functions during learning that can be modified by a regularization term. One of the most common approaches is to include l1 -norm, which effectively punishes too large boosting coefficients [13]. An example extension of the basic AdaBoost algorithm that allows optimizing a given loss function and using regularization is AnyBoost [14]. Other are TotalBoost [15] or LPBoost [15], which are examples of fully corrective boosting algorithms that update the coefficients of the previously learned weak learners in each iteration. Usually, this introduces an additional optimization problem that must be solved in every iteration of the boosting algorithm. As a result, the final classifier can consist of a smaller number of weak learners and can converge on a smaller number of iterations. However, the cost of each iteration increases. The approach presented in this work is different. The original formulation of AdaBoost remains and we show that regularization can be achieved with a simple modification based on TS ideas. Hybridizing TS with other algorithms has already shown to be beneficial in other works. For example, in [16], TS was combined with a simulated annealing algorithm and applied to the symmetrical traveling salesman problem, while in [17], it was integrated with a whale algorithm for the quadratic assignment problem. In the context of classifiers’ learning algorithms, search algorithms have often been used to perform feature selection. The quality of the classifier can often be improved by choosing an appropriate subset of features. In general, feature selection algorithms are divided into filtering and wrapper-based methods.
425
Filtering methods are designed to select the features that are useful for classification without indicating which specific classifier will use them. In contrast, wrapper algorithms select a subset of features for a specific classification algorithm. The quality of a particular classifier, which uses the candidate subsets of features, is used to conduct the process of searching the space of all possible subsets. To this end, evolutionary algorithms have often been used. An example can be found in [18]. TS has also been used in a similar context. In [19], TS was used to optimize the selection of features for the nearest neighbor classifier. In [20], TS was used to select features in the problem of hyperspectral data analysis using a binary decision tree. Similar work is found in [21] and [22]. Note that the method proposed in this work differs significantly from these approaches. First, despite some similarities, the proposed method is not a wrapper method for selecting features. It is based on a modification of boosting and is the method of creating the classifier. In the context of the proposed method, TS is not an external algorithm that wraps the AdaBoost learning process, it is an integral part. The technique proposed in this work is therefore unique in comparison with other TS applications as a wrapper method of feature selection. 1.2. Contributions of this work The main contributions of this work are as follows. A novel approach to the regularization of AdaBoost is proposed. An idea of tabu search (TS) is incorporated into the boosting procedure. This has a positive effect on the final classifier, both on test errors and on the number of weak learners. This positive effect can be explained by two factors. First, we show that AdaBoost can be treated as a greedy method of optimization and, as such, can provide suboptimal solutions. Introducing TS into the boosting procedure makes it less greedy, which allows finding better solutions. Second, we show that the proposed method gives effects similar to the well-known method of regularization used in learning neural networks. Compared to more traditional methods of regularized boosting, the proposed method does not require using the functional gradient-based formulation of boosting, which makes it much more straightforward. The effectiveness of the proposed approach was tested on problems from the UCI Machine Learning Repository [5] and on two image recognition tasks, face verification and gender recognition. We present comparisons of the proposed approach (TabuBoost) with the original AdaBoost and with the well-known regularization method of AdaBoost known as ε -AdaBoost [23]. ε -AdaBoost is also a simple modification of AdaBoost; however, it is important because it has been shown to converge asymptotically to l1 -norm regularization. The rest of this paper is organized as follows. Section 2 briefly presents the original AdaBoost algorithm and introduces decision stumps (DS) as the weak learners used in this work. Additionally, we point out the greedy characteristic of the boosting procedure and refer to TS as the approach to limit the disadvantages of greedy optimization methods. Section 3 proposes the idea of combining AdaBoost with TS and presents the novel algorithm, TabuBoost, in detail. Section 4 presents the results of numerical experiments, while Section 5 discusses possible explanations of the useful properties of the proposed algorithm. Section 6 gives the conclusions. 2. Boosting and greedy optimization In this section, we review the original AdaBoost algorithm. The decision stumps, which are used as the weak learners in this work, are also presented. Additionally, AdaBoost is interpreted as a greedy optimization procedure. This allows us to introduce TS as a method for limiting the general disadvantages of greedy optimization.
426
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
Fig. 1. Train error, test error and train exponential loss in an example training process of AdaBoost with decision stumps as weak learners.
2.1. AdaBoost The main idea of boosting is straightforward: apply a weak learning algorithm that cannot solve the classification problem by itself, but is still better than a random guess. Then, modify weights of the learning examples and apply the weak learning algorithm repeatedly, each time using a modified set of weights. The weight of each training example should reflect the relative difficulty the example caused to the previously created weak learners. Thus, the weak classifier currently being trained should focus more on examples with higher weights. One of the earliest and most widely known boosting algorithms is AdaBoost.M1 [2]. It was designed for two-class classification problems. Despite its simplicity, it has been effectively applied to difficult practical problems. A notable example is the highly popular face detection algorithm proposed by Viola and Jones [24], which has become a standard approach and a reference algorithm for other researchers. The basic AdaBoost algorithm, presented as Algorithm 1, forms the base of the discussion in this work. In Algorithm 1, I(.) is the identity function, which returns 1 if the argument is true and 0 otherwise. Despite its simplicity, AdaBoost is still a widely used learning algorithm [25]. The roots of boosting algorithms’ excellent performance in many real-world problems are still not fully understood. Nevertheless, one of the sources of AdaBoost’s success is that exponential loss is minimized during learning. Compared to the methods minimizing classification error on the training set, this criterion has some obvious benefits. Fig. 1 shows an example situation when AdaBoost was used in one of the problems described in this paper, namely face-based gender recognition. It can be observed that after adding a given number of weak learners to the model (around 300 in this case), the training classification error is zero. However, the exponential loss is non-zero and can still be minimized by leading the learning process. On the other hand, minimizing exponential loss does not guarantee optimal testing performance of the final classifier. In this work, a regularized version of AdaBoost, ε -AdaBoost, is also used for comparison with the proposed approach. In ε AdaBoost, each weak learner is given a constant boosting coefficient, which is usually a small real value. This means that step 2c of Algorithm 1 is modified and the boosting coefficient
αm is always set to a small value ε . The proper value of ε depends on the problem being solved. Usually, ε -AdaBoost requires many iterations to converge. However, it has been shown to asymptotically converge to the l1 -norm regularized version of AdaBoost [23], so it is used for comparison in this work.
Algorithm 1. AdaBoost.M1 Input: Training examples (xi , yi ), xi ∈ RK , yi ∈ {-1,1} – the class label of ith example; xi – the feature vector of ith example; i = 1, 2, . . . , N; N – the number of training examples; M – the number of weak classifiers to train; K – the number of features. Output: Aggregated classifier H(x). Notation: I(.) is the identity function, which returns 1 if the argument is true and 0 otherwise. 1. Initialize weights for training examples as wi = 1/N. 2. For m = 1 to M: a. Train a weak classifier hm (x) based on the training examples and the current weights distribution. b. Calculate the weighted error of classifier hm (x) as errorm =
∑N
i =1
wi I (yi ̸=hm (xi )) ∑N . i=1 wi
(1)
c. Calculate the weight of classifier hm (x) indicating its impact on the final classification as
αm = log ((1 − errorm ) /errorm ) .
(2)
d. Update weights distribution as
wi = wi · exp [αm · I (yi ̸= hm (xi ))] . ∑ wi = wi / Nj=1 wj
(3) (4)
for i = 1, . . . , N. 3. Return the final aggregated classifier as H (x) = sign
[∑
M m=1
] αm hm (x)
(5)
for any example with the feature vector x, where sign(.) is the signum function. End of Algorithm 1
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
2.2. Decision stumps Stump classifier (a.k.a. decision stump) is essentially a onelevel decision tree, i.e., a tree with just two leaves. There is only one node, the root node, which performs the split test based on the value of just one feature. While building a decision stump, based on the current weighting of the training examples, a threshold value is selected that minimizes the weighted classification error. The search for the optimal (in the sense of minimizing the weighted classification error) threshold value is carried out by exhaustively checking possible values for a given training set. Each feature is considered separately. Values for a given feature for training examples are sorted, then the algorithm checks each unique value calculated as an average between successive sorted values. The procedure of finding the decision stump for a given training set with weight distribution is provided as Algorithm 2. Algorithm 2 is used in this work by all boosting algorithms (Algorithms 1, 5, and 6) to find a subsequent weak classifier (e.g., step 2a in Algorithm 1).
Algorithm 2. Creating a stump classifier (decision stump) Input: Training examples (xi , yi ), xi ∈ RK , yi ∈ {−1, 1}; {wi }, i = 1, . . . , N – current weight distribution of training examples; N – the number of examples; K – the number of features. Output: h(x) ={j, thresh, class}, where class ∈ {-1,1}, j = 1,. . . , K, thresh ∈ R – decision stump minimizing weighted classification error. 1. Find the feature index j , the threshold value thresh and class so that the weighted classification error error h (x) , {(xi, yi )}, {wi } =
(
)
∑N
i=1
wi I (h (xi ) ̸= yi ) (1)
is minimized. 2. Return h(x) = {j, thresh, class} as h (x) =
{
class
if xj ≥ thresh −1 ∗ class if xj < thresh
(2)
427
ones. This work shows that introducing a tabu list with forbidden features can improve the performance of the original AdaBoost. The length of the tabu list becomes a hyper-parameter of the proposed algorithm and must be tuned for a given problem. In Section 4, its importance is discussed based on the results of numerical experiments. 2.3. Boosting as a greedy optimization procedure One challenge of learning algorithms is the danger of overfitting to the training data. This can happen when the model is trained for too long or the model is too complicated. Boosting is not free from this challenge. Although boosting forever used to be seen as a safe strategy, adding more weak learners to the aggregated model can decrease the final test performance due to over-fitting: adding weak learners results in a more complicated final classifier that is more adjusted to the training data but is not guaranteed to decrease test error. This can be observed empirically in Section 5 in this work: for AdaBoost, the maximum number of weak learners was 500 for UCI problems, but the best test performances were usually for much smaller numbers of weak learners. Overfitting explains this complication. Thus, setting the proper number of base learners is a usual meta-parameter of boosting procedures; this number can be estimated using cross-validation. The problem of over-fitting can also be partially attributed to the fact that boosting is a greedy optimization procedure. AdaBoost is considered a greedy optimization procedure because of the interpretation of the algorithm. In each step, a weak learner is selected that minimizes the objective function (i.e., the exponential loss) while all the previous learners are frozen and no longer modifiable. That does not guarantee an optimal set of base learners, i.e., one that minimizes classification error on the test examples. Thus, the whole boosting procedure can be interpreted as an iterated greedy optimization procedure. This opens up the opportunity to utilize techniques from search and optimization heuristics.
End of Algorithm 2 2.4. Greedy optimization and tabu search The boosting algorithm has been shown to work well, even with extremely simple weak classifiers. One exceptionally successful application is the Viola-Jones model [24], in which a boosting algorithm with decision stumps is used as a backbone of face detection system. This approach became widely popular and is still in extensive use. The root of the success, in this case, is the ability to select the most useful Haar-like features from among thousands. The example of the Viola-Jones model shows the high potential of boosting combined with decision stump as a method of classifier training with a built-in feature selection mechanism. In fact, each weak classifier in the form of a decision stump selects just one feature. In this work, a decision stump is also used as the weak classifier in the proposed modification of AdaBoost. However, the main focus is not on the feature selection ability of this boosting framework, but on introducing a kind of regularization mechanism to AdaBoost. The main idea is to limit access to features already utilized by previously trained weak classifiers. This is done by introducing a tabu list with features not available for the currently trained weak classifier. While this idea could be implemented with any type of base classifier, using it with decision stumps is straightforward, as each of them uses just one feature that in turn becomes unavailable for the next
Greedy algorithms are widely known and used in the optimization community. They are relatively simple from a practical point of view. Examples include the simplex algorithm and gradient learning algorithms, which are popular in the neural networks community. The alternatives are global optimization algorithms, often in the form of heuristics such as simulated annealing or bioinspired methods (genetic algorithm, swarm optimization and many others). Analyzing the boosting framework reveals that boosting’s greedy procedure is one of its main advantages, as the ability to create one weak learner at a time is desirable from the practical point of view. Trying to apply a global optimization algorithm to optimize the set of weak learners and their weights would effectively eliminate this benefit. However, some techniques from the greedy optimization group of algorithms can be used to limit the negative effects of boosting’s greediness. This paper shows that combining AdaBoost with a simple tabu list can improve both classification accuracy and the simplicity of the resulting aggregated model. It can also prevent over-fitting and plays a regularization role, similar to methods known from, for example, neural network learning algorithms. The general greedy optimization algorithm is presented as Algorithm 3.
428
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
Algorithm 3. General greedy optimization (local search) Input: f (x) – function to be minimized (problem to be solved); n – the number of iterations. Output: x∗ – best solution found. 1. Initialize x* (randomly or based on problem-specific knowledge). 2. For i = 1 to n: a. Generate a new solution x from the neighborhood of x∗ . b. If f (x) < f (x∗), replace x∗ with x. 3. Return x∗ . End of Algorithm 3
Tabu search is the general name of the approach used in many practical greedy search methods. It can be easily adapted to a given problem being solved. However, its central idea remains mostly unchanged. A list of previously encountered solutions is maintained and it is forbidden for the algorithm to visit them repeatedly, at least for a specified number of iterations. This simple idea often allows the algorithm to escape from the local minimum and explore the search space, potentially leading to better solutions. The general tabu search algorithm is given as Algorithm 4. In Algorithm 4, tabu_list.length is the number of solutions that are currently forbidden (they are tabu). The function tabu_list.contains(x) returns true if solution x is currently on the tabu list. Algorithm 4. General tabu search Input: f (x) – function to be minimized (problem to be solved); n – the number of iterations; max_tabu_len – maximum length of the tabu list Output: x∗ – best solution found. 1. Initialize x∗ (randomly or based on problem-specific knowledge). 2. Set tabu_list as an empty list. 3. For i = 1 to n: a. Generate a new solution x from the neighborhood of x∗ . b. If not tabu_list.contains(x) and f (x) < f (x∗ ), replace ∗ x with x. c. Add x at the end of tabu_list. d. If tabu_list.length > max_tabu_len, remove the first element from tabu_list. 4. Return x*. End of Algorithm 4
3. Proposed approach It is evident from the previous section that AdaBoost can be interpreted as a greedy optimization procedure. The idea introduced in this section is based on the TS presented in Algorithm 4. Algorithm 5 introduces a list of features already used by some of the previous weak learners and forbidden for the currently trained one (step 3a of Algorithm 5). The novel steps (compared to Algorithm 1) are 3a, 3b and 3c, where the maintenance of the tabu list is carried out. In this work, decision stumps are used as weak learners. This is very convenient in the proposed framework, as precisely one feature is utilized by a given decision stump. This feature is placed on the tabu list. The optimal length of the tabu list is not known. It becomes a
new meta-parameter of the proposed algorithm and can be finetuned for a given classification problem based on cross-validation experiments. Algorithm 5. TabuBoost (AdaBoost.M1 with a tabu list) Input: Training examples (xi , yi ), xi ∈ RK , yi ∈ {−1, 1}, i = 1, 2, . . . , N; M – number of weak classifiers to train; tabu_len – maximum length of the tabu list. Output: Aggregated classifier H(x). 1. Initialize weights for training examples as wi = 1/N. 2. Initialize tabu_list as an empty list. 3. For m = 1 to M: a. Train a weak classifier hm (x) based on the training examples and the current weights distribution. The features on tabu_list cannot be used by hm (x). b. Add the feature used by hm (x) at the end of the tabu_list. c. If the length of tabu_list is greater than tabu_len, remove the first feature from tabu_list. d. Calculate the weighted error of classifier hm (x) as errorm =
∑N
i=1
wi I (yi ̸=hm (xi )) ∑N . i=1 wi
(1)
e. Calculate the weight of classifier hm (x), indicating its impact on the final classification as
αm = log ((1 − errorm ) /errorm ) .
(2)
f. Update weights distribution as
wi = wi · exp [αm · I (yi ̸= hm (xi ))] , ∑ wi = wi / Nj=1 wj .
(3) (4)
4. Return the final aggregated classifier as H (x) = sign
[∑
M m=1
] αm hm (x) .
(5)
End of Algorithm 5
3.1. Hybridization of TabuBoost and ε -AdaBoost Simultaneously using a tabu list and the ε parameter is possible, as these modifications of the original AdaBoost algorithm are not mutually exclusive. Thus, we can test whether using both modifications (tabu list and constant ε value) further improves the boosting procedure. It is therefore possible to propose another hybrid algorithm, ε -TabuBoost, that combines both described methods. This is presented as Algorithm 6. The resulting algorithm is actually TabuBoost with the use of a constant value of ε as the base classifier weight in each iteration of the boosting procedure. Two possible uses of ε -TabuBoost are tested. First, for each classification problem, we select values of tabu length and ε that perform best when used separately (i.e., in TabuBoost and ε AdaBoost). They are used together in ε -TabuBoost. This approach is then called ε -TabuBoost_v1. In the second approach, there is an extensive search for the best performing pair of the two parameters: length of the tabu list and ε value. This approach is called ε -TabuBoost_v2. All possible lengths of the tabu list in a given problem are checked, paired with the ε values from the considered set. The next section presents the effectiveness of the proposed approaches.
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
Algorithm 6. ε -TabuBoost (ε -AdaBoost with a tabu list) Input: Training examples (xi , yi ), xi ∈ RK , yi ∈ {−1, 1}, i = 1, 2, . . . , N; M – the number of weak classifiers to train; tabu_len – maximum length of the tabu list; ε – a parameter of the algorithm, a positive real value Output: Aggregated classifier H(x). 1. Initialize weights for training examples as wi = 1/N. 2. Initialize tabu_list as an empty list. 3. For m = 1 to M a. Train a weak classifier hm (x) based on the training examples and the current weights distribution. The features on tabu_list cannot be used by hm (x). b. Add the feature used by hm (x) at the end of the tabu_list. c. If the length of tabu_list is greater than tabu_len, remove the first feature from tabu_list. d. Update weights distribution as
wi = wi · exp [ε · I (yi ̸= hm (xi ))] ∑ wi = wi / Nj=1 wj .
(1) (2)
429
Table 1 Summary of the test datasets. Dataset
Features
Examples
Pima Parkinson’s Sonar Ionosphere QSAR ILPD SPECTF Messidor Wisconsin diagnostic Wisconsin original Wisconsin prognostic Banknote authentication Planning-relax Thoracic surgery kc1 kc3 Spambase Ozone level 8 h Hill-valley Musk
8 22 60 33 41 10 44 19 30 9 32 4 12 16 21 39 57 72 100 166
768 195 208 351 1055 579 267 1151 569 683 194 1372 182 470 2109 458 4601 2534 1212 6598
4. Return the final aggregated classifier as H (x) = sign
[∑ M
m=1
] ε · hm (x) .
(3)
End of Algorithm 6
4. Experimental studies This section presents the results of numerical experiments. The proposed algorithm TabuBoost was applied to several twoclass classification problems of varying difficulty, with different numbers of training examples and different numbers of features. These problems include 20 datasets from the UCI Machine Learning Repository [5], a face verification problem and a facebased gender recognition problem. The results of the proposed algorithm are compared with those of the baseline AdaBoost and ε -AdaBoost. Other classifiers are not included, as the main purpose of the tests is to show the advantage of including the tabu list into the AdaBoost framework, not to provide the best classifier for each problem. The importance of the proper selection of the length of the tabu list is also studied. Additionally, a possible hybridization of TabuBoost with ε -AdaBoost is analyzed. Note that AdaBoost is itself a very well performing classification algorithm. Thus, we did not expect large improvements in the classification ratio in each case. However, it should be underlined at the beginning of the analysis that the proposed TabuBoost algorithm has advantages according to several criteria used to compare classifiers and learning processes. The first advantage is the increase in classification accuracy, which is obviously desirable. The second is the ability to achieve the same level of classification accuracy with a smaller number of base classifiers (weak learners) for some problems, which means we can have simpler (and thus faster) classifier. The third advantage is connected with the regularization ability of TabuBoost and can be described as the ability to prevent over-fitting when adding subsequent decision stumps. To compare, when adding more and more weak learners, often the test error decreases for TabuBoost, while for AdaBoost, the learning process does not seem to further improve the classifier. Although these three benefits are not always simultaneously present in each classification problem, all three types of benefits of using TabuBoost over AdaBoost are seen in this section.
4.1. Results for machine learning datasets This section presents a comparative analysis of TabuBoost, AdaBoost, ε -AdaBoost and ε -TabuBoost based on 20 two-class classification problems from the UCI Machine Learning Repository. Table 1 describes the databases used, including the names, the numbers of features (all numeric) and the numbers of examples. The algorithms were compared with 10-fold cross-validation. Even when the database has its recommended division into the training and test sets, all the examples were used in this set of tests in the cross-validation in order to keep the analysis uniform for all datasets. Additionally, the divisions into 10 folds in the cross-validation experiments were the same for all algorithms. The maximum number of weak learners (decision stumps) for all algorithms was set to 500. Based on the 10-fold crossvalidation error, the best performing number of weak learners is reported for each algorithm. Additionally, in the case of TabuBoost, there is a new parameter to be selected: the length of the tabu list. In each classification problem, all possible lengths of the tabu list were tried and the best performing one (according to the mean cross-validation classification error) is reported. In each problem, the number of different lengths for the tabu list depends on the number of available features. The minimum length of the tabu list is one and the maximum length equals the number of features — 1. Of course, there is no sense in setting the length of the tabu list greater than the number of decision stumps. However, in this set of experiments, for all problems, the number of features did not exceed the maximum number of weak learners (500). For ε -AdaBoost, possible values of ε were 1.0∗ k for k = 1, 1e − 1, . . . , 1e − 5 and 0.5∗ k for k = 1, 1e − 1, . . . , 1e − 4. The best results are reported. 4.1.1. Comparison of AdaBoost, ε -AdaBoost and TabuBoost Table 2 presents the best mean test error rates and the standard deviations for the classification problems considered. The best performing number of weak learners is reported. For TabuBoost, the optimal tabu length is also given and for ε -AdaBoost, the best performing ε value. Table 2 presents also the results for ε -TabuBoost_v1 and ε -TabuBoost_v2; however, their analysis is found in the next section. Let us focus now on analyzing the effects of introducing the tabu list. The results show the clear benefits of using the tabu list. First, both TabuBoost and ε -AdaBoost improved over AdaBoost in all cases except two (TabuBoost) and four cases (ε -AdaBoost): TabuBoost did not improve over AdaBoost for the Pima and Messidor
430
Dataset
Pima Parkinson’s Sonar Ionosphere QSAR ILPD SPECTF Messidor Wisconsin diagnostic Wisconsin original Wisconsin prognostic Banknote authentication Planning-relax Thoracic surgery kc1 kc3 Spambase Ozone level 8 h Hill-valley Musk
AdaBoost
ε-AdaBoost
TabuBoost
ε-TabuBoost v1
ε-TabuBoost v2
Test error rate
Std
stumps
Test error rate
Std
Stumps
Tabu length
Test error rate
Std
Stumps
ε
Test error rate
Std
Stumps
Tabu length
ε
Test error rate
Std
Stumps
Tabu length
ε
0.2326 0.0583 0.1426 0.0797 0.1372 0.3164 0.1876 0.2760 0.0148
0.0554 0.0527 0.0797 0.0540 0.0357 0.0602 0.0659 0.0305 0.0146
43 364 415 92 171 430 89 255 64
0.2421 0.0446 0.1249 0.0715 0.1277 0.3089 0.1861 0.2844 0.0116
0.0576 0.0474 0.0650 0.0475 0.0252 0.0593 0.0654 0.0381 0.0127
37 208 236 470 487 356 294 303 257
2 13 46 25 27 5 32 1 24
0.2252 0.0456 0.1350 0.0683 0.1324 0.3092 0.1826 0.2787 0.0165
0.0497 0.0524 0.0715 0.0444 0.0242 0.0597 0.0732 0.0468 0.0123
148 495 50 128 130 353 49 452 213
0.01 1.0 0.1 0.1 0.1 0.1 0.1 0.1 1.0
0.2325 0.0513 0.1408 0.0854 0.1281 0.3119 0.1733 0.2781 0.0136
0.0628 0.0493 0.0835 0.0438 0.0341 0.0599 0.0696 0.0406 0.0141
401 389 79 178 476 263 149 212 351
2 13 46 25 27 5 32 1 24
0.01 1.0 0.1 0.1 0.1 0.1 0.1 0.1 1.0
0.2255 0.0420 0.1162 0.0622 0.1230 0.2758 0.1531 0.2774 0.0115
0.0574 0.0445 0.0451 0.0435 0.0313 0.0673 0.0486 0.0400 0.0122
48 196 240 463 488 473 141 205 241
3 7 16 14 29 5 5 3 22
0.1 0.5 1.0 1.0 0.1 1.0 1.0 0.05 0.5
0.0339
0.0194
28
0.0313
0.0193
39
8
0.0261
0.0193
32
0.1
0.0239
0.0183
107
8
0.1
0.0224
0.0136
26
4
0.1
0.3195
0.0963
422
0.2889
0.1062
42
9
0.3128
0.0937
47
0.1
0.3388
0.0786
358
9
0.1
0.2707
0.0878
213
9
1.0
0.0014
0.0029
115
0.0007
0.0021
73
3
0.0014
0.0029
59
0.5
0.0007
0.0021
200
3
0.5
0.0000
0.0000
119
3
1.0
0.4389 0.2984
0.0934 0.0497
1 380
0.4159 0.2869
0.0705 0.0517
103 202
11 4
0.4171 0.2824
0.1094 0.0396
383 285
0.5 0.1
0.4488 0.2809
0.1059 0.0495
465 99
11 4
0.5 0.1
0.3726 0.2346
0.1332 0.0467
381 363
11 3
1.0 0.5
0.2579 0.1472 0.0538 0.0801
0.0303 0.0441 0.0105 0.0115
481 192 326 261
0.2472 0.1164 0.0512 0.0765
0.0252 0.0415 0.0133 0.0135
499 338 460 434
19 38 54 27
0.2186 0.1289 0.0528 0.0804
0.0224 0.0168 0.0103 0.0142
375 77 488 435
0.1 0.5 0.05 0.1
0.2253 0.1320 0.0549 0.0811
0.0277 0.0358 0.0111 0.0122
485 475 500 361
19 38 54 27
0.1 0.5 0.05 0.1
0.1922 0.1116 0.0494 0.0674
0.0235 0.0373 0.0101 0.0157
495 153 344 471
5 8 40 69
0.5 1.0 0.1 0.5
0.4423 0.0267
0.0347 0.0053
497 499
0.4303 0.0252
0.0383 0.0064
493 490
19 2
0.3877 0.0263
0.0475 0.0048
499 476
1.0 0.1
0.3921 0.0248
0.0388 0.0075
423 500
19 2
1 0.1
0.3883 0.0188
0.0445 0.0052
487 448
4 36
0.5 0.5
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
Table 2 Results for UCI datasets. Best results marked as bold.
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
Fig. 2. Mean test error rates for AdaBoost (‘‘0’’) and TabuBoost (‘‘32’’) with a tabu list of length 32. Example run for the QSAR database.
datasets and ε -AdaBoost for the Wisconsin diagnostic, Messidor, Banknote authentication and Ozone level datasets. The Messidor dataset was the only dataset for which AdaBoost performed best. Second, TabuBoost is strongly complementary to ε -AdaBoost when the best results are considered. Ignoring ε -TabuBoost_v2 for now, Table 2 shows that the best results are achieved by TabuBoost for 12 datasets and ε -AdaBoost for 8 datasets. This observation suggests that TabuBoost and ε -AdaBoost could work even better when used together. The next section presents the testing of this possible hybridization. The increase in the classification performance of TabuBoost was likely possible due to two factors. The first factor is the increased number of weak classifiers used by TabuBoost compared to AdaBoost. For all of the considered datasets, the total number of base classifiers is greater by 696 for TabuBoost than for AdaBoost. The average number of base classifiers per problem is 256 for AdaBoost and 291 for TabuBoost. Adding new base classifiers is more beneficial for TabuBoost, as they can be used more effectively. This effect can be observed in several datasets, e.g., Ionosphere (470 stumps for TabuBoost vs. 92 for AdaBoost) and QSAR (487 vs. 171), among others. Adding more weak learners to AdaBoost led to overfitting, while TabuBoost, due to its less greedy nature, improved with the addition of more classifiers. This can be observed in Fig. 2, where the dependence of the mean test error on the number of decision stump is presented for the QSAR dataset. The figure shows that adding new decision stumps does not improve the performance of AdaBoost, while the test error rate for TabuBoost keeps decreasing. The second factor is TabuBoost’s ability to achieve the same or better classification accuracy with a smaller number of base classifiers. By selecting the subsequent stumps in a less greedy way, TabuBoost can alter the order in which the features are selected, i.e., used by subsequent base classifiers. Similar or better performance with a smaller number of base classifiers was achieved by TabuBoost, such as in the Wisconsin prognostic (42 stumps for TabuBoost vs. 422 for AdaBoost) and Sonar (236 vs. 415) datasets. A smaller number of weak learners also means a simpler and faster classifier, which can be important in some applications. While both AdaBoost and TabuBoost are not able to use additional stumps, the minimum is reached earlier by TabuBoost and is still slightly better than AdaBoost. There is no clear winner between ε -AdaBoost and TabuBoost. In 12 cases, TabuBoost achieved better results than ε -AdaBoost. More thorough statistical comparison is presented in Section 4.1.4.
431
4.1.2. Results for hybridization of TabuBoost and ε -AdaBoost The results analyzed in the previous section suggest that TabuBoost and ε -AdaBoost are highly complementary. We present tests of the idea that using both types of regularization (tabu list and constant ε value) can further improve the boosting procedure. Two approaches to the hybridization of TabuBoost and ε AdaBoost were tested. First, the values of tabu length and ε that perform best when used separately (in TabuBoost and ε AdaBoost) are selected for each classification problem. The approach is then called ε -TabuBoost_v1. In the second approach, there is an extensive search for the best performing pair of the parameters: length of the tabu list and ε value. This approach is called ε -TabuBoost_v2. For all except the Musk dataset, all possible lengths of the tabu list in a given problem are checked, paired with the ε values from the considered set, defined previously. For the Musk dataset, given a huge number of possible pairs of tabu list lengths and ε values, only arbitrary selected combinations were tested (575 pairs). The results of these two hybrid approaches are presented in Table 2. The table shows that for ε -TabuBoost_v1, simply selecting the best length of the tabu list for TabuBoost and the best performing value of ε for ε -AdaBoost does not improve the classification performance much. However, searching for the best pair of tabu list and ε results in the best performing ε TabuBoost_v2. Comparing the results from Table 2, the mean test error rates are the best for ε -TabuBoost_v2 for all datasets except for three. However, accounting for the overall good results of ε -TabuBoost_v2, this particular result does not change the final judgment. 4.1.3. Statistical comparison of the results In order to analyze the results in a more statistically rigorous way, a set of statistical tests were used to compare the algorithms. The overall performance of the five algorithms (Table 2) on 20 datasets was compared with the Friedman test. The Friedman test compares the average behavior of the algorithms. Only the mean results from cross-validation for each problem and each algorithm were considered in the Friedman test. However, all problems and all algorithms were considered at once in order to discover potential differences among the average behaviors of the algorithms on all of the problems. The null hypothesis in the Friedman test is that there are no differences among the compared methods when accounting for their averaged performances in the considered problems. The p-value delivered by the Friedman test is the probability that the null hypothesis is true. Thus, small p-values (usually smaller than 0.05 or 0.01) allow rejecting the null hypothesis. At this point, it was only known that there are statistically significant differences among algorithms. To find precisely which algorithms perform significantly better/worse, post-hoc pair-wise comparisons of the algorithms were executed with a proper test (the Shaffer test in this work). This two-step testing procedure is designed to prevent too many false positive test results between pairs of compared algorithms, i.e., Shaffer pair-wise comparisons are only executed when the Friedman test rejects its null hypothesis. False positive test results occur when the test incorrectly rejects the null hypothesis and discovers differences between algorithms when there are in fact no differences. The Friedman test also provides a ranking of the algorithms, which allows sorting them from the best performing (smaller values of ranks) to worst (higher values of ranks). More details about the adopted testing procedure can be found in [26]. All tests were calculated using Keel software package [27,28]. Table 3 presents the ranks of each algorithm as assigned by the Friedman test. ε -TabuBoost_v2 ranks highest (which means it is the best algorithm), followed by TabuBoost and ε -AdaBoost.
432
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438 Table 3 Friedman ranks for the algorithms. P-value of the null hypothesis is 5.32E−09. Algorithm
Friedman rank
ε-TabuBoost v2
1.15 2.925 3.075 3.475 4.375
TabuBoost ε-AdaBoost ε-TabuBoost v1 AdaBoost
Table 4 Adjusted p-values from the Shaffer pair-wise post-hoc procedures. Statistically significant results are marked.
ε-TabuBoost v2 TabuBoost ε-AdaBoost ε-TabuBoost v1
TabuBoost
ε-AdaBoost
ε-TabuBoost v1
AdaBoost
0.002311 – – –
0.000709 0.847422 – –
0.00002 0.813996 0.847422 –
0 0.02239 0.03729 0.287443
The original AdaBoost ranked worst. This result indicates that both presented ideas, using tabu list and hybridizing it with ε -AdaBoost, are beneficial and justified. The fourth position of ε TabuBoost_v1 indicates that finding the proper pair of the length of the tabu list and the value of ε is not a straightforward procedure and some computational effort has to be made in order to tune these hyper parameters. As the p-value of the Friedman test was 5.32E−09, it indicates a strong statistically significant difference in performance among the algorithms. Thus, post-hoc statistical tests (Shaffer post-hoc procedures) were performed on each pair of algorithms to find the differences in pair-wise comparisons. The results are presented in Table 4. Assuming a significance level of 0.05, one can state that ε -TabuBoost_v2 performs significantly better than any of the other algorithms (as the corresponding p-values in Table 4 are smaller then 0.05). Both, TabuBoost and ε -AdaBoost perform significantly better than AdaBoost. However, the original TabuBoost algorithm is not proved to work significantly better then ε -AdaBoost, though it is still ranked higher by the Friedman test. As the divisions into 10 folds in the cross-validation experiments were the same for all algorithms, the test results from each fold could also be compared using the Friedman test. In this case, there are 10 test results for each of the 20 datasets, giving a total of 200 test results for each algorithm. We performed this analysis. All results are consistent with the results of the Friedman test applied to compare the mean test error rates. The p-value of the Friedman test was 3.70E−11 and the ranking of the algorithms is the same, with slightly different rank values. Also, the post-hoc comparisons between pairs of algorithms gave exactly the same results about the statistically significant differences between the considered methods. 4.1.4. The importance of the length of the tabu list An interesting and important question is: how difficult is it to find the proper length of the tabu list in the proposed algorithm? Table 2 only presents the best results for the best value of the tabu length parameter of TabuBoost. However, other values were also tested. For each dataset, each possible value of tabu length (from 1 to the number of features - 1) was evaluated in a separate 10-fold cross-validation run. Fig. 3 shows example results for the Sonar dataset. The vertical bars show the best mean test classification errors for different tabu list lengths. The vertical bar for length 0 shows the best error of AdaBoost (no tabu list). It can be observed that many values of the length of the tabu list cause improvement. Finding any regularities among the results is difficult, but it can be seen that many lengths of
Fig. 3. Best mean test error rates for different lengths of the tabu list in an example run of TabuBoost for the Sonar dataset. AdaBoost is marked as ‘‘0’’. Table 5 Numbers of different settings of the length of the tabu list in each UCI problem. Dataset
Number of different lengths of tabu lists
Number of different lengths of tabu lists with improvements
Pima Parkinson’s Sonar Ionosphere QSAR ILPD SPECTF Messidor Wisconsin diagnostic Wisconsin original Wisconsin prognostic Banknote authentication Planning-relax Thoracic surgery kc1 kc3 Spambase Ozone level 8hr Hill-valley Musk Total
7 21 59 32 40 9 43 17 29 8 31 3 11 15 20 38 56 71 99 165 774
3 3 53 18 9 2 39 6 25 7 4 1 5 7 18 38 36 12 25 10 321 (41.5%)
tabu list improve the results of AdaBoost. However, in practice, the length of the tabu list must be treated as a hyperparameter that needs to be tuned by a procedure such as cross-validation. In order to further analyze this issue, Table 5 summarizes, for each dataset, the different lengths of the tabu list used and the cases with improvements over AdaBoost in one example run of the experiment. There are classification problems in which almost any value of the length of the tabu list brings improvement. In summary, the ideas introduced to TabuBoost and its hybridized version improve AdaBoost, in several different ways. In some cases, TabuBoost is able to utilize additional weak learners more efficiently, thus decreasing test errors. In other cases, TabuBoost achieves comparable results with a smaller number of weak learners, leading to simpler and faster classifiers. The version of TabuBoost hybridized with the regularization with constant ε value has the best results. They were verified by the Friedman test as being better than the original AdaBoost. 4.2. Face verification and gender recognition problems This section presents the results of the proposed TabuBoost algorithm for face recognition problems (face verification and gender recognition). Compared to the classification problems from
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
the previous section, face recognition represents a different problem domain (image recognition). Note that the goal here was not to develop a new method of face recognition, nor to compare with existing methods, but rather to present the results of the proposed algorithm for a problem of higher dimensionality, i.e., with a bigger number of features. We define a face verification problem as follows. Given two face images, classify them as representing the same person (positive class) or as representing different persons (negative class). This is a two-class classification problem, even for many different persons in the database. In this section, we present the results for a subset of the widely known FERET face dataset [29]. This subset consisted of frontal face images. It included 3878 images of 1192 persons. All images were normalized by aligning them so that the eyes were always exactly on the horizontal line and were the same distance from the top of the image. The distances between the eyes were also the same for all images. All images were scaled to the same size of 100 × 120 pixels. No further preprocessing was done. Fig. 5 presents an example of a normalized image. 4.2.1. Gabor filters There exist many image descriptors that have proved successful in different image recognition tasks. A family of local patterns has been widely used in recent years [30,31]. Another type of successful image descriptor is Gabor filter, which can be utilized together with local patterns. More information on these image descriptors can be found in [30]. Here, we provide only the necessary information about the approaches for describing the content of the images that were used to test the proposed TabuBoost. Gabor filter has a real and an imaginary part, Wre and Wim , respectively. They are defined as Wre (x, y, θ , λ, ϕ, σ , γ ) = e
′2 2 ′2 − x +γ 2 y 2σ
Wim (x, y, θ , λ, ϕ, σ , γ ) = e
′2 2 ′2 − x +γ 2 y 2σ
(
x
′
)
· cos 2π + ϕ λ ( ) x′ · sin 2π + ϕ λ
433
Fig. 4. Example Gabor filters (the real parts) used in this work.
Fig. 5. Example normalized face image (left), the Gabor magnitude image (middle) and the Gabor phase image (right).
4.2.2. Local binary patterns Local Binary Pattern (LBP), in its original definition [30], is calculated for each pixel of the image (pc ) based on its surrounding 3 × 3 neighborhood. For each pixel pc , its new value in the LBP image is calculated as LBP (pc ) =
7 ∑
s (pi − pc ) · 2i
(6)
i=0
(2) where (3)
where θ is the orientation of the 2D wave function, λ is the wavelength of the sinusoidal part of the filter, ϕ is the phase offset, σ is the width of the Gaussian function, γ represents the spacial aspect ratio responsible for the ellipticity of the filter and x′ = xcos (θ) + ysin (θ)
(4)
y′ = −xsin (θ) + ycos (θ) .
(5)
Fig. 4 presents example Gabor filters (the real parts) with five different orientations and three wavelengths. By using Gabor filters, it is possible to measure localized responses to the given frequency of the two-dimensional signal (an image), which is calculated by performing a two-dimensional convolution of the image and the filter. Localization of the response is realized due to the Gaussian part of the filter. In practice, having the real and imaginary responses, it is the usual approach to calculate the magnitude and the phase of the response for the given pixel. Fig. 5 presents the Gabor magnitude and phase images for an example image. In some approaches, the Gabor responses are measured only at the selected landmark locations on the face image. It is also possible to calculate the full Gabor magnitude and Gabor phase images and use them for further processing. The approach employed in this work was to use Gabor filtered images in conjunction with Local Binary Patterns (LBP), which are described below.
s (x) =
{
0 if x < 0 1 if x ≥ 0.
(7)
The relation of the neighboring pixel of pc is encoded as a single bit. Having precisely eight neighbors, it is possible to encode the eight bits as a single decimal value, which becomes the value of pc in the LBP image. The upper part of Fig. 6 summarizes this procedure. The lower part of Fig. 6 presents the idea of preparing the final LBP descriptor for the image. After calculating the LBP image, it is divided into several disjunctive regions. For each region, a histogram of the pixel values is calculated. The final descriptor is composed as a concatenation of all local histograms. The length of the descriptor (i.e., the number of features describing the image) depends on the number of regions and the number of bins used for preparing each local histogram. The LBP image can be calculated directly from the original image. It is also possible to apply this procedure to Gabor filtered images for both magnitude and phase. With a set of Gabor filters, each Gabor filter can be used first to produce the magnitude and phase image, then the LBP images are calculated. All histograms are concatenated in order to prepare the final descriptor. This approach produces much longer descriptors (i.e., a higher number of features) than in the case when LBP is applied only to the original image. For example, for Gabor filters with 5 different orientations and three wavelengths (total 15 filters), 15 Gabor magnitudes and 15 phase images are calculated and each one is processed by LBP. The final descriptor is thus 30 times longer than when applying LBP only to the original image. However, in many tests, combining Gabor filtered images with LBP-histogram-based descriptors performs better than the classic LBP [30]. Fig. 7 shows the LBP images for the images in Figure 5.
434
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
Fig. 6. General idea of the LBP-histogram-based description of the face image.
feature vectors xi and xk , the feature vector zm describing these two images as a pair is calculated as zmj = ⏐xij − xkj ⏐ .
⏐
⏐
(8)
This means that each feature describing a pair of images is the absolute value of the difference between the values of the corresponding feature taken from both images.
Fig. 7. Example face image filtered with LBP (left), its Gabor magnitude image filtered with LBP (middle) and its Gabor phase image filtered with LBP (right).
4.2.3. Descriptors settings In this set of experiments, three types of image descriptors are used as vectors of features available for decision stumps in AdaBoost and TabuBoost algorithms. The first descriptor is the widely known gradient direction descriptor, in which the gradient information is calculated with Sobel filters. Information about the gradient direction for each pixel is calculated and used to calculate the histogram-based descriptor precisely as was described for the LBP procedure. The original image is divided into 12 regions (4 rows and 3 columns). Each local histogram is prepared by using 32 bins. Concatenating the local histograms gives the final length of the feature vector as 384. The second descriptor is produced by applying LBP to the original face image. The histogram-based feature vector is prepared with the same settings as for the gradient direction descriptor (12 regions, 32 bins) and has the same length of 384 features. The third descriptor results from applying the LBP procedure to Gabor filtered images, for both magnitude and phase, and concatenating all histograms. A set of 15 Gabor filters with five different orientations (θi = i ∗ Π /5, for i = 0, . . . , 4), and three
√
i
wavelengths (λi = λmin ∗ 2 , for i = 0, . . . , 2) was used. After preliminary tests, the kernel size was set to 41 and λmin = 8. The value of σ was set to kernel size/5 and γ = 1. As the descriptors produced by this approach tend to be very long, in order to limit the number of features, the number of bins used to calculate each regional histogram was set to 4. Finally, the number of features was 1440. Table 6 summarizes all three descriptors and their settings. In the face verification problem, given two images, the classifier must decide whether both represent the same person. In order to use common classification algorithms, the pair must be described by one feature vector. The approach assumed in this work is as follows. For each pair of images, based on their two
4.2.4. Results for face verification A subset of the FERET face database was used in the test [29]. The subset consisted of the frontal face images. It included 3878 images of 1192 persons. The number of images per person varied. Pairing all images produced a set of 7,517,503 pairs, 6,822 representing the same person (positive class) and a set of 7,510,681 image pairs representing different persons (negative class). This is definitely a poorly conditioned learning problem, as the numbers of examples from each class were not balanced. In order to avoid bias toward the dominating class, the experimental setting was as follows. All pairs of the positive class were collected. Then, a random set of 100,000 pairs from the negative class was created. Created from these sets, a randomly generated training set contained 4000 examples from the positive class and 4000 examples from the negative class. The rest of the examples constituted the test set. Given a large number of examples and the computational demands of the tested algorithms, the algorithms were compared by training them on the training set and testing them on the testing set. The cross-validation procedure was not used. It can be argued that face verification is a problem in which one class is of particular importance to be classified correctly. This class, called the positive class, is the class representing the same person in this set of experiments. A primary requirement of a suitable classifier is that it should be able to correctly classify examples from the positive class, which is measured by its sensitivity defined as Sensitiv ity =
TP TP + FN
(9)
where TP are true positives (the number of positive examples correctly classified as positive) and FN are false negatives (the number of positive examples incorrectly classified as negative). Sensitivity measures the ability of the classifier to discover the positive cases (i.e., the same person images), provided that they are really positive cases. The ability of the classifier to classify examples as negative (i.e., different persons), provided that they are in fact negative, is described by its specificity, which is defined as Specificity =
TN TN + FP
(10)
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
435
Table 6 Summary of the image descriptors used in the experiments. Descriptor
Parameters of Gabor filters
Local histograms
Hist. bins
Features (total)
Gradient direction (Sobel filter) Local Binary Patterns (LBP) Gabor (magnitude and phase) + LBP
– – kernel size = 41, λmin = 8, No. of orientations = 5, No. of wavelengths = 3
4 rows × 3 columns = 12 4 rows × 3 columns = 12 4 rows ×3 columns = 12
32 32 4
384 384 1440
Table 7 Test AUC values in the face verification task. The best results for each descriptor are marked as bold. AdaBoost
ε-AdaBoost
TabuBoost
ε-TabuBoost_v2
Descriptor
Test AUC
Stumps
Test AUC
Stumps
Tabu length
Test AUC
Stumps
ε
Test AUC
Stumps
Tabu length
ε
Gradient phase direction (Sobel filter) LBP Gabor + LBP
0.9648
1369
0.9659
1744
280
0.9663
1995
0.01
0.9675
573
120
0.05
0.9615 0.9897
1285 1967
0.9623 0.9903
1351 1983
120 120
0.9656 0.9909
1982 1811
0.01 0.1
0.9661 0.9905
1935 1299
80 80
0.01 0.1
where TN are true negatives (the number of negative cases correctly classified as negative) and FP are false positives (the number of negative cases incorrectly classified as positives). The false alarm rate is defined as FalseAlarmRate = 1 − Specificity.
(11)
Using both sensitivity and specificity is a better approach for comparing classifiers that can be interpreted as detectors than is using the error rate without distinguishing the type of error (false negative or false positive). In this case, the primary goal is to detect cases with the same person in two different images. The usual approach is to analyze the ROC (Receiver Operating Characteristic) curve. This is constructed as the plot of sensitivity versus false alarm rate for a given classifier. It allows observing how the classifier model is efficient to increase the sensitivity when the higher rate of false alarms is allowed. An ideal classifier would have a maximum sensitivity of 1 for a false alarm rate equal to 0. In a real-world case, the increase in sensitivity is slower. If a quantitative measure is needed to compare two classification models based on ROC, it is common to calculate the AUC (Area Under Curve) of the ROC. For an ideal classifier, this equals 1. For real-world classifiers, it falls in the range [0, 1]. AUC values for the test set were used to compare the performance of the classifiers. The following algorithms and settings were used in the face verification experiments: AdaBoost, TabuBoost, ε -AdaBoost and ε -TabuBoost_v2 were trained and tested. Given the complexity of the task, after preliminary computations, the number of weak learners (decision stumps) was set to 2000 for all algorithms. Only tabu lists with lengths of 40, 80 and 120 were used for the Gabor+LBP descriptor. For shorter descriptors (gradient phase direction and LBP), the considered tabu lists had lengths of k∗ 40 for k = 1, . . . , 9 and, additionally, 380. The considered values of ε were 1.0, 0.5, 0.1, 0.05 and 0.01. Table 7 presents the best test AUC values for the three descriptors. Note that the test AUC value for TabuBoost is higher than for AdaBoost for all three descriptor types, showing that also in this problem TabuBoost outperforms AdaBoost. Noticeably, TabuBoost can use the additional base classifiers more efficiently. Fig. 8 shows, for AdaBoost and TabuBoost, the dependence of the test AUC value on the number of weak learners. The results are for the third descriptor (i.e., Gabor filters + LBP). TabuBoost used a tabu list of length 120. Similar to the previous tests, it can be observed that TabuBoost can utilize additional weak learners to further increase the performance measure (AUC in this case) much more efficiently than can AdaBoost. The hybridization of the TabuBoost algorithm with constant ε regularization is again beneficial. Additional computations are
Fig. 8. Dependence of the test AUC value on the number of weak learners in the face verification task for AdaBoost (‘‘0’’) and TabuBoost (‘‘120’’) with a tabu list of length 120. Gabor filters + LBP are used as the image descriptor.
necessary to find the proper values of ε and the length of the tabu list; however, the results achieved by ε -TabuBoost_v2 are better for two descriptors. For the third descriptor, the test AUC for ε TabuBoost_v2 is only slightly worse than for the best performing ε -AdaBoost; in contrast, ε-TabuBoost_v2 was able to provide the solution with a much smaller number of base classifiers (1299 compared to the 1811 needed by ε -AdaBoost). 4.2.5. Face-based gender recognition As the final application example, in this section, we present the results of the proposed TabuBoost algorithm for the facebased gender recognition problem. From the well-known FERET database [29], 3878 frontal face images were taken: 2430 images of males and 1448 images of females. The total number of different persons was 1192. The classification task is defined as follows: given an image, classify it as either containing a male or female person. Images were preprocessed and normalized exactly as in the face verification task. Only the image descriptor that performed best in the face verification task was used; each image was filtered by the set of Gabor filters described in the previous section and LBP was applied to both the magnitude and phase images. Concatenated histograms resulted in a feature vector of length 1440. The maximum number of weak learners (decision stumps) was set to 500. Different lengths of tabu list were tried, starting from 20 and increasing by increments of 20 up to the maximum length of
436
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
Table 8 Results of the gender recognition task. AdaBoost
ε-AdaBoost
TabuBoost
Test error [%]
Std
Stumps
Test error [%]
Std
Stumps
Tabu length
Test error [%]
Std
Stumps
ε
8.87
0.52
448
8.33
0.79
472
80
8.72
0.32
491
0.1
500. The total number of the various lengths of tabu lists was 25. In 15 cases, there was an improvement of TabuBoost over the simple AdaBoost algorithm according to the best test values of the 5-fold cross-validation. The best test error of AdaBoost was 8.87% with std = 0.52% for 448 weak learners (decision stumps). For TabuBoost, the best test error was 8.33% with std = 0.79% for 472 weak learners for tabu_len = 80. While only slight, improvement over AdaBoost was again possible. The algorithm ε -AdaBoost also improved over AdaBoost, but not by as much as TabuBoost, despite using more weak learners. Table 8 presents the detailed results. Fig. 9 shows the dependence of the mean test error rate on the number of weak learners. The figure shows that the test error of TabuBoost decreases less quickly than that of AdaBoost. However, TabuBoost is again able to better utilize the growing number of weak learners.
a high value forces the neural state of a given neuron to depend more on this particular input signal (a feature of the neurons from the input layer). Using regularization causes a ‘‘forgetting’’ effect on the net’s learning process. This means the weights will decrease in time if not stimulated by other learning examples. This leads to a set of weights with closer absolute values, which is beneficial for the net’s generalization ability, leading to an expected decrease in test errors. A similar effect can be observed for the proposed TabuBoost. Fig. 10 presents the cumulative weights produced by both AdaBoost and TabuBoost for each feature in the SPECTF dataset. The maximum number of decision stumps was set to 500 and, for TabuBoost, the tabu_len parameter was set to 32 (as this was the best-performing value for the SPECTF dataset). Here, the cumulative weight for a given feature is calculated as the sum of αm , m = 1, . . . , M, (see Algorithm 1 and 5) of all decision stumps, which selects the given feature on which to perform the threshold test. It can be observed that TabuBoost can focus more on features ignored by AdaBoost. Still, the most useful features receive the high accumulated weights. Thus, TabuBoost is able to utilize all features without sacrificing the ability to discover which features are actually most important. The accumulated weights of the features are more uniformly distributed for TabuBoost than for AdaBoost. TabuBoost does not allow the excessive weights produced by AdaBoost. This is a similar effect to the neural net having excessively large weights for only some of the input weights, which is known to be disadvantageous. However, TabuBoost is able to force a more uniform usage of all available features in this classification problem, bringing better results, as described in the previous section. Having the decision stumps as the base classifier makes it easy and straightforward to put features on the tabu list. The general approach could also be extended to boosting with other base learners, such as fully-grown decision trees. However, the procedure of creating the tabu list would be more complicated, as one would have to consider using the given feature on different levels of the decision tree.
5. Discussion
6. Conclusions
This section presents a short discussion about the possible roots of TabuBoost’s success. As was pointed out in the introductory section of this work, introducing a list of forbidden solutions for the greedy search algorithm (the tabu list) makes it less greedy, allowing it to try other solutions in the search process and potentially escaping from the local minima. As AdaBoost, and any other boosting algorithm, can be interpreted as a kind of greedy search process, introducing the tabu list was expected to be beneficial. However, boosting is, in fact, a learning process. Each classifier learning algorithm can be interpreted as a process of optimizing performance criteria. Observations specific to this kind of process can be made and interpreted in the context of classifier training. In the introductory section, a simple, yet effective, l2 -norm method for regularization, e.g., of the neural network learning process, was cited (Eq. (1)). By adding an additional term to the minimized loss function, depending on the values of the connection weights in the net, it is possible to force the net to learn weights that cannot increase excessively compared to other weights in the net. This is beneficial because having a weight with
In this paper, a novel modification of the basic AdaBoost algorithm was proposed. A tabu list with a fixed maximum length of forbidden features was introduced, forcing the boosting procedure to focus more on features currently available for the weak learner being trained. The proposed algorithm, TabuBoost with decision stumps as the base classifiers (i.e., weak learners), was tested on several machine learning datasets from the UCI Machine Learning Repository and a face verification problem and facebased gender recognition problem based on frontal images from the FERET database. Additionally, TabuBoost was hybridized with ε -AdaBoost. The proposed algorithm showed improvement over the original AdaBoost algorithm. The origins of the positive effect of introducing the tabu list of features into AdaBoost can be partially explained by analyzing the statistics of the selection rates and accumulated weights for each feature in a given problem. Using the tabu list can have a similar effect as some regularization techniques in algorithms for neural network learning. This allows TabuBoost to focus more evenly on all available features without sacrificing the ability to discover the most important ones.
Fig. 9. Dependence of the mean test error on the number of weak learners used in the gender recognition tasks for AdaBoost (‘‘0’’) and TabuBoost (‘‘80’’) with a tabu list of length 80.
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
437
Fig. 10. Accumulated weights for various features for AdaBoost and TabuBoost with a tabu list of length 32 (SPECTF database)
The length of the tabu list is a new parameter of the proposed boosting procedure. It is an essential factor that can be tuned experimentally, e.g., via cross-validation. Thus, setting it to the optimal value requires additional computations. However, as tests revealed, it can be expected that in most problems, many different values of this parameter improve TabuBoost over AdaBoost. Given the highly parallelized structure of the algorithm, the proposed approach can be used to design simpler and better versions of boosting classifiers. Obviously, the original boosting procedure still cannot be implemented in parallel. However, the process of searching for the best-performing length of the tabu list and/or ε value can be executed in parallel runs. The proposed method has limitations. This work only demonstrated that the proposed modification works well with the AdaBoost.M1 algorithm with decision stumps as weak learners. As future work, it is worth exploring more supervised tasks, such as multi-class classification or regression. Also, other base classifiers can be used. However, this would require more complicated data structures than a simple list of features to store forbidden (tabu) solutions. Another future work could be to use existing extensions to TS, such as aspiration criteria, which can override the tabu status of elements on the tabu list.
Conflict of interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.04.003. References [1] T.J. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second ed., in: Springer Series in Statistics, 2009. [2] R.E. Schapire, Y. Freund, Boosting: Foundations and Algorithms, The MIT Press, 2012. [3] F. Glover, Tabu search—part I, ORSA J. Comput. 1 (1989) 190–206. [4] F. Glover, Tabu search—part II, ORSA J. Comput. 2 (1990) 4–32. [5] M. Lichman, {UCI} machine learning repository, 2013. http://archive.ics.uci. edu/ml.
[6] R. Meir, G. Rätsch, An Introduction to Boosting and Leveraging, Springer-Verlag New York, Inc, New York, NY, USA, 2003, pp. 118–183. [7] R.E. Schapire, Theoretical views of boosting and applications, in: O. Watanabe, T. Yokomori (Eds.), Algorithmic Learning Theory: 10th International Conference, ALT’99 Tokyo, Japan, December (1999) 6–8 Proceedings, Springer Berlin Heidelberg, 1999, pp. 13–25. [8] L.G. Valiant, A theory of the learnable, Commun. ACM 27 (1984) 1134–1142. [9] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors), Ann. Statist. 28 (2000) 337–407. [10] W. Jiang, Is regularization unnecessary for boosting?, in: AISTATS, 2001. [11] Y. Xi, Z. Xiang, P. Ramadge, R. Schapire, Speed and sparsity of regularized boosting, in: D. van Dyk, M. Welling (Eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, PMLR, Clearwater Beach, Florida USA, 2009, pp. 615–622. [12] P. Bühlmann, T. Hothorn, Boosting algorithms: regularization, prediction and model fitting, Statist. Sci. 22 (2007) 477–505. [13] Chunhua Shen, Hanxi Li, On the dual formulation of boosting algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010) 2216–2231. [14] L. Mason, J. Baxter, P.L. Bartlett, M. Frean, et al., Functional gradient techniques for combining hypotheses, Adv. Neural Inf. Process. Syst. (1999) 221–246. [15] M.K. Warmuth, J. Liao, G. Rätsch, Totally corrective boosting algorithms that maximize the margin, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, New York, NY, USA, 2006, pp. 1001–1008. [16] Y. Lin, Z. Bian, X. Liu, Developing a dynamic neighborhood structure for an adaptive hybrid simulated annealing – tabu search algorithm to solve the symmetrical traveling salesman problem, Appl. Soft Comput. 49 (2016) 937–952. [17] M. Abdel-Basset, G. Manogaran, D. El-Shahat, S. Mirjalili, Integrating the whale algorithm with tabu search for quadratic assignment problem: a new approach for locating hospital departments, Appl. Soft Comput. 73 (2018) 530–546. [18] H. Dong, T. Li, R. Ding, J. Sun, A novel hybrid genetic algorithm with granular information for feature selection and optimization, Appl. Soft Comput. 65 (2018) 33–46. [19] M.A. Tahir, J.E. Smith, Feature selection for heterogeneous ensembles of nearest-neighbour classifiers using hybrid tabu search, in: P. Siarry, Z. Michalewicz (Eds.), Advances in Metaheuristics for Hard Optimization, Springer, Berlin Heidelberg, 2008, pp. 69–85. [20] D. Korycinski, M.M. Crawford, J.W. Barnes, Adaptive feature selection for hyperspectral data analysis, in: Proc. SPIE 5238, Image and Signal Processing for Remote Sensing IX, 2004, pp. 5213–5238. [21] H. Zhang, G. Sun, Feature selection using tabu search method, Pattern Recognit. 35 (2002) 701–711. [22] I.O. Oduntan, M. Toulouse, R. Baumgartner, C. Bowman, R. Somorjai, T.G. Crainic, A multilevel tabu search algorithm for the feature selection problem in biomedical data, Comput. Math. Appl. 55 (2008) 1019–1033. [23] S. Rosset, J. Zhu, T. Hastie, Boosting as a regularized path to a maximum margin classifier, J. Mach. Learn. Res. 5 (2004) 941–973.
438
M. Bereta / Applied Soft Computing Journal 79 (2019) 424–438
[24] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2004) 137–154. [25] A. Fernandes, A. Utkin, J. Eiras-Dias, J. Silvestre, J. Cunha, P. Melo-Pinto, Assessment of grapevine variety discrimination using stem hyperspectral data and adaboost of random weight neural networks, Appl. Soft Comput. 72 (2018) 140–155. [26] J. Derrac, S. García, D. Molina, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm Evol. Comput. 1 (2011) 3–18. [27] J. Alcalá-Fdez, L. Sánchez, S. García, M.J. del Jesus, S. Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit, V.M. Rivas, J.C. Fernández, F. Herrera, Keel: a software tool to assess evolutionary algorithms for data mining problems, Soft Comput. 13 (2008) 307–318.
[28] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Mult.-Valued Log. Soft Comput. 17 (2011) 255–287. [29] P.J. Phillips, H. Moon, S.A. Rizvi, P.J. Rauss, The feret evaluation methodology for face-recognition algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 1090–1104. [30] M. Bereta, W. Pedrycz, M. Reformat, Local descriptors and similarity measures for frontal face recognition: a comparative analysis, J. Vis. Commun. Image Represent. 24 (2013) 1213–1231. [31] M. Bereta, P. Karczmarek, W. Pedrycz, M. Reformat, Local descriptors in application to the aging problem in face recognition, Pattern Recognit. 46 (2013) 2634–2646.