Neural Networks 59 (2014) 23–35
Contents lists available at ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
Model-wise and point-wise random sample consensus for robust regression and outlier detection Moumen T. El-Melegy ∗ Department of Electrical Engineering, Assiut University, Assiut 71516, Egypt
article
info
Article history: Received 13 February 2013 Received in revised form 20 May 2014 Accepted 23 June 2014 Available online 7 July 2014 Keywords: Multi-layered feed-forward neural networks Training algorithm Robust statistics Regression Outliers
abstract Popular regression techniques often suffer at the presence of data outliers. Most previous efforts to solve this problem have focused on using an estimation algorithm that minimizes a robust M-estimator based error criterion instead of the usual non-robust mean squared error. However the robustness gained from M-estimators is still low. This paper addresses robust regression and outlier detection in a random sample consensus (RANSAC) framework. It studies the classical RANSAC framework and highlights its model-wise nature for processing the data. Furthermore, it introduces for the first time a point-wise strategy of RANSAC. New estimation algorithms are developed following both the model-wise and pointwise RANSAC concepts. The proposed algorithms’ theoretical robustness and breakdown points are investigated in a novel probabilistic setting. While the proposed concepts and algorithms are generic and general enough to adopt many regression machineries, the paper focuses on multilayered feed-forward neural networks in solving regression problems. The algorithms are evaluated on synthetic and real data, contaminated with high degrees of outliers, and compared to existing neural network training algorithms. Furthermore, to improve the time performance, parallel implementations of the two algorithms are developed and assessed to utilize the multiple CPU cores available on nowadays computers. © 2014 Elsevier Ltd. All rights reserved.
1. Introduction Regression analysis seeks to find the relationship between one or more independent variables and a dependent variable. Quite often in order to find this relationship, a mean squared error (MSE) criterion is minimized. This criterion is sufficient (indeed optimal) to deal with data corrupted by noise that follows a Gaussian model. However in many real applications, data are not only noisy but also contain outliers, data that are in gross disagreement with a postulated model. It has been noted that the occurrence of outliers in routine data ranges from 1% to 15% (Hampel, Ronchetti, Rousseeuw, & Stahel, 1986). In applications where data are collected using fully automated techniques, it is not surprising that an outlier rate as high as 50% may be encountered (Ahmed & Farag, 2002; Meer, Mintz, Rosenfeld, & Kim, 1991; Zhang, 1998). Those inevitable outliers can severely distort a fitting process so that the estimated parameters become useless. As a matter of fact, just one bad outlier would skew the results of any approach based on the mean squared estimates.
∗
Tel.: +20 88 2411021. E-mail address:
[email protected].
http://dx.doi.org/10.1016/j.neunet.2014.06.010 0893-6080/© 2014 Elsevier Ltd. All rights reserved.
Because of the skewing of the result, trying to detect outliers by thresholding on the residual errors will not work in general. Throwing away one datum at a time and doing mean squares on the remaining subset often does not work when more than one outlier are present. As such, the estimation problem becomes harder as it is actually two problems: classification of data into inliers (valid data) and outliers; and fitting the model to data in an optimal manner. In such circumstances the deployment of robust estimation methods is essential. Robust methods continue to recover meaningful descriptions of a statistical population even when the data contain outlying elements belonging to a different population. They are also able to perform when other assumptions underlying the estimation, say the noise model, are not wholly satisfied. Several robust techniques have been proposed in the field of robust statistics (Hampel et al., 1986; Huber, 1981; Rousseeuw & Roy, 1987). These methods include M-estimates (Maximum likelihood estimates), L-estimates (linear combination of order statistics), R-estimates (estimates based on rank transformations) and LMedS estimates (Least Median Square). The RANdom SAmple Consensus (RANSAC) framework of Fischler and Bolles (1981) has become the standard way of dealing with outliers in the computer vision and image processing community (Haralick, 1986; Meer et al., 1991; Zhang, 1998; Zhuang, Wang, & Zhang, 1992). It is able to cope with as large as 50% of
24
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
outlying data. RANSAC searches for suitable solutions directly using the data, repeatedly constructing solutions from randomly sampled minimal subsets, and then tests the solution for support from the complete set of data. In RANSAC the support is the number of data items within a distance threshold. The solution from the subset with the most support is deemed the robust outcome. The main goal of this paper is to develop novel concepts and estimation algorithms in the RANSAC framework for the objective of robust regression and outlier detection. The paper carefully studies the classical RANSAC framework and highlights its modelwise nature for processing the data. Furthermore, it introduces for the first time a point-wise strategy of RANSAC. New algorithms are developed following both the model-wise and point-wise RANSAC concepts for estimating regression models in the presence of outlier. While the proposed concepts and algorithms are generic and general enough to adopt many regression machineries, such as regression trees and support vector machines (SVMs), this paper focuses on multilayered feed-forward neural networks (MFNNs) (Haykin, 2008; Zurada, 1992) due to their popularity in solving regression problems in many diverse and practical applications. MFNNs have demonstrated a great ability to solve outlier-free regression problems and to accurately approximate unknown functions. However they suffer at the presence of outliers. Several robust training algorithms have been proposed for feedforward neural networks, e.g., Chen and Jain (1994); Chuang, Su, and Hsiao (2000); El-Melegy, Essai, and Ali (2009); Liano (1996); Pernia-Espinoza, Ordieres-Mere, de Pison, and Gonzalez-Marcos (2005) and Rusiecki (2010). Almost all these methods rely on Mestimators (Hampel et al., 1986; Huber, 1981; Rousseeuw & Roy, 1987) to reduce the effect of outliers on the training outcome. Unfortunately these estimators do not possess a high enough breakdown point, which is the smallest proportion of outliers that may force the value of the estimate to be arbitrary wrong (Kim, Kim, Meer, Mintz, & Rosenfeld, 1989; Rousseeuw & Roy, 1987). Reported experiments (Chen & Jain, 1994; Chuang et al., 2000; Liano, 1996; Pernia-Espinoza et al., 2005) demonstrated a breakdown of 10%–15% at most. In fact, several researchers tested the robustness of their methods on synthetic data by introducing a few outliers to the experiment data. El-Melegy et al. (2009) developed a training algorithm based on the more robust Least-median-of-squares estimator (LMedS) that has theoretically the maximum outliers insensitivity as it can tolerate up to 50% of outlying data (Huber, 1981; Rousseeuw & Roy, 1987). Nevertheless, unlike the M-estimators, there is no straightforward method (explicit formula) to minimize the LMedS estimator. Therefore the authors (El-Melegy et al., 2009) resort to the stochastic simulated annealing (SA) algorithm to minimize this estimator. Approaches based on the adaptive learning rate (Rusiecki, 2006) and Least Trimmed Squares (LTS) (Rusiecki, 2007) estimator were also proposed. The concept of initial data analysis by the Minimum Covariance Determinant (MCD) estimator was also investigated (Rusiecki, 2008). An annealing dynamical learning algorithm combined with particle swarm optimization was also proposed to train wavelet neural networks for identifying nonlinear systems with outliers (Ko, 2012). Some robust learning algorithms have been also applied to radial basis function networks (Chuang, Jeng, & Lin, 2004; Lee, Chung, Tsai, & Chang, 1999). In spite of its popularity in the computer vision and image processing communities, the RANSAC framework has been less known in the neural network community. The first key contribution of the paper is the presentation of the RANSAC framework to the neural network research community for the objective of robust regression and outlier detection. We are aware of only one research effort (Nishida & Kurita, 2008) that used a RANSAC-like method to improve the performance of SVMs on large-scale datasets by training a number of small SVMs for randomly selected subsets of data, while tuning their parameters to fit SVMs to the whole training
set. However any possibility of outlying data was not considered, and the concept of robustness was not addressed at all. The current paper adapts the model-wise RANSAC framework, for the first time, for the robust training of MFNNs. In order to make the adaptation of this framework for this goal more efficient, several novel aspects are introduced in the standard framework. A stage following the standard RANSAC procedure is employed to improve the outlier/inlier data classification, and to increase the accuracy of the estimated model representing the inlying data. Furthermore, to enhance the sampling process of the RANSAC algorithm, a selection mechanism is proposed based on bucketing techniques (Zhang, Deriche, Faugeras, & Luong, 1995). This mechanism aims to avoid wasting computational time over useless sampled subsets. Some early draft of these ideas has appeared in our recent papers (ElMelegy, 2011a), upon which we build in this paper. A second key contribution in this paper is the development of another new RANSAC algorithm for robust regression. This algorithm proceeds in a way similar to the first RANSAC algorithm, but follows a different strategy as it keeps track of the support for a data point to be valid, instead of computing the support for the estimated model. The consensus of random subsets is taken point-wise rather than model-wise. This new point-wise RANSAC strategy has not been used before in the entire RANSAC literature. Both proposed algorithms are extensively evaluated for nonlinear regression on synthetic and real data, contaminated with varying degrees of outliers under different scenarios, as well as real highdimensional data publicly available from the standardized benchmark collection for neural networks PROBEN1 (Prechelt, 1994) and the UCI repository of standard machine learning datasets. Both algorithms successfully outperform well-known algorithms in the literature based on M- and LMedS-estimators. It is noteworthy that the algorithms’ robustness can be further improved through integrating an M-estimator based criterion within the algorithms. The third contribution of this paper is the theoretical investigation of the algorithms’ robustness and breakdown points in a novel probabilistic setting. To the best of our knowledge, this has not been reported before in the literature for any RANSAC-based algorithm. Most previous researchers have settled for the underlying reasoning behind the first RANSAC algorithm (Fischler & Bolles, 1981) or for some experimental validations (Chum & Matas, 2008; Meer et al., 1991; Torr & Zisserman, 2000; Zhuang et al., 1992), but no explicit formulas have been given. The fourth contribution of the paper deals with the time performance of the proposed algorithms. Due to its repetitive sampling and estimation nature, any RANSAC algorithm typically takes a prolonged period of time till convergence. The paper presents parallel implementations of the two proposed RANSAC algorithms and the assessment of their time performance on a multi-core desktop computer. The rest of this paper is organized as follows. Section 2 defines first the regression problem, and then develops a modelwise RANSAC algorithm for the robust training of MFNNs. A new point-wise RANSAC algorithm is proposed in Section 3. The robustness of the two algorithm is theoretically analyzed in Section 4. Our experimental results are described in Section 5. Parallel implementations of the RANSAC algorithms are evaluated in Section 6. We give our conclusions and future work in Section 7. 2. RANSAC training of neural networks Let us start first by stating the regression problem under concern in this paper. Let S = {(xi , yi ), i = 1, . . . , n} be a training set of input–output pairs, where xi ∈ Rm are m-dimensional input data points, and yi ∈ R are the corresponding outputs. For some pairs {(xk , yk ) ∈ S |k = 1, . . . , n′ , n′ < n}, the inputs and outputs are related by an unknown (nonlinear) function f such that yk = f (xk ) + εk . The errors εk are typically assumed independent and
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
identically distributed Gaussian noise. These n′ pairs represent the inlying data, whereas the remaining pairs are outliers that do not satisfy the previous relation, and no assumptions are made about their distribution. The classification of any given data pair whether an inlier or an outlier is not known in advance. The goal is to find an estimator fˆ of f such that some metric of approximation error is minimized, and to detect the outliers from the entire dataset S . Multilayered feed-forward neural networks (MFNNs) have demonstrated a great ability to solve the outlier-free regression problem and to accurately approximate unknown functions (Haykin, 2008; Zurada, 1992). However they suffer at the presence of outliers. In this section we review the standard RANSAC framework (Fischler & Bolles, 1981; Meer et al., 1991), on which we build in this paper. We also adopt this framework to propose an algorithm for the robust training of MFNNs. RANSAC starts by forming a random subset of the given data, possibly contaminated with outliers. The subset consists of a minimal subset of the data sufficient to determine the model or function to be approximated by the neural network. The size of this subset, s, thus depends on the model/function and is a userdefined parameter. The network is trained on this subset using the traditional backpropagation algorithm or any of its variants (Haykin, 2008; Zurada, 1992) minimizing the usual MSE criterion. The support for the solution found after training is measured by the number of all data points that are close, within a distance threshold, from the estimated model. These points form what is referred to as ‘‘consensus set’’. This procedure is repeated a sufficient number of times and the model with the maximum support is declared the robust solution. The points within the threshold distance are the inliers and the rest of points are the outliers. Fig. 1 illustrates the inlier and outlier points for an estimated model. It is important to stress that RANSAC here follows a model-wise strategy in which each estimated model is weighed up by the size of its consensus set among the entire dataset. This is to be contrasted to the strategy proposed in the next section. The above procedure typically constitutes the complete estimation process in the standard RANSAC framework (Fischler & Bolles, 1981; Meer et al., 1991). However, if a MFNN is trained now on the consensus set corresponding to the model found with the most support, a better estimate of the model of the inlying data can be obtained. Moreover, there may be additional points that would now be classified as inliers if the distance threshold was applied to this newly-found model. Therefore, our proposed algorithm proceeds further beyond the standard RANSAC framework by introducing a following stage, in which two more steps are iterated until the number of found inliers converges. The first is to re-estimate the model from all the data points identified as inliers so far, and the second is to re-identify the inliers based on the last found model. This stage improves the outlier/inlier data classification, and increases the accuracy of the estimated model representing the inlying data. Relevantly, some RANSAC variants (Hartley & Zisserman, 2004) employed a similar stage to iteratively increase the set of found inliers. Two questions arise: What is the distance threshold? And how many random subsets are required? Although the distance threshold, t, can be taken empirically, we adopt a statistical method (Hartley & Zisserman, 2004) to find a suitable value for t. We would like to choose t such that with a probability α the point is inlier. If it is assumed that the error in inlying data is Gaussian with zero mean and standard deviation σ , then the distance of a point from the model is a sum of squared Gaussian variables and follows a χm2 distribution with m degrees of freedom, where m equals the codimension of the model (or number of independent function variables). The probability that the value of a χm2 random variable is less than k2 is given by the cumulative chi-squared distribution
25
Fig. 1. Example demonstrating the RANSAC framework: (a) A dataset consisting of Gaussian-noised inliers (shown in red) and many true outliers (shown in blue crosses). (b) The function (shown in continuous blue line) robustly estimated from the dataset and its consensus set consisting of all the estimated inliers (shown in red), which are the points within a threshold distance from the function. The remaining points are assumed outliers (blue crosses). The dotted lines indicate the threshold distance. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
k2
Fm (k2 ) = 0 χm2 (ζ )dζ . From the cumulative distribution, t is taken such that t 2 = Fm−1 (α)σ 2 .
(1)
In our implementation, α is chosen as 0.95, so that there is a 95% probability that the point is an inlier. For example, this leads to t = 1.96σ for a one-dimensional (1-D) function. As for the number of random subsets N, one wants this number small enough to make computation feasible, yet big enough to ensure with a probability p that at least one of the random subsets is completely free from outliers. So if one has a worst case estimate1 ϵ of the contamination percentage of outliers in the whole set of data, then N should be at least (Rousseeuw & Roy, 1987) N =
log(1 − p) log[1 − (1 − ϵ)s ]
.
(2)
Obviously the smaller the subset size s needed to instantiate a model, the fewer subsets are required for a given level of confidence p. For instance, if we assume a fraction ϵ = 40% of outliers in the data, s = 7 and require the probability p be 0.99, then N should be at least 163. However one last concern yet remains regarding the random selection of the subset points. The s random members of the subset should reflect the function throughout its entire space. For example, if some points in the very same subset were very close to each other, the solution generated from the network trained on this subset would be useless and thus a waste of time. In order to avoid this and achieve higher efficiency, in another departure from
1 If the fraction of outlying data is unknown, as is often, a worst case estimate of the level of contamination must be made in order to determine the number of subsets to be taken. In absence of any clue, take ϵ = 50%.
26
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
Note also that Algorithm 1 uses an early termination condition of the first loop if a subset is luckily found to have a consensus set larger in size than a pre-defined threshold T . The model estimated from the inliers corresponding to that subset is then further improved upon during the following repeat-until loop. In our experiments, we use rather a high value for the threshold T (e.g., T = 0.85n). Algorithm 1 Model-wise RANSAC Training algorithm. Objective: Robust fit of a model (function) to a dataset S of size n, which may contain outliers. User-supplied parameters: MFNN topology, believed outliers fraction ϵ , subset size s, and estimated noise level in inliers σ .
Fig. 2. Illustration of the new algorithm keeping track of the data points’ supports during loop iterations: (a) Data points with outliers, along with the true function (shown in continuous red line) that best fits the inliers. Four sample points are selected (each marked with ‘‘X’’). (b) The support of each point versus the algorithm loop iterations. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
the standard RANSAC framework, we adopt a regularly random selection method based on bucketing techniques (Zhang et al., 1995) (see Section 5.1 for some detail). Implementation note: This bucketing technique may be rather difficult to implement when the number of independent function variables (neural network inputs) is rather large. Therefore in this case, this technique is replaced by a pre-test performed on any formed subset to check if any two member points are rather too close. The subset is not used at all to estimate the model if this pretest turns to be true, and another subset is then tried. Finally, an outline of the model-wise RANSAC training algorithm is shown in Algorithm 1. It is important to stress that in the first stage of the algorithm where subsets of size s are formed randomly, the neural network should have fewer number of hidden units (i.e., of reduced size) as compared to the fully-sized network to be trained in the final stage (repeat-until loop) of the algorithm. This is because the training set size cannot be smaller than the number of free parameters in the network (total number of network weights and biases). The choice of a proper network topology/architecture and size is a typical concern in MFNN applications, which depends on the specific problem/data being analyzed and on the user’s experience. However there are some well-founded methods and guidelines trying to address this issue in the literature (e.g., Haykin, 2008), to which the interested reader is referred for more details.
1. Compute the distance threshold t from (1). 2. Compute the required number of subsets N from (2). 3. for i = 1 To N do (a) Randomly select a subset of s data points from S using the bucketing technique. (b) Train a smaller MFNN on this subset by the standard backpropagation algorithm minimizing the usual MSE. (c) Determine the points from S that are within distance t from the resultant model. The set Si is the consensus set for this model. (d) If the size of Si (number of inliers) is greater than some threshold T , set Sm = Si and go to Step 5. end for 4. Select the largest consensus set Sm from all Si . 5. repeat (a) Train a MFNN using all points in Sm by the standard backpropagation algorithm minimizing the usual MSE criterion. (b) Determine the points from S that are within distance t from the resultant model. Let these points form the new consensus set Sm . until almost no change in the size of Sm
3. Point-wise RANSAC algorithm RANSAC methodology, as described in the previous section, is to repetitively estimate the model from a random group of data points of minimal size, and then weigh up this model by the size of its consensus set among the entire data set. The model-wise RANSAC philosophy here is that the best, robust model should have the biggest consensus set. In this section, we propose a new RANSAC algorithm. The underlying idea here is that an actual inlier should appear more frequently within the consensus sets of the models found from the randomly-formed subsets. As such, the new algorithm proceeds in a way similar to the previous RANSAC algorithm by repeatedly constructing the random minimal subsets, estimating the model, and then finding the corresponding consensus set. The algorithm counts how many times each data point is considered an inlier by the models obtained from the subsets. After repeating this a sufficient number of times, those points with large enough counts are taken as the actual inliers, and the rest of the points are assumed outliers. This way, the new algorithm keeps track of the support for a data point to be valid, instead of computing the support for the estimated model. In other words, the consensus of the random subsets is taken on the data points (point-wise), being good or bad, not on the estimated models (model-wise). The robust model is eventually estimated from those inliers. To shed more light on this algorithm, Fig. 2(a) shows a set of data points containing several outliers, along with the one-dimensional
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
function that best fits the true inlying points. To illustrate the rationale behind the algorithm, four sample points are selected (each marked with an ‘‘X’’ on the figure). The algorithm keeps record of each point’s support, i.e., how many times a point is considered a member of the found consensus sets. The supports of the four points marked in Fig. 2(a) are graphed in Fig. 2(b) versus the algorithm iterations. As point D is a true outlier, it has a small support, whereas Points A and C accumulate significantly more support as both indeed represent inlying data. Point B receives support less than A and C, but more than D as D is farther away from the true function. As such, by the end of the loop iterations, the algorithm has an idea about which points are more likely to truly represent the underlying model and are thus worthy of further consideration to estimate the model. The new algorithm, dubbed Point-wise RANSAC (PRANSAC), is outlined in Algorithm 2. The algorithm has almost the same skeleton of Algorithm 1. The main difference is how the inliers set Sm is constructed (Step 5) before the final repeat-until stage. The threshold H determines the minimum number of occurrences of a point in the found consensus sets to be considered for this stage. This stage, as before, iterates model re-estimation from the inliers found so far, and inliers re-identification based on the last found model until the number of found inliers converges. Note also that the early termination condition is omitted as one is no longer looking for the largest consensus set. One question remains: How many random subsets (algorithm trials) are required for this point-wise RANSAC strategy? One wants this number, N ′ , small enough to make computation feasible, yet big enough to ensure with a probability p′ that a more-likely inlier point will accumulate a sufficient support to move on to the final repeat-until stage. We can derive a proper value for N ′ based on some statistical analysis as follows. In each trial of the algorithm, a model is estimated from a random data subset, and then each point is examined to be a member of the consensus set based on the distance threshold t. One can model the residual error e of a point with respect to a model (difference between network actual output and target value) as a mixture of a Gaussian distribution for the model’s inlying points and a uniform distribution for the outliers. Since no prior information is typically available about the outliers distribution, a uniform distribution for the outliers’ residuals is often assumed (Chum & Matas, 2008; Hartley & Zisserman, 2004; Torr & Zisserman, 2000). That is,
(1 − ϵ) exp p(e) = √ 2π σ
−e2 2σ 2
+
ϵ V
,
(3)
where σ is the noise level as described in the previous section. Here it is assumed that the outliers distribution is uniform between − V2 · · · + V2 , where V represents the largest error possibly induced by the presence of outliers (an estimate of such quantity can be easily obtained from the ranges of the input data or the domain across which a model is estimated). Consequently, the probability that a is considered belonging to the model’s consensus set is β = point t p ( e ) de. Note β thus accounts for the possibility of some true in−t liers and some true outliers presumed in the consensus set. In each independent trial out of a total of N ′ , the algorithm checks each point as an inlier or not. As such, a point’s support (count of how many times a point found in the consensus sets) can be described as a binomial distribution Bin(N ′ , β). Thus the probability p′ that a more-likely inlier point will accumulate asupport no smaller than
N ′ ′ N′ the threshold H is given by p′ = β k (1 − β)N −k . Ack=H k cordingly, for a given level of confidence p′ , by requiring H to be a certain fraction of N ′ , a lower bound on N ′ (the sole remaining unknown) can be estimated numerically from the Binomial
27
cumulative distribution function.2 For example in the 1-D function of Section 5.1, for ϵ = 40%, σ = 0.05, V = 1, and t as computed from (1), if we require the probability p′ be 0.99 and H = 0.56 N ′ , then β = 0.648 and N ′ should be at least 169. It is also important to note here that, as explained above, a point’s support can be described as a Binomial random variable. The PRANSAC algorithm repeatedly estimates a model from a randomly-constructed data subset and updates the data points’ supports, which can be used to estimate the probability parameter of the Binomial random variable for each data point corresponding to whether that point is an inlier or not. As soon as the variance on those Binomial parameters drops such that the most-probable inliers can be determined with high confidence, the algorithm’s time-expensive loop (Step 4 in Algorithm 2) can be terminated. This can be used as a more reliable early termination criterion of the iterative loop, which highlights another advantage of PRANSAC over the RANSAC algorithm (Algorithm 1). Algorithm 2 Point-wise RANSAC Training algorithm. Objective: Robust fit of a model (function) to a data set S of size n, which may contain outliers. User-supplied parameters: MFNN topology, believed outliers fraction ϵ , subset size s, and estimated noise level in inliers σ . 1. Compute the distance threshold t from (1). 2. Compute the required number of subsets N ′ as described in the text. 3. Initialize countj = 0, 1 ≤ j ≤ n. 4. for i = 1 To N ′ do (a) Randomly select a subset of s data points from S using the bucketing technique. (b) Train a smaller MFNN on this subset by the standard backpropagation algorithm minimizing the usual MSE. (c) Determine the points from S that are within distance t from the resultant model. The set Si is the consensus set for this model. (d) Increment the count for each point in Si . end for 5. Form the set Sm from all data points having counts greater than or equal to a pre-defined threshold H. 6. repeat (a) Train a MFNN using all points in Sm by the standard backpropagation algorithm minimizing the usual MSE criterion. (b) Determine the points from S that are within distance t from the resultant model. Let these points form the new consensus set Sm . until almost no change in the size of Sm
4. Robustness analysis Several tools have been developed in the field of statistics to assess the robustness of estimators (Hampel et al., 1986; Huber, 1981; Rousseeuw & Roy, 1987). Here we analyze the proposed algorithms in terms of the breakdown point (Huber, 1981; Rousseeuw & Roy, 1987) as a global measure of robustness. It indicates the maximum number of outliers an estimation algorithm can tolerate before complete failure. The proposed algorithms (Algorithm 1 and Algorithm 2) are formulated in the RANSAC framework, for which several experimental studies (e.g., Chum & Matas,
2 The known Chernoff bound (Mitzenmacher & Upfal, 2005) can be used here for the same purpose, but it yields a loose bound resulting in unnecessarily large values for N ′ .
28
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
(a) RANSAC.
(b) PRANSAC.
Fig. 3. Proposed algorithms’ theoretical robustness: (a) Breakdown probability Pbd of Algorithm 1 versus outliers percentage ϵ for several values for N and s. (b) Breakdown ′ probability Pbd of Algorithm 2 versus outliers percentage ϵ for several values for N ′ and H.
2008; Meer et al., 1991; Torr & Zisserman, 2000 and Zhuang et al., 1992) have reported a breakdown point of 50%, or even more (Hartley & Zisserman, 2004; Zhang, 1998), but no explicit formulas have been given. In this section, we provide further insight into this, and more specifically, we derive theoretical formulas for the breakdown points of the two proposed algorithms. To the best of our knowledge, this has not been reported before in the literature for any RANSAC-based algorithm. The first algorithm (Algorithm 1) repetitively estimates a model from a random group of data points of minimal size, then weighs up this model by its support over the entire dataset. The robustness of the algorithm is contingent upon the existence of at least one subset of data that is free from outliers. As described in Section 2, the choice of the required number of subsets is formulated to increase this possibility. However if this does not happen, the algorithm may breakdown. As such, the breakdown robustness of the algorithm is better characterized by the breakdown probability Pbd . This breakdown probability indicates when the algorithm is more likely to break down. Pbd is just the probability of not having at least one subset consisting of all ‘‘true’’ inliers in a sequence of N subsets when the proportion of outliers is ϵ . That is, Pbd (ϵ; s, N ) = [1 − (1 − ϵ)s ]N .
(4)
Explicitly, Pbd gives the probability of the algorithm breakdown for a given ϵ as a function of the number of subsets N, and the minimal subset size s. Fig. 3(a) plots Pbd versus ϵ for several values for N and s. For example, at s = 7 and N = 163, the probability of the algorithm breakdown is 0.28 at 50% contamination. Pbd tends to increase as the subset becomes bigger. Moreover as N increases, it becomes more likely to form a clean subset thus decreasing the breakdown probability (Pbd becomes as low as 0.02 for N = 500 in the previous example), but this is at the cost of increased computational cost and processing time. In principle, the proposed algorithm can thus be designed to provide the desired breakdown probability by selecting the required number of subsets (along the line of what (2) offers). Interestingly, in view of (4) and Fig. 3(a), the algorithm, at least in theory, can cope with some outliers percentages higher than 50%. This justifies some experimental observations made in some earlier reports (Hartley & Zisserman, 2004; Zhang, 1998) on some RANSAC-based methods. However in practice this could come at the price of unreasonably large N leading to a huge amount of computation. Besides, the ability of the minority inliers in this case to faithfully represent the true model becomes questionable. The PRANSAC algorithm (Algorithm 2) is also formulated in the RANSAC framework keeping track of the support for each data point, where the support is dictated by how many times a point is considered a member of the found consensus sets. Theoretically the algorithm is more likely to break down if a true
outlier accumulates enough support to survive the threshold H and thus moves on to the final repeat-until stage of the algorithm. ′ As such, the algorithm breakdown probability Pbd becomes the probability of a true outlier having a support no smaller than H in a sequence of N ′ trials. That is, ′ Pbd (ϵ; H , N ′ ) =
N′ ′ N k=H
k
′
γ k (1 − γ )N −k ,
(5)
where
γ =
t
−t
ϵ V
de =
2t V
ϵ.
(6)
γ gives the probability of a true outlier found in a consensus set and depends on the distance threshold t and the constant V , both of which are data-specific parameters. The control parameters that ′ affect the breakdown probability Pbd are the number of trials N ′ and ′ the threshold H. Fig. 3(b) plots Pbd versus ϵ for several values for N ′ and H at σ = 0.05, V = 1, and t as computed from (1). Generally speaking as N ′ is increased, all points tend to accumulate more ′ support, thus leading to a higher breakdown probability Pbd , unless H is also increased to filter out the outliers. As H is made bigger, the less the chances an outlier will survive, and thus the more robust ′ the algorithm. For example for N ′ = 169 in Fig. 3(b), Pbd is about 0.79 for H = 16 at 60% contamination, while it drops to 0.02 if H is made just 1.8 times as big. As such, by selecting a high threshold H, the algorithm can be made more robust, outperforming Algorithm 1 without the need for overwhelmingly more computational trails. However if H is made too high, fewer inlier points may survive the threshold as well, thus possibly offering not enough information to obtain a meaningful model. It is interesting to note again that the algorithm can even theoretically cope with outliers percentages higher than 50%. However in this latter case, the question about the ability of the minority inliers to represent the true model becomes legitimate. In practice, we use a value for N ′ as estimated from our analysis in Section 3. We then take the threshold H as the median of the supports of all points, i.e., take H = medianj count j in Step 5 of Algorithm 2. This sets the size of the inliers set Sm to half the entire dataset before the beginning of the repeat-until stage. This way, a proper value for H is obtained over the practical range of 0%–50% outliers contamination. 5. Experimental results In this section, the performance of the two RANSAC algorithms, R implemented in Matlab⃝ 7.8 (The Mathworks Inc., 2009), is demonstrated for nonlinear regression using several datasets. We first evaluate them on simulated one-dimensional (1-D) data, because the low-dimensionality allows the easy visualization and
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
inspection of the resulting approximated models. Besides, the accuracy of these models is readily obtainable from the known ground-truth models. Furthermore, the same data have been used before by competing algorithms in the literature (e.g., Chuang et al., 2000; El-Melegy et al., 2009; Liano, 1996; Pernia-Espinoza et al., 2005 and Rusiecki, 2007, 2010), the fact which eases the qualitative and quantitative comparison with them. We then proceed to assess the proposed algorithms on more challenging, real high-dimensional publicly available datasets, which serves to demonstrate the practical application of the two algorithms. The first one is a 14-D set from the standardized benchmark collection for neural networks PROBEN1 (Prechelt, 1994). This set does not have outlying data, therefore outliers are artificially added with varying degrees. This helps examine the algorithms’ robustness versus various outliers contaminations. Afterwards, the two algorithms are evaluated on four high dimensional medical datasets from the UCI repository of standard machine learning datasets, which are known to have several inevitable outliers. For all the real data, cross-validation using a validation set is used to select proper values for the MFNN topology (number of hidden units), subset size s, and estimated noise level σ in the inliers. We compare the performance of the two proposed algorithms with that of an M-estimator based algorithm (Liano, 1996; Rusiecki, 2010) that employs a neural network trained by the popular backpropagation algorithm minimizing the Least Mean Log Squares (LMLS) M-estimator. That algorithm was first introduced by Liano (1996), then recently improved upon by Rusiecki (2010). The so-called Tao-robust training algorithm (Pernia-Espinoza et al., 2005) (also based on using M-estimators) was also tested, but we found its performance no better than that of the LMLS training algorithm. Therefore its results are not included in our results reported below. In addition, we compare the performance of these robust neural networks with a neural network trained by the backpropagation algorithm minimizing the usual MSE criterion. Our implementation of the backpropagation algorithm needed for the various training tasks in all algorithms uses the function ‘‘traincgf ’’ available from the Matlab neural networks toolbox (Demuth & Beale, 2000). This function is based on conjugate gradient backpropagation with Fletcher–Reeves updates. We also make a comparison on the 1-D simulated data with the SA-LMedS algorithm (El-Melegy et al., 2009) minimizing the LMedS estimator using the simulated annealing algorithm. It is important to note that the performance of this algorithm depends on a number of user defined parameters, such as start temperature, stop temperature, number of temperatures, and temperature reduction factor. Therefore seeking fairness, we have tried to carefully tune these parameters in order to give the algorithm a full opportunity to give its best performance. All algorithms under comparison are analyzed under several scenarios and levels of outlying data. As it is customary in data analysis literature, we use the Root Mean Squared Error (RMSE) to measure the quality of function approximation from the various datasets. In each case, we report this RMSE score of the trained neural network on an outlier-free test dataset that is different from the data used for training. As such, this RMSE is an important characteristic that reflects the network’s generalization ability. 5.1. 1-D regression The algorithms are tested to approximate the one-dimensional function: y = |x|2/3 .
(7)
29
Fig. 4. Data points used for Case I of 1-D regression at ξ = 40%, along with the true function shown in continuous line.
This test function was first used in Liano (1996) and later in Chuang et al. (2000); El-Melegy et al. (2009); Pernia-Espinoza et al. (2005) and Rusiecki (2007, 2010). The experiment data consist of n = 501 points generated by sampling the independent variable in the range [−2, 2], and using (7) to calculate the dependent variable. The data points are then corrupted in both the x- and ycoordinates with a Gaussian noise of zero mean and 0.1 standard deviation. Two scenarios are employed to introduce outliers in this dataset. Case I. A variable percentage, ξ , of the data points is selected at random and substituted with background noise, uniformly distributed on the range [−2, 2]. Fig. 4 illustrates the data points used in our experiments at ξ = 40%, along with the true function shown in continuous line. To compare the performances of the five training algorithms, we use the root mean square error (RMSE) of each resultant model with respect to the true model in (7), computed as:
RMSE =
M (ˆyi − yi )2 i=1 M
,
(8)
where yˆ i is the actual neural network output, yi is the true output, both when the input is xi , and M is the number of test points. This test dataset, {(xi , yi ), i = 1, . . . , M }, is designed to be different from the data used for training. The neural network architecture considered in all the algorithms is a MFNN with one hidden layer having ten neurons, which is the same network topology used by previous researchers (Chuang et al., 2000; El-Melegy et al., 2009; Liano, 1996; PerniaEspinoza et al., 2005; Rusiecki, 2007, 2010) on this dataset. All networks are trained for 400 epochs. For the two RANSAC algorithms, we used s = 7, p = p′ = 0.99, ϵ = 0.4 and σ = 0.05 (i.e., different from true values) without any further tuning regardless of the value of ξ . The outliers percentage ξ is changed from 0% to 50% with a step of 5. At each outliers percentage, the network is trained with the five algorithms and the RMSE error is calculated from the resultant network model. This procedure is repeated 50 times. For this experiment, the bucketing technique (Zhang et al., 1995) is implemented as follows: the space over which the function is to be approximated [−2, 2] is evenly divided into b buckets (we use here b = 5). To each bucket is attached a set of data points, which fall in it. The buckets having no matches attached are excluded. To generate a subset of s points, we randomly select s buckets, and then randomly choose a point from each
30
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
Fig. 5. A sample run of the first RANSAC algorithm at ξ = 40%: (a) The supports for the various models estimated from the random subsets before the repeat-until loop (a red ‘‘star’’ marks the model with the largest consensus set). (b) The model found with the largest support (will be further improved upon during the repeatuntil loop). The inliers found so far are shown in red, and the remaining points are believed to be outliers (shown as blue crosses). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
selected bucket. As the number of points in one bucket may be quite different from that in another, a point belonging to a bucket having fewer points has a higher probability to be selected. It is thus preferred that a bucket having many points has a higher probability to be selected than a bucket having few points, in order that each point has almost the same probability to be selected. To realize this, we divide the interval [0, 1] into b intervals such that the width of the ith interval is equal to ni / i ni , where ni is the number of points attached in the ith bucket. During the bucket selection procedure, a number produced by a [0, 1] uniform random generator falling in the ith interval implies that the ith bucket is selected. Figs. 5 through 7 illustrate a sample run of each of the two proposed RANSAC algorithms at ξ = 40%. The supports for the various models estimated from the random subsets for the first RANSAC algorithm (Algorithm 1) before the repeat-until loop are shown in Fig. 5(a). A ‘‘star’’ marks the model with the biggest support (largest consensus set). This winning model is graphed in Fig. 5(b). This model is further improved upon during the repeatuntil loop, and the final model is shown in Fig. 8. Similarly, the supports for all the data points as found by the PRANSAC Algorithm before the beginning of the repeat-until loop are shown in Fig. 6(a). All points with support no smaller than the shown H threshold
Fig. 6. A sample run of the PRANSAC algorithm at ξ = 40%: (a) The supports for all the data points before the repeat-until loop (a red horizontal line indicates the threshold H). (b) The model found from all points with support larger than the threshold indicated in (a) (will be further improved upon during the repeatuntil loop). The inliers found so far are shown in red, and the remaining points are believed to be outliers (shown as blue crosses). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
are used to estimate the model graphed in Fig. 6(b). The repeatuntil loop refines this solution iteratively as illustrated in Fig. 7. In each iteration of this loop, the model is re-estimated, then the inliers are re-identified based on this last found model. As this happens, the support of the estimated model (size of consensus set Sm ) increases, whereas the root mean square of the training error over the currently found inlying data decreases. This is repeated until the number of found inliers converges typically after a few iterations as shown in Fig. 7. The final estimated model was shown in Fig. 1 (and shown again in Fig. 8). Surely, all these figures demonstrate the additional benefit gained from the last stage (the repeat-until loop) of the proposed algorithms. To gain more insight on the performance of the new algorithms in comparison with the other methods, the models (the approximated functions) obtained by the two RANSAC algorithms and the MSE and LMLS algorithms, along with the true model, are graphed in Fig. 8 at ξ = 40%. For a better, quantitative assessment of the performance, the average RMSE score from (8) over the 50 runs for each algorithm is plotted versus ξ in Fig. 9. The vertical bars indicate the standard deviations of the obtained RMSE scores of the algorithms. It is clear that from both figures that the network based on MSE has rather a good performance when outliers percentage
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
31
Fig. 9. Comparison of the RMSE scores for the five training algorithms versus ξ for Case I of 1-D regression. Vertical bars indicate the error standard deviations. Fig. 7. The support for the estimated model (in continuous line) and the root mean square of the training error over the inlying data (in dashed line) versus iterations during the repeat-until loop of the PRANSAC algorithm.
Fig. 10. Data points used for Case II of 1-D regression at ξ = 40% along with true function shown in continuous line.
Fig. 8. Functions approximated by the algorithms: MSE, LMLS, RANSAC and PRANSAC for Case I at ξ = 40% along with the true function.
is very low. This is quite expected as the MSE criterion is optimal at the case of Gaussian noise with no outliers. However its estimated model starts to deviate considerably and increasingly from the true model as ξ is increased. Better robustness can be noted in the network trained based on the LMLS estimator. We can see clearly the better performance for the SA-LMedS algorithm and the two RANSAC algorithms as ξ exceeds about 15%–20%, and continues even at higher outliers contamination rate. The SA-LMedS and RANSAC (Algorithm 1) algorithms have close performance, with a small advantage for the latter. However, the PRANSAC algorithm has indeed shown the smallest sensitivity to outlying data, and accordingly the most robustness, among the five algorithms. Even at ξ = 50%, the RMSE remains below 0.1. It is also important to note here that the last stage of the algorithm (the repeat-until loop) has contributed positively in the lower RMSE scores of the two proposed RANSAC algorithms. For example, it has resulted in a decrease from 0.1295 to 0.1266 in the reported average RMSE of the RANSAC algorithm at ξ = 40%. Similarly, it reduced the average RMSE of the PRANSAC algorithm at ξ = 40% from 0.0898 to 0.0864. Case II. In the second case, another series of experiments are carried out when the outliers are introduced into the data of Fig. 4
using a different methodology. A variable percentage ξ of the data points is selected at random and substituted with severe outlying data generated from Gaussian distributions with different means and variances. These outliers are synthesized from the Gaussian mixture (Pernia-Espinoza et al., 2005) N (15, 2) + N (−20, 3) + N (30, 1.5) + N (−25, 3). Fig. 10 illustrates the data points used in our experiments at ξ = 40%, along with the true function shown in continuous line. A zoomed view on the true function in the range [0, 2] on the vertical axis would reveal the same function plotted in Fig. 4 in continuous line. The outliers percentage ξ is changed from 0% to 50%. At each outliers percentage, the network is trained with each of the five training algorithms under comparison, and the RMSE is calculated from the resultant network model. This procedure is repeated 50 times and the average obtained RMSE scores are plotted versus ξ in Fig. 11. The models approximated by the two RANSAC algorithms and the MSE and LMLS algorithms at ξ = 40% are graphed in Fig. 12. Both figures demonstrate trends rather similar to those in Case I. The five algorithms start with very close performance when the outliers percentage is rather low. As ξ is increased, the MSE algorithm suffers and exhibits the biggest error. The LMLS algorithm shows better performance in this case than in Case I. However, SA-LMedS and the two RANSAC algorithms offer better accuracy with increasing ξ . The proposed PRANSAC algorithm still yields the best overall robustness among all algorithms. At
32
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
Fig. 11. Comparison of RMSE for the five training algorithms versus ξ for Case II. Vertical bars indicate the error standard deviations.
Fig. 12. Functions approximated by the algorithms: MSE, LMLS, RANSAC and PRANSAC for Case II at ξ = 40%.
ξ = 50%, the algorithm’s RMSE is only 34% of that of the LMLS algorithm, and about 66% of the RANSAC algorithm’s RMSE. 5.2. Real high-dimensional dataset with artificial outliers The proposed algorithms are also evaluated on a real publicly available dataset from the standardized benchmark collection for neural networks PROBEN1 (Prechelt, 1994). PROBEN1 was also considered in many articles, for example, Igel and Hüsken (2003); Prechelt (1994, 1998) and Tomandl and Schober (2001). The used dataset is the set ‘building’ with 4208 examples, where the value of hourly electricity consumption is regarded as a function of 14 input variables, such as the date, time of day, outside temperature, outside air humidity, solar radiation, and wind speed. Each variable is bounded in the range [−1, 1]. The available dataset examples are already divided into 3 sets: training of size 2104, validation of size 1052, and test of size 1052. The neural network architecture considered in this experiment is a MFNN with 14 input neurons and one hidden layer having 6 neurons. This network topology is selected after several cross-validation experiments using the given training and validation sets. For our purpose here, we use the training set as input to the algorithms, after introducing outliers by randomly replacing a varying percentage ξ of the examples with a large value (e.g., 10) in a randomly-selected one of the 14 variables. The already-prepared PROBEN1 validation and test sets are combined as one large test set (of size 2104) to test the generalization of the algorithms after being trained by the data contaminated with outlier.
Fig. 13. Comparison of RMSE for the four training algorithms: MSE, LMLS, RANSAC and PRANSAC, for 14-D function approximation versus ξ . Vertical bars indicate the error standard deviations.
For the two RANSAC algorithms, we use s = 17, p = p′ = 0.96, ϵ = 0.3 and σ = 0.01, without further tuning regardless of the value of ξ . Since the 14-D input space is large, the bucketing technique is not employed here, refer to our implementation note in Section 2. Instead, a randomly-formed subset is rejected if the Euclidean distance between any two member data points is less than 0.1, and another subset is then tried. The outlier percentage ξ is varied within the range 0%–50%. The results plotted in Fig. 13 are the RMSE scores on the combined validation and test sets, defined analogously to (8), averaged over 30 runs for each of the four training algorithms: MSE, LMLS, RANSAC and PRANSAC, at various values of ξ . We had a difficulty in tuning the design parameters of the SA-LMedS algorithm (e.g., start/stop temperatures, number of temperatures, and temperature reduction factor) for its best performance on this dataset. Consequently, it was not included in this experiment. From Fig. 13, one can clearly observe that all algorithms start with very close performance on outliers-free data. However the MSE algorithm breaks down severely right away as ξ increases. The LMLS algorithm shows a slightly better performance. While the two RANSAC algorithms demonstrate significantly better performance, the PRANSAC algorithm, in particular, yields a very notable performance on this high-dimensional dataset. At ξ = 50%, its RMSE is as low as 56% of that of MSE, and only 68% compared to LMLS. It offers a RMSE of about 80% of that of the RANSAC algorithm. 5.3. Medical high-dimensional datasets The proposed algorithms are also extensively evaluated on real high-dimensional datasets. These datasets are publicly available from the UCI repository of standard machine learning datasets (http://archive.ics.uci.edu/ml/). Table 1 presents four datasets we considered in the present work along with their specifications. All these datasets are in the medical and public health domains representing patient records. The data have several inevitable outliers due to several reasons, such as abnormal patient condition or instrumentation errors or recording errors. Thus outlier detection is a very critical problem in this domain and requires a high degree of accuracy. A recent study (Srimani & Koti, 2012) has shown that these particular datasets do indeed contain about 20%–40% outliers contamination (also suggested by the results of Ciampi & Zhang, 2002). The proposed algorithms are used to find the functional relationship of the status (predicted) attribute of each dataset as a function of the remaining attributes. For the sake of comparison, the MSE and LMLS algorithms are used for the same purpose. The performance of each algorithm is evaluated using 10-fold cross-validation on each dataset. Due to the possibility
M.T. El-Melegy / Neural Networks 59 (2014) 23–35 Table 1 Specifications of medical high-dimensional datasets. Dataset
#instances
#attributes
Bupa liver disorders Cleveland heart Thyroid Haberman
345 303 215 306
7 14 6 4
33
Table 4 Average runtime (in seconds) of the parallel versions of the RANSAC and PRANSAC algorithms for 1-D regression versus number of CPU cores. #Cores
1
2
3
4
RANSAC PRANSAC
400.9 403.5
249.8 250.7
167.7 169.4
135.5 136.0
6 Table 2 Accuracy performance on medical datasets. MSE
LMLS
RANSAC
PRANSAC
Net top.
5
Bupa liver Heart Thyroid Haberman
0.851 0.236 0.043 0.709
0.159 0.248 0.032 0.207
0.084 0.049 0.011 0.021
0.071 0.027 0.008 0.011
6-6-1 13-4-1 5-2-1 3-1-1
4
Table 3 Average runtime (in seconds) of each of the five training algorithms for 1-D regression. MSE
LMLS
RANSAC
PRANSAC
SA-LMedS
19.8
30.0
400.9
403.5
481.4
Speed Gain
Dataset
3
2
1
of outlying data in the validation partition, the accuracy of an obtained model is assessed using the root mean squared residual error for data whose residuals are no greater than 2.5σˆ , where σˆ is a robust estimate of the standard deviation (Hampel et al., 1986; Rousseeuw & Roy, 1987) of the residuals over all validation data based on the median absolute deviation. This procedure is repeated for a total of 10 trials, each time using a different partition for validation. The obtained accuracies, averaged over all the trials, are shown in Table 2 for all algorithms. The neural network architecture considered in each case is determined after several cross-validation experimentations and reported in the last column of Table 2. Clearly from Table 2, the MSE algorithm exhibits the worst accuracy. The LMLS algorithm shows a better performance. The two RANSAC algorithms demonstrate significantly better performance, with considerable superiority to the new PRANSAC algorithm. This algorithm detected 100 outliers for the Liver dataset, 92 for Heart, 37 for Thyroid and 78 for Haberman. 6. Parallel RANSAC algorithms Due to its repetitive sampling and estimation nature, any RANSAC algorithm takes considerably longer time. In more specific words, the total runtime of the algorithm can be estimated as
τ ≈ k(tM + tV ),
(9)
where k is the number of trials, tM is the average time needed to find a model given a subset (train a MFNN on this subset), and tV is the average time needed to evaluate the quality of the resultant model (find the consensus set and update the point/model support). According to Algorithm 1 (and 2), k will equal the number N (N ′ ) of drawn subsets in addition to a few iterations in the final repeat-until loop. Typically k is in the order of hundreds, leading to rather a long runtime τ . For example, Table 3 shows the average runtimes (in seconds) of the five algorithms over all runs and outliers percentages for the Case I of the 1-D regression experiment on one core of an Intel 2.4 GHz Quad-Core PC with 4 GB RAM. Clearly the MSE training algorithm is the fastest, then comes the LMLS algorithm followed by the two RANSAC algorithms which have very close time performance. Eventually, the simulated annealing-based SALMedS algorithm takes the longest time.
0
0
1
2
3
4
5
6
7
8
9
Fig. 14. Speed gain for the parallel PRANSAC algorithm for 1-D regression versus the number of cores. The gain is measured up to 4 cores (shown as red isolated markers). The gain scales up linearly (illustrated as a blue dotted line), and is extrapolated up to 8 cores.
Nevertheless, the two proposed RANSAC algorithms can be speeded up considerably by means of parallel computing, because the processing for each subset can be done independently. A careful examination reveals that each algorithm has the construct of a big for-loop, where each loop iteration takes rather a long time to execute, but can proceed independently of all other iterations. This can exploit the new advances in CPU architecture that have paved the way to the availability of multiple CPU cores on nowadays PCs. In order to take advantage of the multiple CPU cores, parallel versions of the two RANSAC algorithms are implemented using the Parallel Computing Toolbox in Matlab (Sharma & Martin, 2009). To assess the gain out of the parallel implementation, the complete experiment for Case I of 1-D regression is repeated at the 11 outlier percentages with 50 trials per each. The average runtime of the two parallel algorithms over all runs and outliers percentages is recorded versus the number of CPU cores used in the experiment, see Table 4. The two algorithms showed very close performance with increased number of cores. The runtime drops to about one third when using the four cores of the PC. The speed gain of the parallel PRANSAC versus the number of cores is plotted in Fig. 14. The speedup scales almost linearly with the number of cores, and can be extrapolated to predict the time performance on more than 4 cores. For example, the PRANSAC is expected to take about 71 s (a gain of 5.7) on 8 cores. Additional speed gain can also be realized by further optimizing some other parts of the implementation code for speed. Nevertheless, this experiment clearly demonstrates the potential of utilizing multiple cores to improve on the time performance of the proposed RANSAC algorithms. 7. Conclusions and future work This paper has addressed the problem of fitting a model to data corrupted with outliers. Most previous efforts to solve this problem
34
M.T. El-Melegy / Neural Networks 59 (2014) 23–35
have focused on using an estimation algorithm that minimizes a robust M-estimator based error criterion instead of the usual non-robust MSE measure. However the robustness gained from M-estimators is still low. In this paper, we have addressed this problem in a RANSAC framework. We have studied the classical RANSAC framework and highlighted its model-wise nature. A regression estimation algorithm is presented adapting this modelwise strategy. In order to make the adaptation of this framework for this goal more efficient, several novel aspects are introduced in the standard framework. A stage following the standard RANSAC procedure is employed to improve the outlier/inlier classification of the data, and to increase the accuracy of the estimated model representing the inlying data. Furthermore, a bucketing technique is employed to enhance the sampling process of the RANSAC algorithm. A more suitable alternative is also devised for rather high-dimensional datasets. Furthermore, we have introduced a new point-wise RANSAC strategy, which keeps track of the support for a data point to be valid, instead of computing the support for the estimated model. The PRANSAC algorithm developed based on this point-wise strategy offers better robustness. The proposed algorithms’ theoretical robustness has been analyzed resulting in a novel probabilistic setting for the algorithms’ breakdown points. To the best of our knowledge, this has not been reported before in the entire RANSAC literature. While the proposed concepts and algorithms are generic and general enough to adopt many regression machineries, the paper has focused on multilayered feed-forward neural networks in solving regression problems. In that context, the RANSAC framework has been less known in spite of being widely used in the computer vision and image processing communities. More importantly, the new point-wise RANSAC strategy has not been used before in the computer vision and image processing communities either. The two proposed algorithms have been extensively evaluated on synthetic and real data, under different scenarios and levels of outliers contamination. Our evaluation also included real highdimensional data publicly available from the standardized benchmark collection for neural networks PROBEN1 (Prechelt, 1994) and the UCI repository. In all these experiments, the two proposed RANSAC algorithms have demonstrated better robustness compared to well-known, competing neural network training algorithms in the literature. In that regard, the new PRANSAC algorithm has consistently offered the most accuracy and robustness. To improve the time performance of the two RANSAC algorithms, parallel implementations have been developed to utilize the multiple CPU cores available on nowadays computers. Our experiments have confirmed a speed gain that scales almost linearly with the number of cores. The two algorithms have demonstrated a 3-times faster performance on a 4-core CPU. Consequently, we believe that present paper has presented rather a comprehensive study of robust supervised training of MFNN in a RANSAC framework from the standpoint of both accuracy and time. To the best of our knowledge, this has not been reported before in the neural networks literature. Our current research is directed to further enhancing several aspects of the new point-wise RANSAC algorithm. Being part of the modeling process, the algorithm needs the specification of several parameters by the user. Some such parameters are typical in neural network applications, such as the network topology, the number of hidden units, and the parameters of the backpropagation algorithm. Proper selection of these parameters may be handled based on the user’s experience or empirically through cross-validation (as in our experiments in Sections 5.2 and 5.3). In addition, there are some well-founded methods and guidelines addressing these issues in the literature (e.g., Haykin, 2008 and Hunter, Yu, Pukish, Kolbusz, & Wilamowski, 2012), many
of which can be safely used with the proposed algorithm. Some other parameters are pertaining to the proposed algorithm, such as a believed (or worst case) estimate of the outliers contamination percentage, and the subset size. All these parameters are application and data dependent. However the algorithm also needs some estimate of the noise level (common requirement in the RANSAC framework) in the inlying data to determine the distance threshold t (we used cross-validation to estimate the noise level for our experiments with the real data). Although the algorithm was not critically dependent on the accuracy of this estimate (the true noise level was twice as big as the estimate used in many of our experiments), we are currently investigating some other ways to estimate the noise level automatically and robustly from the data without any user intervention. Another direction under current investigation to improve the robustness of the algorithm is to integrate an M-estimator based criterion within the PRANSAC algorithm. Instead of minimizing a MSE measure during the final repeat-until loop (Step 6(a)), the algorithm can minimize an M-estimator based criterion. This can be useful to marginalize the influence of any outliers erroneously left in the consensus set Sm in the last stage of the algorithm. Moreover in order to further improve the time performance of the PRANSAC algorithm, one can try to predict whether a randomly-formed subset is likely to be ‘‘good’’ and deserves to be used for model estimation, or it is more likely a ‘‘bad’’ subset and worthy of no further consideration. If this prediction strategy turns to be successful most of the time, the algorithm could be considerably faster especially in cases where the percentage of outliers is large. We have already (El-Melegy, 2011b, 2013) drafted some early thoughts on integrating the M-estimator criterion and using this predication strategy formalized in the framework of sequential decision-making in our first RANSAC algorithm. Further investigation of these ideas and their application to the PRANSAC algorithm are underway. Acknowledgments The author would like to thank the anonymous reviewers for their detailed and constructive reviews, which have improved the technical content and presentation of the manuscript. References Ahmed, M., & Farag, A. (2002). A neural approach to zoom-lens camera calibration from data with outliers. Image and Vision Computing, 20, 619–630. Chen, D., & Jain, R. (1994). A robust backpropagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5(3), 467–479. Chuang, C., Jeng, J., & Lin, P. (2004). Annealing robust radial basis function networks for function approximation with outlier. Neurocomputing, 56, 123–139. Chuang, C., Su, S., & Hsiao, C. (2000). The annealing robust backpropagation (ARBP) learning algorithm. IEEE Transactions on Neural Networks, 11(5), 1067–1077. Chum, O., & Matas, J. (2008). Optimal randomized RANSAC. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1–11. Ciampi, A., & Zhang, F. (2002). A new approach to training back-propagation artificial neural networks: empirical evaluation on ten data sets from clinical studies. Statistics in Medicine, 21(9), 1309–1330. Demuth, H., & Beale, M. (2000). Matlab neural network tool-box user’s guide, version 4.0. Tech. rep. The MathWorks Inc. El-Melegy, M.T. (2011a). Random sampler m-estimator algorithm for robust function approximation via feed-forward neural networks. In The 2011 international joint conference on neural networks (pp. 3134–3140) 31 July–August 5. El-Melegy, M.T. (2011b). Ransac algorithm with sequential probability ratio test for robust training of feed-forward neural networks. In The 2011 international joint conference on neural networks (pp. 3256–3263) 31 July–August 5. El-Melegy, M. T. (2013). Random sampler m-estimator algorithm with sequential probability ratio test for robust function approximation via feed-forward neural networks. IEEE Transactions on Neural Networks and Learning Systems, 24(7), 1074–1085. El-Melegy, M., Essai, M., & Ali, A. (2009). Robust training of artificial feedforward neural networks. In A. Hassanien, A. Abraham, A. Vasilakos, & W. Pedrycz (Eds.), Springer studies in computational intelligence: Vol. 1. Foundations of computational intelligence: learning and approximation (pp. 217–242).
M.T. El-Melegy / Neural Networks 59 (2014) 23–35 Fischler, M., & Bolles, R. (1981). Random sample consensus: a paradigm for model fitting with application to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395. Hampel, F., Ronchetti, E., Rousseeuw, P., & Stahel, W. (1986). Robust statistics: the approach based on influence functions. John Wiley and Sons. Haralick, R. (1986). Computer vision theory: the lack thereof. Computer Vision, Graphics and Image Processing, 36, 372–386. Hartley, R., & Zisserman, A. (2004). Multiple view geometry in computer vision (2nd ed.) Cambridge University Press. Haykin, S. (2008). Neural networks and learning machines (3rd ed.) Prentice Hall. Huber, P. (1981). Robust statistics. John Wiley and Sons. Hunter, D., Yu, H., Pukish, M., Kolbusz, J., & Wilamowski, B. (2012). Selection of proper neural network sizes and architectures—a comparative study. IEEE Transactions on Industrial Informatics, 8(2), 228–240. Igel, C., & Hüsken, M. (2003). Empirical evaluation of the improved rprop learning algorithm. Neurocomputing, 50, 105–123. Kim, D., Kim, J., Meer, P., Mintz, D., & Rosenfeld, A. (1989). Robust computer vision: a least median of squares based approach. In Proc. DARPA image understanding workshop (pp. 1117–1134). Morgan Kaufman Publishers. Ko, C.-N. (2012). Identification of nonlinear systems with outliers using wavelet neural networks based on annealing dynamical learning algorithm. Engineering Applications of Artificial Intelligence, 25(3), 533–543. Lee, C., Chung, P., Tsai, J., & Chang, C. (1999). Robust radial basis function neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 29(6), 674–685. Liano, K. (1996). Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7(1), 246–250. Meer, P., Mintz, D., Rosenfeld, A., & Kim, D. (1991). Robust regression methods for computer vision: a review. International Journal of Computer Vision, 6(1), 59–70. Mitzenmacher, M., & Upfal, U. (2005). Probability and computing: randomized algorithms and probabilistic analysis. Cambridge. Nishida, K., & Kurita, T. (2008). RANSAC-SVM for large-scale datasets. In 19th international conference on pattern recognition (pp. 1–4). December. Pernia-Espinoza, A. V., Ordieres-Mere, J. B., de Pison, F. J. M., & Gonzalez-Marcos, A. (2005). TAO-robust backpropagation learning algorithm. Neural Networks, 1, 1–14. Prechelt, L. (1994). PROBEN1—a set of benchmarks and benchmarking rules for neural network training algorithms. Tech. rep. 21/94. Fakultat fur Informatik, Universitat Karlsruhe, URL: http://digbib.ubka.uni-karlsruhe.de/volltexte/39794, 2011.
35
Prechelt, L. (1998). Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, 11, 761–767. Rousseeuw, P., & Roy, A. (1987). Robust regression and outlier detection. John Wiley and Sons. Rusiecki, A.L. (2006). Robust learning algorithm with the variable learning rate. In ICAISC 2006, artificial intelligence and soft computing (pp. 83–90). Rusiecki, A. L. (2007). Robust LTS backpropagation learning algorithm. In F. Andoval, A. Prieto, J. Cabestany, & M. Grana (Eds.), LNCS: Vol. 4507. IWANN (pp. 102–109). Heidelberg: Springer. Rusiecki, A. L. (2008). Robust MCD-based backpropagation learning algorithm. In LNAI: Vol. 5097. ICAISC 2008 (pp. 154–163). Berlin: Springer. Rusiecki, A. L. (2010). Fast robust learning algorithm dedicated to LMLS criterion. In L. Rutkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh, & J. Zurada (Eds.), Lecture notes in computer science: Vol. 6114. Artifical intelligence and soft computing (pp. 96–103). Berlin, Heidelberg: Springer. Sharma, G., & Martin, J. (2009). MATLAB: a language for parallel computing. International Journal of Parallel Programming, 37, 3–36. Srimani, P., & Koti, M. (2012). Outlier mining in medical databases by using statistical methods. International Journal of Engineering Science and Technology (IJEST), 4(1), 239–246. The Mathworks Inc. (2009). MATLAB 7.8 documentation. URL: http://www.mathworks.com. Tomandl, D., & Schober, A. (2001). A modified general regression neural network (MGRNN) with new, efficient training algorithms as a robust ‘black box’-tool for data analysis. Neural Networks, 14, 1023–1034. Torr, P., & Zisserman, A. (2000). MLESAC: a new robust estimator with application to estimating image geometry. Journal of Computer Vision and Image Understanding, 78(1), 138–156. Zhang, Z. (1998). Determining the epipolar geometry and its uncertainty: a review. International Journal of Computer Vision, 27(2), 161–195. Zhang, Z., Deriche, R., Faugeras, O., & Luong, Q. (1995). A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence Journal, 78, 87–119. Zhuang, X., Wang, T., & Zhang, P. (1992). A highly robust estimator through partially likelihood function modeling and its application in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 14(1), 19–34. Zurada, J. M. (1992). Introduction to artificial neural systems. Boston: PWS.