Robust weighted kernel logistic regression in imbalanced and rare events data

Robust weighted kernel logistic regression in imbalanced and rare events data

Computational Statistics and Data Analysis 55 (2011) 168–183 Contents lists available at ScienceDirect Computational Statistics and Data Analysis jo...

715KB Sizes 4 Downloads 108 Views

Computational Statistics and Data Analysis 55 (2011) 168–183

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Robust weighted kernel logistic regression in imbalanced and rare events data Maher Maalouf ∗ , Theodore B. Trafalis School of Industrial Engineering, The University of Oklahoma, 212 West Boyd Street, Room 124, Norman, OK, 73019, United States

article

info

Article history: Received 26 December 2009 Received in revised form 12 June 2010 Accepted 12 June 2010 Available online 25 June 2010 Keywords: Classification Endogenous sampling Logistic regression Kernel methods Truncated Newton

abstract Recent developments in computing and technology, along with the availability of large amounts of raw data, have contributed to the creation of many effective techniques and algorithms in the fields of pattern recognition and machine learning. The main objectives for developing these algorithms include identifying patterns within the available data or making predictions, or both. Great success has been achieved with many classification techniques in real-life applications. With regard to binary data classification in particular, analysis of data containing rare events or disproportionate class distributions poses a great challenge to industry and to the machine learning community. This study examines rare events (REs) with binary dependent variables containing many more non-events (zeros) than events (ones). These variables are difficult to predict and to explain as has been evidenced in the literature. This research combines rare events corrections to Logistic Regression (LR) with truncated Newton methods and applies these techniques to Kernel Logistic Regression (KLR). The resulting model, Rare Event Weighted Kernel Logistic Regression (RE-WKLR), is a combination of weighting, regularization, approximate numerical methods, kernelization, bias correction, and efficient implementation, all of which are critical to enabling RE-WKLR to be an effective and powerful method for predicting rare events. Comparing RE-WKLR to SVM and TR-KLR, using non-linearly separable, small and large binary rare event datasets, we find that RE-WKLR is as fast as TR-KLR and much faster than SVM. In addition, according to the statistical significance test, RE-WKLR is more accurate than both SVM and TR-KLR. © 2010 Elsevier B.V. All rights reserved.

1. Introduction Rare events (REs), class imbalance, and rare classes are critical to prediction and hence human response in the field of data mining and particularly data classification. Examples of rare events include fraudulent credit card transactions (Chan and Stolfo, 1998), word mispronunciation (Busser and Daelemans, 1999), tornadoes (Trafalis et al., 2003), telecommunication equipment failures (Weiss and Hirsh, 2000), oil spills (Kubat et al., 1998), international conflicts (King and Zeng, 2001a), state failure (King and Zeng, 2001b), landslides (Eeckhaut et al., 2006; Bai et al., 2008), train derailments (Quigley et al., 2007), rare events in a series of queues (Tsoucas, 1992) and other rare events. By definition, rare events are occurrences that take place with a significantly lower frequency compared to more common events. Given their infrequency, rare events have an even greater value when correctly classified. However, the imbalanced distribution of classes calls for correct classification. The rare class presents several problems and challenges to existing classification algorithms (Weiss, 2004; King and Zeng, 2001c).



Corresponding author. Tel.: +1 405 410 6365. E-mail addresses: [email protected] (M. Maalouf), [email protected] (T.B. Trafalis).

0167-9473/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2010.06.014

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

169

Sampling is undoubtedly one of the most important techniques in dealing with REs. The underlying objective of sampling is minimizing the effects of rareness by changing the distribution of the training instances. Sampling techniques can be either basic (random) or advanced (intelligent). Van-Hulse et al. (2007) provide a comprehensive survey on both random and intelligent data sampling techniques and their impact on various classification algorithms. Seiffert et al. (2007) observed that data sampling is very effective in alleviating the problems presented by rare events. Basic sampling methods consist of under-sampling and over-sampling. The former eliminates examples from the majority class, while the latter adds more training examples on behalf of the minority class. Over-sampling can thus increase processing time. In addition, over-sampling risks over-fitting, since it involves making identical copies of the minority class. Drummond and Holte (2003) found that under-sampling using C4.5 (a decision tree algorithm) is most effective for imbalanced data. Maloof (2003) showed, however, that under-sampling and over-sampling are almost equal in effect using Naive Bayes and C5.0 (a commercial successor to C4.5). Japkowicz (2000) came to similar conclusion but found that under-sampling the majority class works better on large domains. Prati et al. (2004), without providing conclusive evidence, proposed over-sampling combined with data cleaning methods as a possible remedy for classifying REs. The basic sampling strategy is known in econometrics and transportation studies as choice-based, state-based or endogenous sampling. In medical research it is known as case control. King and Zeng (2001c) advocate under-sampling of the majority class when statistical methods such as logistic regression are employed. They clearly demonstrated that such designs are only consistent and efficient with the appropriate corrections. Unfortunately, few researchers are aware of the fact that any kind of undersampling is a form of choice-based sampling which leads to biased estimates. Thus, they proceed to solve likelihoods that are only appropriate for random sampling. King and Zeng (2001c) state that the problems associated with REs stem from two sources. First, when probabilistic statistical methods, such as logistic regression, are used, they underestimate the probability of rare events, because they tend to be biased towards the majority class, which is the less important class. Second, commonly used data collection strategies are inefficient for rare events data. A trade-off exists between gathering more observations (instances) and including more informational, useful variables in the dataset. When one of the classes represents a rare event, researchers tend to collect very large numbers of observations with very few explanatory variables in order to include as many data as possible for the rare class. This in turn could significantly increase the data collection cost and not help much with the underestimated probability of detecting the rare class or the rare event. Kernel Logistic Regression (KLR) (Canu and Smola, 2005; Jaakkola and Haussler, 1999), which is a kernel version of Logistic Regression (LR), has been proven to be a powerful classifier. Just like LR, KLR can naturally provide probabilities and extend to multi-class classification problems (Hastie et al., 2001; Karsmakers et al., 2007). The advantages of using LR are that it has been extensively studied (Hosmer and Lemeshow, 2000), and recently it has been improved through the use of truncated Newton methods (Komarek and Moore, 2005; Lin et al., 2007). This has also been shown recently for KLR by Maalouf and Trafalis (2008). Furthermore, LR and KLR do not make assumptions about the distribution of the independent variables. LR and KLR include the probabilities of occurrences as a natural extension. Moreover, LR and KLR can be extended to handle multi-class classification problems and they require solving only unconstrained optimization problems. Hence, with the right algorithms, the computation time can be much less than that for other methods, such as using Support Vector Machines (SVM) (Vapnik, 1995), which require solving a constrained quadratic optimization problem. In sum, King and Zeng (2001c) applied LR to REs data with the appropriate bias and probabilities corrections. Komarek and Moore (2005) implemented the TRuncated Newton method in LR (TR-IRLS). Maalouf and Trafalis (2008) implemented the TRuncated Newton method in KLR (TR-KLR). The focus of this study is the implementation of fast and robust adaptations of KLR in imbalanced and rare events data. The algorithm is termed Rare Event Weighted Kernel Logistic Regression (RE-WKLR). The ultimate objective is to gain significantly more accuracy in predictive REs with diminished bias and variance. Weighting, regularization, approximate numerical methods, kernelization, bias correction, and efficient implementation are critical to enabling RE-KLR to be an effective and powerful method for predicting rare events. Our analysis involves the standard multivariate cases in finite dimensional spaces. Recent advances in Functional Data Analysis (FDA) (Ramsay and Silverman, 2005) and their extension to non-parametric functional data analysis (Ferraty and Vieu, 2006) allow for consideration of cases in which random variables take on infinite dimensional spaces (functional spaces). In Section 2, we provide a brief description of sample selection bias. In Section 3, we give an overview of LR for rare events. Section 4 derives the KLR model for the rare events and imbalanced data problems. Section 5 describes the Rare Event Weighted Kernel Logistic Regression (RE-WKLR) algorithm. Numerical results are presented in Section 6, and Section 7 addresses the conclusions and future work. 2. Sample selection bias, endogenous sampling, and biased estimates Following Zadrozny (2004), let s be a binary random variable, which takes the value of 1 if a sample is selected and 0 otherwise. Let X ∈ Rn×d be a data matrix where n is the number of instances (examples) and d is the number of features (parameters or attributes), and y be a binary outcomes vector. For every instance xi ∈ Rd (a row vector in X), where i = 1, . . . , n, the outcome is either yi = 1 or yi = 0. Let the instances with outcomes of yi = 1 belong to the positive class, and the instances with outcomes yi = 0 belong to the negative class. The goal is to classify the instance xi as positive or negative. An instance can be thought of as a Bernoulli trial with an expected value E [yi ] or probability pi . In addition,

170

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

the joint probability P (x, y) = P (y|x)P (x) is the unbiased probability of observing an example (x, y) from a distribution D. Classification algorithms are then applied to estimate the joint probability P (x, y). Classifiers that approximate P (x, y) via P (y|x) are called ‘‘local’’ classifiers (i.e. logistic regression or Bayesian classifiers), and classifiers that approximate P (x, y) by P (y|x)P (x) are called ‘‘global’’ classifiers (i.e. SVM or naive Bayes ones). Zadrozny (2004) listed four cases as regards the dependence of s on the example (x, y):

• • • •

If s is independent of both x and y, then there is no bias. If s is independent of y given x(P (s|x, y) = P (s|x)), then there is a feature bias. If s is independent of x given y(P (s|x, y) = P (s|y)), then there is a class bias. If s is dependent on both x and y, then there is a full bias.

For local learners in general, and logistic regression in particular, Zadrozny (2004) found that when s is independent of y given x, the LR model is not affected by the sample selection bias. The results for the sample are nearly those obtained from random sampling. However, this is not the case when there is a class bias, or when sampling is of the response variable, which demonstrates the third of the four cases listed above. When one of the y classes is rare in the population, then random selection within values of y would save significant resources in data collection (King and Zeng, 2001c; Cramer, 2003). There are several advantages associated with selection on the response variable. First, in conducting surveys, cost reduction and time saving can be achieved by using stratified samples instead of collecting random samples, especially when the event of interest is rare in the population. Second, greater computational efficiency can be reached, because there is no need to analyze massive datasets. Finally, the explanatory power of the logistic model can be enriched by making the proportions of events and non-events more balanced (King and Zeng, 2001c). However, since the objective is to derive inferences about the population from the sample, the estimates obtained by means of the common likelihood using pure endogenous sampling are inconsistent. To see why this is so, under pure endogenous sampling, the conditioning is on X rather than y (Cameron and Trivedi, 2005; Milgate et al., 1990), and the joint distribution of y and X in the sample is fs (y, X|β) = Ps (X|y, β)Ps (y),

(1)

where β is the unknown parameter vector to be estimated. Yet, since X is a matrix of exogenous variables, then the conditional probability of X in the sample is equal to that in the population, or Ps (X|y, β) = P (X|y, β). However, the conditional probability in the population is P (X|y, β) =

f (y, X|β) P (y)

,

(2)

but f (y, X|β) = P (y|X, β)P (X),

(3)

and, hence, substituting and rearranging yields fs (y, X|β) =

= where QH = is then

Ps (y) P (y) H Q

P (y|X, β)P (X),

Ps (y) , with H P (y)

LEndogenous =

P (y|X, β)P (X)

(5)

representing the proportions in the sample and Q the proportions in the population. The likelihood

n Y Hi i=1

(4)

Qi

P (yi |xi , β)P (xi ).

(6)

Therefore, when dealing with REs and imbalanced data, it is the likelihood in (6) that needs to be maximized (Cameron and Trivedi, 2005; Amemiya, 1985; Xie and Manski, 1989; Manski and Lerman, 1977; Imbens and Lancaster, 1996). Several consistent estimators of this type of likelihood have been proposed in the literature. Amemiya (1985) and Ben-Akiva and Lerman (1985) provide an excellent survey of these methods. 3. Logistic regression and endogenous sampling The logistic function commonly used to model each instance xi with its expected outcome is given by the following formula (Hosmer and Lemeshow, 2000): E [yi |xi , β] = pi =

ex i β 1 + ex i β

,

(7)

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

171

where β is the vector of parameters with the assumption made that xi0 = 1 so that the intercept β0 is a constant term. From now on, the assumption is that the intercept is included in the vector β. The logistic (logit) transformation is the logarithm of the odds of the positive response, and is defined as

ηi = ln





pi 1 − pi

= xi β.

(8)

In matrix form, the logit function is expressed as

η = Xβ.

(9)

The regularized log-likelihood is defined as ln L(β) =

n X λ (yi ln pi + (1 − yi ) ln(1 − pi )) − kβk2 ,

2

i =1

(10)

where the regularization (penalty) term λ2 kβk2 is added to obtain better generalization. For binary outputs, the loss function or the deviance DEV is the negative log-likelihood and is given by the formula (Hosmer and Lemeshow, 2000; Komarek, 2004)

ˆ = −2 ln L(β). DEV(β)

(11)

ˆ given in (11) is equivalent to maximizing the log-likelihood given in (10) (Hosmer and Minimizing the deviance DEV(β) Lemeshow, 2000). The deviance function above is non-linear in β, and minimizing it requires numerical methods to find the ˆ . Recent studies show that the Conjugate Gradient (CG) method provides Maximum Likelihood Estimate (MLE) of β, which is β better results for estimating β than any other numerical method (Malouf, 2002; Minka, 2003). The main advantage of the CG method is that it guarantees convergence in at most d steps (Lewis et al., 2006). Now, with regard to REs, King and Zeng (2001a,c) argued that while endogenous sampling is the preferred method for dealing with REs and class imbalance data, certain corrections must be applied. First, consistent estimators are obtained by solving for the likelihood proposed by Manski and Lerman (1977), ln L(β) =

n X Qi i =1

Hi

(yi ln pi + (1 − yi ) ln(1 − pi )),

(12)

where Hi = ( τy )yi + ( 11−τ )(1 − yi ), with y and τ representing the proportion of events in the sample and in the population, −y i respectively. Second, King and Zeng (2001c) developed the small-sample corrections, as described by McCullagh and Nelder (1989), to the weighted likelihood (12), which, even with endogenous sampling, can make a difference when the population probability of the event of interest is low. In fact, a recent comparative study by Maiti and Pradhan (2008) showed that this bias correction method gives the smallest Mean Squared Error (MSE) when applied to LR. The main advantage of the bias correction method as described by McCullagh and Nelder (1989) is that it reduces both the bias and the variance (King and Zeng, 2001c). The disadvantage of this bias correction method is that it is corrective and not preventive, since it is applied after the estimation is complete, and hence it does not protect against infinite parameter values that arise from perfect separation between the classes (Heinze and Schemper, 2001; Wang and Wang, 2001). Applying the above corrections, along with the recommended sampling strategies, such as collecting all of the available events and only a matching proportion of non-events, can (1) significantly decrease the sample size under study, (2) cut data collection costs, (3) increase the rare event probability, and (4) enable researchers to focus more on analyzing the variables. Despite all of the improvements made in the Logistic Regression method, its underlying assumption of linearity in xi , as evident in its logit function in (9), is often violated (Hastie et al., 2001; Keele, 2008). With the advancement of kernel methods, the search for an effective non-parametric LR model has become possible. Q

4. Kernel logistic regression in imbalanced and rare events data The vector β can be expressed as a linear combination of the input vectors, such that

β = XT α =

n X

αi xi ,

(13)

i=1

where the vector α is known as the dual variable with dimensions n × 1. Now, the logit vector η can be rewritten as

η = XXT α = Kα, where K = XXT is symmetric positive semidefinite Gram matrix with n × n dimensions.

(14) (15)

172

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

Input Space

Feature Space

φ

Fig. 1. Mapping of non-linearly separable data from the input space to the feature space.

Consider again the logit link function shown in the previous section,

ηi = xi β = β0 + xi1 β1 · · · xid βd ,

(16)

where the vector xi is given by xi = [1, xi1 , . . . , xid ] with i = 1, . . . , n. This linear function represents the simplest form of an identity mapping polynomial basis function φ of the feature space such that

φ(xi ) = φ[(1, xi1 , . . . , xid )] = xi .

(17)

Thus, the logit link function could be rewritten as

ηi = φ(xi )β.

(18)

In general, the function φ(.) maps the data from a lower dimensional space into a higher one (Fig. 1), such that

φ : x ∈ Rd → φ(x) ∈ F ⊆ RΛ ,

(19)

where Λ is the dimension of the feature space. The goal for choosing the mapping φ is to convert non-linear relations between the response variable and the independent variables into linear relations. However, the transformation φ(.) is often unknown but the dot product in the feature space can be expressed in terms of the input vectors through the kernel function. In the case of KLR, the logit link function could then be expressed as

ηi =

n X

αj hφ(xi ), φ(xj )i

(20)

αj κ(xi , xj )

(21)

j =1

=

n X j =1

= ki α,

(22)

where ki is the ith row in the kernel matrix κ(xi , xj ) = K. The kernel is a transformation function that must satisfy Mercer’s necessary and sufficient conditions, which state that a kernel function must be expressed as an inner product and must be positive semidefinite (Cristianini and Shawe-Taylor, 2000; Shawe-Taylor and Cristianini, 2004). Now,

ηi = ki α = ln



pi



1 − pi

,

(23)

which implies that pi =

eηi 1 + eηi

=

eki α 1 + eki α

=

1 1 + e − ki α

,

(24)

and, hence, the regularized log-likelihood can be rewritten with respect to α as ln L(α) =

n X λ (yi ln pi + (1 − yi ) ln(1 − pi )) − αT Kα, i=1

2

(25)

with a deviance

DEV(α) = −2 ln L(α).

(26)

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

173

Like the LR model, the commonly used maximum likelihood formulation on KLR is not appropriate for classifying imbalanced and rare events data, especially when endogenous sampling is performed. The full likelihood function needs to be stated. The likelihood function should then be

L(α|y, K) = f (y, K|α) =

=

n Y Hi i=1

Qi

H Q

P (y|K, α)P (K)

(27)

P (yi |ki , α)P (ki ).

(28)

Now, following the same intuitive concept as Manski and Lerman (1977), choice-based sampling can easily be dealt with as long as knowledge of the population probability is available. The log-likelihood for KLR can then be rewritten as ln L(α|y, K) =

n X Qi i=1

=

n X Qi i=1

=

Hi

n X

Hi

ln P (yi |ki , α)

ln

wi ln

i=1

(29)

eyi ki α

(30)

1 + eki α e y i ki α 1 + e ki α

,

(31)

where wi = Hi . As with LR, in order to obtain a consistent estimator, the likelihood is multiplied by the inverse of the i fractions. This produces a Weighted Maximum Likelihood (WML), and the KLR model becomes a Weighted KLR (WKLR) model. The WML estimator for WKLR is consistent. From the regularity conditions, the WML estimator solves the first-order conditions Q

Q H

∇α ln P (y|K, α) = 0.

(32)

Now, taking the expectation of the score function with respect to the sample density yields

 E

Q H

 Z Q H ∇α ln P (y|K, α) = ∇α ln P (y|K, α) P (y|K, α)P (K)dK H

Z = Z =

Q

(33)

∇α ln P (y|K, α)P (y|K, α)P (K)dK

(34)

E [∇α ln P (y|K, α)] P (K)dK

(35)

= 0,

(36)

and therefore the estimator is consistent. This estimator, however, like its LR counterpart, is not fully efficient, because the information matrix equality does not hold. This is demonstrated as

 −E

Q H



∇α ln P (y|K, α) 6= E 2

"

Q H

 T # Q ∇α ln P (y|K, α) ∇α ln P (y|K, α) , H

(37)

because

−E

"  n  X Qi i =1

Hi

# pi (1 − pi )ki kj

6= E

" 2 n  X Qi i =1

Hi

# pi (1 − pi )ki kj .

(38)

Now, as mentioned earlier, KLR regularization is used in the form of the ridge penalty λ2 αT Kα, where λ > 0, is a regularization parameter. When regularization is introduced, none of the coefficients is set to zero (Park and Hastie, 2008), and hence the problem of infinite parameter values is avoided. In addition, the importance of the parameter λ lies in determining the bias–variance trade-off of an estimator (Cowan, 1998; Maimon and Rokach, 2005). When λ is very small, there is less bias but more variance. On the other hand, larger values of λ would lead to more bias but less variance (Berk, 2008). Therefore, the inclusion of regularization in the WKLR model is very important for reducing any potential inefficiency. However, as regularization carries the risk of a non-negligible bias, even asymptotically (Berk, 2008), the need for bias correction becomes inevitable. In sum, the bias correction is needed to account for any bias resulting from regularization, small samples, and rare events.

174

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

4.1. Rare events and finite sample correction Like the ordinary LR model, the method for computing the probability, Pr (yi = 1|α) = pˆ i =

1 1 + e−ki αˆ

,

(39)

ˆ , which is a biased estimate of α. is affected by α 4.1.1. Bias adjustment and parameter estimation Following McCullagh and Nelder (1989), the bias in large samples may be very small. However, for samples of smaller size, or for samples in which the number of parameters is large compared to the number of instances, the bias may not be so small. For the KLR model, the approximate bias vector can be written as

ˆ − α), b = E (α

(40)

and by following the same methodology as was used by McCullagh and Nelder (1989), it can be shown that the approximate asymptotic covariance matrix of η is given by Q = K(KT VK)−1 K,

(41)

˜ = (K + δ I), for a very small where V = diag(pi (1 − pi )) for i = 1 . . . , n. Replacing the kernel matrix K with the matrix K δ > 0, in order to make it invertible, would enable (41) to be reduced to Q = (V)−1 .

(42)

Let Qii be the ith diagonal element of Q, and

ξi = −

1



p00i



p0i

2

Qii ,

(43) ∂2p

∂p

where p0i = ∂ηi and p00i = 2i are the derivatives of the KLR logit function, then the bias vector b can be written in matrix ∂ηi i form as

ˆ = (K˜ T VK˜ )−1 K˜ T Vξ, bias(α)

(44)

which is obtained as the vector of regression coefficients with ξ as a response vector. Applying now the formulation suggested by King and Zeng (2001c) on the weighted LR to the WKLR model in (31), the weighted likelihood can be rewritten as

LW (α) =

n Y

(pi )w1 yi (1 − pi )w0 (1−yi ) ,

(45)

i=1

where w1 = τy , and w0 = 11−τ . Now, −y pi = E (yi ) =



1

w 1

1 + e−ηi

w

≡ pi 1 ,

(46)

and, hence, w

p0i = w1 pi 1 (1 − pi ),

(47)

and w

p00i = w1 pi 1 (1 − pi )(w1 − (1 + w1 )pi ).

(48)

Finally, the bias vector for WKLR can now be rewritten as

ˆ = (K˜ T DK˜ )−1 K˜ T Dξ, B(α)

(49)

where the ith element of the vector ξ is now

ξi = 0.5Qii (1 + w1 pi − w1 ),

(50)

with Qii as the diagonal elements of Q, and D = diag(vi wi ) for i = 1 . . . , n. The bias-corrected estimator becomes

ˆ α˜ = αˆ − B(α).

(51)

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

175

For WKLR, the gradient and Hessian are obtained by differentiating the regularized weighted log-likelihood, ln LW (α) =

n X

wi ln

i=1

e y i ki α 1 + e ki α



λ 2

αT Kα,

(52)

with respect to α. In matrix form, the gradient is

∇α ln LW (α) = K˜ T W(y − p) − λK˜ α,

(53)

where W = diag(wi ) and p is the probability vector whose elements are given in (39). The Hessian with respect to α is then

∇α2 ln LW (α) = −K˜ T DK˜ − λK˜ .

(54)

The Newton–Raphson update with respect to α on the (c + 1)th iteration is

αˆ (c +1) = αˆ (c ) + (K˜ T DK˜ + λK˜ )−1 (K˜ T W(y − p) − λK˜ α(c ) ). ˆ Since α

(c )

T

αˆ (c +1) = (K˜ T DK˜ + λK˜ )−1 K˜ T Dz(c ) , (c )

where z

(55)

(c )

= (K˜ DK˜ + λK) (K˜ DK˜ + λK)αˆ , then (55) can be rewritten as −1

T

= Kαˆ

(c )

(56)

+ D (y − p) is the adjusted dependent variable or the adjusted response. −1

5. The RE-WKLR algorithm For WKLR, the Weighted Least Squares (WLS) subproblem is

(K˜ T DK˜ + λK˜ )αˆ (c +1) = K˜ T Dz(c ) ,

(57)

˜ vector of adjusted responses z, and a weight matrix D. Both which is a system of linear equations with a kernel matrix K, ˆ (c ) , which is the current estimate of the parameter vector. the weights and the adjusted response vector are dependent on α (0) ˆ for αˆ can be solved iteratively, giving a sequence of estimates Therefore, the problem of specifying an initial estimate α ˆ . That problem can be solved using the CG method, which is equivalent to minimizing the that converges to the MLE of α quadratic problem 1 (c +1) T αˆ (K˜ DK˜ + λK˜ )αˆ (c +1) − αˆ (c +1) (K˜ T Dz(c ) ). 2

(58)

Similarly, the bias in (49) can be computed with CG by solving the quadratic problem 1 2

ˆ (c +1) (K˜ T DK˜ + λK˜ )B(α) ˆ (c +1) − B(α) ˆ (c +1) (K˜ T Dξ B(α)

(c )

).

(59)

Now, like for the TR-KLR algorithm, in order to avoid the long computations that the CG may suffer from, a limit can be placed on the number of CG iterations, thus creating an approximate or truncated Newton direction. Like the TR-KLR formulation (Maalouf and Trafalis, 2008), for Algorithm 1, the maximum number of iterations is set to 30 and for the relative difference of deviance threshold, ε1 , the value 2.5 is found to be sufficient for reaching the desired accuracy and at the same time maintaining good convergence speed. By choosing the value 2.5 as a threshold, computational speed is improved while not affecting accuracy. Should the algorithm reach a certain desired accuracy with a low threshold, then by slightly modifying the parameter values (e.g. σ , λ) with a larger threshold, the same accuracy can be reached; hence the robustness of the algorithm. As for Algorithms 2 and 3, the maximum number of iterations for the CG is set to 200 iterations. In addition, the CG convergence thresholds, ε2 and ε3 , are set to 0.005. Furthermore, no more than three non-improving iterations are allowed in the CG algorithm. 6. Computational results and discussion The performance of the RE-WKLR algorithm was examined using (1) seven benchmark binary class datasets (see Table 1) found on the UC Irvine Machine Learning Repository website (Asuncion and Newman, 2007) and (2) a real-life tornado dataset. Performance of the algorithm was then compared to those of SVM and TR-KLR. In this analysis, the Gaussian Radial Basis Function (RBF) kernel,

κ(xi , xj ) = e

 −

1 2σ 2

2 kxi −xj k

= e(−γ kxi −xj k) , 2

(60)

was used for all methods, where σ is the parameter of the kernel. The values of these parameters which give the best generalization were chosen from a range of different values (generally user defined) and were tuned using the bootstrap

176

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

Algorithm 1: WKLR MLE using IRLS (0)

˜ , y, αˆ , w1 , w0 Data: K ˆ B(α), ˆ α, ˜ p˜i Result: α, 1 begin 2 c=0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

(c )

(c +1)

while | DEV −(DEV | > ε1 and c ≤ Max IRLS Iterations do DEV c +1) for i ← 1 to n do 1 pˆi = −k αˆ ; 1+e

i

vi = pˆi (1 − pˆi ) ; wi = w1 yi + w0 (1 − yi ) ; pi ) ˆ (c ) + pˆ(y(i1−ˆ zi = ki α ; pi ) i −ˆ 1 Qii = v w ; i i ξi = 12 Qii ((1 + w1 )pˆi − w1 ); D = diag(vi wi ) ; (K˜ T DK˜ + λK˜ )αˆ (c +1) = K˜ T Dz(c ) ; ˆ (c +1) = K˜ T Dξ (c ) ; (K˜ T DK˜ + λK˜ )B(α) c =c+1 ˆ ; α˜ = αˆ − B(α) 1 p˜i = ; ˜ −k α˜

/* Compute probabilities /* Compute variance /* Compute weights /* Compute adjusted response /* Compute weighted logit elements /* Compute the bias response /* Obtain the n × n diagonal weight matrix /* Compute αˆ via Algorithm 2 ˆ via Algorithm 3 /* Compute B(α)

/* Compute the unbiased α */ /* Compute the optimal probabilities */

i

1 +e

*/ */ */ */ */ */ */ */ */

end

˜ b = K˜ T Dz. ˆ . A = K˜ T DK˜ + λK, Algorithm 2: Linear CG for computing α (0)

1 2 3 4 5 6

ˆ Data: A, b, α ˆ such that Aαˆ = b Result: α begin ˆ (0) ; r(0) = b − Aα c=0 while ||r(c +1) ||2 > ε2 and c ≤ Max CG Iterations do if c = 0 then ζ (c ) = 0 else

7

ζ (c ) =

8

(c +1)

d

9

(c +1)

rT(c ) r(c ) dT(c ) Adc (c +1) (c )

10

s(c ) =

11

αˆ



;

(c ) (c )

d

;

;

= αˆ + ζ (c ) d(c +1) ; r(c +1) = r(c ) − s(c ) Ad(c +1) ; c =c+1

12 13 14

=r

rT(c +1) r(c +1) rT(c +1) r(c )

/* Initialize the residual */

/* Update A-Conjugacy enforcer

*/

/* Update the search direction /* Compute the optimal step length /* Obtain an approximate solution /* Update the residual

*/ */ */ */

end

method (Efron and Tibshirani, 1994). The bootstrap method was applied only to the testing sets. The idea behind the bootstrap method is to create hundreds or thousands of samples, called bootstrap samples, by re-sampling with replacement from the original sample. Each re-sample has the same size as the original sample (Efron and Tibshirani, 1994). For this study, the total number of bootstrap rounds (B) was set to 5000 rounds on all of the datasets except for the Spam and Tornado datasets. The bootstrap accuracy (A) had at most a halfwidth of its 95% confidence interval equal to 0.25. The bootstrap sample size was chosen equal to the testing set size for all of the datasets. Due to the large size of both Spam and Tornado datasets, bootstrap rounds of 200 were found adequate to generate enough variations. The overall bootstrap accuracy (A∗ ) was calculated according to the following. A sequence of sample accuracies, (1)

(1)

(0)

(0)

a1 , . . . , a(r1) , . . . , aB , a1 , . . . , a(r0) , . . . , aB , (1)

(0)

were collected during the bootstrap procedure, where for a given round (r), ar = TPTP for class 1, and ar = TNTN for +FN +FP class 0, where (TP) corresponds to the number of correctly classified positive instances (true positive), (FN) corresponds to the number of positive instances classified as negative (false negative), (FP) corresponds to the number of negative instances classified as positive (false positive), and (TN) corresponds to the number of correctly classified negative instances (true negative). After the bootstrap procedure was completed, the average accuracy of each class was computed. Then, for a given

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

177

˜ T DK˜ + λK, ˜ b = K˜ T Dξ . Algorithm 3: Linear CG for computing the bias. A = K

1 2 3 4 5 6

ˆ (0) Data: A, b, B(α) ˆ such that AB(α) ˆ =b Result: B(α) begin ˆ (0) ; r(0) = b − AB(α) c=0 while ||r(c +1) ||2 > ε3 and c ≤ Max CG Iterations do if c = 0 then ζ (c ) = 0 else

7

ζ (c ) =

rT(c +1) r(c +1) rT(c +1) r(c )

; d(c +1) = r(c +1) + ζ (c ) d(c ) ;

8 9

rT(c ) r(c ) dT(c ) Ad(c ) (c +1)

s(c ) =

10

;

ˆ ˆ (c ) + ζ (c ) d(c +1) ; B(α) = B(α) r(c +1) = r(c ) − s(c ) Ad(c +1) ; c =c+1

11 12 13 14

/* Initialize the residual */

/* Update A-Conjugacy enforcer

*/

/* Update the search direction /* Compute the optimal step length /* Obtain approximate solution /* Update the residual

*/ */ */ */

end

Table 1 Datasets. Instances

Features

Class

Rarity

0 Ionosphere Sonar Liver Survival Diabetes SPECT Spam Tornado

351 208 345 306 768 267 4,601 13,790

34 60 6 3 8 44 57 83

1

225 111 200 225 500 212 2,788 13,016

126 97 145 81 268 55 1813 774

in testing set (%) 5 5 5 5 5 8 5 3

bootstrap procedure, the accuracy is A = min{a(a1v)g , a(a0v)g }.

(61) ∗

The overall accuracy reached, with different parameters, is considered to be A = max{A}. The interval between the 2.5th and 97.5th percentiles of the bootstrap distribution of a statistic is the non-parametric 95% bootstrap confidence interval. In addition, statistical significance was established using a multiple-comparison paired t-test (Jensen and Cohen, 2000), single tailed with an adjusted α = 0.017. All of the datasets were preprocessed using normalization of a mean of 0 and standard deviation of 1. All of the computations for RE-WKLR and TR-KLR were carried out using MATLAB version 2007a on a 3 GB RAM computer. As for the SVM method, MATLAB LIBSVM toolbox (Chang and Lin, 2001) was used. 6.1. Benchmark datasets The benchmark datasets were the Ionosphere, Sonar, BUPA Liver Disorders, Haberman Survival, Pima Indians Diabetes, SPECT Heart Diagnosis, and Spam datasets. The Ionosphere dataset describes radar signals targeting electrons of two types in the ionosphere: those that show some structure (good) and those that do not (bad) (Sigillito et al., 1989). The Sonar dataset is composed of sonar signals detecting either mines or rocks (Gorman and Sejnowski, 1988). The BUPA Liver Disorder dataset consists of blood tests that are thought to be sensitive to liver disorders arising from excessive alcohol consumption in single males (PC/BEAGLE User’s Guide). The Haberman Survival dataset is information on the survival of patients who had undergone surgery for breast cancer (Haberman, 1976). The Pima Indian Diabetes dataset describes the onset of diabetes among Pima Indian patients (Smith et al., 1988). SPECT Heart is data on cardiac Single Proton Emission Computed Tomography (SPECT) images. Each patient is classified into one of two categories: normal and abnormal (Kurgan et al., 2001). Finally, the Spam dataset consists of email messages considered ‘‘spam’’, on the basis of certain features (Wang and Witten, 2002). The datasets were divided into training and testing sets. Two schemes of sampling the training datasets were applied. In the first, the training datasets were equally divided into 40 instances in each class, chosen randomly, but the same instances were applied to all of the methods. In the second scheme, the number of non-events remained 40 zeros, but the number of

178

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

Table 2 Optimal parameter values for the balanced training datasets. C is the SVM regularization parameter, and λ is the regularization parameter for both RE-WKLR and TR-KLR. RE-WKLR

Ionosphere Sonar Liver Survival Diabetes SPECTF Spam Tornado

SVM

σ

λ

2.5 1.0 5.0 2.2 5.3 7.4 7.0 2.0

0.07 0.01 0.005 0.07 0.05 0.7 0.5 1.0

TR-KLR

σ

C

3.0 1.9 5.0 2.0 2.0 1.3 18.0 8.0

10.0 1.0 10.0 10.0 10.0 10.0 100.0 10.0

σ

λ

6.0 6.5 6.0 6.0 4.0 8.0 7.0 1.2

0.04 10 0.005 0.04 0.05 0.001 0.5 0.1

Table 3 Bootstrap accuracy (%) using balanced training sets. Bold accuracy values indicate the highest accuracy reached by the algorithms being compared.

Ionosphere Sonar Liver Survival Diabetes SPECT Spam Tornado a b c

RE-WKLR

SVM

TR-KLR

89c 83a 63b 86a 70a 72a 88a 95a

94 67 56 76 61 69 86 93

78 70 65 78 61 51 85 93

Statistical significance using paired t-test with α = 0.017 over both SVM and TR-KLR. Statistical significance using paired t-test with α = 0.017 over SVM. Statistical significance using paired t-test with α = 0.017 over TR-KLR.

events was reduced to 15 instances. Due to the large size of the Spam dataset, a balanced training sample with 200 instances of each class and an imbalanced training sample with 200 non-events and 100 events were selected. To include rarity, the number of events (ones) in all the testing datasets, except for the SPECT Heart dataset, was randomly chosen and made 5% of the number of non-events (zeros). For the SPECT Heart dataset, the number of rare events remained unchanged, since the original dataset includes a rarity of 8% in the testing set. 6.2. Tornado data The Tornado dataset consists of 83 attributes, 24 of which are derived from the Mesocyclone Detection Algorithm (MDA) data, measuring radar-derived velocity parameters that describe aspects of the mesocyclone, in addition to the month attribute. The rest of the attributes are from the Near Storm Environment (NSE) dataset (Lakshmanan et al., 2005), which describes the pre-storm environment on a broader scale than MDA data. The attributes of the NSE data consist of wind speed, direction, wind shear, humidity lapse rate, and the predisposition of the atmosphere to explosively lift air over specific heights. In addition, the original Tornado dataset consists of a training set and a testing set. The training set has 387 tornado observations and 1144 non-tornado observations. As with the benchmark datasets, the original Tornado dataset is considered as the imbalanced training set, while an under-sampling scheme for 387 non-tornado instances (chosen randomly from the original training set) and 387 instances was carried out. The testing set consists of 387 tornado observations and 11,872 non-tornado observations, and hence the rarity is 3%. 6.3. Balanced training data For the balanced training dataset, Tables 2 and 3 summarize the computation results for the three methods, including their optimal parameters and accuracy, respectively. Table 3 shows that the RE-WKLR method scored much better than SVM and TR-KLR on all datasets except for the Ionosphere and Liver data. When RE-WKLR performed better, the difference in accuracy was large, as shown with Sonar, Survival and Diabetes datasets. A visual comparison of statistical significance is provided in Fig. 2. Fig. 2 shows the accuracy and 95% confidence level obtained by each method on the benchmark datasets. It can be observed that the accuracy of RE-WKLR is noticeably better than that of SVM on Sonar, Liver, and Survival datasets and only worse on the Ionosphere dataset. With respect to TR-KLR, the accuracy of RE-WKLR is better than that of TR-KLR on Sonar, Survival and SPECT datasets. What is more, the computational speed of the RE-WKLR algorithm, measured by CPU time as shown in Table 4, is distinctly faster than that of SVM, despite the fact that LIBSVM is written mainly in C++ while both the TR-KLR and RE-WKLR algorithms are coded purely in MATLAB. The time saving ranges between approximately 24% and 50% as indicated by Table 4.

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

179

100 Survival

T

Spam

Sonar

90

S

Liver

Diabetes SPECT

R

80

R ST

R 70

R

S R

60

S

T

T

Ionosphere

T R

50 R S

40

S

T ST R = RE–WKLR S = SVM T = TR–KLR

Fig. 2. Benchmark dataset accuracy comparison using balanced training sets with 95% confidence.

Table 4 Comparison of bootstrap times in seconds using balanced training sets.

Ionosphere Sonar Liver Survival Diabetes SPECTF Spam Tornado

RE-WKLR

SVM

TR-KLR

12 5 7 10 27 14 37 311

30 21 16 14 42 41 34 615

12 6 7 8 27 13 37 322

Table 5 Optimal parameter values for the imbalanced training datasets. RE-WKLR

Ionosphere Sonar Liver Survival Diabetes SPECTF Spam Tornado

SVM

σ

λ

9.0 8.0 3.7 2.5 2.7 7.0 3.2 2.0

0.007 0.01 0.001 0.002 0.0002 0.06 0.03 0.02

σ 5.0 4.0 3.0 1.0 6.0 4.0 18.0 8.5

TR-KLR C 10.0 10.0 10.0 100.0 10.0 10.0 10.0 10.0

σ 4.0 20.0 1.0 1.0 2.0 5.0 5.0 1.2

λ 0.005 0.005 0.01 0.01 0.01 0.2 0.1 0.1

6.4. Imbalanced training data In order to appreciate the robustness and stability of RE-WKLR, the number of events in the training set was reduced to only 15 instances, again chosen randomly. It should be noted here that this is not an under-sampling scheme but rather an assumption that only 15 rare-event instances were available. As mentioned previously, for the Spam dataset, the imbalanced training set consisted of 200 non-spam instances and 100 spam instances, and the original Tornado training set was considered to be the imbalanced training set. Tables 5 and 6 summarize the results for these three methods with their optimal parameters and accuracy, respectively. Table 6 shows that RE-WKLR performed the best on all of the datasets except for Sonar and SPECT, on which it achieved equal accuracy with TR-KLR. A comparison of statistical significance with imbalanced training data is provided in Fig. 3. As can be observed from Fig. 3, the accuracy of RE-WKLR is noticeably better than that of SVM on all of the datasets except for the Liver data. The accuracy of RE-WKLR is better than that of TR-KLR on all the datasets except for the Sonar and SPECT data. Fig. 4 shows that, except for the Ionosphere dataset, despite the reduction, the RE-WKLR method retained almost the same level of accuracy as in the balanced training data, with some improvement on the Survival data. The accuracy of REWKLR reaches 100% on the Ionosphere dataset with the imbalanced training set. In comparison, SVM accuracy improved

180

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

100 T

90

SPECT

Sonar Liver

Diabetes R

R

80 R 70

Spam

Survival

R

T

T

S

S S

R S

60 T R

50 40 Ionosphere

T

S

T RS S

30 T 20

R = RE–WKLR S = SVM T = TR–KLR

Fig. 3. Benchmark dataset accuracy comparison using imbalanced training sets with 95% confidence level. Table 6 Bootstrap accuracy (%) using imbalanced training sets.

Ionosphere Sonar Liver Survival Diabetes SPECT Spam Tornado a b

RE-WKLR

SVM

TR-KLR

100a 83b 63a 88a 70a 74b 89a 95a

89 67 62 76 52 69 79 92

67 83 50 77 61 73 84 93

RE-WKLR

SVM

TR-KLR

8 5 6 7 18 9 28 616

16 15 9 9 22 29 24 807

8 4 6 7 20 9 28 632

Statistical significance using paired t-test with α = 0.017 over both SVM and TR-KLR. Statistical significance using paired t-test with α = 0.017 over SVM.

Table 7 Comparison of bootstrap times in seconds using imbalanced training sets.

Ionosphere Sonar Liver Survival Diabetes SPECTF Spam Tornado

on the Liver data upon the reduction but it became worse on both Ionosphere, Diabetes and Spam datasets. The accuracy of TR-KLR on the other hand improved on Sonar, SPECT and Spam datasets but degraded on both Ionosphere and Liver datasets upon the reduction. As with the balanced training data, the computational speed of the RE-WKLR algorithm is also distinctly faster than that of SVM as indicated by Table 7. 7. Conclusion and future work We have presented the Rare Event Weighted Kernel Logistic Regression (RE-WKLR) algorithm and have demonstrated that the RE-WKLR algorithm is relatively easy to implement and is robust when implemented on imbalanced and rare event data. The algorithm combines several concepts from the fields of statistics, econometrics and machine learning. It was also shown that RE-WKLR is very powerful when applied to both small and large datasets. The RE-WKLR algorithm takes advantage of bias correction and the power of the kernel methods, particularly when the datasets are neither balanced nor linearly separable. Another benefit of using RE-WKLR is its implementation of unconstrained optimization methods whose algorithms are less complex than those with constrained optimization methods, such as SVM.

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

181

Fig. 4. Comparison of algorithms on benchmark datasets with balanced and imbalanced training sets.

Future studies may compare these methods using different kernels and algorithms and may implement them on imbalanced multi-class datasets. They may also try to improve the speed and sparsity of the algorithm. Future research may also extend the techniques provided in this study to Functional Data Analysis (FDA) where random variables take on infinite dimensional functional spaces. The excellent monograph on non-parametric FDA by (Ferraty and Vieu, 2006) provides the mathematical background necessary for such an extension.

182

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

Acknowledgements The authors would like to thank Dr. S. Lakshmivarahan, Dr. Robin C. Gilbert, Dr. Kash Barker and Dr. Hillel Kumin of the University of Oklahoma for their valuable input, comments and suggestions. References Amemiya, T., 1985. Advanced Econometrics. Harvard University Press. Asuncion, A., Newman, D.J., 2007. UCI machine learning repository, University of California, Irvine. School of Information and Computer Sciences. http://www.ics.uci.edu/~mlearn/MLRepository.html. Bai, S.B., Wang, J., Zhang, F.Y., Pozdnoukhov, A., Kanevski, M., 2008. Prediction of landslide susceptibility using logistic regression: a case study in Bailongjiang river basin, China. In: Fuzzy Systems and Knowledge Discovery, Fourth International Conference on, vol. 4, pp. 647–651. Ben-Akiva, M., Lerman, S., 1985. Discrete Choice Analysis: Theory and Application to Travel Demand. The MIT Press. Berk, R., 2008. Statistical Learning from a Regression Perspective, 1st edition. Springer. Busser, B., Daelemans, W., Bosch, A., 1999. Machine learning of word pronunciation: the case against abstraction. In: Proceedings of the Sixth European Conference on Speech Communication and Technology, Eurospeech99, pp. 2123–2126. Cameron, A.C., Trivedi, P.K., 2005. Microeconometrics: Methods and Applications. Cambridge University Press. Canu, S., Smola, A.J., 2005. Kernel methods and the exponential family. In: ESANN, pp. 447–454. Chang, C.-C., Lin, C.-J., 2001. LIBSVM: a library for support vector machines. Software available at: http://www.csie.ntu.edu.tw/cjlin/libsvm. Chan, P.K., Stolfo, S.J., 1998. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp. 164–168. Cowan, G., 1998. Statistical Data Analysis. Oxford University Press. Cramer, J.S., 2003. Logit Models from Economics and Other Fields. Cambridge University Press. Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press. Drummond, C., Holte, R.C., 2003. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, pp. 1–8. Eeckhaut, M.V.D., Vanwalleghem, T., Poesen, J., Govers, G., Verstraeten, G., Vandekerckhove, L., 2006. Prediction of landslide susceptibility using rare events logistic regression: a case-study in the Flemish Ardennes (Belgium). Geomorphology 76 (3–4), 392–410. Efron, B., Tibshirani, R.J., 1994. An Introduction to the Bootstrap. Chapman & Hall/CRC. Ferraty, F., Vieu, P., 2006. Nonparametric Functional Data Analysis: Theory and Practice, 1st edition. Springer. Gorman, R.P., Sejnowski, T.J., 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75–89. Haberman, S.J., 1976. Generalized residuals for log-linear models. In: Proceedings of the 9th International Biometrics Conference, Boston. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer-Verlag. Heinze, G., Schemper, M., 2001. A solution to the problem of monotone likelihood in cox regression. Biometrics 57, 114–119. Hosmer, D.W., Lemeshow, S., 2000. Applied Logistic Regression, 2nd edition. Wiley. Imbens, G.W., Lancaster, T., 1996. Efficient estimation and stratified sampling. Journal of Econometrics 74, 289–318. Jaakkola, T., Haussler, D., 1999. Probabilistic Kernel Regression Models. Japkowicz, N., 2000. Learning from imbalanced data sets: a comparison of various strategies, Tech. Rep. University of DalTech/Dalhousie. Jensen, D., Cohen, P.R., 2000. Multiple comparison in induction algorithms. Machine Learning 38, 309–338. Karsmakers, P., Pelckmans, K., Suykens, J.A.K., 2007. Multi-class kernel logistic regression: a fixed-size implementation. In: International Joint Conference on Neural Networks. pp. 1756–1761. Keele, L.J., 2008. Semiparametric Regression for the Social Sciences. Wiley. King, G., Zeng, L., 2001a. Explaining rare events in international relations. International Organization 55 (3), 693–715. King, G., Zeng, L., 2001b. Improving forecast of state failure. World Politics 53 (4), 623–658. King, G., Zeng, L., 2001c. Logistic regression in rare events data. Political Analysis 9, 137–163. Komarek, P., 2004. Logistic regression for data mining and high-dimensional classification. Ph.D. Thesis. Carnegie Mellon University. Komarek, P., Moore, A., 2005. Making logistic regression a core data mining tool: a practical investigation of accuracy, speed, and simplicity. Tech. Rep. Carnegie Mellon University. Kubat, M., Holte, R.C., Matwin, S., 1998. Machine learning for the detection of oil spills in satellite radar images. In: Machine Learning. pp. 195–215. Kurgan, L.A., Cios, K.J., Tadeusiewicz, R., Ogiela, M., Goodenday, L.S., 2001. Knowledge discovery approach to automated cardiac spect diagnosis. Artificial Intelligence in Medicine 32 (2), 149–169. Lakshmanan, V., Stumpf, G., Witts, A., 2005. A neural network for detecting and diagnosing tornadic circulations using the mesocyclone detection and near storm environment algorithms. In: 21st International Conference on Information Processing Systems. American Meteorological Society, San Diego, CA, CD-ROM, J52.2. Lewis, J.M., Lakshmivarahan, S., Dhall, S., 2006. Dynamic Data Assimilation: A Least Squares Approach. Cambridge University Press. Lin, C.-J., Weng, R.C., Keerthi, S.S., 2007. Trust region Newton methods for large-scale logistic regression. In: ICML ’07: Proceedings of the 24th International Conference on Machine Learning. Corvalis, Oregon. ACM, New York, NY, USA, pp. 561–568. Maalouf, M., Trafalis, T.B., 2008. Kernel logistic regression using truncated Newton method. In: Dagli, C.H., Enke, D.L., Bryden, K.M., Ceylan, H., Gen, M. (Eds.), Intelligent Engineering Systems Through Artificial Neural Networks, Vol. 18. ASME Press, New York, NY, USA, pp. 455–462. Maimon, O., Rokach, L. (Eds.), 2005. Data Mining and Knowledge Discovery Handbook. Springer. Maiti, T., Pradhan, V., 2008. A comparative study of the bias corrected estimates in logistic regression. Statistical Methods in Medical Research 17 (6), 621–634. Maloof, M.A., 2003. Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 Workshop on Learning from Imbalanced Data Sets II. Malouf, R., 2002. A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of Conference on Natural Language Learning, vol. 6. Manski, C.F., Lerman, S.R., 1977. The estimation of choice probabilities from choice based samples. Econometrica 45 (8), 1977–1988. McCullagh, P., Nelder, J., 1989. Generalized Linear Model. Chapman and Hall/CRC. Milgate, M., Eatwell, J., Newman, P.K. (Eds.), 1990. Econometrics. W. W. Norton & Company. Minka, T.P., 2003. A comparison of numerical optimizers for logistic regression. Tech. Rep. Department of Statistics, Carnegie Mellon University. Park, M.Y., Hastie, T., 2008. Penalized logistic regression for detecting gene interactions. Biostatistics 9 (1), 30–50. Prati, R.C., Batista, G.E.A.P.A., Monard, M.C., 2004. Learning with class skews and small disjuncts. In: SBIA, pp. 296–306. Quigley, J., Bedford, T., Walls, L., 2007. Estimating rate of occurrence of rare events with empirical Bayes: a railway application. Reliability Engineering & System Safety 92 (5), 619–627. Ramsay, J., Silverman, B.W., 2005. Functional Data Analysis, 2nd edition. Springer. Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A., 2007. Mining data with rare events: a case study. In: ICTAI ’07: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2007, vol. 2. IEEE Computer Society, Washington, DC, USA, pp. 132–139. Shawe-Taylor, J., Cristianini, N., 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.

M. Maalouf, T.B. Trafalis / Computational Statistics and Data Analysis 55 (2011) 168–183

183

Sigillito, V.G., Wing, S.P., Hutton, L.V., Baker, K.B., 1989. Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest 10, 262–266. Smith, J.W., Everhart, J.E., Dickson, W.C., Johannes, R.S., 1988. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Symposium on Computer Applications and Medical Care. IEEE Computer Society Press. Trafalis, T.B., Ince, H., Richman, M.B., 2003. Tornado detection with support vector machines. International Conference on Computational Science. pp. 289–298. Tsoucas, P., 1992. Rare events in series of queues. Journal of Applied Probability 29, 168–175. Van-Hulse, J., Khoshgoftaar, T.M., Napolitano, A., 2007. Experimental perspectives on learning from imbalanced data. In: ICML ’07: Proceedings of the 24th International Conference on Machine Learning. ACM, New York, NY, USA, pp. 935–942. Vapnik, V., 1995. The Nature of Statistical Learning. Springer, NY. Wang, S., Wang, T., 2001. Precision of warm’s weighted likelihood for a polytomous model in computerized adaptive testing. Applied Psychological Measurement 25 (4), 317–331. Wang, Y., Witten, I.H., 2002. Modeling for optimal probability prediction. In: ICML, pp. 650–657. Weiss, G.M., 2004. Mining with rarity: a unifying framework. SIGKDD Explorations Newsletter 6 (1), 7–19. Weiss, G.M., Hirsh, H., 2000. Learning to predict extremely rare events. In: AAAI Workshop on Learning from Imbalanced Data Sets. AAAI Press, pp. 64–68. Xie, Y., Manski, C.F., 1989. The logit model and response-based samples. Sociological Methods & Research 17, 283–302. Zadrozny, B., 2004. Learning and evaluating classifiers under sample selection bias. In: ICML ’04: Proceedings of the Twenty-First International Conference on Machine learning. ACM, New York, NY, USA, p. 114.