Stretchy binary classification

Stretchy binary classification

Neural Networks 97 (2018) 74–91 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Stretchy...

3MB Sizes 0 Downloads 90 Views

Neural Networks 97 (2018) 74–91

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Stretchy binary classification Kar-Ann Toh a, *, Zhiping Lin b , Lei Sun c , Zhengguo Li d a

School of Electrical and Electronic Engineering, Yonsei University, Seoul, 03722, Republic of Korea School of Electrical and Electronic Engineering, Nanyang Technological University, 639798, Singapore School of Information and Electronics, Beijing Institute of Technology, 100081, PR China d Institute for Infocomm Research, 1 Fusionopolis Way, #21-01, 138632, Singapore b c

highlights • • • • •

Proposed a novel cost function for counting the samples that are misclassified. Conjectured an analytic solution to a constrained p-norm minimization problem. Linkage of the proposed formulation to two existing classifiers. Provided variance analysis for the proposed analytic solution. Extensive experiments with comparison to state-of-the-arts.

article

info

Article history: Received 12 January 2017 Received in revised form 8 August 2017 Accepted 28 September 2017 Available online 10 October 2017 Keywords: Pattern classification Parameter learning Sparse estimation

a b s t r a c t In this article, we introduce an analytic formulation for compressive binary classification. The formulation seeks to solve the least ℓp -norm of the parameter vector subject to a classification error constraint. An analytic and stretchable estimation is conjectured where the estimation can be viewed as an extension of the pseudoinverse with left and right constructions. Our variance analysis indicates that the estimation based on the left pseudoinverse is unbiased and the estimation based on the right pseudoinverse is biased. Sparseness can be obtained for the biased estimation under certain mild conditions. The proposed estimation is investigated numerically using both synthetic and real-world data. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Pattern classification has a wide range of applications spanning fields in engineering, financial, social, medical and life sciences. The approaches to classifier design can generally be divided into those by deterministic means and those by probabilistic means, even though their underlying mechanisms can be linked theoretically (Duda, Hart, & Stork, 2001; Hastie, Tibshirani, & Friedman, 2001). Under the supervised learning paradigm, the deterministic approach (also known as non-probabilistic approach) formulates the classification problem as to minimize the misclassification rate (or to maximize the classification accuracy) based on a given set of training samples. The adopted mathematical model for data learning can be treated as the problem of input–output data mapping under the paradigm of neural networks (Faris, Aljarah, & Mirjalili, 2016; Funahashi, 1989; Hornik, Stinchcombe, & White, 1990; Huang & Du, 1999, 2008; Sprecher, 1993; Zhang, Zhang,

* Corresponding author.

E-mail addresses: [email protected] (K.-A. Toh), [email protected] (Z. Lin), [email protected] (L. Sun), [email protected] (Z. Li). https://doi.org/10.1016/j.neunet.2017.09.015 0893-6080/© 2017 Elsevier Ltd. All rights reserved.

Lok, & Lyu, 2007), which can be traced to neural and information decision sciences (Huang & Jiang, 2012; Nickerson, 1972; Proctor & Cho, 2006). The probabilistic approach capitalizes on the Bayesian decision theory which quantifies the tradeoffs among various classification decisions using probability and the cost that accompany such decisions (Duda et al., 2001). Depending on the assumptions regarding the probability density of data and the design model prior, various methods have been developed to solve the decision formulation. The maximum-likelihood estimation views the parameters of the data density function as fixed but unknown quantities where the best estimation of their value is defined to be the one that maximizes the probability of obtaining the samples actually observed. The Bayesian estimation views the parameters as random variables of certain known type of distribution. Such information is converted to posterior density by the inference of parameter values from observed data samples. Due to the often limited availability of data and its limited representativeness for learning, structural risk minimization adopts an inductive principle for learning model selection. Generally, the method adopts a model of capacity control and seeks a trade-off between the hypothesis space complexity and

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

the quality of fitting the training data. From Vapnik (1998, 1999), the hypothesis space complexity is known as the VC dimension. The quality of data fitting is called the empirical error. Apart from the classification accuracy concern, an important consideration for classifier design is the computational cost. This is in view of the often limited computing and memory facilities in contrast to the large amount of data for processing. Based on the assumption that the desired signal content is intrinsically sparse, compressed sensing (also known as sparse estimation) addresses such concern from reducing the size of estimation parameters while maintaining competitive accuracy. Grounded on the knowledge that the distance metric plays a critical role in shrinking the parameters, a typical parametric search is either based on a relevant distance metric or coupled with constraints that can produce the desired shrinkage behavior. This can be viewed as an adoption of a parametric prior with shrinking capability upon accumulation of evidence from the Bayesian perspective. In this article, we address the problem of network-based classifier learning by adopting the deterministic approach. Different from existing methods, we conjecture an analytic and stretchable learning solution without needing an iterative search or likelihood maximization steps. Extensive numerical evaluations using synthetic and real-world data are presented to support the claim. The main contributions1 include: (i) a novel classification error counting cost function for constraining the parametric search; (ii) a novel deterministic method for sparse network parameter estimation. This estimation is the first of its kind whereby classification network learning and compressed estimation are performed at the same time; (iii) an analysis of variance regarding the estimation bias; (iv) a linkage of the proposed formulation to two existing classifiers; and (v) an extensive numerical evaluation to illustrate the estimation mechanism and its validity. The conceivable impact of such estimation is evident from the vast application potentials in real-world classification problems particularly in this era of big data. The organization of this article is as follows: Section 2 provides the relevant background material for subsequent development. In particular, learning based on the linear prediction network model is introduced and this is followed by a brief coverage of parameter regularization and shrinkage methods in the literature. Next, a novel error rate based classification cost function is proposed to pave the development of a classification network with stretchable parameters in Section 3. This is followed by a variance analysis in Section 4. In Section 5, a linkage of the proposed methodology to two well-known state-of-the-art classifiers is shown. A synthetic example is presented in Section 6 to illustrate the stretching behavior on representative scenarios. Subsequently, in Section 7, the proposed method is evaluated through experimentation on benchmark data sets from the literature. Finally, our concluding remarks are given in Section 8.

be mapped onto a transformed space p(x) : Rd ↦ → RD where the model output can be written as g(α, x) = p(x)T α. This is also known as a linear network structure. Typically, the transformed dimension D is chosen to be larger than d such that a larger degree of freedom for weight vector (correspondingly α ∈ RD ) estimation is obtained. The transformed feature vector p(x) can be considered as a basis expansion function with popular choice taking the form of Sigmoid (see e.g., Bishop, 1995; Guliyev & Ismailov, 2016), Gaussian (see e.g., Huang & Du, 2008; Poggio & Girosi, 1990; Rouhani & Javan, 2016), Polynomial (see e.g., Toh, Tran, & Srinivasan, 2004; Tong, 2016), and Random Projection (see e.g., Cao, Zhang, Luo, Yin, & Lai, 2016; Toh, 2008; Widrow, Greenblatt, Kim, & Park, 2013). Consider a training set consisting of m samples, a popular cost function for predictor learning is the sum of squared errors (Duda et al., 2001; Hastie et al., 2001) given by SSE =

m ∑

e2i =

The linear model for regression expresses its output as a linear combination of the input variables. By denoting an input vector using x ∈ Rd and a corresponding weight vector using α ∈ Rd , the linear regression model g(α, x) = xT α can be used to learn a target output y ∈ R by minimizing the error of fit, e = y − xT α. In the generalized form (Duda et al., 2001), the input vector x can 1 Some of the preliminary ideas in this paper has been presented at the IEEE Tenth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (Toh, 2015). The current manuscript extends beyond the preliminary work in both theoretical findings and experimental observations.

m ∑

(yi − g(α, xi ))2 = (y − Pα)T (y − Pα),

i=1

(1)

i=1

where P = [p(x1 ), . . . , p(xm )]T ∈ Rm×D packs the transformed input vectors in matrix form. When P has a full rank, then minimization of (1) gives a solution in analytic form:

α = (PT P)−1 PT y.

(2)

This solution, which is particularly useful when the system is overdetermined (i.e., m > D), is often referred to as the least squares error (LSE) solution for regression applications. 2.2. Regularization and restricting the feasible set There are two common ways to deal with a possible singularity of the covariance matrix PT P in (2). The first way, called regularization, is to constrain the parameter size (i.e., having ∥α∥22 ⩽ t , t ∈ ∑D−1 2 T R+ where ∥α∥22 := j=0 αj = α α) during minimization. Such minimization problem is often posed in an unconstrained form with inclusion of a regularization penalty factor λ ∈ R which controls the strength of the penalty: m )2 λ 1 ∑( yi − p(xi )T α + ∥α∥22 . 2 2

SSEridge (α) =

(3)

i=1

Solving (3) results in an analytic solution given by

α = (PT P + λI)−1 PT y.

(4)

This solution comes with a scaled identity matrix λI which stabilizes the inverse operation. This minimization is commonly known as the ridge regression in the literature (Hastie et al., 2001; Hoerl & Kennard, 1970; Tikhonov, 1963). In the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996, 2011), an ℓ1 -norm is adopted as the metric for constraining the size of parameters, giving

2. Preliminaries 2.1. Linear regression

75

SSElasso (α) =

m ∑ (

)2

yi − p(xi )T α

+ λ∥α∥1 ,

(5)

i=1

∑D−1

where ∥α∥1 := p-norm given by

ℓ : p

j=0

∥α∥p :=

|αj |. By replacing the penalty metric with a

( D−1 ∑

)1/p |αi |

,

p

(6)

i=0

the minimization is called the bridge regression (Frank & Friedman, 1993): SSEbridge (α) =

m ∑ (

yi − p(xi )T α

i=1

)2

+ λ∥α∥pp .

(7)

76

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

Here, 0 ⩽ p < 2 is of particular interest due to its compression capability. Apart from the popular Elastic-net (Zou & Hastie, 2005) which stretches between the LASSO and the ridge regression, several extensions to such regularization can be found for treating the penalty norm term in various forms (see e.g., Mazumder & Hastie, 2012; Yuan & Lin, 2006; Zou & Hastie, 2005). Besides regularization, another way to deal with singularity of PT P is to restrict the feasible solution set (Madych, 1991) to a subspace. Consider an under-determined system where the number of samples (m) is smaller than the feature dimension (i.e., D for p and d for x). In this case, the feasible solution set can be restricted by a re-parameterization. Suppose the feasible solution is linearly mapped based on α = PT β where β ∈ Rm . Then minimization of ∑m 2 T T T SSE = i=1 ei = (y − PP β) (y − PP β) with respect to the new T −1 parameter β results in β = (PP ) y which can be back substituted to give

(8) which minimizes the ℓ2 -norm of parameters with curve-fitting constraint is suitable for under-determined systems. ■ 2.3. Error rate based classification cost function In terms of classifying a given feature pattern, the classification error based counting goal is preferred over the SSE objective in regression (Duda et al., 2001). Consider a binary classification problem with labels y ∈ {−1, +1} (called negative-class and positive-class hereon) and a predictor g(x) ∈ R which makes decision at threshold τ = 0. A feature pattern x is classified as positive-class if the predictor output gets above the threshold value (i.e., g(x) ⩾ τ ), and it is classified as negative-class otherwise (i.e., g(x) < τ ). The classification error rate for each category is the average sum of error counts from each category. These error rates are called false negative rate (FNR) and false positive rate (FPR) which can be expressed mathematically as: +

α = P β = P (PP ) y. T

T

T −1

(8)

Here, the outer product term replaces the inner product term of the regularization approach. Since m < D, the matrix PPT in (8) is better conditioned than that of PT P in (2) or (4). The solution is analogous to the least-norm solution (Boyd & Vandenberghe, 2004) where a constrained optimization problem is posed as to find α which minimizes ∥α∥22 subject to y − Pα = 0. Similar to ridge regression, a regularization term can be included to deal with possible singularity of PPT in (8) to give the dual ridge solution (Saunders, Gammerman, & Vovk, 1998):

α = PT (PPT + λI)−1 y.

(9)

Consider the case without regularization, the above constrained p optimization problem can be generalized to minimize ∥α∥p subject to y − Pα = 0 which leads to solving an unconstrained minimization problem given by min ∥α∥pp + βT (y − Pα), α

(10)

where different choices of p can result in various parametric stretching. This framework is well adopted in compressed sensing research (Baraniuk, 2007; Candés & Wakin, 2008; Donoho, 2006) where minimization of an ℓ0 -norm objective is adopted for parameter subsets selection (Miller, 2002) and minimization of an ℓ1 norm objective is adopted for optimally sparse solution (Donoho & Elad, 2003). Remark 1. The regularized least-squares (LS) solution in the primal space (4) and the least-norm solution in the dual space (9) can be linked via the matrix identity (I + AB)−1 A = A(I + BA)−1 (Petersen & Pedersen, 2006). For the non-regularized cases of (2) and (8), the existence of the pseudoinverse of B when A = BT where limδ→0 (δ 2 I + BT B)−1 BT = limδ→0 BT (δ 2 I + BBT )−1 (Theorem 3.4 in Albert (1972)) may be inferred to link the two solutions particularly when A and B are square matrices. However, such full rank conditions seldom coexist in practice since A and B are often not square. Hence, it is common to see only either of the solutions (primal or dual) holds under an over-determined system or an underdetermined system. The solution in the primal space (2) which minimizes the ℓ2 -norm of curve-fitting error distance is suitable for over-determined systems while the solution in the dual space

FNR =

m 1 ∑

m+



L(g(xi ) < τ ), FPR = +

i=1

m 1 ∑

m−

L(g(x− j ) ⩾ τ ),

(11)

j=1

where L(·) is a zero–one step loss function for error counting and m+ , m− are the number of samples in each class. Here, L(g(x+ i ) < τ ) = ‘1’ whenever g(x+ i ) < τ holds true and ‘0’ otherwise. − Conversely, L(g(x− j ) ⩾ τ ) = ‘1’ whenever g(xj ) ⩾ τ holds true + and ‘0’ otherwise. Collectively, by defining ei := τ − g(x+ i ) and − e− j := g(xj ) − τ for indexed samples, the above equations can be written in the total error rate (TER) form as +

TER =

m 1 ∑

m+



m 1 ∑

+

L(ei ) +

i=1

m−

L(e− j ).

(12)

j=1

Since the error counting step function L(·) is not differentiable, an approximation is often adopted when optimizing an objective function involving L(·). This topic has been discussed in Toh (2008) where the key issue is to choose a monotonic function which penalizes heavier on an erroneous count than a correct count. Common choices for such approximation include exponential, binomial, hinge, square and sigmoid functions (Hastie et al., 2001). 3. Proposed stretchy classification In order to achieve a stretchable estimation, the ℓp -norm of the parameter vector is minimized subject to the classification error counting constraint. Both the ℓp -norm function and the classification error counting cost function are approximated by a differentiable function to pave our way towards an analytic estimation. These approximations shall be introduced in the following two subsections prior to dealing with the constrained minimization. 3.1. An approximation to the ℓp norm Here, we consider an approximation to the ℓp -norm given by

|≀α≀|k :=

( D−1 ∑

)1/k f (αi )

k

,

(13)

i=0



where f (·) = + (·)2 + ϵ, ϵ > 0 is a differentiable approximation to the absolute function. The power term k replaces p in the ℓp -norm to indicate the approximation and to avoid confusion with the projection model term p(·). We note that (13) is not a normed vector space since the absolute homogeneity and triangle inequality are violated. However, for the limiting case when ϵ → 0, (13) approaches the ℓp -norm, i.e.,

∥α∥k = lim |≀α≀|k . ϵ→0

(14)

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

77

3.2. A novel classification cost function In approximating the counting based classification error cost function, we make an uncommon and novel attempt where a symmetric function is utilized. That is, instead of using those common monotonic loss functions which penalize an incorrect decision heavier than a correct decision, a non-monotonic and symmetric √ function given by f (·) = (·)2 + ϵ > 0 is adopted for the error counting decision. In other words, f (·) replaces L(·) in (11) for error count penalty approximation. To see how the symmetric function f (·) works for error counting, we consider the following two types of errors which arise from a binary classification decision. Suppose g(x+ ) > τ and g(x− ) < τ give a correct classification for a decision threshold τ . The first type of error, e = τ − g(x+ ), arises when data from the positive class (x+ ) is estimated with a lower g(x+ ) value than τ . The second type of error is e = g(x− ) −τ , which arises when data from the negative class (x− ) gives a higher g(x− ) value than τ . For both types of errors, e < 0 when a sample is classified correctly. Conversely, e > 0 corresponds to an incorrect classification. When a monotonically increasing loss function is adopted in the conventional way, e > 0 gives a higher cost and e < 0 gives a lower cost. However, this does not work for f (e) which has a high cost for both e > 0 and e < 0! To deal with this issue, we introduce an offset value η > 0 on a normalized e (e.g., |e| < η) so that (e ± η) always falls on a single arm of f . In other words, f (e −η) is monotonically decreasing when −η < e < η. When τ = 0, we see that the error can be expressed in terms of the margin function, i.e., e = yg for y ∈ {−1, +1}. The margin function yg is an alternative way of classification error quantification. Here, a positive sign for yg indicates a correct classification whereas a negative sign indicates an incorrect classification. Fig. 1 shows the plots of f on the margin function yg at two ϵ values (see the two ‘‘v’’-shape curves which correspond to ϵ = 0 and ϵ = 0.1). Several conventional loss functions such as step loss, sigmoid loss, square loss, exponential loss, binomial loss and SVM hinge loss are also included in the plot for comparison. As an example in Fig. 1, the solid ‘‘v’’-shape curve is plotted with a negative unit offset (i.e., f (yg − 1) where the curve is shifted one unit to the right), we see that only the left arm of f is utilized for data falling within the domain −1 < yg < 1. With the above rationale behind the use of f (e ± η) for error counting, we propose an approximate total classification error rate given by

m+

2 (τ − g(x+ i ) + η) + ϵ

i=1 −

+

m √ 1 ∑

m−

2 (g(x− j ) − τ + η ) + ϵ,

(15)

j=1

for our classification formulation. Putting y+ = τ + η and y− = τ −η and noting that (a − b)2 = (b − a)2 , the above can be rewritten as +

TERapprox =

m ] 12 1 ∑[ + 2 (y − g(x+ i )) + ϵ + m i=1



+

m 1 ∑[

m−

2 (y− − g(x− i )) + ϵ

=

1 ∑ m+

] 12

j=1

m+

i=1

∥e∥1 = ∥y − g(x)∥1 =

m 1 ∑

m−

ϵ→0

i=1

i=1

m

= lim

ϵ→0

∑√

y2i + g(xi )2 − 2yi g(xi ) + ϵ

i=1

m ∑ √

1 + g(xi )2 − 2yi g(xi ).

(17)

Here, we note that yi ∈ {y+ , y− } (which is equal to {+1, −1} when τ = 0, η = 1) is compared with the prediction of each training sample, i.e., yi − g(xi ) or yi g(xi ), i = 1, . . . , m where a correct classification gives −yi g(xi ) ⩽ 0 and conversely −yi g(xi ) > 0. Hence, according to (17), each wrongly classified sample contributes to a higher ℓ1 -norm error than a correctly classified sample. This corresponds to an approximation to the classification error counting penalty. Effectively, when every sample error (i.e., e = (τ + η) − g(x+ i ) for positive class data or every sample error e = (τ − η) − g(x− i ) for negative class data) is normalized within [0, η), only a single arm of the absolute function is activated. This means that the absolute function in |e| gives a monotonic penalty under such operating range. In other words, under appropriate normalization, the TER approximation can be simplified as



f (ei ) +

m m ∑ ∑ |yi − g(xi )| = lim f (yi − g(xi ))

i=1

+

=

− − where the error is now given by ei = y+ −g(x+ i ) and ej = y −g(xj ) with the offset η being absorbed into the learning target labels y+ and y− . To differentiate between the target label of each category, i.e., y+ ̸ = y− , it is required that η > 0 cannot take zero value while the decision threshold τ ∈ R can take arbitrarily real values including zero. The above approximation of classification error can be interpreted from the perspective of minimizing an ℓ1 -norm of error vector e = y − g(x):

=

TERapprox = FNRapprox + FPRapprox m √ 1 ∑

Fig. 1. Loss functions for binary classification where the target is y = ±1, the prediction is g(x), and yg(x) is called the margin which is an alternative way to quantify the classification error. The loss and its approximated functions are: step loss given by L(−yg(x)), exponential loss given by exp(−yg(x)), binomial deviance given by log(1 + exp(−2(yg(x) − 0.25))), SVM hinge loss given by (1 − yg(x))+ , square loss given by (yg(x) − 1)2 , sigmoid loss√given by 1/(1 + exp(5yg(x))) and the proposed approximate absolute loss given by (yg(x) − 1)2 + ϵ .

j=1

+

f (ej ),

(16)

TERapprox =

m 1 ∑

m+

i=1



ei +

m 1 ∑

m−

j=1

ej =

m ∑ i=1

wi (yi − g(xi )) ,

(18)

78

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

where

wi =

1 + yi

+

2m+

1 − yi 2m−

+

,

(19)

Next, multiply p to both sides of the above equation and substitute y = pα, we have

(

pα = p



and m = m + m . This approximation for TER will be adopted in the forthcoming formulation for stretchy learning.

Theorem 1. Given y ∈ R and suppose p(pT ) k > 1 and w ̸ = 0, 1 ◦ k− 1

α = (pT w)

(

1 ◦ k− 1

p(pT w )

)−1

1 ◦ k− 1

̸= 0, then for all

y

(20)

α

(21)

D−1 ∑

α

|αj |k = min lim |≀α≀|kk α

j=0

= min lim

ϵ→0

ϵ→0

D−1 ∑

( β ) k−1 1 k

(25)

into (24), we have

k

y(w pT )

= ∑ D−1 k=0

1 ◦ k− 1 1

pk (w pk ) k−1 1

(

1

= (pT w)◦ k−1 p(pT w)◦ k−1

)−1

y.

(26)

The global optimality comes from the convexity of the ℓp -norm for ϵ → 0 and k > 1 (Boyd & Vandenberghe, 2004) and the linearity of w (y − pα). ■

)−1

When k = 2, we arrive at α = (pT w ) p(pT w ) y and we know that this solution can be accumulated for multiple training samples giving (27)

(αj2 + ϵ )k/2 ,

D−1 ∑

(αj2 + ϵ )k/2 + βw (y − pα),



α=

(

β k

wp

α

subject to W(y − Pα) = 0.

(30)

Remark 2. While a complete proof for the case with multiple sam◦ 1 ples remains pending, we can let α = (PT W) k−1 β and substitute it into the constraint given by W(y − Pα) = 0 with non-zero weighting elements of W to get W(y − Pα) = 0 y = Pα

)◦ k−1 1



y = P(PT W)

(24)

(29)

min ∥α∥kk

pD−1

.

∈ Rm×m is of full rank, then for

is the global minimizer of



T

1 ◦ k− 1

[ ]−1 1 1 α = (PT W)◦ k−1 P(PT W)◦ k−1 y



k

(28)

Conjecture 1. Suppose P(PT W) all k > 1,

⎤ βw p0 ⎢ ⎥ ⎥ .. . ⎥−⎢ lim ⎢ . ⎦ ⎣ .. ⎦ = 0 ϵ→0 ⎣ k βwpD−1 kαd (αD2 −1 + ϵ ) 2 −1 ⎡ ( ) k−1 1 ⎤ βw ⎤ ⎢ ⎡ p0 ⎥ α0 k ⎢ ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ . ⎥ . ⇒ = ⎣ . ⎦ ⎢ ⎥ . ⎢( 1 ⎥ ) αD−1 ⎣ βw k−1 ⎦





(23)

j=0

k

0

(22)

j=0

kα0 (α02 + ϵ ) 2 −1

⎡ w1 0 · · · ⎢ ⎢ 0 ... W=⎢ ⎢ . .. ⎣ .. .

with wi = (1 + yi )/2m+ + (1 − yi )/2m− , i = 1, . . . , m under our TER setting. This is a weighted least-norm (ℓ2 -norm) solution and is equivalent to (8) when W is replaced by an identity matrix. Since α is continuous on k, we conjecture that this accumulation can be extended beyond k = 2 (k > 1) for m number of training samples.

where β ∈ R is a Lagrange multiplier. Taking the first partial derivative of the above cost function with respect to α and putting it to zero gives



⟩ ̸= 0. Substituting

,

( ) k−1 1 ( T )◦ k−1 1 β α= wp

ϵ→0

problem (21) can be written in an unconstrained form as

α

1 ◦ k− 1

for ⟨p, p

k

⎡ ⎤ ⎡ ⎤ p1 y1 .. ⎥ . ⎥ .. ⎥ , y = ⎢ .. ⎥ , ⎥, P = ⎢ ⎣ . ⎦ ⎣.⎦ ⎥ 0 ⎦ pm ym 0 · · · 0 wm

Proof. Since

min lim

k

( ) k−1 1 β

where

subject to w (y − pα) = 0.

α

=

1 ◦ k− 1

◦ 1 pT ) k−1

( ) k−1 1 β

p(w pT )

)◦ k−1 1

α = PT W(PPT W)−1 y,

min ∥α∥kk

α

y



w pT

(

is the global minimizer of

min ∥α∥kk = min

k

y = p(w

⇒ 3.3. Stretchy classifier learning Consider a single sample of data pair (x, y) for classification network learning where the input feature vector x = [x1 , . . . , xd ]T ∈ Rd and the target output class label is given by y ∈ {−1, +1}. The data is projected to a high dimensional space p(x) ∈ RD where a generalized ∑D−1 linear predictor model is written as g(α, x) = p(x)T α = j=0 αj pj (x). Under this scenario, the TER approximation (18) is given by w e = w (y − pα) where w = (1 + y)/2m+ + (1 − y)/2m− and p := p(x)T ∈ R1×D (defined as a row vector here). By denoting the elementwise operation using ◦, we write the elementwise power of matrix/vector as A◦k /b◦k . Then, the following sample-wise constrained formulation can be minimized with respect to the parameter vector α ∈ RD .

β



1 ◦ k− 1

[

1

β ]−1

β = P(PT W)◦ k−1

y.

(31)

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91 1

This β can be substituted back into α = (PT W) k−1 β to give (29). This shows satisfaction of the conjectured solution to the constraint in (30) containing multiple samples. From the error or noise minimization point of view, the solution given by (29) satisfies (y − Pα)T WT W(y − Pα) = 0. This constitutes a solution under the restricted feasible set when the system is under-determined. An extensive study based on numerical evaluations using both synthetic data and real-world data shall be conducted to illustrate the usefulness of this conjectured solution. ■ ◦

Next, based on the above solution in the dual space, the estimation in its primal counterpart can be readily obtained as follows. We shall make use of the known results as mentioned in Remark 1 where the dual solution is linked to its primal solution form via the matrix identity (δ 2 I + AB)−1 A = A(δ 2 I + BA)−1 . According to the always existence of the limiting case (limδ→0 ) of pseudoin-

)◦

1

79

ˆ = α indicates that no parameter estimator (with k = 2). Here, E [α] vanishes in the expectation. In other words, the analysis shows that estimation under the over-determined setting cannot be sparse. Next, Consider the under-determined systems, i.e., (29). Let the size of α be D × 1 and that of P be m × D. For under-determined ˆ is systems, we have m < D. In such a case, the expectation of α given by ˆ =E E [α] =E

[ (

)◦ k−1 1 [

PT W

[ (

)◦ k−1 1 [

T

P W

1 ◦ k− 1

]−1

1 ◦ k− 1

]−1

P(PT W) T

P(P W)

]−1

( )◦ 1 1 = PT W k−1 P(PT W)◦ k−1 [

]

(Pα + n) (Pα)

]

Pα.

verse (Albert, 1972), and putting A = PT W k−1 , B = P, the estimation in primal space (32) is readily obtained.

ˆ = α, it is required that the matrix M = In order for E [α] ( T )◦ k−1 1 [ T ◦ 1 ]−1 P(P W) k−1 P must be a D × D identity matrix, P W

[ ]−1 ( )◦ 1 1 α = (PT W)◦ k−1 P PT W k−1 y.

which is not possible since m < D, i.e., M is at most of rank m < D. ˆ is a biased estimator. This is similar to the LS estimator Hence, α (with k = 2). This result shows that some parameters can vanish, ˆ ̸= α. ■ causing E [α]

(

Let P⌟

Remark 3.

( )◦ 1 ( [ PT W k−1 P]−1 PT W

= )◦

)◦ k−1 1

(

PT W

(32) 1

[P(PT W)◦ k−1 ]−1 and P⌞ =

1

. Then, we see that PP⌟ = I and P⌞ P = I. This means that P and P can be considered an extended left and right pseudoinverses respectively. Noting that the primal solution form of (32) and the dual solution form of (29) seldom coexist in practice, in the experiments, we can adopt (32) when dealing with over-determined systems and adopt (29) when dealing with under-determined systems. Moreover, noting that these two equations with their weight elements fixed at 1/m+ and 1/m− are optimal for the ideal case without noise, a better and more practical solution is to tune the weighting matrix W in (32) and (29) according to the given data set. Two special cases would be choosing W = I throughout and choosing the weight with its elements fixed at 1/m+ and 1/m− according to category size. In the experiments, we shall consider only these two special cases where further tuning could be a topic for future work. ■ k−1





As for the variance analysis, for the case with under-determined ˆ is systems, the co-variance matrix of α

ˆ − E [α] ˆ )(αˆ − E [α] ˆ )T ] E [(α =E

[{ (

×

{ (

=

{ (

PT W

=

{ ]−1 }T ( )◦ 1 [ 1 × PT W k−1 P(PT W)◦ k−1 { } ( T )◦ k−1 1 [ T ◦ 1 ]−1 {

Property 1. The estimator given by (32) is unbiased while the estimator given by (29) is biased.

ˆ , we assume Proof. To analyze the performance of the estimator α that y = Pα + n where n is a zero mean noise with co-variance matrix C. Consider the case of over-determined systems, i.e., (32). Let the size of α be D × 1 and that of P be m × D. For even-determined and over-determined systems, we have m ⩾ D. In such a case, the ˆ is given by expectation of α ˆ =E E [α] =E

[[

(PT W)

[[

T

(P W)

1 ◦ k− 1

1 ◦ k− 1

P W

]−1 (

P

]−1 (

P

PT W T

P W

)◦ k−1 1 )◦ k−1 1

(Pα + n)

T

P W

)◦ k−1 1 [

)◦ k−1 1 [

P W

4. Variance analysis In this section, we perform an analysis of the variance of the estimate and observe its essential properties. After that, the relationship between the conjectured estimation and two well-known existing classifiers is discussed in Section 5.

)◦ k−1 1 [

T

×

1 ◦ k− 1

T

P(P W)

1 ◦ k− 1

T

P(P W)

1 ◦ k− 1

P(PT W)

P(P W)

)◦ k−1 1 [

PT W

(

]−1 } n

] ]−1 }T n

]−1 } [

E nnT

C

k−1

]−1 }T

1 ◦ k− 1

P(PT W)

ˆ − E [α] ˆ )(αˆ − E [α] ˆ )T ] E [(α =E × =

[{[ {[

{[

1 ◦ k− 1

(PT W)

1 ◦ k− 1

T

(P W)

1 ◦ k− 1

(PT W)

{[

]

= α. ˆ is an un-biased estimator regardless of the value of k and Hence, α for a general C. This may be considered as a generalization of the LS

.

When k = 2, it becomes the ordinary LS estimator and standard LS result can be applied to simplify the above expression. For the case with over-determined systems, the co-variance ˆ is matrix of α

]−1 (

)◦ k−1 1

PT W

P

]−1 (

P

T

)◦ k−1 1

P W

]−1 (

)◦ k−1 1

PT W

P

}

]

(Pα)

]

=

T

× (P W) {[

1 ◦ k− 1

1 ◦ k− 1

(PT W)

×

{[

(PT W)

]−1 (

P

]−1 (

P W

]−1 (

P

)◦ k−1 1

)◦ k−1 1

PT W

P

1 ◦ k− 1

T

PT W

} n

}T ] n

E nnT

[

}T

}

)◦ k−1 1

C

}T

.

]

80

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

Consider now some special cases. When k = 2, it becomes the ordinary LS estimator and standard LS result can be applied to simplify the above (even for the general C matrix). When k ̸ = 2, the situation becomes more complicated. Assume first that C = σ 2 ID . We have

ˆ − E [α] ˆ )(αˆ − E [α] ˆ )T ] E [ (α = σ2 ×

(PT W)

{[

= σ2 ×

{[

1 ◦ k− 1

(PT W)

{[

(PT W)

{[

(P W)

(PT W)

1

◦ k−1

]−1 (

P

1 ◦ k− 1

]−1 (

P

)◦ k−1 1

PT W

]−1 (

)◦ k−1 1

PT W

P

1 ◦ k− 1

1 ◦ k− 1

T

Since k ̸ = 2, (PT W)

[

1 ◦ k− 1

]−1 (

P

]−1 }T

P

1

(WP)◦ k−1

}

)◦ k−1 1

subject to

(P)◦ k−1 ̸= ID .

1

1 T ◦ k−1

and consequently, (P ) is approaching a rank 1 matrix even if P ◦ 1 ◦ 1 is non-singular. In both cases, (PT W) k−1 P and P(PT W) k−1 become rank 1 matrices also and it is hence non-invertible. We conjecture that k = 2 is the most stable case (in terms of matrix regularity) and it tends to singular when k is moving away from 2, in both directions.

5.1. Relationship with FDA From the primal solution given by (32), by rewriting the 1 1 ‘scatter’ matrices as S+ := (1/m+ ) k−1 (PT+ )◦ k−1 P+ and S− := k k P− , as well as by rewriting the ‘projected 1

means’ as mk := (1/m+ ) k−1 PT+ +

)◦ k−1 1

1

)◦

(

(

1

− k−1 y+ and m− k := (1/m )

PT− k−1 y− , we see that when k = 2, the resemblance of (32) to the well-known Fisher’s linear discriminant analysis (FDA) (Duda et al., 2001) which is given by 1 + − α = λS− W (m − m ),

(33)

where λ is a scaling factor and SW = S + S with +



+

+

S =

m ∑

(xi − m+ )(xi − m+ )T ,

i=1 m−

S− =

(34)



(xj − m− )(xj − m− )T ,

j=1 +

+

m =

m 1 ∑

m+

i=1



xi , m = −

m 1 ∑

m−

j=1

xj .

1 2

∥α∥22

yi (αT pi + b) ⩾ 1, ∀(pi , yi ), i = 1, . . . , m.

(36)

αT pi + b ⩾ +1, for yi = +1

(37)

α pj + b ⩽ −1, for yj = −1.

(38)

T

This can be interpreted as negative samples being projected onto values at most −1 and positive samples being projected onto values at least +1. In other words, what matters most is just those supporting samples located at decision boundaries of the projected space for which the equality in (37) and (38) is satisfied: min

(αT pi + b) = +1,

(39)

max

(αT pj + b) = −1.

(40)

i∈{1,...,m+ }

j∈{1,...,m− }

Consider only these supporting vectors, Eqs. (39) and (40) can be interpreted as (αT pi + b) −

i∈{1,...,m+ }

5. Relationship with two existing classifiers

1 ◦ k− 1

P and

The set of constraints yi (α pi + b) ⩾ 1 is actually derived from combining the projection of separable training samples for each category:

min

1

T

T

1

quently, (PT ) k−1 is approaching a rank 1 matrix even if P is nonsingular. Similarly, it can be shown that when k → ∞, k−1 1 → 0,

(1/m− ) k−1 (PT− )

1 ◦ k− 1

)◦ k−1 1

(m − m ) corresponds to P W y in (32). Here we see the main difference between (32) and (33) lies in the normalization and the mean centering of covariance matrix P and when k = 2.

1

(P)◦ k−1 . Hence,

Therefore, unlike the case with k = 2, the above expression cannot be simplified further in general. ◦ 1 Another issue is the singularity of the matrix (PT W) k−1 P or 1 ◦ P(PT W) k−1 . So far, we assume that it is invertible. For the case of ◦ 1 ◦ 1 k = 2, the non-singularity of (PT W) k−1 P or P(PT W) k−1 is guaranteed by making P non-singular. However, this may no longer be true for k ̸ = 2. We limit our discussion to 1 < k < ∞ where k can be a non-integer. It is easy to show that when k → 1, k−1 1 → ∞, and conse◦



minimize

(

PT W

+

(

In the well-known Support Vector Machines (SVM) (Boser, Guyon, & Vapnik, 1992; Vapnik, 1998), the separation margin 2/∥α∥2 is maximized with respect to α and b under the constraints that all data are projected to relevant discriminant values, i.e., yi (αT pi + b) ⩾ 1, ∀(pi , yi ), i = 1, . . . , m yi ∈ {−1, 1}. The problem is often formulated into its equivalent form as to

.

P ̸ = PT W

)◦ k−1 1

row vectors within P), then SW corresponds to (PT W)

5.2. Relationship with SVM and its Variant

}T

)◦ k−1 1

PT W

}

When we put xi = pi (recall that pTi , i = 1, . . . , m are the

(35)

max

(αT pj + b) = 2,

j∈{1,...,m− }

(41)

to take effect the constraints in (36). Here, we see SVM takes the form of k = 2 in (29) and uses only those boundary data as constraints. As mentioned, the proposed dual solution (29) corresponds to the case of minimizing ∥α∥kk subject to all data constrained by W(y − Pα) = 0 where mini (αT pi ) and maxj (αT pj ) in (41) are two sets of data points at decision boundary within the classification mapping given by W(y − Pα) = 0 when W is an identity matrix. Instead of using two parallel planes at data boundaries ((39) and (40)), the Nonparallel Support Vector Machine (NPSVM) (Tian, Qi, Ju, Shi, & Liu, 2014) relaxes the parallelism constraint by solving two convex quadratic programming problems. Apart from adding flexibility to the orientation of the decision hyperplane with category specific consideration, the underlying mechanism of minimizing the ℓ2 -norm with the boundary data as constraints remains unchanged. 6. Synthetic example In order to understand the stretching behavior of the proposed method, we observe the decision contours on a synthetic example which consists of 18 data samples (see Fig. 2(a) for the red circles and the blue squares). Among the data samples, 6 samples (marked by red circles) are labeled as ‘−1’ (class-1) and the remaining 12 samples (marked by blue squares) are labeled

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

81

Fig. 2. (a) Decision boundaries (at threshold levels {−2.3, −2, −1, 0, 0.5, 1, 2, 3}) of a 3rd-order polynomial model learned from 18 data points at k = 1.15. (b) Estimated polynomial coefficient values based on 18 training data samples. Top panel: LSE-Poly; Middle panel: SBC1-Poly; Bottom panel: SBC2-Poly. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 3. (a) Decision boundaries (at threshold levels {−2.3, −2, −1, 0, 0.5, 1, 2, 3}) of a 10th-order polynomial model learned from 18 data points at k = 1.15. (b) Estimated polynomial coefficient values based on 18 training data samples. Top panel: LSE-Poly; Middle panel: SBC1-Poly; Bottom panel: SBC2-Poly.

as ‘+1’ (class-2). An over-determined system is observed when a third order bivariate polynomial model (which has 10 parameters) is adopted for learning. The system becomes under-determined when a 10th-order polynomial model (which has 66 parameters) is adopted. The proposed method is evaluated at three learning settings namely, (i) LSE-Poly which uses least-squares solution (2) for over-determined systems and uses least-norm solution (8) for under-determined systems, (ii) SBC1 which uses either (29) or (32) with both m+ , m− fixed at unity, and (iii) SBC2 which uses either (29) or (32) with m+ , m− being set equal to the class sample size values. In other words, SBC1 has m+ = m− = 1 and SBC2 has m+ = 12, m− = 6, which adopts (29) for under-determined systems and (32) for over-determined systems. Fig. 2(a) shows the contour plots for the third order polynomial model learned at k = 2 (LSE-Poly) and at k = 1.15 (SBC1-Poly and SBC2-Poly). This is an over-determined system where (2) has been used for training LSE-Linear and LSE-Poly. The classification error counts for these two classifiers (see the red solid line and the blue dotted curve) are both 5 wherein 3 out of 6 from class-1

are wrongly classified. For SBC1-Poly (green dashed contour lines) and SBC2-Poly (black solid contour lines), the primal solution form (32) has been adopted. The classification error counts for these two classifiers are both 6 but with different error counts within each category. SBC1-Poly has 4 error count for the minority class whereas SBC2-Poly has 2 error count for the minority class. This again, shows the impact of per class weighting in the proposed counting based classification error function. Fig. 2(b) shows the estimated polynomial coefficients for LSE-Poly, SBC1-Poly and SBC2-Poly. These results show different estimation values for all three methods. Next, a bivariate polynomial model of 10th-order is adopted for learning the 18 data∑ samples. The bivariate polynomial model is 10 ∑10−i i j given by g(α, x) = i=0 j=0 αij x1 x2 . Here, the total number of parameters (αij terms) to be estimated is 66. Since there are only 18 data samples, the system becomes under-determined. Fig. 3(a) shows the decision contours for the above three methods. LSE-Poly is trained with zero error while LSE-Linear resulted in 5 error count. For the minority class-1, SBC1-Poly gives 4 error count (out of 4

82

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91 Table 1 Gene expression data sets. Data set

Domain

Type

# Feat.

Total # of samples

Arabidopsis thaliana Leukemia Hereditary breast cancer

Gene expression Gene expression Gene expression

Dense Dense Dense

118 3501 3226

40 72 22

Table 2 NIPS feature selection challenge data sets. Data set

Domain

Type

# Feat.

# Train

# Valid.

# Test

Arcene Dexter Dorothea Gisette Madelon

Mass spec. Text categ. Drug discov. Digit recog. Artificial data

Dense Sparse Sp. bin. Dense Dense

10 000 20 000 100 000 5 000 500

100 300 800 6000 2000

100 300 350 1000 600

700 2000 800 6500 1800

total error count considering both classes) while SBC2-Poly gives 1 error count (out of 3 total error count considering both classes). Again, this shows the effect of per class weighting mechanism in Stretchy Classification. Fig. 3(b) shows the corresponding estimated parameters where sparseness is observed for both stretchy methods (SBC1-Poly and SBC2-Poly) with the number of non-zero parameter values being eighteen. The effect of per class weighting is also seen in the difference in the estimated parameters for the stretchy methods. To summarize, this example not only verifies the feasibility of sparse estimation, but also the biased estimation in the underdetermined scenario. 7. Experiments In this section, we apply the proposed Stretchy Binary Classifier (SBC) to high dimensional benchmark data to validate its usefulness in real-world problems. The experimented benchmark data sets consist of two groups: the gene expression data sets from Wille et al. (2004) and Yeung, Bumgarner, and Raftery (2005) and the NIPS feature selection challenge data sets from Guyon (2003) and Guyon, Hur, Gunn, and Dror (2004). Although the feature selection challenge was held in 2003, the data sets are available for benchmarking with the training and validation sets being released to the public where a link to the well-known UCI Machine learning repository (Lichman, 2013) can be found. Until 2014 when all participated results were available, there had been over a thousand participants in most data sets with the most popular data set attracting nearly two thousand participants. 7.1. Data for evaluation (i) Gene expression data sets The first group of data sets consists of three microarray measurements from gene expression research. The first data set is called Arabidopsis thaliana, a small flowering plant which is important for understanding the genetic pathways of many plant traits. The data set obtained from Wille et al. (2004) contains 40 isoprenoid genes in mevalonate (MVA) and non-mevalonate (MEP) pathways plus 795 additional genes from 56 downstream pathways. We use only the 40 data samples from MVA and MEP pathways for our binary classification task. The second data set involves monitoring of Leukemia by DNA microarrays based on 3501 human genes from 72 samples published by Yeung et al. (2005). Among the 72 samples, 25 are classified as acute myeloid leukemia (AML) and 47 are classified as acute lymphoblastic leukemia (ALL). The third data set (Yeung et al., 2005) consists of 3226 genes of primary breast tumors from both hereditary and sporadic cases, including 7 BRCA1-mutation-positive, 8 BRCA2-mutation-positive

and 7 sporadic cases. For binary classification, we identify between the sporadic cases from the hereditary cases by merging BRCA1 and BRCA2 into a single category. In all experiments, we perform ten runs of two-fold tests on data formed by combining the training and test data sets to deal with possible statistical bias in a single hold-out test. Table 1 summarizes these three gene expression data sets in terms of feature dimension and total sample size. (ii) NIPS feature selection challenge data sets There are five data sets available for classifier benchmarking in the NIPS feature selection challenge (Guyon, 2003; Guyon et al., 2004). Table 2 lists the attributes of these data sets, including the type and dimension of features and the sizes of training, validation and test data. While the test set is withheld by the organizer, the training and validation sets are released for classifier learning and tuning. The challenge protocol adopted a single hold-out evaluation using the test set. In our experiments, to deal with possible statistical variation in the single hold-out test, we perform ten runs of two-fold tests on data formed by combining the training and validation data sets. We present the results at this level of detail with the aim to support future comparisons. In the NIPS challenge, several classical evaluation metrics such as classification accuracy (ACC), balanced error rate (BER) and area under the ROC curve (AUC) (see Guyon (2003); Guyon et al. (2004) for details) were utilized for performance comparisons. In view of the intrinsic relationship among these performance metrics, we shall adopt only the BER in our experiments (including that for gene expression data) in order not to overly clutter the plots. In addition to this performance metric, we include the training CPU time for all compared algorithms running on an Intel Core i7– 3.6 GHz processor under Matlab (The MathWorks, 2017) environment except for the cases of the Dorothea and Gisette data sets where the baseline NPSVM was run in an Intel Xeon E5-v2-3.5 GHz processor with 256 GB RAM. 7.2. Data standardization Consider [ a stacked set ] of raw training input data given by Xraw = x1 x2 · · · xd . A standardization using xj = (xj − µxj )/σxj , j = 1, . . . , d is first performed for each data column by a z-score normalization based on the statistics (µxj and σxj ) of the training set. Then, an exponential function given by x˘ = exp(axj + bj )

(42)

is adopted to map the standardized data into the first quadrant. The main purpose of this transformation is to ensure that no complex number arises from taking the power term (1/(k − 1)) in (29) and in (32).

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

83

7.3. Algorithmic settings

7.4. Results

For the proposed SBC, several algorithmic settings are evaluated to observe their parametric stretching behavior on data with different feature dimensions and sample sizes. Two versions of SBC adopting a linear model, namely SBC1 (with m+ and m− both fixed at unity for (32) and (29)) and SBC2 (with m+ and m− set according to respective data size in each class for (32) and (29)), are evaluated for various values of k ∈ {1.1, 1.25, 1.5, 1.75, 2}. All data has been standardized, with empirically selected scaling parameters (a = −0.2, b = 0 in (42)). LASSO (Tibshirani, 1996) and Elastic-net (Zou & Hastie, 2005) are adopted as benchmarks in this evaluation. Under various settings of the LASSO function in the Matlab toolbox (The MathWorks, 2017), the algorithm traverses between LASSO and Elastic-net behaviors. At Alpha=1, the function provides LASSO output, and when 01, a two-pass training is performed where the best result is chosen from the range of evaluations. Here, we configure it at 1, 2, and 100 to observe its impact on accuracy and training CPU time. At NumLambda=1 and NumLambda=100, the impacts of having the lowest lambda penalties can be compared with that of a large number of lambda penalties. The choice at NumLambda=2 is to observe its significantly larger training processing time comparing to NumLambda=1. We abbreviate these settings as LASSO1, LASSO2 and LASSO100 for readability. For NumLambda>1, only the best accuracy in terms of lowest mean square error is recorded. Similar to LASSO, the proposed SBC (SBC1 and SBC2) adopts a two-pass training process wherein the top parameter values from the first pass are selected for the second pass according to a user-specified feature density. For instance, SBC2 at 0.01 feature density means that only the highest 1% of parameters obtained from the first pass, are selected for the second pass. To observe the impact of sparsity on adopted features, we perform experiments on several feature density values, namely SBC1(0.5), SBC1(0.1) and SBC1(0.01) together with the original SBC1(1.0) with full features. A similar set of density choices are adopted for SBC2. Apart from LASSO and elastic-net, an extended SVM version called Non-Parallel Support Vector Machines (NPSVM) (Tian et al., 2014) is adopted for classification performance benchmarking. Due to its sequential minimal optimization implementation of the inverse operation and the flexible class-specific decision boundaries, this NPSVM was reported to offer good accuracy and training speed for large data sets comparing with that of the conventional implementation of SVM and other classification algorithms such as the sparse Bayesian relevant vector machine (RVM) (Tipping, 2000) and multiple criteria linear programming (MCLP) (Shi & Peng, 2002; Shi, Tian, Chen, & Zhang, 2009) method. To select the best parameters of the radial basis function, a grid search was performed for the NPSVM for each of its parameters (C, Epsilon and RBFwidth), which was searched within {0.003906, 0.03125, 0.25, 1, 4, 32, 256}. In other words, the grid search was performed for 7 × 7 × 7 sets of parametric combinations and the best five sets of parameters were adopted in NPSVM for performance benchmarking.

(i) Arabidopsis thaliana Fig. 4(a)–(b) show respectively the BER and CPU performance which is plotted over the k-values (k ∈ {1.1, 1.25, 1.5, 1.75, 2}) for SBCs, and over the Alpha-values (Alpha ∈ {1, 0.75, 0.5, 0.25, 0.0001}) for LASSO. Due to the relatively low data dimension and the use of a nonlinear kernel, the NPSVM shows a clear BER advantage over the linear models of LASSO and SBC. Among those algorithms with compressing capability, the LASSOs with variable NumLambda values (LASSO100 and LASSO2) outperform all compared methods (LASSO1, SBC1 and SBC2) in terms of the BER value. The follow-up performance goes to SBC1 and SBC2 at k = 1.1 and at density 0.1. The varying performance at different k values for SBCs indicates the difference in stretching of α elements which in turn impacted the difference in classification performance. The CPU performance (Fig. 4(b)) indicates the relatively heavy computational cost for LASSO100, LASSO2 and NPSVM as compared to LASSO1 and SBCs. Fig. 4(c)–(d) show an example of sorted estimation parameters for SBC, LASSO1 and NPSVM. In Fig. 4(d), only a fraction of the estimated parameters is shown for NPSVM due to the large parametric size incurred from the non-sparse estimation. This plot shows high sparsity of the LASSO estimation whereas a smooth distribution of estimation parameters can be seen for SBC. (ii) Leukemia Different from the above case, the results in Fig. 5(a) for Leukemia data show comparable performance of all classifiers, particularly for SBCs at high k values, except for LASSO1. This shows that a less biased estimation is preferred for SBCs. The CPU performance in Fig. 5(b), however, shows similar behavior to that of the Arabidopsis data apart from the much increased learning time for NPSVM. Fig. 5(c)–(d) show an example of sorted estimation parameters for SBC, LASSO1 and NPSVM. This plot shows smooth but relatively sparse parameters for SBC at k = 1.1. (iii) Hereditary breast cancer Similar to the Leukemia case above, the Hereditary breast cancer data shows comparable accuracy related performances for all compared classifiers in Fig. 6(a). The impact of k value peaks at k = 1.25 for SBCs. The CPU performance (Fig. 6(b)) does not indicate much deviation from that of the Leukemia data set. The sorted parameter plots in Fig. 6(c)–(d) show similar sparseness of the SBC estimation comparing to that of the Leukemia data set. (iv) Arcene From Fig. 7(a), we see that the BER values of SBC1 and SBC2 at several densities namely 1.0, 0.5 and 0.1 are seen to outperform that of LASSOs. As for the training CPU time (Fig. 7(b)), we observe a heavy dependency of LASSO on the NumLambda values search process whereas the proposed SBC depends much on the selected training feature dimension. A high learning cost is also observed for the NPSVM which adopted the nonlinear radial basis kernel. Fig. 7(c)–(d) show an example of the estimated parameters which are sorted in ascending order for SBC, LASSO and NPSVM. LASSO is seen to achieve good sparseness with most parameters approaching zero while SBC shows relatively stretched estimation towards sparseness. This plot shows relatively ‘smooth’ stretching of estimation parameters for SBC while LASSO gives a relatively ‘discrete’ distribution on its estimated parameters. (v) Dexter From Fig. 8(a), the BER trend for this example is apparently similar to that in the Arcene example where the lowest performance goes to SBCs at density 0.01 and the higher performances go to SBCs at densities 1.0, 0.5 and 0.1. This shows that at least 10% of the parameters are useful for estimation in SBC learning. The training

84

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

Fig. 4. Arabidopsis thaliana: (a)–(b) BER and CPU results. x-axis: for SBC, k ∈ {1.1, 1.25, 1.5, 1.75, 2}; for LASSO, Alpha ∈ {1, 0.75, 0.5, 0.25, 0.0001}. (c)–(d) Sorted parameter α estimation values for proposed SBC2 at k = 1.1 and LASSO at Alpha= 1. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 5. Leukemia: (a)–(b) BER and CPU results. x-axis: for SBC, k ∈ {1.1, 1.25, 1.5, 1.75, 2}; for LASSO, Alpha ∈ {1, 0.75, 0.5, 0.25, 0.0001}. (c)–(d) Sorted parameter α estimation values for proposed SBC2 at k = 1.1 and LASSO at Alpha= 1. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

85

Fig. 6. Hereditary breast cancer: (a)–(b) BER and CPU results. x-axis: for SBC, k ∈ {1.1, 1.25, 1.5, 1.75, 2}; for LASSO, Alpha ∈ {1, 0.75, 0.5, 0.25, 0.0001}. (c)–(d) Sorted parameter α estimation values for proposed SBC2 at k = 1.1 and LASSO at Alpha= 1. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 7. Arcene: (a)–(b) BER and CPU results. x-axis: for SBC, k ∈ {1.1, 1.25, 1.5, 1.75, 2}; for LASSO, Alpha ∈ {1, 0.75, 0.5, 0.25, 0.0001}. (c)–(d) Sorted parameter α estimation values for proposed SBC2 at k = 1.1 and LASSO at Alpha= 1. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

86

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

Fig. 8. Dexter: (a)–(b) BER and CPU results. x-axis: for SBC, k ∈ {1.1, 1.25, 1.5, 1.75, 2}; for LASSO, Alpha ∈ {1, 0.75, 0.5, 0.25, 0.0001}. (c)–(d) Sorted parameter α estimation values for proposed SBC2 at k = 1.1 and LASSO at Alpha= 1. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

CPU time in Fig. 8(b) shows again the dependency of NumLambda setting for LASSO while SBC depends on the density setting. From Fig. 8(c)–(d), the trend of the sorted parameters is seen to be similar to that of the above Arcene data where LASSO shows a sparser and discrete distribution of estimated parameters while SBC shows a smoother and less sparse estimation. (vi) Dorothea This example has the highest input feature dimension (100,000) among the studied examples. Different from the above two examples, Fig. 9(a) shows the variation of BER performances under different settings. Particularly, several settings of SBC outperform that of LASSO where good BER performances at density 0.01 are seen for both SBC1 and SBC2. This variation reveals the impact of highly imbalanced class distribution of training data (one class has nearly 7.6 times the number of samples of another) on the BER performance. The NPSVM does not appear to cope well in terms of the BER performance where the experiment was conducted using an Intel Xeon processor with 256 GB RAM. Due to the high dimensionality, the training CPU times (Fig. 9(b)) show much heavier computational cost than the previous examples. Fig. 9(c)–(d) show again the relatively smooth parameter distribution of SBC comparing with that of LASSO. (vii) Gisette This data set is of 5000 dimension with 7000 samples considering both the training set and the validation set. For a two-fold crossvalidation test, this data set forms an under-determined system which is approaching the boundary between under-determined and over-determined systems. Fig. 10(a) shows the BER and CPU performances of SBC and LASSO where we observe comparable performances on most settings. However, the variation in

performances is apparently large across different k-values for SBC. Particularly, the BER performance appears to be less stable at low k-values than that of high k-values. This could be attributed to the near critical case between over-determined and under-determined systems. The NPSVM shows good BER performance for this data set with not too high dimension. Fig. 10(c)–(d) show the estimated parameters where we observe a nearly non-sparse parametric distribution for SBC and a highly sparse parametric distribution for LASSO. (viii) Madelon Different from the above data sets which are under-determined, this Madelon data set has a dimension of 500 with 2000 training samples and 600 validation samples. This constitutes an overdetermined system under our training protocol. Fig. 11(a)–(b) show the BER and CPU performances of SBC and LASSO where the results are far from satisfactory (over 40% BER). According to the organizer, this data set was artificially constructed for the purpose of illustrating a particular difficulty in selecting a feature set when no feature is informative by itself. To validate the need of a nonlinear decision hyperplane on relevant features, an additional experiment is carried out on an expansion of the top six features using a random projection network (Toh, 2008) to high dimension (10,000). In other words, the first pass of running SBC2 selects the six largest parameters in terms of the absolute value for the second pass projection to high dimension. The projected results show a drastic improvement in the BER performance (improvement over 20%). This verifies the need of a reduced features with nonlinear mapping capability beyond the evaluated linear SBC and LASSO settings. Fig. 11(c)–(d) show the non-sparse parametric distributions for both SBC and LASSO.

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

87

Fig. 9. Dorothea: (a)–(b) BER and CPU results. x-axis: for SBC, k ∈ {1.1, 1.25, 1.5, 1.75, 2}; for LASSO, Alpha ∈ {1, 0.75, 0.5, 0.25, 0.0001}. (c)–(d) Sorted parameter α estimation values for proposed SBC2 at k = 1.1 and LASSO at Alpha= 1. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 10. Gisette: (a)–(b) BER and CPU results. x-axis: for SBC, k ∈ {1.2, 1.25, 1.5, 1.75, 2}; for LASSO, Alpha ∈ {1, 0.75, 0.5, 0.25, 0.0001}. (c)–(d) Sorted parameter α estimation values for proposed SBC2 at k = 1.2 and LASSO at Alpha= 1. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

88

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

Fig. 11. Madelon: (a)–(b) BER and CPU results. x-axis: for SBC, k ∈ {1.1, 1.25, 1.5, 1.75, 2}; for LASSO, Alpha ∈ {1, 0.75, 0.5, 0.25, 0.0001}. (c)–(d) Sorted parameter α estimation values for proposed SBC2 at k = 1.1 and LASSO at Alpha= 1. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

7.5. Summary of results and discussion To see whether the comparisons are significantly different among the classifiers in a statistical sense, we conduct a Friedman test using the best performed average results. The obtained confidence levels for each comparison metric are: BER (p = 0.0176) and CPU (p = 1.9802 × 10−8 ). Next, from the Nemenyi test plots as shown in Fig. 12(a) regarding the BER performance, we see an insignificant difference among the classifiers except for the LASSO1 setting which has the lowest rank. These results not only verified the feasibility of the proposed SBC, but also showed that its performance can be on-par with LASSO, Elastic-net and NPSVM, if not better. From Fig. 12(b), we see a significant difference between the training CPU times of LASSO and SBC when a penalty search is involved in the LASSO (such as that in LASSO100 and LASSO2). The near bottom ranking of CPU time for NPSVM2 (even without inclusion of the processing time of grid search for best parameter selection) is also due to its heavy search procedure involving a dual set of the support vectors comparing with that of the conventional SVM. In summary, the above experiments not only verified the numerical feasibility of the proposed solution, but also suggested comparable performance with LASSO, Elastic-net and NPSVM in terms of BER with good training CPU performance. Examples (ii)– (vi) are clearly much under-determined where they show clear sparsity of parameter distributions when k approaches 1. Examples (i), (vii) and (viii) spread over slightly undetermined, near borderline of under-determined, and over-determined cases where non-sparsity of estimation has been observed. This observation

(a) Nemenyi test for BER results.

(b) Nemenyi test for CPU results.

2 The ranking of the training CPU time for NPSVM is an optimistic estimation since the Dorothea and Gisette data sets have been run in an Intel Xeon processor with 256 GB of RAM.

Fig. 12. Nemenyi tests.

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

89

(a) Arcene: AUC vs BER.

(b) Dexter: AUC vs BER.

(c) Dorothea: AUC vs BER.

(d) Gisette: AUC vs BER.

(e) Madelon: AUC vs BER.

Fig. 13. Comparing SBC and LASSO with all NIPS entries (2003–2014). The solid horizontal and vertical lines are mean results of NIPS entries. The dashed lines, horizontal and vertical, correspond to the median results of NIPS entries.

is in congruence with the variance analysis regarding the biased and unbiased estimations, respectively for under-determined and over-determined systems, in Section 4.

In Fig. 13, the Area Under the ROC Curve (AUC) and the BER results of SBC and LASSO are plotted together with all NIPS entries (2003–2014) to observe their performance standing relative to

90

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91

other state-of-the-arts. The number of participated algorithms are 1941, 1291, 887, 1265 and 1490 for Arcene, Dexter, Dorothea, Gisette and Madelon. For the Arcene, Dexter and Gisette data sets, several settings of the proposed SBC show above median and above average performances with respect to these challenge results. For the Dorothea data set, an above median and above average is obtained for AUC performance and below average for BER performance. For the Madelon data set, both SBC and LASSO show a well below average performance limited by the linear model adopted. A classifier with nonlinear mapping capability has been shown to be effective in this situation with the implemented RPN network showing above average and near median performances here. 8. Conclusion We proposed an approximation to the ℓp -norm and conjectured a solution to minimize this parametric norm subject to a novel classification error constraint. The conjectured solution came in primal and dual analytic forms catering for over-determined and underdetermined systems respectively. Our variance analysis showed that the primal solution form gave an unbiased estimation while the dual solution form led to a biased but sparse estimation. Our numerical experiments on both synthetic data and real-world data not only validated the feasibility of the implementation, but also showed comparable performance of the proposed estimation to the well-known state-of-the-art LASSO. Acknowledgments The authors are thankful to the editor and reviewers for their constructive comments and Dr. Bernd Burgstaller for proof reading the manuscript. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (Grant Number: NRF-2015R1D1A1A09061316). References Albert, A. (1972). Regression and the Moore–Penrose pseudoinverse, Vol. 94. New York: Academic Press. Inc.. Baraniuk, R. G. (2007). Compressive sensing. IEEE Signal Processing Magazine, 24(4), 118–124. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press Inc.. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Fifth annual workshop on computational learning theory (pp. 144–152). Pittsburgh: ACM. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. Candés, E. J., & Wakin, M. B. (2008). An introduction to compressive sampling. IEEE Signal Processing Magazine, 52(2), 21–30. Cao, J., Zhang, K., Luo, M., Yin, C., & Lai, X. (2016). Extreme learning machine and adaptive sparse representation for image classification. Neural Networks, 81, 91–102. Donoho, D. L. (2006). Compressed sensing. IEEE Transaction on Information Theory, 52(4), 1289–1306. Donoho, D. L., & Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization. Proceedings of the National Academy of Sciences of the United States of America, 100(5), 2197–2202. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification. (2nd ed.). New York: John Wiley & Sons. Inc. Faris, H., Aljarah, I., & Mirjalili, S. (2016). Training feedforward neural networks using multi-verse optimizer for binary classification problems. Applied Intelligence, 45, 322–332. Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109–148. Funahashi, K.-I. (1989). On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3), 183–192. Guliyev, N. J., & Ismailov, V. E. (2016). A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function. Neural Computation, 28(7), 1289–1304.

Guyon, I. (2003) Design of experiments of the NIPS 2003 variable selection benchmark. In [http://www.nipsfsc.ecs.soton.ac.uk/papers/Datasets.pdf]. (NIPS Feature Selection Challenge). Guyon, I., Hur, A. B., Gunn, S., & Dror, G. (2004). Result analysis of the NIPS 2003 feature selection challenge. In Advances in neural information processing systems, Vol. 17 (pp. 545–552). MIT Press. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. Hornik, K., Stinchcombe, M., & White, H. (1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks, 3(5), 551–560. Huang, D.-S., & Du, J.-X. (1999). Radial basis probabilistic neural networks: Model and application. International Journal of Pattern Recognition and Artificial Intelligence, 1083–1101. Huang, D.-S., & Du, J.-X. (2008). A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks. IEEE Transactions on Neural Networks, 2099–2115. Huang, D.-S., & Jiang, W. (2012). A general CPL-AdS methodology for fixing dynamic parameters in dual environments. IEEE Transactions on Systems, Man, and Cybernetics–Part B: Cybernetics, 1489–1500. Lichman, M. (2013). UCI machine learning repository. Madych, W. R. (1991). Solutions of underdetermined systems of linear equations. In Lecture notes–monograph series: Vol. 20. Spatial statistics and imaging (pp. 227–238). Institute of Mathematical Statistics. Mazumder, R., & Hastie, T. (2012). The graphical lasso: New insights and alternatives. Electronic Journal of Statistics, 6, 2125–2149. Miller, A. (2002). Subset selection in regression. London: CHAPMAN & HALL/CRC (A CRC Press Company). Nickerson, R. S. (1972). Psychonomic monograph supplements. Binary-classification reaction time: A review of some studies of human information-processing capabilities. Petersen, K. B., & Pedersen, M. S. (2006). The Matrix Cookbook. http://www.mit.edu/ ~wingated/stuff_i_use/matrix_cookbook.pdf. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9), 1481–1497. Proctor, R. W., & Cho, Y. S. (2006). Polarity correspondence: A general principle for performance of speeded binary classification tasks. Psychological Bulletin, 416–442. Rouhani, M., & Javan, D. S. (2016). Two fast and accurate heuristic RBF learning rules for data classification. Neural Networks, 75, 150–161. Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual variables. In International conference on machine learning, ICML (pp. 515–52). Madison, WI. Shi, Y., & Peng, Y. (2002). Data mining via multiple criteria linear programming: Applications in credit card portfolio management. Information Technology and Decision Making, 1(1), 145–166. Shi, Y., Tian, Y., Chen, X., & Zhang, P. (2009). Regularized multiple criteria linear programs for classification. Science in China Series F: Information Sciences, 52(1), 1–9. Sprecher, D. A. (1993). A universal mapping for Kolmogorov’s superposition theorem. Neural Networks, 6(8), 1089–1094. The MathWorks, (2017). Matlab and simulink. In [http://www.mathworks.com/]. Tian, Y., Qi, Z., Ju, X., Shi, Y., & Liu, X. (2014). Nonparallel support vector machines for pattern classification. IEEE Transactions on Cybernetics, 44(7), 1067–1079. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B., 58, 267–288. Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society. Series B., 73, 273–282 Part 3. Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Doklady Akademii Nauk SSSR, 151, 501–504 Translated in Soviet Mathematics 4: 1035–1038. Tipping, M. E. (2000). The relevance vector machine. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural, information processing systems, Vol. 12 (pp. 652–658). Toh, K.-A. (2008). Deterministic neural classification. Neural Computation, 20(6), 1565–1595. Toh, K.-A. (2015). Stretchy multivariate polynomial classification. In Proceedings of IEEE tenth international conference on intelligent sensors, sensor networks and information processing, ISSNIP, Singapore, (pp. 1–6). Toh, K.-A., Tran, Q.-L., & Srinivasan, D. (2004). Benchmarking a reduced multivariate polynomial pattern classifier. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 740–755. Tong, H. (2016). A note on support vectormachines with polynomial kernels. Neural Computation, 28(1), 71–88. Vapnik, V. N. (1998). Statistical learning theory. Wiley-Interscience Pub. Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5), 988–999.

K.-A. Toh et al. / Neural Networks 97 (2018) 74–91 Widrow, B., Greenblatt, A., Kim, Y., & Park, D. (2013). The no-prop algorithm: A new learning algorithm for multilayer neural networks. Neural Networks, 37, 182–188. Wille, A., Zimmermann, P., Vranová, E., Fürholz, A., Laule, O., Bleuler, S., et al. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene network in arabidopsis thaliana. Genome Biology, 5(11), R92. Yeung, K. Y., Bumgarner, R. E., & Raftery, A. E. (2005). Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21(10), 2394–2402.

91

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B., 68, 49–67 (Part 1). Zhang, J.-R., Zhang, J., Lok, T.-M., & Lyu, M. R. (2007). A hybrid particle swarm optimization–back-propagation algorithm for feedforward neural network training. Applied Mathematics and Computation, 185, 1026–1037. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B., 67, 301–320 (Part 2).