Tailoring density ratio weight for covariate shift adaptation

Tailoring density ratio weight for covariate shift adaptation

Neurocomputing 333 (2019) 135–144 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Tailori...

1MB Sizes 0 Downloads 45 Views

Neurocomputing 333 (2019) 135–144

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Tailoring density ratio weight for covariate shift adaptation Sentao Chen∗, Xiaowei Yang School of Software Engineering, South China University of Technology, Guangzhou, China

a r t i c l e

i n f o

Article history: Received 6 March 2018 Revised 21 November 2018 Accepted 30 November 2018 Available online 31 December 2018 Communicated by Dr. Weike Pan Keywords: Covariate shift Transfer learning Density ratio Structural risk minimization Biconvex optimization

a b s t r a c t In many real-world applications, the performance of machine learning models is often significantly degraded because of the covariate shift problem. Appropriately dealing with this problem remains an important challenge. A common way for covariate shift adaptation is to estimate the density ratio weight from unlabeled source and target data and then learn a final hypothesis by directly minimizing the weighted empirical loss. This approach may result in a poor final hypothesis for the target domain by overweighting a few training examples that carry large loss. The problem stems from separating the estimation of density ratio weight from the learning of the final hypothesis. In this paper, an Adaptively Weighted Structural Risk Minimization (AWSRM) framework is developed for addressing this problem. In the proposed framework, the raw weights are first estimated from any density ratio estimation technique, and then the tailored weights and the target model are obtained by simultaneously minimizing the weighted structural risk and the discrepancy between the tailored weights and the raw weights. Comprehensive experimental studies on both synthetic and real-world data sets demonstrate that the AWSRM framework outperforms existing state-of-the-art covariate shift adaptation methods in terms of prediction accuracy. © 2018 Elsevier B.V. All rights reserved.

1. Introduction Covariate shift is a situation where the source and target domains share the same conditional labeling distribution P(y | x), but the marginal distribution PS (x) in the source domain differs from the marginal distribution PT (x) in the target domain. It is very common in many real-world applications such as 3D pose estimation [1], facial expression analysis [2], bioinformatics [3], brain– computer interfacing [4], and spam filtering [5]. The goal of covariate shift adaptation [6] is to exploit the labeled source data and unlabeled target data to learn a model for labeling the target data. In a misspecified hypothesis space, learning by simply minimizing the empirical loss on the source training examples is biased due to the difference in marginal distributions. In order to overcome this drawback, some researchers suggest first estimating the density ratio weight PT (x)/PS (x) and then minimizing a weighted empirical loss [6–8]. The density ratio weight for a source example indicates the importance of this example to the target domain. Under such mechanism, the examples which are important to the target domain will be endowed with large weights, and minimizing the weighted loss will naturally lead to a hypothesis that fits these examples well. Obviously, the density ratio weight plays an important role in learning the final hypothesis for the target domain. Currently, there ∗

Corresponding author. E-mail address: [email protected] (S. Chen).

https://doi.org/10.1016/j.neucom.2018.11.082 0925-2312/© 2018 Elsevier B.V. All rights reserved.

are mainly four lines of work on density ratio estimation [9]. The first line of work performs density ratio estimation by moment matching. As an outstanding representative, Kernel Mean Matching (KMM) [10] first maps the original features into a Reproducing Kernel Hilbert Space (RKHS) and then learns the density ratio weights for the source instances by minimizing the distance between the means of source data and target data. The output of KMM is a set of weights for the source data rather than a proxy of the density ratio function PT (x)/PS (x) and thus it cannot do out of sample estimation. The second line of work performs density ratio estimation through probabilistic classification. It assumes a source and target data generation model and transforms the density ratio estimation problem into a probabilistic classification problem by applying the Bayes’ rule [5,11]. In the third line of work, density ratio estimation is performed via density matching. This work first assumes a nonnegative density ratio model and then learns the model parameters by minimizing the KL divergence between the real target distribution PT (x) and the estimated target distribution [12,13]. In this way, the source distribution is reweighted to mimic the target distribution. The last line of work estimates the density ratio function by density ratio fitting, in which a density ratio model space is chosen in advance and the model parameters are learned by minimizing the mean square error between the true density ratio function and the assumed density ratio model [14–16]. Rather than developing new density ratio estimation techniques, some researchers focus on lowering the computational complexity of the existing techniques and scaling them to large

136

S. Chen and X. Yang / Neurocomputing 333 (2019) 135–144

scale problems. For example, Wen et al. [17] applied the FrankWolfe algorithm to lower the computational complexity of KMM and KL Importance Estimation Procedure [12]. In the ensemble KMM algorithm proposed by Miao et al. [18], the target instances are divided into smaller partitions and their local density ratio estimates are fused with a weighted sum. By contrast, Chandra et al. [19] obtained the density ratio estimates by taking fixed-size samples from the source and target data, performed independent computations on these samples, and then combined the results. The estimate of the density ratio PT (x)/PS (x) can be exploited to reweight the training loss to obtain an unbiased target model. In P (x ) particular, based on the trick PT (x ) =PS (x ) PT (x ) , traditional probaS

bilistic and non-probabilistic learning frameworks can be adapted for application in the covariate shift scenario. In the probabilistic framework, a target domain regression model is learned by assuming that Y |X =x obeys a Gaussian distribution and minimizing the KL divergence between conditional distributions [7]. Similarly, the Importance Weighted Least Square Probabilistic Classifier (IWLSPC) [20] is obtained by assuming that Y |X =x obeys a multinomial distribution and minimizing the L2 probability distribution distance [21]. In the non-probabilistic framework, based on the traditional empirical risk minimization learning principle [22], the Importance Weighted Empirical Risk Minimization (IWERM) framework [6] is proposed by using PT (x)/PS (x) to multiply the loss function. Under such framework, Khalighi et al. [23] proposed the Importance Weighted Import Vector Machine (IWIVM), which minimizes the weighted logistic loss at the source samples and applies a subset search technique to search for a sparse target model. Yamada et al. [1] proposed the Importance Weighted Kernel Regression (IWKR) to correct the covariate shift in 3D human pose estimation. In essence, these reweighting methods include two steps: (i) estimate the density ratio weights for the source instances, and (ii) solve a weighted variant of the traditional supervised learning problem. In addition to these two-step methods, there are some other methods that simultaneously learn the target model and the density ratio weights in one step, such as the KMM-LM [24], Selective Transfer Machine (STM) [2] and Distribution Matching Machines (DMM) [25]. Considering that the reweighting technique may have high variance, Reddi et al. [26] utilized the unweighted model as a prior to reduce the estimator variance. Instead of estimating the density ratio weights for reweighting the training loss, some methods tackle the covariate shift problem by avoiding weight estimation and building models that are robust to potential shifts in the input or conditional output distribution [27,28]. Besides the reweighting techniques, there are also other methods for correcting the distribution mismatch between source and target domains, which are often in the name of domain adaptation [29,30] rather than covariate shift adaptation. Domain adaptation generally assumes that the target joint distribution PT (x, y) is different from the source joint distribution PS (x, y) [29]. If the difference is only from the distribution of the input variable, that is, PS (x) = PT (x) while P(y | x) is shared by both domains, then domain adaptation becomes covariate shift adaptation. Domain adaptation methods usually perform feature transformation to mitigate the distribution difference. In particular, most of them explicitly minimize the distribution distance such as Maximum Mean Discrepancy (MMD) [31] between transformed source and target samples [32–35], or align the source and target covariance matrices to minimize the domain shift [36]. These methods mainly focus on minimizing the distribution distance between marginal distributions. Recently, Joint Distribution Optimal Transport (JDOT) [37] is proposed by minimizing the optimal transport loss between the source joint distribution and an estimated target joint distribution. In this work, we focus on studying the covariate shift problem. In particular, we follow the reweighting approach since it is

theoretically reliable [7]. From the above discussions, we can see that the two-step reweighting approach is simple yet it is flawed. Because it prematurely fixes the density ratio weights when learning the final hypothesis, some examples with large training loss may be further endowed with large weights. This will undoubtedly lead to an undesirable hypothesis. Although the one-step reweighting approach seems to avoid such problem, it cannot perform Importance Weighted Cross Validation (IWCV) [38], which is a decent hyperparameter selection strategy under covariate shift. The reason lies in the fact that it utilizes KMM for estimating the density ratio weight, and KMM cannot generate weights for unseen training examples. To overcome the drawbacks of these two reweighting approaches, an Adaptively Weighted Structural Risk Minimization (AWSRM) framework is proposed for covariate shift adaptation. In the proposed framework, the density ratio raw weights are first estimated by any technique from the four lines of density ratio estimation work mentioned above, and then the tailored weights and the target model are simultaneously obtained by solving an optimization problem, which jointly minimizes the weighted structural risk and the square loss between the tailored weights and the raw weights. Obviously, the proposed framework is a non-probabilistic framework as the two-step IWERM. But it is more flexible than IWERM because it can adjust the tailored weights adaptively to prevent large weights on the source samples that carry large training loss. Compared with the one-step reweighting approach, the proposed framework has the following advantages: (i) it does not depend on KMM only, any technique from the four lines of density ratio estimation work can be embedded into it to provide the raw weights, and (ii) we can perform IWCV by using the weights rendered by the previous density ratio estimation technique. The rest of the paper is organized as follows. We formalize the problem setting in Section 2. After that, we present our framework in Section 3. In Section 4, we design the learning algorithms for the regression and classification models under our framework. In Section 5, we experimentally compare the performance of our approach with existing methods on both synthetic and real-world data sets. Finally, we summarize our contributions in Section 6.

2. Problem formulation Let x ∈ X ⊆ Rd be an input variable and y ∈ Y be an output variable. Y is a real space R for regression task or a set of discrete labels {1, 2, . . . , C} for classification task. Suppose we mS are given a set of mS source examples DS = {(xSi ,ySi )}i=1 and a m

T set of mT unlabeled target instances DT ={xTj } j=1 . The unlabeled

mS instances {xSi }i=1 are first drawn independently from distribution PS (x), and then labeled by a conditional

source the source distribution of output given input P(y | x). Alternatively, we can view DS as a data set generated by the joint probability distribution PS (x,y ) = PS (x )P (y | x ) in the source domain. The unlabeled target mT instances {xTj } j=1 are drawn independently from the target distribution PT (x), which is different from PS (x). The goal is to predict mT the labels yT1 , yT2 , . . . , yTmT for the unlabeled target data set {xTj } j=1 . We assume that yT1 , yT2 , . . . , yTmT are still drawn from P(y | x) but they are not given in advance. This setup indicates that the conditional probability distribution P(y | x) remains unchanged in the source and target domains, but the marginal distribution varies from PS (x) to PT (x). Formally, learning under this covariate shift setting is often called covariate shift adaptation [6], or transductive transfer learning [39]. As a popular non-probabilistic framework for covariate shift adaptation, IWERM corrects the covariate shift based on reweighting the loss at each training sample. Let H be a hypothesis space and  : Y × R → [0, +∞) a loss function. IWERM gives the

S. Chen and X. Yang / Neurocomputing 333 (2019) 135–144

S {PT (xSi )/PS (xSi )}m are unknown, we propose to replace them with i=1

following optimization problem:

1  PT (xSi ) (ySi , h(xSi )), S mS P ( x ) S i i=1 mS

min h∈H

(1)

where the density ratio weight PT (x)/PS (x) for the loss of a single point (xS , yS ) is also called importance. To avoid overfitting, by adding a regularization term (h) into optimization problem (1), the importance weighted structural risk minimization (IWSRM) framework is given in the following: mS PT (xSi ) 1  min (ySi , h(xSi )) + λ(h ), S h∈H mS P ( x ) S i i=1

(2)

where λ is a regularization parameter. For regression task, the square loss (y, h(x )) = (y − h(x ))2 can be used; for binary classification task where Y = { − 1, +1}, consistent surrogate loss functions such as exponential loss, logistic loss and hinge loss are often adopted and the loss function can be rewritten as (y, h(x )) = ϕ (yh(x )). A sign function is then applied to h(x) as sign(h(x)) to achieve binary classification. 3. Adaptively weighted structural risk minimization In this section, we develop the AWSRM framework for covariate shift adaptation. For the target domain with joint probability distribution PT (x,y), we want to search for a final hypothesis h∗ in the hypothesis space H so that the generalization error of this hypothesis is minimized



h∗ = argmin

(x,y )

h∈H

(y, h(x ))PT (x, y )dxdy.

(3)

Under the covariate shift assumption, Eq. (3) can be rewritten as

h∗ = argmin h∈H

 (x,y )

(y, h(x ))PT (x, y )dxdy



= argmin

(x,y )

h∈H

(y, h(x ))

PT (x ) PS (x, y )dxdy. PS (x )

(4)

For the unknown density ratio function PT (x)/PS (x), we first use a non-negative function r(x) to approximate it, and then solve the following joint optimization problem of h and r to obtain a final hypothesis h∗ for the target domain and a proxy r∗ for the true density ratio function:

(h∗ , r∗ ) = argmin h,r





(x,y )

(y, h(x ))r (x )PS (x, y )dxdy 2

(5)

When the model space is chosen correctly, r(x) will be able to approximate PT (x)/PS (x) arbitrarily close. Under such circumstances, Eqs. (4) and (5) will share the same final hypothesis for the target domain. Using the empirical mean to approximate the integral in Eq. (5), we propose the following optimization problem:



mS mS PT (xSi ) 1  C  min ri (ySi , h(xSi )) + λ(h ) + ri − mS h,r mS PS (xSi ) i=1 i=1

)T

the weights obtained from some handy density ratio estimation techniques. Let β i (1 ≤ i ≤ mS ) denote the density ratio raw weights of the source examples, the proposed AWSRM framework takes the following form:

min h,r

mS mS 1  C  ri (ySi , h(xSi )) + λ(h ) + (ri − βi )2 mS mS i=1

i=1

s.t. ri ≥ 0, i = 1, 2, . . . , mS .

(7)

This framework is flexible for covariate shift adaptation because it takes any combination of loss function and density ratio estimation technique. Using optimization problem (7), we can adjust every density ratio weight based on the raw weight β i . On the one hand, although the third term in optimization problem (7) requires that the tailored weight should stay close to the raw weight so as to make their discrepancy small, based on the first term in (7), a larger loss will probably lead to a smaller tailored weight (even as small as zero) when minimizing the weighted structural loss is more effective than minimizing the weight discrepancy in decreasing the whole objective function. This effectively reduces the tailored weights of bad training samples. We demonstrate this weight refinement effect in a synthetic regression example in Section 5.1. On the other hand, most tailored weights will not be too far away from their raw weights since the weight discrepancy between them will be minimized. In a word, since AWSRM simultaneously considers the training loss and the density ratio weights, it will be able to make use of those high quality source examples for learning a better target model. Apparently, this characteristic is an advantage over the IWSRM framework. For hyperparameter selection, we use the raw weights to weight the validation error so that we can perform the unbiased IWCV procedure. This is also an advantage over the KMM based one-step joint optimization methods [2,24,25], which can only perform the biased cross validation under covariate shift. For real-world applications, we should instantiate our abstract framework. Firstly, we choose our hypothesis space H. Generally speaking, H can be a linear model space, an RKHS, a neural network hypothesis space and so on. In this paper, we choose two different hypothesis spaces. One is the linear model space H= {h(x;θ ) =θ T x|θ ∈ Rd+1 }. For convenience, we denote x as x = (1, x1 , . . . , xd )T in this linear model space to compensate for the offset. The other one is the radial basis function (RBF) network  [40] space H= {h(x;α ) = bj=1 α j k(x, c j )|α ∈ Rb }, where k(x, cj ) is the jth radial basis function centered at a vector cj . We use ||x−c ||2

PT (x ) + (r (x ) − ) PS (x )dx. PS (x ) x

s.t. ri ≥ 0, i = 1, 2, . . . , mS ,

137

2

(6)

where r = (r1 , . . . , rmS is a non-negative vector, and its entries are called the tailored density ratio weights. λ(h) is an additional regularization term. C is a tradeoff parameter between the weighted structural risk and the mean square error. In this empirical approximation version, we reduce the function r(x) to a set of weights for the source examples, which makes the optimization problem easier to solve. Since the true density ratio weights

the Gaussian kernel function k(x, c j ) = exp(− 2σ 2j ) as the radial basis function here. Secondly, we choose our loss functions for regression and classification tasks, respectively. For regression task, we choose the square loss (y, h(x )) = (y − h(x ))2 . For classification task, we choose the square hinge loss (y, h(x )) = max (0, 1 − yh(x ))2 . 4. Learning algorithms for regression and classification Notice that an RBF network and a linear model share a common characteristic: linear in its parameters. Therefore, the following section will be focusing on learning the parameters of an RBF  network h(x;α ) = bj=1 α j k(x, c j ). Plugging the parametric form of h(x) into optimization problem (7) and choosing the L2 regularizer, we can obtain the following optimization problem:

min α, r

mS mS 1  C  2 ri (ySi , h(xSi ; α )) + λ||α||2 + (ri − βi )2 mS mS i=1

s.t. ri ≥ 0, i = 1, 2, . . . , mS .

i=1

(8)

138

S. Chen and X. Yang / Neurocomputing 333 (2019) 135–144

Algorithm 1 ACS for learning regression model. S S , raw weights {βi }m , centers {c j }bj=1 for the RBF network, hyperparameters σ , λ and C; Input: Data set DS = { (xSi ,ySi )}i=1 i=1 Output: Labels for each instance in DT ; 1 Randomly initialize (α, r); 2 Repeat until convergence 3 Fixed r, update α by Eq. (10); 4 Fixed α, update r by Eq. (11); 5 end 6 Return the learned parameter α∗ for the RBF network; b  αi∗ k(xTi , c j ) 7 Label each target instance xTi as h (xTi ;α∗ ) =

m

j=1

It is obvious that the square loss and the square hinge loss are convex functions of the parameter α = (α1 , . . . , αb )T based on the affine composition calculus rule [41]. Denote the objective function in optimization problem (8) as J(α, r). It is easy to check that J(α, r) is a convex function of α when fixing r and a convex function of r when fixing α. Hence, it is a biconvex function. Furthermore, the constraint set is also a biconvex set. We conclude that (8) is a biconvex optimization problem. The Alternate Convex Search (ACS) algorithm [42] can be utilized to solve it. Basically, ACS works by alternatively updating α and r through fixing one of them at a time and solving the corresponding convex optimization problem. It monotonically decreases the objective function and finally converges to a stationary point. 4.1. Learning algorithm for the regression model We plug the square loss into optimization problem (8) and solve the following optimization problem for learning a regression model:

min α, r

mS mS 1  C  2 2 ri (ySi − h(xSi ; α )) + λ||α||2 + (ri − βi )2 mS mS i=1

i=1

s.t. ri ≥ 0, i = 1, 2, . . . , mS .

(9)

By ACS, in the first step, we fix r and minimize the objective function over α. This gives us a closed form solution:

α∗ = (λmS I + K T RK )−1 K T Ry,

is the loss function. Using the square hinge loss, we solve the following optimization problem for learning a classification model:

min α, r

mS mS 1  C  2 2 ri max (0, 1 −ySi h(xSi ; α )) + λ||α||2 + (ri − βi )2 mS mS i=1

i=1

s.t. ri ≥ 0, i = 1, 2, . . . , mS .

(12)

Similarly, we solve this biconvex optimization problem via ACS. Firstly, we fix r and minimize the objective function over α. We resort to the batch gradient descent algorithm for learning parameter α. The update rule for α is

α ← α − η (λmS α + K T v ),

(13)

where η is the learning rate, v = (v1 , . . . , vmS vi = −ri ySi max(0, 1 − ySi h(xSi ; α )), and K is the same as the regression case. Secondly, we fix α and find the optimal value for r by Eq. (11), but the loss for each source example should by modified as i = max (0, 1 − ySi h(xSi ; α ))2 in this classification case. We summarize the ACS for learning a classification model in Algorithm 2. Since the computational complexity for gradient descent is O(T1 bmS ) where T1 is the number of iterations for gradient descent and the computational complexity of solving Eq. (11) is O((mS )3 ), the overall computational complexity of Algorithm 2 is O((mS )3 ).

)T ,

5. Experiments

where I is an identity matrix of size b × b, [K]i j = k(xi , c j ), R = diag(r1 , . . . , rmS ) and y = (yS1 , . . . , ySmS )T . In the second step, we fix α and solve the constrained quadratic optimization problem for r,

In this section, we compare the experimental performances of the proposed framework and existing approaches on both synthetic and real-world data sets. We use a linear model space for the synthetic data experiments and an RBF network space for the realworld data experiments.

r∗ = argmin C rT r + ( − 2C β )T r,

5.1. Artificial data sets

(10)

(11)

r≥0

where β=(β1 , . . . , βmS )T , 0=(0, . . . , 0 )T and =(1 , . . . , mS )T , i = (ySi − h(xSi ; α ))2 . This optimization problem can be easily solved by the gradient projection method or the interior-point methods [43]. We summarize the ACS for learning a regression model in Algorithm 1. We update α first and then r in Algorithm 1. But as the original ACS algorithm has pointed out, the order of updating α and r can be permuted and it will not affect the result [42]. In the following, we analyze the computational complexity of Algorithm 1. Let the ACS iteration times be T. In each iteration, it takes O(b(mS )2 ) to solve Eq. (10) if b < mS , and O((mS )3 ) to solve Eq. (11). Therefore, the overall computational complexity of Algorithm 1 is O((mS )3 ). 4.2. Learning algorithm for the classification model The derivation for learning a classification model here is quite analogous to that of a regression model, and the main difference

We consider both regression and classification tasks on synthetic data sets here. The main objective of these two synthetic experiments is to show the power of the proposed framework in adaptively adjusting the tailored weights from the raw weights provided by different kinds of density ratio estimation techniques. These techniques include KMM [10], logistic regression (LogReg) [11], unconstrained least squares importance fitting (uLSIF) [14] and relative unconstrained least squares importance fitting (RuLSIF) [16]. For each density ratio estimation technique, we compare the performances of the traditional supervised learning with uniform weights, the IWSRM framework with raw weights, and our AWSRM framework with tailored weights. The regularization parameter λ is set to 0 for both IWSRM and AWSRM. In the regression task, we adopt a similar data generation process to the polynomial regression example in [10]. Specifically, we consider learning the true regression model f (x ) = −x + x3 . Let the source and target distributions be PS (x ) = N (x; 0.5, 0.52 ) and PT (x ) = N (x; 0, 0.32 ), respectively. N(x; μ, σ 2 ) denotes a Gaussian distribution with mean μ and variance σ 2 . Unlabeled

S. Chen and X. Yang / Neurocomputing 333 (2019) 135–144

139

Algorithm 2 ACS for learning classification model. S S , raw weights {βi }m from any density ratio estimation technique, centers {c j }bj=1 for the RBF network, hyperparameters η, σ , λ and C; Input: Data set DS = { (xSi ,ySi )}i=1 i=1 Output: Labels for each instance in DT ; 1 Randomly initialize (α, r); 2 Repeat until convergence 3 Fixed r; 4 Repeat until convergence 5 Update α by Eq. (13); 6 end 7 Fixed α, update r by Eq. (11); 8 end 9 Return the learned parameter α∗ for the RBF network; b  αi∗ k(xTi , c j )) 10 Label each target instance xTi as sign( h (xTi ;α∗ ) ) =sign(

m

j=1

Fig. 1. The performance of adaptively weighted linear regression from the AWSRM framework on toy data. (a) Comparison of LR, WLR, and AWLR. Number in the parentheses following each method is the mean square error of the corresponding method. Red circle points are the target data and blue square points are the source data. (b) The variation of the biconvex objective function values of AWLR.

instances are respectively drawn from PS (x) and PT (x) with the same size mS = mT = 100. We create the noisy source output values as y = f (x ) + ε , where the noise ɛ has a density function P (ε ) = N (ε ; 0, 0.3 ) when x∈(0.1, 0.15) and P (ε ) = N (ε ; 2, 0.5 ) when x ∈ (0.1, 0.15). By this setting, a small proportion of source data points that are similar to the target data points will have large loss. Target domain output values are created as y = f (x ) + ε where P (ε ) = N (ε ; 0, 0.3 ). Note that they are only used to measure the performance of the target model. Firstly, we show the power of our method and the behavior of the ACS algorithm in a single run by adopting only RuLSIF as the density ratio estimation technique. The hyperparameter α for RuLSIF is set as 0.5 as recommended in [16] through all the experiments. Considering that the linear model space is adopted in these synthetic cases, so the three methods compared here are the linear regression (an instance of traditional supervised learning with uniform weights), the weighted linear regression (an instance of IWSRM with raw weights), and the adaptively weighted linear regression (an instance of AWSRM with tailored weights). We abbreviate linear regression, weighted linear regression and adaptively weighted linear regression to LR, WLR and AWLR, respectively. Fig. 1(a) shows the performances of LR, WLR, and AWLR. Obviously, our method AWLR fits the target data points better than the other two methods and it achieves the lowest mean square error. In the following, we analyze how AWLR avoids putting large weights on the three bad source samples with large square loss. From Fig. 1(a), we can see that on the x axis, the three bad source samples are in the dense region of the target domain, which means that they are important to the target domain. The density ratio estimation technique will generate large raw weights (1.32,1.31 and 1.30) for them. WLR directly reweights these three bad source samples by using their raw weights so it does not bring a good regres-

sion model (square error 0.163). On the contrary, AWLR refines the weights from the raw weights through the training loss. Starting from the raw weights of the three bad source samples (1.32,1.31 and 1.30), in every iteration of the ACS algorithm, the tailored weights will be lowered to guarantee that the objective function in optimization problem (9) is small. After 10 iterations, the tailored weights for the three bad source samples are all 0. Thus, they are finally neglected and a better regression model (square error 0.120) is learned from the remaining high quality source samples. Fig. 1(b) reports the variation of the biconvex objective function values of AWLR. We see that the ACS algorithm converges quickly (in approximately 10 iterations). Secondly, we demonstrate the effectiveness of our method on tailoring the raw weights provided by KMM, LogReg, uLSIF and RuLSIF. Fig. 2(a) reports the average mean square error of the target regressors learned from the linear regression, the weighted linear regression and the adaptively weighted linear regression over 100 trials. From Fig. 2(a), we find that the weighted linear regression has advantage over the linear regression regardless of applying any density ratio estimation technique for providing the raw weights, and the adaptively weighted linear regression further lowers the test error of the weighted linear regression. For the classification task, we generate a binary classification data set for covariate shift adaptation according to [38]. We define the posterior probability distribution of y as

P (y = +1|x ) =

1 + tanh(x1 + min(0, x2 )) , 2

(14)

where x = (x1 , x2 )T and P (y = −1|x ) = 1 − P (y = +1|x ). The optimal decision hyperplane is a set of x such that P (y = −1|x ) = P (y =

140

S. Chen and X. Yang / Neurocomputing 333 (2019) 135–144

Fig. 2. Performance comparison on different raw weights. (a) Average mean square errors of three linear regressors over 100 trials. (b) Average misclassification error rates of three linear classifiers over 50 trials.

+1|x ). Let the source distribution and target distribution be

 

PS (x ) =

 



1 −2 1 0 N x; , 3 0 2 2

 

 

   

+



1 0 1 0 PT (x ) = N x; , −1 0 1 2

1 2 1 0 N x; , 3 0 2 2

 

 



 , (15)



1 4 1 0 + N x; , −1 0 1 2



, (16)

where N(x; μ, ) is the multivariate Gaussian density with mean vector μ and covariance matrix . We draw unlabeled instances from the source distribution PS (x) and target distribution PT (x) with the same size mS = mT = 10 0 0. The labels of the instances in both domains are drawn from P(y|x) and the target data labels are only for performance measurement. This is a typical example of covariate shift. Same as the case in regression task, we run the experiment for 50 times and report the average misclassification error rates of the target classifiers learned from the linear classification, the weighted linear classification and the adaptively weighted linear classification in Fig. 2(b). The graph shows a similar tendency to Fig. 2(a): the adaptively weighted linear classification always works better than the weighted linear classification and the linear classification. 5.2. Real world data sets We employ publicly available regression and classification data sets from the LibSVM1 and UCI2 archives to evaluate our approach. We partition the source data and target data for each data set in two ways: synthetically created bias and naturally occurring bias. The number of examples and features, the amount of source and target data, and the type of bias for each data set are listed in Table 1. In these experiments, we adopt the density ratio estimation technique LogReg [11] to estimate the raw weights for AWSRM. The tradeoff parameter C of AWSRM is chosen from {10−2 , 1 , 102 , 104 , 106 }. We compare AWSRM with the following state-of-the-art adaptation methods: •



1 2

Uniform (the traditional supervised learning with uniform weights) is used as a baseline for adaptation methods. It fits a model on the source data without any modifications and predicts the target data labels. IWSRM(LogReg) minimizes the weighted structural risk to learn a classification or regression model. The density ratio weights are from LogReg. LibSVM data sets: https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/. UCI data sets: http://archive.ics.uci.edu/ml/datasets.html.







IWSRM(RuLSIF) minimizes the weighted structural risk to learn a classification or regression model. The density ratio weights are from RuLSIF [16]. DoublyRobust [26] lowers the variance in the reweighting estimator by biased regularization. LogReg is adopted as the density ratio estimation technique here. STM [2] simultaneously learns the KMM density ratio weights with the target model. The tradeoff parameter that balances the risk and the distribution mismatch is chosen from {10−3 , 10−1 , 1 , 103 , 105 }. CORAL [36] aligns the source and target covariance matrices to minimize the domain shift. A model is trained on the transformed source data and applied to label the target data. JDOT [37] minimizes the optimal transport loss between the source joint distribution and the estimated target joint distribution to directly learn the target model. The parameter α that balances the metric in the feature space and the loss is searched among {10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1 }. The iteration times for the block coordinate descent optimization algorithm is fixed at 10.

For all the adaptation methods, the RBF network is chosen as the hypothesis space. The RBF kernel width σ is searched from {0.1, 0.5, 1, 2, 5} and the regularization parameter λ from {10−2 , 10−1 , 1, 101 , 102 }. The optimal hyperparameters of AWSRM, IWSRM(LogReg), IWSRM(RuLSIF) and DoublyRobust are selected by IWCV [38]. For STM, CORAL, and JDOT, we use the traditional cross validation for hyperparameter selection, which is biased under covariate shift. 5.2.1. Synthetically biased data We first evaluate our method on data sets with artificially created bias of the source and target distributions. Specifically, we generate the source data and the target data from each data set { ( xk , yk )}m following the strategy in [12]. We normalize all the k=1 m

T features into [0, 1] and choose the target examples {(xTj , yTj )} j=1 m from the pool {(xk , yk )}k=1 as follows. Firstly, we randomly choose one example (xk , yk ) from the pool and accept it with probability min (1, 4(xc )2 ), where xc is the cth element of xk and c is randomly determined and fixed in each trial of the experiment. Secondly, we remove xk from the pool regardless of its rejection or acceptance. Thirdly, we repeat the above procedure until mT examples are acmS cepted. Next, we choose the source samples {(xSi , ySi )}i=1 uniformly

m

S from the remaining samples. Note that we only use {(xSi , ySi )}i=1

and

T {xTj }mj=1

for training regressors or classifiers, and the target m

T domain output values {yTj } j=1 are only used for evaluating the

S. Chen and X. Yang / Neurocomputing 333 (2019) 135–144

141

Table 1 Data sets and their experimental settings for empirical evaluation (data sets masked with ∗ are for regression task). Data set

#Examples

#Feature

#Source

#Target

Bias setting

Abalone∗ Combined∗ Space_ga∗ Cpusmall∗ Housing∗ Mg∗ Mpg∗ Yacht∗ Concrete∗ Australian German_numer Heart Phishing Svmguide1 Svmguide3 Skin_nonskin Diabetes Breast_cancer Parkinsons∗ WineQuality

4177 9568 3107 8192 506 1385 392 308 1030 690 10 0 0 270 11,055 3089 1243 245,057 768 683 5875 6497

8 4 6 12 13 6 7 6 8 14 24 13 68 4 22 3 8 10 16 11

300 300 300 500 100 300 100 100 300 200 300 100 500 300 300 500 200 100 1578 1599

500 500 500 600 200 400 100 150 500 300 300 100 500 500 500 500 400 100 4297 4898

Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Different age Different color

Table 2 Target domain test error on 18 data sets for different adaptation methods over 10 trials (mean with variance in parentheses). All the error values are normalized so that the mean error of baseline method (Uniform) is one. For each data set, the best mean error is described in bold face. Data set

Uniform

IWSRM(LogReg)

IWSRM(RuLSIF)

DoublyRobust

STM

CORAL

JDOT

AWSRM

Abalone∗ Combined∗ Space_ga∗ Cpusmall∗ Housing∗ Mg∗ Mpg∗ Yacht∗ Concrete∗ Australian German_numer Heart Phishing Svmguide1 Svmguide3 Skin_nonskin Diabetes Breast_cancer

1.00(0.03) 1.00(0.02) 1.00(0.01) 1.00(0.17) 1.00(0.04) 1.00(0.01) 1.00(0.07) 1.00(0.01) 1.00(0.01) 1.00(0.01) 1.00(0.05) 1.00(0.02) 1.00(0.10) 1.00(0.07) 1.00(0.01) 1.00(0.06) 1.00(0.01) 1.00(0.19)

1.05(0.08) 0.70(0.01) 0.92(0.01) 0.52(0.17) 0.98(0.03) 0.96(0.01) 0.63(0.11) 0.95(0.01) 0.94(0.01) 1.00(0.01) 1.01(0.06) 0.91(0.03) 0.80(0.01) 0.85(0.03) 0.98(0.01) 0.94(0.07) 0.99(0.01) 0.86(0.18)

0.95(0.03) 0.78(0.01) 0.99(0.01) 0.79(0.17) 1.06(0.05) 0.97(0.01) 0.70(0.12) 1.00(0.01) 1.01(0.01) 1.04(0.01) 1.02(0.04) 0.93(0.02) 0.93(0.05) 0.98(0.05) 1.00(0.01) 1.02(0.04) 1.00(0.01) 0.82(0.25)

0.93(0.09) 0.75(0.02) 1.10(0.04) 0.69(0.28) 1.08(0.04) 0.83(0.01) 0.81(0.03) 0.98(0.07) 0.92(0.01) 1.04(0.01) 0.93(0.05) 0.94(0.01) 0.84(0.02) 0.68(0.01) 0.98(0.01) 1.10(0.15) 1.00(0.01) 1.01(0.34)

0.97(0.03) 2.30(0.22) 1.01(0.01) 0.70(1.37) 1.25(0.05) 1.04(0.02) 0.81(0.21) 1.15(0.01) 0.93(0.03) 1.03(0.01) 1.30(0.47) 0.96(0.03) 0.87(0.02) 1.06(0.03) 1.02(0.01) 0.91(0.66) 1.06(0.01) 2.21(3.35)

1.00(0.03) 1.03(0.01) 1.16(0.04) 0.94(0.27) 1.11(0.02) 0.88(0.01) 1.00(0.02) 1.05(0.12) 1.02(0.01) 1.04(0.04) 0.92(0.06) 0.89(0.01) 0.86(0.02) 0.78(0.01) 1.04(0.01) 1.52(0.69) 0.99(0.01) 0.90(0.23)

0.98(0.31) 1.29(1.21) 1.29(0.04) 3.30(2.05) 1.17(0.02) 1.07(0.02) 2.24(0.85) 1.15(0.27) 1.00(0.01) 1.11(0.03) 1.33(0.03) 1.01(0.04) 0.90(0.12) 0.95(0.42) 1.23(0.01) 2.18(1.13) 1.05(0.01) 3.03(2.88)

0.86(0.01) 0.70(0.01) 0.92(0.01) 0.49(0.13) 0.92(0.03) 0.93(0.01) 0.60(0.09) 0.83(0.01) 0.94(0.01) 0.95(0.01) 0.94(0.04) 0.88(0.02) 0.78(0.01) 0.85(0.03) 0.96(0.01) 0.90(0.04) 0.96(0.01) 0.67(0.17)

generalization performance. In this experiment, the centers of the RBF network are 100 randomly sampled instances from the target data. To evaluate the generalization performance, square error is used for regression problems, and 0 − 1 misclassification error for classification problems. In our experiments, each setup is repeated 10 times and the average performance measures together with variances are reported in Table 2. In particular, all the error values in each trial are divided by the mean error of Uniform, so that the mean error of Uniform is one. This operation is commonly adopted in many covariate shift adaptation experiments [12,17,26]. From Table 2, we can see that AWSRM significantly outperforms Uniform, IWSRM(RuLSIF), STM and JDOT. On some data sets, we observe that AWSRM performs similar to IWSRM(LogReg). The reason lies in that the source samples which are important to the target domain do not carry large training loss, so all the tailored weights stay close to the raw weights, which consequently brings a similar target model as the non-adaptive IWSRM(LogReg). In the following, we perform statistical test to verify whether the performance of AWSRM is significantly better than those of IWSRM(LogReg), DoublyRoubust and CORAL. In machine learning, Wilcoxon signed-ranks test is usually used to compare the performances of two learning machines on

multiple data sets [44]. Therefore, we use it to perform three pairs of tests: AWSRM versus IWSRM(LogReg), AWSRM versus DoublyRoubust, and AWSRM versus CORAL. Each pair is carried on regression and classification problems, respectively. The Wilcoxon signed-ranks test is a non-parametric statistical test which ranks the differences in performances of two learning machines for each data set. The differences are ranked according to their absolute values. The smallest absolute value gets the rank of 1, the second smallest gets the rank of 2, and so on. In case of equality, average ranks are assigned. The statistics of the Wilcoxon signed-ranks test is [44]:

z(a, b) =

T (a, b) − N (N + 1 )/4



N (N + 1 )(2N + 1 )/24

,

(17)

where T (a, b) = min{R+ (a, b), R− (a, b)}, N is the number of the data sets. R+ (a, b) is the sum of ranks for the data sets on which learning machine b outperforms learning machine a and R− (a, b) is the sum of ranks for the opposite. They are defined as follows:

R+ (a, b) =

 di >0

rank(di ) +

1 rank(di ), 2 di =0

(18)

142

S. Chen and X. Yang / Neurocomputing 333 (2019) 135–144 Table 3 Comparison of mean square error for regression task and misclassification error rate for classification task. For each data set, the best result is described in bold face. Data set

Uniform

IWSRM(LogReg)

IWSRM(RuLSIF)

DoublyRobust

STM

CORAL

JDOT

AWSRM

Parkinsons∗ WineQuality

130.906 0.334

129.069 0.336

154.142 0.353

129.981 0.345

139.012 0.333

135.655 0.342

140.125 0.351

125.255 0.319

Fig. 3. Hyperparameter sensitivity of AWSRM on data sets Australian and Diabetes. (a)–(c) are mean error curves with standard deviations, showing the adaptation results with different values of σ , λ and C, respectively. When one parameter is used for testing, the others are set as σ = 1, λ = 10−2 and C = 105 .

R (a, b) = −

 di <0

1 rank(di ) + rank(di ), 2

(19)

di =0

where di is the difference between the performance scores of two learning machines on the i-th data set out of N data sets, rank(di ) is the rank value of |di |. Based on formulas (17)–(19), we compute z(a, b) respectively on regression and classification problems. In particular, we fix b as AWSRM, and set a as IWSRM(LogReg), DoublyRobust, and CORAL in turn. For regression problems, the z values are −2.19, −2.07 and −2.54; for classification problems, the z values are −2.66, −2.07 and −2.07. These values are all below the critical value −1.96, which shows that for significance level 0.05, the performance of AWSRM is significantly better than IWSRM(LogReg), DoublyRobust and CORAL both on regression and classification problems. 5.2.2. Naturally biased data To highlight the benefits of our method in practice, we conduct experiments under naturally occurring bias. The Parkinsons data set is a regression data set, which is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson’s disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The goal is to predict the total UPDRS scores. We separate the data set into source and target data according to different age ranges. The source data includes all examples of subjects whose age are below 59 while the target data includes the rest of the examples. The WineQuality data set contains information of red and white variants of the Portuguese “Vinho Verde” wine. The goal is to predict the wine quality score whose discrete values are from 0 to 10. For scores smaller or equal to 5, we label the corresponding instances as “bad wine” and “good wine” for the rest of the instances. Thus, it becomes a binary classification data set. We treat the red wine examples as the source data and use the white wine examples as the target data. A k-means RBF network hypothesis space is adopted in this part, whose 100 centers are acquired by running the k-means clustering algorithm on the target data in advance. The mean square error for the regression task and the

misclassification error rate for the classification task are reported in Table 3. For the Parkinsons data set, AWSRM further improves the accuracy of IWSRM(LogReg), whose performance is almost the same as the baseline Uniform. However, for the rest of the methods, they do not have a significant improvement over the baseline. For the WineQuality data set, our AWSRM achieves the best performance among all the methods. Hence our method promises to be a valuable tool for covariate shift adaptation in real-world applications. 5.2.3. Hyperparameter sensitivity analysis In this section, we study the hyperparameter sensitivity in AWSRM. Recall that AWSRM involves two hyperparameters: the regularization parameter λ and the tradeoff parameter C. If we use a nonlinear model, say an RBF network, there is another hyperparameter, i.e. the Gaussian kernel width σ . Here we examine the sensitivity of its performance with respect to different choices of these three hyperparameters. In particular, following the experimental settings in Section 5.2.1, we conduct the experiments on two classification data sets: Australian and Diabetes, and illustrate the results in Fig. 3. From Fig. 3, we can see that on the one hand, AWSRM is sensitive to the hyperparameters σ and λ, and the optimal hyperparameter values are significantly different for different data sets. On the other hand, AWSRM is insensitive to the hyperparameter C when C ≥ 101 . As a general guideline for choosing hyperparameters, we would suggest a large value for C. As for σ and λ, we suggest using IWCV [38] to obtain their optimal values for different data sets. 6. Conclusions In this paper, the AWSRM framework is proposed for transductive learning under covariate shift. The framework takes any density ratio estimation technique for estimating the raw weights, and adjusts its tailored weights together with the weighted structural risk. This is achieved by using the ACS algorithm to solve a biconvex optimization problem. Due to such operation, the learning process of the hypothesis will not be misled by a few training examples with large loss. Under this framework, a regression learning model and a classification learning model are also presented. The

S. Chen and X. Yang / Neurocomputing 333 (2019) 135–144

experimental results on both artificial data sets and a range of realworld data sets show that compared with the state-of-the-art covariate shift adaptation methods, the proposed method has higher accuracy for regression and classification tasks. In this study, we have also conducted some experiments on data sets with more than 100 dimensions and found that the regression or classification model from the AWSRM framework does not perform very well. The reason may be that the raw weights from most density ratio estimation techniques are unreliable under such circumstances. In the future work, we will tackle this issue. On the other hand, we have developed the AWSRM framework based on the assumption that there is only unlabeled data in the target domain. However, this framework can be further extended to the case where there is also some labeled data in the target domain. Finally, since deep neural networks have been successfully applied in a wide range of applications, an exciting future direction to explore would be extending our method to deep neural network models. Acknowledgment This work is partially supported by the National Natural Science Foundation of China (Grant Nos. 61273295, 61502175 and 11501219), the Guangdong Natural Science Funds (2015A030310298), and the Science and Technology Program of Guangzhou (201607010069). References [1] M. Yamada, L. Sigal, M. Raptis, Covariate shift adaptation for discriminative 3D pose estimation, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2) (2014) 235–247. [2] W. Chu, F. Torre, J. Cohn, Selective transfer machine for personalized facial expression analysis, IEEE Trans. Pattern Anal. Mach. Intell. 39 (3) (2017) 529–545. [3] G. Schweikert, C. Widmer, B. Schölkopf, G. Rätsch, An empirical analysis of domain adaptation algorithms for genomic sequence analysis, in: Proceedings of the 2008 Advances in Neural Information Processing Systems, 2008, pp. 1433–1440. [4] H. Raza, H. Cecotti, Y. Li, G. Prasad, Learning with covariate shift-detection and adaptation in non-stationary environments: application to brain–computer interface, in: Proceedings of the 2015 International Joint Conference on Neural Networks, 2015, pp. 1–8. [5] S. Bickel, T. Scheffer, Dirichlet-enhanced spam filtering based on biased samples, in: Proceedings of the 2006 Advances in Neural Information Processing Systems, 2006, pp. 161–168. [6] M. Sugiyama, Learning under non-stationarity: covariate shift adaptation by importance weighting, in: J. Gentle, W. Härdle, Y. Mori (Eds.), Handbook of Computational Statistics, Springer, Berlin, 2012, pp. 927–952. [7] H. Shimodaira, Improving predictive inference under covariate shift by weighting the loglikelihood function, J. Stat. Plan. Inference 90 (2) (20 0 0) 227–244. [8] B. Zadrozny, Learning and evaluating classifiers under sample selection bias, in: Proceedings of the 2004 International Conference on Machine Learning, 2004, pp. 903–910. [9] M. Sugiyama, T. Suzuki, T. Kanamori, Density ratio matching under the Bregman divergence: a unified framework of density ratio estimation, Ann. Inst. Stat. Math. 64 (5) (2011) 1009–1044. [10] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, B. Schölkopf, Covariate shift by kernel mean matching, in: J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, ND. Lawrence (Eds.), Dataset Shift in Machine Learning, The MIT Press, Cambridge, 2009, pp. 131–160. [11] S. Bickel, M. Brückner, T. Scheffer, Discriminative learning under covariate shift, J. Mach. Learn. Res. 10 (5) (2009) 2137–2155. [12] M. Sugiyama, S. Nakajima, H. Kashima, P. von Bünau, M. Kawanabe, Direct importance estimation with model selection and its application to covariate shift adaptation, in: Proceedings of the 2007 Advances in Neural Information Processing Systems, 2007, pp. 1433–1440. [13] J. Tsuboi, H. Kashima, S. Hido, S. Bickel, M. Sugiyama, Direct density ratio estimation for large-scale covariate shift adaptation, in: Proceedings of the 2008 SIAM International Conference on Data Mining, 2008, pp. 443–454. [14] T. Kanamori, S. Hido, M. Sugiyama, A least-squares approach to direct importance estimation, J. Mach. Learn. Res. 10 (3) (2009) 1391–1445.

143

[15] T. Kanamori, T. Suzuki, M. Sugiyama, Statistical analysis of kernel-based least-squares density-ratio estimation, Mach. Learn. 86 (3) (2011) 335–367. [16] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, M. Sugiyama, Relative density-ratio estimation for robust distribution comparison, Neural Comput. 25 (5) (2013) 1324–1370. [17] J. Wen, R. Greiner, D. Schuurmans, Correcting covariate shift with the Frank— Wolfe algorithm, in: Proceedings of the 2015 International Conference on Artificial Intelligence, 2015, pp. 1010–1016. [18] Y. Miao, A. Farahat, M. Kamel, Ensemble kernel mean matching, in: Proceedings of the 2015 IEEE International Conference on Data Mining, 2015, pp. 330–338. [19] S. Chandra, A. Haque, L. Khan, Efficient sampling-based kernel mean matching, in: Proceedings of the 2016 IEEE International Conference on Data Mining, 2016, pp. 811–816. [20] H. Hachiya, M. Sugiyama, N. Ueda, Importance-weighted least squares probabilistic classifier for covariate shift adaptation with application to human activity recognition, Neurocomputing 80 (2) (2012) 93–101. [21] M. Sugiyama, S. Liu, M. Yamanaka, M. Yamada, T. Suzuki, T. Kanamori, Divergence approximation between probability distributions and its applications in machine learning, J. Comput. Sci. Eng. 7 (2) (2013) 99–111. [22] V.N. Vapnik, Statistical Learning Theory, first ed., Wiley, New York, 1998. [23] S. Khalighi, B. Ribeiro, U. Nunes, Importance weighted import vector machine for unsupervised domain adaptation, IEEE Trans. Cybern. 47 (10) (2017) 3280–3292. [24] Q. Tan, H. Deng, P. Yang, Kernel mean matching with a large margin, in: Proceedings of the 2012 International Conference on Advanced Data Mining and Applications, 2012, pp. 223–234. [25] Y. Cao, M. Long, J. Wang, Unsupervised domain adaptation with distribution matching machines, in: Proceedings of the 2018 AAAI Conference on Artificial Intelligence, 2018, pp. 2795–2802. [26] S. Reddi, B. Poczos, A. Smola, Doubly robust covariate shift correction, in: Proceedings of the 2015 AAAI Conference on Artificial Intelligence, 2015, pp. 2949–2955. [27] J. Wen, C. Yu, R. Greiner, Robust learning under uncertain test distributions: relating covariate shift to model misspecification, in: Proceedings of the 2014 International Conference on Machine Learning, 2014, pp. 631–639. [28] X. Chen, M. Monfort, A. Liu, B. Ziebart, Robust covariate shift regression, in: Proceedings of the 2016 International Conference on Artificial Intelligence and Statistics, 2016, pp. 1270–1279. [29] J. Jiang, A Literature Survey on Domain Adaptation of Statistical Classifiers. http://sifaka.cs.uiuc.edu/jiang4/domainadaptation/survey, 2008. [30] K. Zhang, B. Schölkopf, K. Muandet, Z. Wang, Domain adaptation under target and conditional shift, in: Proceedings of the 2013 International Conference on Machine Learning, 2013, pp. 819–827. [31] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, A. Smola, A kernel method for the two-sample-problem, in: Proceedings of the 2006 Advances in Neural Information Processing Systems, 2006, pp. 513–520. [32] S. Pan, I. Tsang, J. Kwok, Q. Yang, Domain adaptation via transfer component analysis, IEEE Trans. Neural Netw. 22 (2) (2011) 199–210. [33] M. Baktashmotlagh, M. Harandi, M. Salzmann, Distribution matching embedding for visual domain adaptation, J. Mach. Learn. Res. 17 (108) (2016) 1–30. [34] J. Zhang, W. Li, P. Ogunbona, Joint geometrical and statistical alignment for visual domain adaptation, in: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5150–5158. [35] J. Liu, J. Li, K. Lu, Coupled local–global adaptation for multi-source transfer learning, Neurocomputing 275 (2018) 247–254. [36] B. Sun, J. Feng, K. Saenko, Return of frustratingly easy domain adaptation, in: Proceedings of the 2016 AAAI Conference on Artificial Intelligence, 2016, pp. 2058–2065. [37] N. Courty, R. Flamary, A. Habrard, A. Rakotomamonjy, Joint distribution optimal transportation for domain adaptation, in: Proceedings of the 2017 Advances in Neural Information Processing Systems, 2017, pp. 3733–3742. [38] M. Sugiyama, M. Krauledat, K.-R. Müller, Covariate shift adaptation by importance weighted cross validation, J. Mach. Learn. Res. 8 (1) (2007) 985–1005. [39] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 (10) (2010) 1345–1359. [40] Q. Que, M. Belkin, Back to the future: radial basis function networks revisited, in: Proceedings of the 2016 International Conference on Artificial Intelligence and Statistics, 2016, pp. 1375–1383. [41] S. Boyd, L. Vandenberghe, Convex Optimization, first ed., Cambridge University Press, Cambridge, 2004. [42] J. Gorski, F. Pfeuffer, K. Klamroth, Biconvex sets and optimization with biconvex functions: a survey and extensions, Math. Methods Oper. Res. 66 (3) (2007) 373–407. [43] J. Nocedal, S. Wright, Numerical Optimization, second ed., Springer, New York, 2006. [44] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (1) (2006) 1–30.

144

S. Chen and X. Yang / Neurocomputing 333 (2019) 135–144 Sentao Chen received the B.S. degree in statistics from Guangdong University of Technology, Guangzhou, China. He is currently pursuing the Ph.D. degree in software engineering in the School of Software Engineering, South China University of Technology. His current research interests include statistical machine learning, transfer learning, and covariate shift adaptation.

Xiaowei Yang received the B.S. degree in theoretical and applied mechanics, the M.Sc. degree in computational mechanics, and the Ph.D. degree in solid mechanics from Jilin University, Changchun, China, in 1991, 1996, and 20 0 0, respectively. He is currently a full time professor in the School of Software Engineering, South China University of Technology. His current research interests include designs and analyses of algorithms for large-scale pattern recognition, imbalanced learning, semi-supervised learning, support vector machines, tensor learning, and evolutionary computation.