Differential privacy preservation in regression analysis based on relevance

Differential privacy preservation in regression analysis based on relevance

Knowledge-Based Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/k...

681KB Sizes 0 Downloads 36 Views

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Differential privacy preservation in regression analysis based on relevance ∗

Maoguo Gong , Ke Pan, Yu Xie School of Electronic Engineering, Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an, Shaanxi Province 710071, China

article

info

Article history: Received 17 October 2018 Received in revised form 22 February 2019 Accepted 22 February 2019 Available online xxxx Keywords: Differential privacy Privacy preservation Regression analysis Relevance

a b s t r a c t With the development of data release and data mining, protecting the sensitive information in the data from being leaked has attracted quite a few attentions in the information security field. Differential privacy is an excellent paradigm for providing the preservation against the adversary that attempts to infer the sensitive information of individuals. However, the existing works show that the accuracy of the differentially private regression model is less than satisfactory since the amount of noise added in is uncertainty. In this paper, we present a novel framework PrivR, a differentially private regression analysis model based on relevance, which transforms the objective function into the form of polynomial and perturbs the polynomial coefficients according to the magnitude of relevance between the input features and the model output. Specifically, we add less noise to the coefficients of the polynomial representation of the objective function that involve strongly relevant features, and viceversa. Experiments on Adult dataset and Banking dataset demonstrate that PrivR not only prevents the leakage of data privacy effectively but also retains the utility of the model. © 2019 Elsevier B.V. All rights reserved.

1. Introduction With the coming of the era of big data, data mining and machine learning are widely used to discover the useful knowledge from large collections of data. However, such data mining algorithms and machine learning algorithms may reveal the sensitive information of certain individuals. On the one hand, the publicly released dataset used in these algorithms that contains personal information will cause the disclosure of the privacy of individuals. On the other hand, data mining algorithms and machine learning algorithms are prone to discover the potential relationship between the data, which provides an opportunity for attackers to acquire the sensitive information. Thus, the preservation of sensitive information must be carefully managed. In the field of data mining, various privacy preservation data mining algorithms have been proposed. Motlagh et al. [1] presented an association rule (AR) hiding technique based on the genetic algorithm to realize the privacy preservation, which not only decreases the side effect in sensitive AR hiding process, but also preserves the sensitive knowledge in data effectively. In addition, with the significant progress of machine learning, its applications are broadcasted into various fields, including recommender systems [2–5], natural language processing [6–8], computer vision [9,10], autopilot [11,12], social network analysis [13, ∗ Corresponding author. E-mail address: [email protected] (M. Gong).

14] and many more. These applications also relate to the personal and highly sensitive information of individuals such as clinical records, genetic information, biomedical images and financial status, etc. In the case of multiple data sources, it is widely shared that quite a few privacy preservation algorithms have been presented to avoid the disclosure of privacy during the sharing between data sources. For instance, Karapiperis et al. [15] proposed a multi-party blocking approach for privacypreserving record linkage based on the Locality-Sensitive Hashing technique [16]. The presented method aims to identify records that correspond to the same real-world entities across several datasets without disclosing any sensitive information about these entities. Moreover, there are also several privacy preservation machine learning algorithms based on a single data source. However, it is widely shared that the risk of the disclosure of privacy information in a dataset will increase if an appropriate privacy preservation technology is not adopted. Therefore, it is an urgent demand that preventing the established model from being exploited by attackers to gain the sensitive information. Differential privacy [17] is a more effective privacy preservation mechanism, which provides a preservation against the adversary who attempts to infer any information about any particular record even though the adversary possesses all the other records except the target record. In practice, in order to meet the requirements of differential privacy preservation, there are different implementation mechanisms for different problems. Both Laplace mechanism [17] and exponential mechanism [18] are the

https://doi.org/10.1016/j.knosys.2019.02.028 0950-7051/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.

2

M. Gong, K. Pan and Y. Xie / Knowledge-Based Systems xxx (xxxx) xxx

two most basic differential privacy preservation mechanisms. The former is applied to the analysis tasks that return numeric results, whereas the latter mainly aims at tasks with categorical outputs. Nevertheless, neither the Laplace mechanism nor the exponential mechanism can be easily utilized for regression model, which is due to the sensitivity analysis. The sensitivity analysis dissects the degree of variety in the model output with the change of the model input data. Furthermore, it is formidable for the regression model to achieve the sensitivity analysis since the input and the output of it have an intricate relevance. In this paper, we propose a differential privacy preservation model based on the magnitude of relevance between the input features and the model output, and the presented model fits in linear regression and regularized linear regression. Linear regression is used for predicting the response features of a dataset given the rest of the explanatory features. Specifically, it attempts to learn a function that forecasts through a linear combination of features and the regression result is a straight line obtained by minimizing a loss function such as mean square error. Although linear regression is a very common algorithm in practice, which is always extensively used in stock prediction [19], medical research [20] and so on, there is only a limited selection of methods for differentially private regression analysis. Lei [21] presented a differentially private M-estimators (DPME) algorithm that perturbs the histogram of input data and uses the synthetic dataset based on the perturbed histogram to train the model. It is unnecessary for us to derive the sensitivity of the model output, the preservation of data privacy is only determined by the generated noisy histogram. However, it is restricted to low-dimensional datasets. Filter-Priority is an ϵ -differentially private [17] algorithm proposed by Cormodel et al. [22], which transforms the original sensitive data into synthetic data and releases the compact summary of the noisy data. The Filter-Priority maintains the preservation of privacy via the post-processing property of differential privacy. However, this method is limited to low-density datasets and often injects an amount of unnecessary noise. Zhang et al. [23] presented the Functional mechanism (FM) that protects the sensitive information by perturbing the coefficients of the objective function and releasing the model parameters after minimizing the perturbed objective function. The FM adds noise with the same amount, which may have an impact on accuracy. Based on the covariance perturbation and output perturbation, a general noise reduction framework performed with regularized linear regression is proposed by Ligett et al. [24], which inputs a series of privacy levels and outputs a sequence of hypotheses that satisfy ϵ -differential privacy. Specifically, the noise reduction mechanism is used to generate a list of private hypotheses by gradually increasing the values of ϵ and the privacy hypothesis is calculated by optimizing the perturbed objective function. However, the existing differential privacy preservation models in regression analysis can hardly seek out the balance between the utility of model and the preservation of sensitive information. Therefore, a privacy preservation framework that can significantly promote the privacy preservation while improving the accuracy in regression analysis is required urgently. Our contributions. Motivated by this, we present a novel framework PrivR, which is a differentially private regression model based on relevance. The PrivR algorithm perturbs the coefficients of the objective function in the polynomial form according to the magnitude of relevance between the input features and the model output, it not only prevents the disclosure of data privacy, but also retains the utility of the model. In PrivR, more noise is added to the coefficients of the objective function in the polynomial form when the relevance between the input features and the model output is weakly and vice-versa. Specifically, according to the magnitude of relevance between the input features

Table 1 Notations. Notation

Description

D (xi , yi )

Database of n records The ith record in database D The parameter vector of the regression model The objective function on record (xi , yi ) The objective function on database D The optimized parameter of the objective function fD (ω) The noisy version of fD (ω) The parameter of the noisy version objective function f¯D (ω) A product of values in ω The set of all possible φ of order q The polynomial coefficients of φ in f (xi , ω) The relevance between the input features xij and the model output The average relevance between the jth input features and the model output The privacy budget for the regression model The privacy budget for strongly relevant features and weakly relevant features

ω f (xi , ω) fD (ω) ω∗ f¯D (ω) ω¯ φ Φq λφ xi Rxij Rj

ϵ ϵs , ϵw

and the model output, a threshold is set to divide the features into two groups: strongly relevant features and weakly relevant features. We add noise with smaller amount to the coefficients of the objective function in the polynomial form which involve strongly relevant features while adding noise with larger amount to the coefficients when input features have less relevance to the model output. Our algorithm provides the preservation against the adversary and ensures the accuracy of the model owing to the control of the amount of the noise added to the objective function in the form of polynomial. Experiments on Adult dataset and Banking dataset demonstrate that the PrivR algorithm protects the training data effectively and keeps the utility of the linear regression model and regularized linear regression model. The structure of the paper proceeds as follows. The next section overviews the related literatures of differential privacy and the Functional mechanism. Section 3 explains our approach in detail. The experimental evaluations and experimental results are described in Section 4. In the end, Section 5 concludes the paper. 2. Preliminaries and related works In this section, we introduce the concepts of differential privacy mechanism and the Functional mechanism. Given a database D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )}, d explanatory features X1 , X2 , . . . , Xd , and one response feature y. According to the magnitude of relevance between the input features and the model output, explanatory features X1 , X2 , . . . , Xd can be split into two groups: strongly relevant features {Xs } and weakly relevant features {Xw }. For each record (xi , yi ) where xi = (xi1 , xi2 , . . . , xid ), we assume without loss of generality that √

∑d

x2ij ≤ 1 where xij > 0. Table 1 summarizes the notations used throughout this paper. Given the privacy budget ϵ , privacy budget ratio α , explanatory features X1 , X2 , . . . , Xd and threshold γ , our objective is to classify the explanatory features X1 , X2 , . . . , Xd based on their relevance to the model output and perturb the objective function fD (ω) in the light of the magnitude of relevance between the explanatory features and the model output. Furthermore, the differentially private linear regression model and the differentially private regularized linear regression model are constructed to forecast the response feature value y. Finally, the model parameter ω ¯ is released to optimize the differentially private regression function f¯D (ω), and the differentially private regression function outputs a prediction of yi as accurate as possible. The definition of linear regression and regularized linear regression are shown as follows. j=1

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.

M. Gong, K. Pan and Y. Xie / Knowledge-Based Systems xxx (xxxx) xxx

Definition 1 (Linear Regression). Given the database D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )}, we assume that the response feature y ∈ [−1, 1]. A linear regression prediction function ρ (xi ) = xTi ω returns the optimized parameter ω∗ by minimizing a cost function ( )2 f (xi , ω) = yi − xTi ω , i.e.,

ω = arg min ∗

ω

n ∑ (

yi −

xTi

ω

)2

(1)

i=1

Definition 2 (Regularized Linear Regression). Given the database D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xn , yn )}, we assume that the response feature y ∈ [−1, 1]. A regularized linear regression prediction function ρ (xi ) = xTi ω returns the optimized parameter ω∗ by minimizing a cost function f (xi , ω) = yi − xTi ω i.e.,

(

ω∗ = arg min ω

)2

+ λ ∥ω∥22 ,

2.2. Functional mechanism Functional mechanism [23] is a general framework to achieve

ϵ -differential privacy. It is extended from the Laplace mechanism and realized by transforming the objective function fD (ω) into

the form of polynomial and injecting the noise into polynomial coefficients. In the end, the model parameter ω ¯ is released to optimize the noisy version objective function f¯D (ω). The product of the elements in the model parameter vector ω = {ω1 , ω2 , . . . , ωd } is denoted as φ (ω), namely, φ (ω) = c c c ω11 · ω22 · · · ωdd for some c1 , . . . , cd ∈ N. Let Φq (q ∈ N) denote the set of all products of ω1 , . . . , ωd with degree q, i.e.,

{ c1 1

cd d

c2 2

Φq = ω · ω · · · ω |

d ∑

} cl = q

(6)

l=1

For example, Φ0 = {1}, Φ1 = {ω1 , . . . , ωd }, and Φ2 = ωi ωj |i, j ∈ [1, d]}. In the light of the Stone–Weierstrass Theorem [35], the function f (xi , ω) can be expressed as a polynomial form of parameter vector ω when it is continuous and differentiable. Consequently, we have

{

n ∑ (

yi − xTi ω

)2

+ λ ∥ω∥22

(2)

i=1

where λ is the regularization parameter. 2.1. Differential privacy Differential privacy [17,25,26] provides the privacy guarantee for database records without significant query accuracy loss. The purpose of it is to ensure that the inclusion or exclusion of a record in the database cannot be inferred by the adversary. There are several data analysis tasks that have already adopted differential privacy as a preservation mechanism such as online learning [27–29], principal component analysis [30,31] and clustering [32–34], etc. The definition of differential privacy is shown as follows. Definition 3 (ϵ -Differential Privacy [17]). Given two adjacent databases D1 and D2 , let M be a randomized algorithm, O is any possible output of M. The randomized algorithm M satisfies ϵ -differential privacy if and only if for D1 and D2 , we have Pr [M (D1 ) = O] ≤ eϵ Pr [M (D2 ) = O]

(3)

where the privacy budget ϵ is used to control the ratio of the probabilities that the algorithm M obtains the same output on two adjacent databases. It actually shows the level of privacy preservation that algorithm M can provide. The privacy preservation of algorithm M is superior when the value of ϵ is smaller. Laplace mechanism is one of the method that enables an algorithm to meet the requirements of differential privacy preservation. It is achieved by injecting the noise into the output of function f . The global sensitivity is a key parameter to determine the amount of the noise that added in, which is defined as follow. Definition 4 (Global Sensitivity [17]). Given the function f : D → Rd , the global sensitivity of f refers to the largest change in the query results caused by deleting any record in the database, which is defined as GSf (D) = max ∥f (D1 ) − f (D2 )∥1

(4)

D 1 ,D 2

where D1 and D2 are two neighbor databases, and ∥f (D1 ) − f (D2 )∥1 is the l1 distance between f (D1 ) and f (D2 ).

f (xi , ω) =

Definition 5 (Laplace Mechanism [17]). Given the database D, we assume that the global sensitivity of the function f : D → Rd is ∆. The randomized algorithm M (D) = f (D) + η provides ϵ -differential privacy preservation, where η ∼ Lap (∆/ϵ) is the random noise.

ϵ pdf (η) = exp − |η| 2GSf (D) GSf (D)

(

) (5)

Q ∑ ∑

λφ xi φ (ω)

(7)

q=0 φ∈Φq

where Q ∈ [0, ∞] and λφ xi ∈ R represents the coefficients of φ (ω). Furthermore, fD (ω) can also be depicted as a polynomial form of parameter vector ω, i.e., fD (ω) =

Q ∑ ∑∑

λφ xi φ (ω)

(8)

q=0 φ∈Φq xi ∈D

Given the polynomial representation of fD (ω), we achieve differential privacy(based ) on the Functional mechanism by adding Laplace noise Lap ∆ into the polynomial coefficients and reϵ leasing the model parameter ω ¯ to optimize the noisy version objective function f¯D (ω), where ∆ = 2 max x

Q ∑ ∑  λφ x  , 1 q=1 φ∈Φq

which is derived by Lemma 1. Lemma 1 ([23]). Given two adjacent databases D and D′ . Let fD (ω) and fD′ (ω) express the objective functions of regression model. The form of polynomial representations of fD (ω) and fD′ (ω) are denoted as fD (ω) =

Q ∑ ∑∑

λφ xi φ (ω)

(9)

q=1 φ∈Φq xi ∈D

fD′ (ω) =

Q ∑ ∑ ∑

λφ xi ′ φ (ω)

(10)

q=1 φ∈Φq xi ′ ∈D′

Then, we have

   Q ∑ ∑ ∑ ∑   ′ ∆= λφ xi − λ φ xi     q=1 φ∈Φq xi ∈D xi ′ ∈D′

1

Q ∑ ∑  λφ x  ≤ 2 max x

ϵ

3

(11)

1

q=1 φ∈Φq

where xi , xi ′ or x is an arbitrary record. 3. Methodology We present a new framework to provide ϵ -differential privacy preservation for regression analysis models, which not only protects the sensitive information of individuals effectively, but also

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.

4

M. Gong, K. Pan and Y. Xie / Knowledge-Based Systems xxx (xxxx) xxx

keeps the utility of the model by balancing the privacy budget ϵ for strongly relevant features and weakly relevant features. Based on the Functional mechanism and Layer-wise Relevance Propagation (LRP) algorithm, our method is performed by perturbing the polynomial coefficients of the objective function according to the magnitude of relevance between the input features and the model output. The main framework of PrivR algorithm is shown in Algorithm 1 and the principal steps are as follows. In the first place, the input features are divided into strongly relevant features {Xs } and weakly relevant features {Xw } according to the derived average relevance Rj (D) as shown in lines 1–9. There is one more point that the product of the elements in ω are denoted as φ , which contains strongly relevant feature parameter ωs is added into the subset Φs while φ contains weakly relevant feature parameter ωw is added into the subset Φw as shown in lines 10–19. The last but not the least, the privacy budget ϵs and ϵw are calculated directly on the basis of the relevance between the input features and the model output, and the differentially private objective function f¯D (ω) is derived by adding different Laplace noise to the polynomial coefficients of the objective function as shown in lines 20–31. Finally, the optimized parameter ω ¯ is acquired by minimizing the differentially private objective function f¯D (ω) as shown in lines 32–34. In summary, our differentially private regression analysis model can be divided into two parts, one is the relevance analysis on explanatory features based on LRP and the other is the objective perturbation based on the relevance analysis. Algorithm 1 The main framework of PrivR algorithm Input: database D, loss function fD (ω), privacy budget ϵ , privacy budget ratio α , threshold γ Output: The optimal model parameter ω ¯ 1: Calculate the average relevance Rj (D). 2: Set {Xs } = {}, {Xw } = {} 3: for j ∈ [1, d] do 4: if Rj (D) ≥ γ then 5: Add Xj into {Xs } 6: else 7: Add Xj into {Xw } 8: end if 9: end for 10: Set Φs = {}, Φw = {} 11: for q = 1 to Q do 12: for each φ ∈ Φq do 13: if φ contains ωs for any strongly relevant features then 14: Add φ into Φs 15: else 16: Add φ into Φw 17: end if 18: end for 19: end for 20: Set ∆ = 2 max x

2 max

21: Set µs =

x

Q



∥λ φ x ∥1



22: Set ϵs = µ +αµ ϵ , ϵw = s w 23: for q = 1 to Q do 24: for each φ ∈ Φq do 25: if φ ∈ Φs then 1

26:

Set λφ =

xi ∈D

27: 28:



2 max

λφ xi + Lap



x

φ∈Φw



ϵ

xi ∈D

32: Let f¯D (ω) =

Q ∑ ∑ q=1 φ∈Φq

λφ φ (ω)

ω



−1,l) Rp(l← m (xi )

(12)

m∈hl

and the relevance decomposition can be computed by an explicit formula, which is defined as zpm zm +θ zpm zm −θ

{ −1,l) Rp(l← m

(xi ) =

(l)

Rm (xi ) , (l)

Rm (xi ) ,

zm ≥ 0 zm < 0

(13)

where the parameter θ is a predefined stabilizer that eliminates (l−1,l) the unboundedness of the relevance Rp←m (xi ). In addition, zm is a linear projection of neuron m, which is defined as zpm = ap ωpm zm =



(14)

zpm + bm

(15)

p

where ap is the value of neuron p, ωpm is the weight between the neuron p and neuron m and bm is a bias term. Given l hidden layers h1 , . . . , hl , the relevance between each hidden layer neuron and the model output can be calculated by using Eqs. (12) and (13), which also fits in the derivation of the relevance between input features and the model output. In the case of linear regression model and regularized linear regression model, based on the LRP algorithm, we assume that R (xi ) expresses the relevance between the input features xi and the model output fxi (ω). Given the output variable j, R (xi ) is defined as

{ R (xi ) =

zij f zj +θ xi zij f z −θ xi j

(ω) ,

zj ≥ 0

(ω) ,

zj < 0

(16)

where zij = xij ωj



∥λφ x ∥1

(17)

zij + bj

(18)

Therefore, Eq. (16) can be written as R (xi ) = zij . The relevance between input features and the model output can be calculated by using Eq. (16). As indicated in [36], we have

( ) ∆ ϵs

w



Rxij (xi )

(19)

xij ∈xi

where Rxij (xi ) is the relevance between the input feature xij and the model output fxi (ω). Then the average relevance Rj (D) of all the jth input features can be acquired by Rj (D) =

33: Compute ω ¯ = arg min f¯D (ω) 34: Return ω ¯

Rp(l−1) (xi ) =

fxi (ω) =

else ( ) ∑ Set λφ = λφ xi + Lap ϵ∆

29: end if 30: end for 31: end for

(l)

Definition 6 (Relevance Decomposition). Let Rm (xi ) expresses the relevance between the neuron m at layer l and the model output fxi (ω). The total relevance of upper layer neurons determines the relevance of lower neurons, that is

i

, µw = α µs +αµw

The relevance between the input features and the model output is derived based on the Layer-wise Relevance Propagation [36] algorithm, which decomposes the relevance based on the messages that are derived for all the neurons from preceding layers. The messages sent from a neuron m to its input neuron p (l−1,l) via the connections, which is denoted as Rp←m (xi ).

zj =

 ∑ ∑  λφ x  1

q=1 φ∈Φq

φ∈Φs

3.1. Relevance analysis

1∑ D

Rxij (xi )

(20)

xi ∈D R (D)−κ

j To guarantee Rj (D) ∈ [0, 1], each Rj (D) is normalized to (τ , −κ) where τ and κ represent the maximum and minimum values in {R1 (D) , . . . , Rd (D)}.

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.

M. Gong, K. Pan and Y. Xie / Knowledge-Based Systems xxx (xxxx) xxx

3.2. Objective perturbation based on relevance analysis In order to realize an outstanding data privacy preservation while retaining the utilization of the model, we propose the differentially private regression analysis model PrivR, which calculates the relevance between the input features and the model output based on the LRP algorithm and splits the input features into two groups according to the magnitude of relevance between the input features and the model output. Specifically, a threshold γ is set to divide the input features into strongly relevant features {Xs } and weakly relevant features {Xw }. We add less noise to the coefficients of the objective function in the polynomial form that involve strongly relevant features while adding more noise to the other coefficients. For each input feature Xj where j ∈ [1, d], we define:

{

Rj (D) ≥ γ Rj (D) < γ

Xj ∈ {Xs } , Xj ∈ {Xw } ,

(21)

where γ = 0.5. The setting of the parameter γ is based on a series of experiments in Section 4.2, and the experiment results illustrate that the strongly relevant features and weakly relevant features are uniform distribution when γ = 0.5 while the distribution of strongly relevant features and weakly relevant features is uneven when the value of γ is too large or too small. In addition, we introduce the privacy budget ϵs for strongly relevant features and ϵw for weakly relevant features. Meanwhile, a privacy budget ratio α is set to satisfy the equation ϵw = αϵs where 0 < α ≤ 1. The privacy budget ratio α guarantees that more Laplace noise is added to the coefficients of the objective function in the polynomial form that involve weakly relevant features and vice-versa. 3.2.1. Application to linear regression We apply PrivR to linear regression, which is illustrated in Definition 1. The linear regression objective function in the polynomial form is shown as follows. fD (ω) =

∑(

yi − xTi ω

=



(yi ) − 2

d ∑

⎛ ⎝2

yi xij ⎠ωj

Q ∑ ∑  λφ x 

1

q=1 φ∈Φq

≤ 2 max ⎝2 (x,y)

(

≤ 2 2d + d

j=1 2

⎞ yxj +



φ∈Φw

  λφ x 

∆ ϵs

to the

∆ ϵw

to derive the differentially private objective function

f¯D (ω). Finally, we compute the optimized ω ¯ according to f¯D (ω). Our algorithm achieves ϵ -differential privacy as proved in the following Theorem 1. Theorem 1. Algorithm 1 satisfies ϵ -differential privacy. the last Proof. Let xn and xn ′ be∑  in two adjacent databases ∑Q ∑ record Q D and D′ . ∆ = 2 maxx q=1 φ∈Φq λφ x 1 , and f¯D (ω) = q=1 ∑ φ∈Φq λφ φ (ω). Φs denotes the set of φ including strongly relevant feature parameter ωs and Φw expresses the set of φ containing weakly relevant feature parameter ωs . Formally, we find Pr f¯D (ω) |D

{

}

Pr f¯D′ (ω) |D′

{

} (



φ∈Φs

exp

=

( ∏

φ∈Φs

exp

exp

( ∏

∑    ϵs  x ∈D λφ xi −λφ  i



1

 ) ∑   ϵs  x ′ ∈D′ λφ x ′ −λφ  i i 1 ∆

)

∑    ϵw  x ∈D λφ xi −λφ  i

1



 ∑   ϵw  x ′ ∈D′ λφ x ′ −λφ  i

exp

)



i

)

1

  ⎞ ∑  ∏ ∑   ϵ s ⎝ ⎠ ≤ exp λ φ xi − λ φ xi ′   ∆   ′ ′ φ∈Φs xi ∈D xi ∈D 1  ⎞ ⎛    ∏ ∑ ∑   ϵw ⎠ exp ⎝  λφ xi − λ φ xi ′   ∆ xi ∈D  φ∈Φw xi ′ ∈D′ 1 (ϵ  ∏  ) s  ′ = exp λφ xn − λφ xn 1 ∆ φ∈Φs (ϵ  ∏  ) w λφ xn − λφ xn ′ 1 exp ∆ φ∈Φw (ϵ ∏   ) s ≤ exp 2 max λφ x 1 x ∆ φ∈Φs (ϵ ∏   ) w exp 2 max λφ x 1 x ∆ φ∈Φ ⎛

j=1

d ∑



polynomial ( ) coefficients of φ ∈ Φs and to those of φ ∈ Φw with

Lap

φ∈Φw

From the above equation, we find that fD (ω) only involves { monomials}in Φ0 = {1}, Φ1 = {ω1 , . . . , ωd } and Φ2 = ωi ωj |i, j ∈ [1, d] . According to Lemma 1, the sensitivity ∆ can be calculate as



2 maxx

1

The last but not the least, we add Laplace noise Lap

φ∈Φw

(22)

  λφ x 

φ∈Φs

1 , µw = (24) ∆ ∆ where µs + µw = 1, and the privacy budgets are given by α 1 ϵ, ϵw = ϵ (25) ϵs = µs + αµw µs + αµw ( )

⎞ ∑ xi ∈D

j=1

1≤j,h≤d

(x,y)

µs =

(

⎞ ⎛ d ∑ ∑ ⎝ xij xih ⎠ωj ωh

∆ = 2 max



2 maxx



xi ∈D

+

and vice-versa. There is one more point that we set the sensitivity ∆ on the basis of the maximum value of the coefficients λφ x of φ (ω) and determine the privacy budget in the light of the given ϵ . We introduce the relevance ratios µs and µw , which express the part of the contributions to the sensitivity ∆ corresponding to the elements in Φs and Φw , which are denoted as

)2

xi ∈D

5

(23)

xj xh ⎠

1≤j,h≤d

)

Based on the magnitude of relevance between the input features and the model output, fD (ω) can be perturbed by adding different Laplace noise to its polynomial coefficients λφ . Specifically, in the first place, we split the parameter φ into two subsets Φs and Φw , as shown in Lines 10–19 of Algorithm 1. We add φ to subset Φs if φ contains strongly relevant feature parameter ωs ,

(26)

w

= exp (ϵs µs + ϵw µw ) ( ) µs αµw = exp ϵ+ ϵ µs + αµw µs + αµw = exp (ϵ) □

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.

6

M. Gong, K. Pan and Y. Xie / Knowledge-Based Systems xxx (xxxx) xxx

According to the derived sensitivity ∆, the relevance ratios µs and µw , PrivR achieves differential privacy by adding noise with distinct scale to coefficients of strongly relevant features and weakly relevant features. Remark 1 manifests the derived result for linear regression. Remark 1. For linear regression, we assume that there are k strongly relevant features in( the total In Al) d input2kfeatures. +kd gorithm 1, we find ∆ = 2 2d + d2 , µs = 2d and µ w = 2 +d (2+d)(d−k) . 2d+d2

Proof. According to Eq. (22), we know

∆ = 2 max (x,y)

Q ∑ ∑  λφ x 

1

q=1 φ∈Φq

⎛ ≤ 2 max ⎝2

d ∑

(x,y)

⎞ yxj +

= 2 2d + d

(27)

xj xh ⎠

1≤j,h≤d

j=1

) 2

(





The xj represents the jth record in vector x. What is more, for the coefficients involving k strongly relevant features, the sensitivity ∆ can be calculated as 2 max x

Q ∑ ∑  λφ x  = 2 (2k + kd) 1

(28)

q=1 φ∈Φs

Thus, we have

µs = µw =

2 (2k + kd)

(

2 2d + d2

)=

2k + kd

(29)

2d + d2

2 (2 (d − k) + (d − k) d)

(

2 2d + d2

)

∑(

yi − xTi ω

)2

(30)

+ λ ∥ω∥22

xi ∈D

=



(yi )2 −

xi ∈D

⎛ ⎝2

⎞ ∑

yi xij ⎠ωj

(31)

xi ∈ D

j=1

⎛ ⎞ ∑ ⎝ xij xih ⎠ωj ωh + λ ∥ω∥22



+

d ∑

1≤j,h≤d

xi ∈D

(x,y)

Q ∑ ∑  λφ x 

≤ 2 max ⎝2 (x,y)

(

1

q=1 φ∈Φq



≤ 2 2d + d

d ∑ j=1

⎞ yxj +



φ ∈ Φw with Lap

∆ ϵw

to derive the regularized differentially

private objective function f¯D (ω). 4. Experiments We conduct an experiment to evaluate PrivR on two wellknown datasets. Adult [37] consists of 48,842 individuals information extracted from the 1994 US Census database, and each record has 14 explanatory features, namely, age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country and 1 response feature income. Banking [38] is related with marketing campaigns of a banking institution, which includes 45,211 individuals and each record has 17 features, namely, age, job, marital, education, default, balance, housing, loan, contact, day, month, duration, campaign, pdays, previous, poutcome and y. In our experiments, we focus on the linear regression model and regularized linear regression model under differential privacy by using y as the response feature and the rest of the features as explanatory features. In addition, the Adult dataset and Banking dataset are used to investigate the effectiveness of Algorithm 1. The regression task for Adult dataset is to forecast whether the annual earnings of an individual is greater than 50K, and the regression task for Banking dataset is to predict whether a client subscribes a term deposit. All experiments are conducted on a PC with Intel(R) Core(TM), 4 CPU, 2.5 GHz, 8G memory. We normalize each feature and measure the accuracy of linear regression model and regularized linear regression model by the

xj xh ⎠ + λ ∥ω∥22

)2

T mean square error, i.e., 1n i=1 yi − xi ω . In each experiment, we change the parameters of dataset size, privacy budget ϵ and privacy budget ratio α to achieve different experiments.

4.1. Competitive model We compare the proposed PrivR algorithm with three approaches, namely, DPME [21], FM [23] and NoPrivacy. Specifically, DPME is an outstanding algorithm based on ϵ -differential privacy for regression model and FM achieves differential privacy preservation in regression analysis by adding identical noise to all coefficients of the objective function in the polynomial form. NoPrivacy is an algorithm that directly releases the optimized parameters to minimize the objective function without any privacy consideration. 4.2. Evaluation on the choice of the parameter γ

According to Lemma 1, the sensitivity ∆ can be calculated as

∆ = 2 max

s

added to the polynomial ( ) coefficients of φ ∈ Φs and to those of

∑n (

(2 + d) (d − k) = 2d + d2

3.2.2. Application to regularized linear regression We apply PrivR to regularized linear regression, which is illustrated in Definition 2. The regularized linear regression objective function in the polynomial form is given by fD (ω) =

The amount of noise added to the regularized objective function remains the same as the amount of noise added (to )the non-regularized objective function. The Laplace noise Lap ϵ∆ is

(32)

1≤j,h≤d

) 2

Due to the independence of the regularization term on the dataset, the sensitivity of a regularized objective function is the same as the sensitivity of non-regularized objective function.

Fig. 1 shows that the strongly relevant features and weakly relevant features of Adult dataset and Banking dataset are uniform distribution respectively when γ = 0.5. It is a fact that each record in Adult dataset has 14 explanatory features and each record in Banking dataset has 16 explanatory features. Fig. 1 illustrates that the distribution of the number of strongly relevant features and weakly relevant features in Adult dataset and Banking dataset is uneven when the value of γ is too large or too small. Moreover, the accuracy of the model is fluctuant when the value of γ is too large or too small, which is shown in Fig. 2. Fig. 2 manifests that the accuracy of PrivR as a function of the privacy budget ratio α on different parameter γ . In this experiment, we vary the parameter γ in {0.2, 0.5, 0.8}. Fig. 2 indicates that the accuracy of PrivR algorithm is fluctuant when the value of γ is too large or too small, and the accuracy of the

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.

M. Gong, K. Pan and Y. Xie / Knowledge-Based Systems xxx (xxxx) xxx

7

Fig. 1. The distribution of strongly relevant features and weakly relevant features.

Fig. 2. The accuracy of PrivR as a function of the privacy budget ratio α on different parameter γ .

model is relatively stable when γ = 0.5. Therefore, we set γ = 0.5 based on the above experiments. The parameter γ = 0.5 leads to the strongly relevant features and weakly relevant features are uniform distribution. 4.3. Evaluation on dataset cardinality Fig. 3 demonstrates the accuracy of each algorithm for linear regression analysis affected by the dataset cardinality. In this experiment, we vary the sampling rate of the dataset in {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. In both Adult dataset and Banking dataset, the interval of the accuracy between PrivR and NoPrivacy is large. However, with the increase of dataset cardinality, the interval gets smaller, which indicates that with the increase of sampling rate, the accuracy of PrivR improves. Because the more data we have, the more information we get. In addition, the change of the accuracy of PrivR is gentle while the fluctuation of DPME is slightly larger in both Adult dataset and Banking dataset. The accuracy of PrivR on Adult dataset is kept at around 85% and the accuracy of PrivR on Banking dataset remains about 89% when the sampling rate is more than 0.5. The performance of DPME is also ameliorated when the data cardinality raises, which is in accordance with the theoretical results in their papers. Nevertheless, PrivR outperforms DPME all the time even if all the records in the dataset are employed. 4.4. Evaluation on privacy budget ratio Fig. 4 manifests the accuracy of each algorithm for linear regression analysis affected by the privacy budget ratio α , where 0 < α ≤ 1. In this experiment, we set ϵ = 1 and vary the privacy budget ratio α in {1, 0.5, 0.25, 0.1, 0.05, 0.025, 0.01}. Note that the values of ϵs and ϵw can be easily derived from Remark 1.

According to Section 3, we know that with the decrease of α , the amount of noise injected into coefficients of the objective function that involve strongly relevant features will decrease and vice-versa. The privacy budget ratio α has an influence on the accuracy of the regression model, which is indicated in Fig. 4 that the decrease of α will cause the increase of the accuracy of PrivR algorithm. Moreover, the performance of PrivR is better than FM and DPME since the introduction of the privacy budget ratio α , which controls the amount of the noise should be added to the polynomial coefficients based on the relevance between the input features and the model output. It ensures more noise is injected into the coefficients of the objective function that involve weakly relevant features and vice-versa. The privacy budget ratio α balances the accuracy of PrivR and the preservation of individual information well while the FM and DPME add noise with the same amount, which may have an impact on the model accuracy. However, the value of α is greater than 0, which is due to the fact that the parameter α = 0 will lead to the ϵw = 0, where ϵw = αϵs . In the first place, although the strong privacy guarantee of weakly relevant features is achieved when ϵw = 0, the model outputs the same result with two identical probability distributions for any adjacent datasets, which cannot reflect any useful information about the dataset and can hardly seek out the balance between the utility of model and the preservation of privacy information. There is one more point that the exhaustion of ϵw results in the destruction of differential privacy mechanism, which leads to the insignificance of the algorithm. The last but not least, ϵw = 0 is inapplicable for Laplace mechanism, it means the ( )

Laplace noise Lap

∆ ϵw

is infinite since the denominator is zero.

In addition, we investigate the sensitivities of different features while considering the relevance between different features

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.

8

M. Gong, K. Pan and Y. Xie / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 3. The prediction accuracy analysis as a function of dataset cardinality.

Fig. 4. The prediction accuracy analysis as a function of privacy budget ratio α when ϵ = 1.

Fig. 5. The accuracy of PrivR as a function of the privacy budget ratio α when the features are reclassified.

and the model output in detail. Specifically, no matter whether

4.5. Evaluation on privacy budget

the relevance between sensitive features and the model output is weakly or not, the sensitive features are reclassified into weakly relevant features, which guarantees more noise is added to the polynomial coefficients that involve the sensitive features and provides a more effective preservation for sensitive features. For simplicity, we consider there is only one sensitive feature, and all remaining ones are non-sensitive. Fig. 5 demonstrates that although the accuracy of model is limited by adding more noise to the coefficients that involve sensitive features, a more effective privacy guarantee is provided for sensitive features.

Fig. 6 illustrates the accuracy of each algorithm affected by the privacy budget ϵ on both Adult dataset and Banking dataset. In this experiment, the input features are not divided into strongly relevant features and weakly relevant features, and the same amount of noise is added to the coefficients of the objective function. For all values of ϵ , the accuracy of NoPrivacy remains unchanged since it does not enforce ϵ -differential privacy. However, both PrivR and DPME are performed under ϵ -differential privacy. Smaller values of ϵ enforce a stronger privacy preservation for sensitive information and a worse accuracy of the model, which is due to the fact that more noise is added. We can also derive

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.

M. Gong, K. Pan and Y. Xie / Knowledge-Based Systems xxx (xxxx) xxx

9

Fig. 6. The prediction accuracy analysis as a function of private budget ϵ .

this conclusion from Fig. 6. With the decrease of ϵ , the accuracy of DPME, PrivR and Regularization PrivR are decline. However, the accuracy of PrivR remains steady at around 85% in Adult dataset and keeps its stability at around 90% in Banking dataset when the private budget ϵ is more than 0.8 and then falls slightly while the accuracy of DPME drops dramatically almost all the time. Therefore, PrivR outperforms DPME in all cases. The regularization PrivR still outperforms PrivR with the decrease of ϵ , which is due to the independence of the regularization term on the dataset. Therefore, the sensitivity of a regularized objective function remains the same as the sensitivity of a nonregularized objective function and the amount of the noise that added in will not increase. In addition, regularization itself can prevent overfitting and improve the accuracy of model. In summary, PrivR is superior to DPME and FM in all experiments. It is apparent that the relevance between the input features and the model output is not identical. Consequently, the noise with the same amount injected into the polynomial coefficients may affect the utility of model. The PrivR has the ability to perturb the coefficients of the objective function in the polynomial form based on the magnitude of relevance between the input features and the model output, and the experimental results demonstrate that PrivR is an effective differentially private regression analysis model. 5. Conclusion and future work This paper proposes a novel framework PrivR for differentially private regression analysis. Our approach perturbs the coefficients of the objective function in the polynomial form based on the magnitude of relevance between the input features and the model output. Specifically, we rewrite the objective function in the form of polynomial and add noise with smaller amount to the polynomial coefficients that involve strongly relevant features, and vice-versa. Empirically experimental evaluations conducted on well-known datasets validate the effectiveness of our algorithm. In the future, our work can be extended towards the following directions. In the first place, the current approach only works on linear regression analysis and regularized linear regression analysis, it is worthwhile to apply our algorithm in deep learning models to preserve privacy while retaining the model utilization. There is one more point that the investigation of the sensitivities of different features while considering the relevance between different features and the model output is an excellent direction to implement. The last but not the least, we plan to investigate some other mechanism based on the differential privacy to prevent the disclosure of sensitive information.

Acknowledgment This work was supported by the National key research and development program of China (Grant no. 2017YFB0802200), the National Natural Science Foundation of China (Grant nos. 61772393), and the Key research and development program of Shaanxi Province (Grant no. 2018ZDXM-GY-045). References [1] F.N. Motlagh, H. Sajedi, MOSAR: A multi-objective strategy for hiding sensitive association rules using genetic algorithm, Appl. Artif. Intell. 30 (9) (2016) 823–843. [2] J.L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, Y. Kluger, Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network, BMC Med. Res. Methodol. 18 (1) (2018) 24–35. [3] L. Chienliang, C. Yingchuan, Background music recommendation based on latent factors and moods, Knowl. Based Syst. 159 (2018) 158–170. [4] Y. Hongzhi, W. Weiqing, C. Liang, D. Xingzhong, Q.V.H. Nguyen, H. Zi, MobiSAGE-RS: A sparse additive generative model-based mobile application recommender system, Knowl. Based Syst. 157 (2018) 68–80. [5] F. Strub, R. Gaudel, Hybrid recommender system based on autoencoders, in: The Workshop on Deep Learning for Recommender Systems, 2016, pp. 11–16. [6] T. Young, D. Hazarika, S. Poria, E. Cambria, Recent trends in deep learning based natural language processing, IEEE Comput. Intell. Mag. 13 (3) (2018) 55–75. [7] B. Liu, I. Lane, Adversarial learning of task-oriented neural dialog models, in: Special Interest Group on Discourse and Dialogue Conference, Association for Computational Linguistics, 2018, pp. 350–359. [8] L. Yijing, G. Haixiang, Z. Qingpeng, G. Mingyun, Y. Jianying, Imbalanced text sentiment classification using universal and domain-specific knowledge, Knowl. Based Syst. 160 (2018) 1–15. [9] R. Mehrizi, X. Peng, X. Xu, S. Zhang, D. Metaxas, K. Li, A computer vision based method for 3D posture estimation of symmetrical lifting, J. Biomech. 69 (2018) 40–46. [10] Z. Cao, T. Simon, S.E. Wei, Y. Sheikh, Realtime multi-person 2D pose estimation using part affinity fields, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1302–1310. [11] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2017) 1137–1149. [12] F. Yang, W. Choi, Y. Lin, Exploit all the layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers, in: Computer Vision and Pattern Recognition, 2016, pp. 2129–2137. [13] W. Zhixiao, L. Zechao, Y. Guan, S. Yunlian, R. Xiaobin, X. Xinguang, Tracking the evolution of overlapping communities in dynamic social networks, Knowl. Based Syst. 157 (2018) 81–97. [14] X. Linchuan, W. Xiaokai, C. Jiannong, P.S. Yu, On exploring semantic meanings of links for embedding social networks, in: Proceedings of the 2018 World Wide Web Conference on World Wide Web, 2018, pp. 479–488. [15] D. Karapiperis, V.S. Verykios, An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage, IEEE Trans. Knowl. Data Eng. 27 (4) (2015) 909–921.

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.

10

M. Gong, K. Pan and Y. Xie / Knowledge-Based Systems xxx (xxxx) xxx

[16] A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: Proceedings of 25th International Conference on Very Large Data Base, 1999, pp. 518–529. [17] C. Dwork, F. Mcsherry, K. Nissim, A. Smith, Calibrating noise to sensitivity in private data analysis, in: Theory of Crytography Conference, 2006, pp. 265–284. [18] F. Mcsherry, K. Talwar, Mechanism design via differential privacy, in: IEEE Symposium on Foundations of Computer Science, 2007, pp. 94–103. [19] Y.E. Cakra, B.D. Trisedya, Stock price prediction using linear regression based on sentiment analysis, in: International Conference on Advanced Computer Science and Information Systems, 2016, pp. 147–154. [20] Y. Yu, Z. Chao, W. Hong, X. Kuang, T. Gong, Y. Lu, C. Xu, How to conduct dose-response meta-analysis by using linear relation and piecewise linear regression model, J. Evid. Based Med. 16 (1) (2016) 111–114. [21] L. Jing, Differentially private M-estimators, in: International Conference on Neural Information Processing Systems, 2011, pp. 361–369. [22] G. Cormode, C. Procopiuc, D. Srivastava, T.T. Tran, Differentially private summaries for sparse data, in: Proceedings of the 15th International Conference on Database Theory, 2012, pp. 299–311. [23] J. Zhang, Z. Zhang, X. Xiao, Y. Yang, M. Winslett, Functional mechanism: regression analysis under differential privacy, Proc. VLDB Endow. 5 (11) (2012) 1364–1375. [24] K. Ligett, S. Neel, A. Roth, B. Waggoner, S.Z. Wu, Accuracy First: Selecting a Differential Privacy Level for Accuracy Constrained ERM, Advances in Neural Information Processing Systems, 2017, pp. 2566–2576. [25] C. Dwork, A firm foundation for private data analysis, Commun. ACM 54 (1) (2011) 86–95. [26] C. Dwork, A. Roth, The algorithmic foundations of differential privacy, Found. Trends Theor. Comput. Sci. 9 (3–4) (2014) 211–407.

[27] C. Li, P. Zhou, L. Xiong, Q. Wang, T. Wang, Differentially private distributed online learning, IEEE Trans. Knowl. Data Eng. 30 (8) (2018) 1440–1453. [28] J. Zhu, C. Xu, J. Guan, D.O. Wu, Differentially private distributed online algorithms over time-varying directed networks, IEEE Trans. Signal Infor. Process. Netw. 4 (1) (2018) 4–17. [29] P. Zhou, Y. Zhou, D. Wu, H. Jin, Differentially private online learning for cloud-based video recommendation with multimedia big data in social networks, IEEE Trans. Multimedia 18 (6) (2016) 1217–1229. [30] H. Imtiaz, A.D. Sarwate, Differentially private distributed principal component analysis, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 2206–2210. [31] H. Imtiaz, A.D. Sarwate, Symmetric matrix perturbation for differentiallyprivate principal component analysis, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 2339–2343. [32] M.F. Balcan, T. Dick, Y. Liang, W. Mou, H. Zhang, Differentially private clustering in high-dimensional Euclidean spaces, in: International Conference on Machine Learning, 2017, pp. 322–331. [33] S. Dong, C. Jianneng, L. Ninghui, B. Elisa, L. Min, J. Hongxia, Differentially private K-means clustering and a hybrid approach to private optimization, ACM Trans. Priv. Secur. (ISSN: 2471-2566) 20 (4) (2017) 1–33. [34] H. Zhiyi, L. Jinyan, Optimal differentially private algorithms for kmeans clustering, in: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, ACM, 2018, pp. 395–408. [35] W. Rudin, Principles of Mathematical Analysis, New York: McGraw-Hill, 1976. [36] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.R. Müller, W. Samek, On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation, PLos One 10 (7) (2015) e0130140. [37] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 1–27. [38] A. Frank, A. Asuncion, UCI machine learning repository, 2010.

Please cite this article as: M. Gong, K. Pan and Y. Xie, Differential privacy preservation in regression analysis based on relevance, Knowledge-Based Systems (2019), https://doi.org/10.1016/j.knosys.2019.02.028.