A reversible data transform algorithm using integer transform for privacy-preserving data mining

A reversible data transform algorithm using integer transform for privacy-preserving data mining

Accepted Manuscript A Reversible Data Transform Algorithm Using Integer Transform for Privacy-Preserving Data Mining Chen-Yi Lin PII: DOI: Reference:...

568KB Sizes 32 Downloads 142 Views

Accepted Manuscript

A Reversible Data Transform Algorithm Using Integer Transform for Privacy-Preserving Data Mining Chen-Yi Lin PII: DOI: Reference:

S0164-1212(16)00041-8 10.1016/j.jss.2016.02.005 JSS 9675

To appear in:

The Journal of Systems & Software

Received date: Revised date: Accepted date:

1 June 2015 9 December 2015 3 February 2016

Please cite this article as: Chen-Yi Lin , A Reversible Data Transform Algorithm Using Integer Transform for Privacy-Preserving Data Mining, The Journal of Systems & Software (2016), doi: 10.1016/j.jss.2016.02.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Research highlights

AC

CE

PT

ED

M

AN US

CR IP T

RDT aims to protect sensitive information from being revealed by data mining methods. A watermark can be embedded into the original data by RDT. Compared with the existing algorithms, RDT has better knowledge reservation. In addition, experimental results also show that RDT has a higher watermark payload.

ACCEPTED MANUSCRIPT

A Reversible Data Transform Algorithm Using Integer Transform for Privacy-Preserving Data Mining

Chen-Yi Lin*

and Technology, Taiwan

Abstract

CR IP T

Department of Information Management, National Taichung University of Science

In the cloud computing environment, since data owners worry about private

AN US

information in their data being disclosed without permission, they try to retain the knowledge within the data, while applying privacy-preserving techniques to the data. In the past, a data perturbation approach was commonly used to modify the original

M

data content, but it also results in data distortion, and hence leads to significant loss of knowledge within the data. To solve this problem, this study introduced the concept of

ED

reversible integer transformation in the image processing domain and developed a Reversible Data Transform (RDT) algorithm that can disrupt and restore data. In the

PT

RDT algorithm, using an adjustable weighting mechanism, the degree of data

CE

perturbation was adjusted to increase the flexibility of privacy-preserving. In addition, it allows the data to be embedded with a watermark, in order to identify whether the

AC

perturbed data has been tampered with. Experimental results show that, compared with the existing algorithms, RDT has better knowledge reservation and is better in terms of effectively reducing information loss and privacy disclosure risk. In addition, it has a high watermark payload. *

Corresponding author: Chen-Yi Lin Department of Information Management National Taichung University of Science and Technology, Taiwan Tel: +886-4-22196606 ; Fax: +886-4-22196311 E-mail address: [email protected]

ACCEPTED MANUSCRIPT

Keywords: Cloud Computing, Privacy-Preserving, Reversible Data Hiding, Data Perturbation.

1. Introduction In recent years, signal processing in the encrypted domain has attracted the

CR IP T

interest of many researchers (Alattar, 2004; Bianchi et al., 2009; Pun and Choi, 2014). This is especially true in application of cloud computing and delegated calculation, where data owners want to disclose or provide original data to remote servers for data processing (Bianchi et al., 2009; Hao et al., 2011; Sasikala and Banu, 2014). Data

AN US

owners might not trust these users or service suppliers, so a privacy-preserving mechanism was used on the original data. This type of study is called Privacy-Preserving Data Mining (PPDM) (Chun et al., 2013; Herranz et al., 2010;

M

Karandikar and Deshpande, 2011; Sasikala and Banu, 2014). PPDM is research that can effectively protect privacy information while simultaneously preserving the

ED

knowledge in the original data (Fung and Mangasarian, 2013; Hajian et al., 2014; Lakshmi and Rani, 2013). Relevant literature in the past can be divided into three

PT

types (Sasikala and Banu, 2014): (1) before disclosing or providing the original data,

CE

use swap (Li et al., 2012; Yang and Qiao, 2010; Zhu et al., 2009), update (Fung et al., 2007; Mateo-Sanz et al, 2005; Yun and Kim, 2015), and other operations to disrupt

AC

the original data; (2) the original data are distributed among two or more sites; individual sites cannot know the content of the data contained in the other sites; (3) While using a classification model to classify the original data, only specific users know the classification results. Among them, the first type of approach is the most favored. In the first approach, k-anonymity, l-diversity, and randomization are the well-known methods for data perturbation (Li et al., 2012; Sasikala and Banu, 2014;

ACCEPTED MANUSCRIPT

Yang and Qiao, 2010; Yun and Kim, 2015; Zhu et al., 2009). In practical application of data mining, the knowledge being analyzed usually requires cross analysis and comparison with the original data to confirm the relevance between the knowledge and the data and thus help the users to verify the authenticity of the knowledge and make decisions (Chen et al., 2013; Zhu and Davidson, 2007). However, these

CR IP T

traditional data perturbation methods (Fung et al, 2007; Li et al., 2012; Sasikala and Banu, 2014; Yang and Qiao, 2010; Yun and Kim, 2015; Zhu et al., 2009) could result in knowledge that cannot be verified because the original data could not be restored, which could cause knowledge uncertainty (Chen et al., 2013; Hong et al., 2010).

AN US

Take k-anonymity as an example, for data with k-anonymity protection, each tuple has at least k - 1 tuples, which is the same as the tuple, thereby hiding the privacy information in the original data (Li et al., 2012; Yang and Qiao, 2010; Zhu et

M

al., 2009); the degree of preserving the privacy of the original data is determined by the value of k. When k is a larger value, the degree of privacy-preserving will be

ED

higher, but the degree of data distortion will also be more obvious (Karandikar and Deshpande, 2011; Li et al., 2012; Yang and Qiao, 2010; Zhu et al., 2009).

PT

Take the disease dataset in Table 1 as an example to illustrate the k-anonymity

CE

method. First, it can directly remove identifier attributes, which identify individual identities such as ID and Name; the remaining attributes are partitioned into two

AC

categories (Li et al., 2012; Yang and Qiao, 2010; Zhu et al., 2009): (1) some attributes can indirectly identify a person’s identification, such as Age, Cholesterol, and Triglyceride (Table 1) from disclosed information through methods of combination or comparison. This is called Quasi Identifier (QI); (2) some attributes are sensitive; they may contain private or sensitive information, such as a disease a patient is suffering from. In order to prevent interested parties from using QI attributes to estimate the sensitive data of patients, k-anonymity performs data perturbation to the QI attributes.

ACCEPTED MANUSCRIPT

Table 1 An original disease dataset QI Attributes Cholesterol Triglyceride 165 115 178 153 173 148 189 127 172 124 177 131 232 165 226 157 215 182 221 191 243 187 234 197

Sensitive Attribute Disease Heart Disease Heart Disease Cancer Cancer Hypertension Diabetes Diabetes Heart Disease Diabetes Hypertension Cancer Hypertension

CR IP T

Age 22 26 23 35 34 37 46 42 48 56 59 57

AN US

Identifier Attributes ID Name A12345 Alexander B12345 Alice C12345 Beatrice D12345 Randolph E12345 Mark F12345 Amanda G12345 Allen H12345 Angela I12345 Matthew J12345 Barbara K12345 Steven L12345 Jennifer

Table 2 The perturbed dataset from Table 1

Sensitive Attribute Disease Heart Disease Heart Disease Cancer Cancer Hypertension Diabetes Diabetes Heart Disease Diabetes Hypertension Cancer Hypertension

CE

PT

ED

M

QI Attributes Age Cholesterol Triglyceride [20-29] [160-179] [110-159] [20-29] [160-179] [110-159] [20-29] [160-179] [110-159] [30-39] [170-189] [120-139] [30-39] [170-189] [120-139] [30-39] [170-189] [120-139] [40-49] [210-239] [150-189] [40-49] [210-239] [150-189] [40-49] [210-239] [150-189] [50-59] [220-249] [180-199] [50-59] [220-249] [180-199] [50-59] [220-249] [180-199]

Assuming k is set to 3, the dataset perturbed by 3-anonymity is as shown in Table

AC

2. In Table 2, although the perturbed dataset cannot estimate the patients’ diseases, if data mining analysis is performed, the vast majority of the knowledge hidden in the original dataset will no longer exist because of data perturbation. Moreover, in certain applications such as the medical data domain, a reliance on the original data to perform precision analysis is necessary. In other words, data perturbation results in inaccurate analysis that is unacceptable. Therefore, the storage of the original data is of great importance. From this, it has become an urgent topic. This study researches

ACCEPTED MANUSCRIPT

protecting privacy data while having the ability to restore the original data at the same time. For image processing, Reversible Data Hiding (RDH) technique (Alattar, 2004; Coltuc and Chassery, 2007; Hong and Chen, 2011; Peng et al, 2012; Pun and Choi, 2014; Zhang, 2012) makes it impossible for the naked eye to tell the difference

CR IP T

between the original image and the image with the watermark embedded. Difference expansion methods (Alattar, 2004; Coltuc and Chassery, 2007; Peng et al, 2012; Pun and Choi, 2014) are often used in the RDH research field. The main concept is based on the pixels sequence in the original image. It makes adjacent pixels into pixel

AN US

groups and calculates the difference value among the pixels in each group to modify the pixel values. At the same time, it embeds the watermark. In other words, we can use the difference values expanded by the pixel groups to obtain the watermark

M

hidden inside the image and restore the undistorted original image. To protect individual privacy while improving the shortcomings where PPDM cannot revert back

ED

to the original data, Chen et al. (2013) used the concept of the difference expansion methods in image processing and proposed the Privacy Difference Expansion (PDE)

PT

algorithm. The method pairs adjacent pixels into a group and, based on the parameters

CE

set by the user, determines the difference value between each group of data, thereby affecting the degree of data perturbation. However, the parameters in the PDE

AC

algorithm do not have any specific relationship with the knowledge reservation. As a result, setting the appropriate parameters is not easy. In addition, there is limited length for the watermark hidden inside of the data, thus minimizing the watermark payload. In addition, Kao et al. (2015) used the concept of the Contrast Mapping in image processing and proposed the Reversible Privacy Contrast Mapping (RPCM) algorithm. The reversible integer transformation method (Alattar, 2004; Peng et al, 2012;

ACCEPTED MANUSCRIPT

Pun and Choi, 2014) is a well-known difference expansion method. It uses the triplet pixels formed by the RGB color model as a group and sets the weight value of the pixels within the group, thereby achieving the purposes of adjusting the ratio difference flexibility. Compared to other difference expansion methods (Coltuc and Chassery, 2007; Zhang, 2012), the reversible integer transformation method can be

CR IP T

embedded with more watermark bits, and gets a higher payload. Therefore, in this paper, in order to solve the disadvantages of PDE, we introduce the concept of reversible integer transformation and propose a data perturbation method to restore the original data called Reversible Data Transform (RDT) algorithm. The RDT

AN US

algorithm protects the privacy information in the original data and restores the perturbed data back to the original data. In addition, using the mechanism that allows adjustable weight values, the degree of data perturbation can be adjusted to increase

M

the flexibility of privacy-preserving. The method will be described in detail in the

ED

subsequent sections.

2. Reversible Data Transform (RDT) algorithm

PT

Based on the concept of the reversible integer transformation method (Alattar,

CE

2004; Peng et al, 2012; Pun and Choi, 2014), the RDT algorithm uses the difference between data within the data group and designs a novel method that can

AC

disturb/restore the original data. The RDT algorithm uses an adjustable weighting mechanism to determine the degree of disturbance of the original data to increase the flexibility of privacy-preserving. In addition to protecting privacy information, compared to the existing algorithms, the RDT algorithm can be embedded with more watermark bits and get a higher payload. The RDT algorithm can be divided into two phases: data perturbation and data recovery. The detailed operation process is as follows.

ACCEPTED MANUSCRIPT

Data perturbation phase Input: an original dataset D, QI attributes Q = (qm, m = 1, 2, 3, …), an integer Seed, a group size g, a set of weights xi ( i  [0, g  1] ), and a watermark w. | D |    1 , and l = 1.  g 

Step 1. Let n  

CR IP T

Step 2. For each qm: 1) Let be a group of g neighboring data values (j = (1, 1+(1×g), 1+(2 × g), …, 1+(n × g))).

2) Use Eq. (1) and Eq. (2), to perform difference expansion on
AN US

qm,j+2, …, qm,j+(g-1)> and obtain  q~m, j , q~m, j1 , q~m, j2 , ..., q~m, j( g 1)  .

 x q  x1qm, j 1  x2 qm, j  2  ...  xg 1qm, j  ( g 1)  qm , j   0 m, j , x0  x1  x2  ...  xg 1   qm , j 1  qm, j 1  qm, j , qm , j  2  qm, j  2  qm, j ,

M

..., qm , j  ( g 1)  qm, j  ( g 1)  qm, j .

(1)

ED

q~m , j  qm , j , q~m , j 1  2  qm , j 1 , q~  2  q ,

(2)

m, j  2

PT

m, j  2

CE

..., q~m , j  ( g 1)  2  qm , j  ( g 1) .

AC

3) If

l



|w|,

then

for

q~m, j 1 ,

q~m, j  2 ,

…,

q~m, j  ( g 1)

in

th th  q~m, j , q~m, j1 , q~m, j2 , ..., q~m, j( g 1)  we embed the l bit, the (l +1) bit, …,

the (l + (g – 2))th bit of the watermark w respectively. l = l + (g  1).

4) Use Eq. (3) to generate the corresponding perturbed group.

ACCEPTED MANUSCRIPT

 x q~  x q~  ...  xg 1q~m, j  ( g 1)  qm , j  q~m, j   1 m, j 1 2 m, j  2 , x0  x1  x2  ...  xg 1   ~ qm , j 1  qm, j 1  qm , j , q  q~  q , m, j  2

m, j  2

m, j

(3)

..., qm , j  ( g 1)  q~m, j  ( g 1)  qm , j .

CR IP T

Step 3. Use Rand (Seed) function to generate |D| number of random values. Based on the values of these random values, the perturbed data is arranged in an ascending or descending order in order to generate the perturbed dataset D .

AN US

Output: The perturbed dataset D .

Data recovery phase

Prior to restoring the perturbed dataset D , the user must know the QI attributes Q, Seed, the group size g, the set of weights xi, and the length |w| of the watermark w

M

embedded in D in order to properly restore the original dataset D and to extract the

ED

watermark w. In addition, we can use w, which has been extracted, to verify when D has been tampered with. The process flow for the data recovery phase is as follows:

PT

Input: The perturbed dataset D , QI attributes Q = (qm, m = 1, 2, 3, …), the integer

CE

Seed, the group size g, the set of weights xi ( i  [0, g  1] ), and the length |w| of the watermark w.

AC

Step 1. Use Rand (Seed) function to generate | D | number of random values and arrange in an ascending or descending order, based on the values of these random values. Then use the order generated by the random values to rearrange the data in D . | D | 

Step 2. Let n     1 , and l = 1.  g  Step 3. For each qm:

ACCEPTED MANUSCRIPT

1) Let  qm , j , qm , j1 , qm , j2 , ..., qm , j( g 1)  be a group of g neighboring data values (j = (1, 1+(1×g), 1+(2×g), …, 1+(n×g))). 2) Transform  qm , j , qm , j1 , qm , j2 , ..., qm , j( g 1)  to the corresponding group  q~m, j , q~m, j1 , q~m, j2 , ..., q~m, j( g 1)  defined as:

m, j  2

m, j

..., q~m , j  ( g 1)  qm , j  ( g 1)  qm , j .

(4)

AN US

m, j  2

CR IP T

 x q  x1qm , j 1  x2 qm , j  2  ...  xg 1qm , j  ( g 1)  q~m , j   0 m , j , x0  x1  x2  ...  xg 1   ~ qm , j 1  qm , j 1  qm , j , q~  q  q ,

3) If l  |w| then, extract watermark bits wl, wl+1, …, wl+(g-2), based on LSB(q~m, j 1 ) , LSB(q~m, j 2 ) ,…, LSB(q~m, j ( g 1) ) , and l = l + (g  1).

M

4) Use Eq. (5) and Eq. (6), and restore  q~m, j , q~m, j1 , q~m, j2 , ..., q~m, j( g 1)  back to .

(5)

CE

PT

ED

qm , j  q~m , j ,  q~m , j 1   q m , j 1   ,  2   q~  qm , j  2   m , j  2 ,  2 

AC

...,

 q~  qm , j  ( g 1)   m , j  ( g 1) . 2    x q  x2 qm , j  2  ...  xg 1qm , j  ( g 1)  qm , j  qm , j   1 m , j 1 , x0  x1  x2  ...  xg 1   qm , j 1  qm , j 1  qm , j , qm , j  2  qm , j  2  qm , j ,

..., qm , j  ( g 1)  qm , j  ( g 1)  qm , j .

Output: The original data D and the watermark w.

(6)

ACCEPTED MANUSCRIPT

Let |D| and |QI| denote the number of the instances and the number of the QI attributes in the original dataset, respectively. In the data perturbation phase, each value in all the QI attributes needs to be perturbed, as a result, the perturbation operation executes a total of (|D||QI|) times. Thus, the time complexity of the data perturbation phase in RDT algorithm is O(|D||QI|). Similarly, the time complexity of

CR IP T

the data recovery phase in RDT algorithm is also O(|D||QI|). In addition, in Step 3 of the data perturbation phase, based on the random values generated by Rand(Seed), the perturbed data need to be arranged in an ascending or descending order in order to generate the perturbed dataset. Consequently, the space complexity of the data

AN US

perturbation phase in RDT algorithm is O(|D||QI|). Similarly, the space complexity of the data recovery phase in RDT algorithm is also O(|D||QI|) because the data in

M

perturbed dataset need to be rearranged.

3. An example of the RDT algorithm

ED

In this section, we continue to use the example of the disease dataset in Table 1 to illustrate the data perturbation and data recovery phases in the RDT algorithm. We

PT

assume that in Table 1, Age is a QI attribute, the watermark w is (101100011) 2, g = 4,

CE

and the weight set is<1, 2, 1, 2>. When we perform data perturbation, we first take every four values in the Age attribute as a group and divide it into a number of groups.

AC

Then, we use Eq. (1) and (2) on group 1 <22, 26, 23 35> of the Age attribute to calculate the difference between the data values and to perform difference expansion. We then obtain data group <27, 8, 2, 26>. Then, aside from the first data value in that group, the other three data values will be embedded into the first 3 watermark bits (101)2 of the watermark w to obtain <27, 9, 2, 27>. Lastly, Eq. (3) is used to generate the corresponding perturbed group <15, 24, 17, 42>. All data groups in the QI attributes use the same method to generate the corresponding perturbed groups and

ACCEPTED MANUSCRIPT

perturbed dataset, as shown in Table 3. Since the length of w is 9 and the group size g is 4, except for the first, fifth, and ninth data values in Age, the other data values in Age will be embedded in the watermark bits. All data values in the other QI attributes in Table 3 will be treated as data perturbed and will not be embedded into the watermark. Finally, in Step 3, we

Table 3 The dataset after the data perturbation phase

M

AN US

Sensitive Attribute Disease Heart Disease Heart Disease Cancer Cancer Hypertension Diabetes Diabetes Heart Disease Diabetes Hypertension Cancer Hypertension

ED

Age 15 24 17 42 28 35 52 44 40 56 63 59

QI Attributes Cholesterol Triglyceride 151 93 177 169 167 159 199 117 142 104 152 118 262 186 250 170 202 173 214 191 258 183 240 203

CR IP T

use Seed to perturb the order of data in the perturbed dataset.

The parameter configuration during the data recovery phase is the same as that of

PT

the data perturbation phase. First, according to the Rand(Seed), we restore the original sorting order of the perturbed data. Then, we use Eq. (4) on the first group <15, 24, 17,

CE

42> in the Age attribute to obtain data group <27, 9, 2, 27> after performing

AC

difference expansion. We then take the watermark bits(101)2 from the last three data values in the data group, and use Eq. (5) and Eq. (6) to restore the first original group <22, 26, 23, 35>. Similarly, all data groups in the QI attributes use the same method for restoration. This way, we can obtain the accurate original dataset and the watermark w, which is embedded in the data.

4. Measures

ACCEPTED MANUSCRIPT

In this section, we first explain the assessment methods to measure the effectiveness of PPDM (Chen et al., 2013; Herranz et al., 2010; Mateo-Sanz et al., 2005; Yun and Kim, 2015). In the next section, we will use these assessment methods to conduct a comparison of the effectiveness of our proposed RDT algorithm with the existing algorithm PDE. The purpose of PPDM is to protect the privacy information

CR IP T

in the original data, while retaining the existing knowledge. In addition, the perturbed data is required to lower the disclosure risk of the privacy data (Chun et al., 2013; Fung and Mangasarian, 2013; Lakshmi and K. S. Rani, 2013; Hajian et al., 2014) and retain the values of the data mining (Yang and Qiao, 2010; Zhu et al., 2009). However,

AN US

perturbed data using the data perturbation methods is likely to cause a large amount of information loss (Mateo-Sanz et al., 2005; Yun and Kim, 2015). Therefore, the most important basis for analyzing the effectiveness of PPDM protection is obtaining a

M

balance between knowledge reservation, information loss, and Privacy Disclosure Risk (PDR).

ED

In terms of knowledge reservation, we use classification, which is commonly used in medicine, finance, and credit estimation (Dangare and Apte, 2012; Fung et al.,

PT

2007; Yang and Qiao, 2010; Zhu et al., 2009), as a basis for this assessment. We also

CE

use the WEKA 3.6 tool, which contains many data mining techniques (Dangare and Apte, 2012; Hall et al., 2009) to analyze the test datasets. We use three well-known

AC

classifiers, namely Decision Tree, Native Bayes, and Support Vector Machine (SVM), and 10-fold cross validation (Chen et al., 2013; Dangare and Apte, 2012; Hall et al., 2009) to analyze the impact of RDT on knowledge reservation. Note that in all the experiments, the parameters in WEKA 3.6 are set by default. In addition, we use Probabilistic Information Loss (PIL) proposed by Mateo-Sanz et al. (2005) to assess the extent of information loss of the perturbed data. PIL uses a method that standardizes data and limits the data range for statistical

ACCEPTED MANUSCRIPT

analysis to be between 0 and 1. The mean, variance, covariance, Pearson's correlation, and quantiles before and after the perturbation are then calculated, respectively. The differences before and after data perturbation are expressed in percentage format. A smaller value represents a better result. Lastly, PDR (Chen et al., 2013) combines the calculation methods used by Interval Disclosure (ID) and Distance Linkage

CR IP T

Disclosure (DLD) (Yun and Kim, 2015) in terms of data similarity to assess the degree of risk for the privacy data in the perturbed data being exposed. ID is used for calculating whether the attributes value in each piece of data has the same proportion as the corresponding attributes value in the original data. DLD uses Euclidean

AN US

distance to find whether each piece of data remains the same as the original data after being perturbed, and records the proportion of cases where the data is the same in order to represent the similarity between the perturbed data and the original data.

M

During the experiments, the results from the similarity calculation performed on ID and DLD are arranged as PDR value (PDR = 0.5 × ID + 0.5 × DLD) (Chen et al.,

ED

2013; Yun and Kim, 2015); a smaller value represents a better result.

PT

5. Experiment results

CE

In this experiment, we used six test datasets published by UCI Machine Learning Repository (Frank and Asuncion, 2010), which are commonly used in results analysis

AC

for data mining to perform an effective analysis for RDT. The test datasets are shown in Table 4. Table 4 Test datasets Datasets Abalone Breast German credit Vehicle Satimage

Number of attributes Number of instances Number of classes 8 4,177 3 10 699 2 25 1,001 2 19 846 4 36 4,435 7

ACCEPTED MANUSCRIPT

KDD Cup

38

4,000,000

23

Decision Tree can sorts the attributes in datasets by importance (Chen et al., 2013; Fung et al., 2007; Hall et al., 2009), as a result, during the experiment, we selected the top 3, top 5, and top 7 attributes as the QI attributes to test the impact of RDT on knowledge reservation, PIL, and PDR under the conditions where different

CR IP T

amounts of QI attributes are present (Chen et al., 2013). In addition, we used different group sizes g for RDT performance testing, and found that g value does not have a large impact on knowledge reservation, PIL, or PDR. Owing to the limited scope in this paper, in this section, we only show the experimental results for g = 4.

AN US

100 90 80

Accuracy (%)

70

60

Original Top3

50

Top5

40

Top7

M

30 20

10 0 Breast

German credit

ED

Abalone

Vehicle

Satimage

KDD Cup

(a) Decision Tree.

90 80

60

Original

CE

Accuracy (%)

70

PT

100

50

Top3

40

Top5 Top7

AC

30 20

10 0

Abalone

Breast

German credit

Vehicle

(b) Native Bayes.

Satimage

KDD Cup

ACCEPTED MANUSCRIPT

100 90 80

Accuracy (%)

70

60

Original

50

Top3

40

Top5 Top7

30 20

0 Abalone

Breast

German credit

Vehicle

(c) SVM. Figure 1 Analysis of knowledge reservation

Satimage

CR IP T

10 KDD Cup

Figure 1 shows three classifiers with Decision Tree, Naive Bayes, and SVM,

AN US

respectively, in order to perform a classification accuracy analysis of the test datasets after going through RDT perturbation, thereby analyzing the impact of the algorithm on knowledge reservation. The experimental results from Figure 1 show that the classification accuracy for using the different numbers of QI attributes in test datasets

M

with RDT perturbation all came very close to the classification accuracy of the

ED

original dataset. This means that the datasets with RDT protection do not lose its original knowledge because of the perturbation, proving that RDT can indeed achieve

PT

the objective of knowledge reservation. Figure 2 presents the relationship between the number of QI attributes and

CE

information loss. The figure shows that, when the number of QI attributes increases,

AC

the value for PIL is higher, indicating more information loss. In these six test datasets, more significant information loss occurs when we selected top 7 attributes from the Abalone, Breast, and KDD Cup datasets as the QI attributes. In terms of German credit, Vehicle, and Satimage datasets, the impact of RDT on information loss is small. In addition, when the number of QI attributes was 3, 5, and 7, the average value of PIL for these six test datasets was 20.57%, 32.44%, and 42.24%, which represents data with RDT perturbation, confirming that it can effectively reduce the degree of

ACCEPTED MANUSCRIPT

information loss. 100 90 80

PIL (%)

70

60 Top3

50

Top5

40

Top7

20

10 0 Abalone

Breast

German credit

Vehicle

Satimage

Figure 2 Analysis of PIL values

CR IP T

30

KDD Cup

Figure 3 presents PDR values for each test datasets with RDT perturbation.

AN US

Figure 3 shows that, PDR values in all the test datasets will not significantly change due to the number of QI attributes used. It shows that RDT has a more stable performance in terms of PDR, and even without a large amount of perturbation data, we can effectively reduce the risk of privacy information being disclosed.

M

100 90

ED

80

60 50 40 30 20

CE

10

Top3

Top5 Top7

PT

PDR (%)

70

0

Abalone

Breast

German credit

Vehicle

Satimage

KDD Cup

AC

Figure 3 Analysis of PDR values

6. The comparison of RDT and PDE In this section, we perform a performance comparison on the RDT with the PDE algorithm proposed by Chen et al. (2013). Figures 4-6 use classification accuracy, PIL, and PDR to evaluate these two algorithms in terms of their performance on knowledge reservation, information loss, and PDR. Figure 4 shows the average values

ACCEPTED MANUSCRIPT

for classification accuracy for RDT and PDE algorithms on Decision Tree, Native Bayes, and SVM on top 3, top 5, and top 7 attributes. Figure 4 shows that, for datasets Abalone, Breast, German credit, Vehicle, and KDD Cup, RDT classification accuracies are closer to the classification accuracy of the original dataset than that of

more complete knowledge in the original datasets. 100 90 80

60

Original

50

RDT

40 30 20 10 0 Abalone

Breast

German credit

AN US

Accuracy (%)

70

CR IP T

PDE. It also shows that, compared to that of PDE, data with RDT perturbation retains

Vehicle

Satimage

PDE

KDD Cup

M

Figure 4 The comparison of knowledge accuracy between RDT and PDE Figure 5 presents the PIL average obtained when using RDT and PDE to perform

ED

perturbation on top 3, top 5, and top 7 attributes of each test dataset. Figure 5 shows that, apart from the PIL values for RDT in the Breast dataset, which are significantly

PT

poorer than that of PDE, the rest are similar to PDE, or even better than that of PDE.

AC

CE

It also shows that RDT can indeed effectively reduce information loss.

ACCEPTED MANUSCRIPT

60

50

PIL (%)

40

RDT

30

PDE

10

0 Abalone

Breast

German credit

Vehicle

Satimage

Figure 5 PIL comparison between RDT and PDE

CR IP T

20

KDD Cup

AN US

Figure 6 presents the PDR average obtained when using RDT and PDE to perform perturbation on top 3, top 5, and top 7 attributes of each test dataset. Figure 6 shows that, apart from the PDR values for RDT in the Vehicle dataset, which are poorer than that of PDE, the rest are similar to PDE, or even better than that of PDE.

M

It also shows that RDT can indeed effectively reduce PDR. 100

ED

90 80

60

PT

PDR (%)

70

50 40

CE

30

RDT PDE

20 10 0

AC

Abalone

Breast

German credit

Vehicle

Satimage

KDD Cup

Figure 6 PDR comparison between RDT and PDE In terms of the watermark payload, PDE arranges two data values as a group to

perturb data and embed the watermarks. For each group, only 1 bit of the watermark can be embedded and, thus, the watermark payload is ½ of that entire perturbed data. On the other hand, when the group size g is 4, RDT will embed 3 bits into each group, meaning the watermark payload is ¾ of the entire perturbed data, which is 1.5 times

ACCEPTED MANUSCRIPT

that of PDE. The top 3 attributes in the test datasets are used as the QI attributes, and Table 5 shows the watermark payload for PDE and RDT. Table 5 shows that, the amount of watermark payload in RDT is indeed significantly larger than the watermark payload in PDE. Table 5 The comparison of watermark payload between RDT and PDE Watermark payload (bits) RDT PDE 9,396 6,264 1,566 1,047 2,250 1,500 1,899 1,269 9,972 6,651 9,000,000 6,000,000

CR IP T

Datasets

AN US

Abalone Breast German credit Vehicle Satimage KDD Cup

Finally, the sizes of memory requirement and the results of execution time of the two algorithms are shown in Table 6 and Table 7, respectively. Table 6 shows that the

M

RDT algorithm needs slightly more memory space than the PDE algorithm. In addition, Table 7 shows that, apart from the execution time in the data perturbation

ED

phase for RDT in the Vehicle dataset, which is slightly more than that of PDE, the rest are similar to PDE, or even much better than that of PDE.

Memory space (K) Data perturbation phase Data recovery phase RDT PDE RDT PDE 65,500 63,636 65,500 63,636 32,226 11,116 32,226 11,116 52,140 40,472 52,140 40,472 51,880 50,656 51,880 50,656 161,604 154,464 161,604 154,464 630,996 587,924 630,996 587,924

CE

Datasets

PT

Table 6 The comparison of memory space between RDT and PDE

AC

Abalone Breast German credit Vehicle Satimage KDD Cup

Table 7 The comparison of execution time between RDT and PDE Datasets

Execution time (ms)

ACCEPTED MANUSCRIPT

Data recovery phase RDT PDE 141 187 15 15 109 1,358 78 47 1,232 1,685 2,337,757 2,413,461

CR IP T

Abalone Breast German credit Vehicle Satimage KDD Cup

Data perturbation phase RDT PDE 218 218 16 16 125 4,056 78 63 1,421 3,694 2,308,990 2,390,219

7. Conclusions and future work

In order to protect the privacy information in the original data, while retaining the knowledge, this study introduced the concept of reversible integer transformation

AN US

in the image processing domain and developed the RDT algorithm, which can perturb and restore data. In this algorithm, using an adjustable weighting mechanism to determine the degree of disturbance of the original data, we can increase the

M

flexibility of privacy-preserving. In addition, it allows us to embed the watermark in the original data, thereby identifying whether the perturbed data has been tampered

ED

with, in order to ensure data integrity. The experimental results confirmed that, compared with the existing algorithms, RDT is indeed better at protecting existing

PT

knowledge, reducing information loss, and reducing PDR, and it has a higher

CE

watermark payload. Future work can further investigate other reversible data hiding techniques to achieve a higher level of privacy-preserving. In addition, selecting

AC

appropriate QI attributes are under development.

Acknowledgments This work was supported partially by the Ministry of Science and Technology of Republic of China under grant MOST 102-2218-E-025-001 and 103-2221-E-025-006.

ACCEPTED MANUSCRIPT

References Alattar, A.M., 2004. Reversible watermark using the difference expansion of a generalized integer transform. IEEE Transactions on Image Processing 13 (8), 1147-1156. Bianchi, T., Piva, A., Barni, M., 2009. On the implementation of the discrete fourier

CR IP T

transform in the encrypted domain. IEEE Transactions on Information Forensics and Security 4 (1), 86-97.

Coltuc, D., and Chassery, J.M., 2007. Very fast watermarking by reversible contrast mapping. IEEE Signal Processing Letters 14 (4), 255-258.

AN US

Chen, T.S., Lee, W.B., Chen, J., Kao, Y.H., Hou, P.W., 2013. Reversible privacy preserving data mining: a combination of difference expansion and privacy preserving. Journal of Supercomputing 66 (2), 907-917.

M

Chun, J.Y., Hong, D., Jeong, I.R., Lee, D.H., 2013. Privacy-preserving disjunctive normal form operations on distributed sets. Information Sciences 231,113-122.

ED

Dangare, C.S., Apte, S.S., 2012. Improved study of heart disease prediction system using data mining classification techniques. International Journal of Computer

PT

Applications 47 (10), 44-48.

CE

Frank, A., Asuncion, A., 2010. UCI machine learning repository. Available at http://archive.ics.uci.edu/ml/.

AC

Fung, B.C.M., Wang, K., Yu, P.S., 2007. Anonymizing classification data for privacy preservation. IEEE Transactions on Knowledge and Data Engineering 19 (5), 711-725.

Fung, G.M., Mangasarian, O.L., 2013. Privacy-preserving linear and nonlinear approximation via linear programming. Optimization Methods and Software 28 (1), 207-216. Hajian, S., Domingo-Ferrer, J., Farràs, O., 2014. Generalization-based privacy

ACCEPTED MANUSCRIPT

preservation and discrimination prevention in data publishing and mining. Data Mining and Knowledge Discovery 28 (5-6), 1158-1188. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1), 10-18.

CR IP T

Hao, Z., Zhong, S., Yu, N., 2011. A privacy-preserving remote data integrity checking protocol with data dynamics and public verifiability. IEEE transactions on Knowledge and Data Engineering 23 (9), 1432-1437.

Herranz, J., Matwin, S., Nin, J., Torra, V., 2010. Classifying data from protected

AN US

statistical datasets. Computers & Security 29 (8), 874-890.

Hong, T.P., Tseng, L.H., Chien, B.C., 2010. Mining from incomplete quantitative data by fuzzy rough sets. Expert Systems with Applications 37 (3), 2644-2653.

M

Hong, W., Chen, T.S., 2011. Reversible data embedding for high quality images using interpolation and reference pixel distribution mechanism. Journal of Visual

ED

Communication and Image Representation 22 (2), 131-140. Kao, Y.H., Lee, W.B., Hsu T.Y., Lin C.Y., Tsai H.F., Chen, T.S., 2015. Data

PT

perturbation method based on contrast mapping for reversible privacy preserving

CE

data mining. Journal of Medical and Biological Engineering 35 (6), 789-794. Karandikar, P., Deshpande, S., 2011. Preserving privacy in data mining using data

AC

distortion approach. International Journal of Computer Engineering Science 1 (2), 24-31.

Lakshmi, M.N., Rani, K.S., 2013. SVD based data transformation methods for privacy preserving clustering. International Journal of Computer Applications 78 (3), 39-43. Li, T., Li, N., Zhang, J., Molloy, I., 2012. Slicing: a new approach for privacy preserving data publishing. IEEE Transactions on Knowledge and Data

ACCEPTED MANUSCRIPT

Engineering 24 (3), 561-574. Mateo-Sanz, J.M., Domingo-Ferrer, J., Sebé, F., 2005. Probabilistic information loss measures in confidentiality protection of continuous microdata. Data Mining and Knowledge Discovery 11 (2), 181-193.

integer transform. Signal Processing 92 (1), 54-62.

CR IP T

Peng, F., Li, X., Yang, B., 2012. Adaptive reversible data hiding scheme based on

Pun, C.M., Choi, K.C., 2014. Generalized integer transform based reversible watermarking algorithm using efficient location map encoding and adaptive thresholding. Computing 96 (10), 951-973.

AN US

Sasikala, I.S., Banu, N., 2014. Privacy preserving data mining using piecewise vector quantization (PVQ). International Journal of Advanced Research in Computer Science & Technology 2 (3), 302-306.

M

Yang, W., Qiao, S., 2010. A novel anonymization algorithm: privacy protection and knowledge preservation. Expert Systems with Applications 37 (1), 756-766.

ED

Yun, U., Kim, J., 2015. A fast perturbation algorithm using tree structure for privacy preserving utility mining. Expert Systems with Applications 42 (3), 1149-1165.

PT

Zhang, X., 2012. Separable reversible data hiding in encrypted image. IEEE

CE

Transactions on Information Forensics and Security 7 (2), 826-832. Zhu, D., Li, X.B., Wu, S., 2009. Identity disclosure protection: a data reconstruction

AC

approach for privacy-preserving data mining. Decision Support Systems 48 (1), 133-140.

Zhu, X., Davidson, I., 2007. Knowledge discovery and data mining: challenges and realities. Information Science Reference: Hershey.

ACCEPTED MANUSCRIPT

Author biography

Chen-Yi Lin received the MS degree in information and computer education from the National Taiwan Normal University, Taiwan, Republic of China, in 2004, and the PhD degree in computer science from the National Tsing Hua University, Taiwan,

CR IP T

Republic of China, in 2012. She is an assistant professor in the Department of Information Management, National Taichung University of Science and Technology. Her research interests include data mining, mobile computing, and

AC

CE

PT

ED

M

AN US

social network analysis.