Journal Pre-proof
Novel trajectory privacy-preserving method based on clustering using differential privacy Xiaodong Zhao , Dechang Pi , Junfu Chen PII: DOI: Reference:
S0957-4174(20)30067-1 https://doi.org/10.1016/j.eswa.2020.113241 ESWA 113241
To appear in:
Expert Systems With Applications
Received date: Revised date: Accepted date:
12 October 2019 5 January 2020 23 January 2020
Please cite this article as: Xiaodong Zhao , Dechang Pi , Junfu Chen , Novel trajectory privacypreserving method based on clustering using differential privacy, Expert Systems With Applications (2020), doi: https://doi.org/10.1016/j.eswa.2020.113241
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd.
HIGHLIGHTS
Differential privacy technology is applied to trajectory clustering.
Restricted noise is added to the location data to improve the clustering effect of noise data.
The paper gives three attack models in cluster analysis and corresponding defense methods.
1
Novel trajectory privacy-preserving method based on clustering using differential privacy Xiaodong Zhao*, Dechang Pi, Junfu Chen College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China Author address:
[email protected] (Xiaodong Zhao);
[email protected] (Dechang Pi); 2019377574 @qq.com (Junfu Chen)
Abstract- With the development of location-aware technology, a large amount of location data of
users is collected by the trajectory database. If these trajectory data are directly used for data mining without being processed, it will pose a threat to the user's personal privacy. At the moment, differential privacy is favored by experts and scholars because of its strict mathematical rigor, but how to apply differential privacy technology to trajectory clustering analysis is a difficult problem. To solve the problems in which existing trajectory privacy-preserving models have poor data availability or difficulty to resist complex privacy attacks, we devise novel trajectory privacy-preserving method based on clustering using differential privacy. More specifically, Laplacian noise is added to the count of trajectory location in the cluster to resist the continuous query attack. Then, radius-constrained Laplacian noise is added to the trajectory location data in the cluster to avoid too much noise affecting the clustering effect. According to the noise location data and the count of noise location, the noise clustering center in the cluster is obtained. Finally, it is considered that the attacker can associate the user trajectory with other information to form secret reasoning attack, and secret reasoning attack model is proposed. And we use the differential privacy technology to give corresponding resistance. Experimental results using the open data show that the proposed algorithm can not only effectively protect the private information of the trajectory data, but also ensure the data availability in cluster analysis. And compared with other algorithms, our algorithm has good effect on some evaluation indicators. KEYWORDS
Trajectory data; cluster analysis; privacy protection; differential privacy
1 INTRODUCTION With the popularity of location-aware devices, people's quality of life has improved, but the corresponding location data has also been collected by the trajectory database. Due to the increasing power of database systems and the ever-decreasing cost of data storage, the collection of personal data is not only the work of government departments and statistical departments, such as the financial sector, internet companies, medical institutions, etc., which hold a large amount of personal data. Cluster analysis is one of the more effective data mining methods, and nowadays there are more and more applications in cluster analysis, which can be used in statistical data analysis, image processing, pattern recognition, bioinformatics and other fields. Similarly, the trajectory data stored in the trajectory database can also be used to perform 2
spatio-temporal data mining related to trajectory clustering, such as mining frequent positions in the trajectory for user behavior analysis, detecting the operation of road vehicles to reasonably arrange traffic police and understanding the travel patterns of urban residents to rationally build infrastructure. But it may also lead to the leakage of user privacy information that these trajectory data are directly used for trajectory clustering analysis, such as user hobbies, eating habits, home addresses and other sensitive information. And using this sensitive information, advertisers can push ads and criminals can commit criminal activities. A large number of leaks of sensitive location information will bring immeasurable losses and security threats to users. Citizen privacy protection has been increasingly valued worldwide. How to protect data privacy and build an effective privacy protection data release model has become a hot topic of current research. The task of privacy protection data release is to ensure that the privacy information of the data is not leaked, but also to ensure the availability of published data. In other words, a balance between data privacy and availability is achieved (Liu & Li, 2016). At present, the application research of privacy protection methods mainly focuses on association rule mining and classification mining. The k-anonymity and its extended protection model technology have been widely used in the domain of the existing privacy protection. This method hides one data record by storing at least k records to achieve the purpose of privacy protection. The researchers improved the k-anonymity, and successively appeared privacy protection technologies such as l-diversity (Shmueli & Tassa, 2015), t-proximity (Ni, Gu & Chen, 2016), (a, k)-anonymity (Trujillo & Domingo,2013), generalization and randomization (Zhang & Meng, 2014). These privacy protection technologies often do not define the attacker's data background. Therefore, when dealing with complex attack models, as attackers master more and more background knowledge, the attack methods become more and more complex. Complex attacks based on background knowledge such as joint attacks (Ganta, Kasiviswanathan & Smith, 2008), consistency attacks (Liu & Li, 2016) are often difficult to cope with by these privacy protection models. Dwork et al. (2006) presented a novel protection model based on differential privacy in 2006, which is currently considered to be the most reliable model (Zhang & Meng, 2014). The model can defend against any model attack without the attacker's background information. At present, the research on combining differential privacy based on clustering is not mature enough, and there are few studies. The DPK-means algorithm using differential privacy and k-means was proposed by Blum (Blum et al., 2005) and this was the first time to apply differential privacy to cluster analysis. The algorithm uses the administrator identity to introduce noise into the query response, which can protect the privacy of a single data record. However, the availability of its clustering results is not robust to noise. Dwork et al. (2010) further studied the algorithm and offered a k-means protection algorithm based on differential privacy. Ren et al. (2017) improved the DPK-means algorithm on the premise of satisfying differential privacy. The method randomly divides the data set into multiple subsets to obtain the initial center point, but the number of data sets partitions is difficult to determine, which affects the stability of the cluster. The above methods generally have problems of poor clustering performance or poor data 3
availability or difficulty in resisting complex attacks based on background knowledge or suitability for specific clustering algorithms. Based on the above summary, we apply differential privacy technology to trajectory clustering. Differential privacy technology can ignore the background knowledge of the attacker and resist complex background knowledge attacks and joint attacks. And the paper limits the amount of noise added to the data, ensuring the availability of data while ensuring privacy protection. In addition, this paper proposes a general method for privacy protection in trajectory clustering analysis. This method is not just for a certain clustering algorithm. To the best of our knowledge, we are the first to take into account the uncertainty of dimensional correlation between trajectory data. More specifically, we add Laplace noise to the trajectory location, cluster center and the count of location in each cluster to defend against attacks. Since it is not possible to determine whether there is a correlation between different dimensions of the trajectory data, we represent the final noise result by a linear combination of the results of the noise added separately for each dimension and the result of adding noise in the two-dimensional space. At the same time, considering that the trajectory may contain other information that leaks user privacy, in order to protect the user's trajectory privacy, this paper also adds noise to the data results. The framework of privacy protection system for trajectory cluster analysis is shown in Figure 1. The user sends a request for trajectory cluster analysis to the system, and then the system adds noise to the relevant data according to the noise design, and the result after adding the noise is returned to the user. The overall process of privacy protection approach for trajectory cluster analysis is shown in Figure 2. First, three attack modes in trajectory clustering analysis are determined, and then corresponding attack defense methods are given for these three attack methods. Next, the corresponding noise design is given according to the defense methods, and then the size of the noise to be added is calculated according to the relevant formula of the noise design, and then the calculated noise is added to the corresponding data result, and the noised data result is returned to the user.
4
User
Trajectory cluster analysis
Trajectory data
Fig.1.
Analysis Analysis Request Request
Results Results response response
Noise design
Add noise
Noise results
The framework of privacy protection system for trajectory cluster analysis Trajectory data Trajectory clustering
Determine attack ways in cluster analysis
Give the defense methods of different attack ways Corresponding noise designs for different defense methods are given Adding noise to the location data count
Add noise with limited noise radius to location data
Add noise to other information data
Noise clustering center
Clustering results after privacy protection
Fig.2.
The overall process of privacy protection approach for trajectory cluster analysis 5
Note that, in this paper, we focus on the privacy protection of trajectory data in cluster analysis, but our work can be also applied to privacy protection in cluster analysis in other fields with a few changes. In the following, we list the innovation work of this paper. (1) Applying differential privacy technology to trajectory clustering, novel trajectory privacy-preserving method based on clustering using differential privacy is proposed, which can ignore the background knowledge of the attacker and ensure data availability while ensuring privacy protection. (2) This paper proposes a general method for privacy protection in trajectory clustering analysis and is not just for a certain clustering algorithm. (3) In order to resist the cluster location attack proposed by the paper, Laplacian noise is added to the trajectory location data and the cluster center in each cluster. (4) To the best of our knowledge, we are the first to take into account the uncertainty of dimensional correlation between trajectory data. Hence, a linear combination of noise in the two-dimensional space and noise added independently in one-dimensional space represent the final noise. And we also limit the size of added noise. Compared with the general way of adding noise, this method can effectively reduce the noise level and has good clustering performance. (5) To combat continuous query attacks, Laplace noise is added to different the count of location in the trajectory data. (6) Secret reasoning attack is proposed, and other data information in the trajectory of each cluster is added Laplacian noise to defend against this attack. The rest of this paper is organized as follows: Section 2 reveals related work. Section 3 shows some preliminaries and basic definitions. Section 4 describes our algorithm for trajectory privacy protection in cluster analysis. Experimental results are reported in Section 5. Finally, Section 6 concludes the paper. For clarity, the main notations used in the rest of this paper are summarized in table 1. Table 1 Summary of Notations Notation
Description
O
A set of moving objects
Oi
A moving object, Oi O
tri
A trajectory of moving object
x , y ,t i i
i i
j
i
The location of moving object Oi at time ti j is
6
x , y ,ti1 ti2 j
i
j
i
tim
pij
Represent
x , y ,t i i
i i
j
i
pi1 , pi2 ,..., pim
A trajectory of moving object
pli j
Other information that may lead to user privacy disclosure
npli j
Information that does not reveal user privacy at this time
ctin
Number of trajectory points in the i-th cluster in the clustering result
cxi , cyi
Coordinates of clustering center point in the i-th cluster in the clustering result
T ,T '
Represent a pair of adjacent data sets in which only one record is different
Privacy protection budget parameter
A Range A
SO
The random algorithm for
-differential privacy protection
A set of all possible outputs for A Arbitrary output of algorithm A,
Pr[ ] Δf
The risk of disclosure of privacy Global sensitivity of the function f
Laplace mechanism parameters
p( x | )
Probability density function under the Laplace mechanism
2 RELATED WORK In recent years, many experts and scholars have carried out research on the privacy protection of trajectories. The research on trajectory privacy protection by experts and scholars mainly focuses on trajectory anonymity, and there is relatively little research on trajectory differential privacy. The trajectory anonymity technology mainly uses k-anonymity and its extended protection model to realize the protection of the trajectory. Next, we introduce the current research status through trajectory k-anonymity, trajectory differential privacy and differential privacy protection in cluster analysis. Over the past few years, trajectory clustering and privacy protection has attracted the research of numerous scholars and has achieved more research results in different filed. Yang et al. (2020) proposed a new trajectory clustering algorithm called TAD, which extracted trajectory stays based on the spatiotemporal density analysis of the data. The algorithm defined two new metrics-NMAST density function and NT factor. And NT used the characteristics of noise to dynamically assess and reduce the effects of noise. Yun et al. (2016) developed a new framework called monitoring of vehicle outliers based on clustering technology (MVOC). This framework can monitor vehicle outliers caused by complex vehicle conditions. In order to better analyze the vehicle information, they first clustered the vehicle data, and then used the vehicle outlier 7
information caused by the complex correlation of vehicle components to analyze the generated clusters. Langari et al. (2020) presented a combined anonymous algorithm based on K-member fuzzy clustering and Firefly algorithm (KFCFA), which can protect anonymous databases from identity leakage, attribute leakage, link leakage and similarity attacks. And the algorithm can also minimize the loss of information. In addition, they introduced constrained multi-objective functions for privacy protection in social networks. Yun et al. (2015) proposed a fast perturbation algorithm based on a tree structure, which can perform the database perturbation process faster to prevent sensitive information from being leaked. And they used a two-table traversal technique to reduce the search space of PPUM, which can improve the efficiency of the search. In addition, the proposed algorithm had better scalability for databases with gradually increasing attributes and transaction characteristics. Polatidis et al. (2017) proposed a multi-level privacy protection method for collaborative filtering systems. This method is implemented by perturbing each rating before it is submitted to the server. And the algorithm can not only protect user privacy in collaborative filtering, but also maintain the recommended high precision. The proposed perturbation method is based on multiple levels and different levels of random values have different ranges. Zhao et al. (2019) have developed a new structure called sequence R (SR)-tree that satisfies the differential privacy based on the R-tree, and differential privacy technology is used to protect data privacy. Combined with the road network space, the similarity of the road sequence of the trajectory data replaces the minimum bounding rectangle structure in the R-tree to construct the SR tree, and performs consistent processing on the data on the noisy SR tree. Mazeh & Shmueli (2020) proposed a novel architecture for recommending systems. And this architecture overcomes two major limitations of existing recommender systems. In addition, this structure allows the recommendation system to utilize the collected rich data about the user to generate more accurate recommendations while allowing its users to manage and control their data and enhance privacy without sacrificing accuracy.
2.1 Trajectory K-Anonymity The trajectory k-anonymity model ensures that an arbitrary anonymous region covers at least k trajectories and their corresponding sampling points, so that the probability of the attacker obtaining the real moving trajectory of the attacking object is 1/k. Wu et al. (2013) devised a (k, δ, ∆)-anonymity model to prevent the published trajectory data from secondary clustering attacks. This model guarantees that the quality of the published trajectory data is not lower than the quality threshold ∆. And the model is first to perform inter-group hybridization and then inter-group disruption based on clustering results obtained by using (k, δ)-model and related algorithms. The method solves the problem that the traditional trajectory data publishing algorithm based on clustering only focuses on the privacy of a single trajectory and ignores the lack of protection of the trajectory clustering group feature. Zhang et al. (2019) proposed a trajectory privacy protection scheme based on Trusted Anonymous Server (TAS). In the snapshot query, the TAS generates a group request that satisfies the spatial k-anonymity of the user group, preventing the location-based service provider (LSP) from performing the inference attack. And in the continuous query, the TAS determines whether the group request needs to be resent by detecting whether the user will leave its security zone, which can reduce the probability that the LSP reconstructs the 8
real trajectory of the user. In addition, Tu et al. (2018) designed an algorithm to generalize trajectories to combat semantic attacks and re-identify attacks. The algorithm solves the problem where personal private information is obtained through the semantic features of frequently visited locations and satisfies the requirements of k-anonymity, l-diversity and t-close. And the algorithm provides privacy protection against semantics and re-identification attacks while preserving data for efficient use. And the method is based on the point that needs to be protected and has strong pertinence. Aiming at the problem that some researches modified the location information to achieve anonymity leading to inaccurate data sets, Chiba et al. (2019) proposed an algorithm to reduce the modification distance of the location. And the algorithm allows time mismatch when acquiring location information within a certain range. In addition, the article defines indicators that represent distortion of position and time information. Zhou et al. (2019) designed a trajectory protection scheme based on fog computing and k-anonymity, which is suitable for offline trajectory data protection in trajectory release and real-time trajectory privacy protection in human queries such as continuous query. In addition, fog computing provides users with local storage and mobility to ensure physical control.In addition, k-anonymity constructs a stealth area for each snapshot based on time-dependent query probability and conversion probability.
2.2 Trajectory Differential Privacy Differential Privacy Model (DPM) has a rigorous mathematical background and has been favored by many scholars. This model was proposed by Dwork et al. (2006) in 2006. By adding differential privacy noise to data, attackers can not judge whether there is a data record in the database by attacking, which can realize the purpose of privacy protection. The differential privacy protection model, which has strict mathematical basis, is a privacy protection model driven by encryption mechanism. In the last few years, differential privacy protection has been gradually applied to privacy protection. Andrés et al. (2013) combining with geographic indistinguishability, gave a formal definition of protecting user's geographic location within radius r, and proposed a new perturbation technology, which realized geographic indistinguishability by adding controlled random noise to user's location. This method shows how the relevant parameters extend to location traces and how privacy decreases as the tracking becomes longer. Second, it describes how to use their mechanism to enhance LBS applications with the guarantee of geographic indistinguishability. Deldar & Abadi (2018) introduced a new concept of differential privacy for personalized location trajectory database, called PLDP-TD, presented a new strategy for node privacy level allocation of personalized noise trajectory tree and personal privacy budget allocation, and designed a new differential privacy algorithm. The algorithm uses the personalized noise trajectory tree, which is constructed from the underlying trajectory database and answers statistical queries in the way of differential privacy. In addition, for locations with different privacy protection requirements, the method can guarantee non-uniform privacy. And the method enforces some consistency constraints to make the personalized noise trajectory tree consistent in the best way. Liu et al. (2018) proposed a flexible trajectory privacy model of omega-event n (2) - block differential 9
privacy considering the time and space locality of trajectory. The model can ensure that any trajectory occurs in an area of N x n blocks during Omega successive time stamps under the protection of
- differential privacy. The solution takes into account the importance of spatial
correlation and avoids the problems of leakage of user trajectory privacy and decline in data utility. Gu et al. (2018) devised a trajectory data protection mechanism based on differential privacy. This method chooses the protected points from the trajectory data and forms a polygon with the high frequency points around it. Then it adds noise to the center of mass of the polygon, and takes the center of mass of the added noise as the protected points, and releases the new trajectory data. Xiao et al. (2015) proposed a systematic solution to protect location privacy with strict privacy guarantees. The article gives the definition of delta-location set based differential privacy to resolve the temporal correlation in location data. They then proposed a new concept called the sensitivity shell to capture geometric sensitivity in multidimensional space, which can limit the error of differential privacy. In addition, in order to obtain the best utility, a plane isotropic mechanism (PIM) for position disturbance is proposed, which is the first mechanism to achieve the lower limit of differential privacy. Ou et al. (2018) took the first step to propose a mathematically strict n-body Laplace framework that satisfies differential privacy, which effectively prevents social relationship inference through the mutual correlation between two user n-node trajectories. The article defines a trajectory relevance score to measure the social relationship between two users. In addition, under the n-body Laplace framework, they proposed two Lagrange multiplier-based differential privacy (LMDP) methods to optimize the privacy budget for a given condition, called UD-LMDP and UC-LMDP. This method is optimized by the data utility of position distance measurement and the data utility of position correlation measurement. In addition, the article provides detailed privacy and data utility analysis, as well as LMDP's adversary knowledge analysis. Wei et al. (2019) proposed a DP-based trajectory community recommendation (DPTCR) scheme to implement an effective trajectory community recommendation (TCR) service while protecting the user's trajectory privacy. This scheme is based on a private semantic expectation method that converts the position of the actual trajectory into a noise feature position, which ensures the semantic similarity between the actual position and the noise position. Also, DPTCR uses a dedicated geographic distance method to construct a noisy trajectory with the smallest geographical distance from the actual trajectory. In addition, DPTCR uses a semantic geographic distance model to cluster the community, which ensures that the trajectory is highly similar to the constructed noisy trajectory in that community
2.3 Differential Privacy Protection in Cluster Analysis The purpose of privacy preservation algorithm based on clustering is to protect personal sensitive information without losing the accuracy of clustering. Li et al. (2019) proposed a two-step differential private method to publish cross-community clustering coefficients. Specifically, DPLM algorithm improves the Louvain method to partition a network using exponential mechanism. The neighborhood was disinfected according to the absolute gain, which 10
was different from the relative gain used in the original algorithm. DPCC draws the noise distribution map coefficients of clustering in the form of histogram. And their method has been proved to provide valuable distribution results, while ensuring the users’ privacy. Su et al. (2016) studied the effectiveness of interactive and non-interactive methods in K-means clustering, and analyzed the empirical error behavior of interactive and non-interactive methods. Then the differential privacy version of Lloyd called DPLloyd was proposed. In addition, they offered a non-interactive method called EUGkM, which published differential privacy profiles for k-means clustering. Li et al. (2010) constructed a new model, which provided a universal and practical microdata publishing technology that meets differential privacy. Firstly, the K anonymization method is analyzed, and it is explained that K anonymization technology cannot provide sufficient protection and will be re-identified by the attacker. This method combines k-anonymity algorithm with differential privacy protection to publish microdata. And it proves that this method can indeed provide privacy protection. Wang et al. (2015) established a differential privacy framework for subspace clustering, and proposed two provable differential subspace clustering algorithms. This framework addresses the issue of privacy breaches involving sensitive data sets about human objects. And one of the proposed methods has formal privacy rights and practical guarantees, and the other performs well in practice while gradually retaining differential privacy. Ni et al. (2018) proposed a differential privacy protection multi-core DBSCAN clustering (DP-MCDBSCAN) model based on the combination of differential privacy and DBSCAN, which can avoid user privacy leakage in data mining process. And by evaluating this architecture, the results show better efficiency than previous architectures.
3 PROBLEM STATEMENT In this section, we first introduce some theories related to trajectory, then introduce several clustering-oriented attack models, and give specific examples to explain these attack models. Finally, we introduce the concepts and advantages of differential privacy.
3.1 Attack Model Definition 1 (Trajectory): O is a collection of moving objects, for all Oi O , trajectory of moving tri
object
is
represented
x , y , t , x , y , t ,... x 1 i
1 i
1 i
2 i
2 i
2 i
m i
as
tri
and
, yim , tim , where
a
trajectory
x , y ,t i i
i i
j
i
can
be
expressed
as
represents that the location of
j j moving object Oi at time ti j is xi , yi ,and the property of ti j is ti ti 1
2
m
ti .
Definition 2 (Trajectory database (Zhao et al., 2019)): In the trajectory database of moving objects, besides trajectory data information, other data information related to moving objects should be included. This information can be divided into information that can lead to privacy leak and information that currently does not lead to privacy leak. If a mobile object is a car, then the information that can lead to its privacy leak, which also includes driving service area, gas station and so on besides location information. Then the relevant information contained in the trajectory 11
database can be represented as di pi , pi ,..., pi : pli , pli ,..., pli : npli , npli ,..., npli , where pi1 can 1
be shown as
x , y ,t . 1 i
1 i
2
m
1
2
n
1
2
q
pli1 , pli2 ,..., plin represents other information in a trajectory database that
1 i
may lead to user privacy leaks, and npli1 , npli2 ,..., npliq indicates that information about user privacy is not currently disclosed. This article only considers other information that may lead to user privacy disclosure. The attacker may obtain user privacy information through cluster analysis of trajectory data. We provide corresponding defense methods for some attack models existing in trajectory cluster analysis. For example, a new attack method is called cluster location attack. The cluster location attack is based on two adjacent clustering areas. The attacker can obtain other location information of the user by knowing the location information of the user and the coordinates of the clustering centers of the two clustering areas. Next, we will introduce the cluster location attack in detail. Definition 3 (Cluster location attack): D denotes the database of trajectory points,
ct in
represents
the number of trajectory points in the i-th class cluster , and (cxi , cyi ) is the coordinates of clustering center point in the i-th class cluster , and the coordinates of clustering center points are the average of all trajectory points in the i-th class cluster. If the attacker knows the number of trajectory points and the coordinates of the clustering center points in the clustering results of two n
n
adjacent regions are ct1 , (cx1 , cy1 ) and ct2 , (cx2 , cy2 ) , then the attacker can calculate the coordinates of a trajectory location point of the moving object as ( x, y ) (Wang et al., 2017):
x ct1n cx1 ct2n cx2 y ct1n cy1 ct2n cy2
(1)
The following example illustrates the cluster location attack. For example, Tables 2 and 3 are trajectory data of two adjacent clustering regions, in which only one location point data is different. From Table 2, we can calculate the number of trajectory points in this area is 13, and the
x
coordinate of the clustering center point is
4
m 1 n 1
n m
, ymn /13 . Through Table 3, we can
calculate the number of trajectory points in this region is 14, and the coordinate of the clustering
center point are
x 4
i 1
j 1
i
j
, yij /14 . By formula (3), the coordinate of a location point of the
fourth trajectory can be obtained as follows:
12
m1 n1
x, y 14 xij , yij /14 13 xmn , ymn /13 x44 , y44
4
i 1 j 1
4
(2)
In this way, the attacker will know the trajectory location of a moving object, thus revealing the user's trajectory privacy. Table 2 Data of trajectory points in adjacent areas D1 ID
Trajectory
x , y ,t , x , y 1 1
1
1 1
1 1
2 1
x , y ,t , x , y 1 2
2
1 2
1 2
2 2
2 2
x , y ,t , x
4
x , y ,t , x 1 4
1 3
1 4
1 3
, t12 , x13 , y13 , t13
, t22 , x23 , y23 , t23 , x24 , y24 , t24
3
1 3
2 1
2 3
1 4
2 4
, y32 , t32 , x33 , y33 , t33 , y42 , t42 , x43 , y43 , t43
Table 3 Data of trajectory points in adjacent areas D2
ID
Trajectory
x , y ,t , x , y 1 1
1 2
1 1
2 1
x , y ,t , x , y 1 2
1 2
1 2
2 2
2 2
1 3
1 3
1 3
2 3
x , y ,t , x , y 1 4
1 4
1 4
2 4
2 4
2 1
, t12 , x13 , y13 , t13
, t22 , x23 , y23 , t23 , x24 , y24 , t24
x , y ,t , x
3 4
1 1
, y32 , t32 , x33 , y33 , t33
, t42 , x43 , y43 , t43 , x44 , y44 , t44
In addition to the location information of the trajectory may reveal the user's privacy, other information may also reveal the user's personal privacy. The other information is very special. The user's information can be queried by querying the results of other information in the database. Definition 4 (Secret reasoning attack): By grasping some trajectory sequences and some other information, an attacker can use the means of reasoning to obtain the user's privacy through association or other means. For example, if the attacker grasps part of the trajectory information and the friendship relationship, the trajectory information of the user can be estimated with high probability according to the information of the friend, thereby obtaining the privacy of the user's trajectory and even threatening the security of the user. Next, a concrete example is given to illustrate this attack model. Table 4 shows a simple trajectory database. If the attacker knows Lina's friend relationship, the friend relationship 13
includes Sary and Tony, and the Tony and Lina trajectory information is ID1 and ID3 trajectory. By comparing the trajectories of Tony and Lina, information can be obtained. They all contain
p12 , p14 position points. By obtaining this information, it can be inferred that their common friend Lina's trajectory also contains these two position points, and the trajectory database can match the trajectory of Lina is
p12 , p12 , p23 , p14 . Table 4 A trajectory database ID
Trajectory information
Friend information
1
p11 , p12 , p13 , p14 , p15
Bob, Lily, Lina
2
p12 , p12 , p23 , p14
Sary, Abby, Tony
3
p31 , p12 , p33 , p14 , p35 , p36
Abel,Lina,Bob
4
p14 , p42 , p43
Bruce,Eddie,Eric
3.2 Differential Privacy Differential privacy technology (Dwork, 2006) disturbs the published data randomly, making it impossible for an attacker to identify whether a record is in the data set in the statistical sense, regardless of the background knowledge. Moreover, this technology has rigorous mathematical proofs that guarantee the statistical nature of the data. Definition 5 ( -differential privacy (Dwork, 2006)): A random algorithm A : T R , where the domain is T and the range is R. Range A is a set of all possible outputs for A and SO is an element in the set Range A .
For any two adjacent data sets T and T' that differ by at most one
record, if algorithm A satisfies equation (3) (Dwork, 2006), Pr[ A(T ) SO] Pr[ A(T ') SO] e
(3)
algorithm A is said to provide differential privacy protection.Where Probability Pr[ ] indicates the risk of disclosure of privacy and this risk is determined by the randomness of A. Then, called privacy protection budget parameter, and the larger
is
is, the smaller the disturbance added
to the data, the lower the degree of privacy protection. For the proposed
-differential privacy, Dwork et al. proposed the Laplacian mechanism
(Dwork, 2006), which is the earliest proposed differential privacy method, but also the most widely used mechanism at present. Before recommending Laplacian mechanism, we need to introduce global sensitivity (McSherry & Talwar, 2007). The definition of global sensitivity is 14
shown in Definition 6. Definition 6 (Global sensitivity (McSherry & Talwar, 2007)): Defined function f : T R d , data set, the input of f is a data set, and the output of f is a vector of
T is a
d dimension. For two
adjacent data sets T1 and T2 , the sensitivity of f is as follows (McSherry & Talwar, 2007):
Δf = max f (T1 ) - f (T2 ) 1
(4)
T1 ,T2
Where
f (T1 ) - f (T2 ) 1 is the 1- norm form between them. Global sensitivity is used to measure the
maximum change in the output value of a function caused by the change of any record. Definition 7 (Laplace mechanism (Dwork, 2006)): Defined function f : D R d , and if algorithm A offers - differential privacy, iff the following expression holds (Dwork, 2006):
A = f ( D) + Laplace(
Δf ) ε
Where the probability density function of noise data is p( x | λ) =
1 -x λ , Laplace parameter is e 2λ
determined by the privacy protection budget parameter and the global sensitivity i.e. λ =
(5)
f of the function,
Δf . ε
4 PROPOSED APPROACH In this section, we introduce the trajectory privacy-preserving method based clustering using differential privacy. Firstly, according to the attack models mentioned in Section 3 and the differential privacy technology, the corresponding defense methods are given. Then, according to the differential privacy theory, the corresponding noise designs are given. Finally, the detailed description of the proposed methods is given.
4.1 Attack Defense Under continuous queries, attackers can deduce user's location by counting location, thus revealing location privacy. Here is an example to illustrate the continuous query attack. If Alice has two friends named John and Eric, and John is ill. Then Alice accompanies John to the hospital. Alice checks out two friends nearby by mobile phone. At this time, Alice can definitely infer that Eric is in the hospital, thus revealing Alice's location. Therefore, for clustering, Laplacian noise is added to the position count in the trajectory to resist this attack. For the cluster location attack proposed in Section 3.1, we first need to ensure that the location 15
data of each cluster in the clustering results will not be leaked, and secondly, ensure that the clustering center of each cluster will not be leaked. In order to resist the clustering location attack, we first add noise to the location data in each cluster. Generally, the dimension of trajectory data is three-dimensional, including two-location dimensions and one-time dimension. When adding noise, we only need to consider location data. Considering that it is impossible to determine whether the two-dimensional location data have some correlation, we use the linear combination of the data added noise in one-dimensional space and the data added noise directly in two-dimensional space to represent the final result after adding noise. For clustering, in order to avoid the attacker gaining user privacy through the central point of the cluster, the noise center point is calculated according to the location data after adding noise and the sum of the counting value of the location data after adding noise to each cluster, thus avoiding the attacker leaking the users’ trajectory privacy through the coordinate of the central point and the number of the location points of the cluster. Given the secret reasoning attack in the cluster proposed in Section 3.1, we use differential privacy technology to resist this attack. The attacker can obtain the user's trajectory privacy by acquiring other information rather than the trajectory location. We use the user's friend relationship as other information obtained by the attacker to illustrate how to defend against related attacks. Table 5 shows the data information of some moving objects during the movement, including the location and friend relationship of the moving objects. If the user passes the location, the corresponding attribute value is 1, otherwise 0. Similarly, if the object and this user are friends, the corresponding attribute value is 1, otherwise 0. Table 5 Data information of some moving objects. ID
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
Ford
Cody
Frank
Carl
Alva
1 1
0
0
1
0
1
0
0
1
0
1
0
1
0
1
2 1
0
0
1
0
0
1
0
0
0
0
1
0
1
0
3 0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
4 1
0
1
1
0
0
00
0
0
1
1
1
0
1
0
Now, the attacker wants to get the track information of Ford. If the attacker now knows that Ford's friends include Sary and Laly, and knows that Sary and Laly's trajectories information is ID1 and ID4 respectively, then they pass two public locations. The information can be inferred that Ford's information is ID2, which reveals Ford's trajectory privacy. Next, we elaborate on the attacker's attack mode in a secret reasoning attack. First, the attacker compares the trajectory information of Sary and Laly, and it can be concluded that their trajectory contains position p1 and position p4 . Then, the attribute values of the two positions of position
16
p1 and position p4 are summed, and the sum of the attribute values of the two columns is 3, which means that the third trajectory besides the trajectory of Sary and Laly belongs to Ford, and that is to say attacker can infer the trajectory information of Ford by this way. Then, the trajectory information of Sary and Laly is removed, and the attribute values of the first n lines of position
p1 and position p4 attributes are summed, and the summed results are respectively sum1n , sum4n . Then the attribute values of the first n+1 lines of position
p1 and position p4
attributes are summed, and the summed results are respectively sum1n +1 , sum4n +1 . If the difference between the sum of the attribute values of the first n+1 rows of the two columns and the attribute value of the first n rows is equal to 1, then the attacker concludes that the trajectory data of the n + 1 row belongs to Ford. We can defend against this attack by adding Laplacian noise to the attacker's query results. For example, first, when the attacker sums all attribute values in the corresponding position column to determine whether the secret reasoning attack is available, the summation result is added with noise, and the actual result should be 3, but the result after adding noise is 2.35, then the attacker It is not possible to determine if this type of attack is available. If the attacker still uses this attack, similarly, we can still add Laplacian noise to the result of the sum of the attribute values and the attacker, so as to prevent the attacker from determining the user's trajectory.
4.2 Noise design Laplacian noise is added for the counting of location data and the result of secret reasoning attack, which is easily obtained by the definition of global sensitivity. And the global sensitivity is
f1 1 under these circumstances. For clustering location attack, we should ensure that the clustering center point in clustering results is not leaked. Assuming that the trajectory location
x , y j
points of two adjacent clustering regions are
j
i
i
and
x
j'
i
, yij ' , and the number of
trajectory location points in the clustering region is count n and count n ' , Combined definition (McSherry & Talwar, 2007), the global sensitivity calculation in two-dimensional space through our calculations can be obtained as follows :
x , y x j
i
f xy
i
j
ct
n
j
i
j'
i
i
, yij '
j
ct n ' 1
17
ct n ' xij , yij ct n xij ', yij i
j
i
n
ct ct
'
j
(6)
n' 1
Next, for adjacent regions, there is only one difference in the number of location points of the trajectory, that is as follows:
ct n ct n ' 1
(7)
Thus, from formulas (6) and (7), the global sensitivity continues to be calculated as follows:
f xy
ct
n
1 xij , yij ct n xij ' , yij ' i
j
i
ct ct 1 n
j
n
1
ct
n
1 xij , yij ct n 1 1 xij ' , yij ' i
j
i
ct ct 1 n
j
n
1
ct
n
1 xij , yij xij ' , yij ' xij ' , yij ' i j i j i j (8) n n ct ct 1 1
For two adjacent clustering regions, the point
x , y x j
i
i
j
j
i
i
i
j'
which maximizes the value of
, yij ' is noted as follows:
j
max xij , yij xij ' , yij ' xij , yij zmax max i j i j
(9)
From the definition of global sensitivity and formula (9), the global sensitivity is calculated as follows:
f xy
ct
n
1 xij , yij
max
xij ' , yij ' i
j
ct ct 1 n
n
1
ct
ct
n
1 xi j , yi
j
1 xi , yi
max n
ct
cx
'n
ct n ct 1
n
j
j
'
ct 1 cx , cy '
ct n ct n 1 18
,cy
1
n
max
'
'
1
ct
n
1 xij , yij
max n
cx ' , cy '
ct n ct 1
1
x , y j
j
i
i
x , y j
max
cx ' , cy '
ct n 1
j
i
i
ct n
cx , cy '
max
'
ct n 1
1 zmax _ x cx ' zmax _ y cy ' n ct
(10)
Where z max denotes the point in the adjacent clustering region that maximizes the result value,
cx , cy '
'
represents the coordinate of the clustering center point with fewer trajectory locations
region in the two adjacent clustering regions, and ct n denotes the number of locations of with more trajectory locations region in the two adjacent clustering regions. Similarly, in one-dimensional space, the global sensitivity under X-dimension is as follows:
f x
xij max cx ' ct n
(11)
Where xij max denotes the point on the x-axis in two clustering regions that maximizes the result value, cx ' represents the x-coordinate value of the clustering center point with fewer trajectory points in the two adjacent clustering regions, and ct n denotes the number of locations of the clusters with more trajectory locations in the two adjacent clustering regions. In one-dimensional space, the global sensitivity under Y-dimension is as follows:
f y
yij max cy ' ct n
(12)
Where yij max denotes the point on the y-axis in two clustering regions that maximizes the result value, cy ' represents the y-coordinate value of the clustering center point with fewer trajectory points in the two adjacent clustering regions, and ct n denotes the number of locations of the clusters with more trajectory locations in the two adjacent clustering regions.
4.3 Probability Distribution of Noise 19
According to the design principle of Laplacian mechanism in differential privacy, we can get that the noise density function under Laplacian mechanism is p( x | )
1 -x e , 2
is the noise
density parameter under Laplacian mechanism, its mathematical expression is λ = the global sensitivity of function f , and
Δf , f ε
is
is the privacy budget of differential privacy.
For the count of location under continuous query and the count of other information under secret reasoning attack, the noise probability density in both cases is defined as follows (Dwork, 2006):
p1 ( x)
2f
e
f
x
2
e
x
(13)
For the trajectory location noise in each cluster under cluster location attack, the noise probability density function in one-dimensional state is considered first. From formula (11) and formula (12), we can know that the global sensitivity of x-dimension and y-dimension in
xij max cx ' yij max cen' and respectively. Combined with the noise ct n ct n
one-dimensional state is
density function under Laplacian mechanism, the probability density function of added noise in this case can be obtained as follows:
p2 x ( z )
p2 y ( z )
2f
2f
e
e
f
f
ct n
z
z
z ct n xij max cx' e 2 xij max cx ' ct n
z ct n ymax cy ' e 2 ymax cy '
(14)
where z represents the noise random variable. If the trajectory location data in the cluster is added too much noise, the location point after adding noise will be far away from the actual location point and deviate from the corresponding cluster. Therefore, the noise random variables need to be constrained. Considering the actual situation, the upper limit of z is set as follows:
x min y
, mean dis
zx min
max
cen _ x xi cx / ct n , means disx
zy
max
cen _ y yi cy / ct n
s
y
(15)
Where xmax represents the x-coordinate value of the location point with the largest x-coordinate distance from the clustering center point in the cluster, and means disx is the average distance between the s location points close to the clustering center and the clustering center point on the x 20
dimension.
ymax represents the y-coordinate value of the location point with the largest
y-coordinate distance from the clustering center point in the cluster, and means dis y
is the
average distance between the s location points close to the clustering center and the clustering center point on the y dimension. Because the range of values of
z x and z y has changed, the cumulative distribution
function of the updated variables is as follows: ct n
ct rx max xmax cx ' F (rx ) 1 e
1 e xmax cx' x ct n 2r 1 e xmax cx' x max
ct ry max ymax cy ' F (ry ) 1 e
1 e ymax cy' y ct n 2r 1 e ymax cy' y max
n
n
ct n
r
r
(16)
According to the above Laplace cumulative distribution function, the corresponding inverse distribution function is obtained to solve the added noise value. Firstly, a random number with uniform distribution in 0,1 interval is generated. Then, by substituting the random number into the inverse distribution function, the corresponding noise value can be obtained. The expression of the inverse distribution function is as follows: ct n rx max xmax cx' xmax cx' F m ln 1 m 1 e ct n 1 x
ct n ry max ymax cy ' ymax cy ' F m ln 1 m 1 e ct n 1 y
(17)
Next, the probability density function in two-dimensional space is considered for the trajectory location data in the cluster. Generally, the dimension of trajectory position data is three-dimensional, including time dimension and two-dimensional location dimension. Considering that there may be correlation between two-dimensional location dimensions, and avoiding adding noise to each dimension alone, which leads to excessive noise, we add noise directly to the two-dimensional trajectory data in the cluster in two-dimensional space. Combining the differential privacy theory, the global sensitivity in two-dimensional space obtained by formula (10) and the result of differential privacy in geo-indistinguishable (Andrés et al., 2013), the probability density in this case is as follows:
21
f
ct n zmax _ x cx ' zmax _ y cy '
f 2 f d xy , x0 y0 p e 2
(18)
Where xy and
x0 y0 denote a location point in two-dimensional space, xy is a real trajectory
location point,
x0 y0 represents the location data after adding Laplace noise, and d xy, x0 y0
shows the Euclidean distance between the two location points. In two-dimensional space in Cartesian coordinate system, the probability density of Laplace noise is shown in formula (18), because the corresponding calculation is difficult to achieve in Cartesian coordinate system. In order to facilitate the calculation in practice, we transform the Cartesian coordinate system into the polar coordinate system. The probability density function in the polar coordinate system is shown in formula (19):
p r ,
f 2 f r re 2 2
(19)
Where r represents the distance between the trajectory location point after adding noise and the original trajectory location point. The value range of r is 0, , and the value range of
is
0, 2 . That is to say, the location point of adding noise satisfies the uniform distribution on the circle with r as the radius and the original trajectory location point as the center. And the value of r may be relatively large, which causes that the location point after adding noise is far away from the original location point and has poor clustering effect. In order to ensure that the trajectory location data can still be clustered in this kind of cluster after adding noise, we set the upper limit of the value as follows:
rmax min Where
xy
max
cen xyi cen / ct n , means dis
(20)
xymax is the point farthest from the clustering center, cen denotes the clustering center,
means dis represents is the average distance between the s location points close to the clustering center and the clustering center point, then the value range of r is
0, rmax . Because
the range of r has changed, the above probability density function needs to be updated. The updated probability density function is as follows:
22
p r ,
f 2
2 1 e f rmax 1 f rmax
re f r
(21)
, according to the probability density function in formula (21), the edge probability density functions of r and can be obtained as follows: In order to extract random variables r and
2
p r
p r , d o
1 e
p
f 2 re f r f rmax
1 f rmax
rmax
1 p r, dr 2
(22)
o
Formulas (21) and (22) can obtain p r , p r p .That is to say, r and
are
independent of each other. Because its probability density is a constant, it only needs to generate uniformly distributed random numbers in
0, 2
for
. In order to get the exact value of r, we
first calculate the cumulative distribution function of r. The cumulative distribution function of r is as follows:
F r or p d
1 e f r 1 f r 1 e f rmax 1 f rmax
(23)
The probability that the distance between random point and original location point falls in [0, r] can be completely described by F r . The inverse function of the cumulative distribution function can be used to generate random variables r which obey this random distribution. m is a random variable which obeys the uniform distribution in 0,1 . Then r F 1 m and the specific form is as follows. F -1 m
1 e f rmax 1 f rmax m 1 1 1 W1 f e
(24)
Where W1 is the Lembert W function (-1 branch). According to the above expression, we can get the values of r and
. Then in Cartesian coordinate system, the coordinate x ' , y ' of the
trajectory location point after adding noise is as follows (Cunha et al., 2019): 23
x ' x r cos
y' y r cos
(25)
Given the random number which obeys the uniform distribution in
0,1
interval, we can
calculate the noise radius in one-dimensional space and two-dimensional space respectively according to formulas (17) and (24). According to the noise results in two cases, we can get the final noise results. The method of adding noise to the trajectory location data in the cluster is shown in algorithm 1. Algorithm 1 Trajectory Location Data Protection Algorithm (TLDP) Input: trajectory location data set in cluster privacy budget
Dclu ;linear combination parameters u, v; differential
Output: trajectory location data set after privacy protection in cluster Dclu ' 01. Calculate the global sensitivity of the cluster // formulas (10), (11) and (12) 02. Calculate the maximum noise radius of this cluster // formulas (15) and (20) 03. Generate random numbers m uniformly distributed over the interval
0,1
04. According to the generated random number, get the noise radius in two cases // formulas (17) and (24) 05. Generate random numbers uniformly distributed over the interval 0,2 06. i 1 07. WHILE i h DO 08. FOR j-th trajectory location point DO 09.
ri j CaculateNoiser u, v,
10.
xij ' CaculateNoisex ri j , c
11.
yij ' CaculateNoisey ri j , c
// formula (25)
12. 13.
END FOR i i 1 14. END WHILE 15. RETURN Dclu ' Next, we analyze the time complexity of the TLDP algorithm. Steps 1 to 5 are all calculated according to a specific formula, so the time complexity is O 1 . Steps 9 to 11 are to traverse all the location points in the cluster and calculate the corresponding results, so the time complexity is O n , 24
where n denotes the number of location points. Hence, time complexity of the TLDP algorithm is O n . Then, time complexity of the location privacy pre-protection method (Tian et al., 2019) is
O n4 i 1 n1 n3
i 2
or O n 4
n3 i 1
n2
2
, where
n1 is average number of neighboring locations, and n2
denotes total number of locations, and n3 represents number of steps in Markov process, and n4 is number of initial locations with risk value. This algorithm needs to repeat the entire algorithm steps for each location, and the calculation of its Markov process is more complicated, and exponential operation is involved in the time complexity of the algorithm. In multidimensional
. For
privacy protection algorithms (Peng et al., 2019), the time complexity for the user is O n
2
users, algorithm time is mainly spent in query issuing. In query issuing, the algorithm needs to encode the Hilbert curve into Hilbert values and perform space conversion. The space conversion requires multiple multiplications, which is an exponential operation, so the time complexity is relatively high. And the time complexity Map segmentation algorithm (Li et al., 2018) is O n log n , where n is the number of location points. In this algorithm, time is spent constructing
the Vino polygons. The method of constructing the Vino polygon is a method of plane sweep, and this method can be reduced to the ordering problem of n real numbers, so the time complexity is O n log n . Compared with these algorithms, it shows that the time complexity of TLDP algorithm
is the lowest. Novel trajectory privacy-preserving method on based clustering using differential privacy is shown in algorithm 2. Algorithm 2: Trajectory privacy-preserving method on based clustering using differential privacy Input: trajectory data set
Dt ;privacy budget
Output: Privacy preserved trajectory data set Dt ' 01. The trajectory clustering algorithm is used to cluster the trajectory data set, and several clusters are obtained. 02.
j 1
03. WHILE 04.
ih
DO
FOR data in the i-th cluster DO
05.
T L D P cDl iu, ,u , v
06.
IF data belongs to trajectory location data THEN
07.
C kj loc trajectory location data count
08.
C j loc NoisyCount (C kj loc , )
k
// formulas (4) and (5) 25
09.
ELSE IF data belongs to others THEN
10.
C kj atr related results
11.
C j atr NoisyCount (C kj atr , )
k
12.
END IF
13.
END FOR
14.
c o u nin 't
15.
x (cen _ x , cen _ y )
16.
i
// formulas (4) and (5)
k
C
j loc
i
k' j
count
, y kj '
n i
i
17. END WHILE 18. RETURN
Dt '
The object processed by Algorithm 2 is the trajectory data set. Firstly, the clustering algorithm is used to cluster the trajectory data set to obtain several class clusters (line 1). Next, Laplacian noise is added to the trajectory location data in the cluster. Then, for the data in the cluster, we first determine whether the data is trajectory position data or other information data that may lead to user privacy leakage. If it is trajectory position data, Laplace noise is added to the trajectory location data, then the location data is counted and noise is added to the count value. If it is other information data, Laplacian noise is also added to the results of the data information. Then, the count of location after adding Laplacian noise is summed, and the sum value is taken as the noise value of the number of location data in the cluster, and the quotient of the sum of the coordinate values of the noise location data and the count of location in the cluster is used as the coordinate value of the noise clustering center point (lines 3 to 17). The trajectory privacy of moving objects based on clustering is effectively protected by the above methods.
5 EXPERIMENTAL ANALYSIS In order to verify the performance of the proposed algorithm, this paper mainly analyses the privacy intensity and data availability, and compares the results with other algorithms to verify the superiority of the proposed algorithm. Experimental data used in this paper is the trajectory data set in pedestrian database collected by the University of Edinburgh (FISHER, 2010). The database records the pedestrian trajectory data captured by the monitoring probe. We selected 200 trajectory data for experiments, including 6919 trajectory location points. The experimental environment is Windows operating system with 16GB memory space and 3.30GHZ Intel (R) Xeon (R) E3-1225 V5 CPU processor.
5.1 Privacy Intensity Evaluation 26
First, we analyse the privacy protection level of the proposed method. Differential privacy protection level is determined by differential privacy budget. The larger the differential privacy budget is, the lower the degree of privacy protection is, and the smaller the differential privacy budget is, the greater the degree of privacy protection. Next, we prove that the proposed algorithm satisfies
-differential privacy in two-dimensional case, that is, the noise variables added to the
original data obey the following distribution:
zmax _ x cen _ x ' zmax _ y cen _ y ' R Laplace count n Then A satisfies
-differential privacy.
Proof: If p1 f T denotes the probability distribution function of original database T and
p2 f r denotes the probability distribution function of noise distribution R , there exists:
Pr Cen Clu A T SO Pr cen r so ' Pr Cen Clu A T ' SO Pr cen r so p2 f so p1 f cen p2 f so p1 f cen'
Where r R , so SO . Because R obeys Laplacian distribution, there exists: count n
cen cen p2 f so p1 f cen zmax _ x cen _ x' zmax _ y cen _ y ' e p2 f so p1 f cen' count n
e
zmax _ x cen _ x' zmax _ y cen _ y '
'
f
e
-differential privacy in two-dimensional case. Similarly, in one-dimensional case, the algorithm also satisfies From the above proofs, we can see that the algorithm satisfies
-differential privacy. In order to verify the privacy protection performance of our proposed algorithm, we compare the cumulative distribution function of noise radius r with Geo-indistinguishability algorithm (Andrés et al., 2013) and Cluster- indistinguishability algorithm (Wang et al., 2017) to illustrate the superiority of the algorithm. In this experiment, we test the change of cumulative distribution function under different noise radius. The experiment is divided into four groups according to the different values of the privacy budget. The privacy budget values of each group are 0.5, 1.0, 1.5 27
and 2.0 respectively. Each experiment is performed 10 times, and the average value is taken as the final result values. As shown in Fig.3, the noise radius changes from 0.25 to 2.00, the X-axis is the noise radius, and the Y-axis is the value of corresponding cumulative distribution function. Figures 3 (a) to 3 (b) show that the cumulative distribution function varies with the noise radius when the privacy budget is 0.5, 1.0, 1.5 and 2.0, respectively. It can be seen from Fig.3, the cumulative distribution function increases with the increase of noise radius in all cases. As expected in this paper, with the increase of noise radius r, the corresponding cumulative distribution function values will gradually increase, and the cumulative distribution function values of the proposed algorithm are always larger than those of the other two algorithms. This shows that the noise radius of the algorithm is relatively small in the noise added to the original data set, thus avoiding the large distance between the added noise and the original data due to the excessive noise added. Fig.4 shows that the cumulative distribution varies with the noise radius under different privacy budgets. Privacy budget values are 0.05, 0.10, 0.15 and 0.20, respectively. As can be seen from Fig.4, the smaller the privacy budget is, the smaller the corresponding cumulative distribution function values is; the smaller the differential privacy budget is, the larger the corresponding noise radius is under the same cumulative distribution. This shows that the smaller the differential privacy is, the larger the radius of noise it can affect, and the greater the degree of privacy protection.
(a)
0.05
28
(b)
0.1
(c)
0.15
29
(d)
Fig.3.
Fig.4.
0.2
Cumulative distribution of different algorithms
Cumulative distribution under different privacy budgets
30
5.2 Data availability Privacy protection and data availability are two opposing aspects. For clustering, the greater the intensity of privacy protection is, the greater the disturbance of trajectory data will be. There may be a big gap between the new perturbation data and the original data in cluster analysis, which makes the perturbation data have poor data availability. When the degree of privacy protection is too small, or even there is no privacy protection for the trajectory data, the trajectory data has the greater data availability, but at this time the trajectory data can easily reveal the user's privacy. Therefore, proper noise should be added to the original trajectory data, which can not only protect the trajectory data, but also ensure the data availability of the trajectory in the clustering process. Then, our proposed algorithm is compared with Geo-indistinguishability algorithm (Andrés et al., 2013), Cluster- indistinguishability algorithm (Wang et al., 2017), Trajectory differential privacy publishing algorithm (TDPPA for short) (Zhao et al., 2019) and NLTR algorithm (Zhao et al., 2019, Wang et al., 2017) to illustrate the performance of the algorithm. The NLTR algorithm is a combination of Cluster-indistinguishability algorithm and Trajectory differential privacy publishing algorithm. In trajectory differential privacy publishing algorithm, noise is added to the moving object count in the SR tree node by adding Laplacian noise, and then consistency processing is performed. Our proposed algorithm is to limit the maximum Laplace noise added to the location data, and generate noise based on the inverse function of the new cumulative distribution function. Next, we compare the average error with other algorithms to show that the data availability of our proposed algorithm is higher. The formula for calculating the average error is as follows:
average
dis p error n
i xy
, pxyi '
(26)
i i ' Where p xy represents the original trajectory location point, pxy denotes the noise location point
' after adding noise, and dis p xy , p xy is the distance between the noise location point and the
original trajectory location point. The distance we use here is Euclidean distance and n denotes the number of trajectory location points. In this group of experiments, the average error between different algorithms is tested and changed with the change of privacy budget. The values of privacy budget are 0.10, 0.125, 0.15, 0.175 and 0.20 respectively. Each experiment is executed 10 times, and the average value is taken as the final result values. As shown in Fig.5, the X-axis represents the privacy budget and the Y-axis represents the average error of the algorithm. Fig.5 (a) shows the comparison of our algorithm with other algorithms. Fig.5 (b) shows the comparison between our algorithm and Geo-distinguishability algorithm. As shown in Fig.5 (a) and Fig.5 (b), the distance between the noised trajectory location point and the original location point decreases with the increase of differential privacy budget. 31
Because of the increase of privacy budget, the intensity of data privacy protection decreases. In addition, the average error of the proposed algorithm is smaller than the other three algorithms, that is to say, the noise added by our algorithm is smaller than the three algorithms. The algorithm can guarantee the privacy of trajectory data and improve the availability of data in the clustering process.
(a) Algorithm comparison 1
(b) Algorithm comparison 2 Fig.5. Average Error vs Privacy Budget Next, we use the average distance of trajectory location data in cluster after clustering the 32
trajectory data before and after adding noise to further illustrate the availability of data. The formula for calculating the average distance of the trajectory location data in the cluster is as follows:
average
dis p distance
i xy
, pcen
n
i Where p xy represents the original trajectory location point,
(27)
pcen represents the center point of
' the cluster, and dis p xy , p xy is the distance between the trajectory location point and the center
point. The distance we use here is Euclidean distance and n denotes the number of trajectory location points. In this group of experiments, we calculate the average distance value of different algorithms and the original data set. We compare the average distance of which algorithm is closer to the original data set under different privacy budgets. In order to better highlight the experimental results, the values of privacy budget in this experiment were 0.01, 0.02, 0.03, 0.04 and 0.05. Each experiment was performed 10 times, and the average value was taken as the final result values. As shown in Fig.6, the X-axis represents the privacy budget and the Y-axis represents the average distance of the algorithm. As shown in Fig.6 (a) and Fig.6 (b), the difference between the average distance of the proposed algorithm and the average distance of the original data set is the smallest among the four algorithms, which further illustrates that the proposed algorithm can protect data privacy, but also ensure the clustering effect of the data, that is, it can ensure the availability of the data.
33
(a) Algorithm comparison 1
(b) Algorithm comparison 2 Fig.6.
Average Distance of Different Algorithms
5.3 Analysis of noise adding mechanism For the privacy protection of trajectory data, we should not only ensure the privacy information of users, but also consider the availability of data. Data availability and data privacy protection are two opposite aspects, so we need to balance them. Differential privacy technology controls the 34
degree of data privacy protection through differential privacy parameters. The greater the value of differential privacy parameters, the least degree of data privacy protection, and the greater the availability of data. For the location data in the class cluster, we limit the maximum noise added to the location data according to the actual situation, and give the cumulative distribution function of noise in one dimension and two dimensions, as shown in formulas (16) and (23). According to TDPPA algorithm, the distribution function of noise radius under the condition of adding Laplace noise directly is as follows: ct n
F1 r 1 e xmax cx
r ' x
(28)
Compared with the cumulative distribution function in formula (16) of our algorithm, the value of the cumulative distribution function corresponding to formula (16) is larger under the same noise radius. In other words, compared with TDPPA algorithm, our algorithm adds less noise and more data availability under the same differential privacy budget. With the same noise radius probability, the differential privacy budget required by our algorithm is smaller. The
probability
distribution
functions
of
Geo-industinguishability
algorithm
and
Cluster-industinguishability algorithm are shown in formulas (29) and (30).
p2 r f 2 re f r
(29)
p2 r 2 re r
(30)
Compared with the probability density function in formula (22) of our algorithm, the value of the probability density function corresponding to formula (22) is larger under the same noise radius. In other
words,
compared with the Geo-industinguishability algorithm and
Cluster-industinguishability algorithm, under the same differential privacy budget, the added noise is smaller and the data availability is greater. With the same noise radius probability, the differential privacy budget required by our algorithm is smaller. In addition, the limitation of maximum noise is shown in formulas (15) and (20). Through the formulas, it can be concluded that the maximum noise value can be adjusted by the user according to the actual situation to further adjust the data availability and privacy protection degree, giving the user certain autonomy. Through theoretical analysis, we can see that our algorithm is optimal compared the above algorithm.
6 SUMMARY and OUTLOOK The users’ trajectory data collected by the location-aware device can be used for data mining research such as cluster analysis and anomaly detection. However, how to ensure the users’ privacy and security while performing clustering analysis has become a research hotspot of many scholars. 35
Some scholars implement privacy protection in trajectory clustering analysis through k-anonymity and corresponding variants, but these models are difficult to resist complex attacks. Current algorithms generally have the problems of poor clustering effect or poor data availability. To solve the above problems, this paper devises novel trajectory privacy-preserving method based on clustering using differential privacy. More specifically, by adding Laplacian noise to the trajectory location data and clustering center in the cluster, we can resist the clustering location attack. By adding Laplacian noise to the count of different trajectory location data, we can resist the continuous query attack. In addition, adding noise to different dimensions separately and adding noise to two-dimensional space, the final noise result is represented by a linear combination of the two noise results. Considering that other information in trajectory may include users’ privacy, in order to resist the corresponding attacks, noise is added to the result of other information, which further enhances the protection intensity of privacy information. Finally, using open data, current mainstream related algorithms are compared and analysed with our algorithm by a large number of experiments, which verifies the effectiveness of the proposed algorithm. The following problems are worth studying in the future: (1) With the development of big data technology and database storage technology, the amount of data stored in current databases is relatively large. How to ensure the privacy protection of a large amount of trajectory data while ensuring fast efficiency, and secondly, it is necessary to ensure the availability of the data in different application scenarios. (2) There are many technologies in the field of data mining, such as cluster analysis, frequent pattern analysis, and association pattern mining. These data mining technologies are widely used, but when data mining is performed on the trajectory data, there is also a risk that the privacy of the user will be leaked. How to protect the user's privacy while performing other data mining techniques on the trajectory data? (3) The protection of trajectory data has different considerations in different research contents, and different attack models need special attention. Xiao et al. (2015) considered the temporal correlation of the location. Qu et al. (2018) focused on protection of social relation between users. They are all concerned about user privacy issues brought by the relevance of the user or location and their works are interesting. In future work, privacy protection of clustering combined with relevance of user and location is worthy of study.
ACKNOWLEDGMENTS The research work is supported by National Natural Science Foundation of China (U1433116), the Fundamental Research Funds for the Central Universities (NP2017208) and Foundation of Graduate Innovation Center in NUAA (Kfjj20191603).
36
AUTHOR CONTRIBUTIONS SECTION
Xiaodong Zhao Zhao presented novel trajectory privacy-preserving method based on clustering using differential privacy, wrote the manuscript and revised responds, designed experiments, and carried out experiments. Dechang Pi Pi guided the writing and organization of the entire manuscript, and played an important role in revising comments. Junfu Chen Chen participated in the formula derivation and experiment of the manuscript.
CONFLICT OF INTEREST: The authors have declared that no conflict of interest exists.
CREDIT AUTHOR STATEMENT Xiaodong Zhao: Ph.D. candidate. His research interests include privacy and security issues about moving objects and data mining.
Dechang Pi: Ph.D. professor. His research interests include data mining and privacy-preserving techniques. Junfu Chen: Ph.D. candidate. His current research interests include privacy and security issues about moving objects and data mining.
REFERENCES Andrés, M., et al. (2013). Geo-indistinguishability: differential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security (CCS '13). ACM, 901-914. https://doi.org/10.1145/2508859.2516735 Blum, A., Dwork, C., McSherry, F., et al. (2005). Practicalprivacy: theSuLQ framework[C] //Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 128-138. https://doi.org/10.1145/1065167.1065184 Chiba T, Sei Y, Tahara Y, et al. (2019). Trajectory Anonymization: Balancing Usefulness about Position Information and Timestamp[C]//2019 10th IFIP International Conference on New Technologies, Mobility and Security (NTMS). IEEE, 1-6. https://doi.org/10.1109/NTMS.2019.8763833 Chen R., Fung B. C. M., Desai B. C. and Sossou N. M. (2012) Differentially private transit data publication: a case study on the montreal transportation system[C]// Proceedings of the 18th 37
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’12), Beijing, China, New York, NY, USA. 213-221. https://doi.org/10.1145/2339530.2339564 Cunha M, Mendes R, Vilela J P. (2019). Clustering Geo-Indistinguishability for Privacy of Continuous Location Traces[C]//2019 4th International Conference on Computing, Communications and Security (ICCCS). IEEE, 2019: 1-8. https://doi.org/ 10.1109/CCCS.2019.8888111 Deldar, F., Abadi , M., (2018). PLDP-TD: Personalized-location differentially private data analysis on trajectory databases. Pervasive and Mobile Computing, 49: 1-22. https://doi.org/10.1016/j.pmcj.2018.06.005 Dwork, C., (2006). Differential privacy[C]. //Proceedings of the 33rd International Colloquium on Automata, Languages and Programming(ICALP’06), Venice, Italy,1-12. https://doi.org/10.1007/11787006_1 Dwork, C., Naor, M., Pitassi, T., et al. (2010). Pan-Private Streaming Algorithms[C]//ICS. 2010: 66-80 FISHER B. (2010). Edinburgh informatics forum pedestrian database [OL]. http://homepages.inf.ed.ac.uk/rbf/FORUMTRACKING Ganta, S., Kasiviswanathan, S., Smith, A., (2008). Composition attacks and auxiliary information in data privacy[C]//Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 265-273. https://doi.org/10.1145/1401890.1401926 Gu, K., Yang, L., Liu, Y., et al. (2018). Trajectory data privacy protection based on differential privacy mechanism[C]//IOP Conference Series: Materials Science and Engineering. IOP Publishing, 351(1): 012017. https://doi.org/10.1088/1757-899X/351/1/012017 Langari R K, Sardar S, Mousavi S A A, et al. (2020). Combined fuzzy clustering and firefly algorithm for privacy preserving in social networks[J]. Expert Systems with Applications, 2020, 141: 112968. https://doi.org/10.1016/j.eswa.2019.112968 Li, N., Qardaji, W., Su, D., (2010). Provably Private Data Anonymization: Or, k-Anonymity Meets Differential Privacy[J]. Corr, abs/1101.2604:32-33. https://doi.org/10.1145/2414456.2414474 LI W, DING S, MENG J, et al. (2018). Spatio-temporal aware privacy-preserving scheme in LBS[J]. Journal on Communications, 39(5): 134-142. https://doi.org/10.11959/j.issn.1000-436x.2018084 Li, X., Yang, J., Sun, Z., et al. (2019). Differentially Private Release of the Distribution of Clustering Coefficients across Communities[J]. Security and Communication Networks, 2019. https://doi.org/10.1155/2019/2518714 Liu, X., Guo, Y., Chen, Y., et al. (2018). Trajectory Privacy Protection on Spatial Streaming Data with Differential Privacy[C]//2018 IEEE Global Communications Conference (GLOBECOM). IEEE, 1-7. https://doi.org/10.1109/GLOCOM.2018.8647918 Liu, X., Li, Q., (2016). Differentially private data release based on clustering anonymization[J]. J. Commun, 37(5): 125-129. https://doi.org/10.11959/j.issn.1000-436x.2016100 Mazeh, I., & Shmueli, E. (2020). A personal data store approach for recommender systems: enhancing privacy without sacrificing accuracy. Expert Systems with Applications, 139, 112858. https://doi.org/10.1016/j.eswa.2019.112858 McSherry, F., Talwar, K., (2007). Mechanism design via differential privacy[C]. //Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), Providence, Rhode Island, 94-103. https://doi.org/10.1109/FOCS.2007.66 38
Ni, L., Li, c., Wang X., Jiang H., and Yu, J., (2018). DP-MCDBSCAN: Differential Privacy Preserving Multi-core DBSCAN Clustering for Network User Data[J]. IEEE Access. 6:21053-21063. https://doi.org/10.1109/ACCESS.2018.2824798 Ni, W., Gu, M., (2016). Chen X. Location privacy-preserving k nearest neighbor query under user’s preference[J]. Knowledge-Based Systems, 103: 19-27. https://doi.org/10.1016/j.knosys.2016.03.016 Ou L, Qin Z, Liao S, et al. (2018). Releasing correlated trajectories: Towards high utility and optimal differential privacy[J]. IEEE Transactions on Dependable and Secure Computing. https://doi.org/10.1109/TDSC.2018.2853105 Peng T, Liu Q, Wang G, et al. (2019). Multidimensional privacy preservation in location-based services[J]. Future Generation Computer Systems, 93: 312-326. https://doi.org/10.1016/j.future.2018.10.025 Polatidis N, Georgiadis C K, Pimenidis E, et al. (2017). Privacy-preserving collaborative recommendations based on random perturbations[J]. Expert Systems with Applications, 2017, 71: 18-25. https://doi.org/10.1016/j.eswa.2016.11.018 Ren, J., Xiong, J., Yao, Z., et al. (2017). DPLK-means: A novel differential privacy K-means mechanism[C]//2017 IEEE Second International Conference on Data Science in Cyberspace (DSC). IEEE, 133-139. https://doi.org/10.1109/dsc.2017.64 Shmueli, E., Tassa, T., (2014). Privacy by diversity in sequential releases of databases[J]. Information Sciences, 2015, 298: 344-372. https://doi.org/10.1016/j.ins.2014.11.005 Su, D., et al. (2016). Differentially Private K-Means Clustering. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy (CODASPY '16). ACM, New York, NY, USA, 26-37. https://doi.org/10.1145/2857705.2857708 Tian Y, Kaleemullah M M, Rodhaan M A, et al. (2019). A privacy preserving location service for cloud-of-things system[J]. Journal of Parallel and Distributed Computing, 123: 215-222. https://doi.org/10.1016/j.jpdc.2018.09.005 Trujillo, R., Domingo, J., (2013). On the privacy offered by (k, δ)-anonymity[J]. Information Systems, 38(4): 491-494. https://doi.org/ 10.1016/j.is.2012.12.003 Tu, Z., Zhao, K., et al. (2018). Protecting Trajectory From Semantic Attack Considering k -Anonymity, l -Diversity, and t-Closeness. IEEE Transactions on Network and Service Management,16(1):264-278. https://doi.org/10.1109/TNS M.2018.287790 Wang, Y., Wang, Y., Singh, A., (2015). Differentially private subspace clustering[C]//Advances in Neural Information Processing Systems(NIPS). 1000-1008. Wang H, Xu Z, Jia S. (2017). Cluster-indistinguishability: A practical differential privacy mechanism for trajectory clustering[J]. Intelligent Data Analysis, 21(6): 1305-1326. https://doi.org/10.1109/CCCS.2019.8888111 Wei J, Lin Y, Yao X, et al. (2019). Differential privacy-based trajectory community recommendation in social network[J]. Journal of Parallel and Distributed Computing, 133: 136-148. https://doi.org/10.1016/j.jpdc.2019.07.002 Wu, Y., et al. (2013). A clustering hybrid based algorithm for privacy preserving trajectory data publishing[J]. Journal of Computer Research and Development, 50(3): 578-593. Xiao Y, Xiong L. (2015). Protecting locations with differential privacy under temporal correlations[C]//Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 1298-1309. https://doi.org/10.1145/2810103.2813640 39
Yang Y, Cai J, Yang H, et al. (2020). TAD: A trajectory clustering algorithm based on spatial-temporal density analysis[J]. Expert Systems with Applications, 2020, 139: 112846. https://doi.org/10.1016/j.eswa.2019.112846 Yun U, Kim J. (2015). A fast perturbation algorithm using tree structure for privacy preserving utility mining[J]. Expert Systems with Applications, 2015, 42(3): 1149-1165. https://doi.org/10.1016/j.eswa.2014.08.037 Yun U, Ryang H, Kwon O C. (2016). Monitoring vehicle outliers based on clustering technique[J]. Applied Soft Computing, 2016, 49: 845-860. https://doi.org/10.1016/j.asoc.2016.09.003 Zhang, L., Jin, C., Huang, H., et al. (2019). A Trajectory Privacy Preserving Scheme in the CANNQ Service for IoT[J]. Sensors, 19(9): 2190. https://doi.org/10.3390/s19092190 ZHANG, X., MENG X., (2014). Differential privacy in data publication and analysis. Chinese Journal of Computers, 37(4):927-949. https://doi.org/10.3724/SP.J.1016.2014.00927 Zhao, X., Dong, Y., & Pi, D. (2019). Novel trajectory data publishing method under differential privacy. Expert Systems with Applications, 138, 112791. https://doi.org/10.1016/j.eswa.2019.07008 Zhou K, Wang J. (2019). Trajectory Protection Scheme Based on Fog Computing and K-anonymity in IoT[C]//2019 20th Asia-Pacific Network Operations and Management Symposium (APNOMS). IEEE, 1-6. https://doi.org/10.23919/APNOMS.2019.8893014
40