Using data visualization technique to detect sensitive information re-identification problem of real open dataset

Journal of Systems Architecture 80 (2017) 85–91 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.else...

Download PDF

1MB Sizes 3 Downloads 80 Views

Report

PDF Reader
Full Text

Journal of Systems Architecture 80 (2017) 85–91

Contents lists available at ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

Using data visualization technique to detect sensitive information re-identiﬁcation problem of real open dataset☆

MARK

⁎

Chiun-How Kaoa,c, Chih-Hung Hsieh ,b, Yu-Feng Chub, Yu-Ting Kuangb, Chuan-Kai Yanga a b c

Department of Information Management, National Taiwan University of Science and Technology, Taipei, Taiwan Institute for Information Industry, Taipei, Taiwan Institute of Statistical, Academia Sinica, Taipei, Taiwan

A R T I C L E I N F O

A B S T R A C T

Keywords: Data de-identiﬁcation Data visualization Privacy preserving Personally identiﬁable information Sensitive personal information

With plenty valuable information, open data are often deemed as great assets to academia or industry. In spite of some de-identiﬁcation processing that most of data owners will perform before releasing the data, the more datasets are opened to public, the more likely personal privacy will be exposed. According to previous real case studies, even though the personally identiﬁable information has been de-identiﬁed, sensitive personal information could still be uncovered by heterogeneous or cross-domain data joining operations. The involved privacy re-identiﬁcation processes are usually too complicated or obscure to be realized by data owners, not to mention that this problem will be more severe as the scale of data will get larger and larger. For preventing the leakage of sensitive information, this paper shows how to use a novel visualization analysis tool for open data deidentiﬁcation (ODD Visualizer) to verify whether there exists sensitive information leakage problem in the target datasets. The high eﬀectiveness that the ODD Visualizer can provide mainly comes from implementing a scalable computing platform as well as developing an eﬃcient data visualization technique. Our demonstrations show that the ODD Visualizer can indeed uncover real vulnerability of record linkage attacks among open datasets available on the Internet.

1. Introduction As everyone already knows, releasing dataset as public “open data” provides valuable information to academia or industry, and in fact it results in huge market size up to 3000 billion among various domains [1]. Despite the signiﬁcant potential that open data can oﬀer, Personally Identiﬁable Information (PII) re-identiﬁcation or sensitive privacy leakage problem becomes an annoying side eﬀect accompanied with the dataset releasing, and is one of the primary causes that only 10% amount of datasets owned by worldwide governments have been released [2,3]. Even with a dataset being de-identiﬁed, there still exists a chance that its PII may be revealed by cross joining diﬀerent datasets [4,5]. For instance, L. Sweeny ever proposed some practical surveys claiming that (1) by correlating the National Association of Health Data Organizations (NAHDO) data and voter registration list for Cambridge Massachusetts via attributes of birthday, gender, ZIP code, six people had Governor Weld’s particular birth date; only three of them were men; and he was the only one in his 5-digit ZIP code. Therefore, Weld’s health PII can be subsequently identiﬁed [4]. and (2) 40% of 1130

volunteers who donated their DNA data to the Personal Genome Project can be identiﬁed, as well as using the zip code, date of birth, and gender [5]. Although their names did not appear, their proﬁles list sensitive medical conditions including abortions, illegal drug use, alcoholism, depression, sexually transmitted diseases, medications and their genome-related data. To quantify the likelihood where PII or sensitive privacy may be reidentiﬁed, the k-anonymity model was proposed and can be used to measure how likely the PII of released data being de-identiﬁed [6] as the following: k-anonymity model: for any given dataset table T, and one record D in T, we assume that attributes = {V1, V2, ...} contained in D can be partitioned into four groups according to their roles during the process of de-identiﬁcation or re-identiﬁcation; that is, D=(Explicit Identiﬁer, Quasi Identiﬁer, Sensitive Attributes, Non-Sensitive Attributes), where (1) Explicit Identiﬁer (EID): is a set of attributes, such as name and social security number (SSN), containing information that explicitly identiﬁes record owners; (2) Quasi Identiﬁer (QID): is a set of attributes that could potentially identify record owners; (3) Sensitive Attributes

☆

Fully documented templates are available in the elsarticle package on CTAN. Corresponding author. E-mail addresses: [email protected] (C.-H. Kao), [email protected] (C.-H. Hsieh), [email protected] (Y.-F. Chu), [email protected] (Y.-T. Kuang), [email protected] (C.-K. Yang). ⁎

http://dx.doi.org/10.1016/j.sysarc.2017.09.009 Received 28 February 2017; Received in revised form 12 September 2017; Accepted 27 September 2017 Available online 28 September 2017 1383-7621/ © 2017 Elsevier B.V. All rights reserved.

Journal of Systems Architecture 80 (2017) 85–91

C.-H. Kao et al.

(SA): consist of sensitive personal-speciﬁc information such as disease, salary, and disability status; and (4) Non-Sensitive Attributes (NSA): contain all attributes that do not fall into the above three categories. Assume qid is one existing value of one valid QID combination. For any qid of each QID in T, if there are at least k records sharing the same qid, then such T satisfying this requirement is called k-anonymous. The probability of linking a victim to a speciﬁc record through this QID is therefore at most 1/k. Time complexity of calculating the optimal k-anonymity value for a given dataset is known to be Non-deterministic Polynomial-time Hard (NP-hard) [7]. This means that when a dataset goes to an extremely large-scale, estimating whether it has high risk of privacy leakage could becomes intractable. In this paper, we show how to use a novel and eﬃciently scalable visualization tool, which is named as Open Data Deidentiﬁcation Visualizer (ODD Visualizer) [8], to estimate the likelihood or risk that the sensitive information in datasets will be reidentiﬁed. The core techniques in the ODD Visualizer, including Matrix Visualization (MV) approach [9] and Hierarchical Analysis and Clustering Tree (HACT), are used to depict a brief k-anonymity distribution among diﬀerent attribute subsets as well as an optimal alignment of attributes, to provide a user the most robust attribute subset. In addition, we implemented the ℓ-diversity model to detect “Attribute Linkage Attack” individually. And the visualization of ℓ-diversity distribution shares the same interface as the one for k-anonymity distribution. The merits of the proposed ODD Visualizer are threefold. (1) A scalable database and computation platform were incorporated in the ODD Visualizer such that k-anonymity of each attribute subset can be rapidly estimated. (2) Users can easily get a whole picture describing the k-anonymity and ℓ-diversity distribution among diﬀerent attribute subset combinations. Hence, it can be known where the weakness of current dataset against PII re-identiﬁcation is. (3) Based on the optimal alignment sorted by HACT, users get suggestion to decide which attribute subsets can be released or not. The eﬃciency and eﬀectiveness were evaluated by one solid real case that uses the ODD Visualizer to detect a vulnerability of record linkage attack among real medical center data and census registration information. Details about the deﬁnition of k-anonymity and ℓ-diversity model discussed in this paper can be found in Section 2. The architecture and implementation of the proposed ODD Visualizer are all mentioned in Section 3. Sections 4 is for demonstrating the eﬀectiveness of our proposed method using a benchmark and real datasets available on the Internet. At last, Section 5 concludes this paper as well as gives some possible directions for further researches.

Table 1 Examples for illustrating record linkage re-identiﬁcation and k-anonymity model. (a) Regional patients’ diagnosis table Job

Gender

Engineer Male Engineer Male Lawyer Male Writer Female Writer Female Dancer Female Dancer Female (b) External residents table Name Job Alice Writer Bob Engineer Cathy Writer Doug Lawyer Emily Dancer Fred Engineer Glady Dancer Henry Lawyer Irene Dancer

Age

Disease

35 38 38 38 30 30 30

Hepatitis Hepatitis HIV Flu HIV HIV HIV

Gender Female Male Female Male Female Male Female Male Female

Age 30 35 30 38 30 35 30 39 32

k > 1). As a consequence, the k-anonymity model will be used to measure the likelihood of the sensitive information leakage. In the default scenario, a dataset owner can leverage the ODD Visualizer to detect the weakness of target a dataset. 2.2. Attribute linkage attack and ℓ-diversity model Sometimes even k of the QID is bigger than one, an attacker can still re-identify the sensitive information from the correlation between QID and SA. For example, all 30 years old female dancers have HIV, which is the sensitive information in Table 1(a), and there are just only two persons (qid={dancer, female, 30}) in Table 1(b). Therefore the attacker can infer that Glady and Emily are HIV patients with 100% conﬁdence. The above situation is called “attribute linkage attack”. To prevent attribute linkage attack, Machanavajjhala et al. [10] proposed a new model called ℓ-diversity model. The ℓ-diversity model makes use of the concept of entropy to estimate the diversity of the sensitive attribute in each qid group. The ℓ-diverse of every qid group is deﬁned by the following equation:

log (ℓ) = − ∑ P (qid, s ) log (P (qid, s )) s∈S

where S is one of the sensitive attributes and P(qid, s) is the proportion of the sensitive value s in a qid group. The worst case is that all the values of the sensitive attributes in the qid group are the same, causing the entropy value of log(ℓ) to be equal to 0, and it means that ℓ of the qid group is with the minimum value of 1. In our system we use the minimum of ℓ values of qid groups to deﬁne the ℓ of QID. For example: there are 6 qid groups in Table 1(a), and the ℓ of (qid={dancer, female, 30} equals to 1, which is the minimum value, so the ℓ of QID is set to 1.

2. Related work 2.1. Record linkage attack and k-anonymity model The key feature and major service of the ODD Visualizer is for information de-identiﬁcation risk estimation and its visualization. The kanonymity model [6] is adopted to detect the following “record linkage attack”. The so-called “record linkage attack” and its relationship with k-anonymity model can be described using the following example. Suppose that a regional hospital is going to release patient’s de-identiﬁed information (as regional patients’ diagnosis table in Table 1(a)) for research purpose, and that there is another external table containing residents information of the same local region (as external table in Table 1(b)). Once a PII hacker has the privilege to access both public datasets, then the hacker does have a chance to identify the SA of “Disease” via QID = {job, gender, age}. For example, a privacy leakage crisis is on Doug, because he is the only one whose value of QID (named as qid) is {lawyer, male, 38} (for the case of k = 1). Using such a qid value to link the two tables, the sensitive information of Doug’s diagnosis result will leak out. On the contrary, PII for Bob and Fred is much safer, as they share the same qid of {engineer, male, 35} (for the case of

2.3. Privacy preserving data visualization Recently, privacy preserving data visualization has attracted more attention in visualization community. In information visualization literature, Dasgupta et al. [11] described opportunities and challenges for privacy preserving visualization in the realm of electronic health record (EMRs) data. Andrienko et al. [12,13] discussed privacy issues in applying geospatial visual analytics methods to movement data. Dasgupta et al. [14,15] applied k-anonymity and ℓ-diversity in parallel coordinate plot and scatter plots. Chou et al. [16,17] integrated privacy issues in Sankey diagram-like visualization and Node-link diagram. In previous studies, they bundled a set of edges or merged a set of nodes to protect 86

Journal of Systems Architecture 80 (2017) 85–91

C.-H. Kao et al.

Fig. 1. System architecture of the ODD Visualizer.

3.2. Kernel component II: visualization of ℓ-diversity distribution

the privacy information in a graph instead of releasing the corresponding text data. In this paper, we focus on k-anonymity and ℓ-diversity for raw data de-identiﬁcation evaluation and visualization.

Same as the visualization of k-anonymity distribution, MV was also implemented for depicting a brief distribution among diﬀerent estimations of ℓ-diversity values regarding diﬀerent combinations of attributes. The diﬀerence is that a user must specify the target sensitive attributes ﬁrst. These ℓ-diversity value estimations intuitively work as a measurement of robustness against “attribute linkage attack” for the target sensitive attributes of a whole dataset. Fig. 2(b) shows the MV of ℓ-diversity in the ODD Visualizer. For a matrix itself, the row and column indices list the attributes (i.e., V1, V2, ...) constituting the input dataset, and the color shown at position (i, j) represents the ℓ-diversity values for QID = {Vi, Vj}. The diagonal positions also stand for the cases of single variables. Moreover, the color of matrix borderline represents ℓ-diversity value of the entire dataset considering all attributes. It should be noted that the attributes imported into the matrix are only QID, and the user needs to specify the target sensitive attribute. If there are multiple sensitive attributes, the ℓ-diversity distribution should be visualized separately, and all ℓ-diversity computations are also executed in a Spark platform.

3. System architecture and kernel components of the ODD Visualizer Fig. 1 depicts a 3-layered structure of our ODD Visualizer system. A user (data owner) uploads his/her data to the ODD Visualizer server, and gets the analysis results through a web browser for the left-most layer as a user interface. The ODD Visualizer server consists of the remaining two layers. One is the scalable service, depicted at the right most layer in Fig. 1. Considering that the computation of k-anonymity/ ℓ-diversity model has the potential to be parallelized, we apply “Spark” for supporting parallel cluster computing and Hadoop Distributed File System (HDFS) [18] for running the kernel components of the ODD Visualizer, Matrix Visualization (MV) and Hierarchical Analysis and Clustering Tree (HACT) [8], which will be introduced in following subsections. The central layer works like a web server dedicated on stored accounts/datasets management and relays required information between the other two layers using Spring MVC framework [19].

3.3. Kernel component III: suggestion for optimized attributes releasing order

3.1. Kernel component I: Visualization of k-anonymity distribution

Hierarchical Analysis and Clustering Tree (HACT) analyze the kanonymity/ℓ-diversity distribution after MV. As in Fig. 2(b), The HACT ﬁrst sorts all of k-anonymity/ℓ-diversity values for all combinations of two attributes and uses a greedy strategy to merge two variables (attributes) which result in the maximum k-anonymity/ℓ-diversity value. The two selected variables, Vi and Vj, will be merged as a new cluster and treated as one “complex” variable, {Vi, Vj}. The clustering operation will be executed iteratively until all attributes are merged as a single tree. The “uncle ﬂipping” mechanism [20] is then adopted to determine the optimized order of attributes inside each subtree. Because HACT performs merging, arranging, and grouping variables to ﬁnd which subsets of data have the most robustness against the “record linkage attack”, after the HACT process, given any k-anonymity/ℓ-diversity threshold = δ, a user can get the suggestion and guideline for determining which attribute subsets to be released based on the optimal alignment sorted by HACT (Fig. 2(c)). The following is the pseudo code of HACT process:

Matrix Visualization (MV) was implemented for depicting a brief distribution among diﬀerent estimations of k-anonymity values regarding diﬀerent combinations of attributes. These k-anonymity value estimations intuitively work as a measurement of robustness against “record linkage attack” for a whole dataset. Fig. 2(b) shows the MV of kanonymity in the ODD Visualizer. The color spectrum at the bottom of Fig. 2 indicates diﬀerent k-anonymity values from the lowest to the highest. For a matrix itself, the row and column indices list the attributes (i.e., V1, V2, ...) constituting the input dataset, and the color shown at position (i, j) represents the k-anonymity values for QID = {Vi, Vj}. The diagonal positions stand for the cases where only single variables are considered. Moreover, the color of matrix borderline represents k-anonymity value of the entire dataset considering all attributes. It should be noted that the attributes imported into the matrix are only QID. EID is not considered in MV, because EID always produces kanonymity of 1 due to its deﬁnition. All k-anonymity computations are executed in a Spark platform.

Step 1. Excluding the diagonal, pick one position (i, j) in MV which produces the largest k-anonymity/ℓ-diversity value.

87

Journal of Systems Architecture 80 (2017) 85–91

C.-H. Kao et al.

Fig. 2. The modiﬁed matrix visualization to display kanonymity / ℓ-diversity distribution.

Step 2. Merge the two picked variables, Vi and Vj, as a new “complex” variable, {Vi, Vj}, and form a new clustering tree. Step 3. Replace indices Vi and Vj, by {Vi, Vj}, and update the corresponding k-anonymity/ℓ-diversity values related to Vi and Vj. Step 4. Go back to Step 1 until all subtrees are merged. Step 5. Align all clustered variables based on “uncle ﬂipping” order determining mechanism.

the whole employee dataset but also considers its single attributes or attributes combinations. For instance, the attributes of gender, race, and salary class have the highest k-anonymity values, while nativecountry has the lowest one. The combination of {salaryclass, gender} provides the highest k-anonymity value; however, PII may be re-identiﬁed via the combination of {salaryclass, age} whose corresponding kanonymity value is too low. It also can be observed that the original matrix visualization was shown in a multi-fragment manner. After the HACT processing, Fig. 3(b) shows the aligned matrix result where the attributes are grouped into several blocks. The resulting cluster, representing {gender, salaryclass, education martial-status, race}, has more resistance against re-identiﬁcation than other attributes. The clustering tree on the right side of Fig. 3(b) indicates the order for those blocks to be merged. Moreover, the tree also reﬂects diﬀerent deidentiﬁcation robustness after each merging from k = 1111 to k = 1. This example successfully demonstrates that the a user can take advantage of this visualization as a guideline to decide which parts of a dataset to be exposed.

4. Demonstration 4.1. Veriﬁcation using existent benchmark To demonstrate the eﬀectiveness of our proposed ODD Visualizer, an employee dataset from [21] was ﬁrst downloaded as a benchmark. This dataset contains totally 10 attributes, while one of them, the employee ID, used as EID, was excluded without being imported into the ODD Visualizer. And the results of using the ODD Visualizer to display, to analyze, and to group k-anonymity values of this benchmark are all shown in Fig. 3. Fig. 3(a) displays the k-anonymity values not only for

Fig. 3. A demonstration for how the ODD Visualizer works with regard to a benchmark.

88

Journal of Systems Architecture 80 (2017) 85–91

C.-H. Kao et al.

name of an “unknown medical center” derived from Step 1. Step 3. because the ODD Visualizer also identiﬁes that “Region Name” in 4th dataset still lead to the result of k = 1, so taking “Region Name” of this “unknown medical center” derived by Step 1 and Step 2 as an query keyword will get the critical ﬁnding that there is only one medical center in 4th dataset that is located in the same region where the “unknown medical center” is located. Step 4. Therefore, attacker can decrypt identity of this “unknown medical center” in 1st dataset, as well as re-identify this medical center’s sensitive doctor-working hour statistics information (SA), by linking the “Hospital Name” that is provided in 4th dataset.

4.2. Detect information re-identiﬁcation problem of a real open data Here we give another demonstration to show that our ODD Visualizer indeed helps to identify “record linkage attack” weakness with regard to real open datasets online. Four open datasets available on the Internet are downloaded and taken as illustrative examples. First, the naming (information) and the corresponding attributes of four included datasets are introduced, respectively. (1) “2014 DoctorWorking Hour Statistic of Taiwan medical center” (1st dataset) contains the number of doctors, and doctors’ accumulated working hours for each anonymous medical center at Taiwan in 2014 with some attributes set. Note that in this case, the “Count of Doctors” is the QID, the anonymous medical center index is encoded as the EID, and the accumulated working hours for each medical center is the SA. Due to the concern of Labor Standards Law and employee rights, none of medical centers wants its names and the SA information being linked together. 2) “2014 Count of Medical Personnel of Regions in Taiwan” (2nd dataset) records the total count of medical personnel (QID) in each region of Taiwan. Each such region in this dataset is deﬁned and corresponds to a speciﬁc “Post Code” (another QID). 3) “Post Code & Region Name Mapping Table of Taiwan” (3rd dataset) is provided by the postal company of Taiwan, and both “Post Code” and “Region Name” are treated as two QIDs linked to other tables. 4) “2011-2014 Evaluation Results of Taiwan’s Hospitals” (4th Dataset) lists the service quality evaluation results of all hospitals in Taiwan, as well as their corresponding information, including “Address”, “Region” (QID), and “Hospital Name” (EID). Fig. 4 shows the corresponding scenario of this real case with the following steps.

5. Case study As medical data are usually with sensitive personal information, and therefore the correctness, completeness and safety of the relevant personal intelligence information are important issues, as a slight inadvertence will cause serious damage to their property and reputation. We cooperated with 20 experts on the medical sensitive data from Chang Gung Memorial Hospital, and they usually used Statistical Analysis System (SAS) to process these medical data. We conducted a user study to evaluate our ODD Visualizer. Before starting the study, the experts were explained and demonstrated the ODD Visualizer and let them understand how to use it. After that, each participant used the ODD Visualizer and SAS to detect the re-identiﬁcation information in their works, respectively. Finally, they reported the opinions on a questionnaire when they ﬁnished this experiment. The questions had ﬁve-level Likert items: strongly agree, agree, neither agree nor disagree, disagree, strongly disagree. The questionnaire is as follows:

Step 1. “Count of Doctors” in both 1st dataset and 2nd dataset, is a QID with k = 1, and can be linked together to get a fact that an “unknown medical center” locates in a known region. Step 2. by using “Post Code” in both 2nd dataset and 3rd dataset with the corresponding k = 1, an attacker can get the located region

Q1. Q2. Q3. Q4.

It was simple to use the ODD Visualizer. It was easy to learn to use the ODD Visualizer. I can eﬀectively complete my work using the ODD Visualizer. It is easy to ﬁnd the information I needed from the ODD Fig. 4. Using the ODD Visualizer to detect re-identiﬁcation problem of sensitive information among real open datasets.

89

Journal of Systems Architecture 80 (2017) 85–91

C.-H. Kao et al.

and-challenges-worldwide-open-data-policies. [3] PM Speech at Open Government Partnership 2013 – Speeches – GOV.UK. https:// www.gov.uk/government/speeches/pm-speech-at-open-government-partnership2013. [4] L. Sweeney, Simple demographics often identify people uniquely, Health (San Francisco) 671 (2000) 1–34. [5] A. Tanner, Harvard Professor Re-Identiﬁes Anonymous Volunteers In DNA Study. http://www.forbes.com/sites/adamtanner/2013/04/25/harvard-professor-reidentiﬁes-anonymous-volunteers-in-dna-study/. [6] L. Sweeney, k-anonymity: a model for protecting privacy, Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10 (05) (2002) 557–570. [7] A. Meyerson, R. Williams, On the complexity of optimal k-anonymity, Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, ACM, 2004, pp. 223–228. [8] C.-H. Kao, C.-H. Hsieh, C.-L. Hsu, Y.-F. Chu, Y.-T. Kuang, Odd Visualizer: scalable open data de-identiﬁcation visualizer, Proceedings of 2016 International Workshop on Computer Science and Engineering, (2016). [9] H.-M. Wu, S. Tzeng, C.-h. Chen, Matrix Visualization, Handbook of Data Visualization, Springer, 2008, pp. 681–708. [10] A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, L-diversity: privacy beyond k-anonymity, ACM Trans. Knowl. Discov. Data (TKDD) 1 (1) (2007) 3. [11] A. Dasgupta, E. Maguire, A. Abdul-Rahman, M. Chen, Opportunities and challenges for privacy-preserving visualization of electronic health record data, Proc. of IEEE VIS 2014 Workshop on Visualization of Electronic Health Records, Paris, (2014). [12] G. Andrienko, N. Andrienko, Privacy Issues in Geospatial Visual Analytics, Advances in Location-Based Services, Springer, 2012, pp. 239–246. [13] N. Andrienko, G. Andrienko, G. Fuchs, Towards privacy-preserving semantic mobility analysis, Proceedings of EuroVA, international workshop on visual analytics, 17–18 June 2013, EuroGraphics, Leipzig, 2013. [14] A. Dasgupta, R. Kosara, Adaptive privacy-preserving visualization using parallel coordinates, IEEE Trans. Vis. Comput. Graph. 17 (12) (2011) 2241–2248. [15] A. Dasgupta, M. Chen, R. Kosara, Measuring Privacy and Utility in PrivacyPreserving Visualization, Computer Graphics Forum, 32 Wiley Online Library, 2013, pp. 35–47. [16] J.-K. Chou, Y. Wang, K.-L. Ma, Privacy preserving event sequence data visualization using a Sankey diagram-like representation, SIGGRAPH ASIA 2016 Symposium on Visualization, ACM, 2016, p. 1. [17] J.-K. Chou, C. Bryan, K.-L. Ma, Privacy preserving visualization for social network data with ontology information, In Proceedings of IEEE Paciﬁc Visualization Symposium (PaciﬁcVis), (2017). [18] Apache Spark architecture - Lightning-Fast Cluster Computing. http://spark. apache.org/. [19] Serving Web Content with Spring MVC. https://spring.io/guides/gs/serving-webcontent/. [20] H.-M. Wu, Y.-J. Tien, C.-h. Chen, Gap: a graphical environment for matrix visualization and cluster analysis, Comput. Stat. Data Anal. 54 (3) (2010) 767–778. [21] F. Prasser, F. Kohlmayer, Putting Statistical Disclosure Control into Practice: The Arx Data Anonymization Tool, Medical Data Privacy Handbook, Springer, 2015, pp. 111–148.

Fig. 5. Stack plots reporting the questionnaire results.

Visualizer. Q5. The information provided by the ODD Visualizer is easy to understand. Q6. Overall, I am satisﬁed with the ODD Visualizer. The questionnaire results are shown in Fig. 5, which presents the distribution of the answers. We can ﬁnd that most participants are satisﬁed with the ODD Visualizer which is easy to use and can increase eﬃciency. In addition, we also asked these participants to compare the time spent on the ODD Visualizer and on SAS when looking for weakness attributes of a target dataset. In general, it can save about 80% of the time when they use the ODD Visualizer for detecting the attributes which had re-identiﬁcation issue. Because using SAS must query the grammar ﬁrst and test repeatedly until ﬁnding the best grammar, the whole process takes about one hour, and without any visualization. On the other hand, the ODD Visualizer has a graphical interface to present the results of k-anonymity and ℓ-diversity, which allows a user to ﬁgure out the weakness attributes in a dataset. 6. Conclusion Open data analysis creates huge amount revenue for various domains and their related applications. Yet, identifying whether there exists any weakness of “record linkage attack” and “attribute linkage attack” among large-scaled datasets is still not straightforward, thus slowing down the eﬃciency of data releasing. To overcome this bottleneck, two issues was properly handled in our proposed ODD Visualizer. (1) Conﬁguring de-identiﬁed estimation model on a Spark platform to provide scalable and quick k-anonymity/ℓ-diversity computing. (2) Using eﬃcient matrix visualization analysis and hierarchical and analysis clustering tree not only rapidly depicts k-anonymity/ℓ-diversity distribution of whole dataset, but also constructs optimized data releasing suggestion. A famous benchmark collected from the Internet and a real case study demonstrate that our ODD Visualizer indeed successfully helps detect datasets suﬀering from high risk of sensitive information re-identiﬁcation. Our future works include the extension to other types of anonymity models and the designing of their speciﬁc visualization. In addition, there are other privacy-preserving operations: merge, blur, delete and so on, that we will implement in the future. Moreover, the ODD Visualizer can also be applied to valuable but sensitive datasets of diﬀerent domains.

Chiun-How Kao is currently a Ph.D. candidate in the Department of Information Management, National Taiwan University of Science and Technology. His research interests are in the big data with information visualization and data analysis.

Chih-Hung Hsieh received the B.S. degree in computer science and information engineering from National Taiwan Normal University, Taipei, Taiwan, the M.S. degree in bioinformatics from National Chiao Tung University, Hsinchu, Taiwan, and the Ph.D. degree from the Institute of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, in 2004, 2006, and 2011, respectively. He is currently an engineer at Institute for Information Industry working on cyber security data analysis. Chih-Hung Hsieh research interests include data mining, machine learning, evolutionary algorithms, fuzzy system, and bioinformatics.

References [1] J. Manyika, M. Chui, P. Groves, D. Farrell, S. Van Kuiken, E.A. Doshi, Open Data: Unlocking Innovation and Performance with Liquid Information, McKinsey Global Institute, 2013, p. 21. [2] New Report Highlights Successes and Challenges of Worldwide Open Data Policies. http://techpresident.com/news/wegov/24480/new-report-highlights-successes-

90

Journal of Systems Architecture 80 (2017) 85–91

C.-H. Kao et al. Yu-Feng Chu received the M.S. degree in Department of Information Management from Yuan Ze University, Taoyuan, Taiwan. He is currently an engineer at Institute for Information Industry working on cyber security data analysis. His professional skills include system development and performance evaluation.

Chuan-Kai Yang received the B.S. and M.S. degree in mathematics and computer science from National Taiwan University, Taipei, Taiwan, in 1991 and 1993, respectively, and the Ph.D. degree in computer science from Stony Brook University, Stony Brook, NY, USA, in 2002. He is a Professor with the Department of Information Management, National Taiwan University of Science and Technology, Taipei. His research interests include computer graphics, visualization, multimedia systems, and computational geometry.

Yu-Ting Kuang received the M.S. degree in Department of Information Management from National Taiwan University of Science and Technology, Taipei, Taiwan.She is now a web front-end designer and engineer at Institute for Information Industry working on cyber security data analysis. Her professional skills include web development and user interface design.

91

Using data visualization technique to detect sensitive information re-identification problem of real open dataset

Using data visualization technique to detect sensitive information re-identification problem of real open dataset

Recommend Documents