Accepted Manuscript Title: Full Autonomy: A Novel Individualized Anonymity Model for Privacy Preserving Author: Junqing Le, Xiaofeng Liao, Bo Yang PII: DOI: Reference:
S0167-4048(16)30181-X http://dx.doi.org/doi: 10.1016/j.cose.2016.12.010 COSE 1082
To appear in:
Computers & Security
Received date: Revised date: Accepted date:
24-4-2016 3-11-2016 21-12-2016
Please cite this article as: Junqing Le, Xiaofeng Liao, Bo Yang, Full Autonomy: A Novel Individualized Anonymity Model for Privacy Preserving, Computers & Security (2017), http://dx.doi.org/doi: 10.1016/j.cose.2016.12.010. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Full Autonomy: A Novel Individualized Anonymity Model for Privacy Preserving Junqing Lea, Xiaofeng Liaoa 1, Bo Yanga a
Chongqing Key Laboratory of Nonlinear Circuits and Intelligent Information Processing,
College of Electronic and Information Engineering, Southwest University, Chongqing 400715, China 1
Corresponding author
E-mail addresses:
[email protected].
Junqing Le received the B.S. degree in software engineering from Southwest Jiaotong University, Chengdu, China, in 2014. At present, he is pursuing a master’s degree in signal and information processing at Southwest University, China. His research areas are including cryptography, chaos, image, multimedia security or privacy security. Xiaofeng Liao received the BS and MS degrees in mathematics from Sichuan University, Chengdu, China, in 1986 and 1992, respectively, and the PhD degree in circuits and systems from the University of Electronic Science and Technology of China in 1997. From 1999 to 2012, he was a professor at Chongqing University. At present, he is a professor at Southwest University and the Dean of School of Electronic and Information Engineering. He is also a Yangtze River Scholar of the Ministry of Education of China. From November 1997 to April 1998, he was a research associate at the Chinese University of Hong Kong. From October 1999 to October 2000, he was a research associate at the City University of Hong Kong. From March 2001 to June 2001 and March 2002 to June 2002, he was a senior research associate at the City University of Hong Kong. From March 2006 to April 2007, he was a research fellow at the City University of Hong Kong. Professor Liao holds 4 patents, and published 4 books and over 300 international journal and conference papers. His current research interests include neural networks, nonlinear dynamical systems, bifurcation and chaos, and cryptography. Bo Yang received the B.S. and M.S. degrees in electronic and information engineering from Southwest University, China. He is now pursuing the Ph.D. degree in Southwest University. His
Page 1 of 28
research interests include chaos and network security. Abstract An important principle in privacy preservation is individualized privacy autonomy which means individual has the freedom to decide and choose privacy constraints. Currently, many individualized anonymous models which have been proposed unite privacy autonomy, and most of the individualized models are focused on autonomy of sensitive attributes. As the autonomy of quasi-identifier (QI) attributes are neglected, it is unfully autonomous showed in these individualized models. In order to achieve full privacy autonomy, an individualized (α, ω)-anonymity model is proposed in this paper, where α and ω which respectively represent the constraint value of the sensitive attributes and the QI attributes are both set by providers. The model doesn’t need to set the constraint value k. Furthermore, it is combined with the granular computing and the top-down local recoding to process the providers’ datasets in different intervals and then to achieve differential protection for different granular spaces. Moreover, the performance analysis shows that this model not only satisfies the individualized privacy requirements, but also brings more higher efficiency and lower information loss. Keywords:
privacy preservation, privacy autonomy, individualized (α, ω)-anonymity model,
granular computing, local recoding 1.
Introduction With the development of computer technology big data is popular in Internet, and data
security has become an important research content with the accumulation of data in the information security field. In addition, how to protect the individual privacy is a very important problem to be solved. In the real world, large volumes of information data are published, even some agencies’ main work is to release information, such as demographic and public health research and so on. And, since the network digital virtual assets are gradually into people’s vision, network virtual currency and network virtual goods are often related to individuals’ lives, so that vast quantities of individuals’ privacy are bundled with network digital virtual assets. In order to prevent individual privacy disclosure, individual’s explicit identifier (e.g., Name and ID number) must be removed [1]. Besides, the situation where people can combine certain attributes with external data to uniquely identify individuals should be considered. For example, some individuals can be re-identified by linking the released data with other public database on Age, Sex, and Zipcode. The way of being re-identified is called the linking attack.
Page 2 of 28
In order to avoid the linking attack, Sweeney et al. have proposed the k-anonymity model [2]. A dataset which satisfies the k-anonymity has one property that each record is indistinguishable from at least k-1 other records with respect to QI attributes, so no individual can be identified with probability exceeding 1/k through the linking attacks alone. To solve the different problems of identity leakage and process different types of data, many ameliorated models based on the k-anonymity model have been proposed [3-12]. However, without the corresponding protection mechanism to the sensitive attributes, k-anonymity model is unable to resist the homogeneity attack and the background knowledge attack due to low diversity of the sensitive attribute values. According to frequency, distribution and diversity of the sensitive attribute values, researchers have proposed different kinds of constraints and designed lots of new anonymous models to increase the strength of privacy preserving. Machanavajjhala et al. [13] have proposed l-diversity model. The model required that every sensitive attribute in each equivalence class should at least have l different values. Subsequently, Li et al. in [14] have proposed t-closeness model which required distribution of the sensitive attribute values in the equivalence class as close as possible to the global distribution. Wong et al. have proposed a (α, k)-anonymity model, where the frequency of the sensitive attribute values is limited and the frequency of any sensitive attribute values in each equivalence class could not be more than α [15]. Most of the above studies focuse on a universal approach which exerts the same amount of preservation for all individuals, and parameters (such as k, l and α) are set by the publishers [16] in these models. However, the privacy requirements (also called privacy constraints, privacy expectations or privacy preferences) vary from person to person in practice. For example, someone considers disease as sensitive information while someone doesn’t; someone cares much about privacy while someone cares less. To achieve privacy autonomy which means individual has the right to determine privacy constraints, in recent years, lots of individualized privacy preserving models have been proposed. Xiao and Tao have presented a generalization framework based on the concept of individualized privacy preserving in [17], where everybody could specify the degree of privacy preserving for his sensitive attribute. Ye et al. have proposed personalized (α, k) model that allow everybody to have the freedom to decide his/her own privacy preferences [18]. However, both of the above models utilized the generalization of the sensitive attributes based on its attribute taxonomy to achieve the process of anonymity. The
Page 3 of 28
above method generalize the information of sensitive attributes, furthermore, the utility of the data is reduced. To avoid the generalization of the sensitive attributes, a model named as a complete (α, k) model for the individuation preservation of the sensitive values has been proposed [19]. Let α be a user-specified threshold of the sensitive value and the specified α for each sensitive value should be set according to its sensitivity. Another personalized model without the generalization of the sensitive attributes was proposed in [20]. In the model, everyone’s personal privacy-sensitive factors which are the degree of attention on his/her sensitive values are expressed as 1, -1 or 0. When the factor is 1, it represents that the individual does not want to disclose his/her sensitive values. On the contrary, when the individual pays no attention on his/her sensitive values, which indicates that the individual allows his/her sensitive values to be disclosed, we denoted the factor is -1. Also there are some individuals who cannot determine their degree, denoted as 0. The models proposed in [19, 20] have certain limitation on threshold of the sensitive values, where a person only expresses a rough expectation of privacy security. There are other personalized anonymity models [21, 22] that are similar to the above mentioned models. 1.1.
Motivation Most of personalized anonymity models are based (α, k) model, such as the existing
models mentioned above, so our discussion will be based on the personalized (α, k) model in this paper. However, the current personalized (α, k) models have several drawbacks. First, in some personalized anonymity models (such as [17, 18]), an individual has the freedom to decide his/her own privacy preferences with a precise value, but this model needs to generalize sensitive attributes in the anonymity process. It will reduce the utility of the sensitive data. On the contrary, some personalized anonymity models (such as [19, 20]) without the generalization of the sensitive attributes could not adequately express personal expectations of privacy security by only setting constraints for the sensitive attributes with rough values. A model where everybody has his/her precise privacy preferences and the anonymity process without the generalization on the sensitive attributes has not been put forward yet. Second, most of current personalized anonymity models are based on (α, k) and several models are based on k-anonymity. In all these models, having a constraint value k is set by the data publishers (or system). The constraint value k does a unified constraint for all the individuals who has different requirements. It makes that the individual privacy protection is not
Page 4 of 28
uniform, and causes more information loss. Third, personal information consists of two parts, one part is personal sensitive information, and the other is the individual’s QI information. Most of personalized anonymity models are only concerned with the specific needs of individuals in the sensitive attributes, however, ignoring the individual also has different privacy requirements in the QI attributes. Because not all individuals would like to provide the complete information of the QI attributes. For example, Some persons do not want to disclose their age, but other persons do not have requirements for their own age information. In other words, QI information is sometimes viewed as a kind of the sensitive information. So it not only needs to set the personal constraints on the sensitive attributes, but also need to set the constraints of the QI attributes. And if you want to achieve full privacy autonomy, the individual’s QI constraint requires to be set by provider himself/herself. Assuming one model is not full privacy autonomy, there is a great possibility that personal privacy security is not up to his/her requirements. Thus, the full privacy autonomy is necessary in the individual privacy preservation. Fourth, in order to satisfy the different requirements (different constraints) of different persons on the sensitive attributes, most of the models use the same approach for different constraints on the sensitivity attributes in anonymity process. The consequence is that we may be applying excessive privacy protection to a subset of persons, and such excessive protection is often based on the loss of information. For example, when in an equivalence class, a tuple with a relatively high expectations of privacy may exist. Here the constraint value of the tuple is set to 0.1. In order to meet the constraints of the tuple without the generalization of the sensitive attribute, the value k that represents the size of equivalent class is at least 10. If there is no such tuple, it is possible that the individual constraints are satisfied and the value k is less than 10. And the smaller the value of k means that the less the loss of information. 1.2.
Contributions In this paper, to solve the above problems, we propose individualized (α, ω)-anonymity
model. The core of our solutions is that all individuals’ privacy requirements which are based on full privacy autonomy are satisfied. In order to achieve full privacy autonomy, in the individualized (α, ω)-anonymity model, the data providers are allowed to set desired privacy requirements to their own sensitive attributes and QI attributes, so that constrains on the sensitive attribute and QI are both with providers’ autonomy. In the proposed model, the value ω is used
Page 5 of 28
to describe the lowest degree of QI attributes and is the constraint value of QI attributes set by provider. And the constraint of parameter α is used to express the security which the individual needs his/her sensitive attributes to achieve. In the anonymity process of the individualized (α, ω)-anonymity model, we use the granular computing. In the granular computing, the tuples with same constraint value of sensitive attribute are divided into the same interval, then the appropriate intensity process according to the corresponding constraint value which is included in the interval is done. In summary, the contributions of the proposed model are given as follows: 1) In the individualized (α, ω)-anonymity model, everybody has the freedom to set precise privacy preferences for his/her sensitive attribute and QI attribute. That is to say, the constraints of all personal information are set by the provider himself. So the proposed model achieves full privacy autonomy. 2) In anonymity process, the individualized (α, ω)-anonymity model utilizes the granular computing and does not use the generalization of the sensitive attributes, which can reduce the information loss of the QI attributes and the time of anonymity process, and improve the practicability of the sensitive attribute values. 3) The proposed model does not need to set a fixed value k to regulate the size of equivalence class, and the scope of equivalence classes can be arbitrary large. As the proposed model does not limit the k, it can be more flexible to carry out anonymous operations, furthermore, the efficiency of anonymity processing has been obviously improved. The rest of this paper is organized as follows. In Section 2, it describes some related definitions of the proposed model. The individualized (α, ω)-anonymity model is established in Section 3, including the construction of full privacy autonomy and main algorithms of the model. In Section 4, several numerical simulations are conducted to verify the effectiveness of the individualized (α, ω)-anonymity model. Finally, some conclusions of the model and future work can be obtained in Section 5. 2.
Basic definitions In this section, some basic concepts and definitions about anonymity are introduced. As
the (α, ω)-anonymity model is based on the personalized (α, k)-anonymity model. Hence, we also describe the knowledges related to (α, k)-anonymity model. These definitions are motivated by the ideas of [1, 3, 12, 21, 24-26]. The anonymity model in this paper is studied on the relational data table, so some basic concepts about data table are described: Let original data
Page 6 of 28
table T = {t1 , t 2 , t 3 ,
, tn }
, where ti (1 ≤ i ≤ n) is a tuple, and AT represents an anonymous data
table. Quasi-identifier attribute (QI) set QI = { A1 , A2 , A3 ,
, An }
, where Ai (1 ≤ i ≤ m) represents
an attribute. A quasi-identifier is a minimal set of attributes in table T which can be joined with external information to re-identify personal records. If the competitors cannot associate the identifier through Aj, then Aj is called the sensitive attribute which is denoted as S and |S| represents the number of the sensitive attribute values, where si (1 i | S |) represents one value in the S. If t ∈ T, then t[Ai] and t[QI] are used to represent the value of the tuple t in the Ai and in the attribute set of the QI, respectively. Definition 1 (Equivalence class) [2, 23]: Equivalence class EC represents a set of tuple consisting of several attributes which have the same value. If EC = {t1 , t 2 , t1 [ Q I ] = t 2 [ Q I ] =
, t k } , then
= t k [ Q I ] , and |EC| represents the number of tuple who is contained in the
equivalence class. The frequently used approaches of anonymous algorithms include: generalization, suppression and specialization. Generalization is process where a value is replaced by a less specific and much general value which is faithful to original. For example, the birthdate 12/23/1991 can be generalized to 12/**/1991, and the code 331800 can be generalized to 3318**. Suppression is to hide some data items. Specialization is the inverse process of generalization, and it uses specific and related values to replace the abstract values, the above code 331800 is the result of 3318** after specialization. Definition 2 (Generalization) [1, 1, 23, 24]: Sweeney et al. pointed out that each attribute had its own set of values, the set was called attribute domain and was denoted as D. According to the different accuracies in value, the same attribute may have multiple attribute domains, which are partial order relation ≺ among them. Di is the generalization of Dj, represented as Dj ≺ Di. For a given attribute A, with a function f : A→B about A to express the f
0
f 1
generalization, and A0 A1
f
n 1
An is a generalization sequence or a functional
generalization sequence. Partial order relation of the attribute A can be abstracted for constructing the domain generalization hierarchy DGHA and value generalization hierarchy VGHA, as shown in Figure 1. Usually, the attribute generalization hierarchy determines the rule of the attribute
Page 7 of 28
generalization. At the generalization hierarchy, attributes can be divided into two types: numeric and categorical attributes. Figure 1 shows the generalization hierarchy of the numeric attributes and Figure 2 represents the categorical attribute. Definition 3 (k-anonymity) [2, 7, 23]: For data table T, the QI set and the given parameter k, if T can be divided into a set of mutually exclusive equivalence classes, and satisfy T = EC 1
EC h , EC i
EC 2
EC j 0 ,
k ≤ |ECi|, (1 i < j h ) , then, it can be said
that T satisfies k-anonymity on QI. Any anonymous algorithm has the metrics of information loss. In this paper, we use the normalized certainty penalty proposed by Xu et al. [18] as the metric standard. Definition 4 (Normalized certainty penalty) [25]: The calculated method for the NCP A ( x ) =
information loss of an attribute value x on the attribute A is:
Interval ( x ) Interval ( A ) , the
attribute A is a numeric attribute, Interval(x) represents the range size of the x and Interval(A) represents the range size of the attribute A. When the attribute A belongs to categorical attribute, NCP A ( x ) =
| Sub node ( x ) | | Sub node ( A ) |
, | Sub node ( x ) | represents the leaf nodes number with the x as root node
and | Sub node ( A ) | represents the total number of leaf nodes that the attribute generalization hierarchy contains. In Figure 3 (a) Interval(x) = 30, Interval(A) = 60, NCP A ( x ) = 30/60
with x represents
the range of 1~30 and the attribute A represents age. That is to say, the information loss of range 1~30 in the attribute age is 0.5. In Figure 3 (b), if x represents individual enterprise, then | Subnode ( x ) | = 2 , and the attribute A indicates the type of have-work, | Subnode ( A ) | = 4 , and the
information loss of individual enterprise in the work-class is 1/2. The representation methods which extend the information loss of the attribute value to n
NCP ( t ) =
the tuple t and the entire data table T are
NCP i =1
A
NCP (T ) =
( t [ Ai ])
and
NCP ( t ) t T
.
Definition 5 (α-non-correlating) [18]: (ECi, sj) is the occurrence number of sj in equivalence class ECi of anonymous table AT. If ∀ECi satisfies with the condition ( E C i , s j )/ | E C i |
(0 ≤ α ≤ 1), where α is the threshold set by data publishers, then table AT
Page 8 of 28
on quasi-identifier QI and sensitive value sj are called as α-non-correlating. Definition 6 ((α, k)-anonymity) [18]: If anonymous table AT is both satisfied with k-anonymity model and each sensitive value in the sensitive attribute set S is α-non-correlating, so the anonymous table AT is called as (α, k)-anonymity on QI and S. Definition 7 (Individualized non-correlating constraints) [18]: t represents the tuple in data set, the value set of the sensitive value s can be defined as S = { s1 , s 2 , } , and the sensitive value s is the attribute that the provider wants to protect. α is the individualized non-correlating constraint, t is a two-tuple and is represented as
( s , s i ) i
. Then
freq ( EC ( t ), s i )/ | EC ( t ) | s
i
which needed by tuple t and the correlating sensitive attribute value si, where freq ( EC ( t ), s i ) equals to the tuple number of the sensitive value si in the equivalence EC where includes tuple t. In other words, the correlating constraint between any sensitive value si and single tuple t in EC(t) must be less than the
s
i
which defined by data provider.
The individualized constraint allows individuals to set the constraints of each sensitive value, which wanted to be protected according to their expectations during the process of data collecting. There are the following two extreme cases: Case 1:
s =1 i
, it means that the tuple t is correlated to the sensitive attribute value si,
because all non-correlating constraints are less than 1. Case 2:
s = 0 i
, it indicates the provider isn’t correlated to the sensitive attribute value
si . Definition 8 (individualized (α, k)-anonymity model) [18]: with respect to dataset T, T′ is the table after anonymization, if the table T′ satisfies with k-anonymity constraints, and tuples in T′ of their own equivalence class meet the individualized non-correlating constraints. Then T′ accords with individualized (α, k)-anonymity model. Definition 9 (Safety factor): Safety factor (sf) represents personal privacy safety expectation value which is set by provider own, and means what level of the privacy protection can be achieved. The value of safety factor in [0,1], the greater the value is, the higher security expectations the provider wants to achieve are. There are two extreme cases as follows: Case 1: sf = 0, it shows that the sensitive information isn’t personal privacy and it can be
Page 9 of 28
released without any anonymous processing. Case 2: sf = 1, it means the sensitive information belongs to absolute privacy and the information will be hidden in the anonymous process. Definition 10 (Generalized degree): The generalized degree reflects what degree of original attribute value is generalized in the anonymous process, and in the numerical value, it is equal to the normalized certainty penalty of the normalized property value. If x is the value of the attribute A after generalized, then its generalized degree is: g A ( x ) = NCP A ( x ) . As the function f g can be to express the generalization, so we define a function f A ( x ) , which is used to represent
that the value x of the attribute A is generalized to its generalized degree which is equal to g. Definition 11 (Full privacy autonomy): In the process of data publishing, the full privacy autonomy means the data provider has the right to set accurate security expectation values for all his/her information (including information of the sensitive attributes and the QI attributes). 3.
Individualized (α, ω)-anonymity model
3.1.
Full privacy autonomy Privacy autonomy is an important principle in privacy preservation. If there are any
constraints of individual information that are not set by the provider or represented with rough values, it is likely to cause excessive protection to a subset of people, while may be offering insufficient protection to another subset, so the individuals’ requirements can’t be satisfied very well. As the constraints that are set by system often are universal limit to all individual, which is not suitable to constrain individual privacy requirements that vary from person to person. In order to everyone’s privacy requirements with appropriate protection, individuals need to have full autonomy in their own information. At present, most of the individual (α, k) anonymity models do not consider the data providers’ privacy requirements for QI attributes, and there are k value constraints which is set by publisher, and some models can not meet the provider with accurate constraint values to express personal privacy requirements. That is, most of the current individualized (α, k) models are not full privacy autonomy. So in this paper, we propose a individualized (α, ω)-anonymity model based on full privacy autonomy. The process of achieving full privacy autonomy for the model is shown as follows. The provider’s data is divided into two parts during anonymity process of our (α, ω)-anonymity model: the sensitive attribute values and the quasi-identifier values. The parameter
Page 10 of 28
α and ω are the constraints of the sensitive attribute and the quasi-identifier respectively α and ω are both set by provider own. The providers’ privacy autonomy processes are: the provider sets a safety factor he/she wants and this coefficient is represented by ‘sf’. Simultaneously, the provider chooses to carry out partial or complete hidden processing for the quasi-identifier. The relevance between providers’ autonomy processes and constraints of parameter α and ω is as follows: 1.
In particular, the degree of privacy protection doesn’t really depend on the size of
the equivalence classes for the QI attributes. Instead, it is determined by the number and distribution of distinct sensitive values associated with each equivalence class. With knowledge background [26] that the attacker knows which the ID in data table is located in, the frequency of the sensitive attribute which appears in the equivalence class is equal to the probability which attacker guess right the correlation of ID and the sensitive attributes. Hence, the frequency of the sensitive attributes can be related to the security value of privacy. The greater the frequency is, the lower the security is, and conversely the higher the security is. In this paper, relatively low threshold α usually is set for high safety coefficient when constraint value α is used as the upper bound of the frequency value. That is to say, the safety coefficient is inversely proportional to the sensitive attribute threshold value α which is represented by the value of (1-sf). 2.
When the data is provided, the provider has already processed the data, so the data
has certain degree of generalization before anonymity process. For example, when providing information individual often uses range 20~30 to replace the true age 25, also appears the value which is a blank on the gender attribute. The case in the above example is existed in many practical scenarios, and the processed information is called as a form of data with individual privacy constraints. The method of the above the pre-processing can achieve the privacy autonomy and protection in the quasi-identifier. The generalized degree of the initial attribute value is regarded as the lowest generalized degree of the attribute value. Here, the QI attribute value that has been processed is named as initial value. In the numerical value, the lowest generalized degree is equal to the information loss of initial data. That is, A ( x ) = NCP A ( x 0 ) , x0 represents the initial value of the attribute value x in the attribute. All QI values have a corresponding ω value to express personal privacy preference. In the process of anonymity, all attribute values’ generalized degree must not be less than their lowest generalized degree, g namely, f A ( x ) f A ( x ) .
Page 11 of 28
In the above procedure, all constraints can be represented with precise values. Moreover, to make full privacy autonomy without less information loss, we don’t need to set the k value in the (α, ω)-anonymity models. Because there is no limit of k value, the scope of the equivalence class can be any size and not just limited in the range [k, 2k]. In the process of anonymity, the size of the equivalent class can be made according to the constraints given by the provider, which makes the process more flexible. To construct individualized (α, ω)-anonymity model, we introduce our algorithm which combines the idea of granulation and top-down local coding. The algorithm not only satisfies full privacy autonomy but also improves efficiency of anonymity process and reduces information loss when it is compared to other personalized models. The specific process of main algorithm sees the following Section 3.2 and Section 3.3. 3.2.
Main algorithm The main algorithm in this paper includes granular computing and top-down local coding.
The granular computing does pre-processing to the original data table according to tuples’ safety factor, and then classifying tuples into different granular spaces. The granular computing can reduce the loss of information that will be described in the following and can improve model efficiency that will be illustrated through experiment. After the process of granulation, doing anonymity processing obtains table with top-down local coding. At last, we describe the kernel function which implements the processing of specialization with greedy algorithm, and the function is to choose a scheme which produces less information loss. 3.2.1. Granular computing Definition 12 (Granulation): Based on certain standard or criterion, the domain U is divided into a number of subclasses and the process is called as the granulation of the domain U. Each subclass derived from granulation is called as a granular space and where Vi represents the ith granular space. In this paper, the granular computing takes the tuple with same security factor in the same subclass as the criterion, and does granulation on the whole domain U. Table 1 is the original data table, and Table 2 shows a data table with the process of granulation. Many anonymity models are focused on a common method for all tuples to apply the same protection, but it will make part of the information get too much protection and the other part doesn’t get enough protection. The idea of granulation is to classify the tuples into the same
Page 12 of 28
granular space and to take differentiated protection for granular spaces, in which case, each tuple can be protected in its desired safety value. Directly hiding the space with safety factor which equals to 1 and without any processing for the space with safety factor which equals to 0 are the ways to process the two extreme cases, and the processes can simplify the quantity of data. Simultaneously, the original table with granulation will reduce the information loss greatly. The examples in Table 3 ~ Table 5 show the following advantages. Table 3 shows the original data and Table 4 shows the results of the global generalization which are generalized based on entire data. The frequency of Cancer is 1/3 in the Table 4, and is greater than 0.3, which needs to be hidden (or be deleted) to meet the requirements. However, if the original data is made granulation firstly, then the generalized data table does not need to delete any tuples and can make less information loss, as shown in Table 5. Due to the constraint value α of the sensitive attributes set is equal to (1 − sf) in this paper, so it is necessary to ensure that the frequency of the sensitive attributes in the granular space is not greater than (1 − sf) before anonymous process. Otherwise, the processing in the following is no significant. For example, the α is equal to 0.2 in the granular space V2 of Table 2, while the frequency of Aids and Cancel is 0.5 and is greater than 0.2, so the granular space needs further processing. In order to satisfy the security requirements and also achieve that the sensitive attribute frequency of granular space is not greater than the value α, the method adopted in this paper is to change the value α of the granular space which does not meet the requirements. If the value of the sensitive attribute is generalized to the last layer, it can improve the security of the sensitive attribute, so that the value sf of granular space will become smaller and it means the constraint value α in the granular space is larger. For example, the sensitive attribute of Serious and Cancel in granular space V2 can be generalized to Serious illness in Figure 2, then the probability that the attacker guess tight the tuple identifier from Aids and Cancel is reduced by 0.5 and the security of granular space V2 is increased by 0.5. At last, the safety value of V2 is changed to 0.3, then merging V2 into the space with the safety value of 0.3. When the merged granular space is also not satisfied with its sensitive attribute constraint, the merger will be cancel and the space V2 will be hidden. The specific steps of granulation algorithm show in Algorithm 1. In granulation algorithm, size(V) is the number of granular spaces, frep(Vi) represents the frequency of the sensitive attribute values in the granular space Vi and S is used to save the merging of granular
Page 13 of 28
space. Through the granulation to the original data table, it will generate table T′ that can be used for the anonymity process proposed in this paper. The anonymity method of the table T′ is shown in next subsection.
3.2.2. Top-down local coding algorithm Definition 13 (Basic information loss on the lowest generalized degree): Based on the lowest generalized degree, information loss uses ILD to indicate. The basic information loss of the lowest generalized degree about the attribute value x in the attribute A is ILD A ( x ) = A ( x ) . It is extended to the tuple t and the entire table T, respectively are n
ILD ( t ) = ( t ) =
NCP i =1
A
i
ILD (T ) = (T ) =
( f A t [ Ai ]) i
and
NCP ( t ) t T
, where Ai 0 represents the
original data of the attribute value Ai. The tuple must meet the minimum requirement of constraints in the process of generalization due to the constraints of the QI and the algorithm should give priority to consider
Page 14 of 28
the generalization of QI with strong constraints to minimize the information loss. A top-down heuristic algorithm is proposed in this paper, where it takes all the attributes of the tuple into its highest level, so that all of them can be stored in an equivalent class, then starting from the equivalence class, gradually doing down specialization and it will generate new equivalence classes. In the process of specialization, due to the equivalence class with higher generalization level of QI gives priority to be formed, so the strong privacy constraint tuples are preferentially partitioned into new equivalence classes. The equivalence classes not only should adhere to the lowest generalized degree, but also must satisfy the constraints α of the granular space. Algorithm 2 is the basic framework.
In the algorithm, the tuples of granular space are classified into an equivalence class EC, and then the equivalence class EC is specialized with different QI, this operation is in (6) of Algorithm 2. The function Specialize is a key step of the algorithm and is specifically on how to specialize, this will be shown in the next section. The aim of the algorithm is to select one scheme with minimum information loss and save the scheme in the set S. The specific process is shown in 7) and 8) where
Page 15 of 28
NCP ( EC
sub
)=
EC EC
NCP ( EC ). sub
If the set S is still empty after iterating over the n QI attributes, then it means that the equivalence class EC cannot do further specialization. Until all equivalence classes cannot be specialized, the algorithm is terminated and those equivalence classes that cannot be specialized are stored in ATi . (α, ω)-anonymity model presented in this paper which combines the above two algorithms can be expressed as: First, granulating for the whole table T, as shown in Algorithm 1; Second, each granular space separately is anonymized similar to in Algorithm 2, then get the ATi which through anonymous processing; Third, all ATi are merged into anonymous table
AT which can be released, and the table AT satisfies the individualized privacy autonomy. 3.3.
Algorithm details Definition 14 (Information loss gain (IG)): An anonymous model consists of two parts
of information loss: one part comes from the information loss of lowest degree and the other part is the information loss with the anonymous processing. The information loss gain is the discrepancy between the two parts of information loss, when the tuple t is generalized to t′ the information loss gain is: IG ( t ) = NCP ( t ) ILD ( t ) . This section mainly describes the function Specialize and its operation process. This function is used to determine how an equivalence class EC do further specialization operations to generate ECsub on the attribute Ai, tEC represents all tuples which are in the EC. Before running the function Specialize, these conditions have been given: Generalization hierarchy of attribute Ai, the lowest generalization degree of each tuple on the attribute Ai and the sensitive attribute’s constrain αi of equivalence class EC. Judging from the t EC [ Ai ] whether the equivalence class EC can further do specialization. If t EC [ Ai ] is the generalization hierarchy leaf node of Ai or tEC is in their minimum generalization degree, obviously, equivalence class EC cannot do further specialization in attribute Ai. Otherwise, t EC [ Ai ] which in the generalization hierarchy of attribute Ai must contain next layer nodes, as shown in Figure 4. The node0 stores the current value of t EC [ Ai ] and node1, node2, node3, ⋯, nodei indicate sub-nodes which are the division of the remaining tuple with further specialization.
Page 16 of 28
After the division of nodes, there are several cases to be considered. Case 1: If all the sub-nodes are empty, that is to say, node1, node2, node3, ⋯, nodei do not have any tuples and it means all tuples only stay in the node0 due to privacy constraint of lowest generalized degree, so it cannot do further specialization, the function Specialize returns 0 . Otherwise, we consider the second case continually; Case 2: There is at least one non-empty node in the sub-nodes. Judging whether non-empty nodes can meet the constraint, namely, whether the maximum frequency sensitive of attribute node value is greater than αi. If the frequency of a non-empty node is greater than αi, it will return all tuples that in the non-empty sub-node into the node0. After all non-empty sub-nodes are judged, then we consider the following case; Case 3: After the second step is treated, it doesn’t exist non-empty sub-nodes, then similar to the first case, the function Specialize returns 0 . Otherwise, doing further investigation for the remaining non-empty sub-nodes. Assuming it remains j nodes including node1, node2, node3, ⋯, nodei that meet constraint αi. 1.
If node0 is empty, it means that no tuples exist in the node0 and all tuples in the
node1, node2, node3, ⋯, nodej satisfy the privacy constraints, so the equivalence class EC on the attribute Ai can be specialized to
EC
sub
= { EC 1 , EC 2 , EC 3 ,
, EC j }
. In this process of
specialization, we only need to use the value of its child nodes on the Ai to replace the value of the current value, and the other values of QI remain unchanged. 2.
If the node0 is non empty but the node0 satisfies the constraint αi, then the
equivalence class can be specialized and its division is
EC
sub
= { EC 0 , EC 1 , EC 2 , EC 3 ,
, EC j }
.
In the process of specialization, the attribute value of the equivalence class EC0 remains unchanged, then we can use the remaining values of its child nodes on the Ai to replace the current values and the other attribute values remain the same. 3.
If the node0 is not empty and it doesn’t satisfy the constraint αi. In order to let
node0 satisfy the constraint αi, a part of node1, node2, node3, ⋯, nodej need to be returned to the node0. The principle of rollback nodes is that selecting the scheme with the minimum information loss. There are two strategies which can satisfy the schemes as follows: Strategy 1: Calculating information loss gain of each tuple in each node, then choosing
Page 17 of 28
the tuple with minimum loss of information gain to return to node0. If the maximum frequency of the attribute value in the node0 decreases and the the tuple of the child node originally still meet the constraint αi after the tuple returns to node0, then the tuple of fallback is feasible. Otherwise, the tuple doesn’t fallback, and then we select the following minimum information loss gain tuple to judge. Until node0 satisfies the constraint αi or cannot find any rollback tuples meet requirements, and then the strategy is terminated. Strategy 2: After the strategy 1 is done, if node0 still cannot satisfy the constraint αi, we can do the following operation. Taking each sub-node as a whole and calculating the sum of sub-node’s information loss gain for each tuple, then the sub-node with the smallest information loss gain is back every time. If it can reduce the maximum frequency of the attribute values on node0, then the fallback is feasible, otherwise the sub-node doesn’t return. Until node0 satisfies the constraint αi, then the strategy is terminated. Because when all nodes are returned to node0, the node must satisfy the constraint αi, so after treatment of strategy 2 is done the result will be in accordance with the requirements of specialization. After treatment of strategy 1 and strategy 2 are done, the attribute value of tuples in node0 remains unchanged, other tuples in the attribute Ai use the attribute value of its sub-nodes to replace the value of the current. In the operation of tuple rollback, the result after finishing strategy 1 and strategy 2 is not necessarily the minimum information loss scheme. In the ideal state, one best process should find out all single tuple and multivariate tuples fallback operation are in accordance with the requirements, and find out the scheme of minimum information loss through the comparison of information loss. However, it greatly increases the amount of computation and makes the operation with the large data, and cannot be completed in a reasonable period of time. The above strategy 1 and strategy 2 can find out a scheme with less information loss in the fast time, so they are reasonable and practical selection. 4.
Experimental analysis The real data set which is selected in this experiment is the part of Adults’ census data set
in the UC Irvine machine learning data set, the values of QI are incomplete or complete. At the same time, the constraints of the sensitive attributes are random set, and it is to simulate that the data is set by providers own. Then selecting 40000 tuples as experimental data, the information related to its QI and the sensitive attributes are shown in Table 6. 4.1.
Some properties of individualized (α, ω)-anonymity model
Page 18 of 28
The individualized (α, ω)-anonymity proposed in the paper can satisfy full privacy autonomy. The ω of QI and α of S in the model can influence on the information loss and cost time. However, in the actual situation, the types of value ω are too complex to simulate, so it only analyzes the effect of α on performance. As the α equals to (1 − sf), the analysis to α can be replaced with the analysis to sf. In order to analyze how is the relationship between different α distribution and information loss and what kind of distribution is suitable to deal with in the individualized (α, ω)-anonymity model. In the simulation, the distribution of s is generally divided into five kinds of situations: left interval, middle interval, right interval, both ends interval, uniform distribution. These distributions are used to describe the concentrated area of the value sf, in the scope of value sf, respectively represents: [0.8 ~ 1.0], [0.3 ~ 0.7], [0 ~ 0.2], [0 ~ 0.2] and [0.8 ~ 1.0], [0 ~ 1.0]. The scale of datasets is 5K to 40K. Figure 5 shows the more small values sf which are included in the distribution, the less information loss, and the cost times of five situations that are shown in Figure 6, it can be found the uniform distribution with the best efficiency, so it means the model is suitable for processing the dataset of sf with uniform distribution. 4.2.
The necessity of full privacy autonomy Most individualized (α, k)-anonymity models do not consider the privacy preference of
the provider to the QI attributes information or use a rough value to describe one’s privacy preference. Whether the privacy autonomy is necessary in the privacy preservation. In the following experiment, it is the simulation of the three individualized (α, k)-anonymity models that are proposed respectively in [18-20] and all are unfull privacy autonomy. The first model proposed in [18] is named as individualized (α, k)-anonymity. In order to facilitate comparison, the model does not generalized sensitive attributes. The second model proposed in [19] is named as complete (α, k)-anonymity model. And the model in [20] is named as personalized three-value (α, k)-anonymity model, because the model only uses three values -1, 1, and 0 to represent the individual expectations. Its 40% of the QI information exists privacy security requirements in this experiment. From Figure 7, we can realize the situation where it does not satisfy the individual requirements is still accounted for a certain proportion after the anonymity process. However, all the individuals’ requirements are satisfied in the proposed model. Thus, in order to ensure the privacy security of all individuals, the full privacy autonomy is necessary in the individual privacy preservation.
Page 19 of 28
4.3.
Information loss and efficiency of the individualized (α, ω)-anonymity model The individualized (α, ω)-anonymity model uses granular computing and top-down local
coding algorithm. In order to analyze the information loss and efficiency of the proposed model, we conducted the three experiments. In the three experiments, it includes three traditional individual models and two individualized (α, ω)-anonymity models that one of two models does not use the granular computing. The anonymity process of the five models does not consider the privacy requirements of QI attributes in the first simulation. That is to say, the experiment calculates information loss of the five models in the case of unfull privacy autonomy and the result of it shows in Figure 8. The five models in the other two experiments all satisfy full privacy autonomy. The one is about information loss of five models, and the other one simulates the efficiency of the five models. The simulation results of the latter two experiments are shown in Figure 9 and Figure 10. The scale of datasets is also 5K to 40K and the other conditions of the five models keep consistent. The loss information of the proposed model is less than that of the individualized (α, ω)-anonymity model without granular computing in Figure 8 and Figure 9. It proves that the granular computing can reduce the information loss whether is in the case of full privacy autonomy or in the case of unfull privacy autonomy. The proposed model has the least information loss when comparing with other traditional individual models from Figure 8 and Figure 9. It shows the proposed individualized (α, ω)-anonymity model can reduce information loss in the anonymity process. In Figure 10, it shows the five models’ running time under different dataset’s scale. The efficiency of the proposed individualized (α, ω)-anonymity model is highest, and the (α, ω)-anonymity model without granulation is also higher than other individualized models. So we can get the conclusion that situations with granular computing and without the constraint of k value both increase the flexibility of the anonymity process. The proposed model that satisfies full privacy autonomy achieves all individuals’ requirements, so it has practicability in the individual privacy preservation. The low information loss makes the proposed model suit for data publishing, and the high efficiency shows the model is feasible and acceptable. 5.
Conclusions This paper has presented a individualized (α, ω)-anonymity model which can satisfy full
Page 20 of 28
privacy autonomy. So individuals can be completely free to set their own privacy preferences for all of his/her information. In our model, all the parameters of the privacy constraints are set by provider and the model’s ultimate goal is to achieve the provider’s security requirements. This paper effectively combines the granular computing with top-down local recoding, and experimental analysis has also shown that our approach can reduce the rate of information loss and improve the efficiency, and at the same time, the algorithm in stability and practicability is superior than the algorithm proposed by other anonymous model. In the future, the multi-dimensional [27, 28] generalization path and full-domain [29] may be introduced to the individualized (α, ω)-anonymity model to get more optimal algorithm. Using more better metrics to evaluate information loss of algorithm [30, 31] will improve the utility. With the rise of the information age, online anonymity [32] may be reasonable for the constitutional right of privacy [33], and the privacy transactions [34, 35] become more popular, that means the privacy autonomy will be very important in the future. If the loss of information can be precisely calculated, the (α, ω)-anonymity model with the full privacy autonomy will be more suitable for privacy transactions, other online applications [36, 37] and types of data publishing. Acknowledgment This work was supported in part by the National Key Research and Development Program of China under Grant 2016 YFB 0800601, in part by the National Nature Science Foundation of China under Grant 61472331, in part by the Fundamental Research Funds for the Central Universities under Grant XDJK2015C078, in part by the Talents of Science and Technology promote plan, Chongqing Science & Technology Commission. References [1]
P. Samarati, Protecting Respondents’ Identities in Microdata Release, IEEE Transactions
on Knowledge & Data Engineering 13.6 (2001) 1010-1027. [2]
L. Sweeney, k-anonymity: A model for protecting privacy, Int. J. Uncertainty, Fuzziness
Knowl.-Based Syst., 10.5 (2002) 557-570. [3]
K. E. Emam, and F. K. Dankar, Protecting Privacy Using k-Anonymity, Journal of the
American Medical Informatics Association 15.5 (2008) 627-37. [4]
M. Ghasemzadeh, B. C. M. Fung, R. Chen, and A. Awasthi, Anonymizing trajectory data
for passenger flow analysis, Transp. Res. C, Emerg.Technol., 39 (2014) 63-79.
Page 21 of 28
[5]
R. Chen, B. C. M. Fung, N. Mohammed, B. C. Desai, and K.Wang, Privacy-preserving
trajectory data publishing by local suppression, Inf.Sci., 231 (2013) 83-97. [6]
G. Poulis, S. Skiadopoulos, G. Loukides, and A. Gkoulalas-Divanis, Distance-based (k,
m)-anonymization of trajectory data, in Proc. IEEE 14th Int. Conf. Mobile Data Manage. (MDM), 2 (2013) 57-62. [7]
A. Friedman, W. Ran, and A. Schuster, Providing k-anonymity in data mining, Vldb
Journal 17.4 (2008) 789-804. [8]
X. Sun, H. Wang, J. Li, and T. M. Truta, Enhanced P-Sensitive K-Anonymity Models for
Privacy Preserving Data Publishing, Transactions on Data Privacy 1.2 (2008) 53-66. [9]
H. Wang, J. Han, J. Wang, L. Wang, (k, ε)-Anonymity: An anonymity model for
thwarting similarity attack, 2013 IEEE International Conference on Granular Computing (GrC) IEEE Computer Society, (2013) 332-337. [10]
X. Huang, J. Liu, Z. Han, and J. Yang, A New Anonymity Model for Privacy-Preserving
Data Publishing, Wireless Communication Over Zigbee for Automotive Inclination Measurement China Communications 11.9 (2014) 47-59. [11]
M. E. Nergiz, & M. Z. k, Hybrid k -anonymity. Computers & Security, 44.2 (2014)
51-63. [12]
X. J. Lin, L. Sun, & H. Qu, Insecurity of an anonymous authentication for
privacy-preserving iot target-driven applications, Computers & Security, 48.9 (2014) 142-149. [13]
A. Machanavajjhala, D. Kifer, and J. Gehrke, l-diversity: Privacy beyond k-anonymity,
Icde 1.1 (2006) 24. [14]
N. Li, T. Li, and S. Venkatasubramanian, t-Closeness: Privacy Beyond k-Anonymity and
l-Diversity, Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on IEEE, (2007) 106-115. [15]
R. Wong, J. Li, (α, k)-Anonymity: An Enhanced Anonymity Model for
Privacy-Preserving Data Publishing, Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23 2006 754-759. [16]
L. Xu, C. Jiang, J. Wang, J. Yuan and Y. Ren, Information Security in Big Data: Privacy
and Data Mining, in IEEE Access, 2 (2014) 1149-1176. [17]
X. Xiao and Y. Tao, Personalized privacy preservation, in Proc. ACM SIGMOD Int.
Page 22 of 28
Conf. Manage. Data, (2006) 229-240. [18]
X. Ye, Y. Zhang and M. Liu, A Personalized (α, k)-Anonymity Model, Web-Age
Information Management, 2008, WAIM ’08 The Ninth International Conference on, Zhangjiajie Hunan, (2008) 341-348. [19]
H. Jian-min, Y. Hui-qun, Y. Juan, & C. Ting-ting, A complete (α, k)-anonymity model
for sensitive values individuation preservation. In Electronic Commerce and Security, 2008 International Symposium on IEEE, (2008) 318-323. [20]
B. Wang and J. Yang, Personalized (α, k)-anonymity algorithm based on entropy
classification, J. Comput. Inf. Syst., 8.1 (2012) 259-266. [21]
Y. Xu, X. Qin, Z. Yang, Y. Yang, and K. Li, A personalized k-anonymity privacy
preserving method, J. Inf. Comput. Sci., 10.1 (2013) 139-155. [22]
K. Qing-Jiang, W. Xiao-Hao, and Z. Jun, The (p, α, k) anonymity model for privacy
protection of personal information in the social networks, in Proc. 6th IEEE Joint Int. Inf. Technol. Artif. Intell. Conf. (ITAIC), 2 (2011) 420-423. [23]
L. Sweeney, Achieving k -anonymity privacy protection using generalization and
suppression, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10.5(2002) 571-588. [24]
J. Wei, and C. Clifton, A secure distributed framework for achieving k-anonymity, Vldb
Journal 15.4 (2006) 316-333. [25]
J. Xu, W. wang, J. Pei, B. Shi, and W. C. Fu, Utility-based anonymization for privacy
preservation with less information loss, Acm Sigkdd Explorations Newsletter 8.2(2006) 21-30. [26]
A. Halevy, A. Rajaraman, and J. Ordille, Data integration: The teenage years, in Proc.
32nd Int. Conf. Very Large Data Bases (VLDB), 2006 9-16. [27]
S. Kisilevich, L. Rokach, Y. Elovici, B. Shapiraet, Efficient Multidimensional
Suppression for K-Anonymity, IEEE Transactions on Knowledge & Data Engineering 22.3(2010) 334-347. [28]
K. LeFevre, D. J. DeWitt and R. Ramakrishnan, Mondrian Multidimensional
K-Anonymity, Data Engineering, 2006, ICDE’06, Proceedings of the 22nd International Conference on, 2006 25-25. [29]
K. Lefevre, D. J. Dewitt, and R. Ramakrishnan, Incognito: Efficient Full-Domain
K-Anonymity, Proc of Sigmod (2005) 49-60.
Page 23 of 28
[30]
B. Kenig, T. Tassa, A practical approximation algorithm for optimal k-anonymity, Data
Mining & Knowledge Discovery 25.1 (2012) 134-168. [31]
A. Gionis, T. Tassa, k-Anonymization with Minimal Loss of Information, Knowledge &
Data Engineering IEEE Transactions on, 21.2 (2009) 206-219. [32]
A. Arampatzis, P. S. Efraimidis, and G. Drosatos, A query scrambler for search privacy
on the internet, Information Retrieval 16.6 (2013) 657-679. [33]
J. Kosseff, A New Legal Framework for Online Anonymity: California’s Privacy-Based
Approach, IEEE Security & Privacy Magazine 13.6 (2015) 66-70. [34]
N. Steinfeld, Trading with privacy: The price of personal information, Online
Information Review 39.7 (2015) 923-938. [35]
K. Mcwana, Method and apparatus for mutual exchange of sensitive personal information
between users of an introductory meeting website, US, US20090113006. 2009. [36]
Y. Li, Y. Li, Q. Yan, & R. H. Deng, Privacy leakage analysis in online social networks,
Computers & Security, 49 (2015) 239-254. [37]
Y. J. Maeng, A. Mohaisen, M. K. Lee, & D. H. Nyang, Transaction authentication using
complementary colors, Computers & Security, 48 (2015) 167-181.
Page 24 of 28
Figure 1: (a) Domain generalization hierarchy. Figure 1: (b) Value generalization hierarchy. Figure 2: Generalization hierarchy of the categorical attribute. Figure 3: (a) Generalization hierarchy of age. Figure 3: (b) Generalization hierarchy of work-class. Figure 4: A portion of the attribute Ai generalization hierarchy. Figure 5: The information loss rate of individualized (α, ω)-anonymity model in different distribution. Figure 6: The time cost of individualized (α, ω)-anonymity model in different distribution. Figure 7: The rate of information that satisfies the individual expectations. Figure 8: The information loss rate of different models whose datasets without privacy requirements for QI attributes. Figure 9: The information loss rate of different models that satisfy full privacy autonomy. Figure 10: The information loss rate of different models.
Page 25 of 28
Table 1: Original data. ID
sex
age
Zipcode
Disease
safe
1
female
23
238901
Aids
0.8
2
male
12
23453*
Cancel
0.7
3
female
3*
45632*
Flu
0.4
4
**
34
4353**
Flu
0.3
5
female
55
564776
Cancel
0.8
6
male
**
657532
Aids
0.9
7
**
3*
576756
HIV
0.3
8
male
23
678***
Flu
0.4
Table 2: Data with the process of granulation. ID
sex
age
Zipcode
Disease
safe
Granular space
6
male
**
657532
Aids
0.9
V1
1
female
23
238901*
Aids
0.8
V2
5
female
55
564776
Cancel
0.8
2
male
12
23453*
Cancel
0.7
V3
3
female
3*
45632*
Flu
0.4
V4
8
male
23
678***
Flu
0.4
4
**
34
4353**
Flu
0.3
7
**
3*
576756
HIV
0.3
V5
Page 26 of 28
Table 3: Original data. ID
Zipcode
Birth-date
Disease
α
t1
321800
1991/04/30
AIDS
0.6
t2
321841
1991/11/12
Flu
0.3
t3
321842
1989/12/23
Cancer
0.3
t4
321844
1991/05/07
Hepatitis
0.3
t5
321801
1989/04/21
Cancer
0.6
t6
321847
1991/06/24
HIV
0.3
Table 4: The global generalized data. ID
Zipcode
Birth-date
Disease
α
t1
3218**
19**/**/**
AIDS
0.6
t2
3218**
19**/**/**
Flu
0.3
t3
3218**
19**/**/**
Cancer
0.3
t4
3218**
19**/**/**
Hepatitis
0.3
t5
3218**
19**/**/**
Cancer
0.6
t6
3218**
19**/**/**
HIV
0.3
Table 5: Generalized data with granulation. α
ID
Zipcode
Birth-date
Disease
t2
3218**
1991/**/**
Flu
t3
3218**
1991/**/**
Cancer
0.3(granular
t4
3218**
1991/**/**
Hepatitis
space V1)
t6
3218**
1991/**/**
HIV
t1
3218**
1989/04/**
AIDS
0.6(granular
t5
3218**
1989/04/**
Cancer
space V2)
Page 27 of 28
Table 6: The information of QI and S. Attribute
Type
Numerical number of
number of
Leaves
Levels
1
Age
Numeric
60
3
2
Work-class
Categorical
8
4
3
Race
Categorical
3
2
4
Sex
Categorical
2
2
5
Disease(sens
Categorical
6
3
itive attribute)
Page 28 of 28