Accepted Manuscript CDPS: A Cryptographic Data Publishing System
Tong Li, Zheli Liu, Jin Li, Chunfu Jia, Kuan-Ching Li
PII: DOI: Reference:
S0022-0000(16)30131-3 http://dx.doi.org/10.1016/j.jcss.2016.12.004 YJCSS 3046
To appear in:
Journal of Computer and System Sciences
Received date: Revised date: Accepted date:
1 March 2016 15 December 2016 16 December 2016
Please cite this article in press as: T. Li et al., CDPS: A Cryptographic Data Publishing System, J. Comput. Syst. Sci. (2016), http://dx.doi.org/10.1016/j.jcss.2016.12.004
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Highlights • In the proposed data publishing methods, cryptographic methods are not applied to provide the privacy protection, because they will destroy the data unitization and make the original data unreadable. Our proposed data publishing system combine the cryptography with data publishing for the firt time. It adopts format-preserving encryption to make sure the ciphertext has the same format with the plaintext. • Compared with the traditional data publishing methods, our system achieves the goal of data integrity without removing sensitive attributes, and it will not influence the the data utility at all. • Our system can meets the following data publishing requirements: (1) Keep all the structures and referential integrity. (2) Preserve original format of data. (3) Can be widely used to publish text data, relational database, and NOSQL database.
CDPS: A Cryptographic Data Publishing System Tong Li a a
College of computer and control engineering Nankai University, P.R. China
Zheli Liu b,∗ b College
of computer and control engineering Nankai University, P.R. China
Jin Li c c School
of Computer Science and Educational Software Guangzhou University, Guangzhou, 510006 P.R. China
Chunfu Jia d d
College of computer and control engineering Nankai University, P.R. China
Kuan-Ching Li e e Department
of Computer Science and Information Engineering Providence University, Taiwan
Abstract The traditional data publishing methods will remove the sensitive attributes and generate the abundant records to achieve the goal of privacy protection. In the big data environment, the requirement of utilizing data (e.g., data mining) become more and more various, which is beyond the scope of the traditional method. This paper provides a cryptographic data publishing system that preserves the data integrity (i.e., the original data structure is preserved) and achieves anonymity without deletion of any attribute or utilization of redundancy. The security analysis shows that our system is secure under our proposed security model. Key words: Data publishing, data privacy, big data, format-preserving encryption
Preprint submitted to Journal of Computer and System Sciences21 December 2016
1
Introduction
In the era of big data, the digital information of governments, corporations, and individuals have great potential for decision support. The popularity of smart mobile terminals [1,2] facilitates the data collection and publishing for such decision-making support, and the demand is also greater than ever before. Imagine that the government wants to make use of the massive GPS data of taxi movement. With the data analysis, the government can improve urban transportation planning to ease congestion or build a recommender system for drivers or passengers. The mobile operators want to make use of the cellular network signaling logs to discover some paging black holes which are caused by interference or signal screening.(Paging black holes are areas without normal mobile services, in which connection failures are more likely to appear.) Then, they can optimize the network to enhance the quality of the service. Usually, the massive data needs to be published via a publishing system. To solicit the best scheme for decision support publicly, the data have to be made available for all the candidates. After choosing the scheme, the data must be available to the cooperative institutions to test the quality of the scheme. Data publication is an unavoidable requirement in the process above. Detailed person-specific data in its original form often contains sensitive information about individuals, and publishing such data immediately breaks individual privacy. For example, if the mobile operators publish cellular network signalling logs directly, they can be used to analyze the whereabouts of people, with which the criminals may carry out some coordinate attacks to people. The requirements for modern data publishing are becoming more and more stricter. There are two traditional methods to preserve the users’ privacy, that is, generalization and clustering. However, they are not suitable for the big data environment. The requirements are shown as follows. • Preserve data integrity. The data integrity [3–5] is a common requirement in the storage and transmission, and most attributes of the data, especially the identifiers and sensitive attributes, should not be deleted. There are two reasons for this. First, the mining task is unknown to the publisher. Moreover, multiple mining tasks are presented in some cases. Under these circumstances, every attribute have the potential to be the key attribute for mining. Second, referential integrity needs to be ensured. To do this, the sensitive attributes which are the primary keys or foreign keys should be kept. For example, the telephone number in the cellular network signaling logs from the mobile operators is a part of the users’ privacies. However, ∗ Corresponding author. Email address:
[email protected] (Zheli Liu).
2
Table 1 Original data stored in database Job
Sex
Age
Disease
Teacher
Female
36
HIV
Teacher
Female
37
HIV
Lawyer
Female
39
Flu
Dancer
Male
32
HIV
Writer
Male
33
Hepatitis
Table 2 Published data by generalization Job
Sex
Age
Disease
Professional
Female
[35-40)
HIV
Professional
Female
[35-40)
HIV
Professional
Female
[35-40)
Flu
Artist
Male
[30-35)
HIV
Artist
Male
[30-35) Hepatitis
since often works as the key to associate different tables, it should not be deleted. Otherwise, it would put great resistance to the mining tasks for being unable to figure out the relationships between tables in the database. • Preserve statistical characteristics. No generalization is allowed. Generalization would replace the original values of an attribute with the generalized values. As shows in the following tables, “Job” and “Age” in Table.1 are generalized and showed in Table.2. If the generalized values of ”Job” are used for statistics for classification or clustering, the result would inevitably be affected. • Preserve the properties of data. Keeping the properties can prevent the serviceability of the data from being harmed. For example, a social security number when being published would be replaced by another social security number. Moreover, the partial ordering relation of the attributes should not be broken during publishing. Contribution. In this paper, we present a cryptographic data publishing system named “CDPS” which combines the cryptographic methods with data publishing for the first time. It adopts some cryptographic methods to keep the given database structure and format of different type of data, and it also allows developers to add special methods according to data publishing requirements. Compared with the traditional data publishing methods, CDPS will not influence the data utility at all, and it can meet most data publishing requirements for research or testing. It has some features as follows: (1) it can 3
meet the most data publishing requirements supporting multiple data mining tasks; (2) it keeps all the structures and referential integrity, preserve original format of data; (3) it can be widely used to publish text data, relational database, and NOSQL database. Organizations. The rest of this paper proceeds as follows. In Section 2, we review the works related to FPE and privacy-preserving data publishing methods. In Section 3, we describe the requirements of data publishing system and define the cryptographic data publishing problem. In Section 4, we present the system model and architecture of CDPS, describe its security model and analyze the security. In Section 5, we describe the data publishing module of CDPS in details, and propose two special data publishing algorithms to satisfy different data publishing requirements. In Section 6, we present the implementation details of CDPS and evaluate efficiency of its data publishing algorithms. Finally we draw conclusion and show the future work in Section 7.
2
2.1
Related Work
Data Publishing Systems
Several data anonymization techniques have been proposed for privacypreserving data publishing. The most popular ones are generalization [6,7] for k-anonymity [7] and bucketization [8–10] for l -diversity. In both approaches, attributes are partitioned into three categories: 1) some attributes are identifiers that can uniquely identify an individual, such as Name or Social Security Number; 2) some attributes are Quasi Identifiers (QI), which the adversary may already know (possibly from other publicly available databases) and which, when taken together, can potentially identify an individual, e.g., Birthdate, Sex, and Zipcode; 3) some attributes are Sensitive Attributes (SAs), which are unknown to the adversary and are considered sensitive, such as Disease and Salary. In 2012, Li [11] presented a novel technique called slicing, which partitions the data both horizontally and vertically. Li [11] showed that slicing preserves better utility than generalization and is more effective than bucketization in workloads involving the sensitive attribute. Later, in 2013, Sattar [12] designed a method integrating sampling and generalization, ant they showed that their proposed method provides sound semantic protection of individuals in data and provides higher data utility. Recently, some data publishing methods or applications were proposed in [13– 16]. In all of these schemes, no one can ensure that it will not affect the mining 4
results after data publishing. And, cryptographic methods are not applied to provide the privacy protection, because they will destroy the data unitization and make the original data unreadable. The traditional encryption methods like AES and 3DES, will output the ciphertext with the length of specified block size and make the ciphertext not to be stored in the original data field anymore.
2.2
Cryptographic technologies
Format-preserving encryption (FPE). Since format-preserving encryption was first proposed in 1981, there have been plenty of researches on the subject [17,19,18,20]. In 2002, Black and Rogaway [17] provided a series of FPE methods on enciphering integers, and suggested that such ciphers can be used to construct FPE schemes on any arbitrary domain. In 2009, Bellare et al [18] defined the rank-then-encipher approach (or RtE for short), and suggested that it’s possible to construct any FPE scheme based on integer FPEs by building a bijection between the target domain and an integer domain. This idea is powerful because it reduces all FPE problems to the integer FPE problem, and has been broadly used as a basic construction method. Some previous FPE schemes work on the message space X = Zn = {0, 1, n-1 } for any desired n. Such schemes include both Feistel-based schemes like FFSEM [19]. Within these existing works, the FFX mode proposed in 2010, is of the best generality [20]. FFX specifically aims at encrypting strings of some arbitrary alphabet Σ(in which of course the character data is included), and works on the message space X = Σn for any desired string length n. To resolve the FPE problem of “varchar” data stored in the database, [21] proposes a MR-FPE scheme to ensure both length and storage size can be preserved. Recently, FPE is applied to protect data with the requirement of format, for example, [21] proposes a database encryption mechanism and applies FPE to build such a system for encrypting data stored in the databases. Another example is that [22] uses FPE to encrypt the curve index in their orderpreserving encryption scheme. Order preserving encryption (OPE). OPE is a common encryption scheme which ensures that the order of plaintexts remains in the ciphertexts. It will be useful in the data publishing system. The ideal security goal for an OPE scheme, IND-OCPA [23], is to reveal no additional information about the plaintext values besides their order (which is the minimum requirement for the order-preserving property). Until now, the only ideal-security OPE scheme is mutable order-preserving encoding (mOPE) scheme [24], which is proposed by Popa et al. in 2013, where the ciphertexts reveal nothing except for the 5
order of the plaintext values. The mOPE works by building a balanced search tree containing all of the plaintext values encrypted by the application in the database side, and it requires the encryption protocol to be interactive and a small number of ciphertexts of already-encrypted values to change as new plaintext values are encrypted (e.g., it is straightforward to update a few ciphertexts stored in a database), and these operations in database side can be implemented by user define functions (UDFs). However, such a scheme is not suitable for data publishing. Some OPE schemes [22,25–28] have been proposed after 2013. Liu et al [22] constructs an efficient and programmable OPE scheme for outsourced databases, which uses message space expansion and nonlinear space split to hide data distribution and frequency. It would be useful for data publishing. Key generation. During the data publishing, the system should generate keys for encryption according to specified methods. Traditional key assignment methods are to manage the access control [29–31] in the storage system, which ensure that only authorized entities can get access to certain resources. There are many existed literatures [32,33] in this field. In this paper, encryption keys in our system are only related to files and columns, and not involved in authorization. Thus, we adopt the key dispersion algorithm that combines AES and 3DES.
3
Problem Definition
In general, the data publishing system should be capable to deal with both text data which may contain multiple files with certain relations and relational data base. Let T (x) be the mapping between original data and publication data. Requirement 1 Preserve referential integrity. Let R(U ) be a relation in a relational database, where U is the set of its attributes. U = {a1 , a2, ..., an}. n is the number of elements in set U . ai is the i-th attribute. If there exist two relations R(U1 ) and R(U2 ), where ai == aj (ai is a element in U1 , and aj is an element in U2), during publication ai is still equal to aj , neither of which can be omitted, then Req1(R, T (R)) is true. Example: U1 =Id, name describes the information of students. U2 =Id, course describes the information of the courses the students assigned. The two Id attributes in U1 and U2 are both Ids of students, none of which should be omitted before and after the encryption. Requirement 2 Data should not lose its original meaning after 6
publication. Let V = {v1 , v2, ..., vn} be the value set of an attribute X.A value vi (1 <= i <= n) in V before publication should be transformed into vj (1 <= j <= n) which is still in V . Moreover, let V = {v1 , v2 , ..., vm} be a subset of V , f (vi ) = f (vj ), where f (x) is a Boolean value of whether the x have a certain property, and vj is a element of V . No vi in V after publication should be in V and vice versa. Then Req2(R, T (R)) is true, where R is the relation which X is in. Example: X is an attribute of disease with its value set V ={HIV, HDV, SARS, M yopia, LiverCancer}. F (x) is whether the disease is infectious. Then V ={HIV, HDV, SARS}. Definition 1 Format. The format of an attribute X, denote < n, < S1 , K1 > , < S2, K2 >, ..., < Sn , Kn >>, where n is the maximum number of digits of the values in attribute X, where Si is the value set for digit one, where Ki whose value is 1 or 0 represents the dispensability of the digit. Example: Attribute X is cellphone number. Its format is < 14, < {+}, 0 > , < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 > , < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 > , < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 >, < {0, ..., 9}, 0 >>. It means that the cellphone number must begin with plus sign and followed by 13 decimal digits none of which is dispensable. Definition 2 Data Publication system. D = {R1 , R2, ..., Rn} and D = {R1 , R2, ..., Rn } is the input and output of the system. Then both Req1(Ri , T (Ri)) and Req2(Ri , T(Ri ) (1 <= i <= n) should be true.
4
System Architecture
Usually, there are mainly two data types in a publish system: text data and database data. In our system, we consider that both text data and database data can be processed as a text file, since the database one can be derived to the text one. Moreover, our system can process unstructured data in the similar way. 4.1
System Model
A typical scenario for cryptographic data publishing is shown in Fig.1. First of all, the data publisher collects data from various sources. Then, the data publisher releases the collected data to data recipients for data mining or public research. 7
CDPS
Data publishing
Data recipient
Data collection
Data publisher Data source
Fig. 1. Model of cryptographic data publishing system
A CDPS system involves two different participant roles. One is the data publisher who has some valid data probably collected from log files, the Internet, GPS tracks, and databases. A publisher can be a government agency, a hospital or an enterprise who has its own data sources, who expects to contribute data for valuable mining tasks appropriately. The other is the data recipient that wants to utilize the published data. The recipient is probably a research institution or a researcher. In the CDPS, we assume that the data publisher is fully honest. However, the data recipient is not trusted in case that it could be curious about the sensitive information in the published data. An example in the real world is that a hospital collects data from patients and publishes the health records to an public medical center. As described above, the hospital plays the role of a data publisher, patients’ records are collected data, and the medical center is the data recipient respectively. The operations conducted by the medical center can be some specific data mining algorithms such as building a decision trees or cluster analysis.
4.2
CDPS Architecture
We conclude the tasks conducted via our data publish system as follow. First, the raw data (database data or unstructured data) should be firstly derived to a text file. Then, the data in the type of text file are directly processed. Finally, the data publisher releases the privacy-preserving data to the data recipient as public. The architecture of CDPS is shown in Fig.2. It consists of four stages that are data pre-processing, specified rules, data publishing and data post-processing. Specially, there are two tools should be used in data publishing stage (i.e., key dispersion and publishing algorithm). 8
Data pre-processing
Specified Rules
Data post-processing
Data publishing
Key dispersion
Publishing algorithm
Fig. 2. CDPS architecture
• Data Pre-processing. As mentioned above, in order to publish the data from various data sources, a CPDS ought to convert input raw data into the text form, which is the goal of the data pre-processing stage. Particularly, for the database data, data from the same table will be imported in a same text file. • Specified Rules. The specified rules stage is to make the rules for publishing. A specified rule is in the form as (< column, algorithmID, specialrule >, ..., < column, algorithmID, specialrule >). Note that the data in each column has a corresponding algorithm identified by the algorithm ID. Similarly, each column has its own unique key, and the specialrule is empty except that some special rules are adopted. The pro-processing of input raw data must depend on the specified rules. • Data Publishing. The data publishing stage is the core of the CPDS, where two cryptographic tools should be implemented. 1) Key Dispersion. Given the difficulty of large-scale key management, the CDPS assigns only one master key that all encryption keys can be dispersed from such a key. This mechanism can effectively facilitate the key management task and improve the security of the system; 2) Publishing Algorithms. Publishing algorithm is the most important component of CDPS. An algorithm ID in the rule is associated with someone algorithm in the algorithm set. A publishing algorithm can be derived from the order-preserving encryption (OPE), the format-preserving encryption (FPE), or other feasible encryption to meet publishing requirements. For data publishing, the CPDS firstly disperses a key from file names and specified columns, and then call the publishing algorithm with an algorithm ID to finish the encryption. • Data Post-processing. The data post-processing stage is to further process the ciphertexts generated in previous stages and export final published data into the appropriate database.
9
4.3
Threat Models
Different from the cryptosystem with the demand of decryption, a data publishing system only needs to ensure the usage of published data (e.g., statistical natures of data). An attacker is provided on access to the plaintext via the encrypted ciphertext. Thus, we assume that the published procedure is one-way that an attacker can just capture it and guess values and characteristics of the raw data. Therefore, the most reasonable attack of this scenario is the ciphertext-only attack. Security Analysis. Under the attack model of the ciphertext-only attack, the attacker tries to reveal the statistical distribution of characters in the plaintext. However, a publisher in the system is honest that the attack can not obtain enough knowledge. Moreover, the publisher can arbitrarily choose and output a subset of possessed data which decides the statistics. Since the core of CPDS is the publishing algorithm and all publishing algorithms we adopt in Section 5 are provably secure, the CPDS is also secure.
5
Data Publishing Algorithms
In this section, we introduce the data publishing algorithm that is the core of a CDPS. To meet the publishing requirements, data publishing algorithms mainly includes the traditional symmetric encryption algorithm, the OPE algorithm, and the FPE algorithm. 5.1
Overview
CDPS achieves the choice and maintenance of algorithms via the data publishing algorithms management (DPAM) module which undertakes following two tasks. (1) Key Dispersion. The DPAM should generate a specified key dispersed from a publishing master key. (2) Algorithm Management. The DPAM should also maintain an encryption algorithm set (EAS), and provides functions for algorithm management such as addition, updating and deletion. Each algorithm in an EAS has two default parts that are the encryption and decryption. Although the decryption is not necessary for a CDPS, the system still preserves it to provide a shortcut that the data publisher can view the operation results if necessary. 10
5.2
Key Dispersion Algorithm
In CDPS, the master key is an one-time key randomly chosen by the data publisher in each publishing. The encryption key for each column in a file is dispersed from such a master key using the key dispersion algorithm (KDA). The KDA takes a master key M K, an encryption method E, a file name f ilename, and a column name column as input, and then outputs a key for this column. The KDA showed in Algorithm.1 is in the form as KDA(M K, E, f ilename, columnname) → key.
Algorithm 1 Key Dispersion Algorithm Input: M K, E, f ilename, column 1: f actor ← bitcode(f ilename)||bitcode(column) 2: len ← bitlength(f actor) 3: if len ≤ 128 then 4: f actor ← f actor >> (128 − len) 5: else 6: f actor ← f actor&2128 − 1 7: end if 8: if E = 3DES then 9: (M KL||M KR) ← M K 10: keyL ← 3DESKE Y (f actorL) 11: keyR ← 3DESKE Y (f actorR) 12: key ← keyL ||keyR 13: else 14: key ← AESKE Y (f actor) 15: end if Output: key
Note that the keys used for encrypting a primary key column and the corresponding foreign key column must be same, in order to ensure the referential integrity. Therefore, when the KDA is running to generate a key for the foreign key column, the input is same as the one when it running for the primary key column. Security Analysis. Obviously, the proposed KDA is constructed from 3DES and AES which are provably secure. Furthermore, an attacker can not obtain enough information about dispersed keys to guess the master key during ciphertext-only attack. So, this algorithm is secure. 11
Root 0
Symmetric Cipher
1
0
0
String
0
0
FFSEM decimal
2
OPE
Numeric FFSEM integer
type 3
FPE 0
1
1
FFX
Special Methods 0 ...
0
1 BPS
FOPE
1
0
Popa
AES
sub type
1 ...
1
0
1
DES
...
...
algorithm
Fig. 3. Tree structure of EAS
5.3
Algorithm management
For an efficient management, the CDPS uses the tree structure to organize the EAS. As shown in Fig.3, it allows that each node has 10 or less children numbered from 0 to 9. The tree has four layers that the root is on the first, algorithm types are on the second, subtypes are on the third, and algorithms on the fourth. An algorithm is encoded depending on its path. For example, the code of the FFSEM algorithm is “000”. In the system, there are four algorithm types: FPE, OPE, symmetric cipher, and special methods. The FPE algorithms can deal with the known data type. The OPE in our system mainly refers to numeric OPE algorithm. The symmetric cipher includes AES, DES, and so on. In addition, we introduce the special methods in Section 5.4, which are some special algorithms decided according to publishing requirements. General methods that can be used as publishing algorithms directly are shown in Table.3. As shown in Table.3, numerical algorithms are divided into two categories (i.e., integer and decimal). We adopt FFSEM [19] to implement the numeric FPE where the decimal fraction is processed using FFSEM decimal. For OPE, we only take numerical data into consideration, and adopt the scheme in [22]. Besides, for encrypting string values that are stored in database, FFX [20] is applied for nchar and nvarchar whose formats depend on their lengths, and MR-FPE [21] is applied for char and varchar whose formats depend on both lengths and storage sizes.
5.4
Special methods
Unfortunately, existing FPE algorithm can hardly meet the requirements of the CDPS. In practice, processing the data type could be a complicated work as follow: 12
Table 3 General data publishing methods Type
Sub type
FPE(0)
Numeric(0)
Number
Algorithm
0
FFSEM integer [19]
0
FFSEM decimal [19]
0
FFX [20]
1
MR-FPE [21]
0
Liu [34]
0
Integer OPE [22]
1
Decimal OPE [22]
0
AES
1
DES
String(1)
DateTime(2)
OPE(0)
Symmetric ciper (0)
(0)
(0)
(1) Data must be a special element in the known dataset. Take the medical data as an example. The disease of a patient must be a special element in the disease set, such as HIV and HDV, rather than an arbitrary string. To meet this publishing requirement, a substitution in the set, which can preserve an element’s format, should be adopted in the algorithm. We call such an algorithm as the “inner substitution method” whose code is “300”. (2) Data needs to preserve its own segment characteristic. Usually, segmentlevel statistics are based on the segment characteristic that different segment has different function. For instance, an ID number consists of an area code, a date of birth, and several check codes. In case that statistical analysis is about the area code and birth date, a publishing algorithm should preserve the original meaning of each segment. We call such an algorithm as the “segment encryption” whose code is “301”.
Inner Substitution Method The prefix method can be used to implement an inner substitution method. However, this method will generate the ciphertexts of all elements in a set and store a corresponding substitution table, which is not practical. Thus, we 13
adopt the integer-FPE (i.e., FFSEM integer) algorithm to generate a number index of ciphertext, instead of the substitution table. Generally, the method can be described as ISM = (Setup, Queryindex, Querydata, Encryption, Decryption). • Setup(F , column) → (D, keyD): For the column of file F , Setup generates an inner attribute set D according to the order of elements. The number of elements is size that size = |D|. It also generates a publishing key using the key dispersion algorithm. • Queryindex(D, str) → i: This algorithm is used to query the index of str in the inner attribute set D. • Querydata(D, i) → str: This algorithm is used to query the data with the index i in the inner attribute set D. • Encryption(D, keyD, str) → str : This algorithm is used to generate the corresponding ciphertext of str, and includes three steps: • obtain the index i ← Queryindex(D, str); • generate an integer ciphertext j ← integer−f pe.encryption(size, i, keyD); • obtain the result str ← Querydata(D, j). • Decryption(D, keyD, str ) → str: This algorithm is the converse of above encryption, and includes: • i ← Queryindex(D, str ); • j ← integer − f pe.decryption(size, i, keyD); • obtain the decryption result str ← Querydata(D, j). Since the Encryption of ISM is based on integer-FPE whose security has been proven in [19], ISM is secure. Segment Encryption In the segment encryption, a string is split into several segments first, and then each segment is encrypted by utilizing the encryption algorithm with algorithmID specified in the special rule. A special rule is defined as: < endpos1, algorithmId >, ..., < endposi, algorithmId >, ..., < endposn, algorithmID > where endposi is the end offset of the i-th segment, and the start offset can be obtained from the previous segment. Obviously, endposn = |str|. Note that the split is a segment split method where k is segments’ number; segmenti is the i-th segment specified in the special rule rule; keyD is the dispersed key generated for encrypting a column; ALGi is the encryption algorithm of the i-th segment, whose code is ALGID i . The procedure of the segment encryption SE is shown as Algorithm.2. The decryption is the converse of the encryption. 14
Algorithm 2 Segment Encryption Input: str, rule 1: ki=1 ← split(str, rule) 2: for i = 1 to k do 3: if ALGID i == “30 then 4: m ← ISM.Queryindex(D, segmenti) 5: n ← ISM.integer − F PE.encryption(size, m, keyD) 6: yi ← ISM.Querydata(D, n) 7: else 8: yi ← ALGi .encryption(keyD, segmenti) 9: end if 10: end for 11: y ← y1||y2|...||yk Output: y The encryption of SE is based on publishing algorithms in the EAS, whose security has been proven. In another word, SE only involves existing algorithms without breaking their security. As a result, SE is secure.
5.5
Analysis
Traditional non-privacy-preserving publishing algorithms are not adopting cryptographic methods to protect data privacy. They usually choose data anonymization techniques, such as generalization and bucketization. These data anonymization techniques will influence the data utility. In [12], authors perform experiments to quantify the data utility of a published data set in terms of aggregated query accuracy and classification accuracy, and they show that they all cannot achieve an accuracy rate of 100%. Alternatively, CDPS will have a high data utility because it adopts cryptographic methods. Lemma 1 In CDPS, encryption may lead to the change of mining result, but will not affect the implementation of the data mining algorithm. Proof: We focus on the items in the source dataset, and observe what happened after they are published. In our cryptographic data publishing system: • About the items of sensitive attributes (such as name, address, etc., which are deleted from the dataset in the traditional publishing algorithms), CDPS will publish them after encrypted by some data publishing algorithms based on cryptography. In this way, CDPS preserves the original database structure. • About other items published in traditional publishing algorithms, CDPS 15
will apply the suitable algorithms to publish them. From the above data publishing algorithms, we can see that they will all preserve data usability. That is to say, they all meet the requirements described in section 3. (1) Assume U = {U1 , U2, ..., Un} denotes the attribute set of the source dataset, and R(U ) denotes the relation in U . If the same key is used in the columns of two items in a relation R(Ui ), 1 <= i <= n, the referential integrity will be preserved. In fact, CDPS will firstly analyze the database structure, and make sure the same key will be specified in the “Specified Rules” module. That is to say, when disperses key, CDPS will input the same information for the two columns in a relation. So Req1(R, T (R)) is satisfied. (2) For the items with the special meaning, CDPS will apply the special methods to preserve their original meaning after publication. The inner substitution method will generate a ciphertext from the same dataset, and the segment encryption will preserve the meaning of every segment of an item. So, Req2(R, T (R)) is also satisfied. In summary, if specifies the right rules for each column, CDPS will not affect the implementation of the data mining algorithm. However, it may lead to the change of mining result. For example, when the original mining result is a data item, it will be changed to the corresponding encrypted one. Lemma 2 In CDPS, if the original mining result is a data item, it can be decrypted by data publisher. Proof: In some data mining algorithms, the data mining result is the data item of the source dataset. For example, the mining result is the items of the source dataset in association rule mining. Since the encryption is determinant, if the data publisher can retrieve the key, he/she will get the right data mining result after decrypting it. In CDPS, data publisher can re-compute the key by the key dispersion algorithm using the saved master key.
6
Evaluation
From Section 5.5 we can know, in case that the CDPS does not break properties of the original system, its effects on the data availability is determined. So, in this section, we aim at evaluating the efficiency of publishing algorithms. 16
Table 4 Execute times of general data publishing methods Type
Algorithm
Execute time (us)
Numeric
FFSEM integer)
26.1
FFSEM decimal
51.7
FFX
3.01
MR-FPE
4.23
DateTime
Liu
3.51
OPE
Integer OPE
0
Decimal OPE
0
AES
0.19
String
Symmetric ciper
6.1
Implementation Details
The experiment for our CDPS system is conducted to evaluate its efficiency. We implement the Data Publishing Algorithms through an open kernel API of C++ DLL for CDPS, called “DPA DLL”. In this DLL, all publishing algorithms implement a common interface. It defines the common function prototypes, including: • Void init(unsigned char * keyD, unsigned char * extradata): this function is used to init the algorithm. keyD denotes the key value, extradata denotes the additional information for special data publishing algorithms, including special rules, filename, cloumn etc. • unsigned char * encrypt(unsigned char * plaintext): this function is used to encrypt the plaintext. • unsigned char * decrypt (unsigned char * ciphertext): this function is used to decrypt the ciphertext. Developers can build the concrete data publishing system based on such a DLL with the public interface. In each algorithm based on Feistel, such as FFSEM, MR-FPE, FFX, and FPE-DATETime, the round function adopts AES as pseudo-random function for ensuring security. In the implementation, we use AES and large integer codes of the polarssl library, and set the AES key to be the algorithm key. Therefore, the length of each key in the algorithms is 128 bits. Besides, we assume that the round function runs AES-ECB only one time in each experiment, i.e., the length of each input is 128 bits or less. All the experiments are performed on Computer of Intel(R) Core(TM) i53337U CPU @ 1.80GHZ with Windows7 OS. The basic computing overhead of each general data publishing algorithm is shown in Table.4. 17
• The time on publishing numeric data via FFSEM is about 26.1 μs. The time on encrypting decimal data is double of the time on encrypting integer data, because the decimal part of decimal data should be encrypted as an integer one. • The efficiencies of FFX, MR-FPE, and the scheme in [22] are very close, which are nearly 15 times that of AES. Each algorithm runs the round function 12 times, and MR-FPE also processes some additional ModuloPlus operations and memory operations. Instead, the other two algorithm have less additional operations. • Although FFSEM is also based on Feistel, the efficiency of its numeric data processing is lower than others. Due to its cycle-walking mechanism, the mid-ciphertext which is out of a specific range will be encrypted once again. This step does not stop until the ciphertext is in the range. To avoid such a situation and improve the efficiency, when the integer range is not strict, we can use FFX with a character set {0, ..., 9} to generate a value with some length . • The OPE algorithm here is same as one proposed in [22]. 6.2
Experimental Results
Key Dispersion To test the performance of key dispersion algorithm in Section 5.2, we set some column sets with different number of tables as input, and evaluate effects of changes in table number. The evaluation is mainly about Algorithm 1. Fig.4 show the relationship between the execute time and the table number. As shown in Fig.4, the execute time of key dispersion algorithm is very little (not more than 10 ms) even the column number is 10 and the table number is 200. The execute time of key dispersion is roughly linear to the table number. There are two reasons: 1) the master key is randomly chosen by the data publisher for each table; 2) the more the tables are, the more keys should be generated.
Encryption and Decryption Next, we test the performance of the methods in Section 5.4. We pay more attention on the overheads of encryption and decryption. For making the characteristic of each method more clear, we set that each table has its own three columns. In the experiment of inner substitution method, the types of columns are integer, string, and string respectively, and each column is encrypted by using the appropriate algorithm. Moreover, we set that the number of tables in a small database varies from 1 to 4. The number of records 18
10
4 Columns 6 Columns 8 Columns 10 Columns
Execute Time (ms)
8
6
4
2
0 0
50
100
150
200
Number of Tables
Fig. 4. Execute time of key dispersion 80
1 Table 2 Tables 3 Tables 4 Tables
100
60
Execute Time (ms)
80
Execute Time (ms)
1 Table 2 Tables 3 Tables 4 Tables
60
40
40
20
20
0
0 0
200
400
600
800
1000
0
Number of Records
200
400
600
800
1000
Number of Records
(a) Execute time of encryption
(b) Execute time of decryption
Fig. 5. Execute time of inner substitution method
in each table is the same, and each table has a foreign key that is the primary key of another table. Fig. 5 shows the execute time of the two modules. The execute time of both encryption and decryption is roughly linear to the number of records. For example, on the database that contains 4 tables, the encryption time varies from 1.1174 ms to 99.7837 ms and the decryption time varies from 0.5865 ms to 62.3234 ms, when the record number is from 10 to 1000. There are two reasons: 1) the inner substitution method for the columns we set is based on the FPE (i.e., integer-FPE and FFX); 2) the time of “cyclewalking” increases with the element number when the FPE is running. Another special method is the segment encryption which chooses publishing algorithm according to the segment. In the experiment of this method, the types of columns are integer, string, and datetime respectively. Note that the datetime is the segment type that needs to preserve its own segment 19
120
1 Table 2 Tables 3 Tables 4 Tables
1 Table 2 Tables 3 Tables 4 Tables
80
60
80
Execute Time (ms)
Execute Time (ms)
100
60
40
40
20 20
0
0 0
200
400
600
800
1000
0
Number of Records
200
400
600
800
1000
Number of Records
(a) Execute time of encryption
(b) Execute time of decryption
Fig. 6. Execute time of segment encryption
characteristic. Thus, it should be encrypted by using Algorithm 2. The setting in the experiment is similar as the previous one. Fig. 6 shows the execute time of the two modules. As shown in Fig. 6, the execute time of each module is roughly linear to the number of records. For example, on the database that contains 4 tables, the encryption time varies from 1.0515 ms to 108.3454 ms and the decryption time varies from 0.7869 ms to 77.7768 ms, when the record number is from 10 to 1000. From the description of the method and the Liu [34] scheme, we can infer that the performance is roughly linear to the number of the segments as the experimental results.
7
Conclusion
In this paper, we present a cryptographic data publishing system, which adopts format-preserving encryption to make sure the ciphertext has the same format with the plaintext. The proposed system has some characteristics, including: (1) it achieves the goal of data integrity without removing sensitive attributes; the feature is necessary in the dataset that can be mined with multiple algorithms, such as apriori algorithm, k-means algorithm, naive Bayes, etc.; (2) it also can preserve statistical characteristics and properties of data. In the practical applications, CDPS is limited for some complicated data publishing requirements. For example, some columns have a certain relationship which can be presented by a mathematical formula, as “salary = basic salary + bonus”. Aiming at resolving these limitations, we deliver future work as follows: (1) extend the existing publishing algorithms in the light of various publishing requirements such as the limitation above; (2) explore the application of the homomorphic encryption technique in a data publishing scenario. 20
References [1] A. Castiglione, G. Cattaneo, G. De Maio, F. Petagna, Secr3t: Secure end-to-end communication over 3g telecommunication networks, in: Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on, IEEE, 2011, pp. 520–526. [2] A. Castiglione, G. Cattaneo, A. De Santis, F. Petagna, U. F. Petrillo, Speech: Secure personal end-to-end communication with handheld, in: ISSE 2006— Securing Electronic Busines Processes, Springer, 2006, pp. 287–297. [3] J. Lim, I. Doh, K. Chae, Security system architecture for data integrity based on a virtual smart meter overlay in a smart grid system, Soft Computing 20 (5) (2016) 1829–1840. [4] C. Yao, L. Xu, X. Huang, J. K. Liu, A secure remote data integrity checking cloud storage system from threshold encryption, Journal of Ambient Intelligence and Humanized Computing 5 (6) (2014) 857–865. [5] J. Mao, Y. Zhang, P. Li, T. Li, Q. Wu, J. Liu, A position-aware merkle tree for dynamic cloud data integrity verification, Soft Computing (2015) 1–14. [6] P. Samarati, Protecting respondent’s privacy in microdata release, IEEE TransacY tions on Knowledge and Data Engineering 13 (6). [7] L. Sweeney, k-anonymity: A model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (05) (2002) 557– 570. [8] X. Xiao, Y. Tao, Anatomy: Simple and effective privacy preservation, in: Proceedings of the 32nd international conference on Very large data bases, VLDB Endowment, 2006, pp. 139–150. [9] D. J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, J. Y. Halpern, Worstcase background knowledge for privacy-preserving data publishing, in: 2007 IEEE 23rd International Conference on Data Engineering, IEEE, 2007, pp. 126–135. [10] Q. Zhang, N. Koudas, D. Srivastava, T. Yu, Aggregate query answering on anonymized tables, in: 2007 IEEE 23rd International Conference on Data Engineering, IEEE, 2007, pp. 116–125. [11] T. Li, N. Li, J. Zhang, I. Molloy, Slicing: A new approach for privacy preserving data publishing, IEEE transactions on knowledge and data engineering 24 (3) (2012) 561–574. [12] A. S. Sattar, J. Li, X. Ding, J. Liu, M. Vincent, A general framework for privacy preserving data publishing, Knowledge-Based Systems 54 (2013) 276–287. [13] C. Esposito, M. Ficco, F. Palmieri, A. Castiglione, A knowledge-based platform for big data analytics based on publish/subscribe services and stream processing, Knowledge-Based Systems 79 (2015) 3–17.
21
[14] J.-J. Yang, J.-Q. Li, Y. Niu, A hybrid solution for privacy preserving medical data sharing in the cloud environment, Future Generation Computer Systems 43 (2015) 74–86. [15] Y. Wang, X. Ma, A general scalable and elastic content-based publish/subscribe service, IEEE Transactions on Parallel and Distributed Systems 26 (8) (2015) 2100–2113. [16] L. Chen, G. Cong, X. Cao, K.-L. Tan, Temporal spatial-keyword top-k publish/subscribe, in: 2015 IEEE 31st International Conference on Data Engineering, IEEE, 2015, pp. 255–266. [17] J. Black, P. Rogaway, Ciphers with arbitrary finite domains, in: Cryptographers’ Track at the RSA Conference, Springer, 2002, pp. 114–130. [18] B. Morris, P. Rogaway, T. Stegers, How to encipher messages on a small domain, in: Advances in Cryptology-CRYPTO 2009, Springer, 2009, pp. 286–302. [19] T. Spies, Format preserving encryption, Unpublished white paper, www. voltage. com Database and Network Journal (December 2008), Format preserving encryption: www. voltage. com. [20] M. Bellare, P. Rogaway, T. Spies, The ffx mode of operation for formatpreserving encryption, NIST submission 20. [21] J. Li, Z. Liu, X. Chen, F. Xhafa, X. Tan, D. S. Wong, L-encdb: A lightweight framework for privacy-preserving data queries in cloud computing, KnowledgeBased Systems 79 (2015) 18–26. [22] Z. Liu, X. Chen, J. Yang, C. Jia, I. You, New order preserving encryption model for outsourced databases in cloud environments, Journal of Network and Computer Applications 59 (2016) 198–207. [23] A. Boldyreva, N. Chenette, Y. Lee, A. O’neill, Order-preserving symmetric encryption, in: Annual International Conference on the Theory and Applications of Cryptographic Techniques, Springer, 2009, pp. 224–241. [24] R. A. Popa, F. H. Li, N. Zeldovich, An ideal-security protocol for orderpreserving encoding, in: Security and Privacy (SP), 2013 IEEE Symposium on, IEEE, 2013, pp. 463–477. [25] F. Kerschbaum, Frequency-hiding order-preserving encryption, in: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, ACM, 2015, pp. 656–667. [26] C. Mavroforakis, N. Chenette, A. O’Neill, G. Kollios, R. Canetti, Modular orderpreserving encryption, revisited, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, 2015, pp. 763–777. [27] K. Li, W. Zhang, C. Yang, N. Yu, Security analysis on one-to-many order preserving encryption-based cloud data search, IEEE Transactions on Information Forensics and Security 10 (9) (2015) 1918–1926.
22
[28] D. Boneh, K. Lewi, M. Raykova, A. Sahai, M. Zhandry, J. Zimmerman, Semantically secure order-revealing encryption: Multi-input functional encryption without obfuscation, in: Annual International Conference on the Theory and Applications of Cryptographic Techniques, Springer, 2015, pp. 563– 594. [29] A. Castiglione, A. De Santis, B. Masucci, F. Palmieri, A. Castiglione, X. Huang, Cryptographic hierarchical access control for dynamic structures, IEEE Transactions on Information Forensics and Security 11 (10) (2016) 2349– 2364. [30] A. Castiglione, A. De Santis, B. Masucci, F. Palmieri, A. Castiglione, J. Li, X. Huang, Hierarchical and shared access control, IEEE Transactions on Information Forensics and Security 11 (4) (2016) 850–865. [31] L. Ke, Z. Yi, Y. Ren, Improved broadcast encryption schemes with enhanced security, Journal of Ambient Intelligence and Humanized Computing 6 (1) (2015) 121–129. [32] A. Castiglione, A. De Santis, B. Masucci, Key indistinguishability vs. strong key indistinguishability for hierarchical key assignment schemes. [33] A. Castiglione, A. De Santis, B. Masucci, F. Palmieri, A. Castiglione, On the relations between security notions in hierarchical key assignment schemes for dynamic structures, in: Australasian Conference on Information Security and Privacy, Springer, 2016, pp. 37–54. [34] Z. Liu, C. Jia, J. Li, X. Cheng, Format-preserving encryption for datetime, in: Intelligent Computing and Intelligent Systems (ICIS), 2010 IEEE International Conference on, Vol. 2, IEEE, 2010, pp. 201–205.
23