Journal of Parallel and Distributed Computing 137 (2020) 17–25
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
Variational approach for privacy funnel optimization on continuous data Lihao Nan
∗,1
, Dacheng Tao 1
UBTECH Sydney AI Centre, The School of IT, FEIT, The University of Sydney, NSW, Australia
article
info
Article history: Received 29 January 2019 Received in revised form 22 May 2019 Accepted 17 September 2019 Available online 21 September 2019 Keywords: Privacy Data security Representation Learning
a b s t r a c t Here we consider a common data encryption problem encountered by users who want to disclose some data to gain utility but preserve their private information. Specifically, we consider the inference attack, in which an adversary conducts inference on the disclosed data to gain information about users’ private data. Following privacy funnel (Makhdoumi et al., 2014), assuming that the original data X is transformed into Z before disclosing and the log loss is used for both privacy and utility metrics, then the problem can be modeled as finding a mapping X → Z that maximizes mutual information between X and Z subject to a constraint that the mutual information between Z and private data S is smaller than a predefined threshold ϵ . In contrast to the original study (Makhdoumi et al., 2014), which only focused on discrete data, we consider the more general and practical setting of continuous and high-dimensional disclosed data (e.g., image data). Most previous work on privacypreserving representation learning is based on adversarial learning or generative adversarial networks, which has been shown to suffer from the vanishing gradient problem, and it is experimentally difficult to eliminate the relationship with private data Y when Z is constrained to retain more information about X . Here we propose a simple but effective variational approach that does not rely on adversarial training. Our experimental results show that our approach is stable and outperforms previous methods in terms of both downstream task accuracy and mutual information estimation. © 2019 Elsevier Inc. All rights reserved.
1. Introduction In this paper, we consider the data security problem faced by users who want to disclose their data to gain some utility. This is a common scenario in the commercial client–server model, in which users upload their data to a server and the server receives and processes the data and returns the desired utility to the client. This procedure contains an inherent data security issue. Consider a curious server that not only provides the desired utility to the user but also tries to infer some private information about the user using their disclosed data. For example, a speech recognition service provider can easily recognize the identity of the voice uploaded to the server, which might be against the user’s consent. Further, this private information can be further used by generative adversarial networks (GANs) [10] to synthesize voices that are indistinguishable from reality [27], providing a convenient route to telemarketing scams. ∗ Corresponding author. E-mail addresses:
[email protected] (L. Nan),
[email protected] (D. Tao). 1 L. Nan and D. Tao are with the UBTECH Sydney Artificial Intelligence Centre and the School of Computer Science, in the Faculty of Engineering and Information Technologies, at the University of Sydney, 6 Cleveland St, Darlington, NSW 2008, Australia. https://doi.org/10.1016/j.jpdc.2019.09.010 0743-7315/© 2019 Elsevier Inc. All rights reserved.
To overcome these privacy issues, a natural solution is to add a sanitization mechanism to the data before it is uploaded to the server. Such a sanitization mechanism would take the users’ raw data as input and output a sanitized version of the data for the server to process. Clearly, the sanitized data Z should be kept unrelated to private variable Y while retaining as much information as possible about X , since Z will be processed by the server to provide the expected utility. However, it is hard to define ‘‘unrelated’’ when no assumption is made about the adversary. In this article, we consider a powerful adversary who owns sanitized data Ztrain and its corresponding private data Ytrain and can train a machine learning model to conduct inference for new Ztest . Following the work of [25], we refer to this attack as an inference attack and the sanitization mechanism as privatizer. To model the privacy disclosure and data distortion introduced by sanitization, [23] proposed a concept called the ‘‘privacy funnel’’ to represent the trade-off between data utility and data privacy. Briefly, privacy funnel uses mutual information to evaluate both data distortion and privacy leakage. The mutual information of two random variables is a measure of mutual dependence between the two variables, quantifying the amount the information obtained for one random variable by observing another variable. [23] showed that when the log loss is introduced for both the privacy and utility metrics, the problem can be
18
L. Nan and D. Tao / Journal of Parallel and Distributed Computing 137 (2020) 17–25
formulated as follows: max
I(Z ; X )
(1)
P(Z |X ): I(Z ;S)≤ϵ
where I(X ; Y ) denotes the mutual information between two random variables X and Y , and ϵ is a predefined threshold that describes the degree of privacy leakage. Following the work of [23], we use the privacy funnel as our target to obtain privacy-preserving mapping P(Z |X ). When the data distribution is known, the mutual information between two random variables can be calculated by:
∫∫ I(X ; Y ) =
p(x, y) log
p(x, y) p(x)p(y)
dx dy
(2)
However, the data distribution is often inaccessible and only a proportion of the sampled data is available, forcing us to use data-driven methods to optimize the privacy funnel. The original privacy funnel paper [23] proposed a greedy algorithm for discrete data, which is unsuitable for very complex data such as image or voice data. Thus, our challenge is to find a mapping P(Z |X ) optimizing Eq. (7) for high-dimensional continuous random variable X using the data-driven method. 1.1. Related work To deal with complex data when the data distribution is unavailable, [32] defended against an inference attack in a mobile sensing application by adding differential privacy-guaranteed noise to the data compressed by an autoencoder. However, although differential privacy provided strong protection for every entry to be recognized, it failed to separate information related to private data and information related to utility data, so the added noise significantly compromised the data utility. [31] proposed a model that zeroed out the private information when the first stage of the desired information predictor was a linear operator. Although this model achieved perfect utilityprivacy trade-off when the utility and private data were orthogonal, it required knowledge of the model used to process the downstream utility task and even required the first part of the model to be linear. However, for situations such as data publishing, obtaining information for such downstream task model is impossible. Recently, adversarial privacy-preserving models have dominated the field [6,8,12,15,21]. Such models have appeared for natural language processing [21], facial recognition [6], and speech conversion [8], suggesting that this model has real-world value and practicality. The core idea is to train an adversarial network that tries to infer private data Y given transformed data Z . To preserve data utility, an additional decoder is included that tries to optimize the reconstruction loss, or, when the utility label is available, the decoder is replaced with an additional classifier that aims to keep the transformed data informative about the utility label. When the utility label is not accessible and the reconstruction loss is used, the mutual information between the transformed data and raw data can be bounded by the following inequation: I(X ; Z ) ≥ He (Z ) − Re,d (X )
(3)
where e and d refer to encoder Pθ (Z |X ) and Pθ (X |Z ), respectively, He (Z ) denotes the entropy of the latent variable produced by the encoder, and Re,d (X ) denotes the reconstruction loss produced by both the decoder and encoder. Thus, models with reconstruction loss provide some guarantees on mutual information term I(X ; Z ). This method turns the problem into a minimax game between privatizer and adversary, and an alternative training strategy is used. This makes the model extremely sensitive to
hyper-parameter selection: setting a large coefficient for the reconstruction loss will lead to a strict constraint for the latent space, which finally causes discriminator failure and the whole model to behave like an autoencoder. When a large coefficient is used for the discriminator, the model collapses easily and the privatizer will collapse to a simple constant mapping, ruining the utility of the data to be processed downstream. The variational fair autoencoder [22] is similar to our work, as it uses a similar graphical model and also adapts the VAE framework to learn a representation from raw data while certain known variable factors are purged. They achieved this by combining a VAE loss and a ‘‘maximum mean discrepancy’’ (MMD) [11] loss. However, this study focused on the semi-supervised domain adaptation problem and did not provide a solution for how to trading off between data distortion and privacy leaking, which is one of the aims of privacy funnel optimization. Further, calculating MMD loss required data sampling of each different private label, which is impossible when the private variable is continuous and also results in a high computational cost when candidates for all possible private variables are numerous. In contrast, we provide a convenient way to trade off between data distortion and privacy leaking regardless of whether the private variable is continuous or discrete. β -vae [13] is an unsupervised approach to automatically factorize the feature, i.e., tuning one factor of latent code while keeping other factors constant will only influence one property of generated data. β -vae achieves this target by introducing an adjustable coefficient in the framework of vae, which can balance the latent channel capacity and independence with reconstruction accuracy. However, since this is an unsupervised method, and cannot be used to learn a privacy-preserving representation directly. Our method, uses a similar idea with β -vae, but expands the algorithm to be used in a supervised manner. 1.2. Our contributions Our work, on the other hand, views the problem as fitting the observed data using a directed graphical model that generates data by two disentangled latent variables. Then, the sanitized data Z is obtained by conducting inference on the graphical model when X and Y are observed. The whole model adapts the framework used in VAE, which contains an encoder that maps the original data into the latent variable and a decoder that tries to reconstruct the original data. Our proposed model has three main advantages: 1. We expose little private information from other users on the client, thus preventing inference attacks from the client. Practically speaking, the whole model is trained on the server side and only the encoder is deployed on the client. Compared to the previous method that viewed data with different private variables as belonging to different domains and used GAN to synthesize data indistinguishable from other domains, our proposed model does not expose information from other domains on the client, thus defending the client from attack. 2. Our model does not rely on the adversarial training framework, which has been shown to suffer from the vanishing gradient problem and mode collapse problem. Further, our model takes advantage of the variational autoencoder (VAE), which is easy and stable to train. 3. Our model can easily be modified to achieve different privacy-utility trade-offs. We achieve this by incorporating a loss term from the privacy funnel, which constrains I(X ; Z ) with Lagrange multipliers [20]. To evaluate our work, we take advantage of recent progress made in evaluating mutual information [4], so we can directly estimate the mutual information term in the privacy funnel. Further, we also test downstream tasks, as this is the usual use case (e.g., data publishing and uploading), and downstream task accuracy can to some extent illustrate the term I(Y ; X ). Additionally,
L. Nan and D. Tao / Journal of Parallel and Distributed Computing 137 (2020) 17–25
we use other metrics like MS-SSIM [29] and NDM [5], as these metrics also reflect the mutual information term but are easier to estimate. Compared to previous data-driven context-aware approaches for privacy funnel optimization, our work achieves a better privacy-utility trade-off by adding a module that can simply be plugged in. Our contributions can be summarized as follows: 1. We propose a model that improves the privacy-utility trade-off when the log loss is used as a metric for both privacy and utility tasks, under the setting that the private variable can be high-dimensional and continuous. 2. We provide a convenient way to trade off between the utility and privacy tasks, and for different private leaking tolerances we find that the minimum data distortion is proportional to the tolerance. 3. We compare our model to others using a real image dataset in terms of both downstream task evaluation and other proxies that reflect mutual information. 2. Background 2.1. Mutual information Mutual information (MI) is a measure of the amount of information shared by two random variables. Briefly, given two random variable X and Y , mutual information I(X ; Y ) is defined as the KL-divergence of the joint distribution p(x, y) and its product distribution p(x)p(y): I(X ; Y ) = DKL (p(x, y)∥p(x)p(y)) = Ex,y∼p(x,y) log
p(x, y) p(x)p(y)
(4)
In contrast to correlation, MI can capture non-linear relationships and true dependency [18] and is therefore widely used by the information privacy community [9,23,28,30].
Privacy funnel [23] considers the setting in which the adversary can choose an optimal prior q over all possible distributions of Y and minimizes the inference cost of obtaining Y using this prior. Under the log loss C (y, q) = − log q(y), the optimal q before observation of data Z is: (5)
and after observation of published data Z : q∗z = arg max Ep(y|z) − log(q(y|z)) = H(Y |Z )
(6)
Thus, the expect gain of observation Z is H(Y ) − H(Y |Z ) = I(Y ; Z ). For data distortion, a similar assumption is used, which lead to the final privacy funnel optimization: min
I(Y ; Z )
(e.g., N (µ, σ 2 )), and then tries to approximate the true distribution p(X ) with this generated distribution q(x) by minimizing the KL-divergence between two distributions: min DKL (p(x)∥q(x)) = min Ex∼p(x) − log q(x) − H(X )
(8)
The term Ex∼p(x) log q(x) can be rewritten as:
Ex∼p(x) log q(x) = Ex∼p(x) DKL (pθ (z |x)∥p(z |x)) + L(θ; x)
(9)
where the pθ (z |x) is known as a recognition model that aims to approximate the true intractable distribution p(z |x), and the term L(θ; x) is called the variational evidence lower bound, which can be rewritten as: log q(x) ≥ L(θ; x) = −DKL (pθ (z |x)∥qθ (z)) + Ez ∼pθ (z |x) log qθ (x|z) (10) VAE tries to maximize the term L(θ; x), which is a lower bound for the KL-divergence between true distribution p(x) and the approximate distribution q(x). From Eq. (9), we can see that when the variational evidence lower bound is optimized to zero, the approximate distribution pθ (z |x) will be equal to the real distribution p(z |x), and thus we can use pθ (z |x) to perform inference when x is observed. Although VAE is popular in generative models, there is no privacy-preserving model that uses VAE to achieve a privacyutility trade-off. In this article, we illustrate the relationship between the loss function in VAE and the privacy funnel, and in doing so show how can we modify the loss term in VAE to trade off between data distortion and privacy leaking. 3. Problem formulation Given a dataset X , its private label Y and a certain degree of privacy leaking tolerance ϵ , the problem is to find a mapping P(Z |X ) which optimize the following privacy funnel: max
I(X ; Z )
(11)
P(Z |X ):I(Y ;Z )≤ϵ
Where I(X ; Z ) is the mutual information between variables X and Z .
2.2. Privacy funnel
q∗0 = arg max Ep(y) − log(q(y)) = H(y)
19
(7)
P(Z |X ):I(X ;Z )≥R
Where R is a specific threshold that indicates the required information preserving degree. We use the same setting as the privacy funnel, except here we are dealing with high-dimensional continuous data such as images.
4. Adversarial privacy preserving models In this section, we detail the previous adversarial privacy preserving models [6,8,12,15,21], and suggest the innovations gained from these models. Let us consider Eq. (11), our target is to find an optimal encryption p∗θ (z |x) that can simultaneously minimize the term I(Z ; Y ) and maximize I(Z ; X ). Therefore, we can consider the minimization problem and maximization problem separately and finally use Lagrange Multiplier [20] to optimize Eq. (11). First, let us consider I(Z ; Y ). For the optimal encryption p∗θ (z |x), we have the following equations: p∗θ (z |x) = arg min I(Z ; Y )
= arg min −H(Y |Z )
(13)
pθ (z |x)
= arg min Ey,z ∼p(y,z) log p(y|z)
(14)
pθ (z |x)
= arg min Ey,z ∼p(y,z) log pθ (z |x)
2.3. VAE
(12)
pθ (z |x)
p(y|z) pθ (y|z)
pθ (y|z)
(15)
= arg min Ey,z ∼p(y,z) log pθ (y|z) + KL(p(y|z)∥pθ (y|z))
(16)
= arg min max Ey,z ∼p(y,z) log pθ (y|z)
(17)
pθ (z |x)
The variational autoencoder (VAE) [17] provides a convenient way to both generate X from the latent variable Z and perform inference when data X is observed. VAE tries to generate data through a latent variable Z subjected to a prior distribution
pθ (z |x)
pθ (y|z)
from Eqs. (12) to (13), we assume the H(Y ) is a constant, and thus can be ignored. From Eqs. (16) to (17), since the term
20
L. Nan and D. Tao / Journal of Parallel and Distributed Computing 137 (2020) 17–25
KL(p(y|z)∥pθ (y|z)) is always positive, we can replace the true distribution p(y|z) with our parameterized distribution pθ (y|z) and add a maximization over it. We use multilayer perceptron(MLP) to parameterize pθ (y|z) and pθ (z |x). It is clear that Eq. (17) forms a minimax game. Similar with GAN [10], we refer pθ (z |x) as generator G and pθ (y|z) as discriminator D. The loss for both generator and discriminator to minimize I(Z ; Y ) would be: lossy = Ey,z ∼p(y,z) log pθ (y|z)
(18)
while the generator wants to minimize the loss and discriminator wants to maximize the loss. For utility preserving, we need to maximize the term I(Z ; X ), similar with I(Z ; Y ), we get the following loss term for I(Z ; X ) lossu = Ex,z ∼p(x,z) log pθ (x|z)
(19)
Fig. 1. The directed graphical model we used to model the inference and generation process. The variable in black indicates that this variable is observed. We can see during inference stage, Y and Z are not independent but when X is unobserved, they are independent marginally.
Combining Eqs. (18) and (19) with a Lagrange multiplier λ, we get the final target for training the generator and discriminator: min
max Ey,z ∼p(y,z) log pθ (y|z) − λ log pθ (x|z)
pθ (z |x),pθ (x|z) pθ (y|z)
(20)
Practically, for discrete variable Y , we may assume pθ (Z |Y ) with an exponential family distribution and use softmax to calculate pθ (Y |X ), while for continuous variable X , we may assume the pθ (X |Z ) with some tractable distribution like normal distribution. When the covariance matrix of such distribution is fixed to be unit, we can obtain the reconstruction loss: lossrec = ∥x − g(z)∥2
(21)
where g(z) parameterize pθ (x|z) and predict the mean value for reconstructed x. To optimize Eq. (20), the most common way is to train discriminator and generator alternately. When the utility label is provided, the decoder model may be replaced with a classifier that tries to predict the utility label, and the model performs well [6,8,21]. However, when the utility label is unavailable during training, the strict constraint added by the reconstruction error may easily corrupt the model to either a constant function or ignore the discriminator and the model will finally become an autoencoder. This model does not utilize the prior knowledge that the private variable and the utility variable are naturally independent. Thus, a more suitable model that can combine knowledge of this relationship between the latent and private variables should be considered. 5. Generative privacy filter 5.1. Motivation Since the above adversarial privacy preserving model is really similar to Generative Adversarial Network (GAN) [10], it is natural to consider the generative model instead of the discriminative model. Consider a data generation process which first samples a random variable Z from a prior distribution qθ (Z ) and independently sample the private variable Y from another prior distribution qθ (Y ), then the original data X is generated by qθ (X |Y , Z ). Since Z and Y in above generation process is independent, an ideal privatizer can be obtained by finding a mapping between X and Z . To characterize the generation process, we may use a directed graphical model. Finding the mapping between X and Z is equivalent to do inference Z when X is observed. Particularly, we can model the generation and inference with the graphical model presented in Fig. 1. Note from Fig. 1 we can see when X is unobserved, latent variables Z and Y are marginally independent. When the X and Y are observed, Z and Y are generally not
independent. However, for the graphical model, doing inference is usually difficult and for optimizing Privacy Funnel, we should find a way to trade off between privacy leaking and data distortion. To summarize, there are two additional requirements that the generative model needs to achieve: 1. Inference should be easy to do when X and Y are observed. 2. There should be some mechanism that can conveniently trade-off between the data distortion and the privacy leaking, i.e., some hyperparameter that we can tune so that the privacy-utility curve can be achieved, more importantly, we need to know the relationship between our target Eq. (11) and the loss function used in our generation model.
5.2. Method Finding such a generative model that satisfy above constraints can be challenging. Luckily, we find that Variational Autoencoder (VAE) [17] provides a convenient way to model the above generation process and also is efficient to do inference. From Section 2.3 we know that VAE tries to minimize the KL divergence between real data distribution p(X ) and our modeled distribution q(X ). For the generative model in Fig. 1, we instead minimize the observed joint distribution p(X , Y ) and our modeled distribution q(X , Y ). The corresponding variational evidence lower bound in Eq. (10) becomes: log q(x, y) ≥ L(θ; x, y)
(22)
= −DKL (pθ (z |x, y)∥qθ (z)) + Ez ∼pθ (z |x,y) log qθ (x|z , y) + c
(23)
where c is a constant, and is equal to H(Y |X ) thus can be ignored in loss function. Maximizing Eq. (22), we can obtain the privatizer pθ (z |x, y). When new data X and Y is given, we simply feed those data into the privatizer and obtain the transformed data Z . The above directed graphical model generally encourages the model to factorize the data X to independent Y and Z , and we can use Privacy funnel to further analyze the data distortion and privacy leaking introduced by this graphical model. To show the relation between the loss in Eq. (22) and Privacy Funnel in Eq. (7), we analyze the privacy leaking and data distortion produced by pθ (Z |X , Y ), that is, after observing the real data X and Y , we obtain Z by sampling on the distribution pθ (Z |X , Y ).
L. Nan and D. Tao / Journal of Parallel and Distributed Computing 137 (2020) 17–25
21
Fig. 2. The detail model architecture for our proposed algorithm: the input data X is first fed into a group of convolution network to extract high-level feature, then private variable Y is concatenated with feature (although the discrete Y is shown in figure, the continuous Y is also acceptable) to further predict the parameter of the distribution of Z . Finally a decoder is used to recover X from Z and Y . The loss is calculated by the reconstruction error ∥Xrec − X ∥ and the modified KL loss α DKL (N (0, I)∥N (µ, σ )).
First, we calculate privacy leaking I(Zθ ; Y ): I(Zθ ; Y ) ≤ I(Zθ ; X , Y )
= Ex,y∼p(x,y) Ez ∼p(zθ |x,y) log = Ex,y∼p(x,y) Ez ∼p(zθ |x,y) log
(24) p(x, y, zθ ) p(zθ )p(x, y) p(zθ |x, y) q(zθ ) p(zθ )
q(zθ )
= −DKL (p(zθ )∥q(zθ )) + Ex,y∼p(x,y) DKL (pθ (zθ |x, y)∥qθ )
(25) (26) (27)
≤ Ex,y∼p(x,y) DKL (pθ (zθ |x, y)∥qθ ) (28) ∫∫ where p(zθ ) = p(x, y, zθ )dxdy is the marginal distribution of zθ produced by the recognition model, and q(zθ ) is a predefined tractable distribution (e.g., N (0, I)). Note the term DKL (pθ (zθ |x, y)∥qθ ) appears as the first term of Eq. (22), suggests that minimizing this term provides a bound for I(Zθ ; Y ), which is used to evaluate the privacy leaking. Similarly, we have I(Zθ ; X ) ≤ I(Zθ ; X , Y ), indicates that minimizing DKL (pθ (zθ |x, y)∥qθ ) will increase the data distortion, thus we can trade off data distortion and privacy leaking by adding a stronger constraint for the learned zθ , and the objective for the generation model now becomes: min DKL (p(x, y)∥q(x, y)) subject to
I(Zθ ; X , Y ) <= ϵ
(29)
where ϵ indicates the degree of privacy leaking. Rewriting Eq. (29) using Lagrangian under the KKT conditions [20], and combining Eq. (22), we get: loss = DKL (p(x, y)∥q(x, y)) + β (I(Zθ ; X , Y ) − ϵ )
(30)
≤ −L(θ; x, y) + β (I(Zθ ; X , Y ) − ϵ )
(31)
= −L(θ; x, y) + β I(Zθ ; X , Y ) − βϵ
(32)
= −L(θ , β; x, y) − βϵ
(33)
Although it is not clear the primal–dual gap is zero, Eq. (22) is always a sufficient condition for optimizing Eq. (29). Thus we can instead minimize Eq. (22). However, at most of the time, we do not know the exact threshold for privacy leaking and data distortion, but we have a
check function that takes a privatizer as the input and determines whether data distortion and privacy leaking satisfy our requirements. Thus we can first set β to be zero, and then maximize the new variational evidence lower bound L(θ, β; x, y) to obtain the data transformer. Note when β is set to zero, the model is degraded to an autoencoder and suggests the lower bound for data distortion. If this model cannot pass the data distortion test, our algorithm will fail. In contrast, we increase the β until we get a model that can pass both privacy leaking and data distortion test. This leads to our algorithm, which we referred to as Privacy Preserving Variational Autoencoder (PPVAE) (see Fig. 2 for detail model structure):
6. Evaluation Estimating mutual information for continuous and highdimensional data is an ongoing research topic [4,19]. In order to evaluate our algorithm, we adopt the framework from [4] and evaluate the mutual information I(Y ; Z ) and I(X ; Z ) using this framework. Besides, since our designed algorithm is for deep learning, we also evaluate our trained model by feeding the encrypted data to downstream deep learning models. In the following part of this section, we detail the evaluation setting and evaluation method. 6.1. Mutual information We adopt the framework from [4]. Basically, it is built upon the following theorem: Theorem 1 (Donsker–Varadhan Representation). Given two distributions P and Q, the KL divergence between them can be represented in the dual form: DKL (P ∥ Q) = sup EP [T ] − log(EQ [eT ])
(34)
T :Ω →R
where the supremum is taken over all T while two expectations are finite.
22
L. Nan and D. Tao / Journal of Parallel and Distributed Computing 137 (2020) 17–25
that during the final classification stage, the original data X and encryption mapping M are not available. However, when we train our encryption model M, we consider both settings in which the utility label is either visible or invisible.
Algorithm 1: Privacy-Preserving Variational Autoencoder
1 2 3 4 5 6 7 8 9 10 11
Input: function check(Pθ (Z |X , Y )) → uPassed, pPassed; training data X , Y Output: A transformer Pθ (Z |X , Y ) satisfy both data distortion and privacy leaking requirement Or Failed to find such data transformer β←0 while True; do Pθ (Z |X , Y ) ← w eight Qθ (X |Z , Y ) ← w eight lr ← learningrate while Not coverage do Xbatch , Ybatch ← sample(X , Y ) µ, σ ← Pθ (Z |X , Y )(Xbatch , Ybatch ) Zsample ← N (µ, σ ) Xrec ← Qθ (X |Y , Z )(Ybatch , Zsample ) L ← −β DKL (N (µ, σ ), N (0, I)) + ||Xbatch − Xrec || Pθ (Z |X , Y ) ← Pθ − lr ∂∂PL
7. Implementation 7.1. Keras
uPassed, pPassed ← check(Pθ (Z |X , Y )) if uPassed == False then return Fail
Keras [7] is a high-level neural network API written in Python. Strictly speaking, Keras is not a true neural network framework but a wrapper built on top of several other frameworks like TensorFlow [1], Theano [2], and CNTK [26]. However, it has the advantage of being user-friendly, modular, and easily extensible. We use Keras to implement our algorithm as it provides a convenient way to construct a neural network and a high-level abstraction of the model architecture. Additionally, we implement different algorithms using the same framework so that all experiments can be conducted through a similar interface. In this way, more code is reused, and a different algorithm can only implement a few configurations files.
else if pPassed == True then return Pθ (Z |X , Y )
7.2. Tensorflow
Qθ (X |Z , Y ) ← Qθ − lr
12
θ
∂L ∂ Qθ
13 14 15 16 17 18
Since there is a relationship between KL divergence and mutual information: I(X ; Y ) = DKL (PXY ∥ PX ⊗ PY )
(35)
Substitute Eq. (35) into Eq. (34), we simply get the following corollary: Corollary 1.1. I(X ; Y ) ≥
sup T :X ×Y →R
Ex,y∼p(x,y) [T (x, y)] − log(Ex,y∼p(x)p(y) [eT (x,y) ]) (36)
Thus, by approximate T with a multi-layer neural network, we can evaluate I(X ; Y ) in this way. Directly use this formula as the loss function, we can calculate the gradient by:
ˆ B = EB [▽θ Tθ ] − G
EB [[▽θ Tθ eTθ ]] EB eTθ
[
]
(37)
However, for a minibatch, the second term will bias the estimate of the true gradient. To overcome this problem, a simple workaround is to replace the estimate in the denominator with an exponential moving average, which can reduce bias. 6.2. Downstream task accuracy In practice, the encrypted version of the original data is usually used to train a deep learning model. Since the ultimate purpose for the constraint of maximizing the mutual information between the original and encrypted data is to preserve data utility, we also test our proposed algorithm’s downstream performance. Specifically, given original data X , its privacy label Y , and its utility label Yutility , we first train an encryption model M using the given data and its corresponding labels. After we obtain M, we calculate Z by Z = M(X ) to encrypt the whole dataset. Finally, we use Z and Yutility from the training set to train a classifier from scratch and evaluate its accuracy using the test set. Note
TensorFlow is an open source software library for highperformance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs) and from desktops to clusters of servers to mobile and edge devices. Originally developed by researchers and engineers from the Google Brain team at Google AI, TensorFlow comes with strong machine learning and deep learning support, and the flexible numerical computational core is used across many other scientific domains. We use TensorFlow as the backend of Keras due to its powerful support of many GPUaccelerated libraries. Further, it provides a convenient way to automatically move data across CPUs and GPUs, avoiding manual intervention. 7.3. Model architecture We first illustrate our model architecture in detail. Overall, our model includes an encoder and a decoder. The raw image is first processed by the encoder to obtain the mean and variance of the latent variable Z , and then the latent variable is sampled in the normal distribution according to the calculated mean and variance. The latent variable is then concatenated with the one hot private variable and is finally fed into a decoder to obtain the reconstructed image. The encoder is composed of 5 small blocks, each containing a conv2d(kernel=(3, 3), padding=same, channel = 256/2(5−i) ), a batch normalization layer, and a dropout layer with dropout rate set to 0.2, where i indicates the number of blocks, i.e., the actual channel for each conv2d block is 16, 32, 64, 128, and 258, and padding=same indicates that the image size for each channel is equal. For the decoder, the latent code is first mapped to a (16 × 16 × channel) through a Dense layer, before being followed by five blocks, each containing a deconvolution layer, batch normalization layer, and a ReLU layer. We use Adam optimizer [16] with a fixed learning rate of 0.0003 and train the data transformer for 100 epochs, testing our model at convergence.
L. Nan and D. Tao / Journal of Parallel and Distributed Computing 137 (2020) 17–25 Table 1 The best utility accuracy available for different privacy tolerance. Alg Random guess PPRL-VGAN [6] AdAE [21] VAE [17] Ours
Tolerance No restrict
0.6
0.5
0.4
0.3
0.2
0.143 1.0 – – –
– 0.425 0.407 0.374 0.903
– 0.389 0.309 0.363 0.889
– 0.343 0.284 0.327 0.837
– 0.285 0.233 0.305 0.787
– 0.247 0.215 0.271 0.660
Table 2 Privacy and utility accuracy for different algorithm when model is converged. Alg Random guess PPRL-VGAN [6] Ours
Task Privacy
Utility
0.167 0.283 0.247
0.147 0.951 0.973
8. Experiment 8.1. Evaluation by downstream tasks For evaluation using downstream tasks, we choose the face expression dataset Facial Expression Research Group Database (FERG-DB) [3] as it contains multiple labels for a single entry (face expression and identity). The database contains 55,767 annotated face images of six stylized characters, which is large enough to test our algorithm. Further, we split the database into a training set and test set at a 0.85:0.15 ratio. We train the data transformation model using solely data from the training set and perform our evaluations on the test set. For FERG, we treat identity as the private label and expression as the utility label. We first vary the privacy tolerance and dimension of the latent variable to obtain some transformed data Z . Then, we use a multilayer perceptron (MLP) and train for 20 epochs to evaluate both privacy and utility accuracy. The results are shown in Table 1. It is clear that our algorithm outperforms previous adversarial training-based algorithms by a large margin. We hypothesize that the reason for this is that the restriction added by the reconstruction error highly influences the discriminator procedure, making the learned mapping P(Z |X ) either become a constant function or an autoencoder-like mapping that entirely ignores the privacy-preserving target. For algorithm evaluation when the utility label is available when training the privatizer, we slightly modify the model. For adversarial training-based models, we replace the original decoder with an MLP containing 3 Dense layers of the same dimension as the latent variable, and each is followed by a ReLU layer. In our algorithm, we add the same MLP on the top of the original decoder. We compare our algorithm with that presented in [6], which is based on adversarial training and the conditional adversarial generative network (CGAN) [24]. From Table 2, we can see our simple algorithm without special hyperparameter tuning outperforms the previous algorithm when the utility label is used when training the privatizer. This shows that the adversarial training used in [6] might be not significant; our algorithm, even without a classifier directly on top of transformed data Z , still outperforms other adversarial training-based algorithms. 8.2. Evaluation by mutual information and other proxy In this set of experiments, we first train the privatizer using different algorithms with hyperparameters that can achieve a
23
Table 3 I(X ; Z ) estimation using different proxies. Results for NDM and MS-SSIM are obtained by model loss * 1000, smaller is better. Results for MINE are a direct estimation for mutual information, the larger the better. Model
MINE
NDM
MS-SSIM
PPRL-VGAN [6] AdAE [21] Ours
6.74 6.62 6.78
7.22 7.24 6.79
17.97 18.76 13.65
privacy leaking accuracy less than 0.2. Then, we use the trained model to obtain the transformed data Z and perform different experiments to evaluate the dependence between X and Z . Follow the work of [14], we evaluate the independence between X and Z with the following metrics: 1. MS-SSIM [29] use an additional decoder to predict X from Z and use |Xrec − X |2 as the reconstruction error. This is a proxy for the total MI between the input and the representation and can indicate the amount of encoded pixel-level information. 2. Mutual information neural estimate (MINE) [4]. This is the algorithm stated in Section 6.1 which use the dual form of KL divergence and change the problem into an optimization problem. This can directly evaluate privacy funnel. 3. Neural dependency measure (NDM) [5]. NDM train a discriminator between Z and batch-wise shuffled data Zshuffled to evaluate the DKL (Z ∥Zshuffled ) For MS-SSIM, we add a decoder of similar structure to our proposed decoder architecture in Section 7, except that the input is not the concatenated vector but the transformed data Z . For MINE and NDM, we find models trained using these two algorithms are hard to cover, so we decrease the learning rate and use larger batch size and more training epochs. Our results can be seen in Table 3. For different algorithms, the results are comparable to the results obtained by evaluating downstream tasks. Since MINE is a direct way to evaluate mutual information, we have reason to believe that our algorithms outperform others with respect to the privacy funnel. 8.3. Hyperparameter study In this subsection, we study the influence of hyperparameters on our model. Specifically, we study the privacy-utility trade-off achieved by models with different dims of the encoded variable (z-dim) and β . All experiments are performed under the setting of Section 8.1. Unless otherwise specified, other training parameters are identical to Section 7. To assess the influence of z-dim, we set β to 300 and the other hyperparameters as in Section 7. Then, we vary z-dim from 32 to 512 and perform training for 100 epochs. We evaluate privacyutility trade-off at the end of each epoch and finally collect the best utility accuracy achieved for different privacy tolerances. Fig. 3 shows that the increase in z-dim generally increases utility; however, for large z-dim (e.g., 512), it fails to achieve privacy disclosure less than 0.5. This might be because the increase in zdim enlarges the entropy of the latent code, while our algorithm only provides a guarantee on H(Z |Y ). Thus, the term I(Z ; Y ) = H(Z ) − H(Z |Y ) will generally increase. Similarly, we study models with different β . We set z-dim to 128 and vary β from 30 to 500. Fig. 4 shows that β follows a similar trend to z-dim. Although a small beta seems to achieve
24
L. Nan and D. Tao / Journal of Parallel and Distributed Computing 137 (2020) 17–25
Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.jpdc.2019.09.010. Acknowledgments This research was supported by Australian Research Council Projects FL-170100117, DP-180103424, LP-150100671. The authors would like to thank all the reviewers for their constructive review comments. References Fig. 3. The influence of z-dim for privacy disclosure tolerance vary from 0.25 to 0.5, the models are trained for 100 epochs and other hyperparameters are set to identical, specifically, β is set to 300. the increase of z-dim has potential to help model retain more utility, while showing more limitation for small privacy tolerance.
Fig. 4. The influence of β for privacy disclosure tolerance varies from 0.25 to 0.5, the models are trained for 100 epochs and other hyperparameters are set to identical, specifically, z-dim is set to 128. According to bar chart, the small β will help model to have more chance to get small privacy disclosure.
a wider range of privacy disclosures, the utility distortion also increases compared to models with higher β . 9. Conclusion and future work In this article, we consider the problem of learning a data representation to trade-off between privacy leaking and data distortion, which is helpful in data publishing. Specifically, we consider the setting in which the utility variable is either accessible or inaccessible when training the data transformer, and we explore the case in which the private variable can be continuous and high-dimensional. Our proposed simple but effective model outperforms previous adversarial learning-based models, while at the same time not requiring an alternative training strategy, thus making the model more stable and less hyperparameter sensitive. However, since the optimal information can be retained given the privacy leaking tolerance, there is still room for improvement. Combining GAN [10] with our method might be a potential improvement; however, how to stabilize training and weaken the restriction of the reconstruction loss to the latent representation also require exploration.
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015, Software available from tensorflow.org. http://tensorflow.org/. [2] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, et al., Theano: A Python framework for fast computation of mathematical expressions, arXiv preprint, 2016. [3] D. Aneja, A. Colburn, G. Faigin, L. Shapiro, B. Mones, Modeling stylized character expressions via deep learning, in: Asian Conference on Computer Vision, Springer, 2016, pp. 136–153. [4] I. Belghazi, S. Rajeswar, A. Baratin, R.D. Hjelm, A. Courville, MINE: mutual information neural estimation, arXiv preprint arXiv:1801.04062, 2018. [5] P. Brakel, Y. Bengio, Learning independent features with adversarial nets for non-linear ICA, arXiv preprint arXiv:1710.05050, 2017. [6] J. Chen, J. Konrad, P. Ishwar, VGAN-based image representation learning for privacy-preserving facial expression recognition, arXiv preprint arXiv: 1803.07100, 2018. [7] F. Chollet, et al., Keras, 2015, https://keras.io. [8] J.-c. Chou, C.-c. Yeh, H.-y. Lee, L.-s. Lee, Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations, arXiv preprint arXiv:1804.02812, 2018. [9] P. Cuff, L. Yu, Differential privacy as a mutual information constraint, in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, ACM, 2016, pp. 43–54. [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680. [11] A. Gretton, K.M. Borgwardt, M. Rasch, B. Schölkopf, A.J. Smola, A kernel method for the two-sample-problem, in: Advances in Neural Information Processing Systems, 2007, pp. 513–520. [12] J. Hamm, Minimax filter: learning to preserve privacy from inference attacks, J. Mach. Learn. Res. 18 (1) (2017) 4704–4734. [13] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, A. Lerchner, beta-vae: Learning basic visual concepts with a constrained variational framework, in: International Conference on Learning Representations, vol. 3, 2017. [14] R.D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, Y. Bengio, Learning deep representations by mutual information estimation and maximization, arXiv preprint arXiv:1808.06670, 2018. [15] C. Huang, P. Kairouz, X. Chen, L. Sankar, R. Rajagopal, Context-aware generative adversarial privacy, Entropy 19 (12) (2017) 656. [16] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014. [17] D.P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114, 2013. [18] J.B. Kinney, G.S. Atwal, Equitability, mutual information, and the maximal information coefficient, Proc. Natl. Acad. Sci. (2014) 201309933. [19] A. Kraskov, H. Stögbauer, P. Grassberger, Estimating mutual information, Phys. Rev. E 69 (6) (2004) 066138. [20] H.W. Kuhn, A.W. Tucker, Nonlinear programming, in: Traces and Emergence of Nonlinear Programming, Springer, 2014, pp. 247–258. [21] Y. Li, T. Baldwin, T. Cohn, Towards robust and privacy-preserving text representations, arXiv preprint arXiv:1805.06093, 2018. [22] C. Louizos, K. Swersky, Y. Li, M. Welling, R. Zemel, The variational fair autoencoder, arXiv preprint arXiv:1511.00830, 2015.
L. Nan and D. Tao / Journal of Parallel and Distributed Computing 137 (2020) 17–25 [23] A. Makhdoumi, S. Salamatian, N. Fawaz, M. Médard, From the information bottleneck to the privacy funnel, in: Information Theory Workshop, ITW, 2014 IEEE, IEEE, 2014, pp. 501–505. [24] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784, 2014. [25] F. du Pin Calmon, N. Fawaz, Privacy against statistical inference, in: Communication, Control, and Computing, Allerton, 2012 50th Annual Allerton Conference on, IEEE, 2012, pp. 1401–1408. [26] F. Seide, A. Agarwal, CNTK: Microsoft’s open-source deep-learning toolkit, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016, p. 2135. [27] J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., Natural tts synthesis by conditioning wavenet on mel spectrogram predictions, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2018, pp. 4779–4783. [28] Y. Wang, J. Lee, D. Kifer, Differentially private hypothesis testing, revisited, ArXiv e-prints, 2015. [29] Z. Wang, E.P. Simoncelli, A.C. Bovik, Multiscale structural similarity for image quality assessment, in: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, Ieee, 2003, pp. 1398–1402. [30] W. Wang, L. Ying, J. Zhang, On the relation between identifiability, differential privacy, and mutual-information privacy, IEEE Trans. Inform. Theory 62 (9) (2016) 5018–5029. [31] K. Xu, T. Cao, S. Shah, C. Maung, H. Schweitzer, Cleaning the null space: A privacy mechanism for predictors, in: AAAI, 2017, pp. 2789–2795. [32] Y. Zhang, M. Ozay, Z. Sun, T. Okatani, Information potential auto-encoders, CoRR abs/1706.04635 (2017) http://arxiv.org/abs/1706.04635.
25
Lihao Nan received his B.S. degree in Computer Science and technology from Hangzhou Dianzi University, Zhejiang, China, in 2017. He is currently pursuing the M.Phil. degree with the Computer Science Department, University of Sydney. His research focuses on data security, data encryption and machine learning. His previous research on data privacy and data security has been published on IEEE DSC.
Dacheng Tao (F’15) is Professor of Computer Science and ARC Laureate Fellow in the School of Computer Science and the Faculty of Engineering and Information Technologies, and the Inaugural Director of the UBTECH Sydney Artificial Intelligence Centre, at the University of Sydney. He mainly applies statistics and mathematics to Artificial Intelligence and Data Science. His research results have expounded in one monograph and 200+ publications at prestigious journals and prominent conferences, such as IEEE T-PAMI, T-IP, T-NNLS, T-CYB, IJCV, JMLR, NIPS, ICML, CVPR, ICCV, ECCV, ICDM; and ACM SIGKDD, with several best paper awards, such as the best theory/algorithm paper runner up award in IEEE ICDM’07, the best student paper award in IEEE ICDM’13, the 2014 ICDM 10-year highest-impact paper award, the 2017 IEEE Signal Processing Society Best Paper Award, and the distinguished paper award in the 2018 IJCAI. He received the 2015 Australian Scopus-Eureka Prize and the 2018 IEEE ICDM Research Contributions Award. He is a Fellow of the Australian Academy of Science, AAAS, IEEE, IAPR, OSA and SPIE.