Variational Autoencoders for Missing Data Imputation with Application to a Simulated Milling Circuit

Variational Autoencoders for Missing Data Imputation with Application to a Simulated Milling Circuit

Proceedings, 5th IFAC Workshop on Mining, Mineral and Metal Proceedings, Processing 5th IFAC Workshop on Mining, Mineral and Metal Available onlineand...

895KB Sizes 0 Downloads 53 Views

Proceedings, 5th IFAC Workshop on Mining, Mineral and Metal Proceedings, Processing 5th IFAC Workshop on Mining, Mineral and Metal Available onlineand at www.sciencedirect.com Proceedings, 5th IFAC Workshop on Mining, Mineral Metal Processing Shanghai, China, August 23-25, 2018 Proceedings, 5th IFAC Workshop on Mining, Mineral and Metal Processing Shanghai, China, August 23-25, 2018 Processing Shanghai, China, August 23-25, 2018 Shanghai, China, August 23-25, 2018

ScienceDirect

IFAC PapersOnLine 51-21 (2018) 141–146

Variational Autoencoders for Missing Data Imputation with Application to a Variational Autoencoders for Missing Data Imputation with Application to Variational for Missing Data Imputation with Application Application to to aaa Milling Circuit Variational Autoencoders AutoencodersSimulated for Missing Data Imputation with Simulated Milling Circuit Simulated Milling Milling +Circuit Circuit Simulated

John T. McCoy*, Steve Kroon+, Lidia Auret* John T. McCoy*, Steve Kroon++, Lidia Auret* John T. McCoy*, Steve Kroon+, Lidia Auret* John T. McCoy*, Steve Kroon , Lidia Auret* * Department of Process Engineering, Stellenbosch University, Private Bag X1, Matieland, 7602, South Africa. (e-mail: * Department of Process Engineering, [email protected]). University, Private Bag X1, Matieland, 7602, South Africa. (e-mail: * Department of Process Engineering, Stellenbosch University, Private Bag X1, Matieland, 7602, South Africa. (e-mail: + * Department of Process Engineering, Computer Science Division, [email protected]). University, Private Bag X1, Matieland, 7602, South Africa. (e-mail: [email protected]). + + Computer Science Division, [email protected]). University, Private Bag X1, Matieland, 7602, South Africa. + + Computer Science Division, Stellenbosch University, Private Bag X1, Matieland, 7602, South Africa. Computer Science Division, Stellenbosch University, Private Bag X1, Matieland, 7602, South Africa. Abstract: Missing data values and differing sampling rates, particularly for important parameters such as Abstract: Missing data values and differing samplingproblem rates, particularly importantplants. parameters suchdata as particle size and stream composition, are a common in mineralsfor processing Missing Abstract: Missing data values and differing samplingproblem rates, particularly for importantplants. parameters suchdata as particle size and stream composition, are a common in minerals processing Missing Abstract: Missing data values and differing sampling rates, particularly for important parameters such as imputation isand used to avoid information loss (due to problem downsampling or discarding incomplete records). A particle sizeis stream composition, areloss a common in minerals processing plants. Missing data imputation used to avoid information (due to downsampling or discarding incomplete records). A particle size and stream composition, are a common problem in minerals processing plants. Missing data recent deep-learning technique, variational autoencoders (VAEs),or has been used for missing imputation is used to avoid information loss (due to downsampling discarding incomplete records). A recent deep-learning technique, variational autoencoders (VAEs), has been used for missing data is to data, avoidand information loss (due discarding incomplete records). A imputation in used image wasvariational compared heretotodownsampling imputation byormean replacement andmissing by principal recent deep-learning technique, autoencoders (VAEs), has been used for data imputation in image(PCA) data, and wasvariational compared here to imputation by mean replacement andmissing by principal recent deep-learning technique, autoencoders (VAEs), has been used for data component analysis imputation. The techniques were compared using a synthetic, nonlinear dataset, imputation in image data, and was compared here to imputation by mean replacement and by principal component analysis (PCA) imputation. The techniques were compared usingreplacement a synthetic, nonlinear dataset, imputation in image data, and was compared here to imputation by mean and bynoise, principal and a simulated milling dataset, which included process disturbances, measurement and component analysis (PCA)circuit imputation. The techniques were compared using a synthetic, nonlinear dataset, and a simulated milling circuit dataset, which included process disturbances, measurement noise, and component analysis (PCA) imputation. The techniques were compared using a synthetic, nonlinear dataset, feedback control. Each dataset was corrupted with missing values in 20% of records (lightly corrupted) and a simulated milling circuit dataset, which included process disturbances, measurement noise, and feedback control. Each dataset was corrupted with missing values in 20% of records (lightly corrupted) a simulated milling circuit dataset, which included disturbances, measurement and and in 90% of records Forwith both lightlyprocess and heavily corrupted datasets, the noise, root mean feedback control. Each(heavily dataset corrupted). was corrupted missing values in 20% of records (lightly corrupted) and in 90% ofof records corrupted). Forwith both lightly and heavily corrupted datasets, the corrupted) root mean feedback control. Each(heavily dataset was corrupted missing values intraditional 20% of records (lightly squared error prediction for VAE imputation was lower than the methods. Possibilities for and in 90% of records (heavily corrupted). For both lightly and heavily corrupted datasets, the root mean squared error of prediction forimputation VAE imputation waslightly lower than the traditional methods. for and in 90% ofof records (heavily corrupted). For both and heavily corrupted datasets,Possibilities the root mean the extension missing data to inferential sensing are discussed. squared error of of prediction VAE imputation was lower than traditional methods. Possibilities for the extension datafor to inferential sensing arethe discussed. squared error of missing prediction forimputation VAE imputation was lower than the traditional methods. Possibilities for the extension of missing imputation inferential sensing are discussed. © 2018, IFAC (International Federation of to Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Keywords: Missing Data data Imputation, Machine Learning, Variational Autoencoder the extension of missing data imputation to inferential sensing are discussed. Keywords: Missing Data Imputation, Machine Learning, Variational Autoencoder Keywords: Missing Data Imputation, Machine Learning, Variational Autoencoder Keywords: Missing Data Imputation, Machine Learning, Variational Autoencoder This paper compares missing data imputation by traditional 1 INTRODUCTION This paper compares missing imputation methods imputation by data VAE. Section by 2 traditional gives the 1 INTRODUCTION This paperwith compares missing data imputation by traditional methods with imputation by VAE. Section 2 3traditional gives the This paper compares missing data imputation by 1 INTRODUCTION background for imputation and VAEs, and Section describes Large quantities of historical data are available for most methods with imputation by VAE. Section 2 gives the 1 historical INTRODUCTION background for imputation and VAEs, and Section 3 describes Large quantities of data are available for most methods with imputation by VAE. Section 2 gives datasets and imputationand methods. shownthe in industrial operations, maydata contain information background for imputation VAEs,The andresults Sectionare 3 describes Large quantities of which historical are valuable available for most the the datasets and imputation methods. The results are shown in industrial operations, maydata contain valuable information background for imputation and VAEs, and Section 3 describes Large quantities of which historical are available for most Section 4, followed by conclusions in Section 5. to be used for developing monitoring methods, training soft the datasets and imputation methods. The results are shown in industrial operations, whichmonitoring may containmethods, valuabletraining information Section 4, followed by conclusions in Section 5. to be used for developing soft the datasets and imputation methods. The results are shown in industrial operations, which may contain valuable information sensors, and tuning controllers (He and Wang,training 2017; Qin, to be used developing monitoring methods, soft Section 4, followed by conclusions in Section 5. sensors, andfor controllers (He and Wang, 2017; Qin, in Section 5. to be used fortuning developing monitoring methods, training soft Section 4, followed2by conclusions 2014). However, industrial data records frequently contain BACKGROUND sensors, and tuning controllers and Wang, 2017; Qin, 2014). However, industrial data(He records frequently contain 2 BACKGROUND sensors, and tuning controllers (He and Wang, 2017; Qin, missing values (due to equipment failures, especially in harsh 2014). However, industrial data records frequently contain 2 BACKGROUND missing values (due to equipment failures, especially in harsh Missing data imputation approaches 2014). However, industrial datahave records frequently contain 2 BACKGROUND minerals processing and differing sampling rates 2.1 missing values (due plants), to equipment failures, especially in harsh 2.1 Missing data imputation approaches minerals processing plants), and have differing sampling rates 2.1 Missing data imputation approaches missing values (due totoequipment failures, especially in harsh for variables, due online and offline measurements, minerals processing plants), and have differing sampling rates The three most common approaches to deal with missing and differing offline measurements, Missing datacommon imputation approaches for variables, to online minerals processing plants), and have sampling sampling delaysdue or equipment (Bergh, 2010). rates 2.1 The three most approaches to deal with missing for variables, due to onlinelimitations and offline measurements, values are deletion of records with missing single The three most common approaches to deal values, with missing (Bergh, 2010). sampling delaysdue or equipment for variables, to onlinelimitations and offline measurements, values are deletion of records with missing single The three most common approaches tomean) deal values, with missing sampling delays orreduction equipmentinlimitations (Bergh, 2010). imputation (such as replacement by the and multiple The result is a the amount of information values are (such deletion of records with missing values, single sampling delays orreduction equipmentinlimitations (Bergh, 2010). imputation as replacement by the mean) and multiple values are deletion of records with missing values, single The result is a the amount of information imputation, (such which makes use by of the iterative model-based available, is frequently downsampled match the imputation as replacement mean) and multiple The resultasisdata a reduction in the amount oftoinformation imputation, which makes use of iterative model-based imputation (such as replacement by the mean) and multiple available, as data is frequently downsampled to match the methods (Kadlec et al., 2009). A common multiple imputation The result is with a reduction in the amount orof records information imputation, which makes use of iterative model-based measurement lowest frequency, available, as data isthe frequently downsampled to matchwith the imputation, methods (Kadlec et al., 2009).use A common multiple imputation which makes of iterative model-based measurement with the lowest(Kadlec frequency, or2009). records with maximum likelihood available, as data isdiscarded frequently downsampled to match the method is(Kadlec missing values are et al., Missing et al., 2009). Aiterative commonprincipal multiple component imputation measurement with the lowest frequency, or records with methods method is maximum likelihood iterative principal component et al., 2009).(Dray A common multiple imputation missing valueswith are et missing al.,or2009). Missing analysis is(Kadlec (PCA) imputation and principal Josse, 2015; Kiers, measurement lowest(Kadlec records with methods value imputation is discarded athe strategy to frequency, replace values maximum likelihood iterative component missing values are discarded (Kadlec et al., 2009). Missing method analysis (PCA) imputation (Dray and principal Josse, 2015; Kiers, method is maximum likelihood iterative component value imputation is a strategy to replace missing values with 1997). In the process/chemical engineering field, the method missing values are discarded (Kadlec et al., 2009). Missing plausible values, is in aorder to reduce datamissing loss (Kiers, (PCA) imputation (Dray and Josse, 2015; Kiers, value imputation strategy to replace values1997; with analysis 1997). In the process/chemical engineering field, the method plausible values, in order to reduce data loss (Kiers, 1997; analysis (PCA) imputation (Dray and Josse, 2015; Kiers, to imputation engineering for a simulated flow value is aZeng strategy to2010). replaceImputation missing values with has been Sedghiimputation et values, al., 2017; of missing In applied the process/chemical field, the network method plausible in Zeng orderet toal., reduce data loss (Kiers, 1997; 1997). has been applied to imputation for a simulated flow network Sedghi et al., 2017; et al., 2010). Imputation of missing 1997). In the process/chemical engineering field, the method al., 2004). plausible values, ininorder to reduce loss (Kiers,sensors 1997; (Imtiaz values could moreetuptime for data controllers, has beenetapplied to imputation for a simulated flow network Sedghi et al.,result 2017;inZeng al., 2010). Imputationsoft of missing (Imtiaz al., 2004). values could moreetand uptime for controllers, soft sensors has beenetapplied to imputation for a simulated flow network Sedghi et al.,result 2017; Zeng al.,upsampling 2010). Imputation of missing and monitoring methods, of low frequency (Imtiaz et al., 2004).machine learning methods have been used values could result in more uptime for controllers, soft sensors A number of other (Imtiaz et al., 2004). and monitoring methods, and upsampling of low frequency values could result in result more uptime forcontrol, controllers, soft sensors measurements in faster plant A of othermultiple machine learning methods haveengineering been used and monitoringcould methods, and upsampling ofimproving low frequency fornumber model-based in process number of other machineimputation learning methods have been used measurements resultand in faster control,ofimproving plant A and monitoringcould methods, upsampling low frequency performance. for model-based multiple imputation in process engineering A number of other machine learning methods have been used measurements could result in faster control, improving plant applications. Supervised trees in have been successfully model-based multipledecision imputation process engineering performance. could result in faster control, improving plant for measurements applications. Supervised decision trees have been successfully for model-based multiple imputation in process engineering performance. as the regressiondecision model trees for multiple of Variational autoencoders (VAEs) are a recently proposed deep applied applications. Supervised have beenimputation successfully performance. applied as maintenance the regressiondecision model trees for(Lakshminarayan multiple of Variational autoencoders (VAEs) arewhich a recently proposed deep applications. Supervised have beenimputation successfully industrial databases et al., learning-based generative model, can be used for as maintenance the regressiondatabases model for(Lakshminarayan multiple imputation of Variational autoencoders (VAEs) are a recently proposed deep applied industrial et al., applied as the regression model for multiple imputation of learning-based generative model, which can be used for 1999) and maintenance of missing values in ironmaking blast furnaces Variational autoencoders are a recently proposed unsupervised of (VAEs) themodel, distribution a dataset, anddeep the databases (Lakshminarayan et al., learning-basedlearning generative whichof can be used for industrial 1999) and of missing values in ironmaking blast furnaces industrial maintenance databases (Lakshminarayan et al., unsupervised learning of the distribution of a dataset, and the (Zeng and Gao, 2009). Mean replacement, PCA imputation, klearning-based generative model, which can resemble be used the for 1999) and of missing values in ironmaking blast furnaces generation oflearning further samples which closely unsupervised of the distribution of a dataset, and the (Zeng and Gao, 2009). Mean replacement, PCA blast imputation, kthe 1999) and of missing values in ironmaking furnaces generation oflearning further of samples which 2013; closely resemble nearest neighbours regression imputation and network unsupervised the Welling, distribution of a Rezende dataset, and the original data (Kingma and et al., and Gao, 2009). Mean replacement, PCAneural imputation, kgeneration of further samples which closely resemble the (Zeng nearest neighbours regression imputation and neural network (Zeng and Gao, Meaninreplacement, PCA imputation, koriginal data and Welling, 2013; Rezende et the al., imputation were2009). compared the context of fault diagnosis of generation of (Kingma further samples whichfor closely resemble 2014). be used testing soft sensors, neighbours regression imputation and neural network originalThese data samples (Kingmacould and Welling, 2013; Rezende et al., nearest imputation were compared in the context of fault diagnosis of nearest neighbours regression imputation and neural network 2014). These samples could be used for testing soft sensors, the Tennessee Eastman benchmark problem and an industrial original data (Kingma and Welling, 2013; Rezende et al., controllers andsamples monitoring VAEs have been imputation were compared in the context of fault diagnosis of 2014). These could methods. be used for testing softalso sensors, the Eastman benchmark problem and an industrial controllers andsamples monitoring methods. VAEs have been imputation were compared in theetcontext of fault diagnosis of gas Tennessee sweetening plant (Askarian al., 2016). 2014). These could be used for softalso sensors, used for missing data imputation in testing image-based data, Tennessee Eastman benchmark problem and an industrial controllers and monitoring methods. VAEs have also been the gas sweetening plant (Askarian et al., 2016). used for missing data imputation in image-based data, the Tennessee Eastman benchmark problem and an industrial controllers andpotential monitoring methods. VAEs havedata. also been gas sweetening plant (Askarian et al., 2016). suggesting the application to process used for missing datafor imputation image-based 2.2 sweetening Variationalplant autoencoders suggesting the potential application in to process data. data, (Askarian et al., 2016). used for missing dataforimputation in image-based data, gas suggesting the potential for application to process data. 2.2 Variational autoencoders suggesting the potential for application to process data. 2.2 Variational autoencoders 2.2 Variational autoencoders

2405-8963 © IFAC (International Federation of Automatic Control) Copyright © 2018, 2018 IFAC 141Hosting by Elsevier Ltd. All rights reserved. Peer review©under of International Federation of Automatic Copyright 2018 responsibility IFAC 141Control. 10.1016/j.ifacol.2018.09.406 Copyright © 2018 IFAC 141 Copyright © 2018 IFAC 141

IFAC MMM 2018 142 Shanghai, China, August 23-25, 2018

John T. McCoy et al. / IFAC PapersOnLine 51-21 (2018) 141–146

A variational autoencoder (VAE) makes use of a similar architecture to a standard autoencoder, which is shown in Fig. 1a). Autoencoders learn a latent space representation of input data, typically for dimensionality reduction and feature extraction (Hinton and Salakhutdinov, 2006; Yan et al., 2017), or to reconstruct a denoised version of the input (Gondara and Wang, 2017; Vincent et al., 2008).

the log-evidence can be expressed in terms of the “evidence lower bound”, or ELBO:

ln 𝑝𝑝𝜃𝜃 (𝑋𝑋) = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 + 𝐾𝐾𝐾𝐾[𝑞𝑞𝜙𝜙 (𝑍𝑍|𝑋𝑋)||𝑝𝑝𝜃𝜃 (𝑍𝑍|𝑋𝑋)]

In equation (4), the second term on the right hand side is the Kullback-Liebler divergence, which describes the agreement between two distributions, in this case the approximation to the posterior distribution, or the encoder network, 𝑞𝑞𝜙𝜙 (𝑍𝑍|𝑋𝑋), and the true posterior distribution, 𝑝𝑝𝜃𝜃 (𝑍𝑍|𝑋𝑋). The KL divergence is zero if the two distributions are identical (in other words, if the approximation of the posterior is perfect), and positive for nonidentical distributions. Since the KL divergence is always nonnegative, the log-evidence is always greater than or equal to the ELBO:

However, whereas an autoencoder learns a latent space representation of the data, a VAE learns a distribution in that latent space representation. The result is that samples can be drawn from the latent space distribution, input to the decoder network, and new samples of data generated. The underlying assumption of the VAE is that the data X is generated by some underlying distribution 𝑝𝑝(𝑋𝑋), and that this unknown distribution can be represented by the latent variable, Z, which is itself generated by some distribution 𝑝𝑝(𝑍𝑍). The joint distribution, 𝑝𝑝(𝑋𝑋, 𝑍𝑍) can then be represented as:

𝑝𝑝(𝑋𝑋, 𝑍𝑍) = 𝑝𝑝𝜃𝜃 (𝑋𝑋|𝑍𝑍)𝑝𝑝(𝑍𝑍)

ln 𝑝𝑝𝜃𝜃 (𝑋𝑋) ≥ 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸

(5)

Thus, the log-evidence can be driven towards its maximum value by finding the parameters 𝜙𝜙 and 𝜃𝜃 which maximise the ELBO, defined in equation (6).

(1)

𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = Ε𝑞𝑞𝜙𝜙 (𝑍𝑍|𝑋𝑋) [ln 𝑝𝑝𝜃𝜃 (𝑋𝑋|𝑍𝑍)]

In equation (1), the joint distribution can be generated by sampling from the distribution of Z, 𝑝𝑝(𝑍𝑍), also known as the prior, and the distribution of X given Z, also known as the likelihood 𝑝𝑝𝜃𝜃 (𝑋𝑋|𝑍𝑍), which is parameterised by 𝜃𝜃. The likelihood function, with parameters 𝜃𝜃, can be represented by a neural network, which is analogous to the decoder network in a standard autoencoder. The parameters, 𝜃𝜃, must be learnt from the data, as will be described below.

(6)

−𝐾𝐾𝐾𝐾[𝑞𝑞𝜙𝜙 (𝑍𝑍|𝑋𝑋)||𝑝𝑝(𝑍𝑍)]

In equation (6), the first term is the expectation of the loglikelihood of the decoder network, given the encoder network’s output. Maximising this term corresponds to minimising the reconstruction error of a standard autoencoder.

The prior, 𝑝𝑝(𝑍𝑍), is typically chosen to be the multidimensional zero-centred normal distribution with unit variance (Kingma and Welling, 2013), and thus has no parameters to learn.

The second term in equation (6) is the Kullback-Liebler divergence between the approximated posterior distribution, or the encoder network, 𝑞𝑞𝜙𝜙 (𝑍𝑍|𝑋𝑋), and the prior distribution of the latent variable Z, 𝑝𝑝(𝑍𝑍). Minimising this term brings the approximate posterior, 𝑞𝑞𝜙𝜙 (𝑍𝑍|𝑋𝑋), closer to the prior, 𝑝𝑝(𝑍𝑍), which has the effect of regularisation, preventing overfitting.

The marginal likelihood of the model, 𝑝𝑝𝜃𝜃 (𝑋𝑋), of the unknown distribution, 𝑝𝑝(𝑋𝑋), can be obtained by integration of equation (1), specifically the likelihood and prior. The marginal likelihood (or evidence) specifies how likely the data, X, is under the model with parameters 𝜃𝜃.

𝑝𝑝𝜃𝜃 (𝑋𝑋) = ∫ 𝑝𝑝(𝑋𝑋, 𝑍𝑍) 𝑑𝑑𝑑𝑑 = ∫ 𝑝𝑝𝜃𝜃 (𝑋𝑋|𝑍𝑍)𝑝𝑝(𝑍𝑍) 𝑑𝑑𝑑𝑑

(4)

(2)

From Bayes’ rule, the marginal likelihood 𝑝𝑝𝜃𝜃 (𝑋𝑋) can be related to the posterior distribution 𝑝𝑝𝜃𝜃 (𝑍𝑍|𝑋𝑋).

𝑝𝑝𝜃𝜃 (𝑍𝑍|𝑋𝑋) =

𝑝𝑝𝜃𝜃 (𝑋𝑋|𝑍𝑍)𝑝𝑝(𝑍𝑍) 𝑝𝑝𝜃𝜃 (𝑋𝑋)

(3)

The posterior is the distribution of the latent variable Z, given observations of the data X, but it cannot be determined analytically. The posterior is therefore approximated by 𝑞𝑞𝜙𝜙 (𝑍𝑍|𝑋𝑋), typically a neural network with parameters 𝜙𝜙, which is analogous to the encoder network in a standard autoencoder. The best representation (in other words, the lowest reconstruction error) of the data X will be obtained by maximising the logarithm of the marginal likelihood (the logevidence), ln 𝑝𝑝𝜃𝜃 (𝑋𝑋), by changing the parameters of the encoder network, 𝜙𝜙, and of the decoder network, 𝜃𝜃. It can be shown (refer to Kingma and Welling (2013) for details) that

Fig. 1. Structure of autoencoders: a) Standard autoencoder; b) Variational autoencoder during training and imputation; c) Generative variational autoencoder.

142

IFAC MMM 2018 Shanghai, China, August 23-25, 2018

John T. McCoy et al. / IFAC PapersOnLine 51-21 (2018) 141–146

2.3 VAE Structure

Three hyperparameters must be defined for PCA imputation: the maximum number of iterations, the tolerance on reconstruction error, and the number of principal components to retain (specified here in terms of explained variance).

The structure of the VAE used here is shown in Fig. 1b). Given data X, the encoder network outputs the mean and variance of a multidimensional Gaussian which describes the latent space distribution, or approximate posterior distribution, 𝑞𝑞𝜙𝜙 (𝑍𝑍|𝑋𝑋). Sampling from this distribution generates values for Z.

3.3 Multiple imputation by VAE Missing data imputation is performed with a trained VAE using a Markov chain Monte Carlo method, which has been shown to converge to the true marginal distribution of missing values, given the observed values (Rezende et al., 2014). In practice, the imputation procedure is similar to that for PCA:

Given Z, the decoder network outputs the mean and variance of the multidimensional Gaussian describing the reconstructed data space, the likelihood 𝑝𝑝𝜃𝜃 (𝑋𝑋|𝑍𝑍). Sampling from this distribution generates 𝑋𝑋̂ . Due to the definition of the evidence lower bound in equation

1. 2.

Replace missing values in data with random values; Input the data to the trained VAE: a. Sample from the latent variable distribution (the output of the encoder) to generate Z, given X; b. Sample from the reconstructed data distribution (the output of the decoder) to generate 𝑋𝑋̂, given Z. 3. Replace the missing values with the reconstructed values, leaving the observed values unchanged; 4. Compute the reconstruction error of the observed values; 5. If the reconstruction error is below a specified tolerance, or the iteration limit has been reached, end. 6. Otherwise, return to step 2. The hyperparameters which must be specified include: network architecture and neuron activations (for encoding and decoding networks), training algorithm (including optimiser and learning rate), the dimensionality of the latent space, the maximum training and imputation iterations, batch size, and tolerance on reconstruction error.

(6), training the VAE forces the encoder network’s output

distribution close to the prior distribution of Z, 𝑝𝑝(𝑍𝑍), which is defined as 𝒩𝒩(0, 𝐼𝐼). Thus, it is possible to discard the encoder network of the VAE, and generate new data by drawing a sample from 𝒩𝒩(0, 𝐼𝐼), and feeding this into the decoder network, as shown in Fig. 1c). 3

METHODOLOGY

In order to demonstrate an application of variational autoencoders (VAEs), missing data imputation by VAE will be compared to single imputation and multiple imputation by Principal Component Analysis, using a synthetic dataset and a simulated milling circuit dataset. 3.1 Single imputation Single imputation is achieved by the replacement of missing values with the mean of the observed values for each variable. 3.2 Multiple imputation by PCA

3.4 Datasets

Multiple imputation using PCA is achieved using the approach as summarised below (Dray and Josse, 2015; Kiers, 1997): 1. 2.

3.

4. 5. 6. 7.

143

Missing data imputation is performed on a synthetic, nonlinear dataset, and on a simulated milling circuit dataset.

Replace missing values with random values; Build a model of the data: a. Standardise the data; b. Find the principal components and eigenvalues by singular value decomposition; c. Find retained variance from eigenvalues; d. Determine number of principal components which retains required variance; Reconstruct missing and observed values with the model: a. Reconstruct the data using the retained principal components; b. Convert the standardised data back to the original data space using the standardisation parameters in 2.a. above. Replace the missing values with the reconstructed values, leaving the observed values unchanged; Compute the reconstruction error of the observed values; If the reconstruction error is below a specified tolerance, or the iteration limit has been reached, end. Otherwise, return to step 2.

Synthetic dataset The synthetic data set consists of four variables, parameterised by a single variable. The function is adapted from the twodimensional synthetic data set used in an open source tutorial (available at https://github.com/oduerr/dl_tutorial/blob/master /tensorflow/vae/vae_demo-2D.ipynb). The function generates samples of the single parameter z from the uniform distribution between 0 and 1: 𝑧𝑧 ∼ 𝒰𝒰(0,1)

(7)

Further parameters are derived from z: 𝑟𝑟 = √𝑧𝑧

(8)

𝜂𝜂 = 0.25 × 𝜋𝜋 × 𝑧𝑧 The mean values of each of the four variables are derived from these parameters as follows: 𝑥𝑥1,𝜇𝜇 = 𝑟𝑟 × cos 𝜂𝜂

𝑥𝑥2,𝜇𝜇 = 5𝑟𝑟 × sin 𝜂𝜂

𝑥𝑥3,𝜇𝜇 = 2𝑟𝑟 × tan 𝜂𝜂

143

(9)

IFAC MMM 2018 144 Shanghai, China, August 23-25, 2018

John T. McCoy et al. / IFAC PapersOnLine 51-21 (2018) 141–146

𝑥𝑥4,𝜇𝜇 = 3 × 𝑒𝑒 −𝜂𝜂

which was specified at 99%. The maximum imputation iterations was set at 1000 to ensure convergence of PCA imputation.

The four variables are then sampled from Gaussian distributions as follows: 𝑥𝑥1 = 𝒩𝒩(𝜇𝜇 = 𝑥𝑥1,𝜇𝜇 , 𝜎𝜎 2 = 0.07 × 𝑧𝑧 2 )

(10)

𝑥𝑥1 = 𝒩𝒩(𝜇𝜇 = 𝑥𝑥2,𝜇𝜇 , 𝜎𝜎 2 = 0.05 × 𝑧𝑧 2 ) 𝑥𝑥1 = 𝒩𝒩(𝜇𝜇 = 𝑥𝑥3,𝜇𝜇 , 𝜎𝜎 2 = 0.06 × 𝑧𝑧 2 ) 𝑥𝑥1 = 𝒩𝒩(𝜇𝜇 = 𝑥𝑥4,𝜇𝜇 , 𝜎𝜎 2 = 0.04 × 𝑧𝑧 2 ) 10000 datapoints were generated using the equations above. Simulated milling circuit dataset The milling circuit data is based on the simulation described in (le Roux et al., 2013), with disturbances added in (Wakefield et al., 2018), and is shown in Fig. 2. The milling circuit has three control loops, and the circuit responds to stochastic variations (representing disturbances) in the simulated ore size distribution and grindability, as well as incorporating measurement error. The original data from the simulation consists of samples every 30 seconds, for 104 days. The data has been subsampled to a sample every 15 minutes, in order to introduce some independence between samples (a key assumption in most data imputation approaches). This resulted in a set of 9985 datapoints. 3.5 Missing values Missing data was introduced to the data in two manners, resulting in a lightly corrupted dataset, and a heavily corrupted dataset. The uncorrupted datasets were used to calculate the accuracy of the imputation methods.

Fig. 2. Simulated milling circuit dataset Table 1. Heavily corrupted synthetic and simulated milling circuit datasets: complete records and records with missing values

The lightly corrupted dataset was created by randomly selecting 20% of the rows of a dataset, and replacing a randomly selected variable in each of these rows with NaN (not a number). Further missing values were introduced by randomly selecting 50% of the rows of the lightly corrupted datasets, and replacing a randomly selected variable in each of these rows with NaN. This procedure was repeated three times, resulting in highly corrupted datasets. As shown in Table 1, less than 10% of each dataset was complete (and therefore available for training), about 80% of the data had one or two missing values, and about 10% of the data had three or four missing values.

Rows in synthetic dataset 980

Rows in milling circuit dataset 974

1

4299

3830

2

3929

3810

3

771

1273

4

20

97

For VAE imputation, the encoder and decoder were fully connected networks with two hidden layers, each with 20 neurons per layer, using rectified linear unit activation functions. The latent space was 10-dimensional. Training was run for 10 000 iterations of batches of 250 samples, with the RMSProp optimiser and a learning rate of 0.001. The maximum imputation iterations was set at 25, as initial experiments indicated convergence within 25 iterations.

3.6 Software and hardware All code was implemented in Python 3.6, on a consumer-grade laptop (Dell Latitude E5570). The variational autoencoder code was adapted from an open source implementation (available at https://github.com/altosaar/variationalautoencoder). 4

Missing values per row 0

RESULTS

4.2 VAE training

4.1 Hyperparameters

The variational autoencoder was trained on the complete records in each dataset, in order to determine the network weights which maximise the evidence lower bound (ELBO) in

For PCA imputation, the number of retained principal components was defined in terms of the retained variance, 144

IFAC MMM 2018 Shanghai, China, August 23-25, 2018

John T. McCoy et al. / IFAC PapersOnLine 51-21 (2018) 141–146

145

The results for the four datasets and three imputation methods are shown in Table 2.

equation (6). The convergence of the ELBO for the heavily corrupted synthetic dataset is shown in Fig. 3; similar (but less noisy) convergence was achieved for the lightly corrupted synthetic dataset. For the milling dataset, the ELBO converged to lower values than the synthetic dataset, suggesting that the VAE was able to extract less structure from the simulated data, likely due to the effects of simulated measurement noise and process dynamics.

For the synthetic dataset, mean replacement appears to be a poor strategy; a significant reduction in RMSE is attained for PCA imputation, which takes into account more information about the data than simple mean replacement. However, VAE imputation achieves an order of magnitude reduction in RMSE for the lightly corrupted data, as it accounts for the nonlinearities in the data set. Although RMSE for VAE imputation doubles for the heavily corrupted dataset, it is still better than the other methods. For the simulated milling circuit dataset, mean replacement appears to be a successful strategy; this is because the dataset represents relatively stable operation of a chemical plant under feedback control, with some fluctuation about the mean. PCA imputation performs significantly worse than mean replacement, which is unexpected, as the variables should have approximately linear relationships (due to controlled operation around a set point). The impact of noise in the data, and the variance retained, on PCA imputation may require more detailed investigation.

Fig. 3. Evidence lower bound (ELBO) for heavily corrupted synthetic dataset. ELBO converges after approximately 10 000 iterations.

VAE imputation had the lowest RMSE for the lightly corrupted dataset, but was not much better than mean replacement for the heavily corrupted dataset. This result suggests that there is some structure in the data which can be learnt by the VAE, if sufficient data is available (as in the lightly corrupted set). Where limited data is available, as in the heavily corrupted dataset, simpler, lower-variance models such as mean imputation may be more appropriate than lowerbias (but higher-variance) approaches, such as the VAE, or even PCA.

Sample output of the input data and generated data for the lightly corrupted synthetic dataset, after 10 000 iterations, is shown in Fig. 4. The VAE successfully reproduces the nonGaussian distributions and nonlinear relationships between the four variables; similar results are achieved for both datasets, for both levels of corruption.

Table 2. Root-mean-sum of errors for imputation methods on the four datasets. Dataset Synthetic Milling circuit

Imputation method

Corruption type

Mean

PCA

VAE

Light

0.851

0.253

0.0772

Heavy

0.844

0.279

0.160

Light

0.825

1.52

0.750

Heavy

0.809

1.21

0.796

5

CONCLUSIONS

The results above demonstrated the imputation of missing data by three approaches: single imputation by mean replacement, and multiple imputation by principal component analysis imputation and by variational autoencoder imputation. The methods were compared using a synthetic nonlinear dataset, and a simulated milling circuit dataset which more closely resembles real industrial data.

Fig. 4. Comparison of lightly corrupted synthetic data (blue dots) with data generated by the VAE (orange plusses), after 10 000 training iterations. Variable histograms are shown on the diagonal, and scatter plots of variable pairs on the off-diagonals. 4.3 Imputation results

The training of the VAE showed that the method can successfully learn distributions in latent space representations of complex, multimodal, nonlinear datasets in an unsupervised manner, even in the presence of heavy corruption (~10% complete records).

The root-mean-square error (RMSE) was calculated for the imputed values, with the uncorrupted datasets being used as the ground truth. Each dataset was standardised with respect to the uncorrupted dataset, in order to ensure fair comparisons of the RMSE values for different levels of corruption. 145

IFAC MMM 2018 146 Shanghai, China, August 23-25, 2018

John T. McCoy et al. / IFAC PapersOnLine 51-21 (2018) 141–146

VAE imputation outperformed the other two methods, on both lightly and heavily corrupted datasets, as it can account for complex and nonlinear variable relationships and multimodal variable distributions. Further detailed investigation into the effects of noise and retained variance in PCA imputation should be conducted, to confirm the results presented here.

251–266. doi:10.1007/BF02295279 Kingma, D.P., Rezende, D.J., Mohamed, S., Welling, M., 2014. Semi-supervised Learning with Deep Generative Models. arXiv Prepr. Kingma, D.P., Welling, M., 2013. Auto-Encoding Variational Bayes. arXiv Prepr. arXiv1312.6114.

Inferential sensing is a promising area for future VAE applications, particularly in minerals processing, in which critical parameters (such as composition and particle size) are measured offline. Some process knowledge may be required in order to correctly structure the training data (for example, grab samples may be aggregated over shifts), but the extension of the demonstrated missing data imputation to inferential sensing should be relatively simple, and may be able to take advantage of more recent advances in the use of variational autoencoders in semi-supervised learning problems (Kingma et al., 2014) 6

Lakshminarayan, K., Harp, S.A., Samad, T., 1999. Imputation of Missing Data in Industrial Databases. Appl. Intell. 11, 259–275. doi:10.1023/A:1008334909089 le Roux, J.D., Craig, I.K., Hulbert, D.G., Hinde, A.L., 2013. Analysis and validation of a run-of-mine ore grinding mill circuit model for process control. Miner. Eng. 43, 121–134. doi:10.1016/j.mineng.2012.10.009 Qin, S.J., 2014. Process data analytics in the era of big data. AIChE J. 60, 3092–3100. doi:10.1002/aic.14523

ACKNOWLEDGEMENTS

The financial support of this research by Anglo American Platinum is acknowledged. 7

Rezende, D.J., Mohamed, S., Wierstra, D., 2014. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. arXiv Prepr. arXiv1401.4082.

REFERENCES

Sedghi, S., Sadeghian, A., Huang, B., 2017. Mixture semisupervised probabilistic principal component regression model with missing inputs. Comput. Chem. Eng. 103, 176–187. doi:10.1016/j.compchemeng.2017.03.015

Askarian, M., Escudero, G., Graells, M., Zarghami, R., JalaliFarahani, F., Mostoufi, N., 2016. Fault diagnosis of chemical processes with incomplete observations: A comparative study. Comput. Chem. Eng. 84, 104–116. doi:10.1016/j.compchemeng.2015.08.018

Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A., 2008. Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th International Conference on Machine Learning - ICML ’08. ACM Press, Helsinki, Finland, pp. 1096–1103. doi:10.1145/1390156.1390294

Bergh, L.G., 2010. Restrictions on MMM Industrial Data to Build PCA Models. IFAC Proc. Vol. 43, 98–103. doi:10.3182/20100802-3-ZA-2014.00024 Dray, S., Josse, J., 2015. Principal component analysis with missing values: a comparative survey of methods. Plant Ecol. 216, 657–667. doi:10.1007/s11258-014-0406-z

Wakefield, B.J., Lindner, B.S., McCoy, J.T., Auret, L., 2018. Monitoring of a simulated milling circuit: Fault diagnosis and economic impact. Miner. Eng. 120, 132– 151. doi:10.1016/j.mineng.2018.02.007

Gondara, L., Wang, K., 2017. Multiple Imputation Using Deep Denoising Autoencoders. arXiv Prepr. arXiv1705.02737.

Yan, W., Tang, D., Lin, Y., 2017. A Data-Driven Soft Sensor Modeling Method Based on Deep Learning and its Application. IEEE Trans. Ind. Electron. 64, 4237–4245. doi:10.1109/TIE.2016.2622668

He, Q.P., Wang, J., 2017. Statistical process monitoring as a big data analytics tool for smart manufacturing. J. Process Control. doi:10.1016/J.JPROCONT.2017.06.012 Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks. Science (80-. ). 313, 504–7. doi:10.1126/science.1127647

Zeng, J., Gao, C., 2009. Improvement of identification of blast furnace ironmaking process by outlier detection and missing value imputation. J. Process Control 19, 1519–1528. doi:10.1016/J.JPROCONT.2009.07.006

Imtiaz, S.A., Shah, S.L., Narasimhan, S., 2004. Missing Data Treatment Using Iterative PCA and Data Reconciliation. IFAC Proc. Vol. 37, 197–202. doi:10.1016/S1474-6670(17)31811-6

Zeng, J., Gao, C., Su, H., 2010. Data-driven predictive control for blast furnace ironmaking process. Comput. Chem. Eng. 34, 1854–1862. doi:10.1016/j.compchemeng.2010.01.005

Kadlec, P., Gabrys, B., Strandt, S., 2009. Data-driven soft sensors in the process industry. Comput. Chem. Eng. 33, 795–814. doi:10.1016/j.compchemeng.2008.12.012 Kiers, H.A.L., 1997. Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62, 146