Mario R. Eden, Marianthi Ierapetritou and Gavin P. Towler (Editors) Proceedings of the 13th International Symposium on Process Systems Engineering – PSE 2018 July 1-5, 2018, San Diego, California, USA © 2018 Elsevier B.V. All rights reserved. https://doi.org/10.1016/B978-0-444-64241-7.50377-3
Using Big Data in Industrial Milk Powder Process Systems Irina Boiarkinaab, Nick Depreeb, Arrian Prince-Pikeb, Wei Yuab, David I. Wilsonbc, Brent R. Youngab* a
Chemical and Materials Engineering, University of Auckland, 2-6 Park Avenue, Grafton, Auckland 1023, New Zealand b Industrial Information and Control Center, University of Auckland, 2-6 Park Avenue, Grafton, Auckland 1023, New Zealand c Electrical and Eectronic Engineering Department, Auckland University of Technology, 34 St Paul Street, Auckland Central 1010, New Zealand
[email protected]
Abstract This work looks at the application of big data in the milk powder processing industry, where the focus is on improving product quality and preventing off-specification product. This results in an increased focus on the quality attributes, which may be infrequently measured, and have low repeatability, thus resulting in challenges to the veracity of the data. Combined with the low frequency of failures, the data set is also ‘unbalanced’, making it difficult to analyse using normal black-box big data techniques. However, these can be addressed using tailored algorithms for elucidating the effects of specific parts of the process on the quality attributes of interest, and by using techniques such as bootstrapping with up- and down-sampling of data to address the issue of an unbalanced dataset. Keywords: big data, dairy, processing, modelling
1. Introduction The term “Big Data”, characterised by volume, velocity, variety and veracity (White, 2016), is rapidly gaining attention, both by vendors providing novel IT solutions, and manufacturers wondering how best to take advantage of these new offerings. While the application of big data is not unusual in the dairy industry, to date it has been primarily restricted to farm management (Wolfert 2017), the management of the dairy herds using genetics to increase milk production (Cole, 2012), and machine learning for establishing culling rules for herd management (McQueen, 1995). The application of big data mining techniques and analytics to dairy processing at the factory is still relatively new. Milk powder is a complex powdered product with a wide range of chemical, microbiological, physical and functional properties set down by the customer (Sharma, 2012). Functional attributes, such as different attributes of rehydration and solubility performance, are important quality parameters for end use. However, they are not well understood as they are defined by individually tailored testing procedures, with often complex links to the physical and chemical properties of the powder. This means that the effect of processing conditions on these properties is not well understood. At first glance this is precisely the type of applications that big data analysis is suited for.
2294
I. Boiarkina et al.
Big data approaches are many and varied, including principal component analysis, partial least squares regression, neural networks, random forests, and machine learning (Chiang, 2017). These can be linear or non-linear, and are all black box in nature, and therefore ignore any pre-existing knowledge of the process. However, a crucial distinction between herd and farm management and processing is that personnel in the latter domain wish to know “why” a particular phenomenon occurs. A ‘pure’ black-box model approach is unlikely to articulate this, and is one reason why traditionally engineers have been uneasy to embrace such approaches. In our case, it is important to know why the milk powder is occasionally off-specification for particular attributes, and therefore how to subsequently correct this. This paper explores the specific challenges for big data analytics in milk powder processing and presents two case studies for overcoming some of the challenges.
2. Challenges in the Milk Powder Industry Of the four characteristic “V”s mentioned above, we argue that veracity and velocity are the key concerns for large scale powder production processes. For example in our application what we really want to know is how the powder performs in terms of rehydration, such as how it disperses, wets and produces any sediment. As mentioned earlier, these attributes are measured using individually tailored tests, such as the ISO/TS 17758:2014 standard (IDF 87:2014) for the determination of the dispersibility and wettability of instant dried milk, and are often laborious. Therefore, unlike standard process measurements such as pressure, flow or temperature, this quality data is infrequent, somewhat subjective, and often measured with a significant time lag and is difficult to trace through the process. This means that although the velocity of the process data is high, the quality of the data is low. In carrying out the work presented in Section 3, this resulted in the need for laborious data reconciliation, to be able to apply `big data analytics’ in this area. This is not unusual in the process industry – for example Cauvin et al. (2008) created a unified database for information management for a petrochemical site as the data storage within the institute was found to be highly heterogeneous and unsuitable for data mining. This is especially important for short, infrequent processing events, such as line swaps during continuous operation, which rely on accurate traceability, something not handled well by the standard statistical algorithms. In Section 3.1 we describe the effect of powder re-blend on product quality, an infrequent operation when powder manufactured during start-up is held back, and then subsequently blended into powder made during stable production periods. This required a purpose-built, bespoke, algorithm in order to isolate these periods. This bespoke algorithm approach was also successfully used for analysing ingredient addition variation and off-spec results by Boiarkina (2017). This introduces the second challenge in the dairy industry: veracity. Powders are notoriously difficult to sample reliably and many of the functional tests have poor repeatability. The quality tests are designed with grading in mind, not for diagnosing the effect of process parameters. For example, the functional properties of dispersibility and slowly dissolving particles are quantised into single digit numbers of discrete levels, seven and five levels respectively. This means the quality test results are not suitable for standard modern analytical algorithms to tease out any useful underlying correlations. Coupled with the fact that specification failures are rare means that even the large data set is information poor, and therefore makes it difficult to generate accurate models for
Using Big Data in Industrial Milk Powder Process Systems
2295
these rare, but important events. This means that techniques, such as up-sampling of offspec data (Section 3.2), during model building become important in overcoming these limitations, as standard application of techniques, such as PLS, did not work.
3. Case Studies 3.1. Bespoke Algorithms for Process Data Analysis The quality of milk powder produced cannot be guaranteed during start up, shut down and transitory periods (such as equipment line swaps). During this time, the milk powder is diverted into specific “start-stop” hoppers from which the powder can be re-blended at a low rate into the powder to be packed out or packed out as a lower grade powder. However, the effect of re-blend on the powder quality produced at an industrial plant was unknown. Although a dataset suitable for multivariate analysis had been created previously, this dataset could not be used for this analysis as reblend events were not demarcated. Furthermore, due to mixing in the powder storage hoppers, two start-stop hoppers and different rates of re-blend being used, there was a fuzzy boundary around whether a quality sample was definitively associated with a specific re-blend event. For example, two bins could be used for packing out powder, but only one of them used for re-blending into, and information around which bin the quality sample came from was not available. This was a big data veracity challenge. A bespoke algorithm was created to scan through a season worth of powder hopper weight data and identify periods of ‘re-blending’ and ‘packing out’ taking place. This was based on the rate of change in the bin level, with a cut-off for what is considered a re-blend event versus packing out being set visually, and by discussions with plant operators. It was verified manually. The gradient detection algorithm had to be able to handle noisy data, so using straight differencing methods alone for the change in bin level was ineffective, and erroneously picked up a significant number of non-events. An example of the re-blend data is shown in Figure 1.
Figure 1 - An example of hopper weight fluctuations during a production run showing re-blend, packing out and noise in the data that makes it difficult to extract events.
In this case, the interest was to determine if the quality of the powder produced during reblend was lower than the quality of the powder produced without re-blend. An analysis of variance (ANOVA) was used to compare the means of the two groups (with and
2296
I. Boiarkina et al.
without re-blend), as well as to evaluate if there was a difference in the quality of the powder produced with a low and high rates of re-blend. A statistical difference was only found for two quality parameters, the sifter bulk density and a wettability test (shown in Figure 2). However, the difference was very small, and it was confirmed that the plant could continue to re-blend powder safely at a low rate without impacting the quality parameters studied.
Figure 2 - Effect of re-blend rate on the wettability of instant whole milk powder, with the number of samples analysed shown in the top right hand corner.
3.2. Down-sampling and Up-sampling for Information Poor Data In many cases in industrial data, no matter how big the volume, there may be very little information in the key areas of interest, such as times when product quality is poor. In the case of a well-controlled process, there may be so few occurrences of product failure that it is difficult or impossible to find predictive relationships by statistical regression, in order to avoid operational regions causing product failure in future. This is particularly important in the case of milk powder, where the production rate is very high, and quality tests are delayed, such that even the rare occurrences result in a high cost of quality failure. A key functional property studied involved a proprietary test, where rehydrated powders were visually manually graded on a scale where a rating of 5 or higher constitutes a quality failure. The scale was such that some powders received a 3 rating, almost all powders received a 4 rating, and very few (off spec powders) were rated 5. Furthermore, this is an integer scale, where fractional values cannot exist, and hence categorical modelling was used, rather than linear regression. The training data included no values of 1 or 2, or greater than 5, and hence they can never be predicted from this data, but obviously it is ideal if the plant can avoid process operations likely leading to a score of 5. Random forest models (James, 2013) were used to predict functional properties, from a big data set covering a full season of production, spanning approximately 30 plant process variables including temperatures, pressures and differential pressures, and flowrates around key unit operations, utilising expert domain knowledge to create this subset from the hundreds of measurements available. This process data is available at 30-second intervals from the plant historian, but it must be aligned against physical property measurements of fat, protein, moisture, and bulk density which are sampled hourly, and then against the desired functional tests which occur 3-5 times per day. This results in significant reduction of the original dataset, where several hundred thousand rows of process data result in only 392 samples, of the correct product type.
Using Big Data in Industrial Milk Powder Process Systems
2297
Figure 3 - Random forest prediction without resampling (points jittered for visibility)
Bootstrapping is used in the Random Forest modelling to improve the model fit (R2), where the dataset is repeatedly randomly sampled, and a new tree grown each time, to see how the model fit improves as more trees are used. The initial model results shown in Figure 3 appear very promising, with R2 well over 0.8, with only a few trees needed. On closer inspection however, it was clear that the model need only ever predict a (constant) value of 4, and it will be correct over 80% of the time due to the distribution of input data. The number of predictions of any other value were few, with only 3 out of 50 values of 3 predicted correctly, and 0 out of 11 values of 5. This clearly illustrates the potential downfall of modelling where the good/bad data distribution is unbalanced.
Figure 4 - Random forest prediction with resampling (points jittered for visibility) Resampling was a key approach to improve the model in this case, where both downsampling and up-sampling were used, to reduce the class imbalances (Kuhn, 2013). The 50 existing samples of 3, and 331 samples of 4 were randomly selected to form a sample size of 30 each, and the 11 existing values of 5 were randomly up-sampled with replacement, until it also had 30 occurrences. The model results in Figure 4 show a lower R2 of around 0.75 after a similar number of trees, however the prediction is significantly improved in practice. In this case, around 70% of each of the 3 & 4 values were correctly predicted, with some 3’s predicted as 4’s and 4’s predicted as 3’s, but more importantly for the application of preventing off-specification powder, all 30 values of 5 were
2298
I. Boiarkina et al.
predicted correctly, and only 1 of all 90 points was incorrectly predicted as a 5 i.e. there were no missed-alarms, and only one false-alarm.
4. Conclusions The utilisation of big data in the milk powder processing industry, or indeed any processing industry, has a very different purpose compared to many other big data applications. The focus in processing is on the ‘why’ something is off-specification. This results in a larger focus on the quality attributes, rather than the process data. Given that quality attributes are collected at a lower rate than process data, and the tests themselves are often complex, with poor repeatability and coarse quantized results the veracity and velocity of this data becomes paramount. Furthermore, the data is often information poor, with few results of interest, partly of course because industries primarily produce on-spec product. This means that bespoke algorithms and techniques such as bootstrapping and up- and down-sampling during modelling become critical to extract useful information.
5. Acknowledgements The authors would like to acknowledge the Primary Growth Partnership program from the New Zealand Ministry of Primary Industries for funding the project and would also like to thank Fonterra staff, specifically Richard Croy, James Winchester, Nigel Russell and Steve Holroyd for providing resources and support throughout the project.
References I. Boiarkina, N. Depree, W. Yu, D. I. Wilson, B. R. Young, 2017, Fault diagnosis of an industrial plant using a Monte Carlo analysis coupled with systematic troubleshooting, Food Control, 78, 247-255 S. Cauvin, M. Barbieux, L. Carrié, B. Celse, 2008, A Generic Scientific Information Management System for Process Engineering, Computer Aided Chemical Engineering, 25, 931-936 L. Chiang, B. Lu, I. Castillo, 2017, Big Data Analytics in Chemical Engineering, Annual Review of Chemical and Biomolecular Engineering, 8, 63-85 J. B. Cole, S. Newman, F. Foertter, I. Aguilar, M. Coffey, M., 2012. Breeding and Genetics Symposium: Really Big Data: Processing and Analysis of Very Large Data Sets, Journal of Animal Sicence, 90, 723–733 G.James, D.Witten, T.Hastie, R.Tibshirani, 2013, An Introduction to Statistical Learning. Springer. ISBN 978-1-4614-7138-7 M. Kuhn, K.Johnson, 2013. Applied Predictive Modelling. Springer. ISBN 978-1-4614-6848-6 R. McQueen, S. Garner, C. G. Nevill-Manning, I. H. Witten, 1995. Applying Machine Learning to Agricultural Data, Computers and Electronics in Agriculture, 12, 1 A. Sharma, A. H. Jana, R. S. Chavan, 2012, Functionality of Milk Powders and Milk Based Powders for End Use Applications – A Review, Comprehensive Reviews in Food Science and Food Safety, 11, 518-528 D. White, 2016, Big Data What is it?, CEP Magazine, March 2016, 32-35 S. Wolfert, L. Ge, C. Verdouw, M.-J. Bogaardt, 2017, Big Data in Smart Farming, Agricultural Systems, 153, 69-80