Food Control 18 (2007) 1512–1517 www.elsevier.com/locate/foodcont
Using data mining techniques to predict industrial wine problem fermentations Alejandra Urtubia a, J. Ricardo Pe´rez-Correa a
a,*
, Alvaro Soto b, Philippo Pszczo´lkowski
c
Departamento de Ingenierı´a Quı´mica y Bioprocesos, Facultad de Ingenierı´a, Pontificia Universidad Cato´lica de Chile, Vicun˜a Mackenna 4860, Casilla 306, Santiago 22, Chile b Departamento de Ciencia de la Computacio´n, Facultad de Ingenierı´a, Pontificia Universidad Cato´lica de Chile, Vicun˜a Mackenna 4860, Casilla 306, Santiago 22, Chile c Departamento de Fruticultura y Enologı´a, Facultad de Agronomı´a, Pontificia Universidad Cato´lica de Chile, Vicun˜a Mackenna 4860, Casilla 306, Santiago 22, Chile Received 19 September 2006; accepted 26 September 2006
Abstract Winemakers currently lack the tools to identify early signs of undesirable fermentation behavior and so are unable to take possible mitigating actions. Data collected from tracking 24 industrial fermentations of Cabernet sauvignon were used in this study to explore how useful is data mining to detect anomalous behaviors in advance. A database held periodic measurements of 29 components that included sugar, alcohols, organic acids and amino acids. Owing to the scale of the problem, we used a two-stage classification procedure. First PCA was used to reduce system dimensionality while preserving metabolite interaction information. Cluster analysis (K-Means) was then performed on the lower-dimensioned system to group fermentations into clusters of similar behavior. Numerous classifications were explored depending on the data used. Initially data from just the first three days were assessed, and then the entire data set was used. Information from the first three days’ fermentation behavior provides important clues about the final classification. We also found a strong association between problematic fermentations and specific patterns found by the data mining tools. In short, data from the first three days contain sufficient information to establish the likelihood of a fermentation finishing normally. Results from this study are most encouraging. Data from many more fermentations and of different varieties needs to be collected, however, to develop a reliable and more broadly applicable diagnostic tool. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: PCA; Clustering; K-Means; Sluggish fermentations; Stuck fermentations
1. Introduction Problematic wine fermentations directly impact both productivity and the quality of wine. Anticipating stuck or sluggish fermentations, or simply being able to foresee the progress of a given fermentation, would be extremely useful for an enologist who could then take suitable steps to correct where necessary and ensure that vinifications conclude successfully. In a similar way that viticulturists
*
Corresponding author. Tel.: +56 2 3544258; fax: +56 2 3545803. E-mail address:
[email protected] (J.R. Pe´rez-Correa).
0956-7135/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.foodcont.2006.09.010
take appropriate decisions based on suitable weather forecasts, wineries can benefit from an improved analysis of fermentations data based on which engineering tools could reduce fermentation uncertainty. Numerous metabolites can now be measured using current technology at set intervals during a winemaking fermentation (Urtubia, Pe´rez-Correa, Meurens, & Agosin, 2004). It is possible to extract valuable information from these periodic measurements that should not be overlooked, as it could hold the key to predict fermentation behavior. Data mining can extract useful information from large databases, which may then be employed for predicting, modeling or identifying inter-relationships. Tools
A. Urtubia et al. / Food Control 18 (2007) 1512–1517
involve applying countless numeric operations, each based on a variety of methods such as clustering rules, neural networks, PCA, etc. (Wang, 1999). Many authors have employed data mining methods such as the mean hypothesis test (MHT), principal component analysis (PCA), decision trees, neural networks and cluster analysis to retrieve useful information from fermentation databases (Kamimura, Bicciato, Shimizu, Alford, & Stephanopoulos, 2000; Roger, Sablayrolles, Se´ller, & Bellon-Maurel, 2002; Stephanopoulos, Locher, Duff, Kamimura, & Stephanopoulos, 1997; Subramanian, Buck, & Block, 2001; Vlasides, Ferrier, & Block, 2001). Kamimura, Bicciato, Shimizu, Alford, and Stephanopoulos (2000), for example, used MHT, PCA and cluster analysis information from periodic measurements of relatively few metabolites to isolate behavior patterns associated with problem fermentations. The latter study was made possible nonetheless by the regular behavior encountered among the fermentations. Wine fermentations differ though in two important aspects. First, vinification typically involves tracking many more metabolites. And, secondly, behavior patterns are not as consistent from one wine fermentation to the next. Problems may arise from many causes, which makes direct detection of unusual behavior from measured variables difficult, except, of course, when it is too late and the yeasts are no longer able to consume the remaining available sugar within a reasonable timeframe. In this work we applied a two stage procedure using first PCA and then cluster analysis (K-Means), to explore the data accumulated from measurements sampled regularly of 24 industrial vinifications of Cabernet sauvignon. Sugar, alcohols, organic acids and nitrogen-rich compounds were measured at set times in both normal and problematic fermentations. A description follows of the methodology employed, results are then discussed and, finally, conclusions are drawn and certain observations are made on how to extend this exploratory study. 2. Methodology 2.1. Measurements Samples providing the data for the study were collected from 24 industrial fermentation tanks during vinifications of Cabernet sauvignon. The fermentations were conducted at a winery in Chile’s Maipo Valley following the 2002 harvest. Between 30 and 35 samples were taken per fermentation depending on the duration of a vinification. In all, 29 compounds were analyzed (sugar, alcohols, organic acids, and nitrogen-rich compounds, see Table 2), which generated approximately 22 000 data points. Compounds were analyzed using FT-IR Multispec (module FT-IR, AVATAR 360 NICOLET) equipped with a DTGS KBr detector having a spectral range of 200–740 nm and 1350–28 500 nm. Calibration, to an appropriate degree of precision for this study, was developed
1513
Table 1 Compounds monitored in industrial fermentations of Cabernet sauvignon Sugars (g/l)
Nitrogen sources (mg/l)
Glucose Fructose
Ammonium Alanine Arginine Aspartic acid Cysteine Phenylalanine Glycine Glutamine Glutamic acid Histidine Isoleucine Leucine Lysine
Alcohols (g/l) Glycerol Alcohol a Organic acids (g/l) Lactic acid Citric acid Malic acid Succinic acid Tartaric acid a
Proline Serine Tyrosine Threonine Tryptophan Valine Methionine
(%) v/v.
specially for analyzing sugar, alcohols and organic acids (Urtubia et al., 2004). For amino acid analysis calibration for an artificial must was used. Even though measurements are likely to differ slightly from those of real musts, this should not prove a limitation, as in this study just the relative behavior of different fermentations is compared. Details of the various industrial fermentations of Cabernet sauvignon analyzed in this work are presented in Table 2. Initial sugar values in the table above were estimated from density curves (Oreglia, 1978). No such data was available for fermentations 8 and 23, however. It is worth mentioning that the initial fermentation time was calculated from the moment a tank was filled though under Table 2 Initial and final conditions and classification of Cabernet sauvignon fermentations Tank
Days
Initial density (g/ml)
Initial sugar (g/l)
Final sugar (g/l)
Alcohol (% v/v)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
8 8 10 12 13 12 13 13 13 14 15 15 15 15 17 17 15 18 23 9 12 13 20 23
1095 1095 1093 1102 1103 1097 1093 – 1086 1104 1103 1100 1105 1104 1100 1096 1103 1087 1104 1098 1091 1103 – 1095
223 223 218 242 244 228 218 – 199 247 244 236 250 247 236 226 244 202 247 231 212 244 – 223
0 0.21 0.81 0 0.64 0.7 0 0.45 0.32 0 0 0.21 0.37 1.93 0 0.8 0.83 1.21 0.46 13.46 7.06 3.03 3.09 2.98
10.7 11.4 11.8 13.9 14.4 11.1 12.3 13.3 11.4 11.5 11.3 12.8 11.9 12.7 13 12.3 10.4 12.4 12.5 12.6 11.1 11 12 12.5
1514
A. Urtubia et al. / Food Control 18 (2007) 1512–1517
normal cellar procedures must is left for a day prior to inoculation.
Ji ¼
Ni X
dðxij ; li Þ xij 2 X i ; N i 6¼ #X i
ð2Þ
j¼1
2.2. Data mining PCA is a commonly used statistical tool for reducing the dimensionality of a data set while retaining the relevant patterns hidden in it. PCA achieves this by mapping the original data set onto a reduced orthogonal space assigned in such a manner as to account for most of the original data set’s variability (Martens & Naes, 1989). Cluster analysis, on the other hand, aims to organize information into relatively homogeneous groups, ‘‘clusters’’ (Visauta, 1998). Its purpose is to analyze and extract a dataset’s structure. Elements of a like group, for example, appear closer to each other than to components of other groups. The decision as whether to include an element within a given group is based on degrees of similarity – how alike they are according to a suitable objective measurement. In this work, the K-Means clustering algorithm was used for classification, which is a common clustering technique for data exploration, it is a robust method and it has been employed in numerous fields including gene expression analysis (Fx, Zhang, & Kusalik, 2003; Yoshioka et al., 2002), forecasting atmospheric pollution (Jorquera, Pe´rez, Cipriano, & Acun˜a, 2004), analysis of health databases (Al-Harbi & Rayward-Smith, 2003) and industrial waste treatment plants (Albazzaz, Wang, & Marhoon, 2005). The method assumes that dataset X contains k clusters Xi, which can be represented by their mid value li. The euclidian distance (squared), is used here, between the observed data and its cluster’s center, as a measure of similarity, whose function is 2
T
dðx; li Þ ¼ ðx li Þ ðx li Þ
ð1Þ
where subscript i represents the cluster or group, li is the center of the cluster, and d(x,li) the distance from observation x to the center. The method can be summarized as follows (Jorquera et al., 2004): Step 1: Random placement of the initial centers. Step 2: Assign each data point to its closest cluster. After all the assignments are completed, redefine the center of the cluster so as to minimize function Ji.
Step 3: The new center positions are taken as the initial points for a new iteration starting at Step 2. The procedure is repeated until the positions of the centers have minor variations or no longer change between successive iterations. Using this procedure, convergence to a local minimum of function J is guaranteed. Here, we classify data into five clusters. 2.3. Data analysis Different classifications were performed depending on the initial data set selected. This enabled us to evaluate how similar the classifications were when using data from the first three days of fermentation (77 samples; datasets A and E) to the entire data set (570 samples; datasets B and F). It was found that taking glucose and fructose as a single variable (sugar) was the same as considering the two sugars as independent variables. Subsequently, only the variable ‘‘sugar’’ was used in this study, reducing the total number of components to 28. We also studied how including or excluding ammonium and amino acid measurements affected classification. The first classification involved nine variables including ‘‘sugar’’, alcohols and organic acids (datasets A and B) while the second included all 28 components; the first group plus nitrogen-rich compounds (datasets E and F). Fig. 1 describes the data mining analysis and the four datasets used. As a data preprocessing step, we apply PCA to reduce the dimensionality of the data and therefore simplify the clustering analysis. Here, exactly the same results were obtained from normalized as from non-normalized data, so raw, non-normalized data was used. For each dataset (A, B, D and E) a three-dimensional space was identified that accounted for over 70% of the raw data’s total variance. A space with eight principal components (PCs), accounting over 90% of the variance, did not show significant differences in the subsequent clustering, although it considerably increased the complexity of the analysis. Therefore, including more than three PCs was not justifiable since the additional information was not useful for classification.
3 days => dataset A 8 compounds
Complete Fermentation => dataset B 3 days => dataset E
Principal Components
28 compounds Complete Fermentation => dataset F Fig. 1. Summary chart – data analysis procedure.
Clusters
A. Urtubia et al. / Food Control 18 (2007) 1512–1517
After the dimensionality reduction, we apply the K-Means algorithm to detect the relevant clusters in the data. Finally, identified clusters were analyzed to determine which normal fermentations behaved similarly and to explore any associations between these and problematic fermentations. 3. Results and discussion The far from trivial task of analyzing around 22 000 data points is complicated further by trying to find underlying structures within the database associated with the behavior and evolution of wine fermentations. This section has been divided in two parts to facilitate clear, uncluttered presentation of the results. Initially, normal and problematic behaviors are discussed with respect to the fermentations set out in Table 2 (Section 2). We then present results from the data mining processes and analyze how effective classification is using samples from the first three days of fermentation, and assess whether including nitrogen-rich compounds improves classification.
1515
Fermentation time is another pertinent factor. If we consider 13 days a reasonable cut off point, any fermentation that takes longer is designated ‘‘sluggish’’ even if the residual sugar content eventually falls below 2 g/l. Using such criteria, fermentations 10–19 are sluggish while numbers 23 and 24 were sluggish and finally stuck. Pre-classifying the fermentations under study by using residual sugar and time, we find 15 problem fermentations (10 sluggish (42%), 3 stuck (13%) and 2 sluggish and stuck (8%)). Fig. 2 clearly shows how different a normal and a stuck fermentation behave. Hence we can set certain classification criteria on the monitored data, but such data is only available once fermentation has concluded. The wider aim of this study is to predict or foresee problems based on real time data from assaying samples taken during the course of the fermentation. The following section deals with a two stage data mining application. The aim is to classify fermentations using data from the first three days, and then to analyze associations between the classification obtained and problem fermentations.
3.1. Stuck and sluggish wine fermentations 3.2. Classification analysis Fermentations with high residual sugar contents will probably not finish properly. They may stall or get stuck (Varela, Pizarro, & Agosin, 2004). All of the remaining sugars at the end of alcoholic fermentation, normally C5 (1–2 g/l), constitute the residual sugar. In a stuck fermentation besides the C5 sugars the product will also contain C6 (glucose and fructose). As ‘‘sugar’’ in our study consists of both glucose and fructose, the first condition we can consider as a problematic fermentation is one with more than 2 g/l residual sugar. From Table 2, total sugar (>2 g/l) was not consumed in fermentations 20–24. In other words, these were stuck fermentations – remaining sugar in tank 20 was 13 g/l.
To establish if we can classify fermentations early, results from applying data mining to samples from the first three days, A and E, were compared with those obtained with the full fermentation samples, B and F. Datasets A and B displayed similar cluster patterns that were essentially sugar concentration-linked. Additionally, around 80% of fermentations with datasets E and F were clustered similarly. Consequently, these classification results support our initial claim that information contained in data taken during the first three days of fermentation (datasets A and E) should prove sufficient to classify fermentations early. From now on, therefore, we
120
Total sugars (g/L)
100
80
60
40
20
0 40
60
80
100
120
140 160 Time (h)
180
200
220
240
Fig. 2. Typical normal (m) and stuck (d) fermentations of Cabernet sauvignon.
1516
A. Urtubia et al. / Food Control 18 (2007) 1512–1517
will confine analysis to the first three days of fermentation (datasets A and E). In all cases studied, samples were classified into five clusters, arbitrarily named as: blue (B), red (R), pink (P), brown (Br) and green (G). Clustering analysis would be far simpler were all the samples from a given fermentation confined to a single cluster. However, due in large part to the time variable nature of the fermentation process most of the samples were found in two or three clusters. To simplify the analysis, we established the term ‘‘grouping’’ for a set of clusters that contains all of the samples from one or more fermentations. Next, we will evaluate the effect of nitrogen compounds on the classification. 3.2.1. Analysis using dataset A (excluding nitrogen compounds) Clustering yielded 13 groupings (Table 3): a single grouping of one cluster, six two-cluster groupings and six three-cluster groupings. Eight of these groupings contained just problem fermentations: the single cluster (1), four of the two cluster groups (4, 5, 6 and 7) and three of the three-cluster groups (11, 12 and 13). Therefore, if all the samples of a new fermentation are included in one of these groupings, we can presume that it will be a problematic fermentation; 11 of the 15 problematic fermentations included in this study fell within this classification. In addition, two groupings (3 and 10) contained only normal fermentations. 3.2.2. Analysis using dataset E (including nitrogen compounds) Nitrogen deficiency has been widely reported to be an important factor in problem winemaking fermentations (Boulton, Bisson, Singleton, & Kunkee, 1996; Bisson, 1999, 2000; Pszczo´lkowski, Carriles, Cumsille, & Maklouf, 2001; Varela et al., 2004). Clustering, including nitrogen compounds, produced 12 groupings (Table 3). Five groupings contained just prob-
lem fermentations: one single-cluster (3), one two-cluster (6), and three of three clusters (9, 11 and 12) that, in all, accounted for eight fermentations. On the other hand, 5 groupings (1, 2, 5, 8 and 10) contained only normal fermentations; 7 fermentations were classified in these groupings. At first glance, data set E does not seem to provide any additional information. If both classifications, A and E, are considered however, only 5 of the 24 fermentations were not correctly classified. In each of these cases the unclassified fermentations appeared in groupings that were neither 100% normal nor 100% problematic. 4. Conclusions Collecting the significant number of fermentation samples necessary set a tremendous logistics problem. Chemical analysis of 29 compounds taken around-the-clock produced around 22 000 measurement data. Despite this huge amount of data, this study has shown that it is possible to use data mining tools to detect and spot relationships for classifying winemaking fermentations early on. We found that measurements from the first three days of a fermentation contained sufficient information, and striking similarities to the results of measurements from an entire vinification process. A classification using measurements of sugar, alcohols and organic acids, dataset A, held enough information to detect over 70% of the problem fermentations within 72 h. Adding nitrogen compounds to the classification database, dataset E, 63% of the total fermentations were classified early as normal or problematic. Perhaps more importantly, if both classifications, A and E, are considered in the analysis, only 5 out of 24 fermentations were not classified properly. Therefore, classifications A and E can be considered complementary. Using more sophisticated clustering or other data mining methods may improve detection of problematic fermentations. However, to develop a method for early warning that is truly robust and reliable, a significantly larger
Table 3 Classification results of Clustering K-Means, using 5 clusters Case A(8 compounds) Grouping
Clusters
1 2 3 4 5 6 7 8 9 10 11 12 13
B R–P B–Br R–Br G–Br P–Br B–R B–R–P R–G–Br R–P–G R–P–Br B–P–Br B–R–Br
a
a
Case E(28 compounds) Fermentations
Problem fermentation (%)
Grouping
Clusters
11 3, 9, 24 5 17, 18 16, 22 14 19, 21 1, 2, 4, 7, 15, 23 6, 13 8 20 10 12
100 33.3 0 100 100 100 100 33.3 50 0 100 100 100
1 2 3 4 5 6 7 8 9 10 11 12
B R G R–G R–P B–R B–P B–P–G B–R–G B–R–P R-P-G R-G-Br
B: Blue, Br: Brown, G: Green, P: Pink, R: Red.
a
Fermentations
Problem fermentation (%)
4 5 11 6, 14, 16, 22 3, 9 12, 19, 21 7, 15, 18, 23, 24 1, 2 13, 17 8 10 20
0 0 100 75 0 100 80 0 100 0 100 100
A. Urtubia et al. / Food Control 18 (2007) 1512–1517
database needs to be used that includes past fermentations, vinifications of other grape varieties, from other wineries and, eventually, from other countries. Nevertheless, the results from this study are most encouraging and show potential for considerable benefits for wineries. Acknowledgements One of the authors (A. Urtubia) thanks the financial support from CONICYT and DIPUC. Part of this work was supported by FONDEF (grant number D00I10113). We also thank Alex Crawford for correcting the style of the text. References Albazzaz, H., Wang, Z., & Marhoon, F. (2005). Multidimensional visualization for process historical data analysis: a comparative study with multivariate statistical process control. Journal of Process Control, 15(3), 285–294. Al-Harbi, S., & Rayward-Smith, V. (2003). The use of a supervised kmeans algorithm on real-valued data with applications in health. Developments in Applied Artificial Intelligence Lecture Notes in Artificial Intelligence, 2718, 575–581. Bisson, L. (1999). Stuck and sluggish fermentations. American Journal of Enology and Viticulture, 50(1), 107–119. Bisson, L., & Butzke, C. (2000). Diagnosis and rectification of stuck and sluggish fermentations. American Journal of Enology and Viticulture, 51(2), 159–177. Boulton, R., Bisson, L., Singleton, V., & Kunkee, R. (1996). Principles and practices of winemaking. New York: Chapman Hall. Fx, W., Zhang, W. J., & Kusalik, A. J. (2003). A genetic K-means clustering algorithm applied to gene expression data. Advances in Artificial Intelligence, Proceedings Lecture Notes in Artificial Intelligence, 2671, 520–526. Jorquera, H., Pe´rez, R., Cipriano, A., & Acun˜a, G. (2004). Short term forecasting of air pollution episodes. In Environmental sciences and environmental computing (pp. 17–33). The Enviro Comp Institute. Kamimura, R., Bicciato, S., Shimizu, H., Alford, J., & Stephanopoulos, G. (2000). Mining of biological data I: identifying discriminating
1517
features via mean hypothesis testing. Metabolic Engineering, 2, 218–227. Kamimura, R., Bicciato, S., Shimizu, H., Alford, J., & Stephanopoulos, G. (2000). Mining of biological data II: assessing data structure and class homogeneity by cluster analysis. Metabolic Engineering, 2, 228–238. Martens, H., & Naes, T. (1989). Multivariate calibration. John Wiley & Sons. Oreglia, F. (1978). Enologı´a Teo´rico-Pra´ctico. Instituto Salesiano de Artes Gra´ficas. Pszczo´lkowski, P., Carriles, P., Cumsille, C., & Maklouf, M. (2001). Reflexiones sobre la Madurez de Cosecha y las Condiciones. Facultad de ´ a. Pontificia Universidad Cato´lica de Chile. Agronomı Roger, J., Sablayrolles, J., Se´ller, J., & Bellon-Maurel, V. (2002). Pattern analysis techniques to process fermentation curves: application to discrimination of enological alcoholic fermentations. Biotechnology and Bioengineering, 79(7), 806–815. Stephanopoulos, G., Locher, G., Duff, M., Kamimura, R., & Stephanopoulos, G. (1997). Fermentation database mining by pattern recognition. Biotechnology and Bioengineering, 53(5), 443–452. Subramanian, V., Buck, K., & Block, E. (2001). Use of decision tree analysis for determination of critical enological and viticulture processing parameters in historical databases. American Journal of Enology and Viticulture, 52(3), 175–184. Urtubia, A., Pe´rez-Correa, R., Meurens, M., & Agosin, E. (2004). Monitoring large scale wine fermentations with infrared spectroscopy. Talanta, 64, 778–784. Varela, C., Pizarro, F., & Agosin, E. (2004). Biomass content govern fermentation rate in nitrogen-deficient wine musts. Appllied Environmental and Microbiology, 70(6), 3392–3400. Visauta, B. (1998). Ana´lisis estadı´stico con SPSS para windows. Estadı´stica multivariante (II). McGraw Hill. Vlasides, S., Ferrier, J., & Block, D. (2001). Using historical data for bioprocess optimization: modeling wine characteristics using artificial neural networks and archives process information. Biotechnology and Bioengineering, 73(1), 55–68. Wang, X. (1999). Data mining and knowledge discovery for process monitoring and control. Advances in Industrial Control. Yoshioka, T., Morioka, R., Kobayashi, K., Oba, S., Ogawsawara, N., & Ishii, S. (2002). Clustering of gene expression data by mixture of PCA models. Artificial Neural Networks – Icann 2002 Lecture Notes in Computer Science, 2415, 522–527.