Spectrometric differentiation of yeast strains using minimum volume increase and minimum direction change clustering criteria

Spectrometric differentiation of yeast strains using minimum volume increase and minimum direction change clustering criteria

Pattern Recognition Letters 45 (2014) 55–61 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com...

4MB Sizes 1 Downloads 27 Views

Pattern Recognition Letters 45 (2014) 55–61

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Spectrometric differentiation of yeast strains using minimum volume increase and minimum direction change clustering criteria q Nuno Fachada a,⇑, Mário A.T. Figueiredo b, Vitor V. Lopes c, Rui C. Martins d, Agostinho C. Rosa a a

ISR – Institute for Systems and Robotics, Instituto Superior Técnico, Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal IT – Instituto de Telecomunicações, Instituto Superior Técnico, Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal c LNEG – Laboratório Nacional de Energia e Geologia, Estrada do Paço do Lumiar, 22, 1649-038 Lisboa, Portugal d ICVS – Life and Health Sciences Research Institute, School of Health Sciences, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal b

a r t i c l e

i n f o

Article history: Received 6 July 2013 Available online 28 March 2014 Keywords: Clustering Minimum volume increase Minimum direction change Yeast Spectroscopy

a b s t r a c t This paper proposes new clustering criteria for distinguishing Saccharomyces cerevisiae (yeast) strains using their spectrometric signature. These criteria are introduced in an agglomerative hierarchical clustering context, and consist of: (a) minimizing the total volume of clusters, as given by their respective convex hulls; and, (b) minimizing the global variance in cluster directionality. The method is deterministic and produces dendrograms, which are important features for microbiologists. A set of experiments, performed on yeast spectrometric data and on synthetic data, show the new approach outperforms several well-known clustering algorithms, including techniques commonly used for microorganism differentiation. Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction Spectroscopy, together with statistical analysis of spectra, is frequently used as a rapid microbiological identification method. Rapid, simple and low-cost identification of microorganisms opens several possibilities. For example, on pathogens in general, it has been shown that fast classification has a major impact on the morbidity, mortality, and duration of hospitalization [18]. For Saccharomyces cerevisiae (yeast), quick identification of different strains can yield significant economic advantages, as yeasts not only provide us with many distinctive types of aliment, but are also responsible for food spoilage and can be medically relevant [13]. Winemaking, a multibillion Euro industry, is a prime example, as it could prosper from rapid and comprehensive yeast identification and classification methods [9]. The international wine markets are constantly presenting new challenges, such as taste standardization or production of different and novel wine types with particular characteristics, which can in turn benefit from developing these techniques [6]. Additionally, new species of yeast are continually discovered and explored [15], which requires the classification of

q

This paper has been recommended for acceptance by Y. Liu.

⇑ Corresponding author. Tel.: +351 21 8418273; fax: +351 21 8418291. E-mail addresses: [email protected] (N. Fachada), [email protected] (M.A.T. Figueiredo), [email protected] (V.V. Lopes), rui.martins@ecsaude. uminho.pt (R.C. Martins), [email protected] (A.C. Rosa). http://dx.doi.org/10.1016/j.patrec.2014.03.008 0167-8655/Ó 2014 Elsevier B.V. All rights reserved.

a high number of isolates, a task for which a rapid, simple, low-cost identification method is important [13]. Both supervised and unsupervised statistical techniques have been used on spectrometric data with varying degrees of success [19]. Principal Component Analysis (PCA) [12] is one of the latter methods, often employed as a dimensionality reduction step in a broader analysis [6,18]. The majority of methods used for strain differentiation are based on agglomerative hierarchical clustering (AHC) with typical off-the-shelf implementations and parameters [5,11,13,18–20,24,26]. This paper introduces two new clustering criteria for AHC, based on minimizing: (a) the total volume of clusters, as given by their respective convex hulls; and, (b) the global variance in cluster directionality. These are inspired by data produced when applying PCA to spectrometric data, although can be generically used in other problems. A set of experiments, performed on yeast spectrometric data and on synthetic data, show the new approach outperforms several well-known clustering algorithms, namely k-means [10], EM [7], Partitioning Around Medoids (PAM) [3] and AHC with common distance metrics and linkages [10]. The rest of the paper is organized as follows. First, in Section 2, previous work about spectroscopy as a fast identification method and clustering with volume-based metrics is discussed. Next, Section 3 describes the data sets and the dimensionality reduction methods used in this work. The novel clustering metrics, as well as their integration in AHC, are presented in Section 4. Results,

56

N. Fachada et al. / Pattern Recognition Letters 45 (2014) 55–61

Section 5, show that using AHC with the novel volume and direction-based metrics offers better discrimination capabilities when compared with the remaining tested algorithms. Section 6 provides a global outline of what was accomplished in this work.

2. Related work 2.1. Spectroscopy as a rapid identification method Fourier transform infrared spectroscopy (FTIR) was one of the first spectroscopy techniques to be able to distinguish and identify microorganisms. Helm et al. [11] used this method to successfully group bacteria according to species, metabolite production, presence of outer membrane and antigenic structure. Several spectral windows were preselected based on their specific information content and discrimination power; similarity between them was measured using Person’s correlation coefficient. Different combinations of spectral windows and their weights, as well as the use of first or second derivative of spectra, were systematically tested until the resulting classification mostly agreed with the desired grouping criteria. Grouping was performed using AHC with average and Ward linkages. Kümmerle et al. [13] performed a similar cluster analysis using food-borne yeasts, and concluded that FTIR spectroscopy is limited for taxonomic purposes, because spectra of different species of the same genus generally did not cluster. However, they could successfully identify 97.5% of 722 independent yeast isolates by comparing spectrum similarities against the reference library used in the taxonomic analysis (comprised of 322 yeast strains), showing the potential of FTIR spectroscopy as an identification tool. Maquelin et al. [18] aimed to identify Candida species with spectra obtained using confocal Raman microspectroscopy, by applying a mixture of both unsupervised and supervised techniques. PCA was first applied to the spectra to determine the respective principal components (PCs); a dendrogram was then generated with AHC using squared Euclidean distance and Ward’s linkage on the most relevant PCs. Separate clusters were formed for the majority of species. The results of AHC were then used as a starting point for a supervised sequential species identification scheme based on Linear Discriminant Analysis (LDA). This process was used on two data sets which differed on how the Candida samples were prepared; when applied to a single data set, 100% correct identification was achieved, and when applied to both data sets combined, 97.0% of samples were correctly identified. Thus, the authors concluded that pretreatment or culturing of strains before spectrum measurement did not significantly influence the accuracy of the devised method. The potential of visible (VIS) and near-infrared (NIR) spectroscopy to discriminate and identify yeast strains was demonstrated by Cozzolino et al. [6]. In this study, the 2nd derivative of yeast spectra was subjected to PCA. The resulting PCs were taken as independent variables of LDA, with the goal of classifying strains (with different deletion mutations) based on their metabolome. The observed differences were mostly consistent with the knowledge about specific yeast metabolic functions. Silva et al. [24] used UV–VIS (Ultraviolet–Visible) and VIS–SWNIR (Visible–Short-wave NIR) diffusive reflectance spectra in order to discriminate different yeasts and bacteria. Spectrum data was normalized to account for various integration times, and the growth media spectrum was subtracted from the microorganisms spectra to increase spectral variance, because microorganisms were grown in distinct media. Finally, spectra were then subjected to light scattering correction, and underwent a modified PCA, with emphasis on statistical robustness. AHC, using Euclidean distance and average linkage, was performed on the PCs of UV–VIS and

VIS–SWNIR data, leading to the conclusion that VIS–SWNIR produces higher discrimination ratios for all the studied microorganisms. In a posterior study from the same group [5], the authors experimented with yeast metabolic state identification under different growth conditions using a slightly broader spectral interval. Spectra were subjected to low-pass filtering to smooth the signal, followed by light scattering correction. The same modified PCA was applied to the 1st derivative of the spectra, and the resulting PCs underwent a similar clustering process. Results showed that spectroscopy has the potential for yeast metabolic state identification once the spectral signatures of colonies differ from one another, being possible to achieve 100% of classification.

2.2. Clustering with volume-based metrics When using a minimum volume increase (MVI) criterion in AHC, the inter-cluster dissimilarity is equal to the increase of volume resulting from the merge of any given cluster pair. There are several forms of defining the volume of a cluster of points; when clusters are roughly shaped as convex sets, the volume of the corresponding convex hull or ellipsoid enclosure can be intuitively considered. To our knowledge, the convex hull volume has not been used before in a MVI clustering context, but some work exists regarding the use of minimum volume ellipsoids (MVE) for this purpose [2,14,17,23,27], comparing favorably with several off-the-shelf methods, such as k-means. MVE clustering presents several desirable features, such as scale-invariance because of the use of the Mahalanobis distance metric [27]. Further, data often exhibits a mixture of Gaussian distributions, which are shaped as ellipsoids, and thus suitable for this type of clustering. However, if the number of points in a cluster is insufficient or it lacks an ellipsoidal shape, the use of convex hull volume minimization may be advantageous. MVI presents two main problems when compared with other clustering criteria. First, volume computations incur in higher computational costs, especially when used in a combinatorial context such as AHC. Second, during AHC initialization, in a m-dimensional problem, clusters with fewer than m þ 1 points do not have volume. Thus, MVI must begin with clusters containing at least ðm þ 1Þ=2 points, so that all possible new clusters will have the minimum m þ 1 observations necessary for volume (however, initial clusters are not required to have volume, i.e. they can have zero contribution in Eq. (1). To address this issue, a divisive partitioning algorithm which keeps scale-invariance during MVE clustering is suggested by Kumar and Orlin [14]. However, it is not deterministic, and may lead to unbalanced clusters. In general, any clustering algorithm can be used for initial clustering, provided that each initial cluster contains at least ðm þ 1Þ=2 observations, located close enough to ensure low volume.

3. The data 3.1. Spectrometric data from yeast The yeast spectra was obtained in a study by Fonseca et al. [9] with VIS–NIR reflectance spectroscopy (450–1000 nm) using yeast strains from different locations and environments. Strains were grown in 96-well (8 rows  12 columns) microplates at 30 °C in YPD1 medium for 72 h, so colonies would occupy the entire well. Spectra were obtained inside a biolog, which is a box designed to 1 Yeast Extract Peptone Dextrose, also often abbreviated as YPD, is a complete medium for yeast growth containing yeast extract, peptone, bidest, water, and glucose or dextrose.

N. Fachada et al. / Pattern Recognition Letters 45 (2014) 55–61 Table 1 Measured materials. #

Different strains

Isogenic strains

1 2 3 4 5 6 7 8 9

L558 L709 L718 L724 L734 L739 L748 L752 L756

VL002 VL006 VL011 VL018 VL020 VL064 VL097 VL099 VL108

10 11 12 a

VL1a S288Ca YPDa

57

and 6.81%. Fig. 1 shows the D1 data set and the associated score plot for the two first PCs. The score plot exhibits the distinct groups of strains scattered along a preferential direction, forming low volume clusters; also, different clusters have similar directions, i.e. they are nearly parallel regarding one another. This is a recurrent layout not only present in our data sets, but in many other cases where PCA analysis is performed on spectrometric data [1,5,8,21,22]. Among the methods based on the dissimilarity matrix, the distance between any given pair of vectors (i.e. the spectra or their 1st or 2nd derivatives), was measured by two approaches. The first uses Person’s correlation coefficient, as in [11]. In the second, the distance is given by the variance of the difference between vector pairs.

Control material.

3.2. Synthetic data isolate the microplate containing the strains from environmental light while keeping the probe at 90° angle with the plate. Twelve experiments were prepared, half of which containing nine strains with large genetic differences, and the remaining half containing nine isogenic strains isolated from spontaneous fermentations of commercial strain VL1. Strains S288C and VL1, as well as blank YPD medium, were used as control in all experiments, making up a total of twelve measured materials per microplate (Table 1). Different materials were distributed along the twelve microplate columns. In each column, replicas of the assigned material were placed in four wells, leaving the remaining four vacant. Five spectra were taken in each occupied well, making the total number of observations per experiment equal to 240 (12 materials  4 replicas  5 measurements). Each spectrum is a vector with the absorbance measured at wavelengths from roughly 450 to 1000 nm (VIS–NIR) spaced by 0.26 nm, i.e. 2189 values. The raw spectra were subject to: (a) normalization because of different integration times; and, (b) correction of light scattering. The resulting data sets are labeled from D1 to D6 (different strains) and from I1 to I6 (isogenic strains). Several dimensionality reduction techniques were then applied to the processed spectra data sets, as well as the first and second derivatives. These techniques can be divided into PCA and methods based on a dissimilarity matrix. PCA provided more insight on strain differentiation when directly applied to the processed spectra. First and second derivatives of the spectra yielded poor strain separability, and were thus discarded. For all data sets, the first PC explained between 92.12% and 98.05% of the variance, while the second varied between 1.29%

A group of six synthetic data sets (Fig. 2) is also subject to the same clustering analysis as the yeast spectrometric data to assess the overall effectiveness of the different methods, and to demonstrate the broader applicability of the proposed techniques. The synthetic data sets share common characteristics with the spectrometric data after PCA, i.e. low volume and similar direction clusters, but also present key differences regarding number of points, number of clusters, scale and inter- and intra-cluster point proximity. The synthetic data sets are labeled from S1 to S6 (Fig. 2(a)–(f)), and are composed with a variable number of clusters and cluster elements. Of these, only S1 (Fig. 2(a)) contains non-overlapping clusters; the remaining have at least one overlapping cluster. To reflect the poor separation of some strains in the spectra PCA, the S5 data set (Fig. 2(e)) contains multiple mixed groups, making it impossible to differentiate them with 100% accuracy. 4. Minimum volume and minimum direction change clustering criteria The minimum volume increase (MVI) and minimum direction change (MDC) criteria, described in Sections 4.1 and 4.2, respectively, are proposed as alternatives to commonly used inter-cluster dissimilarity measures in AHC, such as single linkage (nearest neighbor) or Ward’s minimum variance linkage. MVI aims to minimize the total volume of clusters, while the goal of MDC consists of minimizing the global variance in cluster directionality. MDC is better suited for the later stages of AHC, when cluster directionality

Fig. 1. The D1 data set, corresponding to the processed spectra from the first of six ‘‘different strains’’ experiments (left), and the associated score plot for the first two PCs (right).

58

N. Fachada et al. / Pattern Recognition Letters 45 (2014) 55–61

Fig. 2. Synthetic data sets.

patterns start to emerge; thus, MDC should ideally be preceded by a more generic criterion, increasing its influence during the clustering process. A form of combining MDC with MVI in such fashion is proposed in Section 4.3.

was minimized by caching volume computations for possible new clusters, thus avoiding the computation of most cluster volumes in each step. Second, the initial clustering requirement was approached using two deterministic solutions:

4.1. Minimum volume increase

1. A modified AHC (mAHC) algorithm which uses standardized Euclidean distance2 with single linkage, and produces clusters with a predefined minimum size. This clustering prioritizes the merge of smaller clusters and proceeds until all clusters have size equal or greater than a given value. 2. A modified Principal Direction Divisive Partitioning [4] (mPDDP) algorithm which divides each cluster using the hyperplane orthogonal to the clusters’ first principal direction (FPD) computed by PCA. The implemented mPDDP always selects the largest cluster for division, with the algorithm proceeding while the division of a cluster yields sub-clusters with more than ðm þ 1Þ=2 observations.

m

In the MVI criterion, inter-cluster dissimilarity, d , is equal to the increase in volume when merging two clusters, as shown in Eq. (1) for arbitrary clusters i and j. The dissimilarity between all cluster pairs (within g existing clusters) can be represented by distance matrix Dm , as given in Eq. (2). The minimum value in the distance matrix indicates the pair of clusters to be merged, i.e. the less dissimilar cluster pair. m

m

di; j ¼ dj;i ¼ volði; jÞ  volðiÞ  volð jÞ 0

0

B m B d2;1 B Dm ¼ B . B . @ . m dg;1

m

d1;2 0 .. . m

dg;2

m

   d1;g

ð1Þ

1

m C    d2;g C C .. .. C C . . A  0

4.2. Minimum direction change

ð2Þ

If the volume of clusters is determined using the MVE, then the method becomes similar to the hVolume algorithm described by Kumar and Orlin [14], except for the initial clustering step. The two main problems with MVI, described in Section 2.2, have been addressed as follows. First, the issue of computational cost

In the MDC criterion, inter-cluster dissimilarity is equal to the size of the angle between the FPD of the new cluster candidate and the average major direction (AMD), as shown in Eq. (3) for two clusters i and j; here, f i; j and v AMD are vectors representing 2 In standardized Euclidean distance, the component-wise difference between observations is scaled by dividing it by the standard deviation of the respective component.

N. Fachada et al. / Pattern Recognition Letters 45 (2014) 55–61

59

the FPD of the new cluster candidate (i.e. clusters i and j combined) and the AMD, respectively. The dissimilarity matrix is given by Eq. (4). d

d

di; j ¼ dj;i ¼ arccos 0

0 B B dd B 2;1 Dd ¼ B B . B .. @ d dg;1

d

d1;2 0 .. . d

dg;2

f i; j  v AMD kf i; j k kv AMD k d

   d1;g

ð3Þ

1

C d    d2;g C C C .. .. C . . C A  0

ð4Þ

The AMD is defined using a weighted average of each clusters’ FPD, as formalized in Eq. (5), where F is a m  g matrix of unit vectors representing the FPD of the g existing clusters, and s is a g  1 vector with the respective weights. This work proposes two weighting strategies: the use of the respective FPD eigenvalue or its square. These can be obtain by applying singular value decomposition (SVD) to the mean centered cluster data or to its covariance matrix, respectively.

v AMD ¼ F s

ð5Þ

To our knowledge, the MDC criterion is a novel approach. It aims at exploiting a particular data feature shared by many spectrometric data sets after PCA, i.e. data containing clusters oriented along a common direction. 4.3. Integrating MVI and MDC A possible form of combining MDC with MVI in order to calculate the dissimilarity matrix is given in Eq. (6), where wd is the weight of the MDC criterion in the combined dissimilarity.

D ¼ ð1  wd Þ

Dm Dd d m þw d max di;j max di;j

ð6Þ

As can be observed, matrices Dd and Dm are scaled to the same order of magnitude by dividing each by its maximum value. However, it is desirable to increase MDC influence towards the final steps of the algorithm, when direction trends become clearer. One way is to make wd dynamic, as described by Eq. (7):

wd ¼ wdf



k g1

k

Fig. 3. Dynamics of the MDC weight (wd ) for different values of k, with k ¼ 20, g init ¼ 100 and wdf ¼ 0:5.

The data for group (1) comes from two sources: (a) the first and second principal components of the PCA applied to the spectrometric data; and, (b) the synthetic data. For group (2), only spectrometric data is used, and the dissimilarity matrices are computed following the description outlined in the last paragraph of Section 3.1. The methods selected for group (1) are: (a) k-means; (b) EM; (c) AHC with typical distance and linkage criteria; and (d) AMVIDC. Both k-means and EM use random initialization of cluster centers. In k-means, two methods are tested for determining centroids: the mean and the median of the cluster points, both applied to the variable space (component-wise). For method (c), AHC is applied with different combinations of distance and linkage types. The distances used are the Euclidean, standardized Euclidean, Manhattan, Hamming, Chebyshev and Mahalanobis distances, as well as three dissimilarity distances based on the Person, Spearman and Jaccard coefficients. These distances were combined with the following linkage types: average, weighted average, single (nearest neighbor) and complete (farthest neighbor). Additionally, the Euclidean distance was combined with the centroid, weighted centroid (median) and Ward linkages. Method (d), AMVIDC, was tested with the following combination of parameters:

ð7Þ

where wdf is the MDC final weight (the weight the MDC criterion will have on the last iteration), k is the desired number of clusters, g is the current number of clusters, and k is the parameter which controls how the MDC weight increases during the clustering process. Fig. 3 illustrates how MDC weight changes depending on k. The integration of MVI and MDC criteria in AHC, as described in this section, will be further referenced as AMVIDC. 5. Experiments and results The performance of the different clustering techniques is assessed using the known class labels of the observations. The F-score is used, as it weighs both precision and recall, and offers a simple reading on the quality of the results, from 0 (worst score) to 1 (perfect clustering) [16,25].

 Init. clustering: mAHC, mPDDP  Cluster volume: convex hull, MVE  Cluster direction weight: first principal singular value computed directly from the cluster centered data or from its covariance.  MDC final weight: wdf ¼ 0; 0:1; 0:2; . . . ; 1:0  MDC weight rate: k ¼ 0; 1; 2; . . . ; 24 Group (2) comprises two methods, namely PAM and AHC. In the former, the cluster centers are initialized randomly. For AHC, the only changeable parameter is the linkage type since distance is given by the different dissimilarity matrices described in Section 3.1. All linkage types used for method (c) in group (1) are also used here. In all tests, the number of clusters of each data set is provided to the clustering algorithms as a parameter. EM, k-means and PAM use this value as the number of cluster centers to optimize whereas the AHC algorithms use it as a termination condition.

5.1. Computational test design of the clustering methods 5.2. Results In this work, the test of different clustering methods was divided in two groups, according to the nature of data used in each method: (1) bi-dimensional data; and (2) dissimilarity matrices.

Table 2 shows the best F-scores for each method and data set, i.e. the highest value found for all combinations of the tested

60

N. Fachada et al. / Pattern Recognition Letters 45 (2014) 55–61

Table 2 Best F-scores for each method and data set. Bold values refer to highest F-score in each data set. Group (1) methods are applied to bi-dimensional data (i.e. the PCs of spectrometric data and the synthetic data), while group (2) methods are applied to dissimilarity matrices obtained from spectrometric data. Data sets

Group (1)

Group (2)

k-Means

EM

AHC typ.

AMVIDC

PAM

AHC typ.

D1 D2 D3 D4 D5 D6

0.8836 0.6718 0.6657 0.6243 0.6520 0.7138

0.8771 0.6876 0.6990 0.6652 0.6531 0.7838

0.7886 0.5852 0.5697 0.5103 0.5735 0.6227

0.9325 0.7033 0.7486 0.6528 0.7029 0.7957

0.7010 0.6465 0.6092 0.5981 0.5993 0.6310

0.5686 0.5211 0.5093 0.5143 0.5084 0.4835

I1 I2 I3 I4 I5 I6

0.7310 0.7042 0.6041 0.6369 0.6768 0.6535

0.7707 0.7009 0.6251 0.6194 0.6746 0.7168

0.5960 0.5629 0.4844 0.5031 0.5462 0.4916

0.8147 0.7418 0.7172 0.6135 0.6785 0.7302

0.6159 0.5856 0.5357 0.5569 0.5329 0.6136

0.3914 0.5041 0.5089 0.3855 0.4239 0.5665

S1 S2 S3 S4 S5 S6

1.0000 0.6452 0.8292 0.9531 0.5376 0.5841

0.9960 0.8901 0.9396 0.9950 0.8254 0.9567

1.0000 0.7607 0.8033 0.9524 0.4897 0.7265

1.0000 0.9293 0.9801 0.9900 0.7715 0.8310

N/A N/A N/A N/A N/A N/A

N/A N/A N/A N/A N/A N/A

algorithm parameters. In the case of k-means, EM and PAM (stochastic algorithms), it is the highest value in 1000 runs.

calculated with the F-scores obtained from all possible combinations of the remaining parameters. The latter analysis is performed separately to the two different data types (spectrometric and synthetic), i.e. by averaging over the respective data sets. 5.2.2.1. Initial clustering. Result analysis show that mAHC is the best performing method for initial clustering, except for D4, I2, I4 and S4 data sets. Statistics show that mAHC outperforms for both spectrometric and synthetic data, with the maximum F-score equal to 0.733 and 0.893, respectively, against 0.702 and 0.760 for mPDDP. It also presents a higher average F-score for both data types. 5.2.2.2. Cluster volume. Convex hull volume minimization is an effective method for clustering since it accounts for 66% of the best F-scores in the spectrometric data and nearly all the results for the synthetic data sets. For the D4, I2, I4 and I5 data sets, the clustering is slightly better when using ellipsoidal volume. Additionally, statistics indicate that convex hull volume presents higher maximum and average F-scores. 5.2.2.3. Cluster direction weight. The tests showed that this selection is not important since both approaches yielded either the same highest F-score value (50% of the cases) or results with a negligible difference.

5.2.1. Algorithm performance The methods of group (1) present the best clustering results, with AMVIDC yielding the highest F-scores for 72% of cases and values close to the best for the remaining. The most noticeable exception occurs in the S6 data set, for which EM displays superior clustering performance. This is because S6 clusters present higher volumes and divergences of their FPD. Although these conditions are less favorable for AMVIDC, it still yields a higher F-score when compared with the remaining methods. The EM algorithm produces solid results, close but generally inferior to the AMVIDC F-scores. K-means produces interesting but inconsistent results, e.g. it provides the highest F-score for the I4 data set, but displays poor performance in most synthetic data sets, yielding generally inferior F-scores than EM and AMVIDC. AHC with typical metrics is not suited for these data sets, presenting the lowest F-scores within group (1) methods. In group (2), PAM shows higher F-scores than AHC. Results for PAM are also generally superior to typical AHC in group (1), but are inferior to k-means. Group (2) AHC produces poor results, further asserting that AHC is not appropriate for the spectrometric data used in this work. Distance matrices based on the variance with the original spectra account for most PAM higher F-scores. In the case of AHC, the use of spectral derivatives has a positive impact on performance; i.e. distances based on the 2nd derivative are responsible for 50% of the best results, followed by original spectra and 1st derivative dissimilarities, respectively.

5.2.2.4. MDC final weight (wdf ) and MDC weight rate (k). The analysis of the results by data set shows that most of the high F-scores values are achieved for 0:1 6 wdf 6 0:3. Among the spectrometric data sets, D4 was an exception with wdf ¼ 0:9. For the synthetic data sets S2 and S5, the optimal F-scores are found with wdf equal to 0.6 and 0.5, respectively. S3 stands out because of the large range of values which yield the highest F-score: 0:1 6 wdf 6 0:8. Results for the S4 data set are also interesting since it was the only case where the highest F-score was also obtained without the MDC criterion, more specifically for 0:0 6 wdf 6 0:2. About the impact of wdf on the clustering performance statistics, results show that maximum F-scores are attained with wdf ¼ 0:1 for both data groups. Although the contribution from MDC criterion is small, its influence is important to improve classification accuracy. The influence of k becomes clearer for values above 10, where the algorithm provides clusters with a low variability of the maximum, mean and minimum F-scores, independently of wdf . For k < 10, there are good results for the maximum F-score achieved with wdf equal to 0.1, 0.2 at most (especially for spectrometric data). However, the mean and minimum F-scores are lower in this region, which increases the variability of the algorithm results. The analysis suggests that the intervals 0:1 6 wdf 6 0:2 and 13 6 k 6 15 provide conditions for the algorithm to produce good classification results with a small variability; i.e. these intervals ensure a robust algorithm performance. Overall, results suggest that MDC improves clustering results, especially for the spectrometric data sets, but its performance depends on both wdf and k parameters.

5.2.2. Algorithm parameters For k-means, mean-based centroids produce better results for 66% of the spectrometric data sets and for 50% of the synthetic data. Thus, there is no clear advantage in selecting mean-based over median-based centroids. In standard AHC, most of the highest F-scores are achieved using the Euclidean and Chebyshev distances combined with Ward’s linkage. AMVIDC parameters are analyzed in two ways: (i) discussing the parameter values of the highest F-score for each data set; and, (ii) using F-score statistics, i.e. maximum, minimum, mean and standard deviation. For each parameter, the statistics are

5.2.3. Performance on the spectrometric and synthetic data Analyzing results from the perspective of the data groups, it is possible to infer that AMVIDC works well with the spectrometric data when compared with the remaining tested algorithms. Tests also highlight poor discrimination capabilities in the techniques most commonly used in microbiology, such as AHC with standard metrics. Independently of the used algorithm, there are significant differences in the clustering quality of the spectrometric data sets. For example, in D1, strains were distinguished with a highest F-score of 0.9325 whereas in I4, the best F-score found was only

N. Fachada et al. / Pattern Recognition Letters 45 (2014) 55–61

0.6369. The most likely reason is related to the variability of the: (i) growing medium and microplates composition; and, (ii) state of colonies, which cannot be fully controllable. Additionally, the average of the highest F-score for the different strain data sets (D1–D6) was 0.7580, while for the isogenic strains (I1–I6) was equal to 0.7199. These results alone do not clarify about the use of spectrometry for strain separability. Nonetheless, the best results found for the different (0.9325) and isogenic (0.8147) strains suggest that, in certain conditions, this is possible, even for closely related strains. For the synthetic data, both AMVIDC and EM yield good results. This demonstrates that AMVIDC is not limited to spectrometric data, maintaining a solid and consistent clustering performance when used on other data sources. AMVIDC, although not clearly superior than (the best result in 1000 runs of) EM, shows at least similar performance without being stochastic and is able to produce dendrograms. These are useful features in many clustering problems, such as the differentiation of biological and chemical families. As such, AMVIDC can replace EM in these cases, without loss of clustering quality, while offering features EM does not have. The remaining techniques, k-means and AHC with standard metrics, show inconsistent performance, which sometimes is clearly inferior to AMVIDC and EM. 5.3. Further experimentation An implementation of the AMVIDC algorithm, including source code and user manual, is available at . 6. Conclusions In this paper we proposed AMVIDC, a clustering technique which dynamically weighs two criteria, MVI and MDC, in a AHC context. The algorithm, inspired by the PCA layout of spectrometric data, was used to distinguish yeast strains through their spectrometric signature. In order to demonstrate the broader applicability of AMVIDC, the technique was also tested on synthetic data. When applied to yeast spectrometric data, AMVIDC was shown to outperform a set of well-known clustering algorithms, including techniques commonly used for microorganism differentiation. Tests performed on synthetic data highlighted the discrimination capabilities of both AMVIDC and EM algorithms. However, only AMVIDC provided deterministic results and dendrograms, which are important features for the differentiation of microorganisms and other clustering problems. Acknowledgements This work was supported by FEDER funds through Programa Operacional Factores de Competitividade – COMPETE, by national funds of the projects PEst-OE/EEI/LA0009/2013, PEst-OE/MAT/ UI0152, PDTC/AGR-ALI/103392 and PDCTE/BIO/69310/2006 from the Fundação para a Ciência e a Tecnologia (FCT) and partially funded with Grant SFRH/BD/48310/2008 also from FCT. References [1] M. Alcañiz, J.L. Vivancos, R. Masot, J. Ibañez, M. Raga, J. Soto, R. MartínezMáñez, Design of an electronic system and its application to electronic tongues using variable amplitude pulse voltammetry and impedance spectroscopy, J. Food Eng. 111 (2012) 122–128, http://dx.doi.org/10.1016/ j.jfoodeng.2012.01.014.

61

[2] B. Ball, J. Parker, A. Scott, S. De Backer, P. Scheunders, A competitive elliptical clustering algorithm, Pattern Recogn. Lett. 20 (1999) 1141–1147. [3] P. Berkhin, A survey of clustering data mining techniques, in: J. Kogan, C. Nicholas, M. Teboulle (Eds.), Grouping Multidimensional Data, Springer, Berlin Heidelberg, 2006. pp. 25–71. [4] D. Boley, Hierarchical Taxonomies using Divisive Partitioning, Technical Report TR-98-012, University of Minnesota, 1998. [5] C. Castro, J. Silva, V. Lopes, R. Martins, Yeast metabolic state identification by fiber optics spectroscopy, in: BIOSIGNALS 2009: Proceedings of the International Conference on Bio-inspired Systems and Signal Processing, INSTICC–Institute for Systems and Technologies of Information Control and Communication, 2009. pp. 169–178. [6] D. Cozzolino, L. Flood, J. Bellon, M. Gishen, M. De Barros Lopes, Combining near infrared spectroscopy and multivariate analysis as a tool to differentiate different strains of Saccharomyces cerevisiae: a metabolomic study, Yeast 23 (2006) 1089–1096, http://dx.doi.org/10.1002/yea.1418. [7] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B: Methodol. 39 (1977) 1–38. [8] L. Eberlin, R. Haddad, R. Neto, R. Cosso, D. Maia, A. Maldaner, J. Zacca, G. Sanvido, W. Romão, B. Vaz, et al., Instantaneous chemical profiles of banknotes by ambient mass spectrometry, Analyst 135 (2010) 2533–2539. [9] N. Fonseca, D. Schuller, R. Martins, The feasibility of Saccharomyces cerevisiae strains differentiation by UV–VIS–NIR fiber optics spectrocopy, in: XXXV Jornadas Portuguesas de Genética, Universidade do Minho, 2010. p. 72. [10] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second ed., Springer, 2009. [11] D. Helm, H. Labischinski, G. Schallehn, D. Naumann, Classification and identification of bacteria by Fourier-transform infrared spectroscopy, Microbiology 137 (1991) 69–79, http://dx.doi.org/10.1099/00221287-137-169. [12] I. Jolliffe, Principal Component Analysis, 2 edition., Springer Series in Statistics, Springer, 2002, http://dx.doi.org/10.1007/b98835. [13] M. Kümmerle, S. Scherer, H. Seiler, Rapid and reliable identification of foodborne yeasts by Fourier-transform infrared spectroscopy, Appl. Environ. Microbiol. 64 (1998) 2207–2214. [14] M. Kumar, J. Orlin, Scale-invariant clustering with minimum volume ellipsoids, Comput. Oper. Res. 35 (2006) 1017–1029. [15] C. Kurtzman, J. Fell, T. Boekhout (Eds.), The Yeasts: A Taxonomic Study, fifth ed., vol. 1, Elsevier, 2011. [16] B. Larsen, C. Aone, Fast and effective text mining using linear-time document clustering, in: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 1999. pp. 16–22. [17] J. Mao, A. Jain, A self-organizing network for hyperellipsoidal clustering (HEC), IEEE Trans. Neural Networks 7 (1996) 16–29, http://dx.doi.org/10.1109/ 72.478389. [18] K. Maquelin, L. Choo-Smith, H.P. Endtz, H.A. Bruining, G.J. Puppels, Rapid identification of candida species by confocal Raman microspectroscopy, J. Clin. Microbiol. 40 (2002) 594–600. [19] K. Maquelin, C. Kirschner, L.P. Choo-Smith, N. van den Braak, H. Endtz, D. Naumann, G. Puppels, Identification of medically relevant microorganisms by vibrational spectroscopy, J. Microbiol. Methods 51 (2002) 255–271, http:// dx.doi.org/10.1016/S0167-7012(02)00127-6. [20] D. Mouwen, A. Hörman, H. Korkeala, A. Alvarez-Ordóñez, M. Prieto, Applying Fourier-transform infrared spectroscopy and chemometrics to the characterization and identification of lactic acid bacteria, Vib. Spectrosc. 56 (2011) 193–201, http://dx.doi.org/10.1016/j.vibspec.2011.02.008. [21] N. Navas, J. Romero-Pastor, E. Manzano, C. Cardell, Benefits of applying combined diffuse reflectance FTIR spectroscopy and principal component analysis for the study of blue tempera historical painting, Anal. Chim. Acta 630 (2008) 141–149, http://dx.doi.org/10.1016/j.aca.2008.10.008. [22] O.Y. Rodionova, L.P. Houmller, A.L. Pomerantsev, P. Geladi, J. Burger, V.L. Dorofeyev, A.P. Arzamastsev, NIR spectrometry for counterfeit drug detection: a feasibility study, Anal. Chim. Acta 549 (2005) 151–158, http://dx.doi.org/ 10.1016/j.aca.2005.06.018. [23] R. Shioda, L. Tunçel, Clustering via minimum volume ellipsoids, Comput. Optim. Appl. 37 (2007) 247–295, http://dx.doi.org/10.1007/s10589-007-90241. [24] J. Silva, R. Martins, A. Vicente, J. Teixeira, Feasibility of yeast and bacteria identification using UV–VIS–SWNIR diffusive reflectance spectroscopy, in: BIOSIGNALS 2008: Proceedings of the First International Conference on Bioinspired Systems and Signal Processing, INSTICC–Institute for Systems and Technologies of Information Control and Communication, 2008. pp. 25–32. [25] S. Wagner, D. Wagner, Comparing clusterings: an overview, Universität Karlsruhe, Fakultät für Informatik, 2007. [26] M. Wittwer, J. Heim, M. Schär, G. Dewarrat, N. Schürch, Tapping the potential of intact cell mass spectrometry with a combined data analytical approach applied to Yersinia spp.: detection, differentiation and identification of Y. pestis, Syst. Appl. Microbiol. 34 (2011) 12–19, http://dx.doi.org/10.1016/ j.syapm.2010.11.006. [27] K. Younis, Weighted Mahalanobis distance for hyper-ellipsoidal clustering (Master’s thesis), Air Force Institute of Technology, 1996.