Toward automated chromatographic fingerprinting: A non-alignment approach to gas chromatography mass spectrometry data

Toward automated chromatographic fingerprinting: A non-alignment approach to gas chromatography mass spectrometry data

Accepted Manuscript Toward Automated Chromatographic Fingerprinting: A Non-Alignment Approach to Gas Chromatography Mass Spectrometry Data Jochen Vest...

2MB Sizes 0 Downloads 13 Views

Accepted Manuscript Toward Automated Chromatographic Fingerprinting: A Non-Alignment Approach to Gas Chromatography Mass Spectrometry Data Jochen Vestner, Gilles de Revel, Sibylle Krieger-Weber, Doris Rauhut, Maret du Toit, André de Villiers PII:

S0003-2670(16)30090-3

DOI:

10.1016/j.aca.2016.01.020

Reference:

ACA 234360

To appear in:

Analytica Chimica Acta

Received Date: 27 October 2015 Revised Date:

14 January 2016

Accepted Date: 19 January 2016

Please cite this article as: J. Vestner, G. de Revel, S. Krieger-Weber, D. Rauhut, M. du Toit, A. de Villiers, Toward Automated Chromatographic Fingerprinting: A Non-Alignment Approach to Gas Chromatography Mass Spectrometry Data, Analytica Chimica Acta (2016), doi: 10.1016/ j.aca.2016.01.020. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

ts

men

PA RA FA C

... 4 5 3 1 2

SC

samples

RI PT

seg r of e b num

ta

Student Version of MATLAB

tio n

Student Version of MATLAB

Student Version of MATLAB

Student Version of MATLAB

Student Version of MATLAB

Student Version of MATLAB

Student Version of MATLAB

s ss

a

m

c

pe

samples

5 ... 4 2 3 1 segmented retention profile

AC C

EP

Student Version of MATLAB

TE D

Student Version of MATLAB

en

gm

se

M AN U

samples

1

5 4 2 3 segmented retention profile

...

transformations

XXT

...

ACCEPTED MANUSCRIPT

RI PT

Toward Automated Chromatographic Fingerprinting: A Non-Alignment Approach to Gas Chromatography Mass Spectrometry Data Jochen Vestnera,b,c,∗, Gilles de Revela,b , Sibylle Krieger-Weberd , Doris Rauhutc , Maret du Toite , Andr´e de Villiersf a Universit´ e

M AN U

SC

de Bordeaux, ISVV, EA 4577, Unit´ e de recherche Œnologie, 33882 Villenave d’Ornon, France. b INRA, ISVV, USC 1366 Œnologie, 33882 Villenave d’Ornon, France. c Department of Microbiology and Biochemistry, Hochschule Geisenheim University, Von-Lade-Straße 1, 65366 Geisenheim, Germany. d Lallemand, In den Seiten 53, 70825 Korntal-M¨ unchingen, Germany. e Institute of Wine Biotechnology, Department of Viticulture and Oenology, Stellenbosch University, Private Bag X1, Matieland (Stellenbosch) 7602, South Africa. f Department of Chemistry and Polymer Science, Stellenbosch University, Private Bag X1, Matieland (Stellenbosch) 7602, South Africa.

Abstract

In contrast to targeted analysis of volatile compounds, non-targeted ap-

TE D

proaches take information of known and unknown compounds into account, are inherently more comprehensive and give a more holistic representation of the sample composition. Although several non-targeted approaches have been developed, there’s still a demand for automated data processing tools, especially

EP

for complex multi-way data such as chromatographic data obtained from multichannel detectors. This work was therefore aimed at developing a data processing procedure for gas chromatography mass spectrometry (GC-MS) data

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

obtained from non-targeted analysis of volatile compounds. The developed approach uses basic matrix manipulation of segmented GC-MS chromatograms and PARAFAC multi-way modelling. The approach takes retention time shifts and peak shape deformations between samples into account and can be done ∗ Corresponding

author Tel.: +49 6722 502 346; fax +49 6722 502 330 347. Email address: [email protected] (Jochen Vestner)

Preprint submitted to Analytica Chimica Acta

January 14, 2016

ACCEPTED MANUSCRIPT

with the freely available N-way toolbox for MATLAB. A demonstration of the

RI PT

new fingerprinting approach is presented using an artificial GC-MS data set and an experimental full-scan GC-MS data set obtained for a set of experimental wines. Keywords:

non-targeted analysis, gas chromatography, fingerprinting, multi-way analysis,

SC

metabolomics, non-alignment

M AN U

1. Introduction

Non-targeted analysis has increasingly gained importance in numerous domains of analytical chemistry such as life science, food science and especially the ‘-omics’ related sciences. In contrast to conventional targeted analysis, non5

targeted analysis aims to gather qualitative and quantitative information on as many compounds as possible in the analysed samples in a short period of time,

TE D

and thus to provide the researcher with a more holistic view of the composition of samples [1]. Holistic strategies benefit from the vast amount of information obtained from modern analytical instrumentation. However, the main chal10

lenges are data handling and full exploitation of dimensionality of the acquired

EP

data.

The data generated by hyphenated chromatographic techniques such as GCMS or LC-MS are especially information rich. Feature extraction such as peak integration in single ion chromatograms, total ion chromatograms or deconvo-

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

15

luted signals are the most common approaches to extract information from chromatographic data and result in relatively small data tables which are straightforward to analyse [2, 3, 4, 5, 6, 7, 8]. Although various peak integration algorithms and software packages have been developed [9, 10, 11, 12], automated peak integration remains troublesome due to coelution and potential erroneous

20

peak integration and/or assignment. Time consuming manual correction of the 2

ACCEPTED MANUSCRIPT

results is often necessary. Moreover, relevant information from the raw data

RI PT

can be lost due to such feature extraction before modelling [13, 14]. Deconvoluting chromatographic signals can also be time-consuming in terms of model construction and evaluation of results [15, 16, 2, 17]. 25

An alternative, more comprehensive approach aiming at the extraction of

more information and underlying patterns in the data involves the usage of the

SC

two dimensional raw data signal of each sample in entirety as a chromatographic fingerprint for modelling. Examples for holistic non-targeted analyses can be

30

M AN U

found in numerous reports [14, 18, 19, 20, 21, 22, 23, 24, 25], some of which also include the application of multi-way analysis methods such as TUCKER3, PARAFAC and N-PLS to hyphenated chromatographic data. When factor models are used on chromatographic data, challenges are associated with the increased size of data and the handling of shifts and peak shape deformation, which result in distortion of the bilinear/trilinear structure of the data. Several algorithms and software programmes have been developed for peak alignment

TE D

35

[26, 27, 28, 29, 30]. Depending on the data, shift correction can, however, be difficult and time-consuming.

The above described problems of conventional data analysis approaches to

40

EP

non-targeted GC-MS analysis, in particular challenges with automated peak integration and retention time alignment of chromatograms, were the main motivation for the development of an alternative data analysis approach. The major consideration to overcome the peak integration issue was the direct modelling of

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

the chromatographic raw data (without feature selection), including a reduction of the data. The main idea to master the distortion of bilinear/trilinear struc-

45

ture of the data due to shifting peaks was the consideration of a mathematical transformation of pieces (segments) of the chromatograms using SSCP matrices. SSCP matrices are positive, squared and symmetric, similar to variance-

3

ACCEPTED MANUSCRIPT

covariance matrix [31], which are utilised for instance in PARAFAC2, STATIS

50

RI PT

and the calculation of RV -coefficients [32, 19, 33, 34, 35]. Particularly the indirect fitting algorithm for PARAFAC2 [36] served as major inspiration for the development of the new approach. Moreover, for the sake of simplicity another aim was to use a single model for the entire set of chromatograms of all samples

to find systematic differences among samples and to identify important regions

55

SC

of the chromatograms which, if desired, can be further deconvoluted and investigated using e.g. PARAFAC2. A method using multiple PARAFAC2 models on

M AN U

segmented chromatograms has been reported recently [37]. This approach gives very detailed information on fully decomposed mass spectra and peak profiles, which are finally summarized using PCA. The here described new approach can be considered as a ‘segment pre-selection tool’ for subsequent deconvolution of 60

only important chromatogram segments. By this means a significant amount of time used for the construction and evaluation of PARAFAC2 models can so be

TE D

saved.

This paper gives an overview on the algorithm of the new data analysis approach, including the theoretical background such as the calculation of SSCP 65

matrices and all other mathematical transformations used. The approach is

EP

explained and tested on an artificial, well defined GC-MS data set with and without peak shifts. After the theoretical discussion, the approach is tested on a real GC-MS dataset of experimental wines and results are confirmed using a reference method for data analysis approaches including PARAFAC2 deconvolu-

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

70

tion and peak integration of deconvoluted peak profiles of the entire segmented chromatograms with subsequent PCA on the obtained peak table.

4

ACCEPTED MANUSCRIPT

2.1. Defined, artificial GC-MS data set

RI PT

2. Theory

To demonstrate and verify the developed algorithm an defined, artificial GC75

MS data set was created using an in-house developed MATLAB script. The data set consists of 20 chromatograms, each containing 9 to 10 gaussian peaks with

SC

different mass spectra (mz 35 to mz 318) and different degrees of overlapping.

The whole chromatogram can be divided into five segments. Segment one contains two peaks which perfectly overlap. Peaks three and four partially coelute in segment two, which is also the case for the peaks five, six and seven in seg-

M AN U

80

ment three. Peak eight is in segment four and the last segment contains the last two peaks nine and ten, which also partially coelute (Figure 14 in Supporting Information). Peak sizes vary among chromatograms as indicated in Table 1, consequently samples can be divided into four groups. Moreover, a small ran85

dom variation was added to all peak sizes to simulate a natural deviation of

TE D

measurements. To simulate baseline noise a random normal distributed noise was added to the whole data set. Each chromatogram can be considered as a matrix of dimensions 1100 scans × 283 masses, thus the entire data set can be considered as a three-way array (i × j × k ), with the dimensions 20 samples × 1100 scans × 283 masses.

EP

90

segment

peak no.

size difference

sample no.

1 2 5

2 4 9

only present in 0.7× higher in 3× higher in

14 & 15 1 to 5 1 to 10

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Table 1: Differing peaks among samples in the defined, artificial GC-MS data set.

5

ACCEPTED MANUSCRIPT

2.2. A new non-alignment approach to non-targeted GC-MS data: Mathematical

RI PT

transformations of raw chromatograms Using basic matrix algebra a SSCP matrix XX T is obtained by multiplication of a matrix X with its transpose, as displayed in Equation 1.

95

j=1

···

PC

.. .

··· .. .

PC

xRj x2j

···

x1j x2j

PC

2 j=1 x2j

PC

j=1

j=1

x1j xRj



   j=1 x2j xRj  , ..  .   PC 2 x Rj j=1

SC

PC

(1)

M AN U

 P C 2 j=1 x1j   PC   j=1 x2j x1j XX T =  ..  .   PC j=1 xRj x1j

where X is a R × C-matrix of elements xij , i = 1, . . . , R, j = 1, . . . , C. The matrix product XX T is the R×R matrix of Sums of Squares and Cross Products (SSCP matrix).

TE D

100

In Detail, the diagonal of XX T includes the sums of squares with respect to a PC given row i of X, namely j=1 x2ij . Moreover, all off-diagonal elements represent PC cross products between two different rows i, k of X, in particular j=1 xij xkj for i 6= k. Consequently, the sums of squares are a measure of variation within a row, whereas the cross products are a measure of covariation between two rows. Note the similarity to the variance-covariance matrix: diagonal elements

105

EP

of the variance-covariance matrix are variances and all off-diagonal elements are covariances. The terms variation and variance as well as covariation and covariance can for the sake of simplicity be replaced in the following (although not strictly mathematically true).

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

PARAFAC2 is a powerful tool for the deconvolution of small chromatogram

segments [35, 37, 38, 39].The approach presented here is mainly inspired by the

110

idea of the indirect fitting algorithm of the PARAFAC2 model, which instead of modelling an array consisting of the matrices X i (spectral profile × elution profile for I samples) directly considers a model of an array consisting of the

6

ACCEPTED MANUSCRIPT

SSCP matrices X i (X i )T [40, 36]. In this manner, PARAFAC2 is suitable for

115

RI PT

deconvoluting chromatographic peaks with shift along the retention axis among samples. A disadvantage of PARAFAC2 is that for each segment of the chromatogram a single model has to be constructed and evaluated.

The utilisation of SSCP matrices as a preprocessing step for multivariate modelling of whole chromatograms has also been reported before [19, 33]. If

120

SC

entire two dimensional chromatograms are used for the construction of SSCP

matrices, information on the retention time of compounds is lost, complicating

by multivariate modelling.

M AN U

the identification of peaks contributing to the differentiation between samples

However, by dividing all chromatograms along the retention axis into segments containing a small number of peaks and subsequent construction of SSCP 125

matrices for each segment, information on the location of peaks in the chromatogram contributing to the differentiation of samples can be preserved. The

TE D

SSCP matrices for each segment and each sample have dimensions number of mass channels × number of mass channels and contain information on the variation of each mass channel and covariation between all mass channels in each 130

segment for the corresponding sample. For each segment the constructed SSCP

EP

matrices of all samples are vectorized and compiled into a new matrix. This step results in a compilation matrix for each segment with the dimensions number of samples × [(number of mass channels + 1) · number of mass channels / 2]. These compilation matrices are then also transformed into SSCP matrices

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

135

with the dimensions of number of samples × number of samples, which contain information about the variation of the content of the compilation matrix for each sample and the covariation of the content of the compilation matrix between all samples in each segment. These SSCP matrices are finally compiled in a three-way array with the dimension (number of samples × number of samples)

7

ACCEPTED MANUSCRIPT

140

× number of segments.

RI PT

The whole procedure is summarized in matrix notation in the following. Each two dimensional chromatogram (sample) is characterized by M mass chanPK nels and N scan points. N is divided into K segments, that is N = k=1 Nk ,

where Nk describes the number of scans in the k-th segment. In particular, we have altogether I samples. First, we define an I × K-matrix X by  X = (X ik ) i=1,...,I

··· .. .

 X1K  ..  .  ,  XIK

M AN U

k=1,...,K

X11  . . =  .  XI1

SC

145

···

(2)

where X ik is a M × NK -matrix containing the data of the i-th sample and k-th segment, that is



X ik = (xik mn )m=1,...,M

ik  x11  . . =  .  xik M1

TE D

n=1,...,Nk

··· .. .

xik 1Nk .. .

···

xik M Nk

   .  

(3)

The SSCP matrix Aik = X ik (X ik )T containing information on the variation and covariation between all mass channels of the i-th sample and k-th segment is defined by

EP

150

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Aik = (aik rt )r,t=1,...,M with aik rt =

Nk X

(4)

ik xik rs xst

∀r, t = 1, . . . , M

(5)

s=1

and dim(Aik ) = M × M,

for all i = 1, . . . , I and k = 1, . . . , K.

8

(6)

ACCEPTED MANUSCRIPT

Subsequently only the upper triangular part of the symetric SSCP matrix

RI PT

Aik is vectorised (unfolded) and concatenated into a new matrix Y k . The vectorisation vec(Aik ) of the upper triangular of Aik is defined by1

ik vec(Aik ) = α1ik_ α2ik_ · · ·_ αM ,

(7)

where

SC

155

ik ik αlik = (aik l,l , al,(l+1) , . . . , al,M )

∀l = 1, . . . , M,

M AN U

for all i = 1, . . . , I and k = 1, . . . , K.

Consequently, the vectorisation vec(Aik ) has J =

PM

l=1

l=

M (M +1) 2

(8)

compo-

nents. The I × J-matrix Y k is constructed by the above row vectors vec(A1k ), . . . , vec(AIk ) as follows:



1k



vec(A )   .. , Y = .     vec(AIk )

160

(9)

TE D

k

for all k = 1, . . . , K.

In the end, we form SSCP matrices Z k = Y k (Y k )T , which contain informa-

EP

tion on the variation and covariation between all samples in the k-th segment with regard to the variation and the covariation between all mass channels of the i-th sample and k-th segment,

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

1 The concatenation defined as:

_

of two arbitrary row vectors x = (x1 , . . . , xn ), y = (y1 , . . . , yn ) is x_ y = (x1 , . . . , xn , y1 , . . . , yn ).

9

k Z k = (Zrs )r,s=1,...,I

(10)

k with Zrs = vec(Ark ) · (vec(Ask ))T

(11)

for all k = 1, . . . , K. Finally, the matrices Z k are rearranged into the (I ×I)×Karray Z:  Z = Z1

 Z

.

(12)

M AN U

···

K

SC

165

∀r, s = 1, . . . , I,

Prior to multi-way analysis the three-way array Z is mean centered across the first and second mode and scaled to unit variance within the third mode. The term mode refers here to the dimension of the array.

170

2.3. PCA, TUCKER3 and PARAFAC

TE D

Principal component analysis (PCA) is a bilinear multivariate model searching for common patterns in a two dimensional data set. PCA can be understood as a projection method to find directions (components) that maximize the variance in a dataset. These directions, the loadings or latent variables, are constructed as linear combinations of the original variables. The projec-

EP

175

tions of each sample onto these directions are the score values. PARAFAC and TUCKER3, which can be understood as extension of PCA to multi-way data, are multi-linear decomposition methods decomposing a multi-way array into

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

sets of loadings. The loadings ideally describe the data in a more condensed

180

way, thereby facilitating the extraction of information. PARAFAC can be expressed as a constrained version of Tucker3, and Tucker3 a constrained version of two-way PCA [41]. For the matrix xij and the three-way array xijk the PCA model (Equation 13), TUCKER3 model (Equation 14) and PARAFAC model (Equation 15), respectively, are described as follows: 10

xij =

F X

aif bjf + eij

(13)

f =1

aif bjf ckf gf1 f2 f3 + eijk

f =1 f =1 f =1

F X

M AN U

xijk =

(14)

SC

xijk =

F3 F2 X F1 X X

aif bjf ckf + eijk

(15)

f =1

185

Where F is the number of factors (components), aif , bif and ckf are elements of the loading matrices A(I×F ) , B(J×F ) and C(K×F ) . gf1 f2 f3 are the elements of the TUCKER3 core array, and eij and eijk are elements in the residual matrix

TE D

E(I×J) and residual array E (I×J×K) , respectively. Note that in PCA A(I×F ) and B(J×F ) are called scores and loadings, while in multi-way analysis only the 190

term loadings is used[15].

EP

3. Application of the new non-alignment approach to an artificial GC-MS data set

The artificial GC-MS data set was analysed using the new approach to show

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

its validity. To prove theoretical considerations the new approach was first

195

tested on the artificial GC-MS data set without noise and without any shifting peaks. Subsequently, the new approach was tested on the artificial GC-MS data set with noise and non-linear peak shifts to show that the new algorithm can accommodate peak shifts.

11

ACCEPTED MANUSCRIPT

3.1. Artificial data set without retention shifts and noise In the artificial GC-MS data set each of the three differences among samples

RI PT

200

(see Table 1) is caused by varying peaks in different segments. After segmen-

tation and mathematical transformation the resulting three-way array contains information on the covariation among samples in terms of differences in their

205

SC

mass traces in each segment. The decomposition of this array using PARAFAC is therefore expected to give one component to explain each of the three differences among the four groups of samples. Noise was excluded from the artificial

M AN U

data set, as it is a source of random variation. Prior to multi-way analysis the three-way array Z was mean centered across the first and second mode to reduce offsets in these modes and scaled to unit variance within the third mode to give 210

each segment the same weight. Preprocessing was done using the nprocess.m function of the N-way toolbox [42].

In fact, a three component PARAFAC model explains the segmented and

TE D

transformed three-way array perfectly. The proper number of components was determined by evaluating residuals, core consistency, convergence speed, and by 215

assessing the interpretability of the solution. As no noise was introduced to the artificial GC-MS data set 100 % variation is explained, evenly distributed over

EP

the three components. The loadings of the first (sample) and third (segment) mode are shown in Figure 1. Note that due to the calculation of SSCP matrices included in the mathematical transformation modes one and two are identical. Component one explains the differences between samples one to five and the

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

220

other samples, which is caused by peak four in segment two as indicated by the loadings of mode three of this component. PARAFAC component two reflects the differences of the samples 14 and 15 which are the only samples that contain peak number two in segment one. Finally, the differences between the samples

225

one to ten and eleven to 20 are shown by component three. Here segment five

12

ACCEPTED MANUSCRIPT

0.7

0.25

14 15

0.5 0.4 0.3 0.2 0.1

0.1 0.05 0 −0.05 −0.1 −0.15

0 1113 18 8 6 9 12 16 19 7 10 17 20 −0.1 −0.2 −0.1

−0.2

3 14 25 0 0.1 0.2 0.3 Component 1: 33.3 % expl. var.

0.4

−0.25 −0.2

0.5

0 0.1 0.2 0.3 Component 1: 33.3 % expl. var.

0.4

0.5

component 1 component 2 component 3

15

5 0 −5 −10 −15

1

M AN U

10

loading

−0.1

(b) Mode 1: comp. 1 vs. comp. 3

20

−20

11 16 19 12 17 20 13 18

SC

(a) Mode 1: comp. 1 vs. comp. 2

14 15

RI PT

0.15 Component 3: 33.3 % expl. var.

Component 2: 33.3 % expl. var.

1 3 2 4 5

6 9 7 10 8

0.2

0.6

2

3 Segment

4

5

(c) Mode 3: comp. 1 to comp. 3

TE D

Figure 1: Loadings of the modes one and three of the PARAFAC model on the three-way array of the segmented and mathematically transformed artificial GC-MS dataset without noise and without shifted peaks. Note that mode one and two are identical. Samples are coloured according to Table 1.

is responsible for this separation, which contains peak nine.

EP

3.2. Artificial data set with retention shift and noise To prove the applicability of the new algorithm to shifted chromatograms the artificial GC-MS data set with introduced peak shifts (Figure 15 in Sup230

porting Information). After segmentation and mathematical transformation a

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

four component PARAFAC model explaining 83.8 % of the total variation in the data was obtained. The proper number of components was determined by evaluating residuals, core consistency, iterations until convergence, and by assessing the interpretability of the solution. Component one explaining 68.6 % of

235

the total variation in the data separates samples one to ten from samples eleven to 20 (Figure 2(a)). Segment five, which contains peak number 9, shows high 13

ACCEPTED MANUSCRIPT

loadings on this component (Figure 2(d)). Samples one to five differ from the

RI PT

other samples on component two, which explains 9.5 % of variation. The loadings of the segment mode (mode three) reveal that segment two containing peak 240

four is responsible for this difference. Two samples 14 and 15, which as only

samples contain peak number 2, are differentiated from the other samples on

component three explaining 5.5 % of variation (Figure 2(b)). Here segment one

SC

shows high loadings on this component. Furthermore, component four explaining 3.5 % variation reflected unsystematic variation in the data (Figure 2(c)), which is related to noise, as PARAFAC on the transformed shifted artificial GC-

M AN U

245

MS data set which does not contain noise resulted in a three component model (model not shown). It can be shown here, that using the developed approach for the non-shifted and for the shifted artificial GC-MS data the same structural information on the differences among samples could be extracted from the data. 250

The three-way data array which is obtained after the segmentation and

TE D

mathematical transformation can also be seen as a ‘stack’ of matrices. It seems reasonable to evaluate different multi-block methods for the analysis of this data type besides multi-way methods. Different multi-block methods have therefore been applied to the three-way array, in a manner such that each slab of the array corresponds to a block. The following methods were tested: PCA on concate-

EP

255

nated matrices, Multiple Factorial Analysis (MFA)[43], Common Component and Specific Weights Analysis (CCSWA)[44], analysis of co-inerita with common components [45] and STATIS [34] using the SAISIR toolbox for MATLAB[46]

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

kindly and freely available on www.chimiometrie.fr (July 2014). From the tested

260

models only CCSWA gave interpretable results which are shown in Figure 3. Except of CCSWA, non of the tested multi-block methods lead to interpretable results. A CCSWA model with 4 components revealed the structural information in the data comparable to the results from PARAFAC (Figure 2). Common

14

4

x 10

Component 3: 5.5 % expl. var.

1.5

0.5 0 18

14

9

13 11 15

−1

17 12 16 19 −4

−1.5 −6

6

4

−1 −6

6

4

19

0

13 15 14

3

12 16

−0.5 1820

−1

4

−2 0 2 Component 1: 68.6 % expl. var.

4

6 4

x 10

4 9

10

0.6 component 1 component 2 component 3 component 4

0.4

0.2

1 7 8

0

5

−0.2

6

−4

−2 0 2 Component 1: 68.6 % expl. var.

4

6

4

x 10

EP

−1.5 −6

1 5 7

0.8

TE D

1.5

0.5

8 2

6

(b) Mode 1: comp. 1 vs. comp. 3

2

11

10 3

1

17

1

−4

x 10

x 10

9

19 12 1820 16 17

13 11

4

(a) Mode 1: comp. 1 vs. comp. 2

2

1

0.5

−0.5

8

10 −2 0 2 Component 1: 68.6 % expl. var.

2 1.5

0

7

20

2.5

M AN U

1

−0.5

Component 4: 3.5 % expl. var.

14 15

3

2

2.5

x 10

3.5

loading

Component 2: 9.5 % expl. var.

4

1 2 5 4

3

SC

4

3 2.5

(c) Mode 1: comp. 1 vs. comp. 4

−0.4

1

2

3 Segment

4

5

(d) Mode 3: comp. 1 to comp. 4

Figure 2: Loadings of the modes one and three of the PARAFAC model on the three-way array of the segmented and mathematically transformed artificial GC-MS dataset with shifted peaks. Note that mode one and two are identical. Samples are coloured according to Table 1.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

15

ACCEPTED MANUSCRIPT

17 13 16 11 2018 12 19

0.4 0.3

4 7 15 28

0

6

Common component 3: 1.4 % expl. var.

Common component 2: 4.1 % expl. var.

8 10

0.1 310

9

−0.1 −0.2 −0.3 −0.4 −0.5

15

9

6

0.2 0.1

16 15 12 19 20 17 11 13 18 14

0 −0.1 −0.2 4

−0.3

−0.6

7

RI PT

0.2

5 21 3

−0.7 −0.4

−0.3

−0.2 −0.1 0 0.1 Common component 1: 90.8 % expl. var.

14 0.3

0.2

−0.4 −0.4

−0.3

0.5

0.3

0.6 0.5

8 7

0

11 15

4

−0.1

3 10

1

−0.2

5

14 13

0.4 0.3

12

6

−0.3 −0.4 −0.4

Saliences

9

0.1

0.2

16

0.1

20 18

0

−0.3

0.3

PC 1 PC 2 PC 3 PC 4

0.7

M AN U

Common component 4: 0.9 % expl. var.

0.8

2

0.2

0.2

(b) Scores: q1 vs. q3 17 19

0.4

−0.2 −0.1 0 0.1 Common component 1: 90.8 % expl. var.

SC

(a) Scores: q1 vs. q2

−0.2 −0.1 0 0.1 Common component 1: 90.8 % expl. var.

(c) Scores: q1 vs. q4

0.2

0.3

1

1.5

2

2.5

3 3.5 Segment

4

4.5

5

(d) Saliences: q1 to q4

TE D

Figure 3: Scores and saliences (weights of blocks/segments) of CCSWA on the three-way array of the segmented and mathematically transformed artificial GC-MS dataset with shifted peaks. Only common components one to four are shown. Samples are coloured according to Table 1.

component one (90.8 % explained variance) separates the samples one to ten and 265

eleven to 20, while segment five has the strongest influence on this component.

EP

Common component two (4.1 % explained variance) explains differences between the samples 14 and 15 and the other samples (Figure 3(a)). Segment two shows the highest weight on this component. The differences among the samples one

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

to five from the other samples are explained by common component three (Fig-

270

ure 3(b)), on which segment two has a high salience value. Component four (Figure 3(c)) shows the same random variation reflecting noise in the data as component four of the PARAFAC model.

16

ACCEPTED MANUSCRIPT

4. Comparison of the new non-alignment approach and a reference

275

RI PT

method on experimental GC-MS data Modern analytical instrumentation allow an enormous amount of data to be acquired in a short period of time. This is especially the case for chro-

matographic instrumentation coupled to multi-channel detectors. The extrac-

SC

tion and full exploration of this abundance of information is still an important bottleneck in work-flows of non-targeted strategies. The work presented here 280

was instigated by the need for new data processing approaches which take the

M AN U

most important limiting factors regarding the processing and multivariate modelling of chromatographic data, namely feature extraction, peak shifts and peak shape changes, into account. The in this study developed approach is compared to PARAFAC2 deconvolution of all chromatogram segments with subsequent 285

PCA of deconvoluted peak area values, which is very powerful deconvolution methodology previously described by Amigo et al. [37]. A brief summary of

TE D

both approaches is provided in the supporting information. 4.1. Experimental

The data set explored in this study consists of solid phase microextraction (SPME) GC-MS analysis of Cabernet Sauvignon wines, which were fermented

EP

290

with different combinations of yeast and lactic acid bacteria using sequential inoculation and co-inoculation strategies.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

4.1.1. Wine Samples All wines were produced from the same Cabernet Sauvignon grapes from

295

California of 2012 vintage. Fermentations were carried out using six combinations of yeast and lactic acid bacteria, which were selected according to their organoleptic properties indicated by the manufacturer. Three wines were made with the yeast Lalvin Clos and the lactic acid bacteria Enoferm Alpha, Enoferm 17

ACCEPTED MANUSCRIPT

Beta and Lalvin PN4; two wines were made with the yeast Uvaferm RBS and the lactic acid bacteria Lalvin VP41 and O-Mega; and one wine was made with

RI PT

300

the yeast Uvaferm VRB and the lactic acid bacteria Enoferm Alpha (all from

Lallemand Inc., Canada). Moreover, for all of these six yeast/bacteria com-

binations, two different inoculation strategies were used: inoculation of lactic

acid bacteria 24 hour after yeast inoculation (co-inoculation), and inoculation of lactic acid bacteria after the completion of alcoholic fermentation (sequential

SC

305

inoculation). In total, the volatile composition of 12 experimental wines was

4.1.2. SPME-GC-MS Analysis

M AN U

studied here.

Headspace solid phase microextraction (HS-SPME) sampling was carried 310

out in randomized order using a 100 µm polydimethylsiloxane (PDMS) fibre and the following procedure: 5 mL of the wine sample was transferred to a 20 mL headspace crimp-top vial, two grams of sodium chloride (preheated to

TE D

250 ◦C and cooled to room temperature) was added and the vial was capped immediately using a PTFE-lined septum and aluminium cap. Each wine sample 315

was submitted to HS-SPME sampling with agitation at 500 rpm for 30 min. Fiber blank and column blank analyses were carried out regularly to confirm

EP

that no sample carry-over occurred. A standard 12 % hydro-alcoholic solution containing some esters and alcohols commonly present in wine was regularly analysed to monitor the performance of the system.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

320

For GC-MS analysis an Agilent 6890 GC (Agilent Technologies) coupled to a

quadrupole mass spectrometer Agilent 5973 N (Agilent Technologies, PaloAlto, CA) was used applying electron impact ionisation (EI) at 70 eV. Full mass spectra were acquired in the range 35 u to 300 u at four spectra per second. The ion source temperature was set to 230 ◦C, and the detector voltage was

325

2105 V. Separation was carried out on a 30 m HP-5 MS column with an internal

18

ACCEPTED MANUSCRIPT

diameter (i.d.) of 0.25 mm and a film thickness of 0.25 µm. The following oven

RI PT

temperature program was used: 40 ◦C; kept for 5 min; ramped at 15 ◦C min−1 to 250 ◦C; and held for 5 min, resulting in a total run time of 25 min. Thermal

desorption and injection were performed using a split/splitless injector, operated 330

at 250 ◦C in the splitless mode, with a splitless time of 3 min. Helium was used as carrier gas at a constant flow of 1.0 mL min−1 . Linear retention indices

SC

were calculated using a series of n-alkanes. Experimental retention indices were

compared to literature values to confirm tentative peak identification based on

335

4.1.3. Data Treatment

M AN U

mass spectra. All chromatographic analyses were performed in triplicate.

All raw chromatograms were exported from Agilent Chemstation version D.03.00.611 (Agilent Technologies) as netCDF-files and imported into MATLAB version 8.0 (R2012b) (The MathWorks Inc., Natick, MA, USA) using built-in functions. All further data processing was done in MATLAB utilizing the freely available N-way toolbox [42] and in-house written functions. Preprocessing of

TE D

340

multi-way arrays was done using the nprocess.m function of the N-way toolbox [42]. Useless parts of the chromatogram at the beginning and at the end of chromatogram were removed. Each of the 36 GC-MS raw chromatograms was

345

EP

arranged as a matrix of size 3977 × 266 (elution profile × spectral profile). Deconvoluted mass spectra were exported as ASCII text files in NIST .msp format using an in-house written MATLAB function and imported into NIST

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

08 spectral library [47]. 4.2. Application of the new non-alignment approach to the experimental GC-MS data

350

The developed fingerprinting approach was applied to GC-MS data obtained for a set of twelve Carbernet Sauvignon wines fermented with different

19

ACCEPTED MANUSCRIPT

yeast/bacteria combinations using co-inoculation and sequential inoculation to

RI PT

study the impact of these factors on the volatile composition of the wines. SPME was chosen for sample preparation because of its simplicity for wine analysis in 355

terms of full automation speed and sensitivity [8, 48, 49]. A PDMS fibre was

chosen, as all PDMS degradation products contain silicone, which facilitates the differentiation of analytes from artefacts by means of siloxane fragments present

SC

in the mass spectra of the latter. This is particularly important when performing non-targeted analysis. A fast temperature ramp was used in this study to provide relatively fast GC separation. Under these conditions some resolution

M AN U

360

is sacrificed. However, the data analysis approach reported here takes the entire mass dimension into account, and therefore complete separation of peak is not needed provided that co-eluting compounds differ in terms of their mass spectra. During the analyse of all samples, the system stability was monitored using 365

a hydro-alcoholic standard solution containing common wine volatiles.

TE D

4.2.1. PARAFAC on transformed raw chromatograms Initially, all chromatograms were divided into 84 small segments based on visual examination of overlays of total ion chromatograms (TICs) of all samples and of overlays of single ion chromatograms of all mass channels of single samples. Special attention was paid to avoid the inclusion of too many peaks in one

EP

370

segment and splitting of peaks into different segments. The latter is particularly important for segments containing peaks which shift between different samples.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

In this way, as few as possible peaks were included in each segment (one to five) and the dimensions of the segments ranged between 22 and 114 scans. The seg-

375

ments 15, 58 - 62, 72, 76, 77, 80, 81, 83 were excluded from the data set as they either contained only baseline or artefacts in the chromatograms in. Seventy one small segments in total were kept for further analysis. To evaluate the effect of the number of segments, every two and every four neighbouring segments were

20

ACCEPTED MANUSCRIPT

combined which resulted in 36 and 18 larger segments, respectively. The outcome of the mathematical transformation (see section 2.2) of the

RI PT

380

segmented chromatographic raw data is a three-way array of size 36 × 36 × 71 (samples × samples × number of segments) , 36 × 36 × 36 and 36 × 36 × 18, respectively. The array which was obtained from the smallest segments (to-

tal of 71 segments) was analysed using CCSWA, TUCKER3 and PARAFAC. While the TUCKER3 results were promising, although due to the nature of

SC

385

the TUCKER3 model difficult to interpret, CCSWA did not show any inter-

M AN U

pretable results against expectation (not shown). The results of the PARAFAC model were however much more informative and easier to interpret, revealing information on systematic differences among samples. The two other three-way 390

arrays with 36 and 18 segments were therefore only analysed using PARAFAC. The number of components of the PARAFAC models were determined using the core consistency diagnostic [50], by examination of residuals, and by evaluating

TE D

captured variance and number of iterations untill the PARAFAC algorithm converged for models with one to 20 components. For the three-way array with 71 395

segments a eleven component PARAFAC model was chosen, explaining 73.0 % of the total variation in the data set. The best PARAFAC models for the three-

EP

way array with 36 and 18 segments were a ten component PARAFAC model explaining 83.0 % of the total variation and a nine component PARAFAC model explaining 92.2 % of the total variation, respectively. 400

In general, PARAFAC loadings can be interpreted in the same way as PCA

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

scores and loadings. In multi-way terminology, however, only the word ‘loading’ is used. For each mode of the analysed multi-way array a loading matrix is obtained. In the approach presented here, the first and second modes of the obtained PARAFAC model are identical, as the SSCP matrices from Equation

405

11, which were compiled into a three-way array in Equation 12, are symmetric.

21

ACCEPTED MANUSCRIPT

Congruence loadings were calculated for the third mode (segment mode) and

RI PT

each segment with an a congruence loading value higher than 0.5 was considered as ‘high to medium correlated’ with the raw data. Dependant on the aim of the study, this value can also be chosen higher (e.g. 0.75) if only highly correlated 410

segments are of interest. A rather conservative value of 0.5 has been chosen here, to ensure that the data set will be fully explored.

SC

The information content of the three PARAFAC models are discussed and compared in the following. Examination of the loadings of the sample modes

415

M AN U

(first and second modes) of the PARAFAC model of the 71 segments showed that five of the eleven components contained important information revealing systematic differences between wines made with different yeast starter cultures and inoculation scenarios (Figures 4, 5 and 6). The remaining six components mainly reflect unsystematic variations in the chromatograms, for instance component five shown in Figure 7. From the congruence loadings of the segment mode of this component in Figure 7(b) it is evident that only one segment, that

TE D

420

is segment 73, is responsible for the discrepancy of samples on this component (Figure 7(a)). The overlay of the TICs of segment 73 of all samples in Figure 7(c) shows that component 5 returns the information in segment 73 very

425

EP

well. One injection of each of the wine made with the yeast/bacteria combination Lalvin Clos/Lalvin PN4 sequentially inoculated (clos PN4) and the wine made with the yeast/bacteria combination Uvaferm RBS /O-Mega sequentially inoculated (rbs 271) show a much higher peak than all other samples in this

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

segment. This pattern is exactly reproduced in the loadings of the sample mode of component 5. All other components containing redundant information are

430

not further discussed here. PARAFAC components three and eleven are displayed in Figure 4(a) showing

the variation between wines fermented with different yeasts. Wines fermented

22

ACCEPTED MANUSCRIPT

with the yeast Uvaferm RBS (rbs) are separated from the wines fermented with

435

RI PT

the yeast Lalvin Clos (clos) and Uvaferm VRB (vrb) on component three (7.8 % explained variation), whereas the wines fermented with the yeast Uvaferm VRB differ from the other wines by component eleven (2.3 % explained variation).

The impact of each segment on component three and eleven, respectively, is shown in the congruence loadings plots of the segment mode of these compo-

440

SC

nents in Figure 4(b). For component eleven only the segments 9 and 20 are responsible for the differences of the wines made with the yeast Uvaferm VRB

M AN U

compared to the wines made with the other two yeast starter cultures, considering a congruence loading value of a segment higher than 0.5. The differences between the wines fermented with the yeast starter culture Uvaferm RBS and all other wines described by component three are caused by the segments 1, 4, 445

8, 11, 14, 22, 23, 24, 30, 31 and 38.

Figure 5 shows the PARAFAC results for components one and two. Com-

TE D

ponent one (17.6 % explained variation) mainly explains the differences in the wine fermented with the yeast Uvaferm RBS and the lactic acid bacteria O-Mega sequentially inoculated (rbs 271), but this component also shows a difference be450

tween co-inoculated and sequentially inoculated wines. Component two (11.3 %

EP

explained variation) mainly describes the distinction of the wine fermented with the yeast/bacteria combination Lalvin Clos/Enoferm Beta sequentially inoculated (clos beta) compared to all other wines. Congruence loadings of the segment mode for component one and two are shown in 5(b). Segments 4, 6, 11,

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

455

18, 28, 31, 33, 35, 36, 38, 41, 45, 46, 48, 49, 50, 53, 67, 74 and 75 had congruence loading higher than 0.5 on component one, while on component two segments 28, 64, 65, 68, 69, 71, 78 are important. Component 4 explaining 6.9 % of the total variation in the data set differen-

tiates the wine fermented with the yeast Lalvin Clos and the lactic acid bacteria

23

ACCEPTED MANUSCRIPT

1 clos vrb rbs co−inoculated sequential

Component 11: 2.3% expl. var.

vrb alpha 0.4

0.8 Component 11: 2.3% expl. var.

0.5

0.3 vrb alpha

0.2 0.1 0

clos beta

−0.1 −0.2 −0.2

clos alpha clos PN4

rbs 271

clos alpha clos PN4 clos beta −0.1

0

20

0.9

0.7 9

0.6 0.5 0.4

3217 21

0.3 3 7 0.2 13

rbs 41

0.4

23

26 19 2527 2 22 38 41 11 14 53 52 4 39 48 16 67 10 45 342963 36 18 4978 46 5 33 12 37 51 65 35 28 84 42 75 66 6440 30 56 68 69 71 54 6 79 43 73 74 50 44 70 55 82 0 47 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Component 3: 7.8% expl. var.

rbs 271

0.1

rbs 41 0.1 0.2 0.3 Component 3: 7.8% expl. var.

RI PT

0.6

0.5

0.8

0.9

1

(b) Third mode (segments) congruence loadings

SC

(a) First mode (samples) loadings

8

1 24 31

460

M AN U

Figure 4: Loadings plot of PARAFAC components three vs. eleven (model with 71 segments); Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

Lalvin PN4 co-inoculated (clos PN4) from the other wines (6(a)). Responsible for this differences are segments 41, 43, 51 and 63, as shown in the congruence loading plot of the segment mode of this component (6(b)).

The results of the PARAFAC model with only 36 segments (neighbouring

465

TE D

segments were combined) are very similar to the results of the PARAFAC model with 71 segments and will be discussed in the following. Component one of both PARAFAC models (Figure 5 and 18 in Supporting Information) reflect the same information, which is the differences of the wine fermented with the yeast

EP

Uvaferm RBS and the lactic acid bacteria O-Mega sequentially inoculated (rbs 271), and difference between co-inoculated and sequentially inoculated wines. 470

Moreover, component three and two (Figure 18 and 17 in Supporting Informa-

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

tion) of the PARAFAC model with 36 segments and component two and four (Figure 5 and 6) of the PARAFAC model with 71 segments show the same information on the differences of the wines made with the yeast/lactic acid bacteria combination Lalvin Clos/Enoferm beta (clos beta) sequentially inoculated and

475

Lavin Clos/Lalvin PN4 (clos PN4) co-inoculated, respectively. Components three and eleven (Figure 4) of the PARAFAC model with the smallest segments

24

0.4

0.2

rbs 41 rbs 271

0

vrb alpha

−0.1 clos PN4 clos alpha −0.2 −0.2

−0.1

clos PN4

clos alpha vrb alpha 0

clos beta rbs 41

0.5

0.7 0.6

78

28

65 0.5 18

0.4 37 0.3

40 56 82 0.2 3013 70 3417 39

22

74

75

25

48

84

46

36

49

31

51 41 11 44 4 26 52 23143 6 79 10 21 8 12 7275 32 19 1 54 20 73 55 92947 42 16 24 63 66 0 243 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Component 1: 17.6% expl. var.

rbs 271

0.1 0.2 0.3 0.4 Component 1: 17.6% expl. var.

64

0.8

0.3

0.1

6869 71

0.9

Component 2: 11.3% expl. var.

Component 2: 11.3% expl. var.

0.5

1

clos vrb rbs co−inoculated sequential

clos beta

SC

0.6

0.1 0.6

33

0.8

38 35 50 45 53 67

0.9

1

(b) Third mode (segments) congruence loadings

M AN U

(a) First mode (samples) loadings

0.7

clos vrb rbs co−inoculated sequential

clos PN4

0.5 0.4 0.3

EP

Component 4: 6.9% expl. var.

0.6

0.2 0.1 0

clos alpha

rbs 271

clos beta clos beta clos PN4 rbs 271vrb alpha rbs 41 rbs 41 clos alpha vrb alpha −0.1 0 0.1 0.2 0.3 0.4 Component 1: 17.6% expl. var.

−0.1 −0.2

0.5

1 43 0.9 0.8 Component 4: 6.9% expl. var.

TE D

Figure 5: Loadings plots of PARAFAC components one vs. three (model with 71 segments); Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

0.7 0.6

41

63 51

0.5 0.4

39

13 52 56

27 48 36 18 3 65 25 4 40 17 0.1 35 44 21 26 11 46 33 74 45 53 67 22 30 49 9 534 19 66 238 64 28 38 50 20 10 73 163224 84 78 14 6982 54 791 68 55 729 42 75 71 37 47 12 31 6 0 270 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Component 1: 17.6% expl. var. 0.3 0.2

0.6

(a) First mode (samples) loadings

(b) Third mode (segments) congruence loadings

Figure 6: Loadings plots of PARAFAC components one vs. four (model with 71 segments); Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

25

0.7

0.9 0.8

0.6 0.5 0.4 0.3

rbs 271

0.2 0.1 0 −0.1 −0.2

0.7 0.6 0.5

M AN U

Component 5: 5.9% expl. var.

1 73

clos vrb rbs co−inoculated sequential

clos PN4

Component 5: 5.9% expl. var.

0.8

SC

0.9

0.4 0.3

10

42

19

0.2

vrbclos alpha rbs 41 alpha vrb 41 alpha rbs clos beta clos alphaclos PN4 clos beta −0.1

0

36 54 66 25 53 26 47 48 1617 2156 7929 52 78 565 32 71 33 74 38 35 67 51 44 4118 75 70 12 82 43 31 63 45 50 49 6923 34 30 22 84 4 28 14 39 79 40 6 46 27 37 11 13 20 1 68 8 3 64 24 0 255 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Component 1: 17.6% expl. var.

0.1

rbs 271

0.1 0.2 0.3 0.4 Component 1: 17.6% expl. var.

0.5

0.6

(a) First mode (samples) loadings

18000

(b) Third mode (segments) congruence loadings

clos PN4 seq

16000 14000

Abundance

TE D

12000

rbs 271 seq

10000

8000

vrb alpha seq clos alpha coin clos beta seq vrb alpha coin rbs 271 seq clos alpha seq rbs 271 coin clos beta coin rbs 41 coin clos PN4 coin rbs 41 seq clos PN4 seq

6000 4000 2000

EP

0

22.7

22.75

22.8 22.85 Retention time [min]

22.9

(c) TICs of segment 73

Figure 7: Loadings plots of PARAFAC components one vs. five (model with 71 segments); Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

26

ACCEPTED MANUSCRIPT

(71 segments) reveal the same information on systematic differences according

RI PT

to the different yeast starter cultures used as components five and ten (Figure 16 in Supporting Information) of the PARAFAC model with 36 segments: 480

systematic differences according to the different yeast starter cultures.

The results of the PARAFAC model where four neighbouring segments were

combined (total of 18 segments) are, in contrast to the results of the PARAFAC

SC

model with 36 segments, not fully comparable to the results of the PARAFAC

model with the smallest segments (71 segments). Only three components are comparable between these models. Component one (Figure 19 in Supporting

M AN U

485

Information) of the 18 segments PARAFAC model reflecting the differences between the wine fermented with the co-inoculated yeast Lalvin Clos and the lactic acid bacteria Lalvin PN4 (clos PN4) and the other wines shows the same information as component 4 of the PARAFAC model with 71 segments. Com490

ponent two (Figure 19 in Supporting Information) of the PARAFAC model

TE D

with the biggest segments (18 segments) is comparable with component one of the 71 segment PARAFAC model mainly explaining the wine made with the yeast Uvaferm RBS and the lactic acid bacteria O-Mega (sequentially inoculated) and a tendency between co-inoculated and sequentially inoculated wines. Furthermore, component three of the PARAFAC model with 18 segments shows

EP

495

differences of the wine made with sequential inoculation of the yeast Lalvin Clos and the lactic acid bacteria Enoferm Beta (clos beta) and is comparable with the information obtained from component two of the PARAFAC model with 71

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

segments. Information on the systematic differences caused by the yeast strains

500

as obtained on component eleven and three (Figure 4) of the PARAFAC model with the smallest segments (71 segments) and on components ten and five (Figure 16 in Supporting Information) of the PARAFAC model with 36 segments could not be observed.

27

ACCEPTED MANUSCRIPT

In conclusion, the comparison of the results of the three PARAFAC models with different segment sizes shows that the size of the segments clearly has

RI PT

505

an influence on the information obtained from the PARAFAC model. While

the models with small and medium size (71 and 36 segments respectively) revealed the same information on systematic differences in the data, important information on systematic differences among the wines caused by the different

yeast starter cultures could not be obtained from the PARAFAC model with

SC

510

the biggest segments (18 segments). These results demonstrate that a smaller

M AN U

segment size is beneficial. Another positive aspect of smaller segments is that they are easier to investigate after PARAFAC modeling. In this manner peaks in segments which have been determined to be important for the differentiation 515

of samples can be easier deconvoluted and identified.

4.2.2. Deconvolution and identification of compounds in important chromatogram segments

TE D

From the discussion above it can be summarized that the components one, two, three, four and eleven from the PARAFAC model with 71 segments are 520

important to explain information on systematic differences between the wines. The segments with congruence loadings higher than 0.5, which can be consid-

EP

ered as ‘medium to high correlated’ with the data, are the segments 4, 6, 11, 18, 28, 31, 33, 35, 36, 38, 41, 45, 46, 48, 49, 50, 53, 67, 74 and 75 for component one, 28, 64, 65, 68, 69, 71 and 78 for component two, 1, 4, 8, 11, 14, 22, 23, 24,

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

525

30, 31 and 38 for component three, 41, 43, 51 and 63 for component four and 9 and 20 for component eleven. To confirm the results from PARAFAC modelling of the segmented and transformed GC-MS chromatograms and to study the important chromatogram segments in more detail, all of these 38 segments were deconvoluted using PARAFAC2 on each of the segments. The number of

530

factors for each of the PARAFAC2 models were first evaluated as described by

28

ACCEPTED MANUSCRIPT

[39] using the autochrom.m MATLAB function, which is kindly and freely pro-

RI PT

vided on www.models.life.ku.dk (July 2014). The number of components of each model was then manually verified using the freely available N-way toolbox [42] for MATLAB. The number of factors were checked, and if needed corrected, by 535

examining core consistency, number of iterations until the algorithm converges,

residuals, and the interpretability of the loadings. Moreover, non-negativity con-

SC

straints were applied in the spectra mode. After exporting all deconvoluted mass spectra using an in-house written MATLAB function, tentative identification of

540

M AN U

the deconvoluted peaks were performed based on comparison of deconvoluted mass spectra with the NIST 08 spectral library. Furthermore, linear retention indices (LRI) were calculated using a homologous series of n-alkanes and compared with literature values to confirm tentative identifications. Details on the PARAFAC2 models and the identified compounds are summarized in Table 2. 4.2.3. PCA on deconvoluted peak areas

To visualize the above summarized and discussed results three different

TE D

545

PCAs were constructed. All compounds in the segments which had high congruence loadings on the PARAFAC component three and eleven, which distinguished all samples according to which yeast starter culture was used, were

550

EP

included in the first PCA. A two component PCA model was sufficient to separate the wines into three groups. The model was then improved by successively removing all compounds with low loadings on PC1 and PC2 (small impact on

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

these two components). The wines fermented with the yeast starter culture Uvaferm RBS were separated from the other wines by PC1, which explains 67.4 % of the total variance (Figure 8(a)). The loadings in Figure 8(b) reveal

555

that ethyl 2-methylbutyrate (1), iso-amyl iso-butyrate (8), ethyl-2-hexenoate (15), the unknowns 46 and 49 (both terpenoid-like mass spectra) and the two unknowns 48 and 65 are positively correlated with the wines made with the

29

Table 2: Summary of all segments showing high congruence loadings (> 0.5) on PARAFAC components one, two, three, four and eleven and details of PARAFAC2 model of each segment with corresponding compounds.

of PARAFAC component

1

2

3

4

PARAFAC2

11

no.

compound name

M AN U

segment

SC

congruence loadings

LRIa

MS match

857

900

861

852

component no. 1

0.85

1

1

butanoic acid, 2-methyl-, ethyl ester (ethyl

2-methylbutyrate)

2

2

butanoic acid, 3-methyl-, ethyl ester (ethyl

0.51

0.69

3

-

baseline

1

7

acetic acid, hexyl ester (hexyl acetate)

1005

931

2

8

propanoic acid, 3-methyl-, ethyl ester

1003

812

EP

4

TE D

3-methylbutanoate)

6

0.66

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

(iso-amyl iso-butyrate)

3

9

unknown

999

1

12

unknown

1022

2

-

baseline

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

3 8

0.97

1

compound name

M AN U

component no.

SC

of PARAFAC component

LRIa

MS match

13

eucalyptol (1,8-cineole)

1025

877

15

2-hexenoic acid, ethyl ester

1048

860

(ethyl-2-hexenoate)

2

31 11

0.62

0.67

0.59

-

baseline

-

baseline

1

16

2

-

artefact (bleeding)

3

-

baseline

EP

9

TE D

3

4

-

unknown

1051

1

19

propanoic acid 2-hydroxy-, 3-methylbutyl

1068

871

1070

880

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

unknown

1048

ester (isoamyl lactate) 2

20

1-octanol

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

3 4 14

0.63

1 2

20

21

unknown

1069

22

acetophenone

1066

29

unknown

1106

-

MS match

920

baseline

unknown

1112

4

31

unknown

1111

1

36

octanoic acid ethyl ester (ethyl octanoate)

1200

931

2

-

1

39

6-octen-1-ol, 3,7-dimethyl- (citronellol)

1231

888

2

40

unknown

1233

3

-

TE D

30

0.51

0.98

AC C

18

LRIa

EP

3

compound name

M AN U

component no.

SC

of PARAFAC component

32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

baseline

baseline

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

22

0.63

1

compound name

M AN U

component no.

SC

of PARAFAC component

42

hexanoic acid, 3-methylbutyl ester

LRIa

MS match

1252

930

1255

868

1250

852

(isopentyl hexanoate)

2

43

hexanoic acid, 2-methylbutyl ester (2-methylbutyl hexanoate)

33 23

0.54

44

benzeneacetic acid, ethyl ester (ethyl benzeneacetate)

4

45

unknown

1248

5

46

unknown (terpenoid-like MS)

1246

EP

TE D

3

47

acetic acid, 2-phenylethyl ester

1262

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

1

(phenylethyl acetate)

2

-

baseline

3

-

artefact (bleeding)

961

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

4 5 24

0.89

1 2

0.57

0.51

AC C

30

0.63

-

unknown

49

unknown (terpenoid-like MS)

-

50 -

1

59

LRIa

MS match

artefact (bleeding)

48

4

EP

28

TE D

3

compound name

M AN U

component no.

SC

of PARAFAC component

34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

1263

artefact (bleeding) nonanoic acid

1270

843

1297

892

baseline nonanoic acid, ethyl ester (ethyl nonanoate)

2

60

unknown

1295

3

61

propyl octanoate

1294

1

65

unknown (succinic acid ester)

1331

2

66

unknown

1333

841

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

3 4 31

0.67

0.85

1

compound name

M AN U

component no.

SC

of PARAFAC component

67 -

68

unknown

LRIa

MS match

1330

baseline

octanoic acid, 2-methylpropyl ester

1350

890

(isobutyl octanoate)

35 35

0.79

0.93

69

unknown

1352

3

-

1

72

2

-

EP

33

TE D

2

3

73

unknown

1368

4

74

naphthalene, 1,2-dihydro-1,1,6-trimethyl-

1363

870

1389

873

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

baseline decanoic acid

1369

910

baseline

(TDN) 1

78

ethyl trans-4-decenoate

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

2 3 36

0.64

1 2

0.57

decanoic acid, ethyl ester (ethyl decanoate)

1397

80

unknown

1408

baseline

unknown

1406

4

82

unknown

1410

5

83

unknown

1404

6

84

unknown

1403

87

octanoic acid, 3-methylbutyl ester (isoamyl

1449

TE D 0.89

79

-

1

MS match

baseline

81

AC C

38

-

LRIa

EP

3

compound name

M AN U

component no.

SC

of PARAFAC component

36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

942

octanoate)

2

88

unknown

1450

3

89

octanoic acid, 2-methylbutyl ester

1451

921

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

41

0.52

0.57

1 2 3

compound name

M AN U

component no.

SC

of PARAFAC component

-

LRIa

MS match

baseline

96

unknown

1490

97

decanoic acid, propyl ester (propyl

1492

857

decanoate)

0.91

1493

5

99

unknown

1489

1

102

unknown

1515

2

103

butylated hydroxytoluene (BHT)

1520

3

-

4

104

unknown

1521

5

105

unknown

1523

1

109

unknown

1547

AC C

45

0.99

unknown

TE D

4

43

98

EP

37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

baseline

959

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

2 3 46

0.73

1

compound name

M AN U

component no.

SC

of PARAFAC component

110 -

111

unknown

LRIa

112

baseline

1,6,10-dodecatrien-3-ol, 3,7,11-trimethyl-

1570

49

0.56

0.82

1571

3

-

baseline

4

-

artefact (bleeding)

5

113

unknown

1574

1

115

unknown

1583

2

-

baseline

3

-

artefact (bleeding)

1

116

AC C

48

unknown

EP

TE D

2

MS match

1550

(cis,trans-nerolidol)

38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

unknown

1588

915

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

2 50

0.95

1

compound name

M AN U

component no.

SC

of PARAFAC component

-

117

LRIa

MS match

1595

971

baseline

dodecanoic acid, ethyl ester (ethyl

-

baseline

-

baseline

2

-

aretefact (bleeding)

3

118

unknown

1610

4

119

unknown

1612

EP

dodecanoate)

2

121

pentadecanoic acid, 3-methylbutyl ester

1647

39 53

0.5

0.94

1

TE D

51

1

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

(iso-amyl decanoate)

2

122

3

-

unknown (long chain fatty acid ester) baseline

1650

936

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

4 63

0.54

1 2

67

0.67

0.96

130 -

131

LRIa

MS match

artefact (bleeding)

unknown (long chain fatty acid ester)

1783

baseline

tetradecanoic acid, ethyl ester (ethyl

1794

925

tetradecanoate)

2

-

baseline

1

-

basline

2

-

artefact (bleeding)

EP

65

1

-

TE D

0.66

3

132

unknown

1820

4

133

unknown

1824

1

135

dodecanoic acid, 3-methylbutyl ester

1847

AC C

64

compound name

M AN U

component no.

SC

of PARAFAC component

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

(isoamyl laurate)

891

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

2 3 68

0.51

1 2

0.67

136

unknown

1841

137

unknown (long chain fatty acid ester)

1859

138

compname

1851

-

139

2

-

artefact (bleeding)

3

-

baseline

142

unknown (long chain fatty acid ester)

pentadecanoic acid, ethyl ester (ethyl

1866

1896

pentadecanoate)

2

143

3

-

MS match

baseline

1

1

LRIa

baseline

EP

71

0.57

AC C

69

-

TE D

3

compound name

M AN U

component no.

SC

of PARAFAC component

41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

unknown baseline

1890

874

Table 2 – continued congruence loadings

segment

1

2

3

4

PARAFAC2

11

no.

74

0.83

1 2

75

0.57

1

compound name

M AN U

component no.

SC

of PARAFAC component

146 -

147

ethyl 9-hexadecenoate

LRIa

MS match

1976

917

1995

911

baseline

hexadecanoic acid, ethyl ester (ethyl hexadecanoate)

42 a experimentally

0.79

determined linear retention indices

-

baseline

1

148

2

-

baseline

3

-

baseline

EP

78

TE D

2

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

RI PT

ACCEPTED MANUSCRIPT

unknown (long chain fatty acid ester)

2067

ACCEPTED MANUSCRIPT

clos vrb rbs co−inoculated sequential

4 vrb alpha

1

rbs 41

0

clos alpha clos PN4 clos alpha clos PN4 clos beta clos beta

−1 −2

rbs 271 rbs 271 rbs 41

−3

65

0.2

1 49 8 46 15

0

−0.2

48

−0.4

−4 −5 −5

39

0.4

vrb alpha

2

PC 2: 20.8% expl. var.

PC 2: 20.8% expl. var.

3

31 0.6

−0.6

−4

−3

−2

−1 0 1 PC 1: 67.4% expl. var.

2

3

4

5

−0.6

−0.4

−0.2 0 0.2 PC 1: 67.4% expl. var.

0.4

0.6

(b) Loadings

SC

(a) Scores

RI PT

5

M AN U

Figure 8: Scores and loadings plots of the PCA of compounds in segments which had high congruence loadings on components three and eleven of the PARAFAC model with 71 segments; Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

yeast Uvaferm RBS. Moreover, the grouping of the wines fermented with yeast Uvaferm VRB is explained by PC2 (20.8 % explained variance). Citronellol and 560

the unknown compound 31 are positively correlated on PC2 with these wines. All compounds in the segments which had high congruence loadings on

TE D

PARAFAC component one were included in the second PCA. A one component model was sufficient to explain the differences between the co-inoculated wines and the sequentially inoculated wines. After successively removing all com565

pounds with low loadings on PC1 (small impact on this component) a final one

EP

component model was obtained explaining 59.7 % of variance. Figure 9(a) shows the scores of PC1 which show that all samples are discriminated according to the inoculation scenario (co-inoculation vs. sequential inoculation). The branched

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

esters isoamyl iso-butyrate (8), isoamyl lactate (19), isoamyl octanoate (87),

570

isoamyl decanoate (121), isoamyl laurate (135) as well as isobutyl octanoate (68) and octanoic acid, 2-methylbutyl ester (89), the straight chain fatty acid ester ethyl octanoate (36), ethyl nonanoate (59), ethyl decanoate (79), ethyl deodecanoate (117), propyl octanoate (61), the two unsaturated ethyl trans-4decenoate (78) and ethyl 9-hexadecenoate (146), the fatty acid decanoic acid

43

ACCEPTED MANUSCRIPT

clos vrb rbs co−inoculated sequential

0.15

−5

0.1

0 −0.05 −0.1

ph a be t rb a s 27 1 rb s 41 vr b al ph a cl os PN cl 4 os al ph cl a os be t rb a s 27 1 rb s 41 vr b al ph a

al

os

cl

PN

−0.2

os

4

−10

cl

117121 135 122 146 116

98109

87 88

111 110

115

112 7 9

0.05

−0.15

os

8

79 78

89

0

cl

6872 193659 6061 12

74

83

RI PT

5

0.2

PC 1: 59.7% expl. var.

PC 1: 75.5% expl. var.

10

Compound number

(b) Loadings

SC

(a) Scores

575

M AN U

Figure 9: Scores and loadings plots of the PCA of compounds in segments which had high congruence loadings on component one of the PARAFAC model with 71 segments; Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

(72), the terpenoid nerolidol (111), the unknown long chained fatty acid ester 122 and the unknowns 12, 60, 88, 109, 110, 115, 116 all correlate positively with the co-inoculated wines.

The third PCA included all compounds from segments which had high con-

580

TE D

gruence loadings on the components two and four (Figure 10). All compounds with low loadings (small impact on the model) were successively removed from the model. The wine made with the yeast Lalvin Clos and the lactic acid bacteria Enoferm Beta (sequentially inoculated) is separated from all other wines

EP

on PC1, which explains 52.9 % variance. Ethyl tetradecanoate (131) and the two unknown long chain fatty acid ester 137 and 139 show positive correla585

tion on PC1, while ethyl nonanoate (59), propyl octanoate (61) and unknown compound 60 correlate negatively with this component. Principal component

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

two (26.4 % explained variance) shows the differentiation of the wine which was co-inoculated with the yeast Lalvin Clos and the lactic acid bacteria Lalvin PN4 as well as the wines made with the yeast/bacteria combination Uvaferm

590

RBS/O-Mega (co-inoculated), Lalvin Clos/Enoferm Alpha (sequentially inoculated) and Lalvin Clos/Enoferm Beta (sequentially inoculated). This difference is explained by propyl decanoate (97), BHT (103) and the unknown compound 44

ACCEPTED MANUSCRIPT

clos vrb rbs co−inoculated sequential

clos PN4

PC 2: 26.4% expl. var.

4

0.6

0 −2

clos beta

60

0.2 59

61

0

−0.2

−4 −0.4

−6 −8 −8

−6

−4

−2 0 2 PC 1: 52.9% expl. var.

4

6

97 118

2 clos alpha rbs 41 clos alphaclos PN4 rbs 271 rbs 271 rbs 41 clos beta vrb alpha vrb alpha

103

0.4

PC 2: 26.4% expl. var.

6

−0.6 −0.6

8

−0.2 0 0.2 PC 1: 52.9% expl. var.

139 137 131

0.4

0.6

(b) Loadings

SC

(a) Scores

−0.4

RI PT

8

M AN U

Figure 10: Scores and loadings plots of the PCA of compounds in segments which had high congruence loadings on components two and four of the PARAFAC model with 71 segments; Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

(118). BHT (103) and the unknown compound (118) are very likely artefact compounds not associated to wine. 595

Several studies on the impact of the inoculation mode of malolactic fermentation and the yeast/lactic acid bacteria combination on the volatile composition

TE D

of wine have been conducted, but no clear systematic changes have been reported [51, 52, 53, 54]. Some authors have observed higher amounts of some esters in co-inoculated wines [53, 54]. Higher levels of long chain fatty acid 600

esters as well as unsaturated and branched species as a function of malolactic

EP

fermentation inoculation mode as discussed above have, however, not yet been reported. This is most likely due to the fact that long chain fatty acid esters are normally not the focus of targeted methods for general wine aroma analysis.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nevertheless, these compounds were included in the non-targeted approach used

605

here, although this was a priori not specifically known. 4.3. PARAFAC2 on all segments of the chromatogram of the experimental GCMS data with subsequent PCA

As a reference method, PARAFAC2 was also applied to all segments which have not been considered in the above discussed new approach and area values 45

ACCEPTED MANUSCRIPT

clos vrb rbs co−inoculated sequential rbs 41 rbs 271

5

0

clos beta

0.1

clos PN4

rbs 41

rbs 271

clos PN4 vrb alpha clos beta clos alpha clos alpha

−5

54 58 91 48 130 46 47 49 89 90 96 4263 15 4476 128 108 127104 233 8 40 118 129 42 18 102 132 51 61 133 64 57 6 131 69 68 138 10638 77 6675 80119 74 81 97124136 103 71 139 19 137 37 84 99 25 9 29 36 65 1124 142 32 95 85 43 140 92 82 78 109 22 105 50 113 148 12 93 107 122 126 128 27 134 125 67 51035 100 7 120 115 72111 87 41 62 20 8652 143 60 98135 88 146 123 101 73 45 144 17 121 112 53 59 79 5616 70 145 116 117 83 114 147 13 23 63 152 55 141 149 30 34 110 14 39 150 151 31 21

94

PC 2: 12.7% expl. var.

PC 2: 12.7% expl. var.

10

0.2 0.15

vrb alpha

0.05 0 −0.05 −0.1

−10 −0.15

−15 −15

−10

−5

0 5 PC 1: 25% expl. var.

10

−0.2 −0.2

15

−0.1

−0.05 0 0.05 PC 1: 25% expl. var.

0.1

0.15

0.2

(b) Loadings

SC

(a) Scores

−0.15

RI PT

15

610

M AN U

Figure 11: Scores and loadings plots of PC1 and PC2 of the PCA on all autoscaled compounds of all deconvoluted segments; Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

of all integrated deconvoluted peak profiles were analysed using PCA, according reference [37]. A total of 152 peak area values were obtained in this manner. Figures 11 and 12 show the scores and loadings plots of PC1 (25.0 % explained variance), PC2 (12.7 % explained variance) and PC3 (11.8 % explained variance)

615

TE D

of the autoscaled peak table. Note that only a relatively small proportion of variance is explained, even when compounds with low loadings were successfully removed (not shown). Some structural information is however revealed from the scores plots (Figures 11(a) and 12(a)), although the interpretation remains

EP

difficult.

PC1 shows, as component one from the PARAFAC model with 71 segments 620

(Figure 5), a difference between most of the co-inoculated and sequentially inoculated wines. The co-inoculated wines fermented with the yeast starter culture

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Uvaferm RBS correlate most positively, while the wine made with the yeast starter culture Lalvin Clos sequentially inoculated with the Enoferm Beta correlates most negatively with this PC. The compounds 8, 12, 19, 36, 59, 60,

625

61, 68, 72, 78, 79, 87, 88 98 109, 101, 111, 115, 116, 117, 121, 122, 135 and 146 show high positive loadings on PC1 (Figure 11(b)). These results are comparable to component one of the PARAFAC model with 71 segments (Figure 46

ACCEPTED MANUSCRIPT

10

clos PN4

clos vrb rbs co−inoculated sequential

0.2

124 97 119 126 7 112 103 118 127 108 20 115 132 104 12823 74 82 72 3513695 120 57 84 138 53 80 78 34 25 142 113 148 99 116 91145 114 1096 107 94 140 75 141 102 111 133 90 6 9 60146 22 10576 16 134 147 135 149 1936117 139 137 130 123 30151 59 122 79 13 81 129 51 21 121 63 106 40 73 1714 67 11 89 88 109 39 100 41 131 5 143 38 110 87 47101 24 69 56 62 12570 93 152 98 43 12 83 150 66 55 64 71 61 58 50 44 2 31 54 68 37 77 18 86 27 29 48 42 32 144 263 85 28 8 92 4 15 45 1 52 33 46 49

0.15

PC 3: 11.8% expl. var.

PC 3: 11.8% expl. var.

0.1

5

clos beta

clos beta clos alpha

0

clos PN4 vrb alpha rbs 271

−5

clos alpha vrb alpha

rbs 41

rbs 271

rbs 41

0.05 0 −0.05 −0.1

−10 −0.15

−15 −15

−10

−5

0 5 PC 1: 25% expl. var.

10

−0.2 −0.2

15

65

−0.15

−0.1

−0.05 0 0.05 PC 1: 25% expl. var.

0.1

0.15

0.2

(b) Loadings

SC

(a) Scores

RI PT

15

M AN U

Figure 12: Scores and loadings plots of PC1 and PC3 of the PCA on all autoscaled compounds of all deconvoluted segments; Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

9). While the compounds 131, 137 and 139 correlate negatively with PC1, showing a similar pattern as reflected in PARAFAC component two of the 71 630

segment model (Figure 10). Principal component two shows differentiation of the wines fermented with the yeast starter culture Uvaferm RBS and the wine

TE D

made with the yeast/lactic acid bacteria combination Lalvin Clos/Lalvin PN4 (co-inoculated). This separation is however not very clear, while there is no valuable information extractable from the loadings plot (Figure 11(b)). A sim635

ilar observation also applies to PC3, which also explains differences of the wine

EP

made with the yeast/lactic acid bacteria combination Lalvin Clos/Lalvin PN4 (co-inoculated) and of the wines fermented with the yeast/lactic acid bacteria combination Uvaferm VRB/Enoferm alpha (co-inoculated). PCA on the autoscaled peak table was not suitable to detect the same pat-

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

640

terns among the samples as has been received from the new approach presented here. Therefore, class centroid centering and scaling to intra-class variance was used where classes were defined according to the three yeast starter cultures, with the aim to obtaining information on the differences among the wines made with the three different yeast starter cultures. Figure 13 shows the scores and

645

loading of PC1 (38.8 % explained variance) and PC3 (10.7 % explained variance) 47

ACCEPTED MANUSCRIPT

10

39 0.4

vrb alpha vrb alpha

5 0

clos alpha clos alpha clos PN4 clos beta clos beta

−5 −10

PC 3: 10.7% expl. var.

PC 3: 10.7% expl. var.

0.6

clos vrb rbs co−inoculated sequential

15

rbs 41 rbs 41 rbs 271 rbs 271

65

49

55458685 41 27 150 12 110 14 30 37 56 92 1 8 144 151 16 83 152 34 32 62 101 66 24 63 117 50 149 88 141 146 26 147 29 143 5 98 79 125 105 18 77 59 121 13 100 87 71 116 93 106 145 67 17 44 135 43242 43 109 61 69 123 111 64 53 128 114 60 10 140 51 911 122 23 84 35 129 40 131 73 21 137 134 107 22 38 72 139 36 95 112 75 148 80 133 113 89 142 78 82 81 19 99 130 136 90 20 102 120 7 115 94 57 138 6 47 48 119 103 126 7496 132 104 54 108 124 91 118 76 97 25 127 58

0

−0.2

clos PN4

−0.4

−15 −20 −20

31 0.2

−15

−10

−5 0 5 PC 1: 38.8% expl. var.

10

15

−0.6 −0.6

20

−0.4

−0.2 0 0.2 PC 1: 38.8% expl. var.

46

15

0.4

0.6

(b) Loadings

SC

(a) Scores

RI PT

20

M AN U

Figure 13: Scores and loadings plots of PC1 and PC3 of the PCA on all compounds of all deconvoluted segments, where class centroid centering and scaling was applied; Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

of this PCA. The three sample groups show a very similar pattern as obtained for the PCA on the autoscaled compounds of segments with high congruence loadings of component three and eleven of the PARAFAC model with 71 segments (Figure 8). The PCA on the autoscaled peak table showed some systematic differences among the samples, but is not suitable to fully explore the data set

TE D

650

without any pre selection of variables. The same information on the differences between co-inoculated and sequentially inoculated wines was obtained as from the new approach presented here. The interpretation of the loadings, however,

655

EP

is complicated by the presence of more noise in the data and the larger number of variables. The use of a supervised preprocessing method, class centroid centering and scaling to intra-class variance, helped to differentiate wines according

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

to the yeast starter culture used, resulting in the same results as obtained using the new approach presented here. Overall the results from the PCAs after the PARAFAC2 deconvolution of the 38 important segments and the results from

660

PCA after PARAFAC2 modelling of all 71 segments are comparable, albeit the latter were more difficult to interpret and more sophisticated methods then PCA with autoscaling are needed, such as supervised methods, or variable selection.

48

ACCEPTED MANUSCRIPT

The comparability of the results from the new approach using PARAFAC

665

RI PT

on segmented and mathematically transformed chromatograms in combination with PARAFAC2 deconvolution of important segments with subsequent PCA, and the deconvolution of all segments using PARAFAC2 and subsequent PCA modelling proves the validity of the results of the new approach. Only 38 seg-

ments of the chromatogram turned out to be important for the differentiation

670

SC

of samples using the new approach. Almost half of the 71 segments had to be deconvoluted using PARAFAC2, which is a considerable time saving. In this

M AN U

study only segments with congruence loadings greater than 0.5 were considered as ‘medium to highly correlated’ with the raw data. If, depending on the aim of a study, a higher value is chosen here, such as 0.75, which can be considered as ‘highly correlated’, even less PARAFAC2 models would have to be constructed 675

and interpreted. The new approach can therefore be considered as a segment selection tool prior to deconvolution of segments of chromatograms. Further-

TE D

more, the information on systematic differences obtained from the PARAFAC model on the segmented and transformed chromatograms can be used to study the important segments separately: separate PCAs can be constructed on only 680

compounds from segments which are responsible for a certain grouping of sam-

EP

ples. Peak tables obtained in this manner are much smaller than a global peak table of all compounds and contain less redundant information, making them easier to explore using for instance simple plotting (e.g. boxplots) or as have been shown here PCA on autoscaled data. The PCAs are constructed on these

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

685

smaller subsets of peak areas of the deconvoluted profiles are much easier to interpret, as has been shown above.

49

ACCEPTED MANUSCRIPT

5. Conclusions

RI PT

In this study, the potential of the conversion of segmented two dimensional GC-MS chromatograms into sums of squares and cross product matrices (SSCP) 690

prior to PARAFAC modelling has been demonstrated as a powerful data treatment technique for non-targeted GC-MS analysis. The presented approach con-

SC

sists of three steps. First, all chromatograms are segmented and SSCP matrices are calculated for each segment and sample. This transformation of the chromatogram segments into SSCP matrices summarizes information on the variation and covariation of all mass channels in the segments for the corresponding

M AN U

695

sample and makes alignment of peaks unnecessary. The following step, the compilation of the vectorized SSCP matrices into a compilation matrix for all samples in each segment and their transformation into SSCP matrices, gives information on the variation and covariation between samples in each segment 700

as a function of the variation and covariation among mass channels in each

TE D

segment for the corresponding sample. In the final step these SSCP matrices are merged to a three way array, which is then analysed using PARAFAC. In essence, only the segmentation of the chromatograms and the construction of the PARAFAC model have to be done manually. This makes this approach a fast, holistic and semi-automated method for GC-MS fingerprinting. A set of 36

EP

705

chromatograms derived from triplicate SPME-GC-MS analyses of twelve Carbernet Sauvignon wines was used to demonstrate the performance of the data

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

treatment methodology. Wines could be differentiated according to the yeast starter culture used and the inoculation mode of yeast and lactic acid bacteria.

710

Compounds responsible for this discrimination could be tentatively identified after deconvoluting peaks in the important segments using PARAFAC2. Separate PCAs on the integrated deconvoluted signals of segments which are responsible for a certain grouping of samples in the PARAFAC model provide in-depth

50

ACCEPTED MANUSCRIPT

insights to the observed phenomena. The advantage of the novel GC-MS fingerprinting approach presented herein could be confirmed by comparing it with

RI PT

715

PCA on deconvoluted peak profiles of all chromatogram segments. The final results from the new approach could not be summarized by a single PCA on

the autoscaled peak table from all compounds. The new approach can, therefore, also been seen as a segment pre-selection tool prior to deconvolution of chromatogram segments.

SC

720

M AN U

6. Acknowledgements

Lallemand is thanked for partial funding, and Lallemand North America for the donation of wine samples.

JV is supported through the Initiative

d’Excellence (IdEx) Universit de Bordeaux and the Hochschule Geisenheim Uni725

versity. Marie-Claire Perello and Laurent Riquier is thanked for assistance in the laboratory, Rasmus Bro for his suggestions and comments. Julius Witte

EP

TE D

and Kimmo Sirn is thanked for discussions on matrix algebra.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

51

ACCEPTED MANUSCRIPT

References

730

RI PT

[1] C. De Vos, Y. Tikunov, A. Bovy, R. Hall, Flavour metabolomics: Holistic

versus targeted approaches in flavour research, in: Expression of Multidis-

ciplinary Flavour Science. Proceedings of the 12th Weurman Symposium. Interlaken, Switzerland: Z¨ urcher Hochschule f¨ ur Angewandte and Institut

SC

F¨ ur Chemie und Biologische Chemie, 2008, pp. 573–580.

[2] V. Behrends, G. D. Tredwell, J. G. Bundy, A software complement to AMDIS for processing GC-MS metabolomic data, Analytical biochemistry 415 (2) (2011) 206–208.

M AN U

735

[3] S. E. Stein, An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data, Journal of the American Society for Mass Spectrometry 10 (8) (1999) 770–781. 740

[4] R. Aggio, S. G. Villas, K. Ruggiero, Metab: an R package for high-

TE D

throughput analysis of metabolomics data generated by GC-MS, Bioinformatics 27 (16) (2011) 2316–2318. [5] E. Want, P. Masson, Processing and Analysis of GC/LC-MS-Based

745

EP

Metabolomics Data, in: T. O. Metz (Ed.), Metabolic Profiling, Vol. 708 of Methods in Molecular Biology, Humana Press, 2011, pp. 277–298. [6] A. Luedemann, K. Strassburg, A. Erban, J. Kopka, TagFinder for the

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

quantitative analysis of gas chromatography - mass spectrometry (GC-MS)based metabolite profiling experiments, Bioinformatics 24 (5) (2008) 732– 737.

750

[7] C. A. Smith, E. J. Want, G. O’Maille, R. Abagyan, G. Siuzdak, XCMS: processing mass spectrometry data for metabolite profiling using nonlinear

52

ACCEPTED MANUSCRIPT

peak alignment, matching, and identification, Analytical chemistry 78 (3)

RI PT

(2006) 779–787. [8] J. Vestner, S. Malherbe, M. Du Toit, H. H. Nieuwoudt, A. Mostafa, 755

T. G´ orecki, A. G. Tredoux, A. De Villiers, Investigation of the volatile composition of pinotage wines fermented with different malolactic starter

SC

cultures using comprehensive two-dimensional gas chromatography coupled to time-of-flight mass spectrometry (GC×GC-TOF-MS), Journal of agri-

760

M AN U

cultural and food chemistry 59 (24) (2011) 12732–12744.

[9] S. J. Dixon, R. G. Brereton, H. A. Soini, M. V. Novotny, D. J. Penn, An automated method for peak detection and matching in large gas chromatography-mass spectrometry data sets, Journal of chemometrics 20 (8-10) (2006) 325–340.

[10] S. Furbo, J. H. Christensen, Automated peak extraction and quantification in chromatography with multichannel detectors, Analytical chemistry

TE D

765

84 (5) (2012) 2211–2218.

[11] C. A. Hastings, S. M. Norton, S. Roy, New algorithms for processing and peak detection in liquid chromatography/mass spectrometry data, Rapid

770

EP

Communications in Mass Spectrometry 16 (5) (2002) 462–467. [12] G. Viv´ o-Truyols, J. Torres-Lapasi´o, A. Van Nederkassel, Y. Vander Heyden, D. Massart, Automatic program for peak detection and deconvolution of

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

multi-overlapped chromatographic signals: Part I: Peak detection, Journal of Chromatography A 1096 (1) (2005) 133–145.

[13] T. Skov, R. Bro, A new approach for modelling sensor based data, Sensors

775

and Actuators B: Chemical 106 (2) (2005) 719–729.

53

ACCEPTED MANUSCRIPT

[14] D. Ballabio, T. Skov, R. Leardi, R. Bro, Classification of GC-MS mea-

RI PT

surements of wines by combining data dimension reduction and variable selection techniques, Journal of chemometrics 22 (8) (2008) 457–463.

[15] R. Bro, PARAFAC. Tutorial and applications, Chemometrics and intelli780

gent laboratory systems 38 (2) (1997) 149–171.

SC

[16] M. C. Rodr´ıguez, G. H. S´anchez, M. S. Sobrero, A. V. Schenone, N. R. Marsili, Determination of mycotoxins (aflatoxins and ochratoxin A) using

M AN U

fluorescence emission-excitation matrices and multivariate calibration, Microchemical Journal 110 (2013) 480–484. 785

[17] R. Tauler, Multivariate curve resolution applied to second order data, Chemometrics and Intelligent Laboratory Systems 30 (1) (1995) 133–146. [18] N. A. Sinkov, J. J. Harynuk, Cluster resolution: A metric for automated, objective and optimized feature selection in chemometric modeling, Talanta

790

TE D

83 (4) (2011) 1079–1087.

[19] M. Daszykowski, R. Danielsson, B. Walczak, No-alignment-strategies for exploring a set of two-way data tables obtained from capillary

EP

electrophoresis–mass spectrometry, Journal of Chromatography A 1192 (1) (2008) 157–165.

[20] C. Durante, R. Bro, M. Cocchi, A classification tool for N-way array based

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

795

on SIMCA methodology, Chemometrics and Intelligent Laboratory Systems 106 (1) (2011) 73–85.

[21] M. Cocchi, C. Durante, M. Grandi, D. Manzini, A. Marchetti, Three-way principal component analysis of the volatile fraction by HS-SPME/GC of aceto balsamico tradizionale of modena, Talanta 74 (4) (2008) 547–554.

54

ACCEPTED MANUSCRIPT

800

[22] C. Durante, M. Cocchi, M. Grandi, A. Marchetti, R. Bro, Application of N-

RI PT

PLS to gas chromatographic and sensory data of traditional balsamic vinegars of Modena, Chemometrics and Intelligent Laboratory Systems 83 (1) (2006) 54–65.

[23] J. H. Christensen, J. Mortensen, A. B. Hansen, O. Andersen, Chromatographic preprocessing of GC–MS data for analysis of complex chemical

SC

805

mixtures, Journal of Chromatography A 1062 (1) (2005) 113–123.

M AN U

[24] J. H. Christensen, A. B. Hansen, U. Karlson, J. Mortensen, O. Andersen, Multivariate statistical methods for evaluating biodegradation of mineral oil, Journal of Chromatography A 1090 (1) (2005) 133–145. 810

[25] J. H. Christensen, G. Tomasi, Practical aspects of chemometrics for oil spill fingerprinting, Journal of Chromatography A 1169 (1) (2007) 1–22. [26] N.-P. V. Nielsen, J. M. Carstensen, J. Smedsgaard, Aligning of single and

TE D

multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping, Journal of Chromatography A 815

805 (1) (1998) 17–35.

EP

[27] T. Skov, F. van den Berg, G. Tomasi, R. Bro, Automated alignment of chromatographic data, Journal of Chemometrics 20 (11-12) (2006) 484– 497.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

[28] G. Tomasi, F. van den Berg, C. Andersson, Correlation optimized warping

820

and dynamic time warping as preprocessing methods for chromatographic data, Journal of Chemometrics 18 (5) (2004) 231–241.

[29] E. Lange, C. Gr¨ opl, O. Schulz-Trieglaff, A. Leinenbach, C. Huber, K. Reinert, A geometric approach for the alignment of liquid chromatographymass spectrometry data, Bioinformatics 23 (13) (2007) i273–i281.

55

ACCEPTED MANUSCRIPT

825

[30] N. A. Sinkov, B. M. Johnston, P. M. L. Sandercock, J. J. Harynuk, Au-

RI PT

tomated optimization and construction of chemometric models based on highly variable raw chromatographic data, Analytica chimica acta 697 (1) (2011) 8–15.

[31] D. Lay, Linear Algebra and Its Applications, 3rd Edition, Addison Wesley, 2002.

SC

830

[32] R. Danielsson, D. B¨ ackstr¨om, S. Ullsten, Rapid multivariate analysis of

M AN U

LC/GC/CE data (single or multiple channel detection) without prior peak alignment, Chemometrics and intelligent laboratory systems 84 (1) (2006) 33–39. 835

[33] M. Daszykowski, B. Walczak, Methods for the exploratory analysis of twodimensional chromatographic signals, Talanta 83 (4) (2011) 1088–1097. [34] I. Stanimirova, B. Walczak, D. Massart, V. Simeonov, C. Saby,

TE D

E. Di Crescenzo, STATIS, a three-way method for data analysis. Application to environmental data, Chemometrics and Intelligent Laboratory 840

Systems 73 (2) (2004) 219–233.

EP

[35] R. Bro, C. A. Andersson, H. A. Kiers, PARAFAC2-Part II. Modeling chromatographic data with retention time shifts, Journal of Chemometrics 13 (3-4) (1999) 295–309.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

[36] R. A. Harshman, Parafac2: Mathematical and technical notes, UCLA work-

845

ing papers in phonetics 22 (1972) 30–44.

[37] J. M. Amigo, M. J. Popielarz, R. M. Callej´on, M. L. Morales, A. M. Troncoso, M. A. Petersen, T. B. Toldam-Andersen, Comprehensive analysis of chromatographic data by using PARAFAC2 and principal components analysis, Journal of Chromatography A 1217 (26) (2010) 4422–4429.

56

ACCEPTED MANUSCRIPT

850

[38] J. M. Amigo, T. Skov, R. Bro, J. Coello, S. Maspoch, Solving gc-ms prob-

RI PT

lems with parafac2, Trac Trends in Analytical Chemistry 27 (8) (2008) 714–725.

[39] L. G. Johnsen, J. M. Amigo, T. Skov, R. Bro, Automated resolution of over-

lapping peaks in chromatographic data, Journal of Chemometrics 28 (2) (2014) 71–82.

SC

855

[40] R. Bro, Multi-way analysis in the food industry: models, algorithms, and

M AN U

applications, Ph.D. thesis, Københavns Universitet’Københavns Universitet’, LUKKET: 2012 Det Biovidenskabelige Fakultet for Fødevarer, Veterinærmedicin og NaturressourcerFaculty of Life Sciences, LUKKET: 2012 860

Institut for FødevarevidenskabDepartment of Food Science, 2012 Institut for Fødevarevidenskab, 2012 Kvalitet og TeknologiDepartment of Food Science, Quality & Technology (1998).

TE D

[41] H. A. Kiers, Hierarchical relations among three-way methods, Psychometrika 56 (3) (1991) 449–470. 865

[42] C. A. Andersson, R. Bro, The N-way Toolbox for MATLAB, Chemometrics

EP

and Intelligent Laboratory Systems 52 (1) (2000) 1–4. [43] B. Escofier, J. Pag`es, Analyses factorielles simples et multiples: objectifs, m´ethodes et interpr´etation, Dunod, 2008.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

[44] G. Mazerolles, M. Hanafi, E. Dufour, D. Bertrand, E. Qannari, Common

870

components and specific weights analysis: a chemometric method for dealing with complexity of food products, Chemometrics and Intelligent Laboratory Systems 81 (1) (2006) 41–49.

[45] D. Chessel, M. Hanafi, Analyses de la co-inertie de k nuages de points, Revue de Statistique Applique 44 (2) (1996) 35–60.

57

ACCEPTED MANUSCRIPT

875

[46] C. B. Cordella, D. Bertrand, Saisir: a new general chemometric toolbox,

RI PT

TrAC Trends in Analytical Chemistry 54 (2014) 75–82. [47] S. Stein, Y. Mirokhin, D. Tchekhovskoi, G. Mallard, NIST Mass Spectral Search Program, National Institute of Standards and Technology, Gaithersburg, MD (2008).

[48] S. Rocha, V. Ramalheira, A. Barros, I. Delgadillo, M. A. Coimbra,

SC

880

Headspace solid phase microextraction (SPME) analysis of flavor com-

M AN U

pounds in wines. Effect of the matrix volatile composition in the relative response factors in a wine model, Journal of Agricultural and Food Chemistry 49 (11) (2001) 5142–5151. 885

[49] G. Antalick, M.-C. Perello, G. de Revel, Development, validation and application of a specific method for the quantitative determination of wine esters by headspace-solid-phase microextraction-gas chromatography–mass

TE D

spectrometry, Food chemistry 121 (4) (2010) 1236–1245. [50] R. Bro, H. A. Kiers, A new efficient method for determining the number of 890

components in PARAFAC models, Journal of chemometrics 17 (5) (2003)

EP

274–286.

[51] G. Antalick, M. Perello, G. de Revel, Changes in wine secondary metabolite composition by the timing of inoculation with lactic acid bacteria: Impact on wine aroma, in: Proceedings of the 3rd International Sympo-

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

895

sium MACROWINE 2010 on Macromolecules and Secondary Metabolites in Grapevine and Wines, Universita di Torin Torino, Italy, 2010, pp. 143– 148.

[52] M. Gammacurta, S. Marchand, W. Albertin, V. Moine, G. de Revel, Impact of yeast strain on ester levels and fruity aroma persistence during aging of

58

ACCEPTED MANUSCRIPT

900

bordeaux red wines, Journal of agricultural and food chemistry 62 (23)

RI PT

(2014) 5378–5389. [53] C. E. Abrahamse, E. J. Bartowsky, Timing of malolactic fermentation inoc-

ulation in Shiraz grape must and wine: influence on chemical composition, World Journal of Microbiology and Biotechnology 28 (1) (2012) 255–265.

[54] C. Knoll, S. Fritsch, S. Schnell, M. Grossmann, S. Krieger-Weber,

SC

905

M. du Toit, D. Rauhut, Impact of different malolactic fermentation in-

M AN U

oculation scenarios on Riesling wine aroma, World Journal of Microbiology

EP

TE D

and Biotechnology 28 (3) (2012) 1143–1153.

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

59

ACCEPTED MANUSCRIPT

Supporting Information The in this study developed approach, can be summarized (abbreviated form) as follows: • Segmentation of chromatograms along retention axis

RI PT

910

– Calculation of SSCP matrices for every segment and sample

915

SC

– Concatenation of all vectorized SSCP matrices (only upper triangular part) of each segment into a compilation matrix

– Calculation of SSCP matrices of each compilation matrix

three-way array

M AN U

– Assembling of all SSCP matrices of each compilation matrix to a

– PARAFAC on three-way array 920

– Visual examination of loadings and selection of important segments • Deconvolution of only important segments using PARAFAC2 • Integration of deconvoluted peak profiles and identification of compounds

results

925

TE D

• Multiple PCAs on selected compounds with consideration of PARAFAC

Multiple PARAFAC2 models on all segments of the chromatograms with

EP

subsequent PCA [37] was used as a reference method in this study. This approach can be summarized as follows: • Segmentation of the chromatograms along retention axis • Deconvolution of every segment using PARAFAC2

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

930

• Integration of deconvoluted peak profiles and identification of compounds • PCA on peak area tables

60

4

x 10

Peaks 3 & 4

Peak 8 4000

2

3000

SC

2.5

1.5 2000

1

1000

0.5

0

300

310

M AN U

0

320

330

4

x 10

abundance

2 1.5 1

TE D

0.5

700

100

200

Peaks 1 & 2 10000 8000 6000

300

400

14000

500 600 scan number

710

700

800

0 60

EP

2000

70

80

90

100

730

900

12000

10000

10000

8000

1000

1100

6000

6000

4000

4000

2000

2000 0

740

Peaks 9 & 10

8000

4000

720

Peaks 5,6 & 7

550

560

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

570

580

590

0

940

950

960

970

980

Figure 14: Overlay of all mass channels of one sample (sample no. 14) of the artificial GC-MS data set. Dotted lines show the segmentation of the chromatogram.

61

Peaks 3 & 4

5

x 10

Peak 8

4

x 10

3

SC

2

2.5

1.5

2 1.5

1

1 0.5

0 300

310

320

330

340

5

3.5

x 10

3 abundance

2.5 2 1.5 1 0.5

4

4

x 10

100

0 70

300

400

80

0

90

100

500 600 scan number

4

x 10

EP

1

200

Peaks 1 & 2

3 2

350

TE D

0

M AN U

0.5

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

110

710

720

700

730

800

740

750

900

1000

5

Peaks 5,6 & 7

x 10

8

2

6

1.5

4

1

2

0.5

0

0 560

580

600

760

1100

Peaks 9 & 10

940

960

980

1000

Figure 15: Overlay of TICs of all samples of the artificial GC-MS data set with introduced shift. Dotted lines show the segmentation of the chromatogram.

62

0.3

0.8

vrb alpha

0.2 0.1 0 −0.1 −0.2 −0.2

clos beta clos alpha clos PN4 clos PN4 clos beta clos alpha −0.1

0

rbs 271

rbs 41

0.7 0.6 0.5 0.4

rbs 271

3&4

0.3 0.2

rbs 41

0.1 0.2 0.3 Component 5: 8.5% expl. var.

20&21

0.9

Component 10: 3.1% expl. var.

Component 10: 3.1% expl. var.

vrb alpha

0.4

1

clos vrb rbs co−inoculated sequential

0.5

22&23

13&14

7&8

SC

0.6

26&27

0.1

0.4

38&39 1&2 11&12

18&19 52&53 9&10 48&49 16&17 28&29 66&67 56&63 46&47 32&33 36&37 44&45 &34&35 75&78 42&43 64&65 5&6 50&51 40&41 68&69 73&74 70&71 79&82 0 54&55 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Component 5: 8.5% expl. var.

0.5

24&25

30&31 0.8

0.9

1

(b) Third mode (segments) congruence loadings

M AN U

(a) First mode (samples) loadings

0.6

clos vrb rbs co−inoculated sequential

clos beta

0.4 0.3 0.2

EP

Component 3: 10.4% expl. var.

0.5

0.1 0

−0.1

−0.2 −0.2

rbs 271 rbs 41

vrb alpha clos beta clos PN4 clos PN4 clos alpha clos alpha vrb alpha rbs 41

−0.1

0

0.1 0.2 0.3 Component 1: 18% expl. var.

1

64&65

0.8 0.7 0.6 0.5

73&74 18&19

0.4 0.3 0.2

75&78 36&37 70&71 13&14

40&41 79&82 22&23 30&31 46&47

34&35 38&39 50&51

48&49

44&45 32&33 11&12 5&6 3&4 66&67 56&63 9&10 7&8 54&55 26&27 42&43 20&21 28&29 24&25 0 1&216&17 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Component 1: 18% expl. var.

rbs 271 0.4

68&69

0.9

Component 3: 10.4% expl. var.

TE D

Figure 16: Loadings plots of PARAFAC components five vs. ten (model with 36 segments); Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

52&53

0.1

0.5

(a) First mode (samples) loadings

0.9

1

(b) Third mode (segments) congruence loadings

Figure 17: Loadings plots of PARAFAC components one vs. three (model with 36 segments); Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

63

clos vrb rbs co−inoculated sequential

0.5

clos PN4

1 42&43 0.9 0.8 Component 2: 11.3% expl. var.

Component 2: 11.3% expl. var.

0.6

0.4 0.3 0.2 0.1 clos alpha clos beta clos beta rbs 271 clos alpha rbs 41 rbs 41 clos PN4 vrb alpha −0.1 vrb alpha −0.2 −0.1 0 0.1 0.2 0.3 Component 1: 18% expl. var.

0.7 0.6 56&63 0.5 0.4 0.3

13&14 26&27

0.2

0

rbs 271 0.4

3&4 40&41

18&19

SC

0.7

0.1

48&49 52&53 44&45 32&33 11&12 38&3950&51 30&31 & 66&67 75&78 5&6 73&74 34&35 24&25 46&47 9&10 7&822&23 79&82 20&21 16&17 64&65 68&69 28&29 54&55 36&37 70&71 0 1&2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Component 1: 18% expl. var.

0.5

(b) Third mode (segments) congruence loadings

M AN U

(a) First mode (samples) loadings

0.5 0.4 0.3

clos PN4

EP

Component 1: 19.1% expl. var.

0.6

clos vrb rbs co−inoculated sequential

0.2 0.1

clos alpha

0

vrb alpha vrb alpha rbs 41 rbs 271 clos beta rbs 41 clos PN4 clos betaclos alpha −0.1 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 Component 2: 16.2% expl. var.

1 42&43&44&45 0.9 0.8 Component 1: 19.1% expl. var.

0.7

TE D

Figure 18: Loadings plots of PARAFAC components one vs. two (model with 36 segments); Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

0.7 0.6 0.5 0.4 0.3 0.2

rbs 271

0.1

0.4

0

0.5

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

(a) First mode (samples) loadings

0

18&19&20&21 1&2&3&4 50&51&52&53 30&31&32&33 5&6&7&8 9&10&11&12 38&39&40&41 22&23&24&25 13&14&16&17 79&82&84 46&47&48&49 73&74&75&78 34&35&36&37 64&65&66&67 54&55&56&63 26&27&28&29 68&69&70&71 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Component 2: 16.2% expl. var.

(b) Third mode (segments) congruence loadings

Figure 19: Loadings plots of PARAFAC components 2 vs. 1 (model with 18 segments); Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

64

SC 0.4 0.3 0.2 0.1 0

rbs 41 rbs 271 vrb alpha

−0.1

clos PN4

clos PN4

clos alpha −0.2

−0.1

clos beta clos alpha

vrb alpha

rbs 41

0 0.1 0.2 0.3 Component 2: 16.2% expl. var.

68&69&70&71 64&65&66&67

0.9 0.8

rbs 271

0.4

0.5

TE D

−0.2 −0.3

1

clos vrb rbs co−inoculated sequential

clos beta

Component 3: 13.9% expl. var.

Component 3: 13.9% expl. var.

0.5

M AN U

0.6

(a) First mode (samples) loadings

0.7 0.6 0.5

18&19&20&21 73&74&75&78 34&35&36&37 38&39&40&41 50&51&52&53 22&23&24&25 0.2 46&47&48&49 9&10&11&12 30&31&32&33 0.1 5&6&7&8 1&2&3&4 54&55&56&63 26&27&28&29 13&14&16&17 0 42&43&44&45 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Component 2: 16.2% expl. var. 0.4 0.3

79&82&84

(b) Third mode (segments) congruence loadings

EP

Figure 20: Loadings plots of PARAFAC components 2 vs. 3 (model with 18 segments); Yeast starter cultures: Lalvin Clos (clos), Uvaferm RBS (rbs), Uvaferm VRB (vrb); Lactic acid bacteria starter cultures: Enoferm Alpha (alpha), Enoferm Beta (beta), Lalvin PN4 (PN4), Lalvin VP41 (41) and O-Mega (271).

AC C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

RI PT

ACCEPTED MANUSCRIPT

65

Highlights

SC

October 27, 2015

RI PT

ACCEPTED MANUSCRIPT

• A novel data processing procedure for non-targeted gas chromatography mass spectrometry (GC-MS) data is proposed.

M AN U

• Basic matrix manipulation of segmented GC-MS chromatograms and PARAFAC multi-way modelling is used. • Retention time shifts and peak shape deformations between samples are taken into account.

AC C

EP

TE D

• The procedure is demonstrated on an artificial and an experimental fullscan GC-MS data set.

1