Accepted Manuscript Fluorescence spectroscopy as tool for the geographical discrimination of coffees produced in different regions of Minas Gerais State in Brazil
Bruno G. Botelho, Leandro S. Oliveira, Adriana S. Franca PII:
S0956-7135(17)30030-0
DOI:
10.1016/j.foodcont.2017.01.020
Reference:
JFCO 5427
To appear in:
Food Control
Received Date:
10 November 2016
Revised Date:
24 January 2017
Accepted Date:
25 January 2017
Please cite this article as: Bruno G. Botelho, Leandro S. Oliveira, Adriana S. Franca, Fluorescence spectroscopy as tool for the geographical discrimination of coffees produced in different regions of Minas Gerais State in Brazil, Food Control (2017), doi: 10.1016/j.foodcont.2017.01.020
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Fluorescence spectroscopy for geographical discrimination of green coffees. Discrimination among coffees from the major producing regions in MG, Brazil. PARAFAC, NPLS-DA and UPLS-DA were tested for model development. UPLS-DA provided the most accurate models, with 100% precision for MM samples.
ACCEPTED MANUSCRIPT 1
Fluorescence spectroscopy as tool for the geographical discrimination of
2
coffees produced in different regions of Minas Gerais State in Brazil
3
Bruno G. Botelhoa, Leandro S. Oliveiraa,b, Adriana S. Francaa,b,*
4 5
aPPGCA,
6
Horizonte, MG, Brazil.
Universidade Federal de Minas Gerais, Av. Antônio Carlos, 6627, 31270-901, Belo
7 8
bDEMEC,
9
Belo Horizonte, MG, Brazil.
Universidade Federal de Minas Gerais, Av. Antônio Carlos, 6627, 31270-901,
10 11 12 13 14 15 16
*
17
DEMEC, Universidade Federal de Minas Gerais, Av. Antônio Carlos, 6627, 31270-
Corresponding author.
18
901, Belo Horizonte, MG, Brazil.
19
Tel.: +55 31 34093512. Fax: +55 31 34433783. E-mail:
[email protected]
20 21 l1
ACCEPTED MANUSCRIPT 22
ABSTRACT
23 24
The designation of origin of high-value agricultural and food products has become
25
increasingly relevant for the producers, since it allows the consumers to relate the singular
26
characteristics of their preferred product to its respective provenance. Thus, coffee producers
27
are in pursuit of ways to certify their products according to their authenticity pertaining
28
provenance. Fluorescence spectroscopy was applied in order to develop a geographical
29
discrimination model of coffees produced in Minas Gerais State, Brazil. PARAFAC, NPLS-
30
DA and UPLS-DA were used in order to discriminate samples produced in four major
31
production areas in Minas Gerais, namely Cerrado Mineiro (CM), Matas de Minas (MM),
32
Norte de Minas (NM) e Sul de Minas (SM). The UPLS-DA presented the best results, with f-
33
scores for CM and SM higher than 0.8, for both training and test sets, which indicates good
34
classification. MM model presented a good f-score for the training set (1.000), but a poor
35
result was obtained for the test set (0.250), mainly due to false positive samples. NM models
36
presented an intermediary result, with a f-score of 0.913 for training set and 0.625 for test set.
37
The proposed method requires a simple sample pre-treatment, it is fast and can be used for the
38
determination of the geographical origin of coffees produced in Minas Gerais State.
39 40 41 42 43
Keywords: Fluorescence spectroscopy; PARAFAC; NPLS-DA;UPLS-DA; Coffee;
44
geographical origin
45 l2
ACCEPTED MANUSCRIPT 46
1. Introduction
47 48
According to the Brazilian Ministry of Agriculture, Livestock and Supply (MAPA) the
49
geographical indication (GI) is a denomination given to products or services that have
50
singular characteristics that are related to their provenance, and those characteristics bring a
51
good reputation, high added value and a distinction from similar products. These products
52
present unique quality due to natural conditions, such as climate, soil, vegetation or local
53
know-how. The Brazilian Agency responsible for providing the GI designation, INPI
54
(Instituto Nacional de Propriedade Industrial) (MAPA, 2015), adopted two categories of GIs:
55
Indication of Provenance (IP) and Denominations of Origin (DO). The latter, in addition to
56
evidence of reputation associated with the name of the place of production, requires a
57
percentage of the production in question to be carried out in situ.
58 59
Until 2013, 46 GIs had been registered in Brazil, 38 of them national products/services, and of
60
these, 30 IPs and 8 DOs. There were also eight international products registered, all DOs
61
(Wilkinson, Cerdan, & Dorigon, 2015). Of the 30 products registered as IPs, only 3 are
62
coffees. One from Paraná State (Norte Pioneiro Coffee) and two from Minas Gerais (MG)
63
State (Cerrado Mineiro Coffee and Serra da Mantiqueira Coffee). Minas Gerais State is the
64
largest coffee producer in Brazil, accounting for approximately 50% of the total production.
65
The State is officially divided in four major producing regions: Sul de Minas (SM), in the
66
southern part of the State (encompassing areas between 21o 13' to 22o 10' S and 44o 20' to 47o
67
20' W with altitudes from 700 to 1080 m, and climate of the B2 and B3 types, Köppen
68
System); Cerrado Mineiro (CM), in the western part (encompassing areas between 16o 37' to
69
20o 13' S and 45o 20' to 49o 48' W with altitudes from 820 to 1110 m, and climate of the B1 l3
ACCEPTED MANUSCRIPT 70
type); Matas de Minas (MM), in the southeastern part (encompassing areas between 18o 35' to
71
21o 26' S and 40o 50' to 43o 36' W with altitudes from 400 to 700 m, and climate of the B1, B2,
72
B4 and C1 and C2 types); and Norte de Minas (NM), in the northern part (encompassing areas
73
between 17o 05' to 18o 09' S and 40o 50' to 42o 40' W with altitudes around 1099 m, and
74
climate of the C1 and D types) (Figure 1) (Barbosa et al., 2010; CONAB, 2015)."
75 76
According to the Brazilian Coffee Industry Association (ABIC), after acquiring the IP seal,
77
the market prices of Cerrado Mineiro Coffee presented an increase of 30 to 40% (ABIC,
78
2016). As the consumer is willing to pay more for a product from a specific area, expecting to
79
acquire a product of higher quality, some attempts have been made to create special labels for
80
some specialty coffee production regions, such as those by the Cerrado Mineiro Organization
81
(Cerrado Mineiro, 2016). However, more than labelling is needed to ensure coffee origin
82
certification. Analytical methodologies using different techniques, such as near (Link, Lemes,
83
Marquetti, dos Santos Scholz, & Bona, 2014a) and mid infrared spectroscopies (Link, Lemes,
84
Marquetti, dos Santos Scholz, & Bona, 2014b), gas chromatography (Carrera, Leon-
85
Camacho, Pablos, & Gonzalez, 1998; Costa Freitas & Mosca, 1999; Risticevic, Carasek, &
86
Pawliszyn, 2008; Taveira et al., 2014; Toledo et al., 2016), ultra high pressure liquid
87
chromatography coupled to mass spectrometry (Mehari et al., 2016), isotope ratio mass
88
spectrometry (Weckerle, Richling, Heinrich, & Schreier, 2002; Rodrigues et al., 2009), multi-
89
collector inductively coupled plasma mass spectrometer (Liu, You, Chen, Liu, & Chung,
90
2014) and inductively coupled plasma optical emission spectrometry (Muniz-Valencia,
91
Jurado, Ceballos-Magana, Alcazar, & Hernandez-Diaz, 2014), have been developed in order
92
to determine in an unequivocal way the provenance of coffee samples. These techniques have
93
been used to discriminate coffee from different continents, or even produced in the same l4
ACCEPTED MANUSCRIPT 94
country, but no work has been done on the discrimination of coffees produced in different
95
microregions of Minas Gerais State. Also, although fluorescence spectroscopy has been
96
extensively used for the classification of food (Sádecká & Tóthová, 2007), no record has been
97
found in the literature on its application for the discrimination of coffees.
98 99
In view of the aforementioned, the main objective of this paper was to develop a supervised
100
classification method using chemometrics tools capable of discriminating coffees produced in
101
Minas Gerais according to their origin, using fluorescence spectroscopy.
102 103
4. Materials and methods
104 105
4.1. Sample description
106
One hundred and ten samples of green Arabica coffee from the four regions of Minas Gerais
107
previously cited (2012-2013, 2013-2014 and 2014-2015 crop years) were provided by
108
producers themselves or by cooperatives or associations of producers. From the total of
109
samples, 34 were from Cerrado Mineiro (CM), 21 from Matas de Minas region (MM), 20
110
from Norte de Minas region (NM) and 35 from Sul de Minas region (SM).
111 112
Approximately 150 g of coffee were ground using a MCF55 rotating disk mill (Arbel, Brazil).
113
After grinding, the obtained powder was sieved in a 20 mesh sieve (d < 1 mm). The samples
114
were then stored in plastic bags hermetically closed until analysis.
115 116
The green coffee powder was submitted to aqueous extraction to allow the fluorescence
117
spectra acquisition. 3 g of each sample (previously ground and sieved) were placed in a 50 l5
ACCEPTED MANUSCRIPT 118
mL Falcon Tube containing 20 mL of distilled water. The tube was then mixed in a vortex
119
mixer for 30 seconds and heated for 15 minutes at 80oC using a water bath. After the 15-
120
minute period, the samples were placed in an ice bath to lower their temperature and then
121
centrifuged for 5 minutes at 3500 rpm. After centrifugation, the samples were filtered and the
122
aqueous extract obtained was stored at -18oC until analysis.
123 124
In the day of the analysis, the frozen extracts were naturally thawed until achieving thermal
125
equilibrium with the surrounding environment (approximately 20oC). Subsequently, the
126
extracts were diluted ten times using distilled water and further submitted to fluorescence
127
spectroscopy analysis.
128 129
4.2. Instrumentation
130 131
Fluorescence spectra were obtained in a Varian Cary Eclipse spectrofluorimeter, using a 10
132
mm quartz cuvette. All the excitation-emission matrices (EEM) were obtained in the
133
excitation range from 250 to 500 nm (20 nm steps) and in the emission range from 350 to 600
134
nm (2 nm steps). The excitation and emission monochromators slit widths were both 10.0 nm
135
and the scanning rate was 9600 nm min−1.
136 137
4.3. Statistical Analysis
138 139
PARAFAC (PARAllel FACtor analysis), NPLS-DA and UPLS-DA were employed
140
for construction of the discrimination models. PARAFAC is a decomposition method that can
141
be considered a generalization of PCA to higher order data. It decomposes data into triads or l6
ACCEPTED MANUSCRIPT 142
trilinear components. Each component in a PARAFAC model is formed of one score vector
143
(information related to the samples related to the samples) and two loading vectors. In this
144
paper, the loadings vectors represent the excitation and the emission spectral data. The
145
advantage of using PARAFAC when compared to other bilinear methods is the uniqueness of
146
solution, which allows the extraction of the pure spectra of the analyzed species (Bro, 1996).
147
NPLS-DA is the combination of NPLS (Superior order or N Partial Least Squares or
148
Multilinear PLS) and discriminant analysis. NPLS consists of an extension of the PLS two-
149
dimensional algorithm to allow dealing with independent data sets of orders higher than two
150
(cubic or 4-dimension arrays, for example). The combination of NPLS and discriminant
151
analysis allows the classification of samples using these high order data, which usually
152
increases sensibility (Bro, 1996).
153 154
U-PLSDA (Unfolded Partial Least Squares) is a variation of PLS where high order data is
155
unfolded. Unfolding consists in reducing the dimensionality of the data (transforming a cubic
156
array - third order data - into a linear array - second order data.) Although the reduction
157
causes some loss of information about the samples and model interpretability, UPLSDA gains
158
on simplicity and easiness, because it uses the PLS basic algorithm (Olivieri & Escandar,
159
2014).
160 161
Data were handled using MATLAB software, version 7.13 (The MathWorks, Natick, MA,
162
USA). The PLSDA routine used for the U-PLSDA models, the NPLS and the PARAFAC
163
routines came from the PLS Toolbox, version 6.5 (Eigenvector Technologies, Manson, WA,
164
USA). l7
ACCEPTED MANUSCRIPT 165 166
5. Results and discussion
167 168
5.1. PARAFAC
169 170
Figure 2a shows a mean contour map of EEM for all the 110 green coffee samples.
171
Colors represent variations in signal intensity ranging from blue (low intensity) to yellow
172
(high intensity). It is quite noticeable that there is a high intensity area crossing diagonally all
173
the contour map. Such signals are not related to the samples, and are caused by light
174
scattering effects (Rayleigh scatter), a physical phenomenon that occurs naturally in
175
fluorescence analysis. Given that such signals can overlap with signals from the sample,
176
scattering removal algorithms should be employed. In this study, we used the one proposed by
177
Bahram et al. (2006). The same mean contour map obtained after the scattering removal can
178
be seen on Figure 2b. In this map, one can observe four high intensity regions corresponding
179
to the following excitation/emission pairs: 370/440 nm, 400/500 nm, 390/540 nm and 420/440
180
nm. However, it is not possible to infer directly about the relevance of such regions and obtain
181
a precise spectral attribution, given the high interference region observed in the center of the
182
contour map, probably associated to spectral overlapping of several fluorophore compounds
183
in the coffee extract.
184 185
In order to facilitate spectral interpretation, the PARAFAC curve resolution technique was
186
employed. A three component model, accounting for 97% of X (spectral data) variance and
187
presenting a core consistency value of 78 was created. No visible trends were found among
188
the different groups when the scores of the three components were plotted against each other. l8
ACCEPTED MANUSCRIPT 189 190
Even though PARAFAC was not able to indicate any tendency of sample grouping, its use is
191
still relevant, because it performs spectral deconvolution and separates overlaping signals.
192
Thus, pure spectra are obtained for each of the model components, indicating which
193
fluorophores are present in the samples and improving spectral interpretation. Figure 3 shows
194
the loadings obtained for the three components. Belay et al. (2015) studied the interaction
195
between caffeic acid, chlorogenic acid (5-CQA) and caffeine, and, by experimental methods,
196
estimated the emission and excitation peaks for caffeic acid as 370 and 460 nm, respectively.
197
These values are similar to the ones presented by the first component (Figure 3a), which
198
indicates that this component might be due to the presence of caffeic acid in the extracts, a
199
phenolic compound present in coffee usually esterified to quinic acid thus comprising the
200
chlorogenic acids (Farah & Donangelo, 2006).
201 202
The second component presented an excitation peak near 410 nm and emission peak around
203
540 nm (Figure 3b), which is similar to the excitation/emission peaks of quercetin in a PBS
204
solution (Nifli et al., 2007). Quercetin is a flavonoid present in green coffees (Mullen et al.,
205
2013), and, together with caffeic acid, represent the major antioxidants in beverages
206
containing caffeine (Woodward, 2008). The third component presented excitation/emission
207
peaks that are consistent with the lipid fraction of coffee (Figure 3c), sometimes specifically
208
attributed to the tocopherol present in it (Guzmán et al., 2015; Tanajura da Silva et al., 2015).
209
Tocopherol is a major component of the unsaponifiable fraction of coffee oil (Speer &
210
Kölling-Speer, 2006) and has been used as a marker for detection and identification of
211
adulteration of roasted and ground coffee with corn (Jham et al., 2007).
212 l9
ACCEPTED MANUSCRIPT 213
Since no natural clustering was observed regarding the different origins of the coffees in the
214
PARAFAC results, an N-way supervised classification method, the NPLS-DA, was used to
215
develop a discrimination method for the classification of the coffee samples.
216 217
5.2. NPLS-DA
218 219
A NPLS-DA model was built using the same 110 samples of green coffee described
220
previously in section 4.1. Samples from each region were separated into training (two thirds
221
of the total samples) and test (remaining one third) sets, using the Kennard-Stone (KS)
222
algorithm. Data were unfolded in a two-dimensional array, since the KS algorithm is not
223
suitable for cubic arrays. After the separation process, the data were refolded to their original
224
shape.
225 226
At the end of the process, the training set was composed of 75 samples (23 from CM, 14 from
227
MM, 14 from NM and 24 from SM) and the test set was composed of 35 samples (11 from
228
CM, 7 from MM, 6 from NM and 11 from SM). The number of latent variables (LV) was
229
selected using random subsets cross validation, and the LV number that presented the smallest
230
cross validation classification error was chosen. The best models were built with 2 LVs that
231
accounted for 84.5% and 12.5% of the X and Y variance, respectively.
232 233
As can be seen in Figure 4, the majority of the CM samples were correctly classified both in
234
the training and the test sets, and there was also a high number of false positive samples from
235
every class, mostly from SM samples. MM classification model was able to differentiate the
236
MM samples from CM and SM, but not from NM, with a high number of false positives and l10
ACCEPTED MANUSCRIPT 237
false negatives between these two regions. NM and SM models also presented a similar
238
behavior, with a high number of misclassifications from all the different regions. In the SM
239
model, for example, all of the CM samples were misclassified as false positive.
240 241
Table 1 summarizes all the classification results, based on the most probable class
242
classification (Wise et al., 2006), organized in such a way to facilitate the identification of the
243
sources of misclassification of regions. The most noticeable misclassifications are in regard to
244
the SM samples. More than 50% of them were incorrectly classified as CM in the training set.
245
A significant number of misclassifications (more than 60% of the samples) also occurred in
246
the test set.
247 248
Table 1 also show some qualitative Figures of Merit (FoM) that were estimated based on the
249
models results. False Negative rate (FNR) and False Positive Rate (FPR) give us an idea of
250
how the classification errors are distributed, and if a model is able to efficiently classify its
251
positive samples (FNR) and correctly discriminate the samples that do not belong to it (FPR)
252
(Christin et al., 2012):
253 254
𝐹𝑃𝑅 =
𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 + 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
(1)
𝐹𝑁𝑅 =
𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 + 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
(2)
255 256 257 258 259
Table 1 - Confusion matrix and qualitative figures of merit estimated for the developed
260
NPLS-DA model.
261 l11
ACCEPTED MANUSCRIPT
SM 14 1 1 8
CM MM NM SM
CM 10 0 0 1
MM 1 5 0 1
NM 1 2 2 1
SM 7 1 0 3
FPR
FNR
NMC
Precision
Recall
F1 Scores
CM
0.296
0.115
0.411
0.488
0.870
0.625
MM
0.092
0.263
0.355
0.600
0.643
0.621
NM
0.240
0.429
0.667
0.333
0.574
0.421
SM
0.123
0.410
0.533
0.500
0.304
0.378
CM
0.273
0.083
0.356
0.526
0.909
0.667
MM
0.097
0.222
0.319
0.625
0.714
0.667
NM
0.000
0.400
0.400
1.000
0.333
0.500
SM
0.111
0.421
0.532
0.500
0.273
0.353
Training
NM 2 3 8 1
Test
MM 2 9 0 3
Training
CM MM NM SM
CM 20 0 0 3
Test
Predicted as
Actual class
262 263 264
Precision is directly related to FPR and indicates the percentage of correctly classified
265
positive samples among all classified as positive (true positive and false positive):
266 267
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
(3)
268 269
This FoM provides an idea of how capable is the method in separating the samples belonging
270
to one class from the ones that do not belong to it (Christin et al., 2012). In the training set,
271
precision values ranged from 0.333 to 0.600, and in the test set it ranged from 0.500 to 1.000. l12
ACCEPTED MANUSCRIPT 272
CM and SM models presented the smallest precision values. Although the CM model
273
classified almost all of its samples correctly, it also presented a great number of false samples,
274
which lowered its precision. The opposite happened with the SM model, which was not able
275
to classify its related samples in a proper way (only 8 out of 20). However, when compared
276
with CM, it presented only a few false positive samples (7 in SM against 21 in CM).
277 278
The recall (also called sensitivity or true positive rate) represents the model capacity of
279
providing a correct classification of its positive samples, considering only the positive
280
samples (Christin et al., 2012):
281 282
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 + 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
(4)
283 284
It is interesting to notice that CM and MM models obtained recall values distinguishably
285
higher than NM and SM models, for both training and test sets. This indicates that CM and
286
MM models have a higher capacity of classifying correctly its positive samples. The Number
287
of Misclassification (NMC) and the f-score provide an overall evaluation of model quality:
288 289
290
𝑁𝑀𝐶 = 𝐹𝑃𝑅 + 𝐹𝑁𝑅 (5)
𝑓 ‒ 𝑠𝑐𝑜𝑟𝑒 =
(
(𝛽2 + 1) 𝑥 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙 𝛽2𝑥 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙
)
(6)
291 292
NMC estimates the total of misclassified samples (both false positive and false negative) in a
293
model, and the f-score provides a harmonic mean of the precision and recall, that is more l13
ACCEPTED MANUSCRIPT 294
focused in how good a model is in classifying correctly its positive samples. Recall and
295
precision are balanced in the f-score when the β constant parameter is set to 1 and is in favor
296
of precision when β > 1. In this paper, β was set to 1, because the main goal was to correctly
297
classify all samples, which requires both precision and recall to be taken in account at the
298
same extent (Christin et al., 2012). NMC values ranged from 0.355 (MM) to 0.667 (NM) for
299
the training set and from 0.319 (MM) to 0.532 (SM) for the test set. SM had the worst NMC
300
for the test set and the second worst in the training set, which only reassures the model
301
difficulty in classifying correctly the samples from these specific regions. CM and MM had
302
the smallest NMC values, but can still be considered high when compared to others published
303
in the literature. Taveira et al. (2014) were able to develop a model capable of classifying
304
samples from three different locations in Minas Gerais using metabolomics profile obtained
305
by CG-Q/MS, obtaining a 100% correct classification of its samples. Mehari et al. (2016)
306
when developing a chemometric method to determine geographical origin of Ethiopian
307
coffees using UPLC-MS did a similar approach. The phenolic compounds profile was used,
308
and a 90% correct classification rate was obtained. These higher classification rates may be
309
associated to the utilization of more sophisticated analytical methods, capable of detecting
310
substances in much smaller concentrations than the fluorescence spectroscopy.
311 312
Link et al. (2014b) used mid-infrared spectroscopy to determine the geographical origin of
313
Brazilian coffees produced in four different locations of Parana State. The results presented
314
by the authors were similar to the ones obtained by Taveira et al. (2014) and Mehari et al.
315
(2016), with almost 90% of correct classification. The factor that might have contributed the
316
most to a good performance was the small sample variance. Samples from only three
317
locations and harvested in the same year were used. In the present work, the samples came l14
ACCEPTED MANUSCRIPT 318
from more than 80 different locations and with harvesting years ranging from 2012 to 2015.
319
The natural variance of samples due to both different locations and harvested years represents
320
a more realistic situation because commercialized coffees may come from different crop
321
years, since producers and respective cooperatives and associations usually store coffees from
322
different crops, in order to get better prices for the bags due to market prices fluctuations. As
323
this factor does not affect Designation of Origin (DO), variations in samples crop years,
324
ranging from 2012 to 2015, were taken in consideration to obtain a representative and robust
325
multivariate model.
326 327 328
The two components used in the NPLS-DA model are depicted in Figure 5. It is possible to
329
see that these components are consistent with the second and third components from
330
PARAFAC model. This indicates that, even though caffeic acid has been detected by
331
fluorescence spectroscopy, it does not contribute to the classification of the samples.
332 333
Similar results were found by Taveira et al. (2014), where caffeic acid was not a significant
334
metabolite in the classification of coffee samples from different locations of Minas Gerais
335
State. As the NPLS-DA model performance was still considered unsatisfactory, another
336
strategy was used, the UPLS-DA, where the trilinear data is unfolded to a bidimensional
337
array, and a conventional PLS-DA algorithm is used (Olivieri & Escandar, 2014).
338 339
5.2. UPLS-DA
340
l15
ACCEPTED MANUSCRIPT 341
The same 110 samples used in PARAFAC and NPLS-DA models were used to build the
342
UPLS-DA models. The samples were divided into training and test sets, the same way as
343
described for the NPLS-DA, except that they were not refolded for model construction. A few
344
pre-processing algorithms were tested in order to improve the model classification, and the
345
best results were obtained by the combination of MSC (Multiplicative Scatter Correction) and
346
OSC (Orthogonal Signal Correction). The number of latent variables (LV) was selected using
347
random subset cross validation, and the LV number that presented the smallest cross
348
validation classification error was chosen. The best models were built with 8 LVs that
349
accounted for 99% and 51% of the X and Y variance, respectively.
350 351
As can be seen in Figure 6, in an overall way, the models built using the UPLS-DA presented
352
better performances than the ones built with NPLS-DA. MM model was able to classify
353
correctly all of its samples in the training set, followed by NM model (only one
354
misclassification), CM (two misclassifications) and SM (five misclassifications). SM
355
misclassifications were similar to the ones in the NPLS-DA, with all the five samples being
356
classified as CM. This confirms that there really is a similarity amongst the fluorophore
357
compositions of these two regions.
358 359
Table 2 shows the confusion matrix and the qualitative FoM estimated for the UPLS-DA
360
models. All the FoM are indicating an improvement in the classification capacity of the
361
models. The precision values are ranging from 0.808 (CM) to 1.000 (NM and MM) for the
362
training set and from 0.500 (NM) to 1.000 (MM) for the test set. The recall values were also
363
improved with the unfolding process, with values ranging from 0.792 (SM) to 1.000 (MM)
364
for the training set and from 0.143 to 0.909 (CM and SM) for the test set. Although the MM l16
ACCEPTED MANUSCRIPT 365
model was able to classify correctly all of its samples in the training set, it was only capable
366
of classifying correctly 1 out of 7 samples in the test set, which resulted in poor recall and F1
367
Scores. The precision was not affected, because the model did not generate any false positive
368
samples. It is also noticeable the improvement of the NMC parameter for all the models. The
369
number of misclassification ranged from 0% (MM) to 22.8% (SM) for the training set and
370
from 16% (CM and SM) to 46.2% (MM) for the test set. If the MM model is not considered,
371
MMC would range from 16 to 29% which is not far from the performance obtained by other
372
classifications methods employing more sophisticated techniques, such as UPLC-MS or CG-
373
MS (Taveira et al., 2014; Mehari et al., 2016).
374 375
Table 2 - Confusion matrix and qualitative figures of merit estimated for the developed
376
UPLS2-DA model
377
SM 5 0 0 19
CM MM NM SM
CM 10 0 0 1
MM 0 1 5 1
NM 1 0 5 0
SM 1 0 0 10
FPR
FNR
NMC
Precision
Recall
F1 Scores
CM
0.088
0.080
0.168
0.808
0.913
0.857
MM
0.000
0.000
0.000
1.000
1.000
1.000
NM
0.000
0.067
0.067
1.000
0.929
0.963
SM
0.056
0.172
0.228
0.864
0.792
0.826
CM
0.077
0.083
0.160
0.833
0.909
0.870
Training
NM 0 0 13 1
Test
MM 0 14 0 0
Training
CM MM NM SM
CM 21 0 0 2
Tes
Predicted as
Actual class
l17
MM
0.000
0.462
0.462
1.000
0.143
0.250
NM
0.147
0.143
0.290
0.500
0.833
0.625
SM
0.077
0.083
0.160
0.833
0.909
0.870
t
ACCEPTED MANUSCRIPT
378 379 380
Table 3 shows a comparison between F1 Scores from UPLS-DA and NPLS-DA. With this
381
comparison, it is clear that the unfolding process improved the performance of classification
382
of coffees from different regions of Minas Gerais State, except for the MM model in the test
383
set, that presented a significant reduction in the F1 Scores (62.5%).
384 385 386
Table 3 – Comparison between NPLS-DA and UPLS-DA f-scores
UPLS-DA 0.857 1.000 0.963 0.826 0.870 0.250 0.625 0.870
Test
CM MM NM SM CM MM NM SM
NPLS-DA 0.625 0.621 0.250 0.378 0.667 0.667 0.500 0.353
Training
f-score
387 388
The refolded VIP Scores for the UPLS-DA models can be seen in Figure 7. VIP Scores from
389
CM and MM models clearly resemble the second and the third components of the PARAFAC
390
model (quercetin and tocopherol, respectively), and both NM and SM presented a very
391
distinguishable peak, that can be related to the first component of PARAFAC, associated with
392
caffeic acid. Caffeic acid did not contribute to the classifications when using the NPLS-DA
l18
ACCEPTED MANUSCRIPT 393
models, so the improvement in performance obtained with the UPLS-DA models may be
394
attributed to the presence of this component in the VIP Scores.
395 396
6. Conclusions
397 398
The models developed from the results of fluorescence spectroscopy coupled with unfolded
399
partial least squares discriminant analysis (UPLS-DA) performed well in the classification of
400
coffees produced in different regions of Minas Gerais State. Model performance was similar
401
to other works published in the literature where more sophisticated time-consuming analytical
402
techniques were employed, with the exception of the MM model. The poor performance of
403
this model might be due to the selection of extreme samples in this class with the Kennard
404
Stone algorithm. Nonethelles, the presented technique provides an good alternative in
405
assuring the origin of coffee produced in the IP areas of Minas Gerais.
406 407
Acknowledgements
408
The authors acknowledge financial support from the following Brazilian Government
409
Agencies: CNPq (Grant # 475746/2013-9; 306139/2013-8) and FAPEMIG (Grant # PPM-
410
00619-15; BPD-00670-14)
411 412
References
413
CONAB. Companhia Nacional de Abastecimento. Acompanhamento da safra brasileira de
414
café – Terceiro levantamento Setembro/2015. (2015). http://www.conab.gov.br/OlalaCMS/
415
uploads/arquivos/15_09_29_09_01_35_boletim_cafe_setembro_2015.pdf Accessed 20.04.16.
l19
ACCEPTED MANUSCRIPT 416
ABIC. Café do Cerrado está mais valorizado. Associação Brasileira da Industria do Café.
417
(2016).
418
Acessed 20.04.16.
419
Cerrado Mineiro. Denominação de origem. (2016).
420
index.php?pg=denominacaodeorigem#group1/ Acessed 20.04.16.
421
Bahram, M., Bro, R., Stedmon, C., & Afkhami, A. (2006). Handling of Rayleigh and Raman
422
scatter for PARAFAC modeling of fluorescence data using interpolation. Journal of
423
Chemometrics, 20, 99–105.
424
Barbosa, J.N., Borém, F.M., Alves, H.M.R., Volpato, M.M.L., Vieira, T.G.C. & Souza,
425
V.C.O. (2010). Spatial distribution of coffees from Minas Gerais State and their relation with
426
quality, Coffee Science 5, 237-250.
427
Belay, A., Kim, H. K., & Hwang, Y.-H. (2016). Binding of caffeine with caffeic acid and
428
chlorogenic acid using fluorescence quenching, UV/vis and FTIR spectroscopic techniques.
429
Luminescence, 31, 565–572.
430
Bro, R. (1996). Multiway calibration. Multilinear PLS. Journal of Chemometrics, 10, 47–61.
431
Bro, R. (1997). PARAFAC. Tutorial and applications, Chemometrics and Intelligent
432
Laboratory Systems, 38, 149-171.
433
(http://www.sciencedirect.com/science/article/pii/S0169743997000324)
434
Carrera, F., Leon-Camacho, M., Pablos, F., & Gonzalez, A. G. (1998). Authentication of
435
green coffee varieties according to their sterolic profile. Analytica Chimica Acta, 370, 131–
436
139.
http://abic.com.br/publique/cgi/cgilua.exe/sys/start.htm?sid=59&infoid=3662/
http://www.cerradomineiro.org/
l20
ACCEPTED MANUSCRIPT 437
Christin, C., Hoefsloot, H. C. J., Smilde, a. K., Hoekman, B., Suits, F., Bischoff, R., &
438
Horvatovich, P. (2013). A critical assessment of feature selection methods for biomarker
439
discovery in clinical proteomics. Molecular & Cellular Proteomics, 12, 263–276.
440
Costa Freitas, A. M., & Mosca, A. I. (1999). Coffee geographic origin - An aid to coffee
441
differentiation. Food Research International, 32, 565–573.
442
Farah, A. & Donangelo, C. M. (2006). Phenolic compounds in coffee. Brazilian Journal of
443
Plant Physiology, 18, 23–36.
444
Guzmán, E., Baeten, V., Pierna, J. A. F. & García-Mesa, J. A. (2015). Evaluation of the
445
overall quality of olive oil using fluorescence spectroscopy. Food Chemistry, 173, 927–934.
446
Jham, G. N., Winkler, J. K., Berhow, M. A. & Vaughn, S. F. (2007). γ-Tocopherol as a
447
Marker of Brazilian Coffee (Coffea arabica L.) Adulteration by Corn. Journal of Agricultural
448
and Food Chemistry, 55, 5995–5999.
449
Link, J. V., Lemes, A. L. G., Marquetti, I., dos Santos Scholz, M. B., & Bona, E. (2014a).
450
Geographical and genotypic segmentation of arabica coffee using self-organizing maps. Food
451
Research International, 59, 1–7.
452
Link, J. V., Lemes, A. L. G., Marquetti, I., dos Santos Scholz, M. B., & Bona, E. (2014b).
453
Geographical and genotypic classification of arabica coffee using Fourier transform infrared
454
spectroscopy and radial-basis function networks. Chemometrics and Intelligent Laboratory
455
Systems, 135, 150–156.
456
Liu, H. C., You, C. F., Chen, C. Y., Liu, Y. C., & Chung, M. T. (2014). Geographic
457
determination of coffee beans using multi-element analysis and isotope ratios of boron and l21
ACCEPTED MANUSCRIPT 458
strontium. Food Chemistry, 142, 439–445.
459
MAPA. Ministério da Agricultura, Pecuária e Abastecimento. Indicação Geográfica – IG.
460
(2015).
461
Accessed 20.04.16.
462
Mehari, B., Redi-Abshiro, M., Chandravanshi, B. S., Combrinck, S., Atlabachew, M., &
463
McCrindle, R. (2016). Profiling of phenolic compounds using UPLC-MS for determining the
464
geographical origin of green coffee beans from Ethiopia. Journal of Food Composition and
465
Analysis, 45, 16–25.
466
Mullen, W., Nemzer, B., Stalmach, A., Ali, S. and Combet, E. (2013). Polyphenolic and
467
Hydroxycinnamate Contents of Whole Coffee Fruits from China, India, and Mexico. Journal
468
of Agricultural and Food Chemistry, 61, 5298−5309.
469
Muniz-Valencia, R., Jurado, J. M., Ceballos-Magana, S. G., Alcazar, A., & Hernandez-Diaz,
470
J. (2014). Characterization of Mexican coffee according to mineral contents by means of
471
multilayer perceptrons artificial neural networks. Journal of Food Composition and Analysis,
472
34, 7–11.
473
Nifli, A., Theodoropoulos, P., Munier, S., Castagnino, C., Roussakis, E., Katerinopoulos, H.
474
E., Castanas, E. (2007). Quercetin Exhibits a Specific Fluorescence in Cellular Milieu : A
475
Valuable Tool for the Study of Its Intracellular Distribution. Journal of Agricultural and Food
476
Chemistry, 55, 2873–2878.
477
Olivieri, A. & Escandar, G. (2014). Practical Three Way Calibration. Philadelphia : Elsevier
478
(p. 330)
http://www.agricultura.gov.br/desenvolvimento-sustentavel/indicacao-geografica/
l22
ACCEPTED MANUSCRIPT 479
Risticevic, S., Carasek, E., & Pawliszyn, J. (2008). Headspace solid-phase microextraction-
480
gas chromatographic-time-of-flight mass spectrometric methodology for geographical origin
481
verification of coffee. Analytica Chimica Acta, 617, 72–84.
482
Rodrigues, C. I., Maia, R., Miranda, M., Ribeirinho, M., Nogueira, J. M. F., & Máguas, C.
483
(2009). Stable isotope analysis for green coffee bean: A possible method for geographic
484
origin discrimination. Journal of Food Composition and Analysis, 22, 463–471.
485
Sádecká, J., & Tóthová, J. (2007). Fluorescence Spectroscopy and Chemometrics in the Food
486
Classification: a Review. Czech Journal of Food Science, 25, 159–173.
487
Speer, K. & Kölling-Speer, I. (2006). The lipid fraction of the coffee bean. Brazilian Journal
488
of Plant Physiology, 18, 201–216.
489
Tanajura da Silva, C.E., Filardi, V.L., Pepe, I. M., Chaves, M. A. & Santos, C. M. S. (2015).
490
Classification of food vegetable oils by fluorimetry and artificial neural networks. Food
491
Control, 47, 86–91.
492
Taveira, J. H. D. S., Borém, F. M., Figueiredo, L. P., Reis, N., Franca, A. S., Harding, S. A.,
493
& Tsai, C. J. (2014). Potential markers of coffee genotypes grown in different Brazilian
494
regions: A metabolomics approach. Food Research International, 61, 75–82.
495
Toledo, P.R.A.B., Melo, M.M.R., Pezza, H.R., Toci, A.T., Pezza, L., & Silva, C.M. (2016).
496
Discriminant analysis for unveiling the origin of roasted coffee samples: A tool for quality
497
control of coffee related products. Food Control, In press, corrected proof.
498
Weckerle, B., Richling, E., Heinrich, S., & Schreier, P. (2002). Origin assessment of green
499
coffee (Coffea arabica) by multi-element stable isotope analysis of caffeine. Analytical and l23
ACCEPTED MANUSCRIPT 500
Bioanalytical Chemistry, 374, 886–890.
501
Wilkinson, J., Cerdan, C., & Dorigon, C. (2015). Geographical Indications and “Origin”
502
Products in Brazil - The Interplay of Institutions and Networks. World Development,
503
http://dx.doi.org/10.1016/j.worlddev.2015.05.003.
504
Wise, B. W., Gallagher, N. B., Bro, R., Shaver, J. M., Windig, W., & Koch, R. S. (2006).
505
PLS-Toolbox 4.0 for use with MatlabTM (Manual). Manson: EigenVector Research Inc.
506
Woodward, G. M. (2008). The potential effect of excessive coffee consumption on nicotine
507
metabolism: CYP2A6 inhibition by caffeic acid and quercetin. Bioscience Horizons, 1, 98–
508
103.
l24
ACCEPTED MANUSCRIPT 509
Figure Captions:
510 511
Figure 1 - Identification of major coffee producing regions in Minas Gerais State map.
512 513
Figure 2 – Mean contour map of EEM matrix from coffee extracts (a) before and (b) after the
514
scattering removal.
515 516
Figure 3 – First (a), second (b) and third (c) PARAFAC components (Blue lines – excitation;
517
Red lines – emission).
518 519
Figure 4 – Classification plots for the NPLS-DA models developed ( - CM - MM -
520
NM - SM)
521 522
Figure 5 – First (a) and second (b) components from NPLS-DA model.
523 524
Figure 6 – Classification plots for the NPLS-DA models developed ( - CM - MM -
525
NM - SM)
526 527
Figure 7 – VIP Scores from UPLS-DA models - (a) CM (b) MM (c) NM (d) SM
528 529
l25