Journal of Chromatography A, 1141 (2007) 106–116
Unsupervised parameter optimization for automated retention time alignment of severely shifted gas chromatographic data using the piecewise alignment algorithm Karisa M. Pierce a , Bob W. Wright b , Robert E. Synovec a,∗ a
b
Department of Chemistry, Box 351700, University of Washington, Seattle, WA 98195, USA Pacific Northwest National Laboratory, Battelle Boulevard, P.O. Box 999, Richland, WA, 99352, USA Received 8 August 2006; received in revised form 28 November 2006; accepted 29 November 2006 Available online 18 December 2006
Abstract Simulated chromatographic separations were used to study the performance of piecewise retention time alignment and to demonstrate automated unsupervised (without a training set) parameter optimization. The average correlation coefficient between the target chromatogram and all remaining chromatograms in the data set was used to optimize the alignment parameters. This approach frees the user from providing class information and makes the alignment algorithm applicable to classifying completely unknown data sets. The average peak in the raw simulated data set was shifted up to two peak-widths-at-base (average relative shift = 2.0) and after alignment the average relative shift was improved to 0.3. Piecewise alignment was applied to severely shifted GC separations of gasolines and reformate distillation fraction samples. The average relative shifts in the raw gasolines and reformates data were 4.7 and 1.5, respectively, but after alignment improved to 0.5 and 0.4, respectively. The effect of piecewise alignment on peak heights and peak areas is also reported. The average relative difference in peak height was −0.20%. The average absolute relative difference in area was 0.15%. © 2006 Elsevier B.V. All rights reserved. Keywords: Unsupervised retention time alignment; Gas chromatography; Correlation coefficient; Piecewise alignment
1. Introduction Many high-throughput and long-term studies produce large volumes of data that must be efficiently compressed while retaining the desired chemical information. Chemometric tools such as principal component analysis (PCA) can objectively reduce a large data set while retaining the essential information [1]. In this report, we aim to classify complex chromatographic fingerprints using PCA for simulated separations and “real” sample separations using gas chromatography (GC). However, while gathering GC data, uncontrollable fluctuations in temperature, pressure, flow rate, stationary phase degradation, and sample matrix effects cause analytes to elute at different retention times between runs. This retention time shifting impedes chemometric techniques [2–4]. Many techniques have been developed that improve retention time precision, and herein we study the effect
∗
Corresponding author. Tel.: +1 2066852328; fax: +1 2066858665. E-mail address:
[email protected] (R.E. Synovec).
0021-9673/$ – see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.chroma.2006.11.101
of the piecewise retention time alignment algorithm applied to simulated chromatographic data and real GC data. Many of the objective retention time alignment algorithms that have been developed can be loosely classified into three main categories: linear correction, peak matching and local area alignment (including dynamic programming algorithms) [4–16]. Piecewise alignment falls into the objective local area alignment category. There is also a category of subjective alignment algorithms that are not objective because the user has to manually set certain peaks to be marker peaks to develop a shift correction function for each chromatogram [17]. The piecewise alignment algorithm was designed to be a robust alignment tool that can quickly and accurately correct retention time shifting, even when peaks are shifted significantly past neighboring peaks between runs. The piecewise alignment algorithm was previously reported to be objective, fast, and beneficial for PCA [18]. In that study, piecewise alignment was applied to GC separations of gasolines that were gathered over a five-day period. However, the shifting in that data was always less than the chromatographic peak width.
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
This current report will evaluate how robust piecewise alignment is by using severely shifted simulated data and real data that contain peaks shifted past neighbor peaks between runs, with peak elution order preserved. When chromatograms contain peaks that have run-to-run shifting past neighboring peaks, the required complexity of the alignment algorithm increases. GC instrument software commonly offers peak-matching algorithms where each peak is forced to match the nearest peak in a target chromatogram [19]. When shifting is so severe that peaks are shifted past neighbor peaks, the traditional peakmatching algorithms will mismatch peaks. A more complex local area alignment algorithm can use the selective profile of numerous peaks in local regions to ascertain the best correction for that local region, even when the peaks in that local region were shifted past neighboring peaks in another chromatogram. The local area alignment algorithms also use the selective profile of numerous peaks in the local region to correctly align chromatographic regions that are chemically very different. This way, local area alignment algorithms succeed where peak-matching algorithms might fail. In other words, neighboring peaks help the local area alignment algorithm correct shifting for a peak that is present in one sample, but absent (or present at a relatively low concentration) in another type of sample. Herein, we also describe an unsupervised method for optimizing the alignment parameters, where unsupervised means no class information is required. The optimization method evaluates and thus, utilizes the correlation of each entire chromatogram in the unknown data set to a target chromatogram as a function of the alignment parameters. We previously reported that alignment performance correlates with the value of the degree-of-classseparation (DCS) between training set clusters on a PCA scores plot [18]. In order to validate the results we obtain from the unsupervised correlation optimization method, we also analyze DCS for the data sets undergoing alignment to see if the optimal parameters chosen by the DCS method match the optimal parameters chosen by the correlation method. The DCS optimization method is a supervised method that requires a known training set. This report first demonstrates how piecewise alignment is optimally applied to a severely shifted simulated chromatographic data set that has an average relative shift equal to 2.0, where the relative shift of a peak is equal to four times the standard deviation of the peak’s retention time in all of the chromatograms, divided by the peak-width-at-base. Thus, an average relative shift equal to 2.0 means the average peak in the data set is shifted past one neighboring peak and overlaps a second neighboring peak between runs. More simulated chromatographic data sets were created with an average relative shift equal to 0.5, 0.8, 1.0, 1.5, or 3.0. All of the simulated data sets were submitted to alignment to study the alignment performance as a function of the retention time precision of the raw data. We also report the effect of piecewise alignment on severely shifted chromatographic separations of gasolines that were gathered using three different GC programs. Using three different GC programs caused the raw data set to have an average relative shift equal to two. Finally, the benefit of applying piecewise alignment to
107
severely shifted GC separations of reformates from the crude oil distillation fraction is also presented. 2. Theory The following subsections describe the algorithms used in this report. 2.1. Piecewise alignment algorithm Piecewise alignment separates the sample chromatograms into windows and finds the optimal correction for each window by maximizing the correlation between the sample window and an arbitrarily chosen target [18]. Piecewise alignment has two parameters: the window size (W) is the length of local region that the chromatograms are pieced into, and the maximum shift limit (L) is the maximum distance those windows are allowed to move. Corrections yielding the maximum correlation coefficient are assigned to the midpoint of each window and the corrections applied to all data points in between midpoints are linearly interpolated. A flow chart of the piecewise alignment algorithm is in Fig. 1.
Fig. 1. Flow chart of piecewise alignment algorithm.
108
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
2.2. Correlation coefficient for optimization of alignment parameters
produced after alignment with every single set of parameter values.
As the alignment improves between two chromatograms, the correlation coefficient between those two chromatograms increases. For a data set containing many chromatograms, one was arbitrarily designated the target. Then the average correlation coefficient between each entire sample chromatogram and the target chromatogram was recorded as a function of alignment parameter values. The parameters that yield the maximum average correlation coefficient with the smallest standard deviation are the optimal parameters. Thus, the average correlation coefficient can be used to optimize the alignment parameters in an unsupervised manner. The correlation coefficient algorithm was from the MATLAB toolbox (The Mathworks Inc.) and the unsupervised correlation optimization method is readily automated in MATLAB.
3. Experimental
2.3. Degree-of-class-separation for validation of correlation coefficient method It was previously reported that the DCS metric can be used to objectively optimize the alignment parameters for improved PCA clustering [18]. DCS is equal to the distance between centroids of two class clusters on a scores plot, divided by the sum of the within-class variation of each class cluster. Thus, DCS indicates how tightly scores are clustered and how far apart those clusters are from each other. In very general terms, since a mean value plus or minus two standard deviations captures over 95% of the distribution, a DCS value greater than approximately four, implies two clusters are not overlapping and classification is successful. An increase in DCS corresponds with improved alignment. In practice, analysis of DCS as a function of alignment parameters is performed using a training set (a subset of the data set of known classification) and the maximum DCS indicates the optimal parameters. The optimal parameters are then used to align the rest of the data set (unknown test set). Thus, the DCS method is supervised, requiring the user to know the classification of the training set chromatograms. The correlation method is not supervised, the user does not have to know the classification of the chromatograms in the raw data set. Degree-of-class-separation represents the Euclidean distance between scores, which is the scalar “distance” between entire chromatographic profiles. The Fisher criterion represents separations between individual signal values from each individual retention time data point in a set of chromatograms from different classes. Thus, the Fisher criterion is a point-by-point comparison of retention time data points so if there are 10,000 data points in the chromatograms, the Fisher criterion outputs a vector with 10,000 elements. DCS is a similarity metric of the entire sample profile (a single numerical value) representing how similar chromatograms are to each other. These criteria are related, but the analyst would want a single numerical value to indicate how similar chromatograms are after alignment with a variety of parameter values. The single numerical value of DCS provides this. One does not want a vector (that could be thousands of elements long) of Fisher criterion values
As listed in Table 1, six types of data sets (I, II, III, IV, V, and VI) consisting of 200 chromatograms were simulated (each type of data set was simulated four times) in MATLAB to create 10 min chromatograms composed of 600 one-secondwide peaks. During simulation, an initial chromatogram was produced and the remaining 199 chromatograms of the data set were derived from that initial chromatogram using operations of random number generators to introduce retention time shifting and between-class variations in peak heights. There are 50 replicates of four samples (samples A, B, C, and D) in each data set. The initial sample A chromatogram was simulated to contain 600 component peaks, where each peak was modeled by the normal probability density function in MATLAB. Then this initial sample A chromatogram was used for generating the initial chromatograms for samples B, C, and D, by randomly scaling the peak heights (concentrations) of those 600 component peaks to vary across the samples. Then the 49 shifted replicates of each of the initial A, B, C, and D chromatograms were generated by using linear interpolation of the existing signals to shift them to new locations. The six types of simulated data sets contained declining retention time precision produced by incrementally increasing the magnitude of local area stretching and compression as well as incrementally increasing the magnitude of the random shift applied to individual peaks. Table 1 describes the shifting built into the six types of data sets. Data set I contained the most subtle retention time shifting. The average relative shift for data set I was 0.5, where the relative shift is defined as four times the standard deviation of a peak’s retention time in all 200 chromatograms divided by the peak-width-at-base. Thus, an average relative shift of 0.5 means the peak location was distributed across half of the peak-width-at-base in all 200 chromatograms. The average relative shift for data set II was 0.8 so shifting was less than the peak width for data set II. The average relative shift for data set III was 1.0 so the shifting was greater than the chromatographic peak width for the later eluting peaks. More local area stretching and shrinking was also built into data sets II and III to simulate pressure and temperature ramps occurring during the chromatographic run. The average relative shifts for data sets IV, V, and VI were 1.5, 2.0, and 3.0, respectively. For data sets IV, V, and VI, peaks were shifted progressively Table 1 Description of the shifting built into the simulated chromatographic data sets and total computation time for piecewise alignment to align all 200 chromatograms in the given data set Data set
Average relative shift
Piecewise time (min)
I II III IV V VI
0.5 0.8 1.0 1.5 2.0 3.0
1.0 1.3 1.7 5 11 10
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
further past neighbor peaks. For validating the relative shifts that we intended to create using the data simulator, we used a peak-finding algorithm in MATLAB that defined peaks as zerocrossings in the derivative of the chromatogram that also had a signal greater than the noise threshold. This yielded a list of peak locations in all the chromatograms and these lists were exported to Excel to manually determine the retention time precision and relative shifts. When the DCS parameter optimization method was used for validating the correlation optimization method, the training set members were half of the A replicates and half of the B replicates. The alignment target was always the first chromatogram in the data set for the simulated data. For the simulated data, a wide range of W and L values were searched for the optimal alignment parameter values. The W values that were tested were, in units of data points, W = 0, 60, 80, 100, 180, 200, 240, 250, 260, 300, 400, 500, 600, 700, 800, 900, 1000, 1250, 1500, 1750, 2000. At a data acquisition rate of 20 data points per second, this translates to possible parameter values of 0 min ≤ W ≤ 1.67 min. The L values that were tested were, in units of data points, L = 0, 20, 40, 50, 60, 80, 100, 150, 200. At a data acquisition rate of 20 data points per second, this translates to possible parameter values of 0 min ≤ L ≤ 0.17 min. One “real” data set consisted of GC separations of four gasolines (gasoline types A, M, S, and T, collected at the pump from four different local gasoline stations in and around Seattle, WA, USA, 60 replicate separations each) collected using three different GC programs to induce severe retention time shifting. The chromatograms were acquired using an HP 6890A gas chromatograph. The column was 10 m × 0.1 mm, 0.4 m DB5 stationary phase. The injection volume was 1 l injected into a 250 ◦ C inlet, with a 300:1 split ratio, with the electronic flow control on. The first flow program was constant at 0.8 ml/min with a temperature ramp of 30 ◦ C for 2 min then increased at 25 ◦ C/min to 200 ◦ C. The second flow program was 0.8 ml/min for 2 min then decreased to 0.4 ml/min over 7 min with the same temperature ramp as the first program. The third flow program was 0.7 ml/min for 2 min then decreased to 0.3 ml/min over 6 min. The temperature ramp for program three was 30 ◦ C for 2 min then increased at 25 ◦ C/min to 80 ◦ C followed by an increase of 35 ◦ C/min to 250 ◦ C. Each chromatogram in the GC data set was individually smoothed (using a running mean smoother), baseline corrected by subtracting from the chromatogram a line fit through noise regions, and normalized to unit total area. The alignment target was the mean of the gasolines data set, and piecewise alignment using the mean is shown to improve retention time precision. However, in the general case a mean may not be the best choice and instead an arbitrarily chosen chromatogram should be used. The parameters were optimized using the unsupervised correlation optimization method. These parameter values were then validated using the DCS optimization method using a training set consisting of 30 type A and 30 type M chromatograms. The chromatograms were noisy and required analysis of variance (ANOVA) feature selection after alignment for successful PCA classification [3,18]. The ANOVA feature selection parameter (threshold) was optimized using the DCS method for a wide range of threshold values
109
(0 ≤ threshold ≤ 6000). As the ANOVA threshold increases, fewer features are retained. ANOVA feature selection calculates the Fisher ratio as a function of retention time, trained on a data set where the class identifications are known. Therefore, it only makes sense to apply ANOVA after alignment and not before. ANOVA applied to shifted data will always yield incorrect and meaningless feature selection results. For the gasolines data, the parameter values that were tested by the correlation coefficient optimization method ranged between 0 min and 0.75 min. The optimal W and L values were 0.42 min and 0.58 min, respectively. In another study of real GC separations, chromatograms of reformate distillation fraction samples were gathered from historical databases into a single data set. The data set was obtained from ChevronTexaco (Richmond, CA, USA). The data set consists of 27 total chromatograms and the fourth chromatogram was the alignment target. For the reformate data, the W values that were tested ranged between 0 points and 5000 points. The optimal W and L values were 2500 points and 200 points, respectively. 4. Results and discussion 4.1. Raw data set V with severe retention time shifting Simulated data set V was chosen to illustrate the unsupervised parameter optimization method. A typical chromatogram from data set V is shown in Fig. 2A. The average relative shift in data set V was 2.0, meaning the average peak was shifted past two neighbors between chromatograms. In Fig. 2B, overlaid subregions from two raw chromatograms of the same sample type show that peaks are shifted well past neighbor peaks in data set V. Data set V was submitted to PCA but the scores plot in Fig. 2C failed to capture the class variations. Though the scores plot has a “horseshoe” structure, there is no apparent grouping into the four sample types. 4.1.1. Unsupervised piecewise alignment of simulated data Since PCA of raw data set V failed to classify the four samples, this indicates data set V can benefit from piecewise retention time alignment, as suggested in Fig. 2B. As retention time precision improves, the correlation between a target chromatogram and a sample chromatogram increases. Thus, the average correlation coefficient between a target and all chromatograms in a data set is a metric for alignment algorithm performance. The actual correlation coefficients between the first chromatogram and the remaining 199 chromatograms are plotted in Fig. 3A–D for W = 0 min and L = 0 min (raw data), W = 0.15 min and L = 0.13 min, W = 0.42 min and L = 0.04 min, and W = 0.42 min and L = 0.13 min, respectively. The average correlation coefficient for raw data set V was 0.46 ± 0.25, but after optimal piecewise alignment (W = 0.42 min, L = 0.13 min) the average correlation coefficient increased to 0.93 ± 0.02. The average correlation coefficient of the 200 chromatograms in data set V is plotted as a function of W for 0 min ≤ W ≤ 1.67 min and L = 0.02 min, 0.04 min, and 0.13 min in Fig. 3E. The maximum average correlation coefficient (with the minimum standard
110
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
Fig. 2. Description of simulated data. (A) Typical simulated chromatogram of sample type A from data set V. (B) Overlaid subregion from two raw chromatograms. (C) PCA scores plot of raw data set V.
deviation) occurred for W = 0.42 min and L = 0.13 min. The optimal value of L was determined by the correlation coefficient method for 0 min ≤ L ≤ 0.17 min. The result was that values of L greater than 0.08 min yielded average correlation coefficients as a function of W that were superimposable on the L = 0.13 min plot in Fig. 3E because the worst shifting in the data set was less than 0.08 min while values of L less than 0.08 min yielded lower correlation coefficients because there was not enough slack in the alignment parameters to correct the shifting (not shown for clarity). Data set V was subdivided into a training set and a test set and the training set was submitted to the DCS parameter optimization method to validate the parameters chosen by the unsupervised correlation method. The training set was submitted to piecewise alignment using a variety of W values (0 min ≤ W ≤ 0.83 min). The DCS between the training set members was calculated for a range of W values and is plotted in Fig. 3F. The optimal W (W = 0.42 min) corresponds with the maximum DCS (DCS = 3.7). This is analogous to maximizing the average correlation coefficient as a function of W in Fig. 3E, except that the DCS method requires a training set of known classification while the correlation coefficient method does not. To clarify, the unsupervised correlation coefficient method yielded the same
optimal alignment parameter values as the supervised degree-ofclass-separation optimization method. Thus, the unsupervised alignment parameter optimization method was in agreement with the previously reported supervised parameter optimization method [18]. The retention time precision of raw data set V was such that peaks had relative shifts ranging between 0.3 and 6.0. The average peak in data set V had a relative shift equal to 2.0. The relative shift for a single peak is illustrated for the two chromatograms shown in Fig. 2B. Fig. 4A shows the relative shift before and after piecewise alignment for peaks in all 200 chromatograms of data set V. After piecewise alignment (W = 0.42 min, L = 0.13 min), the relative shifts of the peaks were less than 0.5. Retention time precision improved for all peaks along the entire length of the retention time axis. The same chromatograms from Fig. 2B are overlaid in Fig. 4B after piecewise alignment. The overlaid subregions from two different sample types in Fig. 4C show that the chemical variations between samples were preserved after piecewise alignment. By comparing the precision of individual peaks before and after alignment or by looking at the change in the average correlation coefficient for all 200 chromatograms due to alignment, it is apparent the piecewise alignment algorithm corrected the retention time shifting in data set V.
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
111
Fig. 3. Alignment parameter optimization using unsupervised correlation optimization method. (A) Correlation coefficient between the first chromatogram and the remaining 199 chromatograms in data set V before alignment. Correlation coefficient between the first chromatogram and the remaining 199 chromatograms in data set V after alignment using (B) W = 0.15 min, L = 0.13 min, (C) W = 0.42 min, L = 0.04 min, and (D) W = 0.42 min, L = 0.13 min. (E) Average correlation coefficient as a function of W and L for data set V. The W and L conditions from (A–C) are indicated by circles. (F) DCS as a function of W using training set of data set V (L = 0.13 min).
Data set V was then submitted to chemometric analysis to provide additional validation of the alignment algorithm performance. PCA did benefit from unsupervised correlation-based parameter optimization and piecewise alignment of the raw data. After optimal alignment, the scores for data set V are tightly and accurately clustered according to sample type in Fig. 5 (W = 0.42 min, L = 0.13 min).
The average peak in data set V was initially shifted as far as two neighbors away and the average relative shift for data set V was 2.0. Other data sets were simulated with different ranges of shifting. For example, data set I had an average relative shift of 0.5 and data set VI had an average relative shift of 3.0, as shown in Table 1. All six types of data sets were simulated four times. The four replicates were meant to statistically account for
112
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
Fig. 4. Piecewise alignment improves retention time precision. (A) Retention time precision of data set V in terms of relative shift before (+) and after (•) piecewise alignment. (B) Overlaid subregion of same two chromatograms from Fig. 2B after piecewise alignment. (C) Overlaid chromatographic subregions of two different sample types before and after piecewise alignment.
variations introduced by the random number generators in the simulation algorithm (so four data sets of 200 chromatograms with the simulation parameters set to yield an average relative shift of 0.5 were created for the data set I replicates). Then
all four replicates of each type of data set were submitted to correlation-based parameter optimization and piecewise alignment. The average correlation coefficient for each type of data set before and after alignment is shown in Fig. 6. Each data
Fig. 5. Piecewise alignment enhances chemometric analysis. PCA scores clustering of the four sample classes after optimal piecewise alignment (W = 0.42 min, L = 0.13 min).
Fig. 6. Average correlation coefficient between first chromatogram and remaining 199 chromatograms for all simulated chromatographic data sets. The retention time precision built into each simulated data set is described in Table 1. Each data set was simulated four times. Each data point plotted in Fig. 6 is the average of the “average correlation coefficient” from all four replicates.
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
113
Fig. 7. Description of severely shifted GC separations of gasolines. (A) Typical separation of Type A gasoline. (B) Retention time precision of gasolines data set in terms of relative shift before (+) and after (•) piecewise alignment.
point in Fig. 6 is the average value of the four replicate data sets. In every case piecewise alignment increased the average correlation coefficient of the data set. The average value and propagated error (±1 standard deviation) before alignment for each data set and its replicates was: data set I = 0.94 ± 0.05, data set II = 0.91 ± 0.09, data set III = 0.80 ± 0.21, data
set IV = 0.48 ± 0.54, data set V = 0.43 ± 0.72, data set VI = 0.39 ± 0.44. The average value and propagated error after alignment was: data set I = 0.96 ± 0.08, data set II = 0.94 ± 0.06, data set III = 0.94 ± 0.06, data set IV = 0.95 ± 0.05, data set V = 0.91 ± 0.09, data set VI = 0.91 ± 0.08.
Fig. 8. Unsupervised piecewise alignment applied to gasolines data set. (A) PCA scores plot of raw gasolines data set. PC 1 captured 49.0% of the variance and PC 2 captured 19.6% of the variance. (B) Alignment parameter optimization using correlation coefficient as a function of W for all 240 gasoline separations, L = 0.58 min. (C) PCA of the gasolines data set using optimal alignment and feature selection parameters (W = 0.25 min, L = 0.58 min, threshold = 2000). PC 1 captured 86.0% of the variance and PC 2 captured 7.5% of the variance.
114
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
Piecewise alignment computation times are reported in Table 1 for a computer with an AMD Athlon processor, 2.2 GHz, and 1 GB RAM. 4.1.2. Unsupervised piecewise alignment of gasolines data Next, the application of piecewise alignment to a “real” data set consisting of 240 GC separations of gasolines was studied. There were four types of gasoline (A, M, S, and T, 60 replicates each) run under three different GC programs over five days to induce significant shifting between runs. The raw data contained peaks shifted past neighbor peaks and this raw data was submitted to piecewise alignment. A chromatogram of type A gasoline is shown in Fig. 7A. The retention time precision of 84 peaks present in all four types of gasoline is shown in Fig. 7B before and after optimal piecewise alignment. Two minutes into the chromatographic run time, the relative shift of the raw data is greater than one and peaks are shifted past neighbor peaks (average relative shift = 4.7), but piecewise alignment corrected this severe shifting and reduced the relative shifts (average relative shift = 0.5). The 240 raw gasoline chromatograms were submitted to PCA and the scores plot is shown in Fig. 8A. PCA captured the retention time variations rather than chemical class variations of the
sample types. The replicates were clustered into three groups corresponding to the three GC programs. The average correlation coefficient as a function of W for piecewise alignment of the gasolines data is shown in Fig. 8B. The average correlation coefficient and standard deviation for the 240 gasoline chromatograms was 0.63 ± 0.21 before alignment. The average correlation coefficient and standard deviation for the gasoline chromatograms after piecewise alignment was 0.89 ± 0.06 for W = 0.25 min and L = 0.58 min. The gasoline chromatograms were aligned with the optimal parameters (as per Fig. 8B) and submitted to analysis of variance (ANOVA) feature selection using a training set to discard extraneous chromatographic regions. After optimal alignment and feature selection (W = 0.25 min, L = 0.58 min, and threshold = 2000), PCA produced the scores plot shown in Fig. 8C with 87.7% variance captured by PC 1 and 6.8% variance captured by PC 2. Clustering of the four gasolines in Fig. 8C was obtained. 4.1.3. Unsupervised piecewise alignment of reformate data Finally, we applied the automated correlation-based parameter selection to reformate distillation fractions. Gas chromatographic separations of the reformate distillation fraction
Fig. 9. Unsupervised piecewise alignment applied to severely shifted GC separations of reformate distillation fractions. (A) Typical separation. (B) Overlaid subregion from four chromatograms before alignment (raw). (C) Same overlaid subregion from the same four raw chromatograms after piecewise alignment. (D) Correlation coefficient between target and sample chromatograms before and after alignment, where the target was the fourth chromatogram in the data set.
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
from historical databases and current analyses were combined into a single data set. A typical separation is shown in Fig. 9A. The raw data set contained a substantial amount of retention time shifting where peaks were shifted past neighbor peaks between runs. The average relative shift for seven peaks in the 27 raw chromatograms was 1.5. The 27 chromatograms were then submitted to piecewise alignment (W = 2500 data points, L = 200 data points) and the average relative shift of the seven peaks for all 27 chromatograms decreased to 0.4. For example, an overlaid subregion of four raw reformate chromatograms is shown in Fig. 9B. Peaks are shifted past neighbor peaks between runs. The same overlaid subregion from the same four chromatograms after piecewise alignment is shown in Fig. 9C. It is apparent in Fig. 9B that piecewise alignment was able to correct significant retention time shifting for samples that were so chemically different that two peaks were present in one sample type that were absent in the rest of the samples. Piecewise alignment was robust against these significant chemical differences between the samples. The correlation coefficient between the target chromatogram and each individual sample chromatogram is shown in Fig. 9D. There are distinct groups or types of reformates according to the correlation coefficient analysis. Within the groups, the precision of the correlation coefficient improved after alignment, implying piecewise alignment improved retention time precision for all 27 chromatograms. The significance of the alignment results for the reformate data set is to establish that the correlation coefficient increases with improved retention time precision for this real data set that contained severe shifting. 4.1.4. Preservation of quantitative information after piecewise alignment The effect of piecewise alignment on preserving the original peak height and peak area also was studied. Quantitative results for five representative peaks before and after piecewise alignment were obtained by averaging 25 replicate sample injections (a sampling that was statistically representative of the population) from simulated data set V. The relative shift for the five peaks in the original data ranged from 0.39 to 2.85. Piecewise alignment improved the relative shift for all five peaks such that the relative shift ranged from 0.15 to 0.28. Piecewise alignment caused the original peak heights to either not change, or decrease by a small amount (−0.2%). Piecewise alignment caused the original peak areas to slightly increase, decrease, or not change. The average absolute relative difference in area after piecewise alignment was 0.15% with propagated uncertainty of ±0.91%. Piecewise alignment may affect peak area, but that change is within acceptable precision limits when one considers the fact that ideally, with the use of internal standards, peak area precision up to 1% RSD is acceptable [20]. 5. Conclusion Piecewise alignment was shown to be robust for severely shifted data where peaks are shifted past neighbor peaks between runs. Piecewise alignment was also shown to be robust for
115
significantly different sample types. Thus, piecewise alignment is a suitable pretreatment method for chromatograms prior to their multivariate data analysis based on increased class separation and interpretability of scores plots. The unsupervised correlation coefficient method yielded the same optimal alignment parameter values as the supervised degree-ofclass-separation optimization method. Thus, the unsupervised alignment parameter optimization method was in agreement with the supervised parameter optimization method previously reported. This frees the user from providing class information to the alignment algorithm and makes the alignment algorithm more generally applicable to classifying completely unknown data sets. Given access to additional instruments, gathering the “real” data sets on numerous instruments using the same GC method would have more directly addressed the issue of calibration transfer. With this in mind, future studies and applications of the piecewise alignment algorithm will involve truly uncontrollable retention time fluctuations, rather than the fluctuations induced in the current data by changing the GC methods. The simulated data set V is now freely available under the research link at http://synoveclab.chem.washington.edu/. Acknowledgements We thank ChevronTexaco for providing the reformate data. This work was supported by the Internal Revenue Service through an interagency agreement with the US Department of Energy. The Pacific Northwest National Laboratory is operated by Battelle Memorial Institute for the US Department of Energy under contract DE-AC05-76RLO 1830. The views, opinions, or findings contained in this report are those of the authors and should not be construed as the official Internal Revenue Service position, policy, or decision unless designated by other documentation. References [1] R.G. Brereton, Chemometrics: Data Analysis for the Laboratory and Chemical Plant, Wiley, New York, 2003. [2] R. Siuda, G. Balcerowska, D. Aberdam, Chemom. Intell. Lab. Syst. 40 (1998) 193. [3] K.J. Johnson, R.E. Synovec, Chemom. Intell. Lab. Syst. 60 (2002) 225. [4] G. Malmquist, R. Danielsson, J. Chromatogr. A 687 (1994) 71. [5] R. Andersson, M.D. Hamalainen, Chemom. Intell. Lab. Syst. 22 (1994) 49. [6] R.J.O. Torgrip, M. Aberg, B. Karlberg, S.P. Jacobsson, J. Chemom. 17 (2003) 573. [7] J.H. Christensen, J. Mortensen, A.B. Hansen, O. Andersen, J. Chromatogr. A 1062 (2005) 113. [8] B.K. Lavine, Anal. Chim. Acta 437 (2001) 233. [9] J. Forshed, I. Schuppe-Koistinen, S.P. Jacobsson, Anal. Chim. Acta 487 (2003) 189. [10] N.-P.V. Nielsen, J.M. Carstensen, J. Smedsgaard, J. Chromatogr. A 805 (1998) 17. [11] G. Tomasi, F. van den Berg, C. Andersson, J. Chemom. 18 (2004) 231. [12] K.J. Johnson, B.W. Wright, K.H. Jarman, R.E. Synovec, J. Chromatogr. A 996 (2003) 141. [13] P.M.L. Sandercock, E.D. Pasquier, Forensic Sci. Int. 140 (2004) 43. [14] D. Bylund, R. Danielsson, G. Malmquist, K.E. Markides, J. Chromatogr. A 961 (2002) 237.
116
K.M. Pierce et al. / J. Chromatogr. A 1141 (2007) 106–116
[15] M.V. Nederkassel, M. Daszykowski, P.H.C. Eilers, Y.V. Heyden, J. Chromatogr. A 1118 (2006) 199. [16] E. Reiner, L.E. Abbey, T.F. Moran, P. Papamichalis, R.W. Schafer, Biomed. Mass Spectrom. 6 (1979) 491. [17] M.E. Parrish, B.W. Good, F.S. Hsu, F.W. Hatch, D.M. Ennis, D.R. Douglas, J.H. Shelton, D.C. Watson, C.N. Reilly, Anal. Chem. 53 (1981) 826.
[18] K.M. Pierce, J.L. Hope, K.J. Johnson, B.W. Wright, R.E. Synovec, J. Chromatogr. A 1096 (2005) 101. [19] HPChemstation Manual, Understanding Your Chemstation, Hewlett Packard, 1994. [20] D.A. Skoog, F.J. Holler, T.A. Nieman, Principles of Instrumental Analysis, Saunders College Publishing, Philadelphia, PA, 1998.