Determination of the number of significant components in liquid chromatography nuclear magnetic resonance spectroscopy

Determination of the number of significant components in liquid chromatography nuclear magnetic resonance spectroscopy

Chemometrics and Intelligent Laboratory Systems 72 (2004) 133 – 151 www.elsevier.com/locate/chemolab Determination of the number of significant compo...

628KB Sizes 1 Downloads 55 Views

Chemometrics and Intelligent Laboratory Systems 72 (2004) 133 – 151 www.elsevier.com/locate/chemolab

Determination of the number of significant components in liquid chromatography nuclear magnetic resonance spectroscopy Mohammad Wasim, Richard G. Brereton * School of Chemistry, University of Bristol, Cantock’s Close, Bristol BS8 1TS, UK Received 19 October 2003; received in revised form 13 January 2004; accepted 14 January 2004 Available online 30 April 2004

Abstract In this paper, the effectiveness of methods for determining the number of significant components is evaluated in four simulated and four experimental liquid chromatography nuclear magnetic resonance (LC – NMR) spectrometric datasets. The following methods are tested: eigenvalues, log eigenvalues and eigenvalue ratios from principal component analysis (PCA) of the overall data; error indicator functions [residual sum of squares (rssq), residual standard deviation (RSD), ratio of successive residual standard deviations (RSDRatio), root mean square error (RMS), imbedded error (IE), factor indicator functions, scree test and Exner function], together with their ratio of derivatives (ROD); F-test (Malinowski, Faber – Kowalski and modified FK); cross-validation; morphological score (MS); purity-based approaches including orthogonal projection approach (OPA) and SIMPLISMA; correlation and derivative plots; evolving PCA (EPCA) and evolving PC innovation analysis (EPCIA); subspace comparison. Five sets of methods are selected as best, including several error indicator functions, their ratio of derivatives, the residual standard deviation ratio, orthogonal projection approach (OPA) concentration profiles and evolving PCA using an expanding window (EW). Omitting the dataset with the highest noise level, RSS, Malinowski’s F-test, concentration profiles using SIMPLISMA and subspace comparison with PCA score also perform well. D 2004 Published by Elsevier B.V. Keywords: Liquid chromatography; Nuclear magnetic resonance; Principal component analysis; Noise level

1. Introduction Coupled chromatographic methods such as liquid chromatography coupled with diode array detector (LC – DAD), liquid chromatography with infrared spectroscopy (LC –IR), and gas chromatography with IR (GC –IR), gas chromatography with mass spectrometry (GC –MS) are increasingly common in the modern analytical laboratory and result in large quantities of multivariate data, which often have to be interpreted using chemometrics methods. For a successful data analysis of this kind of data, the first step involves the estimation of the correct rank, or number of significant components in a multivariate data matrix. Ideally, this should equal the number of compounds in a chromatographic cluster, but in practice, real data often contains artefacts, e.g., baseline problems, different types of noise, errors generated by data preprocessing (baseline correction, smoothing, etc.) and so on. A variety of tests have been * Corresponding author. Fax: +44-117925-1295. E-mail address: [email protected] (R.G. Brereton). 0169-7439/$ - see front matter D 2004 Published by Elsevier B.V. doi:10.1016/j.chemolab.2004.01.008

introduced in the literature to find out the number of components present in a mixture, and several intercomparisons have been made to check the applicability and limitations of these methods on different datasets [1– 4]. Most of these methods produce excellent results when noise distribution is white, normal and homoscedastic, and resolution is reasonable but may fail to yield easily interpretable results when these conditions are not fulfilled. Most classical studies in chemometrics have been performed on LC – DAD data, for which the principles of resolution have been well established over the past decade. In contrast, on-flow high-performance liquid chromatography coupled with nuclear magnetic resonance (LC– NMR) produces a relatively high level of noise [5]. The lower sensitivity of NMR relative to uv/vis requires higher concentrations in LC – NMR compared to LC – DAD, often requiring chromatographic columns to be overloaded and leading to overlapped peaks in chromatographic direction in many situations. In addition, the main features of NMR spectra, are quite different from those obtained using a DAD. NMR data usually consists of several Lorentzian-

134

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

shaped peaks between which there is primarily noise, and, in most of the cases, the noise regions occupy a larger portion of the spectrum to the signals. Recently, we have reported studies on the resolution of such chromatograms using multivariate methods [6– 8]. The present paper reports a comprehensive study of methods employed for the determination of the number of significant components in LC – NMR chromatographic peak clusters. In the present study, we explore the effect of noise level and resolution on the application of various methods for the determination of rank. The performance is evaluated on both simulated and real LC – NMR datasets containing different numbers of components. The aim of this work is to identify those methods which are suitable for rank analysis of LC – NMR data, their advantages and disadvantages, and to group methods producing similar results. In the performance of all these methods, no priori knowledge about the pure spectra of the main compounds or the instrumental noise is employed.

2. Experimental The two-way data produced by LC – NMR results in NMR spectra at several chemical shifts in one dimension (along rows) and concentration (or chromatographic) profiles in the other dimension (along columns). Two different groups of datasets are reported in this paper: one is simulated data of three components and the other contains experimental data. Notation used in this paper is presented in Table 1. 2.1. Simulated data Four simulated datasets, each consisting of three component mixtures, were created using previously measured NMR spectra of the compounds, 2,6-dihydroxynaphthalene (I), 1,3-dihydroxynaphthalene (II), and 2,3-dihydroxynaphthalene (III), and simulating the chromatographic direction. In contrast to the NMR spectra, which are usually modelled as a sum of Lorentzian peaks, the chromatograms are best based on a Gaussian model. 2.1.1. Chromatography Three chromatographic peaks were created as follows:    m  l 2 f ðmÞ ¼ exp  ð1Þ r where r is related to the standard deviation of the peak; m is the chromatographic elution time; and l is the mean or centre of the chromatographic peak. The value of r was calculated from peaks of real chromatograms of the three isomers of dihydroxynaphthalene. An average value of 20 s was used for r where points were sampled every s. Each chromatogram consisted of 252 spectra, resulting in a C matrix over 252 points in time. While the theoretical model of a Gaussian peak approaches zero asymptotically, to make

the chromatographic profiles resemble real data more closely, i.e., with zero intensity at the start and end, all the profiles were set to zero at a height of 10% of the original Gaussian peakshape, and an offset was added to all the data points. Chromatographic points outside this window were set at zero for each compound. Three chromatographic profiles were created using equal heights. 2.1.2. Spectra Real NMR spectra were recorded using on-flow LC – NMR for the three isomers of dihydroxynaphthalene, 100 mM in concentration each. Three spectra with maximum peak height were selected from two-way datasets of each isomer. All the regions were replaced with zeros in all NMR spectra where there was no resonance peak. This procedure resulted in 989 chemical shift values over the aromatic region. A matrix S of dimensions 3  989 was created, consisting of the spectra of the three compounds. 2.1.3. Bilinear data matrices Bilinear data matrices X of dimensions 252  989 was created by multiplying C with S and adding normally distributed random noise as X ¼ CS þ E

ð2Þ

where E is error matrix. Four simulated datasets were created using different peak positions and different noise levels as follows: Dataset 1: Peak maxima at (III); noise 1% corresponding to Dataset 2: Peak maxima at (III); noise 5% corresponding to Dataset 3: Peak maxima at (III); noise 1% corresponding to Dataset 4: Peak maxima at (III); noise 5% corresponding to

time 90 (I), 110 (II) and 130 of maximum signal height, a resolution of 0.250. time 90 (I), 110 (II) and 130 of maximum signal height, a resolution of 0.250. time 80 (I), 110 (II) and 140 of maximum signal height, a resolution of 0.374. time 80 (I), 110 (II) and 140 of maximum signal height, a resolution of 0.374.

2.2. Experimental data A list of compounds used for the experimental mixtures is presented in Table 2. The following four mixtures were created by combining different compounds. Dataset 5: Three-component mixture—consists of A, B and C, 125 mM each. Dataset 6: Four-component mixture—consists of D, E, F, and G, 100 mM each. Dataset 7: Seven-component mixture—consists of B, C, D, G, H, I and J, 50 mM each. Dataset 8: Eight-component mixture—consists of B, D, G, H, I, J, K and L, 50 mM each.

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151 Table 1 Notations used in the paper Exner function w sdm r m l zmn Ym xm1 z¯ xmn xm X um Trace tmk T (superscript) T tr S rssq RSS RSD RPV ROD RMS Rm RSDRatio h r qm PRESS pum P offset N n MSnl MS M m K k INF IND IE gk A f( ) F B emn E¯MC[ ] E E¯[ ] dw dm det

Standard deviation of mth row A parameter related to the standard deviation of a Gaussian Peak Degrees of freedom Mean or centre of a chromatographic peak Normalized intensity at mth row and nth column A matrix in OPA containing reference spectra and mth row of X First reference spectrum in Ym Normalized average spectrum of X in OPA LC – NMR signal intensity at mth row and nth column The mth row vector of X Data matrix containing LC – NMR signals Weight value in SIMPLISMA Mathematical Trace function PCA scores at mth row for kth component Denoting transpose of a matrix Scores matrix obtained by PCA Trace value of a matrix in subspace comparison Matrix of spectral profiles Residual sum of square calculated in SIMPLISMA Residual sum of square, an error indicator function Residual standard deviation Residual percent variance Ratio of derivatives of the error indicator function Root mean square error Matrix containing one or more reference spectra in SIMPLISMA Ratio of RSD function A positive multiplier in modified Faber – Kowalski F-test Correlation coefficient Ratio of standard deviation to corrected mean intensity Predicted residuals error sum of squares Purity value in SIMPLISMA Loadings matrix obtained by PCA An adjustable parameter in SIMPLISMA Total number of variables (columns) in X Index for columns Morphological score of noise level Morphological score Total number of spectra (rows) in X Index for rows Total number of compounds present in mixture Index for number of compounds or factors Error indicator function Factor indicator function Imbedded error Eigenvalue of the kth component Matrix containing orthogonal spectra selected by a variable selection method Probability distribution function The value of F-test Matrix containing orthogonal spectra selected by a variable selection method Residual error in data point xmn after model fitting Expectation value calculated by Monte Carlo simulation Error matrix Expectation value Durbin – Watson statistics Dissimilarity value in OPA for row m Determinant of a matrix

135

Table 1 (continued) Exner function w D C l j xˆmn x¯ w ¯m w | | xm | | lˆ dxmn/dm # d

Subspace discrepancy function Matrix of chromatographic profiles Vector containing sum along variables of X matrix Overall average of the data matrix X Modelled value at m time and n variable for K PCs Vector of mean signal along elution time of matrix X Mean signal height of mth row Vector containing all wm values Magnitude of mth row Least-squares approximation of the total intensity spectrum Derivative of xmn with respect to m Largest principal angle in subspace comparison Chemical shift

2.2.1. Chromatography All the datasets, except dataset 6, were run on a Waters (Milford, MA, USA) HPLC system, which consisted of a 717 Plus Autosampler, a 600S Controller, a model 996 Diode Array Detector with a model 616 pump. Dataset 6 was run on an HPLC system consisted of a Jasco pump PU980 and a UV detector UV-975. 2.2.2. Spectra Acetonitrile (HPLC grade, Rathburn Chemicals, Walkburn, UK) and deuterated water (Goss Scientific Instruments Great Baddow, Essex, UK) were used as solvent and mobile phase in concentrations as presented in Table 3. A few drops of tetramethylsilane (TMS; Goss Scientific) were added as a chemical shift reference. A 4 m PEEK tube with width of 0.005 in. was used to connect the eluent from the Waters HPLC instrument to a flow cell (300 Al) into NMR probe on a 500 MHz NMR spectrometer (Jeol Alpha 500, Tokyo,

Table 2 The name of compounds, their code and supplier Compound Label

Compound Name

Purity and Supplier

A

Phenyl acetate

B

Methyl p-toulenesulphonate

C D

Methyl benzoate 2,6-dihydroxynaphthalene

E F

1,6-dihydroxynaphthalene 1,3-dihydroxynaphthalene

G

2,3-dihydroxynaphthalene

H I J K L

1,2-diethoxybenzene 1,4-diethoxybenzene 1,3-diethoxybenzene Diethyl maleate Diethyl fumarate

99%, Aldrich Chemical, Milwaukee, WI, USA 98%, Lancaster, Morecambe, UK 99%, Lancaster 98%, Avocado, Research Chemicals, Heysham, UK 99%, Lancaster 99%, Aldrich, Steinheim, Germany 98%, Acros Organics, Geel, Belgium, UK 98%, Lancaster 98%, Lancaster 95%, Lancaster 98%, Avocado 97%, Avocado

136

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

Table 3 Experimental Conditions for real LC – NMR data acquisition

Column Composition of solvent and mobile phase (v/v) Flow rate (ml min 1) Injection volume (Al) Acquisition time (s) Pulse delay (s)

Dataset 5

Dataset 6

Dataset 7

Dataset 8

C18 Waters Symmetry (150  2.1 mm, 5.0 Am) 50:50

C18 Phenomenex (150  4.6 mm, 5.0 Am) 60:40

C18 Waters Symmetry (100  4.6 mm, 3.5 Am) 80:20

C18 Waters Symmetry (100  4.6 mm, 3.5 Am) 80:20

0.1 20 1.4850 0.9

0.5 20 1.1698 0.9

0.5 50 1.1698 2

0.2 50 1.1698 2

Japan). For each spectrum, the spectral width was 7002.8 Hz and a pulse angle of 90j was employed. 2.3. Software

3.2. Chemometrics data analysis In literature, several methods have been proposed for the determination of chemical rank and are described below.

3. Methods

3.2.1. Eigenvalue, logarithm of eigenvalues and eigenvalue ratio Principal component analysis (PCA) [10,11], when performed on a data matrix X (M  N), with suitable K number of components selected, decomposes the data matrix into a scores matrix T, a loadings matrix P and an error matrix E of dimensions M  K, K  N and M  N, respectively. Mathematically, it can be defined as

3.1. Data preprocessing

X ¼ TP þ E

No data preprocessing was required for the datasets which were simulated in the frequency or spectral domain. The experimental data was acquired in the form of Free Induction Decays (FIDs) during chromatographic elution, then the following processing was performed on each FID. Spectra were apodised and Fourier transformed. The first spectrum in time was removed as it contained artefacts; the spectra were aligned to the TMS reference and phase corrected. In order to remove errors due to quadrature detection, which results in regular oscillation of intensity, a moving average of each frequency over every four points in time was performed. This procedure has been described elsewhere [9] in detail. In LC – NMR, many parts of the spectrum correspond primarily to noise channels, so it is important to select only those variables where the signal to noise ratio is acceptable (the noise level was defined by visual inspection of standard deviation plots). In addition, there may only be certain portions of the chromatographic (or time) direction where compounds elute and other regions can be discarded. Variable selection is important because, in multivariate methods, if noise is dominant in the data, then the results produced will be dominated by noise and become difficult to interpret. For this purpose, the standard deviation was calculated first over each variable (chemical shift), and only those variables were retained where the standard deviation was acceptable. The same procedure was repeated in the time dimension.

The size of each PC is measured by its eigenvalue [12], which is calculated as

The data analysis was performed by computer programs written in Matlab by the authors of this paper. Fourier transformation and preprocessing for LC – NMR data was performed by in-house written software called LCNMR, which runs under Matlab.

gk ¼

M X

2 tmk

ð3Þ

ð4Þ

m¼1

where gk is the eigenvalue for kth component and tmk is the score of mth row and kth component. Eigenvalues that correspond to real factors should be considerably higher than later eigenvalues, which correspond to noise. Thus, the significant components can be identified easily in a plot of eigenvalues against the number of components. The logarithm of eigenvalues [13,14] and eigenvalue ratio [15] have also been reported as useful methods for differentiating the significant components from the remaining components. In our study, the eigenvalues, logarithm of eigenvalues and eigenvalue ratio were plotted against the number of components, and the data rank was determined by finding a break in the plot for the first two methods and locating a separate group of data points in the eigenvalue ratio method. In the methods described below, the speed of calculation is influenced by the dimensions of the data matrix. Therefore, in the calculation of eigenvalues, the data matrix was formed in such a way that the number of columns became less than the number of rows. If the original data matrix had more rows than the columns, it was transposed. Thus, in all of these methods the maximum number of eigenvalues was equal to the number of columns. The data – matrix is defined

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

137

as X with M rows and N columns. No preprocessing, such as centring, was performed on the data matrix.

finding the number of components is the same as discussed for RSD method.

3.2.2. Error indicator functions (INF) In the following subsections, some methods have been defined collectively as error indicator functions, as all of these methods calculate some kind of noise present in the data and will be denoted by INF.

3.2.2.4. Imbedded error (IE). The IE function [2,17] is defined by

3.2.2.1. Residual sum of squares plot (RSS). The RSS [16] gives the sum of squares of error values after data reproduction using K eigenvalues. RSS is defined as RSSðKÞ ¼

M X N X m¼1 n¼1

x2mn 

K X k¼1

gk ¼

M X N X

e2mn

ð5Þ

m¼1 n¼1

where xmn is signal intensity at time m and chemical shift n; gk is the kth eigenvalue; and emn is the error. The RSS plot provides information about the size of residuals after increasing number of eigenvalues is selected. If data were noise free, the RSS value drops to zero after selecting a suitable number of significant components, but with real data, the RSS values plot show a change of slop. The number of true components then becomes equal to the K value after which this change is observed. 3.2.2.2. Residual standard deviation (RSD). The RSD [2] represents the difference between raw data and pure data with no experimental error, and it measures lack of fit of a PC model. After selecting a suitable number of eigenvalues, RSD should be equal or lower than the experimental error. It is calculated as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N u X u gk u t k¼Kþ1 RSDðKÞ ¼ : M ðN  KÞ

ð6Þ

RSD(K) is plotted against K; when it reaches the value of instrumental error, it levels off. At that point, the number of components become equal to K and all the points including K, with a flatness in the plot that is observed visually. 3.2.2.3. Root mean square error (RMS). The RMS [2] measures the difference between raw and regenerated data after selecting suitable eigenvalues. Although the RMS is closely related to the RSD, it produces lower values than the RSD. It is defined as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N u X u gk rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u t k¼Kþ1 ðN  KÞ RMSðKÞ ¼ RSDðKÞ: ð7Þ ¼ N MN RMS(K) is plotted against K, and the graph flattens when the correct number of factors is determined. The criteria for

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N X u u K gk rffiffiffiffiffi u t k¼Kþ1 K RSDðKÞ: ¼ IEðKÞ ¼ N NM ðN  KÞ

ð8Þ

It decreases until all the significant eigenvalues have been calculated, and then starts to increase. Theoretically, the number of significant factors is obtained when the minimum is reached. In practice, a steady increase is rarely observed because errors are not uniformly distributed and it becomes difficult to find a true minimum. In our study, we determined the number of components by calculating the minimum of IE(K) function and finding a change in slope in the IE vs. K plot. 3.2.2.5. Factor indicator function (IND). The IND, proposed by Malinowski [2,17] is calculated as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u X u N u gk u t k¼Kþ1 1 1 INDðKÞ ¼ ¼ RSDðKÞ: 2 M ðN  KÞ ðN  KÞ2 ðN  KÞ ð9Þ It reaches a minimum when the true number of components are employed and provides a more pronounced minimum than the IE function. IND is much more sensitive than the IE in finding the true number of components [2]. The criteria for finding the rank of data using IND function are the same as defined for IE function. 3.2.2.6. Residual percent variance (RPV; scree test). The scree test reported by Cattell [18] suggests that the residual variance should level off after a suitable number of factors have been computed. It provides information on the relative importance of each eigenvalue to the overall sum of squares of the data and is defined by 0

N X

1

gk C B B k¼Kþ1 C C100: B RPVðKÞ ¼ B N C A @ X gk

ð10Þ

k¼1

This function is plotted against K. The plot levels off at the point when the number of significant components has been found. This point is taken as the true dimensionality of the data.

138

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

3.2.2.7. Exner function (w). The Exner function proposed by Kindsvater et al. [19] is an empirical function, denoted by the Greek letter w; it varies from zero to infinity. A w value of 1.0 is the upper limit of physical significance but Exner [20] proposed 0.5 as the largest acceptable value. In our study, we used w(K) < 0.1 for determining the number of significant components. The function is defined as follows:

significant components. We define the following ratio for the RSD function as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N X u u ðN  K þ 1Þ gk u RSDðKÞ u k¼Kþ1 : ¼u RSDRatioðKÞ ¼ N RSDðK  1Þ u t ðN  KÞ X g k

k¼K

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u M X N X u u MN ðxmn  xˆ mn Þ2 u u m¼1 n¼1 wðKÞ ¼ u M N u t ðMN  KÞ X Xðx  jÞ2

ð14Þ ð11Þ

mn

m¼1 n¼1

where xˆmn is reproduced data point after have been calculated, and j is the overall data matrix. The criterion for determining components is the same as defined in the

K components average of the the number of scree test.

3.2.3. Ratio of derivatives of error indicator function (ROD) All of the error indicator functions defined above can be plotted against the number of components, and the estimation of the true number of components is based on the proper interpretation of the plots. Another way to find data rank is to plot the derivative of any one of the error indicator functions (defined in Sections 3.2.2.1 –3.2.2.7) and find out the minimum or maximum in the plot. Elbergali et al. [21] suggested the second derivative, third derivative and ratio of the second to third for finding a break-even point in the plot. In our study, we selected only the ratio of the second to third derivative defined as RODðKÞ ¼

INFðKÞ  INFðK þ 1Þ INFðK þ 1Þ  INFðK þ 2Þ

ð12Þ

where INF is an error indicator function. The number of components present in a mixture can be identified either by locating a minimum in the INF function or maximum in ROD function. In Ref. [21], no further explanation is given in the selection criteria when the number of components found is not the same by this criterion. 3.2.4. Ratio of RSD (RSDRatio) The error indicator functions defined in Sections 3.2.2.1 – 3.2.2.7 can also be expressed as a ratio of the form

Although such a ratio could be computed for different error indicator functions, we restrict to the RSD in this paper, as similar results were obtained for the other indicator functions. 3.2.5. F-test Malinowski [22,23] developed a test based on Fisher variance ratio called the F-test to determine the rank of data. This calculates a ratio of two variances, which come from independent pools of samples. The error in the data is assumed to have normal distribution. In the application of F-test, variances are eigenvalues, which derive from independent sets of PCs. Mathematically, the F-test is defined by N X

gK FK ðm1 ; m2 Þ ¼ N X

ðM  k þ 1ÞðN  k þ 1Þ

k¼Kþ1

ðM  K þ 1ÞðN  K þ 1Þ

gK

k¼Kþ1

ð15Þ where m1 = 1 and m2=(N  K) are the degrees of freedom. The calculated F-ratio is compared with F-inverse values at a defined level of confidence (99% in our study). The number of significant components is set equal to K when the F-value exceeds F-inverse. Faber and Kowalski [24,25] noticed that the degrees of freedom used in Malinowski’s F-test were not consistent with the definition of F-test. They defined an F-test based on Mandel’s [26] degree of freedom (called FK Ftest) as FK ðm1 ; m2 Þ ¼

gK N X

gK

m2 m1

ð16Þ

k¼Kþ1

where m1 and m2 are the degrees of freedom calculated by INFðKÞ RSDRatioðKÞ ¼ : INFðK  1Þ

ð13Þ

The assumption is that the ratio changes smoothly over the number of components that belongs to PCs that are not significant. The ratio arising from the components corresponding to noise will fall on a line and form a distinct group of points as compared with those coming from

¯ K m1 ¼ E½g

ð17Þ

m2 ¼ ðM  K þ 1ÞðN  K þ 1Þ  m1

ð18Þ

and E¯[ gK] is the expectation value of the Kth eigenvalue, which is based on the premise that the expected value of an eigenvalue corresponding to noise divided by the

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

appropriate degree of freedom is an unbiased estimate of the error variance r2 defined by r2 ¼

¯ K E½g : mK

ð19Þ

The E¯[ gK] is measured by the average of Monte Carlo simulations of several matrices of random values drawn from a normal distribution with variance r2 = 1. The degrees of freedom will then be defined by mK ¼ E¯ MC ½gK 

ð20Þ

where the subscript MC denotes values calculated by the Monte Carlo simulations. Following the Faber –Kowalski suggestions for F-test, Malinowski [27] reported that the FK F-test is very sensitive and overestimates the rank; he suggested a modification in the calculation of degree of freedom as mK ¼ h E¯ MC ½gK 

ð21Þ

where h is a positive number, which will reduce the sensitivity of the test. In the present study, we used all of these variations and reported their results with different significance levels. 3.2.6. Cross-validation Cross-validation [28,29] is one of the most common chemometrics method for determining the number of significant components in a dataset. An implementation of cross-validation known as leave-one-out is usually applied to find out the number of factors. In the leave-one-out method, a data matrix X is divided into submatrices of dimension (M  1,N) by taking one row out of the matrix X, creating M submatrices. The deleted row is modelled by varying the number of PCs. Each modelled row gives a Predicted Residuals Error Sum of Squares (PRESS) for varying number of PCs. PRESS is defined by PRESSðKÞ ¼

M X N X ðˆxmn  xmn Þ2

ð22Þ

m¼1 n¼1

where xˆmn is the modelled value at m time and n variable for K PCs. There are different ways to present the results, but a common approach is to compare PRESS using (K + 1) PCs to RSS using K PCs. If PRESS(K + 1)/RSS(K) is significantly greater than 1, then K PCs gives the rank of the data. 3.2.7. Morphological score (MS) The morphological score was first presented in the chemometrics literature by Shen et al. [30]. The method is based on the fact that the ratio of the norm of a spectrum to the norm of its first difference is higher for a profile of a

139

component than a profile generated only by noise. Mathematically, it is defined by MSðxÞ ¼

Nðx  x¯ ÞN NMOðx  x¯ ÞN

ð23Þ

where MS(x) is the morphological score calculated for vector x. MO(x) is called the morphological operator, and it calculates the difference between sequential values of the vector x. It has been shown that the score is scale invariant and is not affected by the magnitude of the noise; it is also independent of the baseline offset. The procedure utilises the significant key spectra (normalized to unit length) from a method suitable for selecting key or pure variables in the form of matrix. Pure variables are defined as those that best describe a single component in the mixture. Then PCA is performed on that matrix and the morphological score of each loadings vector, pk, is calculated using Eq. (23). The morphological score of the noise level is calculated using the formula given in Eq. (24): rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðN  1ÞFðN  1; N  2Þ MSnl ¼ N 2

ð24Þ

where N is the number of elements in the vector, and F is the F-test value at certain level of confidence at (N  1,N  2) degrees of freedom. The rank of the data becomes equal to the number of all vectors for which MS>MSnl. 3.2.8. Orthogonal projection approach (OPA) The orthogonal projection approach (OPA) [31] is a stepwise method for finding the pure variables in the data sequentially. Each step involves identifying a new component in the data. The method can be applied in time as well as in chemical shift direction. Here, the method is explained only for chromatographic dimension. The first step involves the creation of a matrix Ym for mth spectrum Y m ¼ ½¯zxm 

ð25Þ

where z¯ is the normalized average spectrum of X, and xm is the mth spectrum in X. After the first component is identified, the spectrum z¯ is replaced by a reference spectrum. A dissimilarity value, dm, is calculated as dm ¼ detðY m Y Tm Þ

ð26Þ

where det is determinant, and YmT is transpose of matrix Ym. The value of dm is plotted against elution time. The time corresponding to the maximum value of dm in the plot indicates the most dissimilar spectrum in the data matrix as compared with the average spectrum and is selected as the

140

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

first component. The spectrum at time m is selected as a first reference vector, which replaces the average spectrum in Eq. (25), so a new Ym matrix is obtained as Y m ¼ ½xm1 xm 

ð27Þ

where xm1 is the first reference spectrum corresponding to maximum dissimilarity value in the plot. Dissimilarity is calculated again using Eq. (26) and presented graphically. If a second reference spectrum, corresponding to maximum dissimilarity, is present, then a second xm2 vector is added to the Ym matrix. In all subsequent steps, a new reference spectrum is added every time to Ym and dissimilarity is calculated again. This procedure is repeated until the dissimilarity plot no longer contains an obvious peak or there is only random noise left in the plot. The rank of the data becomes equal to the number of reference spectra found in the whole process. 3.2.9. SIMPLISMA SIMPLISMA was first reported by Windig et al. [32]. It determines the number of pure variables in the data and is based on a stepwise process similar to OPA. The first step is the identification of standard deviation sdm and mean intensity w¯m of each spectrum, which become the part of a ratio, defined as sdm qm ¼ w¯ m þ ðoffset=100Þ  maxðwÞ

ð28Þ

where max(w) is the maximum intensity of all the mean intensities of the M spectra, and the offset is an adjustable value. Each spectrum in the data matrix X is normalized to unit length zmn ¼

xmn Nxm N

ð29Þ

and the matrix Rm is created as

added to the Rm matrix and the process is repeated until the purity plot does not show any peak or it exhibits only noise. The rank becomes equal to the number of reference spectra found. The dissimilarity and purity data produced by OPA and SIMPLISMA can also be explored by other methods for determination of the rank of the data. In Sections 3.2.9.1 – 3.2.9.3, we will describe three more procedures. 3.2.9.1. Residual sum of squares using SIMPLISMA. SIMPLISMA provides many tools for the identification of true number of components. Among these, residual sum of squares (rssq) [1] gives good results. The calculation is as follows. The total signal is calculated as l at all variables; for example, it is possible to sum the chromatographic intensity at every spectral datapoint, so l becomes a vector as a function of chemical shift, each element corresponding to the summed elution profile at the corresponding spectral frequency. This is a linear combination of all the pure spectra, which is then projected onto the space of the spectra, as selected by SIMPLISMA, in a matrix S, and the contribution of each selected spectrum on the total spectrum is estimated. These values can then be used to calculate the least-squares approximation of the total intensity spectrum, as ˆl. ˆl ¼ SðST SÞ1 ST l

ð33Þ

When the proper number of spectra has been selected, then ˆl should be equal to l within the experimental error. The difference between the two is calculated by residual sum of squares (rssq). vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX u N u ðln  ˆl n Þ2 u u n¼1 : rssq ¼ u N u X t 2 l

ð34Þ

n

n¼1

Rm ¼ ½zm 

ð30Þ

and the weight, denoted by um, of each spectrum is calculated as um ¼ detðRTm Rm Þ

ð31Þ

where Rm initially contains one normalized spectrum as given in Eq. (30). Eqs. (28) and (31) then provide a value for the purity pum, defined as pum ¼ um þ qm :

ð32Þ

The value of pum is plotted as a function of time (which is called a purity plot). The purest spectrum will give the highest value of pum and the corresponding time will be used to select the first reference spectrum. The next steps are similar to OPA; a reference spectrum is

Ideally, rssq should not change significantly after the selection of true number of components. A graph of rssq against K is plotted and the rank becomes equal to the first rssq value after which graph does not show significant change. 3.2.9.2. Durbin –Watson statistics using OPA and SIMPLISMA. The Durbin – Watson (dw) [33 – 35] test assumes that the observations, and so the residuals, have a natural order. If residual is an estimate of error and all of errors are independent, then the dw test checks for a sequential dependence in which each error is correlated with those before and after it in the sequence. In our study, we used dissimilarity (dm) and purity ( pm) values as residuals and the rank was estimated by locating a big change in the dw values by plotting dw against the number

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

of components (or reference spectra). The test, for OPA dissimilarity values, is defined as M X ðdm  dm1 Þ2

dw ¼

m¼1 M X

ð35Þ dm2

m¼1

3.2.9.3. Concentration profiles using OPA and SIMPLISMA. Dissimilarity plots obtained by OPA or purity plots by SIMPLISMA on the spectral dimension usually predict more components than the true rank, which is due to the noise and to the fact that profiles are not well aligned in the original data. This noise and drift generates dispersion in the NMR peak positions, which causes the incorrect estimation of a high rank. In this paper, we suggest a slight variation in the interpretation of purity plots by constructing the ‘‘concentration profiles’’. NMR spectra, in contrast to UV, consist of a series of several resonances, which result in quite distinct chromatographic profiles compared to HPLC – DAD. This property of NMR signals can be exploited in determining the true number of components. When OPA or SIMPLISMA is performed on the spectra, three types of concentration profiles can be observed. Single and pure profiles: these are retained. Multiple profiles belonging to the same compound: the repeated profiles are deleted. Impure profiles starting from one component and extending over two or more than two pure concentration profiles: some of these could be retained and some are deleted. A visual inspection of these concentration profiles reveals the correct number of components very quickly. 3.2.10. Correlation plots The correlation coefficient [16] is a measure of similarity between two vectors. It can be defined between two spectral vectors at elution times m and m  1 as rðxm ; xm1 Þ N X ðxm;n  x¯ m Þðxm1;n  x¯ m1 Þ n¼1 ffisffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N N X X ðxm;n  x¯ m Þ2 ðxm1;n  x¯ m1 Þ2 n¼1

ð36Þ

n¼1

The correlation coefficient between two consecutive spectra plotted against time provides useful information about changes, which takes place in the data, which in turn relates to the number of compounds in the mixture. The correlation coefficient between two similar spectra will be close to 1, while it will be close to 0 for totally dissimilar spectra. The region in which the correlation coefficient is close to 1 indicates a selective chromatographic region.

141

Normally, the logarithm of the correlation coefficient is used for the vertical axis, providing correlation coefficients are all positive, which is usual during regions of elution. 3.2.11. Derivative plots Derivatives represent another approach for finding the change in the characteristic spectra with time in the dataset. However, derivatives also magnify noise, so it is important that these are combined with smoothing functions. Savitsky Golay (SG) filters [36 – 38] provide a facile combination of smoothing and derivative functions in one step. One such function of a first derivative with five-point quadratic smoothing function is defined as dxmn cð2xm2;n  xm1;n þ xmþ1;n þ 2xmþ2;n Þ=10 dm

ð37Þ

In the process of derivative calculations, data is row scaled (over wavelengths), and a Savitsky Golay first derivative (absolute value) is calculated and presented as a function of time. A full description of the method is reported elsewhere [39]. The minima in the plot represent points of highest purity corresponding to derivatives close to 0; thus, the number of minima usually corresponds to the number of compounds in the mixture. 3.2.12. Evolving principal component analysis (EPCA) Two-way chromatographic data, corresponding to a series of unimodal peaks in elution time, have properties that can be exploited for finding the row factors and concentration windows with the help of PCA. Evolving factor analysis belongs to the family of self-modelling curve resolution techniques [40], which can be implemented in two steps. The first step, sometimes called as EPCA [41], involves creating a window in the data matrix and applying PCA on that window. The method can be implemented either as expanding window (EPCA –EW) or fixed size window evolving factor analysis (FSW –EFA). These methods will be described briefly below. EPCA with expanding window is implemented in two steps. The first step is forward window factor analysis, which involves the formation of a submatrix from the data matrix X, and then applying PCA to calculate K principal components. This step will produce K eigenvalues. In the next step, another spectrum (row) is added to the submatrix formed in the last step, and again, PCA is performed for K principal components. The procedure is repeated until all the spectra (rows) in the matrix X are used. In the backward window factor analysis (the second step), the exact procedure is repeated but the rows are selected from the end of the X matrix. Finally, the logarithm of eigenvalues is plotted against corresponding number of rows in submatrices. In the forward window, a rise in the eigenvalue curve indicates that a new compound is detected in the submatrix, and in the backward window, a fall in the eigenvalue curve indicates that the compound has eluted completely. If the number of

142

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

PCs calculated is more than the true number of compounds in the data matrix, the extra PCs will not show any significant rise or fall in their eigenvalues. FSW – EFA [42 – 45] is also implemented in two steps similar to EPCA –EW. In the forward step, a submatrix of a fixed size is created and its eigenvalues are calculated. Next, the centre of matrix is shift by one row forward and PCA is performed again. This procedure is repeated until the matrix includes the last row of X. In the backward step, the submatrix includes a fixed number of rows starting from the end of data and moves backward. Finally, the logarithms of the first few eigenvalues (typically 4 or 5) are plotted against time, and number of components is estimated from the plots. 3.2.13. Evolving principal components innovation analysis (EPCIA) The EPCIA [46 –49] approach is based on curve fitting using Kalman filters. The method assumes that two-dimensional chromatographic data can be modelled as X ¼ CS þ E

ð38Þ

EPCIA is normally implemented in two stages: the first is the calculation of pure variables, which provides chromatographic profiles, and second is the modelling of data using these chromatograms. In the procedure, each chromatogram projects onto the (N  1) two-dimensional subspaces and is fitted to linear model recursively. Initially, a linear model consisting of one component is obtained and the prediction error is calculated. The predicted errors for the chromatogram at each spectral frequency are combined to give a root mean square error (RMS). A plot of RMS against time gives an indication of the number of components. If RMS is high and shows peak-like structures, it suggests that another component is required in the modelling. The procedure is repeated until RMS becomes low and shows only random structure. The algorithm to implement Kalman filters is described in more detail in Chapter 3 of Ref. [16]. 3.2.14. Subspace comparison Subspace comparison [50] compares two subspaces, each of which is described by a set of orthonormal vectors selected by a method suitable for variable selection such as OPA or SIMPLISMA. Suppose two subspaces are defined as A={a1, a2, . . .aK} and B={b1, b2, . . .bK}. For example, A may be a matrix with 5 columns each consisting of a profile obtained using SIMPLISMA for 5 pure components; B may be the corresponding matrix obtained from OPA. However, the components may be slightly different or in a different order, so these matrices will not necessary be identical, although ideally, they should contain similar information. Note that K must be the same for both matrices; the aim is to see whether, for example, the OPA construction using 3 components is similar to that obtained from SIM-

PLISMA; if not, this is likely to be the wrong number of components, so K is changed. The vectors in A and B are orthogonalized by the Schmidt procedure [35]. The next step is to calculate tr(K) as trðKÞ ¼ TraceðAT BBT AÞ

ð39Þ

where tr(K) varies between 0 and K; D(K), called subspace discrepancy function, is calculated as follows: DðKÞ ¼ K  trðKÞ

ð40Þ

D(K) is the measure of that part of the subspace which is in the orthogonal complement of the other. This becomes zero when two subspaces are identical. Eigenvalues are calculated on (ATB) matrix and are utilised as sin2 ð#K Þ ¼ 1  gK

ð41Þ

where sin2(#K) is the largest principal angle. Ref. [50] compared D(K) and sin2(#K) as a measure of disagreement between the subspaces by plotting both values for K components. The true number of factors becomes equal to the largest value of K with D(K), or sin2(#K) close to zero.

4. Results and discussion The results of different methods are discussed in separate sections. 4.1. Eigenvalue, logarithm of eigenvalues and eigenvalue ratio The methods using logarithms and eigenvalue ratios produced better results compared to the method using raw eigenvalues (see Section 3.2.1). The eigenvalue plots determined the correct rank only in three datasets (see Table 4). The plot of the logarithm of eigenvalues gave correct rank for all datasets except the datasets 5, 6 and 8 where it was difficult to find a true break among the values originating from the significant and nonsignificant factors. The eigenvalue ratio yielded similar results as produced by the logarithm of eigenvalue method except for dataset 5 where it determined the correct rank. Fig. 1 is a plot of the logarithm of eigenvalues for dataset 5 where there is no indication about the data rank. For the same dataset, the results of the eigenvalue ratio method are produced in Fig. 2. The results for these three methods have been compiled in Table 4. 4.2. Error indicator functions (INF) It has been explained in Section 3.2.2 that the rank was determined by plotting error indicator functions against the number of components and then locating a break in the

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

143

Table 4 Results of the eigenvalues, logarithm of eigenvalues and eigenvalue ratio methods Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

Dataset 7

Dataset 8

True rank

3

3

3

3

3

4

7

8

Function

Components identified

Eigenvalue log of EV Ratio of EV

2 3 3

3 3 3

3 3 3

2 – 3

3 – 5

7 7 7

5 – 9

2 3 3

plots visually. Most of the error indicator functions yielded correct rank. The RSD, RMS, IND, RPV (scree test) and Exner function determined the number of components correctly in all datasets. The RSS function failed only for dataset 8, where there was no clear distinction between the last RSS calculated for real component and the first RSS generated by noise. The plot of RSS against the number of components for dataset 8 is produced in Fig. 3. The IE function was least successful. It failed for datasets 2 and 4 where noise level was high, and the method identified only two components in both datasets. The plots of RSD and RMS functions against the number of components produced better discrimination between the significant and nonsignificant factors. Although the minimum of IND and IE functions provide data rank at the function minimum, in our study, the results based on minimum function value were not correct. The IND function produced correct rank at the function minimum only for simulated datasets, while IE function generated correct rank with this criteria for datasets 1 and 3. The results are summarised in Table 5.

simulated and experimental data, which is apparent in Table 6. Overall, ROD function gave good results, and it can be implemented as an automatic procedure and, in some cases, with a plot of the original function. 4.4. Ratio of RSD (RSDRatio) Although most of the error indicator functions produced correct rank, the rank finding procedure is based on visual interpretation, which may include some personal bias. The ratio of RSD function (RSDRatio) was more sensitive in differentiating the significant and nonsignificant components in the plot than the error indicator functions. The result of this function is shown in Fig. 4 for dataset 8. It can be seen in the figure that the points of all nonsignificant factors lie almost on the same curve, which is on the right-hand side of the plot (indicated by crosses). The results of this function are summarised in Table 5. 4.5. F-test

The ratio of derivatives was calculated for RSS, RSD, RMS, IE, IND, RPV and Exner functions. Although all of these methods were successful except for RSS and IE, their derivatives were not as good. The ROD method failed for dataset 2 for all functions except IND; the results for all datasets are given in Table 6. The results were good in most of the cases. Apart from dataset 2, the other dataset, which failed in the test, was dataset 5. In majority of the cases, the number of components was identified by a global maximum. For IE and IND functions, this test showed a deviation for

The Malinowski F-test was performed on the data and the calculated F-values were compared with the F-inverse at 99% confidence level. The test performed predicted correctly in all cases except for dataset 2, where it gave one fewer component. The Faber – Kowalski test gave correct results only with the simulated datasets, while with the experimental data, it overestimated the components. It seems that the test is very sensitive and does not perform well in real situations when there could be many sources of error and the error distribution is not normal. The modified Faber – Kowalski F-test was also used. Initially, the degrees of freedom were multiplied by 2

Fig. 1. Plot of the logarithm of eigenvalues against the number of components for dataset 5. Data rank is not clear from this plot.

Fig. 2. Plot of the eigenvalues ratio against the number of components for dataset 5. A long change can be identified after component 3.

4.3. Ratio of derivatives of error indicator function (ROD)

144

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

4.8.1. Dissimilarity plots 4.8.1.1. Object selection. All the simulated datasets provided correct results except dataset 1, which showed one more component. In real datasets, only datasets 5 and 6 produced the correct results while datasets 7 and 8 gave higher components.

Fig. 3. RSS plot of dataset 8 (real dataset). The data rank is not clear in the plot.

[h = 2 in Eq. (21)] and later with 10; in both cases, the results were not correct (Table 7). 4.6. Cross-validation As with other methods, cross-validation can also be applied first by the inspection of plots, and then by comparison based on PRESS and RSS values. Visual inspection of all PRESS and RSS plots show true number of components except in dataset 8, but the results based on the ratio of PRESS and RSS gave the correct number of components only in the case of simulated datasets and predicted too many significant components in the experimental datasets. Therefore, cross-validation based on the visual inspection of PRESS and RSS plots leads to better results. 4.7. Morphological score (MS) The morphological score was, initially, applied to the spectra obtained by OPA; the results were not correct because OPA dissimilarity plots did not give true number of components. It was also applied to the whole dataset, which excluded the OPA calculations. The results showed that the method worked only for simulated data, although it provided good results when the number of components was estimated by visual inspection of the plot of morphological score against number of components, as shown in Figs. 5 and 6. Dataset 5 gave a correct result; the plot of dataset 6 showed the last two points, corresponding to real components, slightly above the MS values of noise. Dataset 7 gave six components, the first five MS values were well above MS of noise but the sixth MS was below the noise level. The seventh MS was higher than the noise level, thus estimating six components in total.

4.8.1.2. Variable selection. All the simulated datasets produced correct results except for dataset 2, in which one less component was predicted than the true rank. In all the real datasets, the number of significant components was overestimated. Most of the results using dissimilarity plots, on the time dimension, were not correct. Using the variable dimension, the results were totally wrong because, in majority of cases, more components than expected were found due to the reasons explained in Section 3.2.9.3. 4.8.2. Durbin –Watson statistics (dw) 4.8.2.1. dw on object selection. The dw statistics produced correct results only with datasets 1 and 3, which are simulations. It found two components in datasets 2 and 4, which contained high-noise levels. There was no clear indication about the number of components in experimental datasets. Figs. 7 and 8 show the dw plots for dataset 8 using OPA data on two different dimensions. 4.8.2.2. dw on variable selection. The results calculated by Durbin –Watson statistics were plotted and the number of components calculated by determining where there was a sharp increase in value. In most of the cases, less components were found than the true rank. The success of the dw test was high when it was used with dissimilarity values obtained using variable dimension. It was successful in most of the cases, as shown in Table 7. There was no difference in simulated or real datasets from the success point of view. 4.8.3. Concentration profiles Concentration plots provided good results in all cases. As OPA does not yield pure variables, therefore, some profiles

4.8. Orthogonal projection approach (OPA) OPA was applied on object (time) and variable (chemical shift) dimensions. As discussed in Section 3.2.8, OPA produces dissimilarity plots, which can be utilised in different ways to identify the number of components. We discuss the results of these methods in Sections 4.8.1 – 4.8.3.

Fig. 4. RSD(K)/RSD(K  1) ratio of dataset 8 (real dataset). Significant factors are indicated by w.

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

145

Table 5 Results of error indicator functions and RSD ratio compiled together Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

Dataset 7

Dataset 8

True rank

3

3

3

3

3

4

7

8

Function

Components identified

RSS RSD RMS IE IND RPV w RSDRatio

3 3 3 3 3 3 3 3

3 3 3 3 3 3 3 3

3 3 3 2 3 3 3 3

3 3 3 3 3 3 3 3

4 4 4 4 4 4 4 4

7 7 7 7 7 7 7 7

7–8 8 8 8 8 8 8 8

3 3 3 2 3 3 3 3

appear which extend over two or more than two components. Fig. 9 shows concentration profiles of dataset 1. The profiles for chemical shift, d, of 7.112 and 7.109 ppm belong to the fastest eluting compound, and similarly those for 6.722 and 6.726 ppm belong to the second eluting compound. The profile for 7.196 ppm arises from the slowest eluting compound, while the concentration plot obtained by using 7.252 ppm starts with the middle eluting compound and ends with the slowest eluting compound, which can easily be identified as a combined profile that originates from two close resonances (variables) from the middle and slowest eluting compounds together. After excluding the mixed profile, the presence of three components is confirmed in dataset 1. Datasets 2 and 3 also show similar profiles but at different chemical shifts. Dataset 4 does not show any common resonance, and the concentration plots are very clean for the three compounds. In dataset 5, three components were correctly identified. For dataset 6, initially, six concentration profiles were generated. The second variable (7.112 ppm) is not a pure variable and its concentration profile is indicative of two components; it was not difficult to correctly identify the true number of components. Dataset 7 resulted in interesting plots; the first variable selected by OPA at 1.337 ppm is not a pure variable and it belonged to compounds eluting as sixth and seventh components in concentration plots (see Fig. 10 which shows the first nine concentration

Fig. 5. Morphological score (MS) on dataset 1 (three components). Three components can be easily discriminated (see horizontal line).

profiles). After deleting the first concentration profile (which corresponds to a combination of the sixth and seventh eluting compounds) and the last profile (which is the same as the third component), it suggests that there is a total of seven components in the mixture. Dataset 8 also exhibited similar behaviour, and it also showed some variables that correspond to a combination of two pure components, but these were identified and the true dimensionality of the data was determined correctly. Thus, using the concentration profiles followed by OPA was a successful approach in finding the true rank of the data in simulated as well as in real datasets. 4.9. SIMPLISMA SIMPLISMA is very sensitive to noise if the noise level is high as it often selects components from noise. The effect of noise could be reduced either by reducing the objects dimension or by increasing the offset value. The latter procedure is not recommended because, in that case, SIMPLISMA will generate variables similar to OPA. The former procedure was adopted, and chromatographic profiles were discarded at the start and at the end of all datasets. It significantly improved the results. SIMPLISMA was applied at an offset value of 15% on both dimensions.

Fig. 6. Morphological score (MS) on dataset 8 (eight components). Discrimination is not good; only seven components can be found (see horizontal line).

146

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

Fig. 7. Durbin – Watson statistics calculated on dissimilarity values calculated by OPA on variable dimension using dataset 8 (eight components).

Like OPA, SIMPLISMA also makes use of several methods in determining the data rank; in our study we used four such methods. The results with these methods are discussed in the Sections 4.9.1 –4.9.4. 4.9.1. Purity plots 4.9.1.1. Object selection. Datasets 1, 3, 5 and 6 gave the true rank, while datasets 2 and 4 gave one component less than the true rank. Datasets 7 and 8 did not provide any conclusive answers because the purity plots became very complex after six components were detected, and so it was difficult to decide the real number of components. 4.9.1.2. Variable selection. From datasets 1 and 3, three components were predicted, and from datasets 2 and 4, two components were predicted. Datasets 5 and 6 produced the true rank. It was difficult to find out the true number of components in datasets 7 and 8, where more components appeared than the true rank. Purity plots behaved in a similar manner to OPA dissimilarity plots; that is, both of the methods yielded incorrect rank in the majority of datasets. Like OPA, purity plots also failed in the case of simulated datasets.

Fig. 9. Concentration plots of the first six variables selected by OPA using dataset 1 (simulated data).

the true one. These results suggest that rssq provides good results in most of cases. In our present study, rssq plots based on the variable selection gave better results than the plots based on object dimension. 4.9.2.2. Variable selection. The rssq gave wrong results for dataset 2, and dataset 6 did not produce good evidence for the three components; there was some indication of the existence of a fourth component. Similarly for dataset 8, either seven or nine components are suggested; there was a slight decrease in the rssq value when the ninth component was selected. The rssq plot of dataset 8 can be seen in Fig. 11. 4.9.3. Durbin –Watson statistics (dw) 4.9.3.1. Object selection. Durbin – Watson gave correct results only in dataset 3, and, in the rest of datasets, the results were incorrect.

4.9.2. Residual sum of squares using SIMPLISMA 4.9.2.1. Object selection. Datasets 1, 3, 5 and 6 gave the correct results. The remainder predicted a lower rank than

Fig. 8. Durbin – Watson statistics calculated on dissimilarity values calculated by OPA on time dimension using dataset 8 (eight components).

Fig. 10. Concentration plots of the first nine variables selected by OPA using dataset 7 (real data).

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

147

Fig. 11. Residual sum of square (rssq) calculated by SIMPLISMA on the time dimension using dataset 8 (eight components).

4.9.3.2. Variable selection. Durbin – Watson was totally unable to provide answer about the number of components. The plots were without minima and did not show any rise after the true number of components.

Fig. 13. Concentration plots of the first five variables selected by SIMPLISMA using dataset 5 (real data).

4.9.4. Concentration profiles When OPA and SIMPLISMA are applied to the variable dimension, in most cases, they found more components than the true number, as estimated by dissimilarity or purity plots. In Section 4.8.3, it was discussed that concentration profiles proved successful in OPA. The concentration profiles become complicated when the number of components becomes large and requires a careful inspection of plots. Fig. 12 shows the concentration plots of the first four variables of dataset 1 obtained by SIMPLISMA. It is easy to identify that the fourth component is the same as the third component. Furthermore, this can be confirmed by calculating the correlation coefficient between profiles 3 and 4, which is greater than 0.9, suggesting that both profiles are very similar. Concentration profiles obtained by SIMPLISMA on the time dimension gave correct results in all cases except dataset 5 where it suggested four components as shown in Fig. 13. It can be seen that the fourth component is approximately similar to the third

component but, due to the high level of noise present in these profiles, it is not very clear, and the fifth component is the same as the first component. Concentration profiles of dataset 7 are given in Fig. 14. Among these profiles, the profiles obtained using chemical shift d at 7.177 and 7.182 ppm belong to the same component; the profiles from the chemical shifts at 2.450 and 2.464 ppm are also from a single compound, while the profile at 1.323 ppm is a combined profile of components 6 and 7. On deleting these profiles, seven components remain, which is a correct number for this dataset. Similarly, 8 components are identified in dataset 8. Fig. 15 shows the concentration profiles for dataset 8. As concentration profiles selected by SIMPLISMA have more noise than those selected by OPA, it makes the identification process more difficult, and, sometimes, an incorrect number of components is identified as in the case of dataset 5.

Fig. 12. Concentration plots of the first four variables selected by SIMPLISMA using dataset 1 (simulated data).

Fig. 14. Concentration plots of the first ten variables selected by SIMPLISMA using dataset 7 (real data).

148

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

Fig. 16. EPCIA on dataset 6 (four components) fourth plot showing a peaklike structure.

simulated datasets. This method did not perform well in one of our previous study [5] of a three-component mixture, which could be due to a high level of noise. Fig. 15. Concentration plots of the first eleven variables selected by SIMPLISMA using dataset 8 (real data).

4.10. Correlation plots In our present study, correlation plots did not provide a suitable approach in finding the true number of components, and were successful only in one case, which was dataset 6. It gave one component less than the true rank in the simulated datasets and datasets 5 and 7. In dataset 8, this method finds two components less than the true rank. Therefore, it suggests that this technique is not very sensitive and underestimates the true number of components. Correlation plots work well when chromatographic resolution is high, although the data-preprocessing step removes most of the noise but not completely. Noise is removed from those frequencies where there appears no peak from all the components under investigation. 4.11. Derivative plots In the same way as correlation plots, derivatives yielded incorrect results. Datasets 1, 3, 4 and 5 produced one component less, and dataset 2 showed two components less than the true rank. The derivatives in the present study were successful only in the case of dataset 6. The method found six components in datasets 7 and seven components in dataset 8. The presence of noise and low resolution appears to be the possible reason for the failure of the method.

4.13. Evolving principal component innovation analysis (EPCIA) In EPCIA, the number of components can be found by looking at the RMS plots, their shape and the changes in the RMS plots after selecting different numbers of components. In our study, we used OPA and SIMPLISMA data as an input to EPCIA. Datasets 1, 3 and 4 gave similar OPA and SIMPLISMA variables. Therefore, the results of EPCIA were same and correct for these samples. We discuss results of other datasets in Sections 4.13.1 and 4.13.2. 4.13.1. OPA selected variables Dataset 2 suggests four significant components, and, in dataset 5, there were three significant components. There was a small peak in the third plot of dataset 5, which could be assigned to a fourth component but this peak persisted even after the addition of fourth profile. Therefore, this method suggests that dataset 5 contains only three components. Dataset 6 showed four components by both methods (OPA and SIMPLISMA). It also behaved similarly to dataset 5 and the last plot did not show complete randomness as shown in Figs. 16 and 17. For dataset 7, 10 OPA variables were used as input; among those, four appeared to have small impact on the reduction of RMS values. If those variables were regarded as impure or insignificant, this suggests that dataset 7 consists of six significant compo-

4.12. Evolving principal component analysis (EPCA) EPCA – EW was another successful method. This method was not only successful in all cases but this method was the easiest in interpretation of its plots. The results were correct in all cases. The method of FSW – EFA was successful only in two cases, datasets 6 and 7. In all other cases, it produced one component less than the true rank, even with the

Fig. 17. EPCIA on dataset 6 (four components) after using five variables.

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

149

4.14. Subspace comparison

Fig. 18. Subspace discrepancy function D(K) (diamonds) and sin2(#K) (stars) for PCA scores and OPA variables using dataset 2.

nents. A similar observation was made for dataset 8, yielding seven components when 10 OPA variables were modelled. 4.13.2. SIMPLISMA selected variables Datasets 2, 5 and 6 gave correct results. EPCIA proved to be difficult when the number of components becomes large. It has been found that, in the analysis, some specific variables behave less effectively. One such variable can be located in dataset 7, which is the fourth variable. If this is removed and EPCIA is performed again with six variables, the results are almost the same when seven components were modelled. Thus, in this case, it can be concluded that there are six true components in this mixture. Dataset 8 produces randomness in the RMS plot after eight variables have been used. Again in this case, the fourth variable has very low influence on the overall reduction of the RMS value or the shape of RMS plot. If this is removed and EPCIA is performed again with seven variables, the results are the same. These observations lead to a conclusion that some variables, which show a small change in the reduction of RMS values, appear to be less effective and cannot be taken as a new component. Finally, a small peak-like structure was seen in the last RMS plot of all experimental datasets, which indicates that either the variables are not pure or the variables are not sufficient to model the data.

Subspace comparison produced promising results. PCA scores and variables from OPA and SIMPLISMA were used as input for this test. The three methods produced three combinations of variables, namely, PCA scores and SIMPLISMA variables, PCA scores and OPA variables, and SIMPLISMA variables and OPA variables. It is difficult to correctly interpret the results from the description given in reference [50] because the criterion is a qualitative one. Neither of the combinations could identify the correct number of components in dataset 2; there is some indication about the third component in the subspace discrepancy plot using scores from PCA scores and OPA variables, as shown in Fig. 18. A similar situation existed in other cases; for example, dataset 6 appears to consist of either three or four components (the third component has lower value than the fourth component) using variables from SIMPLISMA and OPA. Best results and better discrimination were obtained using PCA scores and OPA variables, and PCA scores and SIMPLISMA variables. Subspace comparison, although relying on other methods for variable selection, is simple to implement and fast to use.

5. Summary of results All of the methods have been compiled in Tables 4 – 7. Five methods could be selected as the best for providing true rank. The first is a set of error indicator functions (RSD, RMS, IND, RPVand Exner function), second is ROD on IND function, third is RSD ratio, fourth is concentration plots using OPA, and the last method is EPCA –EW. However, all of these methods rely on visual interpretation of various plots, except ROD on IND function. As the visual interpretation of plots may vary from person to person, therefore, the final results may be prone to some bias and are hard to automate. Dataset 2 has the highest level of noise and lowest resolution.

Table 6 Results of ROD on different error indicator functions Components identified Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

Dataset 7

Dataset 8

True rank

3

3

3

3

3

4

7

8

Function

Components identified

RSS RSD RMS IE IND RPV w

3 3 3 3a 3b 3 3

3 3 3 3a 3b 3 3

3 3 3 2b 3b 3 3

2 3 3 3b 3 2 3

4 4 4 4 4 4 4

7 7 7 7 7 7 7

8 8 8 8 8 8 8

a b

2 2 2 2b 3b 2 2

First minimum in ROD plot and function minimum. Global minimum in ROD plot and function minimum.

150

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

Table 7 Results of all methods except eigenvalue based methods Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

Dataset 7

True components

3

3

3

3

3

4

7

8

Indicator

Components identified

Malinowski’s F-test (a = 0.01) Faber – Kowalski F-test Modified FK F-test (h = 2) Modified FK F-test (h = 10) Morphological score Cross-validation

3 3 3 3 3 3

2 3 3 2 3 3

3 3 3 3 3 3

3 3 3 3 3 3

3 >3 >3 3 3 6

4 >4 >4 4 4 5

7 >7 >7 8 6 8

8 >8 >8 11 7 10

OPA Dissimilarity plots (time dim.) Dissimilarity plots (var. dim.) Durbin – Watson (time dim.) Durbin – Watson (var. dim.) Concentration profiles

4 3 3 3 3

3 2 2 3 3

3 3 3 4 3

3 3 2 3 3

4 4 3 2 3

4 5 2 4 4

9 >7 – 7 7

10 >8 – 8 8

SIMPLISMA Purity plots (time dim.) Purity plots (var. dim.) Durbin – Watson (time dim.) Durbin – Watson (var. dim.) RSS (time dim.) RSS (var. dim.) Concentration profiles Correlation plot Derivative plot

3 3 – – 3 3 3 2 2

2 2 – – 2 2 3 2 1

3 3 3 – 3 3 3 2 2

2 2 – – 2 3 3 2 1 or 2

3 3 2 2 3 3 4 2 2

4 4 2 3 4 3 or 4 4 4 4

9 7–8 7 7 6 7 7 6 6–7

>8 >8 10 6 6 6 8 6 6

EPCA Expanding window Fixed size moving window

3 2

3 2

3 2

3 2

3 2

4 4

7 7

8 8

EPCIA OPA variables SIMPLISMA variables

3 3

4 3

3 3

3 3

3 3

4 4

6 6

7 7

Subspace comparison SIMPLISMA and OPA PCA scores and SIMPLISMA PCA scores and OPA

3 3 3

2 2 2

3 3 3

2 3 3

3 3 3

3 4 4

7 7 7

8 8 8

If the results of this dataset are omitted, then RSS, Malinowski’s F-test, concentration profiles using SIMPLISMA and subspace comparison with PCA score perform well. The methods that least often predicted the correct rank were Durbin – Watson statistics on SIMPLISMA variables, correlation plots and derivative plots. The rank estimation in dataset 3 was correctly determined by most of the methods; this was primarily due to good resolution and a well-defined noise distribution in the data. The relative effectiveness of methods for LC – NMR data differ quite strongly to LC – DAD because there are different characteristics in the spectra. In NMR, there are often quite distinct and well-defined spectral peaks, whereas in uv/vis spectroscopy, the spectral peaks tend to be much broader and less well defined. In addition, the noise levels in NMR are often much higher than in uv/vis. In LC – NMR, it is not easy to eliminate noise by preprocessing because there are always regions where there are resonances in the spectrum of one compound and noise in

Dataset 8

the spectrum of another, so that noise regions will be retained even after variable selection. In addition, there can be slight drifts in peak positions in LC – NMR peak positions during recording which, because of the relatively sharp nature of these peaks, has an influence on the effectiveness of certain methods; this problem is not usually encountered in LC – DAD. Hence, methods such as correlation coefficients and derivatives that are very effective in LC –DAD because of the smooth and relatively noise-free nature of the data are less useful in LC –NMR which technique is a challenge for the chemometrician. Automation is a desirable feature for any algorithm. Only two of the successful methods can readily be automated, one is Malinowski’s F-test and the other is ROD on IND function. The ROD was successful for all datasets using IND function, while Malinowski’s F-test failed for dataset 2 which had high noise and low resolution.

M. Wasim, R.G. Brereton / Chemometrics and Intelligent Laboratory Systems 72 (2004) 133–151

As artefacts make it difficult to identify correct data rank, all those methods which depend on eigenvalues, are strongly influenced by the existence of artefacts. Moreover, the interpretation becomes difficult when the noise level is high. Successful methods, which do not rely on eigenvalues, are OPA dissimilarity plots, SIMPLISMA purity plots and concentration plots. Among these methods, concentration plots using OPA variables was the most successful. Most of the methods estimated less components than the true rank except log of eigenvalues, eigenvalues ratio, FK Ftest, modified FK test, cross-validation, OPA and SIMPLISMA and subspace comparison (only by SIMPLISMA and OPA data), which gave more components.

Acknowledgements M.W. wishes to thank the Ministry of Science and Technology, Government of Pakistan for their funding and the support of PAEC, Pakistan. We thank Drs. C. Airiau and M. Murray for help with the experimental work.

References [1] F.C. Sanchez, B.V.D. Bogaert, S.C. Rutan, D.L. Massart, Chemom. Intell. Lab. Syst. 34 (1996) 139 – 171. [2] E.R. Malinowski, Factor Analysis in Chemistry, third ed., Wiley, New York, 1991. [3] M. Meloun, J. Capek, P. Miksik, R.G. Brereton, Anal. Chim. Acta 423 (2000) 51 – 68. [4] J. Toft, Chemom. Intell. Lab. Syst. 29 (1995) 189 – 212. [5] M. Wasim, M.S. Hassan, R.G. Brereton, Analyst 128 (2003) 1082 – 1090. [6] H. Shen, C.Y. Airiau, R.G. Brereton, Chemom. Intell. Lab. Syst. 62 (2002) 61 – 78. [7] H. Shen, C.Y. Airiau, R.G. Brereton, J. Chemom. 16 (2002) 165 – 175. [8] H. Shen, C.Y. Airiau, R.G. Brereton, J. Chemom. 16 (2002) 469 – 481. [9] J.C. Hoch, A.S. Stern, NMR Data Processing, Wiley-Liss, New York, 1996. [10] S. Wold, K. Esbensen, P. Geladi, Chemom. Intell. Lab. Syst. 2 (1987) 37 – 52. [11] H. Wold, in: J. Gani (Ed.), Soft Modelling by Latent Variables: the Nonlinear Iterative Partial Least Squares (NIPALS) Algorithm, Perspectives in Probability and Statistics, Academic Press, London, 1975, pp. 117 – 142. [12] T.M. Rossi, I.M. Warner, Anal. Chem. 54 (1986) 810 – 815. [13] N. Ohta, Anal. Chem. 45 (1973) 553 – 557. [14] I.T. Jollife, Principal Component Analysis, Springer, New York, 1986.

151

[15] R.F. Hirsch, G.L. Wu, Chemom. Intell. Lab. Syst. 1 (1987) 265 – 272. [16] R.G. Brereton, Chemometrics: Data Analysis for the Laboratory and Chemical Plant, Wiley, Chichester, 2003. [17] E.R. Malinowski, Anal. Chem. 49 (1977) 612 – 617. [18] R.B. Cattell, Multivariate Behav. Res. 1 (1966) 245 – 276. [19] H.H. Kindsvater, P.H. Weiner, T.J. Klingen, Anal. Chem. 46 (1974) 982 – 988. [20] O. Exner, Collection of Czech, Chem. Commun. 31 (1966) 3222 – 3251. [21] A. Elbergali, J. Nygren, M. Kubista, Anal. Chim. Acta 379 (1999) 143 – 158. [22] E.R. Malinowski, J. Chemom. 3 (1988) 49 – 60. [23] E.R. Malinowski, J. Chemom. 4 (1990) 102. [24] K. Faber, B.R. Kowalski, Anal. Chim. Acta 337 (1997) 57 – 71. [25] N.K.M. Faber, Comput. Chem. 23 (1999) 565 – 570. [26] J. Mandel, Technometrics 13 (1971) 1 – 18. [27] E.R. Malinowski, J. Chemom. 13 (1999) 69 – 81. [28] S. Wold, Technometrics 20 (1978) 397 – 405. [29] D.W. Osten, J. Chemom. 2 (1988) 39 – 48. [30] H. Shen, L. Stordrange, R. Manne, O.M. Kvalheim, Y. Liang, Chemom. Intell. Lab. Syst. 50 (2000) 37 – 47. [31] F.C. Sanchez, J. Toft, B. Van den Bogaert, D.L. Massart, Anal. Chem. 68 (1996) 79 – 85. [32] W. Windig, C.E. Heckler, F.A. Agblevor, R.J. Evans, Chemom. Intell. Lab. Syst. 14 (1992) 195 – 207. [33] N.R. Draper, H. Smith, Applied Regression Analysis, third ed., Wiley, New York, 1998. [34] S. Gourve´nec, D.L. Massart, D.N. Rutledge, Chemom. Intell. Lab. Syst. 61 (2002) 51 – 61. [35] G. Strang, Linear Algebra and its Applications, third ed., Harcourt Brace Jovanovich, Orlando, 1998. [36] A. Savitzky, M.J.E. Golay, Anal. Chem. 36 (1964) 1627 – 1639. [37] J. Steinier, Anal. Chem. 44 (1972) 1906 – 1909. [38] K. Kavianpour, R.G. Brereton, Analyst 123 (1998) 2035 – 2042. [39] S. Dunkerley, R.G. Brereton, J. Crosby, Chemom. Intell. Lab. Syst. 48 (1999) 99 – 119. [40] M. Maeder, Anal. Chem. 59 (1987) 527 – 530. [41] B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. De Jong, P.J. Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics, Part B, Elsevier, Amsterdam, 1998, pp. 274 – 278. [42] H.R. Keller, D.L. Massart, Anal. Chim. Acta 246 (1991) 379 – 390. [43] H.R. Keller, D.L. Massart, J.O. De Beer, Anal. Chem. 65 (1993) 471 – 475. [44] J. Toft, O.M. Kvalheim, Chemom. Intell. Lab. Syst. 19 (1993) 65 – 73. [45] F. Cuesta Sanchez, J. Toft, O.M. Kvalheim, D.L. Massart, Anal. Chim. Acta 314 (1995) 131 – 139. [46] S.C. Rutan, J. Chemom. 1 (1987) 7 – 18. [47] G.H. Webster, T.L. Cecil, S.C. Rutan, J. Chemom. 3 (1988) 21 – 32. [48] S.J. Vanslyke, P.D. Wentzell, Anal. Chem. 63 (1991) 2512 – 2519. [49] P.D. Wentzell, S.G. Hughes, S.J. Vanslyke, Anal. Chim. Acta 307 (1995) 459 – 470. [50] H. Shen, Y. Liang, O.M. Kvalheim, R. Manne, Chemom. Intell. Lab. Syst. 51 (2000) 49 – 59.