Analytrco Chmca Acta, 263 (1992) 29-36
29
Elsevler Science Publishers B V , Amsterdam
Evolving factor analysis in the presence of heteroscedastic noise H R Keller and D L Massart * Phamceutlcal
Instrtute, VnJe UnwersrteltBrussel, Laarbeeklmn 103, B-1090 Brussels ~Belgwn)
Y Z Llang and 0 M Kvalhelm Department of Chemmy, Unrversityof Bergen, N-5007 Bergen (Norway)
(Received 13th December 1991)
Abstract
Evohng factor analysis (EFA) IS a promwng method for the analysis of multlvarlate data with an mtrmstc order When applymg EFA for assessment of peak homogeneity m hqmd chromatography, one has to be aware of mstrumental and expelvnental d8ficulttes Heteroscedastlclty 1s one of the most serious problems and leads to additional etgenvalues that may be mlsmterpreted as bemg due to an lmpunty After appropnate data pretreatment, the fiid-size wmdow EFA technique proved successful for peak punty control m hqmd chromatography with photodlode-array detectlon Less than 1% of a spectrally slmdar lmpunty could be detected for R, values as low as 03 Keywords Llqutd chromatography, Evolvmg factor analyas, Heteroscedastlc noise, Peak purity control
Ensurmg peak punty m hqmd chromatography (LC) 1s one of the most general and demandmg problems In pharmaceutical research, for mstance, one has to make sure that the chromatographically separated peaks consist of only one analyte In such a sltuatlon, chemically related substances wth sumlar UV spectra, such as degradation products or isomers, can be consldered as lmpurltles To assess peak homogeneity, less than 1% of such nnpuntles needs to be detected even d the chromatographxz separation is small Different techmques have been developed for peak purity control m LC, the best known being the ratlogram method Most of these methods rely on couplmg LC with diode-array detection (DAD) Such an Instrument produces what can be consldered as a data table where the rows represent spectra taken at different times
and the columns represent chromatograms measured at different wavelengths Many techmques for assessment of peak punty m IX-DAD work on selected spectra or chromatograms only, but the most promlsmg methods are based on latent variables and make use of the full spectral and chromatographlc data An example 1s evolvmg factor analysis (EFA) [l-3] Although EFA seems to be encouraging one has to be aware of some practical hmltatlons Slopmg and non-zero baselines, cahbratlon graph non-hneanty and the DAD scan tune may lead to problems, as dlscussed elsewhere [4,5] Changes m the eqmhbrmm of the mobile phase could change the spectra of the analytes, thus the separatron system may also cause dlfflcultles The most serious problem m practice, however, seems to be heteroscedastlaty, and this wdl be discussed here
0003-2670/92/$05 00 0 1992 - Elsevler Saence Pubhshers B V All rights reserved
30
H R Keller et al /AMI
HETEROSCEDASTICITY
Whereas for homoscedastlc data the measurement error 1s mdependent of the slgnal, 1e , the absolute error 1s constant, heteroscedastlc data are characterized by non-umform variance along the cahbratlon graph [6] In the latter instance the absolute error 1s not constant but a function of the analytical signal Heteroscedastlclty occurs, for mstance, m many spectroscopic techmques, where the variance of the signal increases as the signal itself becomes larger This has often been noted m connection with cahbratlon, where It sometnnes leads to the use of weighted regression or of certain transformations that restore homoscedastlclty While the relative noise can be constant for some analytical techniques, there does not need to be a linear relatlonshlp between noise and signal or concentration Assuming the relative rather than the absolute error to be constant, however, frequently yields better results To illustrate the phenomenon of heteroscedastmty, a solution of a pharmaceutical drug was measured at stopped flow with DAD and twenty spectra were recorded consecutively The mean of these measurements estnnates the spectrum of the drug while the standard devlatlon can be interpreted as the error (Fig 1) If the data were
Chun Acta 263 (1992) 29-36
homoscedastlc, the noBe should be represented as a more or less horizontal lme In the present instance, however, there 1s an almost linear relationship between noise and signal Assuming that the mdlvldual diodes behave slmllarly, one can state that Fig 1 proves the occurrence of heteroscedastlclty for the given system Addmonally, one can observe some homoscedastlc noise above 315 nm, where the analyte does not absorb Applying latent variable methods such as those based on EFA to heteroscedastlc data, one will observe more slgmficant principal components (PCs) than expected This 1s illustrated m Fig 2, where one substance was measured at two wavelengths with equal molar absorption coefflclents and where the signals observed at the corresponding diodes, y, and yz, are plotted against each other With homoscedastlc data, such a situation can be described with one PC, having a slope of about unity and passing close through the origin For heteroscedastlc data, however, two PCs are required for a complete description of such a system One should remember that two PCs generally mean that there are two underlying factors (e g two chemical speaes), which 1s of course not true and which therefore leads to problems when applying techniques based on latent variables
- 8E-04 - 7E-04 - 8 E-04 - 5E-04 - 4E-04 IWiSe - 3 E-04 - ZE-04 - 1 E-04
mvebngth
Fig 1
[nm]
Mean, I e , sgnal (sohd Ime), and standard dewation, I e , mxse (dashed he), of twenty consecutively recorded spectra
HR
Keller et al /Anal
31
Chm Acta 263 (1992) 29-36
To study the effect of heteroscedastlaty on latent vanable methods for peak punty control m LC-DAD, the recently developed fried-size wmdow EFA techmque (FSW) [3,7] was chosen, because it proved to be successful on sunulated data This method works as follows A defined number of consecutwe spectra are factor analysed and their elgenvalues (EVs) are determined Suppose one first analyses spectra l-7 and calculates the EVs of the wmdow Next, one will determine the EVs of spectra 2-8 and so on until the whole data table has been analysed The EVs or for reasons of presentation more frequently
3 2 1 01
0
lo-
08
a
l l l
y2
l
l .
04.
l l l*
l
02. l l
00-w
1
00
02
06
04
10
08
Y1 l
b -
l
l
8
06.
l
l
v2 04
*
le
l
0 . s
02. l l
004. 00
20
30
40
so
60
Fig 3 Log EV plotted vs analysis time (a) for a pure peak wth homoscedastx and (b) for heteroscedastx data where some homoscedastlc noise has been added, as obtamed for simulated data
a le
06.
08
10
lhm
-
lo-
a
. 02
t 04
06
06
10
Yl Rg 2 (a) Homoscedastlc and (b) heteroscedastlc Adapted from [5]
data
their logarithms, log EV, are then plotted against analysis tune The points descrlbmg the first EV are connected with a line, analogously, those for the second are connected, and so on For a sunulated pure peak with homoscedastlc noise this would result m Fig 3, where the noise level corresponds to log EV = -5 2 The signal above this level 1s due to the underlying chemical species In prmclple, the number of detectable substances 1s equal to the number of EVs above the noise level and vice versa Theoretically, a direct determmatlon of the number of substances m a certain time interval 1s possible In peak purity control with LC-DAD, any second peak would therefore be thought to be due to an unpurlty In practice, however, it has been found that for certain reasons the number of EVs above the noise can exceed the number of chemical speces, and that results obtained with latent vanable methods must be interpreted with care [5,8] As heteroscedastlclty causes artefacts m methods such as EFA by introducing additional elgenvectors, the FSW technique as such will not work properly for heteroscedastlc data, as illustrated m Fig 3b One observes that there are many signals even for a pure peak, while only the first EV 1s
H R Keller et al /Anal Chm Acta 263 (1992) 29-36
32
due to an underlymg compound The addltronal signals can be seen as artefacts due to heteroscedastlclty and should not be mistaken as being due to other chemical species The dlfflculty IS, of course, to dlstmgulsh between signals due to underlymg chemical species and signals due to artefacts such as those caused by non-umform variance A common means of coping with heteroscedastlc data consists m transformation to homoscedastlclty The variance 1s frequently proportional either to the signal or to its square root, combmatlons of these posslbdltles can also be observed With umvarlate data, the signals y can be transformed mto homoscedastlc data y * , the techniques to be applied depend on the relatlonship between the variance s,” and y If si 1s proportional to y *, the transformation y*=1og
y
(1) should be carried out and if s; 1s proportional to y one should use the transformation Y*=&
Although these transformations can work well with umvarlate data, they cannot easily be applied to multlvanate data such as those obtained from LC-DAD Applying Eqn 1, for instance, may transform data to homoscedastlclty but it will at the same time introduce new artefacts Supposmg two noiseless spectra, A and 8, are taken from a pure peak If Beer’s law holds, the two spectra are proportional, 1e , A = kB From the relationship log A = log k + log 8, it follows that log A 1s not proportional to log B Logarithmic transformation therefore introduces non-hneanty and more than one PC will be necessary to describe such a system, which IS not what one wants One should remember that one PC should refer to one underlying chemical factor and vice versa As a = &? = fi&, it follows that fi 1s proportional to fi and Eqn 2 can be expected to work also for multlvarlate data for situations where s; 1s proportional to y For many analytical procedures, however, the noise IS proportional to the signal rather than to its square root As Eqn 2 will only work for the latter case and Eqn 1 cannot be applied, we
utlhzed a method that effects all variables m the same way while keepmg the system linear, as proposed by Llang et al [91 This techmque consists in multiplication of all signals of a spectrum with a constant according to -1
Yl* = Yrh
n
kYt,1 lf c Yrr aif i 1=1
1=1
(34
and *_ Yt.4- YL4
lf i Y*, <.z
(3b)
I=1
where ylh 1s the signal measured at time t and wavelength A, n 1s the number of wavelengths and y;f 1s the corrected signal In order not to increase the noise at small absorbance values, such as m regions of the chromatogram where no substances elutes, only those spectra above threshold z must be corrected C:= lytr m the correction term 1s the total signal of one spectrum, which 1s proportional to ylh For sltuatlons where the error 1s proportional to the slgnal, dlvlslon of ytr by the proportlonahty constant C:=Iyfr makes the noise of the standardized slgnals y,: independent of the signal In other words, Eqn 3 transforms the data to homoscedastlclty
EXPERIMENTAL
The sample used to demonstrate heteroscedastlclty was a pharmacologically active drug, which was dissolved m water (purified m a MllhporeM&-Q system) and introduced directly mto the measurement cell The spectra were recorded on a Perkm-Elmer LC235 DAD m the range of 195-365 nm (35 diodes) m intervals of 0 5 s For the assessment of peak purity, the sample consisted of two isomers of another drug, used as available m the laboratory and dissolved directly m the mobile phase (Fig 4) The liquid chromatograph consisted of a Kontron Model 420 pump with a Rheodyne mJectlon valve and a 20-~1 sample loop A 100 X 4 6 mm 1d 5 pm RP-18 Brownlee Sphen-5 column was used at ambient temperature The mobile phase was a degassed and filtered mixture of acetomtrlle
HR
filler et al /Anal
33
Chum Acta 263 (1992) 29-36
I 210
230
250
270
290
mvr*nglll
Fig 4 Spectra of the analysed analyte, dashed hne, lmpunty
310
330
350
[ml]
drug
Isomers
sohd
hne,
(LlChrosolv, Merck) and 0 1 M dlammonmm hydrogenphosphate (analytical-reagent grade, Merck) m water at pH 2 5 [adjusted with 85% phosphoric acid (analytical-reagent grade, Merck)] The flow-rate was 2 ml mm-’ Spectra were recorded m the range 210-365 nm (32 diodes) Data transfer to an IBM-compatible personal computer with math coprocessor and conversion to ASCII files were performed with Perkm-Elmer LC View software All programs for simulation and data analysis were written and compiled m this laboratory using Microsoft BASIC 7 0 The FSW technique was based on a wmdow of seven consecutive spectra A threshold of z = 1 was used for correction according to Eqn 3
tionship between noise and signal as illustrated m Fig 1 In such instances the square root transformation cannot be expected to correct heteroscedastlclty Apphcatlon of Eqn 3, however, works better a more or less horizontal lme can be observed at around log EV = -6, correspondmg to the noise level Even if there 1s still some fluctuation m the second EV, no signal that might be mistaken as being due to an lmpunty can be found any longer As Eqn 3 standardizes the spectra, part of the first EV results m a hornontal lme (time 24-46) One notices that m all instances the first EV at the end of the chromatogram does not correspond to the noise level, 1e , that the lme describing the first EV 1s higher after the peak than before This can be explained by mvestlgatmg the shape of the elutlon profile Figure 6 shows that there IS some peak talhng and that an acceptable spectrum of the substance
0
10
20
30
40
50
60
70
a0
50
60
70
80
lhn
bgEV
-4. 4. 8,
RESULTS
0
AND DISCUSSION
10
P
30
40 thm
l-
Correctwn methods The results obtained with FSW from experi-
mental data for a pure peak and the two correction methods are given m Fig 5 It can be seen that raw data lead to additional signals under the mam peak After square root transformation, mterpretatlon of the results becomes difficult because the second EV looks more like a negative peak than a honzontal line, which one would normally expect for a pure sample The reason for this 1s probably the more or less linear rela-
1.
c
3. la(lEV
5. 7. -9,
Fig 5 Log EV plotted vs analysis time for an expenmentally recorded pure peak, (a) obtamed from raw data, (b) after square root transformation and Cc) after correction accordmg to Eqn 3
34
H R Keller et al /Anal Chrm Acta 263 (1992) 29-36 09
1
10
J 0
10
2a
30
40
50
60
70
80
60
80
m
80
ttme ambalm 0002.
0001.
OoaoJ 210
.
.
.
.
.
.
.
230
23a
270
290
310
330
350
mm*ngthlml
Fig 6 (a) Chromatogram obtamed at 240 nm for data m Rg 5 and (b) spectrum taken at time 75
can still be measured at the end of the recorded chromatogram As the number of spectra that can be recorded was limited, a small amount of the analyte was still present when the last spectra were taken and therefore the correspondmg EV must be different from noise More mformatlve results can be obtained when analysmg a mixture of two Isomers, as used for studying the performance of FSW for assessment of peak purity m LC-DAD Adding 0 7% of its Isomer to the substance analysed before and adJustmg chromatographlc separation to R,= 0 5 results m Fig 7 Raw data again lead to multiple signals, which cannot easily be attributed to the chemical species or to certain sources of artefacts A correct interpretation of such a graph IS therefore difficult As with a pure peak, the results obtained after square root transformation are not those which one would expect, 1e , there 1s no signal above the noise (log EV = - 2 5) that would mdlcate an unpurlty The proposed correc-
0
10
20
30
40
the Fig 7 Log EV plotted vs analysis hme for an experunentally recorded murture of two isomers wth R, = 0 5 and 0 7% of the mmor compound, (a) obtamed from raw data, (b) atIer square root transformation and (c) after correctIon accorrhng to Eqn 3
tlon method, however, clearly represents the Impurity as a second peak above the noise (log EV= - 6 5) around time 20 Although the shape of the third to seventh EVs, which should ideally describe a horizontal hne, 1s not perfect, the results are remarkable One should remember that less than 1% of a spectrally sumlar nnpunty has been detected at low chromatographlc separatlon, without makmg any assumption about the peak shape and without requlrmg the spectra to be known m advance The reason why the correctlon method proposed here performs better than the square root transformation 1s probably that the noise is proportional to y rather than to 6 for the given analytical system, as m many analytical techmques In order not to increase the noise when applymg Eqn 3, the threshold t should not be
H R &ller et al /Anal
35
Chum Acta 263 (1992) 29-36
too small, on the other hand, a too large value would standardrze only a few spectra and therefore not correct heteroscedastlclty For the given system the optnnum threshold was found to be z = 1 Sdl, one should bear m mmd that heteroscedastlclty 1s not the only cause for artefacts m EFA-based methods The DAD scan time, cahbratlon graph non-lmeanty and non-zero or sloping baselines may also lead to problems Heteroscedastlclty seems, however, to be the most relevant source of such artefacts Peak punty control
With the aim of developing a method for detecting small amounts of spectrally similar impurities under a chromatographlc peak even at low R, values, a systematic evaluation of the limits of detection for an impurity based on FSW was performed The relative concentration of one lsomer m the other was lo%, 2% and 0 7% and chromatographlc separation was varied from 10 to 0 1 m steps of 0 1 Although FSW could not easily be applied to raw data because of heteroscedastmty, as shown above, its performance was remarkable after apphcatlon of the conectlon proposed here The results obtained are summarized m Fig 8, where the line indicates the hmlts of detection for the tested impurity as a
lmpUW
function of R, and relative concentration In the area above and to the nght of this hne the second Bomer will be detected The best result m terms of relative concentration and chromatographlc separation can be found at the pomt representing 0 7% of the impurity at R, = 0 3 It can be seen that the presented method permits the detection of a small amount of a spectrally similar impurity even for very small separations Conclmon
Although EFA-based methods are very powerful and promlsmg for peak purity control m LCDAD, one has to be aware of instrumental and experimental problems The DAD scan time, calibration graph non-hear@ and non-zero or slopmg baselmes lead to artefacts, but the most serious problem seems to be heteroscedastlclty After appropnate data pretreatment as described m this paper, FSW could successfully be applied for peak purity control m LC-DAD The hnuts of detection of a spectrally similar impurity under a chromatographlc peak are below 1% even for small R, values A first comparison of FSW with the heuristic evolving latent projection method (HELP) [9,10], a recently developed technique that 1sbased on latent variables and that can also be applied to peak purity control LC-DAD, mdlcates that both methods perform equally well A detailed evaluation of these techniques ~11 be present m the future The authors thank P Ktechle and F Erm for their collaboration and Sandoz Pharma and Shell Research Laboratories for financial assistance
delectable
REFERENCES
01-I 00
I
I
.
02
04
06
.
-
’
08
10
Resolulbn
Fig 8 Lmuts of detection for an unpurity as a function of the relative amount and of the chromatographz separauon, obtamed wth FSW after correction of heteroscedastlaty accordmg to Eqn 3
1 M Maeder, Anal Chem , 59 (1987) 527 2 M Maeder and A Z&an, Chemometr Intell Lab Syst,3 (1988) 205 3 H R Keller and D L Massart, Anal Glum Acta, 246 (1991) 379 4 H R Keller, D L Massart, P I
36 6 D L Massart, B G M Vandegmste, S N Demmg, Y MI-
chotte and L Kaufman, Chemometncs a Textbook, Elseveer, Amsterdam, 1988, pp 64-86 7 J Hobbs, K O’Dea and B Archer, poster presented at HPLC ‘91 Conference, June 3-7, 1991 8 M J P Gemtsen, NM Faber, M van Run, B G M Van-
H R Keller et al /Anal
Chm Acta 263 (1992) 29-36
degmste and G Kateman, Chemometr Intell Lab Syst, 12 (1992) 257 9 Y Z Llang, 0 M Kvalhelm, H R Keller, D L Massart, P ffiechle and F Erm, Anal Chem,64 (1992) 946 10 0 M Kvalhelm and Y Z Llang, Anal Chem , 64 (1992) 936