A computer system for analysis of chromatographic data

A computer system for analysis of chromatographic data

Original Research Paper n Chemometrics and Intelligent Laboratory Systems, 1 (1987) 285-295 Elsevier Science Publishers B.V.. Amsterdam - Printed in...

829KB Sizes 29 Downloads 85 Views

Original Research Paper

n

Chemometrics and Intelligent Laboratory Systems, 1 (1987) 285-295 Elsevier Science Publishers B.V.. Amsterdam - Printed in The Netherlands

A Computer System for Analysis of Chromatographic Data R.J. MARSHALL *, A.J. BLEASBY, R. TURNER and E.H. COOPER Unit for Cancer Research, University of Leeds, Leeds LS2 9JT (U.K.) (Received

3 September

1986; accepted

11 December

1986)

ABSTRACT

Marshall, R.J., Bleasby, A.J., Turner, R. and Cooper, E.H., 1987. A computer system chrotnatographic data. Chemometrics and Intelligent Laboratory Systems, 1: 285-295.

for analysis

of

An interactive computer program (CHAS) for chromatogram processing is described. CHAS is a FORTRAN program which has three basic functions: (a) for data management, (b) for graphical display, and (c) for chromatogram analysis. The program is designed to run off-line by accessing data from a library of chromatograms. Various types of graphical displays are available and its analytical procedures incorporate new algorithms to detect chromatogram peaks, to remove baseline drift and to compute similarities between chromatograms. We present some illustrative uses of the program for data generated by high-pressure liquid chromatography.

INTRODUCTION Computers are increasingly being used to analyse chromatographic data. Many commercial systems are available and others have been developed for various research applications [l-3]. Here we present a computer program called CHAS (an acronym for Chromatogram Handling and Analysis System) for off-line analysis of a library of chromatograms. CHAS offers various useful features. Its graphics facilities enable chromatograms to be displayed in a variety of ways; its data management facilities allow data reduction and

l Address for correspondence: Dr. R.J. Marshall, Department of Community Health, University of Auckland, Private Bag, Auckland, New Zealand.

0169-7439/87/$03.50

0 1987 Elsevier Science Publishers B.V.

insertion/deletion of items from a library of chromatograms; and its analytical procedures can be used to detect and quantify peaks and to assess the similarities between chromatograms using some new methods. CHAS was designed to process chromatograms obtained from separation of proteins in human blood and urine. It was developed to deal with practical problems concerned with chromatogram analysis, but it is quite general and could be used to analyse various types of chromatogram or, indeed, any short duration waveform. Some of the methods of analysis which CHAS uses have been described in refs. 4 and 5. In ref. 4 we used the idea of a “distance” between chromatogram patterns to identify classes of chromatogram patterns. Other authors [6-91 have done similar analyses. CHAS can compute various such measures. A new 285

n

Chemometrics and Intelligent Laboratory Systems

statistic based on the degree of alignment of chromatogram peaks is described here. Another important feature of CHAS, which is used to identify peaks in the aforementioned statistic, is that of peak analysis. In a previous paper [5] we have given the theoretical development of the peak detection algorithm that is insensitive to both noise and a drifting baseline.

Cbromatogram

Handling

and

n

E

3

Listing

5 6 hevly

digitised 7

(C

H

A

5)

management

Adding

Master

System

U

1 2

‘ Ploftinq

N

data

F1Le

Master

Analysis

Erasing Picking FiLe Easy User

entries lots defined

file

plots

Drocessing

Checkinq

8

Hexadecinat

9

Reduction

conversion

Analysis

Both options 5 and 6 display chromatograms. With the “automatic” option 5 the user simply calls up a chromatogram for display by its identifier while option 6 allows the user to control axes, legends, colours etc. and to display processed chromatograms, for example, after removing baseline drift. Chromatograms can be superimposed, or placed side by side, or above and below each other, or in a projective layout as in Fig. 6c. In addition option 6 allows peaks to be located and marked with their computed areas and the method described in ref. 4 can be used to adjust for perturbations in peak retention times before being plotted.

Similarity

12 Fnter

Graphical facilities

Peak

11

analysis analysis

I~fornat,on

DATA ANALYSIS USING CHAS

CHAS is an interactive FORTRAN 77 program for compilation using IBM FORTVS. GHOST 80 FORTRAN procedures are used to generate graphics. An important feature of CHAS is a “master file” of chromatograms. New digital chromatogram files can be pre-processed and added to the master file and old entries can be deleted while within the CHAS environment. The master file is a stack of digital chromatograms; a particular chromatogram is located and extracted from the stack by direct access. The main facilities that are available with CHAS are shown in Fig. 1. We shall not dwell on the menu details; a manual exists for interested readers [ll]. We shall, however, briefly describe some of the graphical facilities and give some theory for the analytical procedures.

LO

your

selection

tl?LD (press

RETURN

to

escape)

Fig. 1. The main menu for CHAS. The program is interactive and accesses and handles data on a master file. In addition options 7, 8 and 9 take data from new source data files.

Data reduction by segmentation If data are recorded at regular intervals the amount of data to adequately represent a chromatogram may be unnecessarily large. For example, where the signal is approximately linear some of the data points will be redundant. One CHAS option is to obtain a segmented representation of a chromatogram by omitting signal values which are redundant. Suppose that Y(t), for t = 0, T, 2T, 3T, etc. represents a digital signal at regular time intervals of length T. The choice of values to retain or to omit is done by working forward as follows: Suppose that t,, . . . , t, have been retained and we wish to determine the next point ti+l to retain. The gradient at time ti is approximately

y(4) - Y(L) gCti)

= ti

-

t,_1

so that a predicted value of Y(t) at time t, + Tj, that is, j time steps ahead of t;, is Y(ti) + Tjg(t;). We work forward from t, until a value of j is found when the predicted value differs from the actual value, Y( ti + Tj), by more than a prespecified amount e. Then we set ti+l = ti + T( j 1) and the procedure continues. The choice of e determines the severity of the reduction. Similar procedures of this sort are described in ref. 12. It may be noted that it is is not strictly neces-

Original

sary that data handled by CHAS be in a segmented form, though a data time registration for each data point is required even for regular interval data. We have found that a segmented represention of a chromatogram is extremely economical, despite the need for a time registration. Peak analysis The peak detection algorithm [5] is illustrated by Fig. 2. For a given threshold z, the primary peaks are found recursively. Menu option 10 gives a peak analysis and optionally generates a chromatogram plot with the peaks marked. Also the user can “identify” each peak according to whether its retention time is within an anticipated reference interval. Peak areas are computed, with peak start and end times determined as in ref. 5, using a trapezoidal method for numerical integration. In addition a “threshold” plot can be produced. This gives the number of peaks exceeding a given threshold level; it is useful for distinguishing between “noisy” and “real” peaks [5]. As illustrated in Fig. 2, the peaks detected by this algorithm are used to remove baseline drift. It may be noted that the figures in ref. 5 were produced by CHAS.

Ff_,Vi

Pi

vi+1 Pi4

Fig. 2. How a “peak” is defined. Peaks p,_ ,, pa, and p,+ , exceed a height z above their adjacent valleys o,_ ,, v,, etc. The shaded area indicates the area taken out by the method for baseline drift removal.

Similarity

Research

Paper

n

and distance measures

Various authors [6-91 have defined similarity/ distance measures between two chromatograms X and Y in terms of a summary n-dimensional vector of peak heights or areas. An alternative approach [4], which avoids condensing information, is to take the continuum representation over a specified time interval a < t < b and form a measure directly. For instance, CHAS will compute the Euclidean distance d(X,

Y)=

{ /[X(t)-

and the correlation

rCx’ ”

Y(t)]2dt}1’2 similarity

= [ /y( t)*dtjx(

function

t)*dt]

“*

where y(t) = Y(t) - y and x(t) = X(t) - x and where x and r are mean values of X and Y, for example, r= jY( t)dt/( b - a). All integrations are over the time interval a < t < b and are done as in ref. 4. Timing adjustments and weightings as in ref. 4 can also be incorporated. Another measure based on the degree of peak alignment may be more suited to chromatography since the prime characteristic of a chromatogram is its peak positioning rather than its waveform shape. Referring to Fig. 3, two peaks of chromatograms X and Y may be said to be aligned if s N, and N = NY, then there is an excess of non-aligned peaks in X and accordingly s( X, Y) = N,,/N, < 1. Clearly the measure depends on how a peak and its start and end are defined. We use the algorithm [5] for this purpose. Thus peaks are set by a threshold z and a lower threshold z’ is used to set start and end 287

n

Chemometricsand IntelligentLaboratorySystems

Y(t)

Chromatography

systems

II

cl

HPLC

t

FPLC

lb!r-l d

output: printer pl0tt.S

I I Fig. 4. The computer system used to collect and to feed data to a mainframe AMDAHL computer for processing by CHAS.

+ X(t) Fig. 3. When chromatogram

a peak of chromatogram X and Y are considered to be aligned.

a peak

of

times. CHAS computes s(X, Y) for any number of z values for a fixed z’. An average is also computed as follows

tern (HPLC, Applied Chromatography Systems, Macclesfield, U.K.) and a fast protein liquid chromatography system (FPLC, Pharmacia Biotechnology, Uppsala, Sweden), though the program has sufficient flexibility to handle data generated by any means. Fig. 4 illustrates the interface between our chromatography systems and eventual

(3)

the summations being over values of z. CHAS can generate a matrix of similarity/distance measures between a collection of chromatograms and the matrix can be used as input for some further analysis. For example, a cluster analysis could be done to obtain natural groupings of chromatographic patterns [4].

AN APPLICATION time

We developed CHAS to analyse data generated by a high-performance 288

liquid chromatography sys-

Fig. 5.

(hours1

Original Research Paper

&

-_

r’

(b)

+._ G-z -t-i-f-L__ I I

0

20

40

60

I

I

!

I

SACRYlR SACRYZR SACRY3C SACRYSC SACRY!X ShCRY6C

100

tlrnutea

Fig. 5 (continued).

analysis by CHAS. A British Broadcasting Corporation (BBC) microcomputer was used for data logging on to floppy disks and data files subsequently uploaded to Leeds University’s AMDAHL computer for analysis by CHAS. Logging was done from the input terminals of a chart recorder and data were fed to an analog-to-digital (AD)

channel of the BBC micro via a 741 operational amplifier. An elution gradient was recorded by a second AD channel of the BBC and a third analog channel was used to detect the start and end of a run and to cater for automatic recording of multiple runs. Software for the BBC-micro was written in 289

n

Chemometrics

L

and Intelligent

-

Laboratory

Systems

-i

Fig. 5 Chromatograms illustrating separation by Sephacryl 300. (a) Gel filtration chromatogram on Sephacryl 300 from which six fractions, A-F, were extracted for subsequent analysis. (b) Computer display of FPLC chromatograms of fractions A-F. (c) A computer rearrangement of Fig. 5b after resealing absorption values and removal of 90% of baseline drift of A. A = SACRYIR; B = SACRY2R; C = SACRY3C; D = SACRY4C; E = SACRYSC; F = SACRY6C.

BBC

Original

Example 1: Monitoring serum separation using Sephactyl300

Research

Paper

w

As a first step in a series of experiments a Sephacryl gel filtration medium was used to separate serum that had been depleted of albumin by absorption on Blue Sepharose. The separation was evaluated by taking six fractions, A-F, of the gel filtration chromatogram (Fig. 5a) and analysing each of these on an FPLC system using a Superose 6 gel filtration column to separate by molecular weight. The FPLC chromatograms were captured and reduced, as described above, and appended to a master file using CHAS. A computer generated plot of the FPLC chromatograms of fractions A-F is shown in Fig. 5b. A more informative layout is shown in Fig. SC, where the chromatograms have been scaled to adjust for differing sample concentrations and some baseline of A has been removed. This shows each peak eluting at a later time and, as the Superose 6 medium separates by molecular weight, the Sephacryl 300 fractions A-F are also substances with increasing molecular weight.

Q chromatogram with the IgG removed is also shown in Fig. 6a (SEPNlC) and so too is the chromatogram for the fraction bound to the affinity column (SEPN2C). The first peak of SEPKlC represents the IgG in the sample; it is clear that this has been effectively removed by the affinity column, whilst that bound to the column is almost pure IgG. A clearer layout for these chromatograms, with areas marked, is shown in Fig. 6b. The high baselines of chromatograms SEPKlC and SEPNZC have been removed and they have been resealed. Table 1 gives the similarity matrices between these three chromatograms for the correlation measure r(X, Y) and the averaged peak alignment measure, eq. 3. The values concur with a subjective assessment of similarity on a zero-toone scale. Another computer generated display is shown in Fig. 6c which, although no more informative in this example, demonstrates a feature of CHAS that may be of use. We have used this type of display to illustrate evolving urinary chromatogram patterns following burn injury [13] and the facility could be of value for chromatograms derived from multi-channel detectors.

Example 2: Binding of Immunoglobin G on a Protein A column

CONCLUSION

A further fraction, G, of the Sephacryl 300 chromatogram in Fig. 5a was subsequently analysed by anion exchange on a Mono Q HR 5/5 column. This gave the full line chromatogram in Fig. 6a (with identifier SEPKlC). The immunoglobin G (IgG) in the sample was then partially removed from this fraction by passing it through a protein A affinity column. The MONO

There are a number of aspects of the CHAS system which are useful for processing chromatographic data, or any short duration electrophysical waveform, which are not available together on other systems. First, it is for data management. The amount of data that may be generated to adequately represent a “library” of waveforms may be quite formidable and the availability of a

TABLE Similarity

1 measures

of the chromatograms Chromatogram

in Fig. 6b evaluated

over the time interval

Alignment

SEPNZC l

0.862 1

Averaged, using eq. 3, over thresholds ** Using eq. 2.

l

minutes

identifiers

SEPNlC

SEPKlC SEPNlC

2-23

Correlation 0.601 1 z = 0.005-0.03

l

*

Alignment

l

0.231 0.179 in steps of 0.005 and taking

Correlation

l

*

0.342 - 0.097 z’ = 0.001.

291

n

Chemometrics and Intelligent Laboratory Systems

reliable and efficient system to add, delete, access and display entries in the library is essential. CHAS is able to fulfill this role. In addition it possesses various procedures which are useful in chromatography and which are not found together in other systems. These include the capacity to detect, integrate and identify peaks, to remove baseline drift. to compute similarities between chromato-

grams, and to adjust chromatograms to correct for retention time perturbations so that peaks align with pre-set reference retention times. Also CHAS can be used to obtain a segmented chromatogram; a facility which, besides giving a compact representation, can be used as a smoothing filter. The program does not, however, possess a feature for dealing explicitly with noise, for instance, there is

0.200

(cl)

: I

tlrnutea

Fig. 6.

292

Original Research Paper

---_-

n

----SW-

(b)

3

5

10 Hlnutee

15

20

Fig. 6 (continued)

no capability for signal-to-noise evaluation. It is assumed that signals to be analysed by CHAS will have been pre-processed to some extent to filter noise, for instance, by analogue filtering at data capture. The program accordingly analyses them as they are presented to it; it makes no allowance for imprecision, inaccuracy or error to the signal. However, its facilities can be selectively used to correct for errors, for instance, by corrections for

baseline drift. In addition, the peak detection algorithm, which is an important feature of the system, is insensitive to signal noise and baseline drift and can be used to distinguish between “real” and “noisy” peaks. The program’s graphical facilities are also useful. The ability to instantly obtain a terminal display of a chromatogram and to superimpose another is extremely useful for chromatogram 293

H

Chemometrics

and Intelligent

A ‘3-dlmenelona~’

Laboratory

Systems

dlaphy.

0.300

(cl

Fig. 6 The binding of IgG by a Protein A column. (a) Chromatograms on a Mono Q column of the low molecular weight fraction G in Fig. 5a; SEPKlC = whole fraction (); SEPNlC = after passing through a Protein A column (- - -); SEPN2C = the sample bound to the Protein A column (------). (b) Another display of Fig. 6a after resealing and baseline drift removed and with peaks detected and areas marked. A = SEPKlC; B = SEPNlC; C = SEPNZC. (c) A three-dimensional display of Fig. 6b.

More elaborate displays of comparisons. chromatograms can be done as the examples given above indicate.

294

ACKNOWLEDGEMENTS

RJM and EHC were supported by a grant from the Yorkshire Cancer Research Campaign. We are grateful to Professor M. Wells and other members of Leeds University Computer ‘Service for their advice and encouragement.

Original

REFERENCES 1 P. Tarroux and T. Rabilloud, Complete computer system for processing chromatographic data, Journal of Chromatography, 248 (1982) 249-262. 2 D.L. Gustine and J. McCulloch, Versatile microcomputercontrolled, automated gradient analytical high-performance liquid chromatography system, Journal of Chromatography, 316 (1984) 407-414. 3 P.W. Banda, MS. Tuttle, L.E. Selmer, Y.T. Thatachari, A.E. Sherry and MS. Blois, Data processing of urine chromatograms for clinical management of melanoma, Computers and Biomedical Research, 13 (1980) 549-566. 4 R.J. Marshall, R. Turner, H. Yu and E.H. Cooper, Cluster analysis of chromatographic profiles of urine proteins, Journal of Chromatography, 297 (1984) 235-244. 5 R.J. Marshall, The determination of peaks in biological waveforms, Computers and Biomedical Research, 19 (1986) 319-329. 6 M.L. McConnell, G. Rhodes, U. Watson and M. Novotny, Application of pattern recognition and feature extraction techniques to volatile constituent metabolic profiles obtained by capillary gas chromatography Journal of Chromatography, 162 (1979) 495-506.

Research

Paper

n

7 H.A. Scoble, J.L. Fasching and P.R. Brown, Chemometrics and liquid chromatography in the study of acute lymphocytic leukemia, Analytica Chimica Acta, 150 (1983) 171-181. P.R. Brown and H.F. Martin, 8 H.A. Scoble, M. Zakaria, Liquid chromatographic profile classification of acute and chronic leukemias, Computers and Biomedical Research, 16 (1983) 300-315. 9 M.E. Parrish, B.E. Good, M.A. Jeltema and F.S. Hsu, Pattern recognition and capillary gas chromatography in the analysis of the organic gas phase of cigarette smoke, Analytica Chimica Acta, 150 (1983) 163-170. 10 K.G. Beauchamp, Signal Processing using Analog and Digital Techniques, Allen-Unwin, London, 1973, p. 41. 11 R.J. Marshall, A manual for CHAS, Internal report of the Cancer Research Unit, University of Leeds, 1985. 12 J.H. Van Bemmel, Biological signal processing, in D. Ingram and R. Bloch (Editors), Mathematical Methods in Medicine, Part 1, Wiley, New York, 1984, pp. 225-272. 13 H. Yu, R.J. Marshall, E.H. Cooper and J. Settle, Tubular proteinuria after bum injury, in G. Lubec and V. Campese (Editors), Advances in Non-invasive Nephrologv, John Libby, London, Paris, 1985, pp. 187-190.

295