Multivariate statistical monitoring of batch processes: an industrial case study of fermentation supervision

Multivariate statistical monitoring of batch processes: an industrial case study of fermentation supervision

Review TRENDS in Biotechnology Vol.19 No.2 February 2001 53 Multivariate statistical monitoring of batch processes: an industrial case study of fer...

856KB Sizes 0 Downloads 97 Views

Review

TRENDS in Biotechnology Vol.19 No.2 February 2001

53

Multivariate statistical monitoring of batch processes: an industrial case study of fermentation supervision Sarolta Albert∗ and Robert D. Kinley This article describes the development of Multivariate Statistical Process Control (MSPC) procedures for monitoring batch processes and demonstrates its application with respect to industrial tylosin biosynthesis. Currently, the main fermentation phase is monitored using univariate statistical process control principles implemented within the G2 real-time expert system package. This development addresses integrating various process stages into a monitoring system and observing interactions among individual variables through the use of multivariate projection methods. The benefits of this approach will be discussed from an industrial perspective.

Sarolta Albert∗ and Robert D. Kinley Eli Lilly and Company Limited, Speke Operations, Fleming Road, Liverpool, UK L24 9LN e-mail: [email protected]∗, [email protected]

The biosynthetic production of secondary metabolites has always posed challenges to scientists and engineers. Reducing peculiar variations in process performance potentially results in improvement in process performance and quality1, as well as improved process understanding. The task of lowering variability was addressed via the development and application of advanced techniques, namely EXPERT SYSTEMS (see Glossary) complemented with data based modelling approaches. The Expert System approach aims to replicate the reasoning of operating staff who have traditionally been supervising the process and who, through their individual experiences and perceptions, have developed a decision-making practice that is essential for maintaining the pre-specified conditions. Such knowledge can be expressed in the form of ‘if-then rules’, which are consciously used in operation. Recently, it was shown that it is possible to reproduce a correct and complete set of the above rules through the use of a new Knowledge Acquisition Technique (KAT; Ref. 2). The development of a fermentation knowledge base is outside the scope of this article and is reported elsewhere3. Although rules are useful in detecting deviations of individual variables, the INTERACTIONS (see Glossary) between measured process variables are usually important, complex and not always fully understood. Simultaneous combined effects of variables might lead to variation in performance that might remain hidden when univariate rule-based approaches are in place. Multivariate techniques are effective in detecting such deviations, and extracting information from the process data itself, thus providing an alternative to knowledge-based approaches. Furthermore, the only requirement for their use is historical data that is often a widely available and under-utilized resource in

today’s industry. Although multivariate methods recently emerged as a leading-edge technology in today’s chemical industries, more traditional batch industries have not been able to adopt this approach because of inherent process dynamics and nonlinearities that are associated with batch processing. As there was no commercial package available for batch process applications, this research involved the development of a prototype comprehensive Batch MSPC tool that is capable of turning raw data into information that could potentially lead towards improved processes. Process description and data availability

Tylosin fermentation has been chosen as an example of a complex secondary metabolite production process. Tylosin production, as with most fermentation processes, involves various stages before and subsequent to the main fermentation, which has traditionally been the favoured area for improvements. However, deviations that influence final productivity might occur before the final stage and therefore it is clearly beneficial to focus on preceding operations of the fermentation. Tylosin production starts with mixing the raw materials that are of natural origin providing essential complex substrates. The pH of the medium is adjusted before it is transferred to the previously sterilized seed vessel. Following inoculation with the carefully prepared culture, the seed fermentation is allowed to grow sufficient microorganisms to inoculate the main fermentor vessels. The medium for the main fermentors is prepared, sterilized and inoculated in a similar manner to the seed. Many factors are believed to influence the above process, several of which are recorded off-line or on-line throughout the ~6 day-long fermentation. Productivity, the indicator of successful operation, is not available until the fermentation is terminated. Data from 144 fermentations were collected including all stages of fermentation. Most stages were represented by off-line measurements with the exception of the main fermentation, where a data historian stores several computer-logged variables throughout the batch duration, such as pH, temperature, respiratory data, pressure, agitation rate, airflow and dissolved oxygen; some of which are controlled around a setpoint. Before any

http://tibtech.trends.com 0167-7799/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01528-6

Review

54

TRENDS in Biotechnology Vol.19 No.2 February 2001

Glossary Expert Systems: A supervisory control system that makes use of expert knowledge in the form of rules to advice operators on process problems and control actions. The simple logic of associative decision rules are easily understood and accepted by humans, therefore often ‘if-then’ rules are used in the above context in the form of ‘If X is true and Y is false, conclude class 1’. Interactions: Chemical and biological systems often highly complex and present complex, non-linear and dynamic dependence structure amongst those parameters one can measure in order to observe such systems. Mathematically, we can approximate these relationships via kinetic expressions (if known) or, if no prior knowledge is available, infer relationships from data. The most commonly used measure of dependence between variables is the correlation coefficient, measuring linear interactions (collinearity). This assumption is adapted by using the covariance matrix of multivariable data when applying PCA. Artificial neural networks (ANNs): A modelling methodology attempting to construct approximations of process behaviour via integrating many processing units (neurons), interacting in a certain manner to provide a powerful means of approximation. ANNs ‘learn’ the approximation (process model) by repeated exposure to process data. SCREE test: A simple visual method to determine the optimal number of principal component representation. It involves plotting the eigenvalues against the number of principal components and deciding which slopes of lines joining the plotted points are ‘steep’ to the left and ‘not steep’ to the right of a given point. This point is suggested to be the optimal number of principal components. Central limit theorem (CLT): The CLT implies that the sum of n independently distributed random variables is approximately normally distributed, regardless of the distributions of the individual variables. Partial least squares regression (PLS): Conceptually PLS is similar to the technique of PCA, with the difference that process outputs (Y) are also projected to a reduced space simultaneously as the process data (X). As PLS is primarily a regression technique, the aim is to find a linear combination of the input and output variables that describe the maximum amount of correlation between the inputs and outputs, that is not only to explain the variation in X but that variation in X which is the most predictive of Y. G2: Recent developments of sophisticated supervisory systems, such as real time knowledge based systems (RTKBS) are software tools providing opportunity to implement process knowledge in the form of rules and/or algorithmic procedures. They communicate with control systems real time and provide advice to assist operation. The benefit from RTKBS comes from the ease with which information can be encoded. Major companies involved in fermentation have reported applications of the G2 RTKBS from Gensym, allowing information to be coded in English which greatly eases the problems of implementation and long term maintenance.

modelling, a few outliers were removed and replaced with linearly interpolated values in the case of the on-line logged variables. Interpolation is not a viable option in case of missing off-line assays and therefore missing data was ignored from the database. Noise filtering was not carried out because of the nature of MSPC models that filter small variations from the data as a consequence of Principal Component Analysis (PCA; see next section). Productivity data were available for each batch and were used as an indicator of performance when the data were subgrouped before modelling. Principles of MSPC

The principles MSPC are published widely4–6 and therefore only a brief summary will be given here. PCA involves finding the eigenvalues of the sample covariance matrix that are the variances of the principal components. For a normalized (mean centred, variance scaled) sample matrix X [n, m] with n samples and m variables, PCA will find m uncorrelated new variables, the variance of which decreases from first to last. Let the new variables be represented by ti for a particular sample i as follows: m

ti = ∑ X j × p ji j =1

http://tibtech.trends.com

(1)

The first principal component t1 is found by selecting pi so that t1 has the largest possible variance subject to the condition shown by Eqn 2: m

∑ pi2 = 1

(2)

i =1

The following form gives the sample covariance matrix:  c11 c12L c1m    c c c (3) C = cov( X ) =  21 22L 2m   : : :    cm1 cm2L cmm  In Eqn 3, cij is the covariance between variables Xi and Xj and the diagonal element cii is the variance of Xi. The variances of the individual principal components are the eigenvalues of the matrix C, and the sum of the eigenvalues is equal to the sum of the variances of the original variables. For m input variables there will be m principal components, some of which might be negligible, if the original variables were either correlated or collinear. By retaining only the first r principal components the X matrix is approximated by the following equation: r

Xˆ = ∑ ti × piT + E

(4)

i =1

In Eqn 4, E is the residual matrix, p [m, r] are the loadings and t [n, r] are the scores. The rest of this article will refer to scores as t [n, r] and loadings as p [m, r]. Ideally, dimension r is chosen such that no significant information is left in E. The transformation results in several desirable mathematical and statistical properties that are associated with the transformed data (scores), enabling the derivation of statistical confidence limits. This is a very significant benefit addressing the major shortcomings associated with univariate statistical process control, namely, the ignoring of interactions between variables and the difficult simultaneous interpretation of numerous control charts (m>20). If the original variables were correlated, a reduced number of control charts can be achieved (r
Review

(a)

55

(b)

(k )

Variables (m) Variables (m)

Batch (n) Off-line data

Batches (n)

Variables (m)

X

Ti m e

Batch (2)

Batches (n)

(k ) Ti m e

X

Time (k)

Batch (1)

‘State’ information

Variables (m) Batches (n)

Fig. 1. Unfolding three-way data (a) within batch variation and (b) within batch-to-batch variation.

TRENDS in Biotechnology Vol.19 No.2 February 2001

T(1) T(2)

...

T(k)

Quality data

TRENDS in Biotechnology

compromise between accurate model representation without explaining idiosyncrasies present in the training data. This theory shows that resampling techniques provide the best approximation of optimal model complexity. The present application adopted the SCREE TEST4 (see Glossary) to assist the selection of principal components. For the dataset used here, and for simulated datasets, it was found that the SCREE test was no less effective and led to the same decision on model complexity as the more time-consuming and computationally intensive methods. The performance of any model is primarily determined by the information available within the data that is used for model building. MSPC for batch processes

Batch data differs from continuous data in that the problem is now 3D: the added dimension being that of time (k) besides the number of variables logged (m) and the number of batches (n). The ideas of manipulating batch data in a certain and meaningful fashion originate from Nomikos and MacGregor8,9. They suggested a simple way to view batch data as a 3D data matrix constructed from layers of batches stacked onto each other (X [n, m, k]), that can be unfolded into 2D arrays in two different ways (Fig. 1). Figure 1a illustrates one way of unfolding the 3D matrix, that is simply concatenating the original batch blocks vertically. This results in a 2D data matrix [nk, m] with the m process variables in the columns and the number of batches multiplied by the batch duration [nk] in the rows. The alternative approach illustrated in Fig. 1b, creates vertical slices of the 3D matrix corresponding to a particular age for m variables and n batches. These blocks are then concatenated horizontally, each batch being one row of the resulting 2D matrix, [n, mk]. This latter approach regards each batch as an independent representation of the process and each process variable as a different variable at each instant in time. http://tibtech.trends.com

The matrices resulting from the two unfolding approaches deliver different sets of information when submitted to PCA. Using the unfolded matrices in a subsequent PCA is often referred to as multi-way PCA (MPCA). The unfolding offers opportunity for combining on-line time-logged data with off-line data blocks, which widens the scope for modelling. The first approach in Fig. 1a facilitates development of on-line estimates for off-line assays that might be available at arbitrary intervals throughout the batch duration (‘software sensors’). The second approach enables simply integrating off-line data records with the unfolded on-line type data (Fig. 1b). In addition, this approach is suitable for deriving estimators for final quality measures (e.g. productivity) from process data that are usually not available until the batch terminates. Although this article merely considers linear multivariate models the above potentials offered by MacGregor’s unfolding approaches apply to any data-based model. PCA itself produces scores and loadings that reflect the majority of the variation in the original data. However, it is the extension of statistical process control principles to these new variables that enables multivariate projection techniques to be established in process industries as an advanced process-monitoring tool (MSPC). Adopting a similar approach to that of univariate charting methodology, nominal-operating regions can be defined based on standard statistical distributional theory. The assumption behind these approximate confidence limits is that the underlying process exhibits a multivariate normal distribution with a population mean zero (Ref. 9). This assumption might not necessarily be valid; however, because they are linear combinations of the measurement variables, the scores are at least approximately normally distributed because of the CENTRAL LIMIT THEOREM (see Glossary). Alternatively, more precise methods for detecting deviations from normality have been

Review

Fig. 2. Loading plot inferred from (a) high yield runs and (b) low yield runs.

Loadings for PC# 3

FV Age AF AG

TRENDS in Biotechnology Vol.19 No.2 February 2001

TE PH BP CER

FV Age AF AG

CO2 OUR O2 RQ

0.08 0.06 0.04 0.02 0 –0.02 –0.04 –0.06 –0.08 0.06 0.04 0.02

TE PH BP CER

C2 OUR O2 RQ

0.06 Loadings for PC# 3

56

0.05

0 –0.02 –0.04 –0.06 Loadings for PC# 2

0 –0.05

Loadings for PC# 1

0.04 0.02 0 –0.02 –0.04 –0.06 0.06 0.04 0.02

0.05

0 –0.02 –0.04 –0.06 Loadings for PC# 2

0 –0.05

Loadings for PC# 1 TRENDS in Biotechnology

formulated10. In this application the assumption of normality of the scores was adopted, resulting in the contours of a constant density for a multidimensional normal distribution. These confidence ellipsoids were defined using Eqn 5: t T × S −1 × t = c 2

(5)

In Eqn 5 [m, r] are the scores and S [r, r] is a covariance matrix of the scores. The term c provides the axes of the ellipsoid as ± c λi where the λi are the ordered eigenvalues of the principal components. The above confidence ellipsoids provide a basis for the calculations of the off-line control limits on scores. An alternative statistic that provides the user with the facility to identify a new event is the squared prediction error (SPE), often referred to as Q-statistic. Once a model representation of the nominal operation has been developed, it can be used to calculate residuals as follows: SPEi = ei × eiT − xi ( I − p × pT )xiT

(6)

In Eqn 6, xi is the ith sample in X; SPEi is the SPE corresponding to xI; ei [1,m] is the residual for the ith sample, xI; p is the loading matrix [m, r]; and I is the identity matrix. SPE is a measure of lack of fit with the established model. A model is inferred from a given data set, therefore new patterns that are not ‘recognized’ by the model (high SPE) indicate new events that were not represented in previous data sets. SPE is a quadratic form of the errors and therefore it can be approximated by a weighted χ2 distribution ( gχ h2 ) , where the weights (g) and the degrees of freedom (h) are both functions of the eigenvalues (λ) of the covariance matrix9. Given the eigenvalues – again assuming that the observations are part of a multivariate normal population – the approximate control limits for the SPE with a significance level of α are obtained using Eqn 7: http://tibtech.trends.com

SPEα = (v/2m ) χ 22m 2 /v ,α χ 22m 2 / v ,α

(7)

In Eqn 7 is the critical value of the χ 2 variable with 2m /v degrees of freedom at significance level α, and m and v, are the mean and variance of the SPE calculated for each time interval. 2

Historical data analysis Model building for fault detection and diagnosis

Process data from 144 batches were collected comprising 17 on-line variables recorded hourly during the main fermentation (~140 hours) and 53 off-line variables recorded from operations preceding and during the fermentation stage (X). The final yield was used as an indicator of overall process performance (Y). The 65 batches with high yield were assumed to represent desirable operation (normal) and were used for model development. The 44 batches resulted in low yield, but no abnormalities could be observed using univariate plots. Once the directions of the principal components were extracted from the normal data, the low yield data can be superimposed in the new space and differences revealed. An additional group of data was created representing non-normal operation, namely, where unusual biological profiles were detected on the univariate plots (20 batches). This data set was used to confirm that univariate SPC information is reflected on the multivariate plots. A few batches were left out of the analysis for validation purposes. Both unfolding approaches were implemented within this application; however, the batch-to-batch models will be discussed in greater detail owing to their superior information content and suitability for on-line control charting. It was found that five principal components were sufficient to capture the characteristics of the high yielding batches, therefore considerable dimension reduction could be achieved indicating highly correlated system. For instance, loading plots produced by the batch-to-batch model

Review

Fig. 3. Scores plot from within-batch-variation model (0, high yield; +, low yield).

TRENDS in Biotechnology Vol.19 No.2 February 2001

140

Age (h)

120 100 80 60 40 20

5 0 0 –5 6 4 2 –10 0 –2 –4 –6 –8 –10–15 Scores for PC# 2 Scores for PC# 1 TRENDS in Biotechnology

(Fig. 2), enables the visualization of the complex dynamic correlation structure between variables throughout the batch duration. The data in Fig. 2a confirms the underlying science as PCA distinguished biological variables, which occupy the outer ranges of the plot and exhibit a well-defined pattern. These variables are uncontrolled or partially controlled throughout the process and therefore exhibit higher variation. Differences in correlation structure can be detected by inspection of model loadings that was inferred from the low yielding runs (Fig. 2b). Namely, the respiratory data that occupies the left side, show looser and altered interrelations; also the relationship between pH and OUR (oxygen uptake rate) is lacking at the early stages but it is observable in case of high yielding runs. These biological differences could not have been revealed on univariate plots because all individual variable profiles appeared to be normal. Some differences between high and low yielding batches can be revealed on the scores plot resulting from the within-batch variation model, the plot of

(a)

57

which shows higher variations associated with low yield (Fig. 3). The scores plot illustrates different metabolic phases of fermentation indicated by intense curvature at early stages, followed by a steady direction in the multivariate space indicating a settling process from 40–50 hours onwards. The above higher variations were detected as outof-control signals by the batch-to-batch models, mainly in the direction of the SPE statistic, which indicates that warnings were generated because of new events rather than changes in correlation structure. MSPC produced out-of-control signals in 38% of the low yield batches as opposed to 20% observability on the univariate plots. The univariate out-of-control signals were confirmed in 95% cases. Investigations of the causes of the above deviations are facilitated by calculating the contribution of each original variable to the scores and SPE corresponding to each sample as follows: ct ( j ) = χ j × p j

(8)

In Eqn 8, ct(j) is the jth variable contribution; xj is the jth variable (mean centred and variance scaled); pj are elements of the jth coloumn of the loading matrix. The loading matrix can be described using the following equation: cSPE ( j ) = ( χ j − t × pTj )2 (9) In Eqn 9, cSPE(j) is the jth variable contribution to SPE. Displaying the above contributions provides easily interpretable multiple process diagnostics to a particular behaviour detected on the scores or SPE plots (Fig. 4). The diagnostic information provided by the contribution plots coincided well with the previously identified ‘non-normal’ batches that presented unusual behaviour on univariate plots, therefore it was concluded that univariate fault detection and diagnostic potential was confirmed. The MSPC diagnosis of out-of-control ‘normal’ runs

(b) Overall contribution for batch 78

Scores plot with 95% and 99% confidence limits 200 Train Test

40 Scores 5

30 20 10

Batch 78

0 –10 –20

Fig. 4. (a) Multivariate statistical process control (MSPC) control chart on overall behaviour. (b) Variable contributions of each variable to each principal component.

160

100 80 60

–40

20 0 50 Scores 3

100

CO2 OUR O2 RQ

120

40

0

TE PH BP CER

140

–30

–50

FV Age AF AG

180 Contribution to scores

50

1

2

3

4

5

Principal component TRENDS in Biotechnology

http://tibtech.trends.com

58

Review

TRENDS in Biotechnology Vol.19 No.2 February 2001

was validated through discussions with process scientists. These discussions resulted in agreement in 70% of the runs but the scientists could not confirm the multivariate diagnostics in the remaining 30% of the batches. Indeed, the above 30% of runs were diagnosed as truly multivariate, hence the scientist could not spot combined events because of unusual interactions. Model building for performance estimation: Principal Component Regression

It is often interesting to relate readily available variables to process parameters that are difficult to measure by using predictive models to provide estimates. Although estimation algorithms have evolved rapidly over the last decades, they are only beginning to find their way into bioprocess applications and often traditional multiple linear regression techniques are still in use despite computational difficulties when dealing with correlated data11. Numerical problems are resolved by PCA itself: the projected data in the orthogonal space offers improved mathematical properties in a subsequent regression step over the original data itself 4. This fact laid the foundations projection to latent structures (PLS; see Glossary), which is a powerful iterative regression tool for multiple input/multiple output (MIMO) systems. If a single output is concerned, PLS does not offer benefit over a simple regression step on the principal component scores (principal component regression, PCR): Yˆ = t × b + e

(10)

In Eqn 10, t is the estimated variable and e is the residual vector. Multi-way PCR is thus suitable for providing estimates for either inferential variables on-line (within-batch variation models) or other off-line parameters (batch-to-batch models). The main area of interest within the present application was explaining final performance, indicated by the final product concentration in the broth. For the purpose of modelling productivity, the same set of production runs was used as previously described. However, accurate models can only be produced if the reference data range is extended to comprise all possible process behaviour. Again, the principles of the variance/bias apply and the use of resampling techniques will result in the best model performance. In this case, the available data was randomly shuffled to ensure that both high and low yielding batches are represented in the data used for predictive modelling. A frequently used term to quantify the performance of regression models is the ‘squared multiple correlation coefficient’ (the R2 statistic), which aims to quantify the ability of the model to reproduce the original data. However, in cases where the data are corrupted with high levels of uncertainty, this measure could be misleading because it is http://tibtech.trends.com

impossible to capture noise present in such measurements. No model, however good, can explain that part of the variation in the original data that is caused by pure error, and an R2 of 1.0 will never be achieved. In the light of the above, this statistic indicated 58% (R2 = 0.58) explained variability by PCR model. The remaining variation is partly owing typically 30% measurement noise in the potency assay, thus it was concluded that the ‘missing’ 12% of variation is accountable to missing information that is currently not gathered around the plant. Variable influences

Another frequently requested task towards modelling exercises to identify variables that are major process drivers. Data from an industrial environment usually cannot deliver suitable data to understand variable influences because the variables are strictly controlled and therefore their effects cannot be investigated. The most appropriate knowledge about variable influences originates from fundamental process understanding or designed experimentation. In fact, prior knowledge concerning variable effects is a useful starting point for model building, as irrelevant information will probably deteriorate model performance12, hence the selection of correct input variables is essential. Nevertheless, the issue has been addressed via databased modelling, for instance, standard applied regression packages use statistical significance tests with various criteria (such as maximum R2, F-test and Mallow’s Cp), which provide the basis for stepwise procedures. The assumptions of normality and linearity might not necessarily be valid, nevertheless these techniques coupled with pairwise correlations are frequently used in industry for data exploration and as an aid in decision making. Some of the limitations are addressed as the level of sophistication in modelling increases13,14. Although the loading plot provides information about variability presented by various process variables, it does not provide information about the relationships of these variables to the process output. The present application set out to investigate variable influences on the output (yield) through the regression coefficients. The PCR coefficients that were established in Eqn 10 are not suitable for this purpose, because they relate the compressed data (scores) to the output variable. We shall refer this term inner regression coefficient (b). It is possible to derive another set of coefficients that directly relate the process data (X) to process output (Y) by substituting Eqn 1 into Eqn 10: Yˆ = X × pT × b = X × B + e

(11)

In Eqn 11, B is another set of PCR coefficients corresponding to the original process data. We shall describe this term as outer coefficient. The fact that an essentially non-linear and dynamic process is being modelled using a linear,

Review

(a)

59

(b)

Process data (X)

Process output (Y) Variables (l) Time (1...K)

r

Batch (1) Batch (2)

Process data (X)

b

Batch (1) Batch (2)

Batches (1..n)

Variables (m) Time (1...K)

Fig. 5. The effect of unfolding on the regression model [multiway principal component regression (PCR)]: (a) within batch model; and (b) batch-to-batch models.

TRENDS in Biotechnology Vol.19 No.2 February 2001

Variables (1..m) T(1) T(2)

Process output (Y) K ×m

...

T(K)

B

T(1) T(2)

PCA

t

K×n

K ×n

m p r

K ×n

Batch (n)

T(K)

b r

m×K Batch (n)

...

t

p n

r

B

b = (tT × t )–1tT × Y [r l] B=

pT × b

[m l]

b = (t × t )–1tT × Y [r l × K] B = pT × b

[m × K l × K] TRENDS in Biotechnology

steady state technique such as PCR can be argued on the basis of examining the loadings. Figure 5 illustrates the effect of unfolding on the loadings, which essentially define the complexity and power of PCR. In the case of the within-batch models the sizes of loadings are relatively small, the non-linear and dynamic features remain in the projected data, violating some of the mandatory statistical assumptions for multivariate charting of scores. However, the loading matrices resulting from the batch-to-batch models are much larger because they include all non-linear and dynamic process characteristics. Indeed, it is the rich information content of the loadings that makes this approach to a powerful regression tool, as the size of the outer regression coefficients (B; Fig. 5) is primarily determined by the size of the loading matrices. The regression coefficients indicated respiratory activities as being most influential on yield, showing similar profiles with the exception of oxygen concentration in the exhaust gas, demonstrating an inverse effect on yield. The above effects were varying in sign and magnitude throughout the batch duration indicating metabolic changes and their effect on productivity. This information was particularly welcome by process scientists who might directly relate and interpret these effects to fundamental knowledge. MSPC in on-line monitoring of batch processes

The off-line analysis of historical data provided a useful tool for learning from data and suggested that multivariate transformation potentially outperform univariate process monitoring. Real benefit only results if the above process deviations are detected and diagnosed in real time through the development of on-line control limits on the scores and SPE statistics with their contributions. http://tibtech.trends.com

The online control limits encapsulate natural process variation at each time interval, which is assumed to be normal, as batches are independent instances of the process (batch to batch model). Under the assumption of normality the control limits at significance level , for each score at any given time interval are given by the following equation: ± t n −1,α/2 × Sref × 1 +

1 n

(12)

In Eqn 12, tn −1,α/2 is the critical value of Student’s t-statistic with n-1 degree of freedom at α signifiance level; n is the number of observations; sref is the estimated standard deviation of the t scores sample at a given time15. SPE is a quadratic form of errors associated with each time interval k, which errors were found to be well approximated by a χ2 distribution as previously given by Eqn 7. Simulation of the on-line control charts of historical batches confirmed that the previously identified deviations can be detected and diagnosed real time. Figure 6 illustrates the on-line scores and SPE monitoring plots with corresponding contribution plots at each time interval. The batch shown in the plot stayed in control for most of the run, but towards termination SPE shows an excursion into the warning and out-of-control region because of biological events indicated by symptom variables pH and respiratory data. Besides improved detection and diagnosis of undesirable process behaviour, one particular area of benefit was identified as timely detection of equipment malfunctions, such as pH probe, mass spectrometer failures resulted in warning signals that could be instantly corrected and thus prevent collection of invalid data for long periods. The on-line contribution plots showed that the initial

Review

Fig. 6. Multivariate statistical process control (MSPC) in on-line process monitoring and diagnostics.

TRENDS in Biotechnology Vol.19 No.2 February 2001

Batch 67

Batch 67 0.4 Control to score 1

0.06

Score 1

0.2 0

–0.2 –0.4

0

50

100

0.04 0.02 0

–0.02

150

FV

Age

AF

AG

TE

Time (h)

Score 2

0.2 0

–0.2

0

–0.05

0

50

100

–0.1

150

FV

Age

AF

AG

TE

PH

BP CER CO2 OUR O2

RQ

Variables

100

SPE

RQ

0.05

Time (h) 100

50

0

BP CER CO2 OUR O2

0.1 Control to score 2

0.4

–0.4

PH

Variables

Control to SPE

60

0

50

100

150

Time (h)

50

0 FV

Age

AF

AG

TE

PH

BP CER CO2 OUR O2

RQ

Variables TRENDS in Biotechnology

deviations were associated with operational constraints at inoculation but at later stages were caused by aeration problems. These effects had not been previously recognized as being potentially harmful by production staff. Although MSPC models are not cause-effect models, the fact that these conditions persistently resulted in lower yield indicated that these conditions should be avoided or that their effect on the yield should be confirmed by experimentation. MSPC in performance forecasting

It would be of quantifiable benefit to estimate the eventual performance at early stages of the fermentation, because no performance measure is available until the batch terminates. Nomikos and MacGregor12 proposed methodology for using MPCA approaches in forecasting final batch performance. These approaches are based on assuming certain process behaviour for the rest of batch duration, such as: (1) average behaviour of runs used for model building; (2) persistent deviations from average trajectories; and (3) no assumption, that is, estimate final performance based on current behaviour. The authors acknowledge that different assumptions might be more or less valid for different processes; however, overall they recommended the latter for http://tibtech.trends.com

most cases. A further important assumption that is associated with all the above approaches is that normal operation is maintained throughout the batch operation. In this contribution an additional assumption is proposed, namely assuming that unknown future behaviour will be in accordance with an existing PCA model. This approach projects the incoming measurements into the reduced space and estimates the final scores and SPE as: tˆK = ( p × pT )−1 × pkT × x new,k

(13)

In Eqn 13, batch duration increases from 1...k...K, tK is the estimated final score, p is the loading matrix, pk is the loading matrix up to time interval k; x new,k is the incoming measurement vector at current k. The advantage of this approach originates in the rich information content of the loading matrix, which captures all dynamic and non-linear features of the underlying correlation structure in the data. Thus, it is assumed that the correlation structure specified in the loadings will persist for the rest of the run. Indeed, if the data were not in accordance with the above structure, an out-of-control signal would be obtained. The different forecasting approaches were contrasted for the tylosin fermentation process

Review

TRENDS in Biotechnology Vol.19 No.2 February 2001

aiming to identify the most suitable approach to estimate final scores and SPE at relatively early stages. The findings suggested that making no assumptions [approach (3)] and the proposed approach according to Eqn. 13 outperformed the other two alternatives, in the case of the tylosin fermentation process showing faster convergence towards actual final performances. Equation 13 provided similar forecasting ability for runs that were in control on the multivariate charts, however, this approach was unsuitable for forecasting outcomes of poorly performing batches, although MacGregor’s favoured approach provided valid forecasts in these particular cases. The analysis of the two forecasting algorithms showed that the overall scores and final SPE were predictable by ~70 hours within 10% accuracy. Further, if it was possible to forecast the final scores, it could serve a basis for forecasting the actual final performance, through a regression step on the forecasted scores. It was found that based on 70 hours of the data it was possible to forecast the final yield with R2 of 0.52, as opposed to the obtainable 0.58 based on the full online and off-line data set. These findings confirm a well-known phenomenon concerning fermentation, namely that once a fermentation started, it can be made worse, but not better16, in other words, the process is somewhat ‘self-determined’ at relatively early stages. Implementation

Acknowledgments The authors would like to express their appreciation to S. Martin for sharing his knowledge about tylosin fermentation, D. Keates and D. Range for the on-line implementation, P. Mohan for managing this project, and all colleagues at Eli Lilly and Company Ltd who made this development possible.

A prototype off-line package was developed (Matlab) to perform various data pre-processing tasks, such as data screening for missing values, outliers and noise and input selection. The graphical user interfaces allow the user to easily create, save and re-use multivariate models, to visually present the models and data, and various statistics. The offline tool includes a simulation of the on-line scenario, whereby the user can ‘play back’ the online multivariate monitoring charts of any particular batch, so that the capabilities of a potential model (monitoring system) can be investigated off-line. Although various models might be built and tested off-line, once reasonable confidence is gained in future on-line performance, the model parameters can be saved and readily loaded into the on-line application. The on-line application is designed to assist the operators in routine fault detection and diagnosis similarly to the conventional and familiar univariate SPC system. As the real time G2 (see glossary) knowledge based system is currently in use for process monitoring around the plant, this platform was used to implement on-line batch MSPC algorithms as an optional add-on module to the existing SPC scheme. G2 procedures are linked to the data source calculating the scores, SPE and variable contributions, which are compared with limits and directly initiate alarms within the G2 system. http://tibtech.trends.com

61

Charting in G2 is similar to that illustrated in Fig. 6 with the intention of simplicity and similarity to the conventional control charts. As well as displaying the current batch status, the progression of a batch from the start is available allowing the retrospective analysis of alarm conditions and progression of malfunctions. Discussion and industrial perspectives

Although the application of data based techniques is appealing because of the relatively low resource requirements and rapid model development times, one of the main lessons learnt through this application is the significance of representative data. It appears to be a general feeling that raw information from industrial instrumentation does not provide sufficient insights into microbial behaviour. Although data quality is difficult to define, it is vital to the success of data based technologies, such as MSPC. Industrial data impose significant challenges on the modeller: routine data collection is tailored towards the needs of production staff, data is often recorded in different places, maybe missing, corrupted with noise, outliers, undesired shifts that may or may not be representative of true process events. Although need for information is generally recognized and being addressed in industry, as the appearance of data historians indicates, the quality of the gathered information often varies across the plant, posing limitations on integrated applications of advanced technologies. Another important issue to deal with is process changes resulting from continuous improvement efforts in R&D. Frequent minor changes necessitate the development of adaptive models, and major changes result in invalid models. These issues can be addressed through the combined use of knowledgebased and data-based approaches. Assuming that generic process knowledge is still valid, fundamental models or expert system approaches can be used while there is lack of data for essential model update. As data becomes available, new models could capture the idiosyncrasies associated with the modified process. Hence, there is certainly a scope for integrated solutions such as knowledge based supervisory systems (KBS) exploiting both data and expert knowledge. Regulatory authorities might impose major challenge to wider usage of advanced technologies in pharmaceutical bulk manufacturing. Besides the main incentive being the production of safe medicine, in today’s competitive climate the speed of process development and scale-up for potential drug candidates is the major driver for the pharmaceutical businesses. Therefore, it is often more beneficial to rapidly set-up processes rather than designing optimal processes. Once such processes are in place, essential regulatory and safety procedures may remove incentives of major process optimization efforts.

62

Review

References 1 Montgomery, D.C. (1996) Introduction to Statistical Quality Control, Wiley & Sons 2 Duke, P. (1992) KAT –A Knowledge Acquisition Technique, Methodology Manual, CK Design, UK 3 Glassey, J. et al. (2000) Issues in industrial advisory system development. Trends Biotech. 18, 136–141 4 Joliffe, I.T. (1986) Principal Component Analysis, Spinger–Verlag 5 Wise, B.M. et al. (1990) A theoretical basis for the use of principal components models for monitoring multivariate processes. Process Control and Quality 1, 41–55 6 MacGregor, J.F. (1994) Statistical Process Control of Multivariate Processes. IFAC World Congress, 1994, Kyoto, Japan

TRENDS in Biotechnology Vol.19 No.2 February 2001

7 Geman, S. et al. (1992) Neural networks and bias/variance dilemma. Neural Computation 4, 1–58 8 Nomikos, P. and MacGregor, J.F. (1994) Monitoring batch processes using multiway principal component analysis. AIChE Journal 40, 1361–1375 9 Nomikos, P. and MacGregor, J.F. (1995) Multivariate SPC charts for monitoring batch processes. Technometrics 37, 41–59 10 Martin, E.B. and Morris, A.J. (1996) Nonparametric confidence bounds for process performance monitoring charts. J. Process Control 6, 349–358 11 Montague, G.A. (1997) Monitoring and Control of Fermenters, IchemE, UK

12 Morris, A.J. et al. (1994) Artificial neural networks: studies in process modelling and control. Transactions IChemE–A 72 13 Alsberg, B.K. et al. (1998) Variable selection in wavelet regression models. Analytica Chimica Acta, 368, 29–41 14 Wagner, M.G. et al. (1998) Probabilistic data based modelling: a new technique for data analysis and process modelling. Proc. UKACC Inter. Conf. on Control’98 1, 201–206 15 Hahn, G.J. and Meeker, W.Q. (1991) Statistical Interval. A Guide for Practitioners, John Wiley. 16 Ignova, M. et al. (1999) Quality analysis with self organising neural networks. Biotech. Bioeng. 64, 82–91

Cell-transistor hybrid systems and their potential applications Andreas Offenhäusser∗ and Wolfgang Knoll Electrogenic cells fire spontaneous or triggered action potentials (transient changes of their membrane potentials) and can be electronically coupled to external electrodes (arrays). Signals from rat heart-muscle cells were recorded by a field-effect transistor and the results described on the basis of an equivalent circuit. This technique has potential applications in drug screening, such as measuring the dose-response curve of isoproterenol, a β-adrenergic agonist with a positive chronotropic effect.

Andreas Offenhäusser* and Wolfgang Knoll Max-Planck-Institute for Polymer Research, Ackermannweg 10, D-55128 Mainz, Germany

e-mail: offenhaeusser@ mpip-mainz.mpg.de*

The combination of biologically active elements, such as proteins, whole cells or even tissue slices, with optical or electronic transducers that detect physical signals creates functional hybrid systems that join the living and the technical worlds. Functional coupling of physiological processes with microelectronic devices is a useful technique for a wide range of scientific applications. The high sensitivity and selectivity of biological recognition systems with signal-amplification cascades that have been optimized by evolution, coupled with a manmade signal-detection and processing system, will open up exciting possibilities for the development of new biosensors. One of the most important reasons for the use of living cells is to obtain functional information, such as the effect of a stimulus on a living system. Whole-cell biosensors1 provide the opportunity to elicit such information for applications in areas such as pharmacology, cell biology and toxicology, and for monitoring environmental factors. In addition, the coupling of a 2D cellular network with an extracellular recording system might allow the structure–function relationship of such a network to be studied in detail. The standard method to detect and measure the electrical activity of electrogenic cells uses

electrodes pulled from metal wires and glass tubing2. Using this conventional method, however, multi-unit or long-term recordings become technically difficult. Alternatively, voltagesensitive dyes can be used as optical probes to detect rapid changes in membrane potential3. However, most dyes are toxic on illumination, which makes this method unsuitable for long-term recording. Extracellular metal electrodes (metal electrode arrays; MEAs) that record only a fraction of the transmembrane voltage can be used. Several groups started microfabrication techniques in the 1970s to produce recording arrays in which the metal electrodes are embedded in the substratum4,5. This technique has been improved, especially with regard to signal-to-noise ratio, and has become increasingly important in neuroscience and pharmacology, as reflected by the growing number of commercially available systems (e.g. Multi Channel Systems, Germany, http://www.multichannelsystems.com; MED systems, Panasonic, Japan, http://www.med64.com; and Center for Network Neuroscience, UNT, USA, http://www.cnns.org). These multisite MEAs can be easily produced with a large number of recording sites on a single chip. Transistors for extracellular recording

The direct coupling of electrical signals of muscle fibers and neuronal slices with an electronic device [field-effect transistor (FET)]6,7 was first reported in the middle of the 1970s to early 1980s, which opened up a new era in the field of bioelectronics. More recently, recordings from individual invertebrate

http://tibtech.trends.com 0167-7799/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01544-4