Soft-sensor development for fed-batch bioreactors using support vector regression

Soft-sensor development for fed-batch bioreactors using support vector regression

Biochemical Engineering Journal 27 (2006) 225–239 Soft-sensor development for fed-batch bioreactors using support vector regression Kiran Desai, Yoge...

409KB Sizes 0 Downloads 24 Views

Biochemical Engineering Journal 27 (2006) 225–239

Soft-sensor development for fed-batch bioreactors using support vector regression Kiran Desai, Yogesh Badhe, Sanjeev S. Tambe ∗ , Bhaskar D. Kulkarni Chemical Engineering and Process Development Division, National Chemical Laboratory, Dr. Homi Bhabha Road, Pune 411008, India Received 8 October 2004; received in revised form 9 June 2005; accepted 2 August 2005

Abstract In the present paper, a state-of-the-art machine learning based modeling formalism known as “support vector regression (SVR)”, has been introduced for the soft-sensor applications in the fed-batch processes. The SVR method possesses a number of attractive properties such as a strong statistical basis, convergence to the unique global minimum and an improved generalization performance by the approximated function. Also, the structure and parameters of an SVR model can be interpreted in terms of the training data. The efficacy of the SVR formalism for the soft-sensor development task has been demonstrated by considering two simulated bio-processes namely, invertase and streptokinase. Additionally, the performance of the SVR based soft-sensors is rigorously compared with those developed using the multilayer perceptron and radial basis function neural networks. The results presented here clearly indicate that the SVR is an attractive alternative to artificial neural networks for the development of soft-sensors in bioprocesses. © 2005 Elsevier B.V. All rights reserved. Keywords: Artificial neural networks; Bioreactor; Soft-sensors; Support vector regression; Multilayer perceptron; Radial basis function network

1. Introduction The task of monitoring and controlling bioprocesses efficiently and robustly is faced with major difficulties such as the existence of significant uncertainties emanating from the frequently observed complex nonlinear process dynamics and the lack of—in most cases—the reliable and robust hardware and/or bio-sensors for measuring the values of the process variables including the product quality variables [1]. The first of these difficulties can be overcome by constructing an appropriate process model that is capable of describing the nonlinear process dynamics accurately. There exist two approaches, namely phenomenological and empirical, for developing such a process model. The former approach requires a detailed knowledge of the bio-chemical phenomena (kinetics, mass transfer processes, etc.) underlying the process. In the case of a complex nonlinear bioprocess, the requisite knowledge is usually not available and therefore the ∗

Corresponding author. Tel.: +91 20 25893095; fax: +91 20 25893041. E-mail address: [email protected] (S.S. Tambe).

1369-703X/$ – see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.bej.2005.08.002

empirical approach is used for the model development. The second significant difficulty in the monitoring and controlling of the bioprocesses, i.e., the lack of reliable sensors, can be addressed by developing soft-sensors, which are capable of an accurate and real-time estimation of the values of the process and product quality variables. The soft-sensors are software based sophisticated monitoring systems, which can relate the less accessible and infrequently measured process variables with those measured easily and frequently. In essence, soft-sensors correlate the unmeasured process and product quality variables with the frequently measured process variables and thus assist in making the real-time predictions of the unmeasured variables. Consequently, the soft-sensors can be gainfully employed in the control and monitoring of bioprocesses for which the values of the important process and product quality variables are not available continuously. The predictive performance of the soft-sensors depends upon the reliable measurements of the easily accessible process variables and also on the mathematical and/or statistical techniques used in the interpretation and correlation of the process data. A number of approaches

226

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

Nomenclature C regularization constant cinv invertase concentration (kU/g) D training data set e ethanol concentration (g/l) Et termination criterion Etrn , Etst RMSEs pertaining to training and test sets F feature space J number of nodes in the hidden layer of an MLP/RBF neural network K kernel function L loss function la lactic acid concentration (g/l) M number of nodes in the output layer of an MLP network N dimension of the input vector NI number of nodes in the input layer of an MLP network p number of example data patterns P probability function q feed rate of glucose (l/h) Remp empirical risk Rreg regression risk s substrate concentration (g/l) s0 initial substrate concentration (g/l) st streptokinase concentration (g/l) t time v reactor volume (l) v0 initial volume and substrate concentration w weight (parameter) vector x vector of model input variables xa active (plasmid containing) biomass concentration (g/l) xt total biomass concentration (g/l) xa0 total biomass concentration at time t = 0 (g/l) active biomass concentration at time t = 0 (g/l) xt0 yi scalar output Greek letters αi , α∗i Lagrange multipliers ε precision parameter in SVR εloss loss function parameter in LIBSVM εtol tolerance for termination criterion in LIBSVM η learning rate in the EBP algorithm µEBP momentum coefficient in the EBP algorithm ξ, ξ* slack variables  feature space function σ width of the RBF have been illustrated for the development of soft-sensors in the batch, fed-batch and continuous bioprocesses and these have been also reviewed extensively [2–5]. The proposed approaches are always based on either a phenomenological [6–10] or an empirical process model (see, e.g., [11]). Often,

the development of a soft-sensor based on a phenomenological process model becomes difficult or impossible owing to the unavailability of the requisite detailed mechanistic knowledge of the process. Also, in many instances the bioprocesses are described using simple models, which perform poorly due to reasons such as the occurrence of several sequential and simultaneous reactions, adaptability of organisms during short periods of time and inevitable shifts in the operating conditions. Moreover, the other process influencing factors, for instance, the temperature or shearing can not be described entirely using a simple mathematical framework thus requiring a rigorous mathematical treatment to account for their effects [12]. The principle difficulty with the other, i.e., the empirical approach to soft-sensor development, is that the form of the empirical data-fitting function needs to be specified before an estimation of the function parameters could be attempted. For nonlinear processes, a large number of functions can be devised as candidate solutions to the empirical data-fitting problem. Consequently, arriving at an accurate data-fitting function requires an extensive numerical effort. Despite expending such an effort, there is no guarantee that a correct data-fitting function can indeed be found in a finite number of trials. In the last decade, artificial neural networks (ANNs), have emerged as an efficient nonlinear modeling formalism [13–15]. The widely used ANN paradigm namely multilayer perceptron (MLP) possesses following characteristics: (i) given an example set comprising model inputs and the corresponding outputs, an MLP network learns (via a suitable ‘learning’ algorithm) the nonlinear relationships existing between the inputs and the outputs to make quick predictions of the outputs, (ii) input–output nonlinear relationships can be constructed solely from the historic process data, that is, the detailed knowledge of the physico-chemical phenomena underlying a process is unnecessary for the model development, (iii) a properly trained ANN model possesses an excellent generalization ability owing to which it can accurately predict the outputs for a new input data set, and (iv) even multiple input–multiple output nonlinear relationships can be approximated simultaneously and easily. Owing to their attractive properties, ANNs have become a powerful formalism not only for constructing exclusively data-driven nonlinear empirical process models but also for developing the soft-sensors [12,16–25]. The ANNs are considered as “black-box” models since their performance depends only on the quality and size of the data, and the structure of the model. A drawback of ANNs, which is shared by all types of black-box models, is that the resultant model and its parameters are difficult to interpret from the view point of gaining an insight into the data used in the model building exercise. More recently, a statistical/machine learning theory based formalism known as support vector regression (SVR) [26] that shares many of its features with the ANNs, but possesses some additional desirable characteristics, is gaining widespread acceptance in the data-driven nonlinear modeling applications. Specifically, the SVR methodology pos-

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

sesses following properties: good generalization ability of the regression function, robustness of the solution, sparseness of the regression and an automatic control of the solution complexity. Another attractive feature of the SVR is that it provides an explicit knowledge of the data points, which are important in defining the regression function. This SVR feature allows an interpretation of the SVR-approximated model in terms of the training data. Despite endowed with a number of attractive properties, the SVR being a new formalism is yet be explored widely in the bio-engineering/technology applications. Hence, in the present paper, the SVR has been introduced for developing soft-sensors for the bioprocesses. In particular, two simulated fed-batch processes namely invertase and streptokinase, have been considered for the SVR-based soft-sensor development. It may be noted that likewise ANNs, the SVR is a generic nonlinear modeling formalism and therefore it can also be used in a wide spectrum of bioengineering/technology tasks such as the steady-state and dynamic modeling, process identification, model-based control, fault detection and diagnosis, and monitoring. Thus, in addition to comparing the performance of the SVR-based soft-sensors with those developed using ANNs, the present paper discusses a number of practical issues that are pertinent to the development of optimal SVR models. For comparing the performance of the SVR based soft-sensors, two standard ANN paradigms, namely, the multilayer perceptron and radial basis function network (RBFN) have been considered. The choice of RBFNs stems from the fact that the SVR and RBFNs both employ radial basis functions in their formulation. The results of the two case studies clearly indicate that the SVR is an attractive strategy for developing soft-sensors for bioprocesses. The remainder of this paper is organized as follows. In Section 2, the regression problem underlying the soft-sensor development and the theory of SVR are explained in detail. For the sake of brevity, Section 3 gives a short description of the RBFNs. The two case studies illustrating the SVR-based soft-sensor development for the fed-batch invertase and streptokinase processes are reported in Section 4. This section also presents a comparison of the performance of the SVR, MLP and RBFN based soft-sensors. Finally, Section 5 provides concluding remarks.

2. Regression formulation For understanding the working principles of the SVR formalism, we first formulate a general problem of regression estimate in the framework of the statistical learning theory. Accordingly, consider a set of measurements (training data), p D = {(xi , yi )}i=1 , where xi ∈ N is a vector of the model inputs and yi ∈ , represents the corresponding scalar output. The objective of the regression analysis is to determine a function, f(x), so as to predict accurately the desired (target) outputs, {y}, corresponding to a new set of input–output examples, {(x, y)}, which are drawn from the same under-

227

lying joint probability distribution, P(x, y), as the training set, D. In essence, the given task is to find a function, f, that minimizes the expected risk, R[f], defined as:  R[f ] = L(f (x) − y)dP(x, y) (1) where L denotes a loss function. For a given function, f(x), the expected risk (test error) is the possible average error committed by the function in predicting the output pertaining to an unknown example drawn randomly from the sample probability distribution, P(x, y). The loss function, L, indicates how this error is penalized. In practice, the true distribution, P(x, y), is not known and, therefore, Eq. (1) can not be evaluated. Thus, an inductive principle is used to minimize the expected risk. In this approach, a stochastic approximation to the R[f], called empirical risk (Remp ), is computed (see Eq. (2)) by sampling the data and its minimization is performed. 1 L(f (xi ) − yi ). p p

Remp =

(2)

i=1

The empirical risk is a measure of the prediction error with respect to the training set, i.e., the difference between the outputs predicted by the function, f(xi ) and the corresponding target outputs, yi (i = 1, 2, . . ., p). It approaches the expected risk as the number of training samples goes to infinity, i.e., Remp [f ]p→∞ = R[f ].

(3)

This however implies that for a small-sized training set, minimization of Remp does not ensure minimization of the R[f]. Thus, selecting an f(x) based solely on the basis of empirical risk minimization does not guarantee a good generalization performance (i.e., the ability to predict accurately the outputs in respect of the test set inputs) by the regression function. The inability of a function to generalize originates from a phenomenon known as ‘function over-fitting’. It occurs when the regression function attains a complex character and thereby fits not only the mechanism underlying the training data but also the noise contained therein. For overcoming the problem of over-fitting and thereby enhancing the generalization ability of the fitting function, f(x), it is necessary to implement what is known as capacity control. The “capacity” of a regression model is a measure of its complexity. For instance, a very high-degree polynomial assuming a wiggly shape fits the training data accurately but does not generalize well outside the training data, has a high capacity [26]. In the SVR formalism, a capacity control term is included to overcome the problem of function over-fitting. The underlying idea is: if we choose a function (hypothesis) from a low capacity function space that yields a small empirical risk, then the true risk, R[f], is also likely to be small. 2.1. SVR-based function approximation The support vector regression is an adaptation of a recently introduced statistical/machine learning theory

228

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

based classification paradigm namely, support vector machines [27–31]. In SVR, the inputs are first nonlinearly mapped into a high dimensional feature space (F) wherein they are correlated linearly with the outputs. The SVR formulation follows the structural risk minimization (SRM) principle, as opposed to the empirical risk minimization (ERM) approach commonly employed by the conventional statistical/machine learning methods, and also by ANNs. In the ERM approach, a suitable measure of the prediction error such as the root mean square error (RMSE), pertaining to the training set outputs is minimized. Since the ERM is based exclusively on the training set error, it does not guarantee a good generalization performance by the resultant model. On the other hand, the SRM feature equips the SVR model with a greater potential to generalize the input–output relationship learnt during its training phase. This improved generalization performance stems from creating an optimized model such that the prediction error and model complexity are minimized simultaneously. To solve a nonlinear regression problem, the SVR formalism considers the following linear estimation function: f (x) = (w · (x)) + b

(4)

where, w denotes the weight (parameter) vector; b is a constant; (x) denotes a function termed feature, and (w · (x)) describes the dot product in the feature space, F, such that : x → F, w ∈ F . In SVR, the problem of nonlinear regression in the lower dimensional input space (x) is transformed into a linear regression problem into a high dimensional feature space, F. In other words, the original optimization problem involving a nonlinear regression is recasted as searching the flattest function in the feature space, F, and not in the input space, x. To avoid an over-fitting of the regression model, the SVR formalism minimizes the following regularized risk functional comprising the empirical risk and a complexity term, w2 : 1 Rreg [f ] = Remp [f ] + w2 (5) 2 where Rreg denotes the regression risk and · is the Euclidean norm. The stated Rreg minimization leads to the penalization of the model complexity while simultaneously keeping the empirical risk small. The regularization term, (1/2)w2 , in Eq. (5) controls the trade-off between the complexity and the approximation accuracy of the regression model to ensure that the model possesses an improved generalization performance. Specifically, the complexity of the linear function is controlled by keeping w as small as possible. Eq. (5) is similar to the cost function augmented with a standard weight-decay term used in developing the ANN models possessing good generalization ability. This approach decreases the complexity of an ANN model by limiting the growth of the network weights via a kind of weight-decay.

Specifically, the weight-decay method prevents the weights from growing too large unless it is really necessary [32]. This is achieved by adding a term to the cost function that penalizes the large weights. The resultant form of the cost function is [32,33]: 1  2 (6) wij E = E0 + γ 2 i,j

where, E and E0 denote the modified and original cost functions, respectively, γ is a parameter governing how strongly the large weights are penalized and wij are the weights on the connections between ith and jth network nodes. The commonly used ANN training procedure such as the error-backpropagation algorithm minimizes only E0 , which in most cases represents the sum-squared-error (SSE) with respect to the training set. A comparison of the respective terms of Eqs. (5) and (6) indicates that the minimization of the regression risk attempted by the SVR is similar to the minimization conducted by the ANNs of a cost function comprising a weight decay (penalty) term. However, the SVR and ANNs use conceptually different approaches for minimizing the respective cost functions. A number of cost functions such as the Laplacian, Huber’s, Gaussian and ε-insensitive can be used in the SVR formulation. Among these, the robust ε-insensitive loss function (Lε ) [28], given below is commonly used.  |f (x) − y| − ε for |f (x) − y| ≥ ε Lε (f (x) − y) = 0 otherwise (7) where ε is a precision parameter representing the radius of the tube located around the regression function, f(x) (see Fig. 1). The region enclosed by the tube is known as ‘ε-insensitive’ zone since the loss function assumes a zero value in this region and as a result it does not penalize the prediction errors with magnitudes smaller than ε. The empirical risk minimization using the symmetric loss function (Eq. (7)) is equivalent to adding the slack variables, ξ i and ξi∗ , i = 1, 2, . . ., p, into the functional, R[f], with a set of linear constraints. The slack variables ξ i and ξi∗ measure the deviation (yi − f(xi )) from the boundaries of the ε-insensitive zone. Accordingly, using the ε-insensitive loss function and introducing the regularization constant, C, the optimization problem in Eq. (5) can be written as,  1 w2 + C (ξi + ξi∗ ) 2 p

Minimize :

(8)

i=1

subject to,   ∗    (w · (xi )) + b − yi ≤ ε + ξi  yi − (w · (xi )) − b ≤ ε + ξi .     ξi , ξi∗ ≥ 0 for i = 1, ..., p

(9)

While conducting the stated minimization, the SVR optimizes the position of the ε-tube around the data as shown

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

229

Fig. 1. (a) A schematic representation of the SVR using ε-insensitive loss function, (b) the linear ε-insensitive loss function in which the parameter C determines the slope.

in Fig. 1. Specifically, the optimization criterion in Eq. (9) penalizes those training data points whose y values lie more than ε distance away from the fitted function, f(x). In Fig. 1, the stated excess positive and negative deviations are illustrated in terms of the slack variables, ξ and ξ * , respectively. These variables assume non-zero values outside the [ε, −ε] region. While fitting the f(x) to the training data, the SVR minimizes the training set error by minimizing not only ξ i and ξi∗ , but also w2 with the objective of increasing the flatness of the function or penalizing its over-complexity. This serves to avoid an under-fitting as also over-fitting of the training data. It was demonstrated by Vapnik [27,28] that the function defined below possessing a finite number of parameters can minimize the regularized risk functional in Eq. (8). f (x, ␣, ␣∗ ) =

p 

(αi − α∗i )((xi ) · (x)) + b

radial basis function (RBF) defined as, 2 − xi − xj K(xi , xj ) = exp 2σ 2

where σ denotes the width of the RBF. A list of other possible kernel functions is given in Appendix A. Upon substituting the dot product in Eq. (10) with a kernel function, the general form of the SVR-based regression function can be written as, f (x, w) = f (x, ␣, ␣∗ ) =

where, αi and α∗i (both ≥0) are the coefficients (known as “Lagrange multipliers”) pertaining to the input data vector, xi , and satisfying αi α∗i = 0, i = 1, 2, . . ., p. It can be noticed that both the optimization problem (Eqs. (8) and (9)), and its solution (Eq. (10)) involves computation of a dot product in a feature space, F. These computations become time consuming and cumbersome when F is highdimensional. It is however possible to use, what is known as the “kernel trick”, to avoid computations in the feature space. This trick uses the Mercer’s condition, which states that any positive semi-definite, symmetric kernel function, K, can be expressed as a dot product in the high-dimensional space. The advantage of a kernel function is that the dot product in the feature space can now be computed without actually mapping the vectors, x and xi into that space. Thus, using a kernel function all the necessary computations can be performed implicitly in the input space instead of the feature space. Although several choices for the kernel function, K, are available, the most widely used kernel function is the

p 

(␣ − ␣∗ )K(x, xi ) + b

(12)

i=1

where the weight vector w is expressed in terms of the Lagrange multipliers, ␣ and ␣∗ . The values of these multipliers are obtained by solving the following convex quadratic programming (QP) problem.

(10)

i=1

(11)

Maximize :

R(␣∗ , ␣) = −

1 ∗ (αi − αi )(α∗j − αj ) 2 p

i,j=1

× K(xi , xj ) − ε

p 

(α∗i + αi )

i=1

+

p 

yi (α∗i − αi )

(13)

i=1

p subject to constraints: 0 ≤ αi , α∗i ≤ C, ∀i, and i=1 (α∗i − αi ) = 0. The bias parameter b in Eq. (12) can be computed as  yi − f (xi )b=0 − ε for αi ∈ (0, C) b= (14) yi − f (xi )b=0 + ε for α∗i ∈ (0, C). 2.2. Interpreting structure and coefficients of an SVR model A significant feature of the SVR is that the regression model and its parameters can be interpreted geometrically.

230

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

In SVR, each training point is associated with it a pair of parameter (αi and α∗i ) values. The αi and α∗i have an intuitive explanation as forces pushing and pulling the regression function, f(xi ), towards the desired output, yi [34]. Owing to the specific character of the QP problem defined in Eq. (13), only some of the regression coefficients, (αi − α∗i ), assume non-zero values. The training input vectors, xi , with non-zero coefficients are termed “support vectors (SVs)”. Alternatively, SVs are those training input–output data (xi , yi ) for which |f(xi ) − yi | ≥ ε. Since they are the only points determining the SVR-approximated function, the SVs are crucial data examples. In Fig. 1, the SVs are depicted as points lying on the surface and outside of the ε-tube. As the percentage of SVs decreases, a more general regression solution is obtained. Also, a lesser number of computations are necessary to evaluate the output of a new and unknown input vector when the percentage of SVs becomes smaller. The data points lying inside the ε-tube are considered to be correctly approximated by the regression function. The training points with the corresponding αi and α∗i equal to zero have no influence on the solution to the regression task. If these points are removed from the training set, the solution obtained would still be the same [35]. This characteristic, owing to which the final regression model can be defined as a combination of a relatively small number of input vectors, is known as “sparseness” of the solution. 2.3. Tuning of SVR’s algorithmic parameters The prediction accuracy and generalization performance of an SVR-based model is controlled by two free parameters namely, C and ε. Hence, these parameters must be selected judiciously. The parameter C determines the amount up to which the prediction errors beyond the magnitudes ±ε are tolerated. If the magnitude of C is too large (infinity), then the SVR minimizes only the empirical risk without regard to the model complexity. On the other hand, for a too small value of C, the SVR algorithm assigns an insufficient weightage to fitting the training data thereby improving the chances of a better generalization performance by the model [36]. The tube width parameter ε can inversely affect the number of support vectors used to construct the regression function. As ε decreases, the percentage of training points marked as SVs (hereafter denoted as %SV) increases, which in turn enhances the complexity of the SVR model. Associated with a complex model is the risk of an over-fitting of the training data and consequently a poor generalization performance by the model. On the other hand, a relatively better generalization performance is realized for large ε magnitudes at the risk of obtaining a high training set error. It may be noted that in ANNs and traditional regression, ε is always zero and the data set is not mapped into a higher dimensional space. Thus, the SVR is a more general and flexible treatment of the regression problems [37]. A number of guidelines for the judicious selection of C and ε are provided by Cherkassky and Ma [38].

3. Radial basis function neural network (RBFN) The architecture of an RBFN [14,39] comprises three layers of nodes namely input, hidden and output layers. The input layer nodes serve only as “fan-out” units to distribute the inputs to the J number of hidden layer nodes. Each hidden node represents a kernel function that implements a nonlinear transformation of an N-dimensional input vector. The commonly used kernel is the Gaussian RBF whose response is typically limited only to a small region of the input space where the function is centered. The Gaussian RBF is characterized by two parameters, namely center (Cj ) and the peak width (σj ). While Cj represents an N-dimensional vector, σj is a scalar determining the portion of the input space where the jth (j = 1, 2, . . ., J) RBF has a significant non-zero response. The centers are adjustable parameters and the nonlinear approximation and generalization characteristics of an RBFN depend critically on their magnitudes. Thus, centers must be selected judiciously. On the other hand, the peak width parameter does not affect an RBFN’s approximation and generalization performance significantly and thus these can be fixed heuristically. For a given input vector, xi , the output of the jth Gaussian hidden node can be calculated as: xi − Cj 2 − Oj = Φj ( xi − Cj ) = exp (15) 2σj2 where xi − Cj denotes the Euclidian distance between xi and Cj . The outputs of the Gaussian hidden nodes serve as inputs to the output nodes and the output of each output node is computed using a linear function of its inputs as given below: yˆ m =

J 

wjm Oj + wm0 ;

m = 1, 2, ..., M

(16)

j=1

where yˆ m refers to the output of the mth output layer node; M denotes the number of output nodes; wm0 denotes the bias weight for the mth output node and wjm refers to the weight of the connection between jth hidden node and mth output layer node. Development of an RBFN based model involves selecting the centers, peak widths, the number of hidden layer nodes (J) and the weights, wjm . The centers can be selected using a number of methods (see, e.g., Bishop [40]) such as the random subset selection, K-means clustering [41], orthogonal least-square learning algorithm [42] and rival penalizing competitive learning [43]. The width parameter can either be chosen same for all the hidden units or can be different for each unit. In this paper, the width parameter is chosen equal for all the hidden nodes. Once the centers and the widths of the RBFs are chosen, the weights, wjm , on the connections between the hidden and output nodes are adjusted using a standard least-squares procedure with the objective of minimizing a pre-specified error function such as the sumsquared-error. Once trained, the magnitude of the response

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

of each of the RBFs is a function of the distance between the network input (xi ) and the RBF center, Cj . Finally, the output layer node combines these signals to produce the network output, yˆ m . As can be noticed, both the SVR and RBFN employ radial basis functions in their mathematical framework and thus it is pertinent to point out the similarities and differences between the two nonlinear function approximation paradigms. The RBFN is equivalent to an SVR when all the width parameters, σ j (see Eq. (15)) are set at the same value and the centers are set to be the support vectors. Thus, in this case, the number of hidden units (J) in the RBFN equals the number of support vectors. Similar to the SVR, it is a common practice in training the RBFNs to center the basis functions on the training data. However, the two methods differ in their training approaches. Specifically, the training of an RBFN often involves optimization of a non-constrained and a non-convex problem comprising multiple local minima while the training of an SVR model is a quadratic optimization problem possessing a unique global minimum. The possible advantage of the SVR is that the usage of the ε-insensitive loss function allows an automatic determination of the number of non-zero model coefficients (αi and α∗i ). In contrast, the number of hidden units as also the centers of an RBFN must be chosen in advance [44]. More discussion on the relation between RBFNs and SVR can be found in ref. [45].

4. Case studies 4.1. Soft-sensor for invertase process The Saccharomyces carlsbergensis yeast shows a biphasic nature of growth whereby in the event of glucose scarcity, it can also utilize ethanol (which is cheaper) [46,47]. The yeast is also known to possess the diauxic nature owing to which the extent of cellular growth and enzyme production depends upon the balance between the metabolic states of the aerobic fermentation and respiratory growth. Accordingly, an estimation of the current metabolic state can be made from the changes in the concentrations of the glucose and ethanol in the broth. The optimization of the biphasic growth of S. carlsbergensis in a fed-batch culture has been reported by Pyun et al. [48] and more recently by Sarkar and Modak [49]. The optimal feed profiles obtained in [49] exhibit an excellent match with those obtained by Pyun et al. [48]. To generate the process data for developing a soft-sensor for the invertase from S. carlsbergensis process, the phenomenological model proposed by Toda et al. [50] (also see Pyun et al. [48]) has been used. It may however be noted that the model (given in Appendix B) is used only for simulating the process and thereby generating the process data. In real practice, the data collected by running the process physically and under varying conditions should be used for developing the soft-sensors. The fed-batch invertase process is described in terms of five operating variables namely, glucose con-

231

centration (s, g/l), ethanol concentration (e, g/l), bioreactor volume (v, l), biomass concentration (xt , g/l) and invertase concentration (cinv , kU/g). This case study aims at developing an SVR-based soft-sensor for the prediction of cinv . For generating the process data, the set of five ordinary differential equations representing the process dynamics was integrated by varying the initial conditions in the following ranges: 0.2 < v0 < 0.4, 0.02 < s0 < 0.1, and 0.02 < xt0 < 0.04. A total of 15 batches were simulated in this manner over the fermentation duration of 13 h. The values of the four operating variables, i.e. s, v, xt and e, and that of the single product quality variable namely, cinv , computed at 1-h intervals formed the process data set. The real-world process data always contain some instrumental and measurement noise and to mimic this scenario, the Gaussian noise of strengths 3, 5 and 10% was introduced in each variable of the simulated process data. Each of the three noise-superimposed data sets thus formed is a three-way array of the size, 15 (number of batches) × 5 (process variables) × 12 (measurement intervals). Next, to examine the effect of noise on the prediction and generalization performance of the soft-sensor, three SVR-based optimal models were developed by using the noise-superimposed data sets separately. An SVR-implementation known as the “εSVR” in the LIBSVM software library [51] was employed for constructing the soft-sensor models. The LIBSVM library utilizes a fast and efficient implementation of the widely used sequential minimal optimization (SMO) method [52,53] for solving large quadratic programming problems and thereby estimating the parameters ␣, ␣* and b, of the SVR’s datafitting function. In the SVR formulation, the RBF was used as the kernel function for avoiding the dot product calculations in the feature space, F. In this study, the performance of the SVR-based soft-sensors was rigorously compared with those developed using the standard MLP and RBF neural networks. Here, the MLP network contained a single hidden layer and it was trained using the EBP algorithm [54]. To construct an optimal MLP-based soft-sensor model, the effect of network’s structural parameter (i.e., the number of hidden nodes) as also two EBP-specific parameters, namely, learning rate (η) and momentum coefficient (µEBP ), was rigorously studied. Also, the effect of random weight initialization was examined to obtain an MLP model that corresponds to the global or the deepest local minimum on the model’s nonlinear error surface. For developing the RBFN-based soft-sensors, the ANN toolbox in the Matlab package was used. The training and test data sets used for developing the MLP and RBFN based soft-sensors were same as used in the development of the SVR-based soft-sensors. The concentrations of glucose, biomass and ethanol as also the reactor volume are the major indicators of the invertase activity and, therefore, it was necessary to also consider the history of these variables along with their current values for developing a soft-sensor model for the cinv . Accordingly, the current and lagged values of s, e, xt and v, were used as inputs to the soft-sensor predicting the current value of the invertase concentration (activity). In order to develop an SVR-based soft-sensor possessing good

232

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

Fig. 2. Invertase activity at 2, 7 and 13 h time duration as predicted by the SVR-based soft-sensor using data containing 5% noise.

prediction and generalization performance, it is necessary to judiciously select the number of lagged values of variables s, e, xt and v. Thus, multiple models with a varying number of the lagged values of the stated variables were constructed. The generalization performance of these models was evaluated by using a test set comprising 20% of the process data; the remaining 80% data were used as the training set for building the SVR models. The model that yielded the least root mean square error (RMSE) magnitude (defined below) for the test set was chosen as the optimal model:   p  i i (t) 2 ˆ inv i=1 cinv (t) − c RMSE = (17) p where i is the index of the input data pattern; t denotes the i (t) and c i (t) are the tardiscrete time ( t = 60 min) and, cinv ˆ inv get (desired) and the SVR-predicted invertase concentrations corresponding to the ith input pattern, respectively. All the three optimal soft-sensor models obtained by following the above-described procedure have 10 inputs representing the current and the lagged values of the variables s, e, xt and v, and the form of the models is: cˆ inv (t) = f (s(t), s(t − 1), e(t), e(t − 2), xt (t), xt (t − 2), v(t), v(t − 1), v(t − 2), v(t − 3))

(18)

where x(t − k) denotes the value of a variable, x, lagged by k number of discrete time intervals. It can thus be seen that the SVR-based optimal soft-sensor models have used 1, 1, 1 and 3 number of lagged values of the model input variables, s, e, xt and v, respectively. The number of support vectors (Nsv ) used by the SVR algorithm for fitting the three invertase activity models were 82, 96 and 133, respectively, and the corresponding magnitudes of the super-imposed Gaussian noise in the data were 3, 5 and 10%. An illustrative comparison of the soft-sensor predicted and the corresponding target values of the invertase activity at batch times of 2, 7 and 13 h is depicted in Fig. 2. This soft-sensor was developed using the data containing 5% Gaussian noise.

The architectural details and/or the training algorithm specific parameter values of each of the three optimal soft-sensor models developed using the SVR, MLP and RBFN strategies are provided in Table 1. Also listed in this table are the values of R2 (squared coefficient of correlation), RMSEs with respect to the training and test sets (Etrn , Etst ), and the average error (%) pertaining to the invertase activity predictions made by the SVR, MLP and RBFN based soft-sensors. As can be seen from the tabulated values, the R2 magnitudes in respect of the predictions of the training set outputs made by the three SVR-based models are close to unity. This indicates an excellent match between the invertase activity values predicted by the SVR models and their target values. Also, the corresponding average error and RMSE magnitudes are sufficiently small. A close match between the respective R2 , RMSE and average error values pertaining to the training and test set outputs indicates that the SVR-based soft-sensors also possess an excellent generalization ability. A comparison of the R2 , RMSE and average error values corresponding to the cinv predictions made by the SVR, MLP and RBFN based soft-sensors reveals that the SVR-based models have consistently outperformed the MLP and RBFN based softsensors. An examination of the performance of the MLP and RBFN formalisms indicates that the RBFN-based softsensors trained on the data corrupted by 5 and 10% noise have exhibited better generalization ability (lower magnitudes of Etst and average error, and higher R2 ) than the MLP based soft-sensors. On the other hand, the MLP-based soft-sensor has exhibited a better generalization performance in comparison to the RBFN-based soft-sensor when the data comprised 3% noise. Two additional sets of simulations were performed for examining the effect of variations in the tube width parameter (ε) and the Gaussian noise in data, respectively, on the number of support vectors considered by the SVR-based softsensor models. Such an exercise helped in getting an insight into the structure of an SVR model. Accordingly, in the first set of simulations, the magnitude of ε was varied systematically (range: 0.0001–0.1) while fixing the values of the other SVR parameters as follows: C (regularization constant) = 5, σ (RBF width) = 0.091 and Et (termination criterion for the prediction error) = 10−5 (see Table 1, Model II). Fig. 3 shows the plot of the percentage of training vectors identified as support vectors (%SV) as a function of the tube width. It is seen in the figure that the %SV decreases as ε increases. This result can be explained by referring to Fig. 1 where it can be easily visualized that a progressive increase in the width of the ε-tube would result in the corresponding increase in the number of data points enclosed by the tube. In other words, as ε increases, the percentage of training points left outside and on the surface of the ε-tube becomes progressively smaller. Since the SVR algorithm identifies such points as the support vectors their percentage also decreases with increasing ε. As %SV decreases, the resultant model becomes progressively flatter thus increasing the chances of an under-fitted model. Such a model does not generalize well since it pro-

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

233

Table 1 Comparison of SVR, MLP and RBFN-based soft-sensors for invertase activity prediction Model

Noise strength (%)

Modeling formalism

Training algorithm parametersa

CPU time (s)

%SVb

Training set

Test set

R2

Etrn

Average error (%)

R2

Etst

Average error (%)

I

3

SVR MLP RBFN

16, 0.1, 0.002, 0.0004 10, 03, 01, 0.2, 0.05 0.413

0.562 38.8 1.5

60.7

0.997 0.979 0.951

0.020 0.085 0.130

1.381 2.662 4.597

0.965 0.953 0.903

0.117 0.142 0.217

1.340 4.858 5.131

II

5

SVR MLP RBFN

5, 0.091, 2.5E−3, 1.0E−5 10, 03, 01, 0.3, 0.05 0.32

0.48 33.9 1.39

71.1

0.995 0.941 0.970

0.011 0.085 0.060

1.622 2.345 2.301

0.946 0.846 0.880

0.146 0.148 0.052

1.433 3.574 2.339

III

10

SVR MLP RBFN

19, 0.6, 0.001, 1E−5 10, 05, 01, 0.3, 0.05 1.351

0.321 27.18 1.71

98.5

0.993 0.825 0.946

0.044 0.141 0.076

2.595 3.800 3.013

0.973 0.393 0.932

0.542 3.215 0.132

1.849 12.157 4.111

a The tabulated values describe the following model-specific parameters: (i) SVR: C, σ, ε and E , (ii) MLP: N , J, M, η, µ t I EBP , (iii) RBFN: width of the Gaussian RBF (σ). b Computed as percentage of the training data size (133 patterns).

duces a nearly constant test set error, Etst . These conclusions can also be verified from Fig. 3 where it is observed that for small values of ε (i.e., 0.0001 and 0.001), a high percentage (≈100 %) of data points are marked as the support vectors. Additionally, the respective SVR models are seen to produce relatively large Etst magnitudes thus indicating that the models possess a poor generalization ability. In Fig. 3 it is also noticed that in the ε range of 0.001 to 0.1, the Etst magnitude goes through a minima. Intuitively, the point of Etst minimum refers to a model, which is neither an over-fitted (%SV large) nor an under-fitted (%SV low) one thus possessing an optimal generalization ability. Although the exact value of ε that minimized the Etst was not determined via extensive simulations, it is seen in the figure that for ε = 0.0025, the Etst magnitude was much smaller (=0.14) as compared to the Etst magnitudes of 4.26 and 0.15 for ε = 0.001 and 0.01, respectively. At this point, the Etrn magnitude (=0.011) was also small suggesting an excellent prediction accuracy of the SVR model. The above-described results are clearly supportive of the fact that there exists an optimum tube width and correspondingly an optimum number of SVs for obtaining an SVR model possessing good prediction accuracy and generalization performance.

In the second set of simulations, SVR models were developed for three ε values (0.01, 0.001 and 0.0025) using the training data comprising a varying percentage (3, 5 and 10%) of the superimposed noise. The other SVR parameter values were same as in the case of Model II (see Table 1). The results of these simulations (plotted in Fig. 4) indicate that for a fixed ε, the %SV increases with the increasing strength of the noise in the training data. This result has a logical explanation that as the magnitude of noise increases, the scatter in the data points also increases. Consequently, a progressively higher percentage of the data points remain outside and on the surface of the ε-tube. It is also observed that for a fixed magnitude of the superimposed noise, the %SV increases as the tube width decreases. This behavior is similar to that observed in Fig. 3 and thus can be interpreted analogously. A comparison of the CPU times consumed by the HewlettPackard Workstation (3 GHz dual CPU) in training the SVR, MLP and RBFN based soft-sensor models suggest that the SVR algorithm is numerically the most efficient. It may however be noted that the higher numerical efficiency of the SVR algorithm vis-`a-vis the other two methods can not be generalized to all types of function approximation tasks since it

Fig. 3. Percentage of support vectors as a function of the tube-width for the SVR-based Invertase models trained on the data comprising 5% noise. The bracketed quantities represent Etst values.

Fig. 4. Percentage of support vectors as a function of the strength of noise in the data for the SVR-based invertase soft-sensors. The bracketed quantities refer to Etst values.

234

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

depends upon a number of factors such as the training set size, desired prediction accuracy, etc. 4.2. Soft-sensors for streptokinase process The genetically engineered microorganisms have become important vehicles for the production of valuable biomolecules. Owing to the “gene dosage effect,” each cell can possess multiple copies of a plasmid up to a certain threshold level thereby leading to an increased production of the recombinant product [55,56]. In the recombinant technology, the host microorganism is forced to overproduce the desired metabolite by infecting it with multiple copies of the plasmids. However, cells always try to get rid of this burden [57] and consequently the fermentation may run away with plasmid-free cells [58,59]. Thus, it becomes necessary to strictly monitor and control these cells, which can be achieved by developing soft-sensors for the estimation of the active cell mass and concentration of the recombinant protein. Accordingly, this study aims at developing SVR-based soft-sensors for predicting the values of the active biomass and recombinant protein concentrations in the streptokinase process. Similar to case study-I, the streptokinase fed-batch process data were generated using the phenomenological process model [60,61] described in Appendix C. Specifically, the data for a number of batches were generated by integrating the set of six differential equations describing the dynamics of as many process variables namely, reactor volume (v, l), and the concentrations of the total biomass (xt , g/l), substrate (s, g/l), lactic acid (la , g/l), active biomass (xa , g/l) and streptokinase (st , g/l). A total of 15 batches of 12 h duration were simulated by varying the process initial conditions in the following ranges: 60 ≤ s0 ≤ 80 and 0.5 ≤ xa0 ≤ 0.9. The values of the stated six variables computed at 1-h intervals formed the process data set. Next, for simulating the real-life process scenario, three data sets were created by superimposing 3, 5 and 10% Gaussian noise in each process variable of the data set. Each of these sets formed a three-way array of size, 15 (number of batches) × 6 (process variables) × 11 (time intervals). Since changes in the reactor volume and concentrations of the total biomass and substrate (glucose) are the major indicators of a change in the streptokinase and active biomass concentrations, the current as also the lagged values of variables v, xt , and s were considered as inputs to the two soft-sensor models predicting the current concentrations of the streptokinase and active biomass. Although not considered here, a soft-sensor for predicting the lactic acid concentration can also be developed in a manner similar to the models for the streptokinase and active biomass. In this case study too, the performance of the SVR-based softsensors was compared with those developed using the MLP and RBF neural networks. Similarly, the effect of the tube width and the strength of Gaussian noise in the training data on the percentage of support vectors was rigorously studied. The number of lagged values of variables v, s and xt , as also ε-SVR, MLP and RBFN specific parameters were chosen via

heuristic optimization such that the RMSE with respect to the test set was minimized. The optimal soft-sensor models for the streptokinase and active biomass, which minimized the Etst magnitude, have eight inputs defining the current and lagged values of the three operating variables, namely v, s and xt : sˆ t (t) = f1 (v(t), v(t − 1), v(t − 2), s(t), s(t − 1), xt (t), xt (t − 1))

(19)

xˆ a (t) = f2 (v(t), v(t − 1), v(t − 2), s(t), s(t − 1), xt (t), xt (t − 1))

(20)

where sˆ t (t) and xˆ a (t), respectively refer to the concentrations of streptokinase and active biomass at discrete time, t, and x(t − k), denotes the value of a variable, x, lagged by k number of time intervals. The number of lagged values of the process variables v, s and xt in Eqs. (19) and (20) were determined by varying the number, k, systematically and choosing the optimal number that minimized the test set RMSE. During the development of two soft-sensors, training sets of 132 patterns each were used to estimate the SVR model parameters ␣, ␣∗ and b, and the corresponding test sets comprised 33 patterns. The training and test sets used for developing the MLP and RBFN based soft-sensors were same as used in the development of the two SVR-based soft-sensors. A comparison of the prediction accuracy and generalization performance of the optimal SVR, MLP and RBFN based soft-sensors for the streptokinase and active biomass is given in Tables 2 and 3, respectively. These tables also list the parameter values of the SVR, MLP and RBFN training algorithms, the CPU time consumed in training the three types of models and %SV (for SVR-based soft-sensors only). As can be noticed from the magnitudes of the CPU time consumed in training the SVR, MLP and RBFN based models that the SVR method has consumed the least time (≤ 1.02 s). The representative plots of the SVR model predicted values of the streptokinase and active biomass concentrations at batch times of 2, 5, 7 and 12 h are depicted in Figs. 5 and 6, respectively. From the R2 values in respect to of the predictions of the streptokinase concentration in the training and test sets (see Table 2), it is observed that the SVR-based soft-sensors have fared consistently better (R2 ≥ 0.993) than the MLP and RBFN based soft-sensors. Also, there exists a close match between the Etrn and Etst values indicating an excellent generalization performance by the SVR-based soft-sensors. In Table 2, it is also noticed that the RMSE and average error values pertaining to the predictions of the training set outputs for all the MLP-based streptokinase soft-sensors are lower than the corresponding values for the SVR and RBFN based soft-sensors. However, all these MLP models have exhibited inferior generalization performance relative to the SVR and RBFN based models as indicated by the lower R2 and higher

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

235

Table 2 Comparison of SVR, MLP and RBFN-based softsensors for streptokinase concentration Model

Noise strength (%)

Modeling formalism

Training algorithm parametersa

CPU time (s)

%SVb

Training set

Test set

R2

Etrn

Average error (%)

R2

Etst

Average error (%)

I

3

SVR MLP RBFN

3, 0.6, 0.006, 0.0005 10, 05, 01, 0.3, 0.05 1.351

0.362 38.8 1.5

58.3

0.993 0.984 0.981

0.056 0.008 0.072

3.404 0.471 4.195

0.994 0.966 0.982

0.056 0.126 0.079

3.274 7.572 6.293

II

5

SVR MLP RBFN

5, 0.4, 0.0065, 1.0E−5 7, 03, 01, 0.2, 0.05 0.32

0.48 33.9 1.39

66.7

0.998 0.983 0.980

0.029 0.008 0.049

1.809 0.513 2.730

0.998 0.960 0.979

0.030 0.044 0.043

1.689 3.887 2.213

III

10

SVR MLP RBFN

5, 0.5, 0.005, 5.1E−4 7, 03, 01, 0.5, 0.05 0.413

1.021 27.18 1.71

82.6

0.999 0.998 0.992

0.019 0.008 0.020

1.315 0.461 1.357

0.999 0.982 0.988

0.021 0.028 0.023

1.676 2.872 1.948

a The tabulated values describe the following model-specific parameters: (i) SVR: C, σ, ε and E , (ii) MLP: N , J, M, η, µ t I EBP , (iii) RBFN: width of the Gaussian RBF (σ). b Computed as percentage of training data size (132 patterns).

Table 3 Comparison of SVR, MLP and RBFN-based softsensors for active biomass concentration Model

Noise strength (%)

Modeling formalism

Training algorithm parametersa

CPU time (s)

%SVb

Training set

Test set

R2

Etrn

Average error (%)

R2

Etst

Average error (%)

I

3

SVR MLP RBFN

21, 0.2, .0005, 0.0005 7, 03, 01, 0.3, 0.05 5.22

0.562 38.8 1.5

71.9

0.992 0.994 0.958

0.150 0.016 0.152

2.519 0.140 3.098

0.992 0.985 0.967

0.147 0.272 0.182

3.065 7.957 4.323

II

5

SVR MLP RBFN

13, 0.3, 0.001, 5.0E−4 7, 02, 01, 0.2, 0.05 4.861

0.48 33.9 1.39

95.4

0.998 0.999 0.998

0.075 0.018 0.096

1.265 0.161 2.791

0.997 0.992 0.981

0.075 0.184 0.092

1.586 3.526 2.808

III

10

SVR MLP RBFN

22, 0.2, 0.003, 5.0E−4 7, 03, 01, 0.2, 0.05 4.811

1.021 27.18 1.71

0.999 0.999 0.959

0.043 0.008 0.047

0.884 0.163 1.039

0.999 0.992 0.918

0.059 0.233 0.172

1.358 7.801 2.541

100

a The tabulated values describe the following model-specific parameters: (i) SVR: C, σ, ε and E , (ii) MLP: N , J, M, η, µ t I EBP , (iii) RBFN: width of the Gaussian RBF (σ). b Computed as percentage of training data size (132 patterns).

Fig. 5. Streptokinase concentration at 2, 5, 7 and 12 h time duration as predicted by the SVR-based using the soft-sensor data containing 5% noise.

Fig. 6. Active biomass concentration at 2, 5, 7 and 12 h time duration as predicted by the SVR-based soft-sensor for data containing 5% noise data.

236

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

Fig. 7. Percentage of support vectors as a function of the tube width for the SVR-based streptokinase models trained on the data comprising 5% noise. The quantities in parentheses represent Etst values.

Etst and average error magnitudes in respect of the predictions of the test set outputs. Fig. 7 shows the effect of the tube width parameter ε on the number of SVs (expressed as the percentage of training data, %SV) identified by the SVR based streptokinase softsensors. These results were obtained by developing optimal SVR models using the process data corrupted with 10% Gaussian noise. Here, values of all the SVR parameters barring ε were fixed as given in Table 2 (Model III). Similar to Fig. 3, the results depicted in Fig. 7 support the geometrical interpretation of an SVR-based model in that the percentage of support vectors decreases as the tube width increases. Additionally, it is seen from the Etst values specified in the figure that Etst goes through a minimum. This feature is also identical to Fig. 3, which suggests that there exists an optimal value of the tube width and correspondingly that of %SV at which the Etst is minimum. The results of the SVR simulations examining the effect of noise in the training data on the percentage of SVs for the streptokinase soft-sensor are plotted in Fig. 8. These simulations were conducted using data comprising 3, 5 and 10% Gaussian noise. The effect of noise variations was studied separately for three values of ε (0.001, 0.0065 and 0.01). Likewise Fig. 4, the results of the stated simulations indicate

that: (i) for a constant ε value, the percentage of the support vectors increases with the strength of the super-imposed noise, and (ii) for a fixed strength of the noise, the percentage of support vectors increases as ε decreases. The results comparing the prediction and generalization performance of the SVR, MLP and RBFN based soft-sensors for the active biomass are tabulated in Table 3. In this case too, the SVR-based soft-sensors exhibited better generalization performance than the MLP and RBFN-based soft-sensors. For all the three noise magnitudes, the MLP based models performed better than the SVR models at “recalling” (i.e., predicting) the training set outputs. However, the much desired generalization performance of these MLP models was inferior than the SVR based models. Additional simulations were conducted for examining the effect of ε and the strength of the noise in the data, respectively, on the percentage of support vectors identified by the SVR-based active biomass soft-sensors. The trends obtained thereby have been found to be identical to those portrayed in Figs. 4 and 8.

5. Conclusions In this paper, a novel machine learning theory based formalism known as “support vector regression” has been introduced for the soft-sensor applications by considering two simulated fed-batch processes namely, invertase and streptokinase. Additionally, the performance of the SVRbased soft-sensors was compared with those developed using the MLP and RBF neural networks. The results of this study clearly indicate that the SVR is an attractive alternative to ANNs for the soft-sensor applications in biotechnology/engineering. The significant benefit of the SVR is that the method solves a quadratic programming problem possessing a unique global minimum. This feature greatly reduces the numerical effort involved in locating the globally optimum solution. In contrast, ANNs solve a nonlinear optimization problem that may possess multiple local and global minima thereby requiring multiple network training runs to obtain a globally optimal solution. The other attractive feature of the SVR is that the structure and parameters of the resultant model are amenable to interpretation in terms of the training data. It is however pertinent to add here that the SVR may not outperform ANNs in all types of function approximation applications and thus it is advisable also explore ANNs to obtain models with the desired prediction accuracy and generalization performance.

Acknowledgement Fig. 8. Percentage of support vectors as a function of the strength of noise in the data for the SVR-based streptokinase soft-sensors. The quantities in parentheses refer to the Etst values.

YB thanks the Council of Scientific and Industrial Research (CSIR), Government of India, New Delhi, for a Senior Research Fellowship.

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

q = 0.2 l/h for (0 ≤ t ≤ 0.58); q = 0 l/h for (0.58 ≤ t ≤ 2.28); q = qc l/h for (2.28 ≤ t ≤ 12.4); q = 0 l/h for (12.4 ≤ t ≤ 13)

Appendix A List of possible kernel functions [62] 1 2

3 4

5

6

Simple dot product Simple polynomial kernel: d, degree of polynomial

K(xi , xj ) = (xi · xj )

Vovk’s real polynomial: Radial basis function: λ is user defined Two layer neural network: b and c are user-defined

K(xi , xj ) =

Linear splines:

K(xi , xj ) =

Semi local kernel: d and σ are user defined

qc =

−1 −1 µmax µmax ks = 0.021 g/l, G = 0.39 h , A = 0.11 h , R F R = 0.67, kp = 0.014 g/l, Yx/s = 0.52 g/g, Yx/s = 0.15 g/g, Yx/p R e0 = 0 g, (cinv )0 = 0 kU/g, v0 = 0.6 l, Yp/s = 0.33 g/g, vmax = 1.5 l, sF = 0.5/[vmax − v0 ] g/l.

K(xi , xj ) = exp(−λ|xi − xj |2 ) K(xi , xj ) = tanh(b(xi · xj ) − c)



xik · xk

ψxv + xt v(−1.4892)(s)(1.2013) (cinv )(−0.1046) . (B.6) sF − s

The details of the rate expression and the kinetic model can be found in [48]. The following values of the model parameters and operating conditions are used in the simulations:

1−(xi ·xj )d 1−(xi ·xj )

k=1

7

where

K(xi , xj ) = ((xi · xj ) + 1)d

n  

237

Appendix C

   d exp −xi −xj 2

K(xi , xj ) = (xi · xj ) + 1

σ2

The phenomenological model of the streptokinase process using Streptoccus sp. in the batch fermentation [61] is given as: dxt q = µxa − xt dt v

(C.1)

Appendix B

dxa q = (µ − kd )xa − xa dt v

(C.2)

The phenomenological model for biphasic growth of S. carlsbergensis and invertase production is given as [48]: d (xt ) = (µG + µA )xt (B.1) dt d ψsxt sF F (s) = − + (B.2) dt v v d (e) = (πA − πC )xt (B.3) dt d (v) = F (B.4) dt d (cinv ) = (ηS µG + ηA µA )xt (B.5) dt where

ds −µxa q = + (sin − s) dt YX v

(C.3)

q dla = YM µxa − la dt v

(C.4)

dst q = YP (µ − kd )xa − kP st − st dt v

(C.5)

xt , s, and e: concentrations of cells, glucose, and ethanol, respectively µG, µA, ψ: specific growth rates on glucose and ethanol and, specific rate of glucose consumption, respectively πA , and πC : specific rates of ethanol production and ethanol consumption, respectively ηS and ηA : ratios of the specific invertase synthesis rate to the specific growth rate on glucose and on ethanol, respectively cinv : invertase activity q: volumetric feed rate of glucose sF : glucose feed concentration v: fermenter volume. The following nonlinear feed rate profile that maximizes the streptokinase production [49] is used for the data generation.

dv = q(t) dt  µ = µm

(C.6) s KS + s



KIb KIb + la b

(C.7)

where xt , xa : concentrations of total and active biomass, respectively s, st , la : concentrations of substrate, streptokinase and lactic acid, respectively v: volume of reactor. The following nonlinear feed rate profile that maximizes the streptokinase production is used for the data generation [61]. t  t 2  t 3 + a3 (C.8) + a2 q(t) = a0 + a1 T T T The parameter values used to simulate the model are: a0 = 0.9959 (l/h), a1 = −0.3037 (l/h), a2 = −1.3418 (l/h), a3 = 0.6499 (l/h), b = 2.39, kd = 0.020 (l/h), kp = 0.0005 (l/h), KI = 12.66 (g/l), KS = 13.14 (g/l), Sin = 70.0 (g/l), T = 12.0 (h),

238

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239

YM = 4.80 (g/g), YP = 0.44 (g/g), YX = 0.15 (g/g), µm = 0.74 (l/h). Initial conditions used in the simulations are: la (0) = 0 g/l, st (0) = 0 g/l, xa (0) = 0.7 g/l, v(0) = 5 l.

s0 = 70 g/l,

xt (0) = 0.7 g/l,

References [1] K. Shimizu, A tutorial review on bioprocess systems engineering, Comput. Chem. Eng. 20 (6/7) (1996) 915–941. [2] D. Dochain, M. Perrier, Dynamical modeling, analysis, monitoring and control design for nonlinear bioprocesses, Adv. Biochem. Eng. 56 (1997) 149–197. [3] G.A. Montague, Monitoring and control of fermenters, Institution of Chemical Engineers, Rugby, Warwickshire, UK, 1998. [4] S.C. James, H. Budman, R.L. Legge, On-line estimation in bioreactors: a review, Can. Rev. Chem. Eng. 16 (2000) 311–340. [5] J.D.A. Adilson, M. Rubens, Soft-sensors development for on-line bioreactor state estimation, Comp. Chem. Eng. 24 (2000) 1099–1103. [6] G. Bastin, D. Dichain, On-line Estimation and Adaptive control of Bioreactors, Elsevier, Amsterdam, 1990. [7] M. Meher, G. Roux, B. Dahlou, A method for estimating the state variables and parameters of fermentation systems, J. Chem. Technol. Biotechnol. 63 (2) (1995) 153–159. [8] Y. Zhao, Studies on modeling and control of continuous biochemical processes, Ph.D. Thesis, Norwegian University of Science and Technology, Norway, 1996. [9] G.A. Montague, A.J. Morris, A.R. Wright, M. Aynsley, A.C. Ward, Online estimation and adaptive control of penicillin fermentation, in: IEEE Proceedings 5, 1986, pp. 240–246. [10] M.N. Pons, A. Rajab, J.M. Flaus, J.M. Engasser, A. Cheruy, Comparison of estimation methods for biotechnological processes, Chem. Eng. Sci. 8 (1988) 1909–1914. [11] S. Albert, R.D. Kinley, Multivariate statistical monitoring of batch processes: an industrial case study of fermentation supervision, Trends Biotechnol. 2 (2001) 53–62. [12] S. James, R. Legge, H. Buddman, Comparative study of black box and hybrid estimation methods in fed batch fermentation, J. Process Control 12 (2002) 113–121. [13] J.A. Freeman, D.M. Skapura, Neural Networks: Algorithms, Applications and Programming Techniques, Addison-Wesley, Reading, MA, 1991. [14] S.S. Tambe, B.D. Kulkarni, P.B. Deshpande, Elements of Artificial Neural Networks with Selected Applications in Chemical Engineering, and Chemical & Biological Sciences, Simulations & Advanced Controls, Louisville, KY, 1996. [15] S. Nandi, S. Ghosh, S.S. Tambe, B.D. Kulkarni, Artificial neural network assisted stochastic process optimization strategies, AIChE J. 47 (2001) 126–141. [16] S. Linko, Y.-H. Zhu, P. Linko, Applying neural networks as software sensors for enzyme engineering, Tibtech 17 (1999) 155–162. [17] T. Eerik¨ainen, P. Linko, S. Linko, T. Simes, Y.H. Zhu, Fuzzy logic and neural network applications in food science and technology, Trends Food Sci. Technol. 4 (1993) 237–242. [18] G. Acura, E. Latrille, C. Beal, G. Corrieu, Static and dynamic neural network models for estimating biomass concentration during thermophilic lactic acid bacteria batch cultures, J. Ferment. Bioeng. 85 (1998) 615–622. [19] S. Feyo de Azevedo, B. Dahm, F.R. Oliveira, Hybrid modeling of biochemical processes: A comparison with the conventional approach, Comput. Chem. Eng. 21 (1997) 751–756. [20] R. Oliveira, Combining first principles modeling and artificial neural networks: a general framework, Comput. Chem. Eng. 28 (2004) 755–766.

[21] Y-H. Zhu, T. Rajalahti, S. Linko, Application of neural network to lysine production, Biochem. Eng. J. 62 (1996) 207–214. [22] M.N. Karim, S.L. Rivera, Comparison of feed-forward and recurrent neural networks for bioprocess state estimation, Comput. Chem. Eng. 16 (1992) 369–377. [23] D. Hodge, M.N. Karim, Nonlinear MPC for optimization of recombinant zymomonas mobilis fed-batch fermentation, in: Proceedings of American Control Conference 4, 2002, pp. 2879–2889. [24] L. Jose, G. Sanchez, W.C. Robinson, H. Budman, Developmental studies of an adaptive on-line softsensor for biological wastewater treatments, Can. J. Chem. Eng. 77 (1999) 707–717. [25] G.A. Montague, A.J. Morris, M.T. Tham, Enhancing bioprocess operability with generic software sensors, J. Biotechnol. 25 (1–2) (1992) 183–201. [26] V. Vapnik, S. Golowich, A. Smola, Support vector method for function approximation, regression estimation and signal processing, Adv. Neural Inform. Proces. Syst. 9 (1996) 281–287. [27] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, 1995. [28] V. Vapnik, Statistical Learning Theory, John Wiley, New York, 1998. [29] C. Burges, A tutorial on support vector machines for pattern recognition, Data Min Knowl Disc 2 (1998) 1–47. [30] A. Smola, B. Sch¨olkopf, K.R. M¨uller, The connection between regularization operators and support vector kernels, Neural Networks 11 (1998) 637–649. [31] J.C. Sch¨olkopf, J. Platt, A.J. Shawe-Taylor, A. Smola, R.C. Williamson, Estimating support of a high-dimensional distribution, Neural Comput. 13 (2001) 1443–1471. [32] A. Krogh, J.A. Hertz, A simple weight decay can improve generalization, in: J.E. Moody, S.J. Hanson, R.P. Lipmann (Eds.), Advances in Neural Information Processing Systems, Vol. 4, Morgan-Kauffmann, San Mateo, CA, 1995, pp. 950– 957. [33] J. Hertz, A. Krogh, R.G. Palmer, Introduction to the Theory of Neural Computation, Addison-Wesley, Redwood City, CA, 1991. [34] K.R. Muller, A. Smola, G. Ratsch, B. Sch¨olkopf, J. Kohlmorgen, V. Vapnik, Predicting time series with support vector machines, in: W. Gerstner, A. Germond, M. Hasler, J.D. Nicoud (Eds.), Artificial Neural Networks ICANN’97, 1327, Springer, Berlin, 1997, pp. 999–1004, Lecture Notes in Computer Science. [35] U. Thissen, R. Van Brakel, A.P. De Weijer, W.J. Melssen, L.M.C. Buydens, Using support vector machines for time series prediction, Chemom. Intell. Lab. Syst. 1–2 (2003) 35–49. [36] H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola, V. Vapnik, Support vector regression machines, in: M.C. Mozer, M.I. Jordan, T. Petsche (Eds.), Advances in Neural Information Processing Systems, Vol. 9, MIT Press, Cambridge, MA, 1997, pp. 155– 161. [37] B.J. Chen, M.W. Chang, C.J. Lin, Load forecasting using support vector machines: A study of EUNITE competition 2001. Report for EUNITE competition for smart adaptive systems (2001). Available at www.eunite.org. [38] V. Cherkassky, Y. Ma, Practical selection of SVM parameters and noise estimation for SVM regression, Neural Networks 17 (1) (2004) 113–126. [39] S. Haykins, Neural networks: a comprehensive foundation, second ed., Prentice Hall, New Jersey, 1999. [40] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [41] J. Moody, C.J. Darken, Fast learning in networks of locally-tuned processing units, Neural Comput. 1 (1989) 281–294. [42] S. Chen, C.F.N. Cowan, P.M. Grant, Orthogonal least-squares learning algorithm for RBFNs, IEEE Trans. Neural Networks 2 (2) (1991) 302–309. [43] L. Xu, A. Krzyzak, E. Oja, Rival penalized competitive learning for clustering analysis, RBF net, and curve detection, IEEE Trans. Neural Networks 4 (1993) 636–648.

K. Desai et al. / Biochemical Engineering Journal 27 (2006) 225–239 [44] M.W. Chang, C.J. Lin, R.C. Weng, Analysis of switching dynamics with competing support vector machines, IEEE Trans. Neural Networks 15 (2004) 720–727. [45] T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and support vectors machines, Adv. Comput. Math. 13 (2000) 1–50. [46] G.V. Dedem, M.A. Moo-Young, Model for diauxic growth, Biotechnol. Bioeng. 17 (1975) 1301–1312. [47] H.E. Beejherk, R.J.A. Hall, Mechanistic model of aerobic growth of Saccharomyces cerevisiae, Biotechnol. Bioeng. 19 (1977) 267– 296. [48] Y.R. Pyun, J.M. Modak, Y.K. Chang, H.C. Lim, Optimization of biphasic growth of Saccharomyces carsbergensis in fed-batch culture, Biotechnol. Bioeng. 33 (1989) 1–10. [49] D. Sarkar, J.M. Modak, Optimization of fed-batch bioreactor using genetic algorithm, Chem. Eng. Sci. 58 (2003) 2283–2296. [50] K. Toda, I. Yabe, T. Yamagata, Kinetics of biphasic growth of yeast in continuous and fed batch culture, Biotechnol. Bioeng. 22 (1980) 1805. [51] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu. tw/∼cjlin/libsvm. [52] T. Joachims, Making large-scale SVM learning practical, in: B. Sch¨olkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1998. [53] J.C. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Sch¨olkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1998.

239

[54] D. Rumelhart, G. Hinton, R. Williams, Learning representations by backpropagating errors, Nature 323 (1986) 533–536. [55] W. Ryan, S.J. Parulekar, B.C. Stark, Expression of ␤-lactamase by recombinant E. coli strains containing plasmid of different sizeseffect of pH, phosphate and dissolved oxygen, Biotechnol. Bioeng. 34 (1989) 309–319. [56] G. Georgiou, J.J. Chalmers, M. Shuler, D.B. Wilson, Continuous immobilized recombinant protein production from E. coli capable of selective protein excretion: A feasibility study, Biotechnol. Prog. 1 (1985) 75–79. [57] W. Ryan, S.J. Parulekar, Recombinant protein synthesis and plasmid instability in continuous cultures of Escherichia coli JM103 harboring high copy number plasmid, Biotechnol. Bioeng. 37 (1991) 415–429. [58] S.B. Lee, J.E. Baily, Analysis of growth rate effects on productivity of recombinant E. coli populations using molecular mechanisms model, Biotechnol. Bioeng. 26 (1984) 66–73. [59] J.H. Seo, J.E. Baily, Effects of recombinant plasmid contains on growth properties and cloned gene product formation in E. coli, Biotechnol. Bioeng. 27 (1985) 1668–1674. [60] P.R. Patnaik, A heuristic approach to fed-batch optimization streptokinase fermentation, Bioprocess Eng. 13 (1995) 109–112. [61] P.R. Patnaik, Improvement of the microbial production of streptokinase by controlled filtering of process noise, Process Biochem. 35 (1999) 309–315. [62] Y.B. Dibike, S. Velickov, D. Solomatine, Support vector machines: Review and applications in civil engineering, in: Proceedings of the 2nd Joint Workshop on Application of AI in Civil Engineering, Cottbus, Germany, March, 2000.