Journal of Biotechnology 91 (2001) 35 – 47 www.elsevier.com/locate/jbiotec
Review article
Parallel cascade identification and its application to protein family prediction Michael J. Korenberg a,*, Robert David b, Ian W. Hunter b, Jerry E. Solomon c a Department of Electrical and Computer Engineering, Queen’s Uni6ersity, Kingston, Ont., Canada K7L 3N6 Room 3 -154, Department of Mechanical Engineering, Massachusetts Institute of Technology, 77 Massachusetts A6enue, Cambridge, MA 02139, USA c Center for Computational Biology, The Beckman Institute, California Institute of Technology, Pasadena, CA 91125, USA b
Received 20 October 2000; received in revised form 30 April 2001; accepted 18 May 2001
Abstract Parallel cascade identification is a method for modeling dynamic systems with possibly high order nonlinearities and lengthy memory, given only input/output data for the system gathered in an experiment. While the method was originally proposed for nonlinear system identification, two recent papers have illustrated its utility for protein family prediction. One strength of this approach is the capability of training effective parallel cascade classifiers from very little training data. Indeed, when the amount of training exemplars is limited, and when distinctions between a small number of categories suffice, parallel cascade identification can outperform some state-of-the-art techniques. Moreover, the unusual approach taken by this method enables it to be effectively combined with other techniques to significantly improve accuracy. In this paper, parallel cascade identification is first reviewed, and its use in a variety of different fields is surveyed. Then protein family prediction via this method is considered in detail, and some particularly useful applications are pointed out. © 2001 Elsevier Science B.V. All rights reserved. Keywords: Protein family prediction; Nonlinear system identification; Parallel cascade identification
1. Introduction Classic problems in nonlinear system theory are to find general models for representing the relation between inputs and outputs of a physical
* Corresponding author. Tel.: + 1-613-533-2931; fax: +1613-533-6615. E-mail address:
[email protected] (M.J. Korenberg).
system, and methods for constructing such models, given only input/output data gathered in an identification experiment. This area of enquiry is frequently referred to as the domain of ‘black box’ techniques, and the system models considered here include the well-known Volterra (1913, 1959) and Wiener (1958) series (or functional expansions). These series can be employed to approximate the input/output relation of a wide class of nonlinear systems, especially a system that is dynamic (i.e. has memory), where the present
0168-1656/01/$ - see front matter © 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 8 - 1 6 5 6 ( 0 1 ) 0 0 2 9 2 - 9
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
36
value of the output depends on past as well as the present value of the input. Use of such series has the advantage of not requiring knowledge of the internal structure of the nonlinear system to be approximated. Indeed, the system output or outputs are approximated via an explicit expansion of the system input or inputs, in terms of a series of integrals of increasing order, involving weighted input crossproducts. For example, a Volterra series where x(t) is the single input and y(t) is the single output has the form: y(t)
=%
& &
···
j=0 0
d~1···d~j
hj (~1,···,~j )x(t −~1)···x(t − ~j )
with fading memory by Boyd and Chua (1985). For the discrete-time case, there is a similar capability of Volterra series to approximate nonlinear systems belonging to a broad class (e.g. Palm, 1979), and further work was carried out by Sandberg (1983). The finite-order finite-memory discrete-time Volterra series is the same as a multidimensional polynomial function of present and delayed input values, and the Volterra kernels correspond to the coefficients of this polynomial. For example, if x(i ) is the single input and y(i ) is the single output, then the finite-order finite-memory discrete-time Volterra series has the form: y(i )
0
D
(1)
= %
R
d=0 j 1=0
where the weighting function hj is called the j-th order Volterra kernel. To use a Volterra or Wiener expansion to obtain an approximation of a system output, the kernels in the appropriate expansion must be measured. When ‘black box’ techniques can be applied, they effectively dispense with the need to conduct a detailed analysis of the system to be modeled, and the requirement to measure internal signals or parameters. Their disadvantage is that the resulting model may simply mimic the input/output relation of the system without affording any insight into the mechanism underlying the system’s behavior. However, in some applications, all that is needed is an effective method of copying this behavior. Indeed, as discussed below, the system to be mimicked may not really exist but may have been created, for example, simply to solve some classification task. The Volterra and Wiener series are merely two of the better known general models for approximating the behavior of nonlinear systems. Their use in modeling is sometimes referred to as a ‘nonparametric’ approach, because it does not involve a detailed analysis of the given nonlinear system to determine its structure and then the fitting of parametric models, such as differential equations having appropriate structure, to the measured data. Frechet (1910) was the first to show the suitability of Volterra series to uniformly approximate finitememory nonlinear systems belonging to a wide class, and this work has been extended to systems
R
% ··· % hd ( j1,···, jd )x(i− j1)···x(i− jd ) jd = 0
(2)
where D is the order of the series and R is the maximum input lag on which the output depends. While the above-mentioned approximation results have great theoretical relevance, the use of Volterra and Wiener series to model nonlinear systems has typically been limited to systems whose significant nonlinearities are of low-order, say, of third order or less (see, for example, Stark, 1969; Ewen and Weiner, 1980; Boyd et al., 1983; Sakai and Naka, 1987a,b). Attempting to approximate a system with higher-order nonlinearities using a Volterra or Wiener series necessitates estimating higher-order kernels, with the concomitant computational requirement typically increasing almost exponentially with assumed order. In special cases, for example when the input can be a complete binary M sequence (Sutter, 1987, 1992), high-order kernels can be rapidly measured, but the luxury of input selection is not available in the application to protein family prediction considered below. Instead we use a fundamentally different model structure, known as a parallel cascade, which is particularly adapted to model systems with high-order nonlinearities.
2. Parallel cascade models To model discrete-time dynamic nonlinear systems, Palm (1979) considered structures that were
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
a finite sum of cascades of alternating dynamic linear (L) and static nonlinear (N) elements. He showed that any discrete-time nonlinear system having finite-memory and anticipation, whose output was a continuous mapping of its input (so that ‘small’ changes in the input result in ‘small’ changes in the output), could be uniformly approximated by a finite sum of LNL cascades. Thus, each cascade path in Palm’s model consisted of a dynamic linear element followed by a static nonlinearity that was in turn followed by a dynamic linear element. The static nonlinearities here were exponential and logarithmic functions. Earlier, Kolmogoroff (1957, 1963) had shown that any nonlinear finite-memory system that was continuous (in the sense explained immediately above) could be exactly represented by a finite sum of NLNL cascades, but the first nonlinearities in the cascades were unusual in that they were not continuously differentiable. Note that, for generality, Palm (1979) had included nonlinear systems having finite anticipation as well as memory. However, for convenience, we restrict ourselves below to causal (i.e. physically realizable) systems, where the system output can depend only on present and past input values, but not future values. While Palm’s model was a very important contribution, he did not suggest a method for identifying such a model, given only a system’s input and output measured in an identification experiment. Later, to model any discrete-time finite memory nonlinear system having a Wiener series representation, Korenberg (1982) suggested using a structure that was a finite sum of LN elements (Fig. 1), where each static nonlinearity N was a polynomial. The use of polynomial nonlinearities affords both theoretical and practical advantages. For example, it can be shown (Korenberg, 1991) that any causal, finite-memory, finite-order (sometimes called ‘doubly finite’) discrete-time Volterra series can be exactly expressed as a finite sum of such LN polynomial cascades. In addition, a general approach, called parallel cascade identification (PCI) was proposed (Korenberg, 1982, 1991) to build a parallel array approximation of a dynamic nonlinear system, given only its inputs and outputs, and the use of polynomial static
37
nonlinearities in the cascades resulted in extremely rapid convergence. This work was part of a set of nonlinear system identification techniques that Korenberg developed starting with the identification of single cascade paths in the early 1970’s. In more detail, suppose that, in the single-input, single-output case, x(i ), y(i ),…, i= 0,…, I, represent the input and output respectively of the dynamic, causal discrete-time nonlinear system for which the parallel cascade approximation is to be constructed. In addition, suppose that the output y(i ) depends on input values x(i ), x(i− 1),…,x(i− R), so that (R + 1) is the system’s memory length, and let D be the assumed order of nonlinearity. Then the parallel cascade approximation to the system can be constructed as follows, where we suppose that each cascade is of LN structure, as in Fig. 1. The extension to cascades having further alternating L and N elements is simple (Korenberg, 1982, 1991), and is not needed for the present discussion. A first cascade of a dynamic linear element and a polynomial static nonlinearity is found to approximate the given nonlinear system. In doing this, the dynamic linear element is chosen to have memory length (R+ 1). Moreover, the polynomial is best-fit, so as to minimize the mean-square error (MSE) between the system output and the polynomial’s output, which is also the cascade’s output. The residual, i.e. the difference between
Fig. 1. The parallel cascade model used to classify protein sequences. Each L is a dynamic linear element, and each N is a polynomial static nonlinearity. Adapted from Fig. 1 on page 18 of Korenberg et al. (2000a); reproduced by permission of Springer – Verlag, copyright Springer – Verlag 2000.
38
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
the system and the cascade outputs, is treated as the output of a new nonlinear system, still having the same input. A further cascade is found to approximate the new nonlinear system, a new residual is computed, and so on. It can be shown (Korenberg, 1982, 1991) that the cascades can be selected so that any given discrete-time system having a Volterra or a Wiener series representation can be approximated to an arbitrary accuracy by a sum of the cascades. Suppose that zk (i ) denotes the output of the k-th cascade, and that yk (i ) is the residual remaining after k cascades have been added to the model. Then y0(i )= y(i ) and yk (i )= yk − 1(i )−zk (i ),
k =1,2,···
(3)
In constructing the k-th cascade, note that the current residual is yk − 1(i ). There are many effective ways to choose the impulse response hk ( j ) for the dynamic linear element beginning the k-th cascade. One simple alternative (Korenberg, 1982, 1991) is to set it equal to either the first-order crosscorrelation, or to a slice of the second- or higher-order crosscorrelation, of the input with the current residual. When higher-order crosscorrelations are employed, discrete impulses are added or subtracted at diagonal values. Thus, for j= 0,…,R set hk ( j )=xyk − 1( j )
(4)
if the first-order crosscorrelation xyk − 1 is used, or hk ( j )=xxyk − 1( j,A) 9Cl( j −A)
(5)
if the second-order crosscorrelation xxyk − 1 is instead employed, etc. Which crosscorrelation function is chosen to define hk ( j ) can be decided either randomly or deterministically, for example by sequentially switching from the first- to the second- to the third- and higher-order crosscorrelations in defining consecutive cascades. If Eq. (5) is used, then A is chosen from one of 0,…,R and the sign of the l-term must also be chosen; again these choices may be made at random or following a logical progression in defining consecutive cascades. What is important is that there must be a finite probability of making any of the possible choices. In Eq. (5), C is adjusted to tend to zero as the mean-square of the current residual tends
to zero. Similarly, if the third-order cross correlation xxxyk − 1 is chosen, then for j =0,…,R set hk ( j ) =xxxyk − 1( j,A1,A2)9 C1l( j− A1)9 C2l( j− A2) (6) where A1, A2, C1, C2 are defined analogously to A, C in Eq. (5). Slices of higher-order crosscorrelations (up to the assumed order D of nonlinearity of the system) can be used analogously to define hk ( j ). Once the impulse response of the dynamic linear element beginning the cascade has been defined, calculate its output R
uk (i )= % hk ( j )x(i− j )
(7)
j=0
Next, the static nonlinearity, in the form of a polynomial, D
zk (i )= % akdu dk(i )
(8)
d=0
is best-fit, in the least-square sense, to the current residual yk − 1(i ). The new residual is then given by Eq. (3), and because the coefficients akd were determined by least-square fitting, it follows that the mean-square of the new residual is y 2k(i )= y 2k − 1(i )− z 2k(i )
(9)
In Eq. (9), the overbar denotes the mean (analogous to a time-average) operation. Hence, the addition of a further cascade to the model reduces the mean-square of the residual by an amount equal to the mean-square of the cascade output. This by itself does not imply that the mean-square of the residual will be driven to zero by adding a sufficient number of cascades. In fact, however, this conclusion does follow (Korenberg, 1982, 1991) in the noise-free case, because of the way the impulse responses hk ( j ) were defined. If there are K cascades in the final model, then the parallel cascade has output K
z(i )= % zk (i )
(10)
k=1
It should be noted that PCI is not restricted to use of slices of crosscorrelations to define the
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
dynamic linear elements beginning the cascades, and there are many other efficacious strategies (Korenberg, 1991, 2000) for doing so. Finally, while the procedure outlined above assumed, for convenience, that the system to be approximated was single-input single-output, parallel cascade identification of multi-input multi-output systems has also been described (Korenberg, 1991). A key aspect of the parallel cascade architecture is that all its memory resides in the dynamic linear elements, while its nonlinearities are localized in static elements. Hence, approximating a system with higher-order nonlinearities simply requires fitting higher-degree polynomials in the cascade paths of Fig. 1. This is much faster, and numerically more robust, than attempting to fit a highorder Volterra or Wiener series approximation to the given nonlinear system. The latter would involve estimating high-order kernels, which is normally a time-consuming and inaccurate procedure, except in fortunate cases when special inputs, such as complete binary M-sequences (Sutter, 1987, 1992), can be used. Above, we have reviewed the parallel cascade model, together with a simple method for building up the model to approximate the behavior of a dynamic nonlinear system, given only its input and output. Several years ago, Korenberg invented the application of PCI to bioinformatics, including class prediction in proteomics and genomics, and two recent papers have concerned use of PCI for protein structure/function classification. Before describing this work, we next briefly survey how the PCI method has been used to characterize or model other biomedical and physical systems.
3. Using parallel cascade identification for kernel estimation, modeling, prediction, and classification As noted above and proven earlier (Korenberg, 1991), any finite-memory finite-order discrete-time Volterra series can be expressed exactly as a finite sum of dynamic linear/static nonlinear cascades (as in Fig. 1). The converse is also true, and has led to an accurate method of estimating the kernels of a nonlinear system, even one with lengthy
39
memory or high-order nonlinearity. Thus, a parallel cascade model is first built from the input/output data to approximate the nonlinear system. The model can then be collapsed into a corresponding Volterra series to directly reveal its kernels (Korenberg, 1991). This approach has been used to estimate the first and second order kernels of fly photoreceptor cells (French et al., 1993), and to study the input– output relationship of the synapse between fly photoreceptors and large monopolar neurons (Juusola et al., 1995). More recently, PCI was employed to model the nonlinear behavior of the front end of an aircraft pilot flight control system (Eklund and Korenberg, 2000). The resulting models were sufficiently accurate to meet the objective test requirements of the US Federal Aviation Administration (FAA) for certifying full flight simulators, despite the significant hysteresis exhibited by the system in question. Depending on the desired precision (beyond the FAA requirements), between two and four parallel cascades were required in the model of Fig. 1. For example, one of the models had two parallel cascade paths, each with a dynamic linear element of memory length 87 and a polynomial static nonlinearity of 15th degree. Cost savings of using this ‘black box’ approach stem from avoiding the traditional analysis of the component parts of the system to be modeled, and significantly reducing the number of signals that must be recorded from the real aircraft flight control system. The above applications use the ability of PCI to either model or characterize (e.g. through kernel estimation) a real system. Alternatively, a fictitious system can sometimes be defined so that its input output behavior would accomplish a desired task, such as forecasting a time series. Although only its input and output define the fictitious system, a parallel cascade model can then be found to approximate the system, and thus carry out the desired task. For example, if the task is to predict the signal x(i ) by m instants, then we regard x(i ) as the input and define the desired output to be y(i )= x(i + m). Hence, we build a parallel cascade model to approximate the defined input output relation; the resulting model can then be used to predict the signal x. This approach was used by Korenberg et al. (1996) to
40
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
predict turbulence-induced wavefront distortion, which is significant for an adaptive optics system for a ground-based astronomical observatory. The parallel cascade model equaled in accuracy, but trained at least 50 times as fast as a standard neural network using backpropagation. Parallel cascade modeling of fictitious systems can also be applied to solve classification problems. This idea was originally suggested a number of years ago (Korenberg, 1985), and has had some interesting applications more recently. For example, consider training a parallel cascade model to distinguish amongst myoelectric signals corresponding to different contraction types, such as forearm pronation, supination, and elbow flexion. A fictitious nonlinear system can be defined that receives the myoelectric signals as inputs, and assumes different output values for the different contraction types. Building a parallel cascade approximation to the fictitious system created the desired classifier for the myoelectric signals (Korenberg and Morin, 1997). A similar approach was employed to rapidly distinguish between the DTMF (dual tone multi-frequency) signals used to encode the digits on a telephone keypad (Korenberg and Doherty, 1996). The trained parallel cascade models could flawlessly distinguish between the 16 families of DTMF signals (for the digits on military telephone keypads) using half or less of the samples required by the most efficient method in current use. This is significant for increasing the service capacity of data entry systems. In these applications, PCI was used to perform featureless classification, i.e. the trained models could differentiate between the raw signals corresponding to the different types of interest. The next section reviews how this ability of PCI can be exploited in a radically different context, namely the problem of protein family prediction.
4. Protein family prediction via parallel cascade identification A number of years ago, Korenberg realized that his algorithms were well suited to solving proteomic and genomic classification problems, such as protein family prediction and the making of analogous class distinctions with DNA sequences.
However, the first application of parallel cascade identification to protein family prediction has only been published recently (Korenberg et al., 2000a). In that study, a single representative from each of the globin, calcium-binding, and kinase families was used to create inputs for training parallel cascade models to function as binary (i.e. two-way) classifiers. First, each sequence of amino acids was mapped into a corresponding hydrophobicity profile according to the Rose scale (Rose et al., 1985; Cornette et al., 1987), which was not normalized. Then, to build a parallel cascade model intended to distinguish, say, globin from kinase proteins, the hydrophobicity profiles for the globin and kinase representatives were spliced together to form the training input. The corresponding training output was defined to be − 1 over the globin portion, and 1 over the kinase portion, of the training input (Fig. 2a). The fictitious nonlinear system that could map the training input into the output would function as a classifier, and at least would be able to distinguish between the globin and the kinase representatives. Nothing is known about this nonlinear system except its input and output, which is why resort must be made to a ‘black box’ identification technique. The particular identification technique used, parallel cascade identification, has the ability to rapidly model systems having high-order nonlinearities. However, the approach described (Korenberg et al., 2000a) was not limited to use of parallel cascade identification, and other nonlinear system identification techniques can instead be employed. In order to identify a parallel cascade model, certain parameters related to its structure had to be determined. These were (i) the memory length of the dynamic linear element beginning the cascade; (ii) the degree of the polynomial defining the static nonlinearity that followed; (iii) the maximum number of cascades that could be introduced into a model; and (iv) a threshold concerning the minimum reduction in MSE required before a candidate cascade could be accepted into the model. To be accepted, a candidate’s reduction of the MSE, divided by the mean-square of the current residual, had to exceed the threshold divided by the number of
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
41
Fig. 2. (a) The training input x(i ) and output y(i ) used in the first study (Korenberg et al., 2000a) to identify the parallel cascade model for distinguishing globin from kinase sequences. The input (dashed line) was formed by splicing together the (Rose scale) hydrophobicity profiles for one globin (Brookhaven designation: 1HDS) and one kinase (Brookhaven designation: 1PFK) sequence. The corresponding output (solid line) was defined to be − 1 over the globin and 1 over the kinase portions of the input. (b) The training output (solid line) y(i ), and the actual output z(i ) (dashed line) calculated when the training input of (a) was fed through the identified parallel cascade model.
output points used in the identification. A ‘default’ set of parameters did not suffice for all classifiers, i.e. the same parameter values could not be used for globin versus calcium-binding, globin versus kinase, and calcium-binding versus kinase classifiers. Instead, these parameters were set for each classification task by judging the
effectiveness of the resulting parallel cascade models in classifying protein sequences in a small evaluation set. For example, for trial values of the four parameters, parallel cascade models were trained on the input/output data in Fig. 2a. Then the models’ correct classification rates were measured
42
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
on globin and kinase sequences from the small evaluation set to determine an effective set of parameters. In all cases, the choices for memory length, polynomial degree, and maximum number of allowed cascades were such that the number of variables introduced into a parallel cascade model was less than the number of output points used in identifying the model. In fact, not all the variables introduced are independent; for example all constant terms for the polynomials in the cascades can be replaced by a single constant. However, to guard against overfitting the model, it is convenient to restrict the total number of variables introduced, since this is an upper bound on the number of independent variables. Once a parallel cascade model had been built, it was used to classify sequences in the following way (Korenberg et al., 2000a). The protein sequence, in the form of its hydrophobicity profile, was fed to the model, say for globin versus kinase, and the mean of the resulting output was calculated. If the mean was less than zero, the sequence was classified as globin, and if greater as kinase. Certainly this criterion effectively distinguishes between the training globin and kinase sequences. As shown in Fig. 2b, if the training input of Fig. 2a is fed to the identified model, then the resulting output is indeed predominately negative over the globin portion, and positive over the kinase portion, of the training input. With the parameter values determined from the evaluation set, the models for globin versus calcium-binding, globin versus kinase, and calciumbinding versus kinase were then tested on a larger set of 253 sequences. Correct classification rates for these classifiers were, respectively, about 91, 81, and 68%. Note that these success rates (Korenberg et al., 2000a) were obtained with training inputs constructed using only one representative from each family. While the small evaluation set had been used to set the values for the four parameters mentioned above, this set had not been used to create the training inputs. Once these parameters had been determined, increased accuracy might well have been obtained using, additionally, all sequences from the evaluation set for training. For example, multiple representatives from each
family could be concatenated to form the training inputs. Alternatively, multiple parallel cascade models could be trained for each classification task using the additional sequences, with the final decision determined by voting (Korenberg et al., 2000a). Unless these additional data are used in training, the PCI success rates cannot be compared with those for other classifiers trained on both the three original representative hydrophobicity profiles and all profiles in the evaluation set. In a follow-up paper, Korenberg’s method was tested over a much larger test set, comprised of over 16 000 sequences from the globin, calciumbinding, and kinase families (Korenberg et al., 2000b). Only the same three sequences from the first study (Korenberg et al., 2000a) were used to form the training inputs for identifying the parallel cascade classifiers. However, a different method was used to encode the amino acids in the protein sequences. While the Rose scale hydrophobicity profiles used in the first study effectively captured structure/function information, there were a few disadvantages. These included representing multiple amino acids, for example phenylalanine and isoleucine, by the same value, so that there is a loss of information in mapping the primary amino acid sequence into the corresponding hydrophobicity profile. In addition, the Rose scale covers a narrow range of values, from 0.52 to 0.91, while still weighting some amino acids more heavily than others. These characteristics caused the resulting hydrophobicity profiles to be relatively poor inputs for nonlinear system identification. The above disadvantages could be avoided through use of binary codes to represent the amino acids (Korenberg et al., 2000b). The first scale in Table 1, SARAH1, uses five digit codes in which the two non-zero entries are both 1 or both − 1. Similarly, in the SARAH2 scale of Table 1, all three non-zero entries of each code equal 1 or all equal − 1. For both scales, the amino acids were ranked according to the Rose scale (breaking ties) and then the codes were assigned in descending order according to the binary numbers corresponding to the codes. In each scale, the bottom half is the negative mirror image of the top half.
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
Either scale can be used in two ways to represent a protein sequence. The codes for the amino acids can simply be concatenated to form a numerical profile five times longer than the original sequence. Alternatively, the codes can be used to create a profile of five signals, each of the same length as the original sequence. The latter approach leads to the use of five-input parallel cascade models, while the former approach leads to use of single-input models. While both alternative uses of the codes were set out in the paper (Korenberg et al., 2000b), test results were shown only for the single-input form. Unlike the Rose scale, where the 20 amino acids are mapped into 14 values, on the SARAH scales each amino acid is uniquely represented and equally weighted. Moreover, the codes caused the resulting numerical profiles for the protein sequences to form better inputs for system identification, which resulted in higher classification accuracy. To illustrate the difference in amino acid representation, suppose that the single-input use is made of the SARAH1 scale. Then Fig. 3a shows the input formed by splicing together the resulting numerical profiles for the same training Table 1 Binary code representations for amino acids Amino acid
SARAH1 scale
SARAH2 scale
C F I V L W M H Y A G T S R P N D Q E K
1,1,0,0,0 1,0,1,0,0 1,0,0,1,0 1,0,0,0,1 0,1,1,0,0 0,1,0,1,0 0,1,0,0,1 0,0,1,1,0 0,0,1,0,1 0,0,0,1,1 0,0,0,−1,−1 0,0,−1,0,−1 0,0,−1,−1,0 0,−1,0,0,−1 0,−1,0,−1,0 0,−1,−1,0,0 −1,0,0,0,−1 −1,0,0,−1,0 −1,0,−1,0,0 −1,−1,0,0,0
1,1,1,0,0 1,1,0,1,0 1,1,0,0,1 1,0,1,1,0 1,0,1,0,1 1,0,0,1,1 0,1,1,1,0 0,1,1,0,1 0,1,0,1,1 0,0,1,1,1 0,0,−1,−1,−1 0,−1,0,−1,−1 0,−1,−1,0,−1 0,−1,−1,−1,0 −1,0,0,−1,−1 −1,0,−1,0,−1 −1,0,−1,−1,0 −1,−1,0,0,−1 −1,−1,0,−1,0 −1,−1,−1,0,0
43
globin and kinase representatives used in Fig. 2a. The training output is again defined to be − 1 over the globin, and 1 over the kinase, portions of the input. One advantage of using the SARAH scales is that model identification was more robust and a single set of default parameters sufficed for all classifiers. Thus, an effective classifier could be built for any of the tasks using only the relevant pair of training exemplars and there was no need to adjust parameters using an evaluation set. Fig. 3b shows that, when the identified parallel cascade model is stimulated by the training input, the resulting output is again predominately negative over the globin portion, and positive over the kinase portion, of the input. To classify a new protein sequence, the corresponding numerical profile, created using the SARAH1 scale, was fed into the identified model, and the resulting output calculated. Instead of a simple test involving the sign of the output mean (Korenberg et al., 2000a), a more sophisticated decision criterion was used (Korenberg et al., 2000b). A first ratio was computed involving the MSE of the output from − 1, relative to that for the training globin exemplar. A second ratio involved the MSE from 1, relative to that for the training kinase exemplar. If the first ratio was less than the second, then the sequence was classified as globin, otherwise as kinase. Analogous MSE ratio tests were used for globin versus calciumbinding and for calcium-binding versus kinase classifications. The use of SARAH scales with PCI boosted two-way classification accuracy to about 85%, compared with about 79% over the same test set in the first study, where the Rose scale had instead been used. In addition, the PCI classifiers were also tested over the large set of over 16,000 sequences, which had been located by an exhaustive keyword search of the NCBI (National Center for Biotechnology Information, at http:// www.ncbi.nlm.nih.gov) database. The PCI performance was compared with that of hidden Markov models (HMMs) (Krogh et al., 1994; Hughey et al., 1999) over this large set. Here, SARAH1 encoding caused the PCI models to average about 79% correct in two-way classifications (Korenberg
44
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
Fig. 3. (a) The training input x(i ) and output y(i ) used in the second study (Korenberg et al., 2000b) to identify the parallel cascade model for distinguishing globin from kinase sequences. The same globin and kinase training sequences were used as in Fig. 2, but this time were mapped into numerical profiles via the SARAH1 scale, before being spliced to form the training input (dashed line). The training output (solid line) was again defined to be − 1 over the globin portion, and 1 over the kinase portion, of the input. (b) The training output (solid line) y(i ), and the actual output z(i ) (dashed line) calculated when the training input of (a) was fed through the identified parallel cascade model.
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
et al., 2000b), versus about 75% for HMMs trained using the same three protein sequences. This difference between PCI and HMM classification rates is significant on a chi-square test at better than the 0.001 level. Moreover, each of the PCI models showed reasonable 2-way classification in both directions (average accuracy ranged from 66 to 97%) whereas the HMMs averaged less than 45% in one direction on two of the three 2-way classification tasks (average accuracy ranged from 43 to 100%). Since the PCI and HMM approaches are markedly different, improved classification can be anticipated from using them together, and indeed the approaches could be combined to increase accuracy to 82%. As already noted, the above results were achieved using the SARAH scales in a single-input form, where the 5-tuple codes for the amino acids were concatenated to form a numerical profile five times longer than the original sequence. This format contains ambiguities in that the numerical profile only ‘makes sense’ if read with knowledge of the correct end point for each 5-tuple. As the numerical profile is fed through the parallel cascade model, output is generated at points in between, as well as at, these end points. To obtain the above results, these ‘in between’ points were not excluded from the output used in classifying the sequence. A future paper will investigate whether classification can be improved by avoiding such ‘in between’ points of the output. The paper will also measure the improvement afforded by the multi-input form (Korenberg et al., 2000b) of the SARAH scales, which avoids such ambiguities in interpreting the numerical profiles. In the above papers (Korenberg et al., 2000a,b), only hydrophobicity was used as the basis for discriminating between protein sequences from different families. That is, hydrophobicity was the only residue property employed in the mapping of the protein sequences into the numerical profiles that were to be classified. The justification for focusing on hydrophobicity is the recognition that this property is a major driving force in protein folding (Dill, 1990). Indeed, both the first protein classification paper (Korenberg et al., 2000a) that used PCI, and an earlier study (Regelson, 1997)
45
that employed left-to-right hidden Markov models, found that Rose-scale hydrophobicity profiles store significant information concerning structure and/or function. However, the later protein classification paper (Korenberg et al., 2000b) also considered future use of additional chemical or physical properties of the amino acids such as polarity, charge, a-helical preference, and residue volume in order to enhance performance. Another point to note in the above protein classification papers (Korenberg et al., 2000a,b) is that the reported results stemmed from using only three protein sequences, one from each family to be distinguished, in creating the training inputs. Indeed, the good results obtained from use of a single exemplar, of each relevant family, in the training inputs highlights an important advantage of the PCI approach: namely, the ability to build accurate classifiers based on very little training data. However, as discussed above and in both of the protein classification papers, additional exemplars of each family can be concatenated to form improved training inputs, or employed to train additional classifiers to be used in concert, for example in voting schemes. In the first study (Korenberg et al., 2000a), which used Rose scale hydrophobicity values to represent the amino acids, classification accuracy ranged from about 68 to about 91%, for calciumbinding versus kinase and globin versus calciumbinding classifiers respectively. An important issue in judging the difficulty of the classification tasks is the degree of similarity of the test sequences with those used for training. The question of whether the training and test sequences were independent was investigated by David (2000). He computed the distance between a test and a training sequence using the sum of squares of the differences in Rose hydrophobicity values. When the sequences compared were of different lengths, David (2000) checked all possible alignments (with no insertions or deletions), and found the minimum distance. Using this distance measure to classify the test sequences, he found that two-way accuracy averaged 53%. This is far worse than the accuracy of the PCI classifiers, and shows that the classification tasks they carried out were nontrivial.
46
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47
As discussed above, the first protein classification paper (Korenberg et al., 2000a) required determining individual parameter set values for each of the three pairwise classification tasks. This requirement was removed in the second paper (Korenberg et al., 2000b) where the SARAH encoding scales were used, and in any event is not peculiar to the PCI approach. Indeed other methods, for example using proximity correlation matrices to detect protein fold similarity (Grigoriev and Kim, 1999), sometimes require different parameter values for different protein families.
5. Conclusion Parallel cascade identification and its application in diverse areas were reviewed, including recent use in protein sequence classification. As the approach differs considerably from other techniques, such as hidden Markov modeling, there are clear benefits from combining PCI with other methods. Aside from the protein family prediction reviewed above, PCI has a richer role to play in bioinformatics. Indeed, some interesting applications of PCI include searching for active sites on proteins, e.g. phosphorylation and ATP-binding sites. Two other important applications concern interpretation of gene expression profiles, and establishing genotype/ phenotype correlations; inventions for these and related applications of nonlinear system identification to class prediction in bioinformatics are included in patent applications filed by Korenberg.
Acknowledgements We thank the referees for perceptive comments on the manuscript.
References Boyd, S., Chua, L.O., 1985. Fading memory and the problem of approximating non-linear operators with Volterra series. IEEE Trans. Circ. Sys. 32, 1150 –1160. Boyd, S., Tang, Y.S., Chua, L.O., 1983. Measuring Volterra kernels. IEEE Trans. Circ. Sys. 30, 571 – 577. Cornette, J.L., Cease, K.B., Margalit, H., Spouge, J.L.,
Berzofsky, J.A., DeLisi, C., 1987. Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J. Mol. Biol. 195, 659 – 685. David, R., 2000. Applications of nonlinear system identification to protein structural prediction. S.M. Thesis, Massachusetts Institute of Technology, MA. Dill, K.A., 1990. Dominant forces in protein folding. Biochemistry 29, 7133 – 7155. Eklund, J.M., Korenberg, M.J., 2000. Simulation of aircraft pilot flight controls using nonlinear system identification. Simulation 75, 72 – 81. Ewen, E.J., Weiner, D.D., 1980. Identification of weakly nonlinear systems using input and output measurements. IEEE Trans. Circ. Sys. 27, 1255 – 1261. Frechet, M., 1910. Sur les fonctionnelles continues. Annales Scientifiques de l. Ecole Normal Superieure 27, 193 – 219. French, A.S., Korenberg, M.J., Jarvilehto, M., Kouvalainen, E., Juusola, M., Weckstrom, M., 1993. The dynamic nonlinear behaviour of fly photoreceptors evoked by a wide range of light intensities. Biophys. J. 65, 832 – 839. Grigoriev, I.V., Kim, S.-H., 1999. Detection of protein fold similarity based on correlation of amino acid properties. Proc. Natl. Acad. Sci. USA 96, 14318 – 14323. Hughey, R., Karplus, K., Krogh, A., 1999. Sequence alignment and modeling software system. http:// www.cse.ucsc.edu/research/compbio/sam.html Juusola, M., Weckstrom, M., Uusitalo, R.O., Korenberg, M.J., French, A.S., 1995. Nonlinear models of the first synapse in the light-adapted fly retina. J. Neurophysiol. 74, 2538 – 2547. Kolmogoroff, A.N., 1957. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk. SSSR 114, 953 – 956. Kolmogoroff, A.N., 1963. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. AMS Transl. 2, 55 – 59. Korenberg, M.J., 1982. Statistical identification of parallel cascades of linear and nonlinear systems, Proceedings 6th IFAC Symposium Ident. Sys. Param. Est. 1, pp. 580 – 585. Korenberg, M.J., 1985. Pattern recognition via nonlinear system identification, Proceedings 28th Midwest Symposium Circuits Sys. 1, pp. 96 – 99. Korenberg, M.J., 1991. Parallel cascade identification and kernel estimation for nonlinear systems. Ann. Biomed. Eng. 19, 429 – 455. Korenberg, M.J., 2000. Identifying systems with high-order nonlinearities or lengthy memory via parallel cascade identification, Proceedings 20th Biennial Symp. Communications, Queen’s University, Kingston, Ontario, Canada, pp. 253 – 257. Korenberg, M.J., Doherty, P.W., 1996. Rapid DTMF signal classification via parallel cascade identification. Electron. Lett. 32, 1862 – 1863. Korenberg, M.J., Morin, E.L., 1997. Automatic discrimination of myoelectric signals via parallel cascade identification. Ann. Biomed. Eng. 25, 708 – 712.
M.J. Korenberg et al. / Journal of Biotechnology 91 (2001) 35–47 Korenberg, M.J., McGaughey, D., Aitken, G.J.M., 1996. Parallel cascade prediction of turbulence induced wavefront tilt. Electron. Lett. 32, 1315 –1316. Korenberg, M.J., Solomon, J.E., Regelson, M.E., 2000a. Parallel cascade identification as a means for automatically classifying protein sequences into structure/function groups. Biol. Cybern. 82, 15 –21 on-line publication December, 1999. Korenberg, M.J., David, R., Hunter, I.W., Solomon, J.E., 2000b. Automatic classification of protein sequences into structure/function groups via parallel cascade identification: a feasibility study. Ann. Biomed. Eng. 28, 803 –811. Krogh, A., Brown, M., Mian, I.S., Sjolander, K., Haussler, D., 1994. Hidden Markov models in computational biology — applications to protein modeling. J. Mol. Biol. 235, 1501 – 1531. Palm, G., 1979. On representation and approximation of nonlinear systems. Part II: Discrete time. Biol. Cybern. 34, 49– 52. Regelson, M.E., 1997. Protein structure/function classification using hidden Markov models. Ph.D. Thesis, The Beckman Institute, California Institute of Technology, Pasadena, CA. Rose, G.D., Geselowitz, A.R., Lesser, G.J., Lee, R.H., Zehfus, M.H., 1985. Hydrophobicity of amino-acid residues in globular-proteins. Science 229, 834 –838. Sakai, H.M., Naka, K.-I., 1987a. Signal transmission in the
.
47
catfish retina. IV. Transmission to ganglion cells. J. Neurophysiol. 58, 1307 – 1328. Sakai, H.M., Naka, K.-I., 1987b. Signal transmission in the catfish retina. V. Sensitivity and circuit. J. Neurophysiol. 58, 1329 – 1350. Sandberg, I.W., 1983. Expansions for discrete-time nonlinear systems. Circuit Sys. Signal Process. 2, 179 – 192. Stark, L.W., 1969. The pupillary control system: its nonlinear adaptive and stochastic engineering design characteristics. Automatica 5, 655 – 676. Sutter, E.E., 1987. A practical non-stochastic approach to nonlinear time-domain analysis. In: Marmarelis, V.Z. (Ed.), Advanced Methods of Physiological System Modeling, vol. 1. Biomedical Simulations Resource, University of Southern California, Los Angeles, CA, pp. 303 – 315. Sutter, E.E., 1992. A deterministic approach to nonlinear systems analysis. In: Pinter, R.B., Nabet, B. (Eds.), Nonlinear Vision: Determination of Neural Receptive Fields, Function, and Networks. CRC Press, Boca Raton, FL, pp. 171 – 220. Volterra, V., 1913. Lec¸ ons sur les fonctions de lignes. Gauthier-Villars, Paris. Volterra, V., 1959. Theory of Functionals and of Integral and Integro-Differential Equations. Dover, New York, NY. Wiener, N., 1958. Nonlinear problems in random theory. MIT Press, Cambridge, MA.