c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 82 (2006) 31–37
journal homepage: www.intl.elsevierhealth.com/journals/cmpb
Robust regression for high throughput drug screening Igor Fomenko∗ , Mark Durst, David Balaban Systems Informatics Department, Amgen, One Amgen Center Drive, MS 34-2-A, Thousand Oaks, CA 91320-1799, USA
a r t i c l e
i n f o
a b s t r a c t
Article history:
Effective analysis of high throughput screening (HTS) data requires automation of dose–
Received 1 July 2005
response curve fitting for large numbers of datasets. Datasets with outliers are not handled
Received in revised form 17 January
well by standard non-linear least squares methods, and manual outlier removal after visual
2006
inspection is tedious and potentially biased. We propose robust non-linear regression via M-
Accepted 26 January 2006
estimation as a statistical technique for automated implementation. The approach of finding M-estimates by Iteratively Reweighted Least Squares (IRLS) and the resulting optimization
Keywords:
problem are described. Initial parameter estimates for iterative methods are important, so
Dose–response curve
self-starting methods for our model are presented. We outline the software implementation,
Agonist activity
done in Matlab and deployed as an Excel application via the Matlab Excel Builder Toolkit.
HTS
Results of M-estimation are compared with least squares estimates before and after manual
Robust regression
editing. © 2006 Elsevier Ireland Ltd. All rights reserved.
M-estimation IRLS
1.
Introduction
Automation of curve fitting for massive numbers of datasets has become an important element of data analysis for high throughput screening (HTS) of drug candidates. Standard non-linear regression algorithms often fail when outliers are present. Manual checking of poor datasets to filter out outliers is laborious and prone to subjectivity; it slows the screening process and adds potential for bias and inconsistency. In Section 8.4.3 of ref. [1], a practical issue with automated curve fitting procedures for large datasets is discussed for the example of a screening campaign for agonist activity. The author poses the question of how to obtain an empirical estimate of potency (as a guide for further, more detailed testing) when there are least squares regression issues for some of the datasets for which “one or more outliers lead the curve fitting procedure to pass over a possibly valuable agonist activity”. Robust regression provides us with methods that are “resistant to wild values” [2]. Automation of curve fitting based on robust statistical techniques eliminates visual inspection
∗
for outliers. By comparison with some ad hoc automatic rules with outlier rejection capability [3], robust regression methods have the advantage that all data points are taken into account, and that outlying observations are progressively downweighted. The main focus of this paper is to develop an effective robust non-linear regression algorithm in order to automate the fitting process for HTS. Our dose–response curves are described by the Four-Parameter Logistic (4PL) model; we develop M-estimates [4] for the parameters of this model. We choose the Tukey Biweight function as the M-estimation penalty function, and the Median Absolute Deviation from the median (MAD) as the M-estimation scale. Iteratively Reweighted Least Squares (IRLS) is used to find M-estimates for the parameters. The optimization problem which results from the IRLS algorithm is formulated. Convergence proofs and other rigorous theoretical analysis are not possible without unsupported distributional assumptions; however, IRLS works well in practice [4] and is frequently used in the computational statistics community [5,6]. A different approach to M-estimation for the 4PL
Corresponding author. Tel.: +1 805 447 9961. E-mail address:
[email protected] (I. Fomenko).
0169-2607/$ – see front matter © 2006 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.cmpb.2006.01.008
32
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 82 (2006) 31–37
model based on Huber’s loss function in the case where the variances of the responses increase with their expected values is presented in [7]. A technique for determining starting estimates must always be developed for non-linear regression [8]; for successful automation of HTS, such a technique is crucial. In addition to the task of finding values from which convergence is quickly obtained, convergence to the correct solution in a heuristic sense – the solution which downweights the actual outliers – is needed. It is known that the standard nonlinear least squares solution can provide a very poor start. As mentioned in [4] for the general regression case, the L1 estimate (which is the analogue of the median) usually provides a good start, but it is harder to compute, and our experiments showed that use of standard L1 optimization algorithms for the 4PL model slows computation significantly. Our self-starting method for robust non-linear regression for the 4PL model is proposed by analogy with the self-starting models for S-Plus from [8]. It is based on the idea of refining the estimate of each parameter by finding its M-estimate while holding all other parameters fixed at their current values. Software implementation of the IRLS algorithm was done in Matlab. We deployed this function as an ActiveX object in an Excel application written using Visual Basic for Applications (VBA) and the Matlab Excel Builder Toolkit [9]. The Excel application has shown good performance in processing HTS batches, and results are consistent with heuristics used by HTS analysts for outlier identification.
2.
Fig. 1 – Dose–response in vitro
parameter Sl corresponds to the slope of the tangent to the curve when the concentration is IC50 (the log 10 of concentration is log IC50 ). The traditional approach to estimating these parameters is non-linear least squares, which we now outline. Suppose we have a vector of concentrations x = (x1 , . . . , xi , . . . , xn )T and responses y = (y1 , . . . , yi , . . . , yn )T . The Four-Parametric Logistic model with parameter vector ˇ = (Bot, Top, IC50 , Sl)T and response function = F(x; Top, Bot, IC50 , Sl) defined by (1) relates the reF(x, ˇ) sponse data to the predictor data by
Computational methods and theory y = F(x; ˇ)
2.1. Four-Parameter Logistic model for dose–response curves Typical HTS dose–responses are shown in Fig. 1 for enzyme-, protein-, antibody- or cell-based assays, and in Fig. 2 for in vivo experiments. (In all figures of this paper a logarithmic scale (base 10) for concentration is used, and everywhere below in this paper only the dose–response behavior in Fig. 1 is considered. All methods and algorithms apply to the response in Fig. 2 with trivial modifications.) We mention in passing that “concentration–response curve” is a more specific and technically correct term for a dose–response curve done in vitro [1]. The most commonly used model for the sigmoidal shape shown in Figs. 1 and 2 is the Four-Parameter Logistic: F(x; Top, Bot, log IC50 , Sl) = Bot +
Top − Bot 1 + 10(log IC50 −log10 (x))Sl
The result of the fitting with an unknown parameter vector ˇ. process is an estimate of the “true” vector of parameters of the 4PL model ˇ true . To obtain the parameter estimates, the least squares method minimizes the summed square of residuals. The residual (sometimes called “error”) for the ith predictor point xi is the difference between the observed response value yi and
(1)
The number of data points for a typical HTS dataset varies from 10 to 20 with possible replicates. Operational models in pharmacology [1,10] provide some theoretical support for the formula (1) and give meaning to the parameters. Probably the most important parameter reported during a campaign is the IC50 : the concentration of drug that is required for 50% inhibition in an assay. A geometrical description of the parameters Bot, Top, log IC50 and Sl is given in Fig. 1; Top is the maximal response, Bot the minimal response, and the
Fig. 2 – Dose–response in vivo
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 82 (2006) 31–37
33
the fitted response value yˆ i = F(xi , ˇ): ri = yi − yˆ i The summed square of residuals is given by
S=
n
ri2 =
i=1
n
(yi − yˆ i )2 =
i=1
n
2, (yi − F(xi , ˇ))
i=1
sometimes called the “residual squared error”. The non-linear least squares (NLS) problem is min ˇ
n
2 (yi − F(xi , ˇ))
(2)
i=1
or more generally, with weighting (scale) factors i > 0, i = 1, . . . , n
2 n yi − F(xi , ˇ)
(3)
Fig. 3 – Weighting for Tukey function and least squares t 2 /2 penalty function
The main disadvantage of least squares fitting is its sensitivity to outliers. Outliers have a large influence on the fit because squaring the residuals magnifies the effect of extreme data points. To minimize the influence of outliers, one can fit the data using robust regression techniques.
contamination by outliers [4]. The scales i are needed because the penalty function is defined on some standard scale.
min ˇ
i
i=1
2.2.
2.2.1.
Non-linear M-estimation
The generalization of the data fitting problem (3) we will solve is Minimize
i=n ri (ˇ)
i=1
Subject to ˇi = ˇi0
i
(4)
for i ∈ I
where ˇ is an m-vector of parameters, m ≤ 4; : R −→ the penalty function; ri : Rm −→ R(i = 1, . . . , n) the non-linear function giving the ith residual; i > 0 the scale parameters; and I is a set of indices i for which the parameter ˇi is fixed at ˇi0 . We introduce the possibility of constrained optimization by defining the set of indices I. This general approach is needed for the self-starting 4PL model described below, and it is useful when three-, two- or one-parameter submodels of the 4PL model are considered. For example, in HTS data analysis, sometimes the parameters Top and Bot are fixed, or the parameter Sl is fixed to 1 [1]. The penalty function for the classical least squares method is
(t) =
R+
(t) =
t2 2
are defined by The functions ri (ˇ) = yi − F(xi , ˇ), ri (ˇ)
i = 1, . . . , n.
(5)
The main idea of M-estimates is to modify the least squares method by choosing the penalty function in such a way that the fitting process becomes robust with respect to data
Choice of penalty function and scale
One of the important problems of robust statistics is to choose the modification of the penalty function (t) = t2 /2 in order to make the fitting process robust with respect to the data outliers. Different penalty functions have been proposed in the course of robust statistics development over the last 30 years. In our work, we used the Tukey Biweight function [2]
⎧ ⎨ 1 [1 − (1 − t2 )3 ], |t| ≤ 1 6
(6)
⎩ 1,
|t| > 1
6
The least squares penalty function is compared with the Tukey Biweight function in Fig. 3. There are strong heuristic arguments [4] in favor of using studentized residuals to adjust the scale pointwise in order to deal with moderate and extreme leverage points. It means that in (4) we choose i =
1 − hi (x, y, ˇ),
i = 1, . . . , n
(7)
i = 1, . . . , n is given where the definition of functions hi (x, y, ˇ), in Appendix A and (yet to be defined) is the scale of the batch of the residuals. The Median Absolute Deviation has emerged as the most useful ancillary estimate of scale, according to [4]. It is defined as the median of the absolute deviations from the median; for the vector x = (x1 , . . . , xi , . . . , xn )T ∈ Rn it is given by MAD(x) = med{|xi − M|},
i = 1, . . . , n,
34
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 82 (2006) 31–37
where
where
M = med{xi },
i = 1, . . . , n.
= (ˇ)
i=n ri (ˇ)
Another possible estimate of scale analogous to MAD is the normalized MAD (NMAD), suggested in [11] for small sample sizes and defined for our model as NMAD(x) = med{largest n − 3 of the ri (ˇ),
i = 1, . . . , n}.
A roughly unbiased estimate of the standard deviation residuals with the normal distribution for large n is =
The optimization problem and IRLS
Putting formulae (4)–(7) together gives the optimization problem resulting from M-estimation regression. For any ˇ M let us calculate
1 − hi (x, y, ˇ M )
i=n ri (ˇ)
i (ˇ M )
i=1
(8)
subject to ˇi = ˇi0 ,
i=n
W
i=1
for i ∈ I
(t) = (t)
i=n
W
i=n
W
yi − (xi , ˇ) ∇ i ( ˇ M )
yi − (xi , ˇ) i ( ˇ M )
2
(t) . t
The solution of (8) satisfies
yi − (xi , ˇ) ∇ i ( ˇ M )
yi − F(xi , ˇ) i ( ˇ M )
2 =0
(13)
old yi − (xi , ˇ M ) ∇ i (ˇ old ) M
yi − F(xi , ˇ) i (ˇ old )
2 =0
M
Because the argument of the function W does not depend on ˇ in this equation, it can be solved as a simple non-linear least squares problem of the type of (3). M-estimates may be found by iteration of this process. This algorithm is called Iteratively Reweighted Least Squares [5,11,12]. It is described by the following steps:
(9)
IRLS2. Form the residuals = yi − F(xi , ˇ), ri (ˇ)
(10)
i = 1, . . . , n.
IRLS3. Calculate scalar weights
and the weight function
=0 ∇(ˇ)
1 2
IRLS1. Begin with the initial estimate ˇ = ˇ st , where ˇist = ˇi0 for i ∈ I and ˇist is determined as in Section 2.3 for i ∈ I
Following [4], we call ˇ = ˇ M an M-estimate if it is a solution of the problem (8). Let us introduce the derivative of (t)
W(t) =
(12)
i (ˇ M )
From the above derivation it follows that ˇ = ˇ M is an Mestimate if it satisfies Eq. (13). Now, if we have an old approxiold mation ˇ M for the M-estimate, we may determine a new, betnew ter approximation ˇ = ˇ M as the solution of an equation similar to (13):
i=1
Consider
ˇ
i (ˇ M )
i=1
Thus, the solution of (8) is a solution of
where the vector r(ˇ) = (r1 , . . . , ri , . . . , rn )T is defined by (3) and c2 = 0.46581 is a tuning constant. The choice of the tuning constant c2 involves a trade-off between efficiency and robustness. The numerical value chosen here yields 95% efficiency for the location estimator of the normal distribution [2,11]. The constant c = c2 /c1 is used below.
min
i=n ri (ˇ) ∇ri (ˇ)
i=1
NMAD(r(ˇ)) c1
i (ˇ M ) = c · NMAD(r(ˇ M ))
= ∇(ˇ)
= ∇(ˇ)
where c1 = 0.6745. The scale is
2.2.2.
and the gradient operator ∇ is understood to operate only on those components of ˇ with indices i ∈ I (i.e., on those components which are allowed to vary). From (9) and (11), this gradient is
Using (10) from (12) and (5) it follows that
NMAD(r(ˇ)) c1
= c2 ·
(11)
i (ˇ M )
i=1
wi = W
ri (ˇ) i
where i = c · NMAD(r(ˇ))
1 − hi (x, y, ˇ st )
(14)
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 82 (2006) 31–37
IRLS4. Solve the following non-linear least square problem of the type of (3):
n
min ˇ
i=1
ri (ˇ) wi
2
35
of the response vector y = (y1 , y2 , . . . , yn )T Top0 = max{y1 , y2 , . . . , yn }, i
Bot0 = min{y1 , y2 , . . . , yn } i
subject to ˇi = ˇi0 for i ∈ I
3.
Program description
IRLS5. Iterate steps 2–4 until convergence.
3.1.
Matlab implementation
In practice various schemes for the recalculation of the scales i are used [4]. Sometimes they are kept constant, defined with respect to a good initial estimate of ˇ [11]. We propose to recalculate the scales at IRLS step 2, but with fixed values of the hat coefficients hi used in (14) (see their definition in Appendix A). The damped Gauss–Newton algorithm is used to solve the optimization problem at step 4. It was implemented with step size chosen according to the Armijo–Goldstein step length principle [13].
The automated curve fitting described in the previous section was implemented in Matlab. The main function for M-estimation, which contains steps IRLS1–IRLS4 and SelfStarting Model steps S1 and S2 is
2.3. IRLS
Self-starting models for 4PL non-linear robust
Initial estimates of parameters are important in any nonlinear regression problem. Automation of this ad hoc process depends very much on the model. A useful collection of self-starting non-linear regression models used in SPlus is described in [8]. Special methods are needed for robust regression models because, as mentioned in Section 1, the non-linear least squares solution can provide a poor start. For High Throughput Screening automation in the case when the response is the so-called “Percentage of Control” (POC) , Top and Bot values are expected to be close to Top0 = 100 and Bot0 = 0, and Sl is expected to vary from 0.5 to 2.5. In the derivation of some operational models the parameter Sl is the so-called “Hill coefficient”, assumed to be near 1; so we typically either set Sl0 = 1 or even impose the constraint Sl = 1 throughout the process. The following steps in determining starting values ˇ st are proposed when there are not constraints on the parameters (i.e., the constraint set of indices I is empty): S1. Use the midpoint of the range of the predictor on the log scale as the initial value of log IC50 . S2. Find ˇ init as a result of IRLS1–IRLS5 estimation, where the initial estimate is ˇ 0 = (Top0 , Bot0 , log IC050 , 1)T at step IRLS1, and the constraint set of indices is I = {1, 2} at step IRLS4. That is, the values of the parameters Top and Bot are fixed. The resulting ˇ init is an initial estimate for the IRLS algorithm, i.e., we set ˇ st = ˇ init . A possible modification of this model, useful for non-POC responses, would be to define Top0 and Bot0 in S1 and S2 as the maximum and minimum respectively of the components
function[FitRes,Param,bw]=RobDR(xdata,ydata, ParamInit,ConstraintMask,fmodel) Input arguments: xdata - vector of x-values ydata - vector of y-values ParamInit = [Top Slope log10(IC50) Bot] - initial parameters of the fit; for HTS processing ParamInit = [100 1 median(xdata) 0] is often used. ConstraintMask - vector defined if constrained optimization is to be done. ConstraintMask = [Mask(Top) Mask(Slope) Mask(log10(IC50)) Mask(Bot)] where Mask(P) is 0 if parameter P is not fixed and 1 otherwise. fmodel - this argument is reserved for the use of different models in future releases, and is equal to 1 for this release.
Return Parameters: FitRes - y-values of the fit at the datapoints of the vector xdata Param = [[Top Slope log10(IC50) Bot] - the result of the fit; the values of parameters which give the best curve fit bw - the values of the weights of the residuals at the datapoints of the vector xdata Function RobDR first calls the function function [FitRes, Param, iter,witer] = NonRegW(xdata,ydata,ParamInit,Weight,... Constraint,Constraint1,fmodel) Function NonRegW implements steps S1 and S2 and calls the function
36
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 82 (2006) 31–37
[FitRes, ParamOutC, iter] = NonReg(xdata,ydata,ParamInit,Weight, verbosity,jcalc,constraint,fmodel) Main function RobDR calls the function NonReg as well. Both functions call this function whenever step IRLS4 is needed. Function NonReg implements the damped Gauss–Newton algorithm with step size chosen according to the Armijo– Goldstein step length principle [13]. The described modularity of the program is useful because function NonReg could implement a different least squares algorithm, and the function NonRegW could implement a different self-starting model.
3.2.
Integration with Excel
Fig. 5 – Sample 2.
Matlab Excel Builder was used to integrate the function RobDR with Microsoft Excel. Different stand-alone Excel applications can use the COM object created with Matlab Excel Builder. We designed a spreadsheet with predefined input rows corresponding to the input vectors xdata, ydata, ParamInit, and ConstraintMask described in Section 3.1. The plot of y-data versus x-data is made on the spreadsheet. The Excel macro was written in Visual Basic for Applications. This macro calls the COM object with the method corresponding to Matlab function RobDR. After its execution, the output vectors FitRes, Param, and bw described in Section 3.1 are displayed in predefined rows. The curve corresponding to the fit is displayed on the plot together with outliers, which are identified as observations with weights larger than 0.7. Fig. 6 – Sample 3.
4.
Samples of typical program runs
Implemented software has shown good performance working on different batches of HTS data. It takes around 20 s to process one batch of 80 datasets in Excel. The number of iterations for IRLS4 is typically 4–6 and the number of IRLS5 iterations is typically 5–11. We present samples of processing actual HTS datasets with the robust regression software in Figs. 4–7. These datasets were analyzed by humans and all were judged to have outliers, indicated by diamond points in the figures. For sample 1 the
Fig. 7 – Sample 4.
Fig. 4 – Sample 1.
orange line is the result of the fit when the scientist identified the outliers. The program identified these points (red crosses), and the result of the automated fit is displayed as the blue line. For samples 2 and 3, the results of the fit with human identification of outliers and the program fit are indistinguishable in Figs. 5 and 6. Fig. 7 displays three fits for sample 4. The first was obtained by the least squares fit (black curve), the second was obtained after the scientist identified the outliers by deleting them and applying least squares (yellow curve), and the third was the result of the program (blue curve). These examples show that automated results are in agreement with heuristics
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 82 (2006) 31–37
used by scientists for outlier identification, but provide consistent, unbiased fitting.
5.
Hardware and software specifications
Any computer which has either Matlab or Excel can be used to run the program. If one wishes to use the COM object created by Matlab Excel Builder for other Excel applications, the specification of the Matlab Excel Builder Calling Conventions [9] is needed.
6.
Mode of availability
A copy of the programs can be obtained by contacting the authors (fax: +1 805 375 8519, email:
[email protected]). Please provide full postal address and, if possible, a contact fax number and e-mail address.
7.
Future work
These programs can be enhanced by calculation of robust confidence intervals. In the case of datasets bigger than 11 points, the Tukey and Mosteller recommendation from p. 208 of [14] can be easily implemented. Exploration of bootstrapping techniques for small datasets is a future goal.
Appendix A For the linear regression problem, Studentized residuals are defined via the diagonal elements of the hat matrix H [15]. The generalization of studentized residuals for the case of nonlinear regression is the following. of F(x), with elements Consider the jacobian J(x, ˇ) = ∂F(xi , ˇ) Jim (xi , ˇ) ∂ˇm as the Q-matrix obtained in the QRDefine Q(x, ˇ) Then factorization [13] of J(x, ˇ). = min(0.9999, ((Q(ˇ)Q T (ˇ)) ii )). hi (x, ˇ) The purpose of taking the minimum with the constant 0.9999 in this formula is to achieve numerical stability. Our
37
procedure of fixing hi during IRLS iteration assumes that, with a good initial start, the leverages of the different data points are well identified [4]. The computer experiments suggest that fixing hi makes little difference and eliminates the need for expensive calculations at step IRLS3.
references
[1] T. Kenakin, A Pharmacology Primer, Elsevier Academic Press, Amsterdam, 2004. [2] D.C. Hoaglin, F. Mosteller, J.W. Tukey, Understanding Robust and Exploratory Data Analysis, John Wiley & Sons, Inc., New York, 2000. [3] F.J. Hawker, G.S. Challand, Effect of outlying standard points on curve fitting in radioimmunoassay, Clin. Chem. 27/1 (1981) 14–17. [4] P. Huber, Robust Statistics, John Wiley & Sons, Inc., 1981. [5] W. DuMouchel, F. O’Brien, Integrating a robust option into a multiple regression computing environment, in: K. Berk, L. Malone (Eds.), Computing Science and Statistics: Proceedings of the 21st Symposium on the Interface, American Statistical Association, Alexandria, VA, 1989, pp. 297–301 (reprinted in (1990), A. Buja, P. Tukey (Eds.), Computing and Graphics in Statistics, Springer-Verlag, pp. 41–48). [6] D. Rocke, Constructive statistics: estimators, algorithms, and asymptotics, Comput. Sci. Stat. 30 (1998) 3–14. [7] D. Normolle, An algorithm for robust non-linear analysis of radioimmunoassays and other bioassays, Stat. Med. 12 (1993) 2025–2042. [8] J.C. Pinheiro, D.M. Bates, Mixed-Effects Models in S and S-plus, Springer, New York, 2000. [9] Matlab Software, Matlab Help for Excel Builder Toolbox, The MathWorks, Inc., 2005. [10] H. Motulsky,A. Christopoulos,Fitting Models to Biological Data using Linear and Nonlinear Regression. A Practical Guide to Curve Fitting,GraphPad Software Inc., San Diego, CA, 2003, http://www.graphpad.com. [11] P.W. Holland, R.E. Welsch, Robust regression using iteratively reweighted least-squares, Commun. Stat. Theor. Methods 6 (1977) 813–827. [12] J.O. Street, R.J. Carroll, D. Ruppert, A note on computing robust regression estimates via iteratively reweighted least squares, Am. Statistician 42 (2) (1988) 152. ˚ Bjorck, ¨ [13] A. Numerical Methods for Least Squares Problems, SIAM, 1996. [14] F. Mosteller, J.W. Tukey, Data Analysis and Regression, Addison-Wesley, 1977. [15] N.R. Draper, H. Smith, Applied Nonlinear Regression Analysis, John Wiley & Sons, Inc., 1998.