ANALYTICAL BIOCHEMISTRY Analytical Biochemistry 319 (2003) 258–262 www.elsevier.com/locate/yabio
Robust regression-based analysis of drug–nucleic acid binding David E. Bootha,* and Kidong Leeb a
Graduate School of Management, Kent State University, Kent, OH 44242, USA b College of Business, University of Incheon, Incheon 402-749, South Korea Received 6 February 2003
Abstract Outlier detection can be very important in analyzing data from Scatchard plots. In this study, a robust (outlier-resistant) regression procedure was used in conjunction with a Scatchard plot to study the binding of the methylphenazinium cation with double-stranded DNA. The procedures, their results, and their advantages are discussed. Ó 2003 Elsevier Science (USA). All rights reserved.
The binding of biologically active compounds to nucleic acids is an important biological event. Many of the compounds that bind nucleic acids are antitumor drugs, antibiotics, mutagens, and carcinogens in addition to other compounds of biological interest [1]. In addition, it is known that the binding of some drugs to nucleic acids can increase the drugÕs pharmaceutical effectiveness [2]. Clearly, the study of the interactions of such compounds with nucleic acids is of considerable importance. There are many different types of experimental techniques that can be used to study the interaction of these compounds with nucleic acids and biopolymers in general [1,3,4]. One data analysis technique that can be used with many different types of experimental study is the Scatchard plot [4–8]. The purpose of this plot is to determine the association constant for complex formation, K, and the maximum number of binding sites per mononucleotide, n. The constant K gives the strength of binding of the ligand to the nucleic acid. As Raguin et al. [4] and others point out, care must be taken in the use of these plots. For drug–nucleic acid (DNA) binding, the Scatchard plot is often biphasic because the two main modes of binding that the ligand can take are intercalation (i.e., fitting in between two adjacent basepairs) or binding to the DNA sugar–phosphate backbone. Thus, often the * Corresponding author. Fax: 330-672-2448. E-mail address:
[email protected] (D.E. Booth).
Scatchard plot consists of two intersecting straight lines, both with negative slopes and, of course, experimental error. Sometimes, as has been pointed out [4,9], a much more complicated Scatchard plot can result. In this work, we will consider the biphasic straight line Scatchard plot. The standard method of determining the two binding constants (KI for intercalation and KII for sugar–phosphate binding) is by calculation of the slopes of the two lines of the Scatchard plot, generally by ordinary least squares (OLS)1 regression. There are two potential problems with this approach. First, we are observing both the tight binding process (intercalation) and the weak binding process (backbone binding) at the same time. The point at which the weak binding becomes more important than the strong binding (or vice versa) is called a change point. It thus is important to determine, at least approximately, where this change point is located. The use of change point models in binding studies has previously been suggested as one way to approach this problem [1]. The second potential problem is that outliers in the data set can adversely affect the OLS regression by pulling the fitted regression equation toward them and away from the majority of the data [10,11], thus causing the calculation of spurious binding constants [10–12]. The purpose of the present paper is to show that the statistical technique M-estimator robust regression 1
Abbreviation used: OLS, ordinary least squares.
0003-2697/03/$ - see front matter Ó 2003 Elsevier Science (USA). All rights reserved. doi:10.1016/S0003-2697(03)00290-2
D.E. Booth, K. Lee / Analytical Biochemistry 319 (2003) 258–262
provides a reasonable solution to both of these potential problems. In the following sections, the standard Scatchard approach and the proposed robust regression approach will be described. The robust regression approach will be demonstrated by considering the binding of phenazine methosulfate with double-stranded DNA. Finally the advantages of the robust regression approach will be discussed
259
term, xi0 will equal 1. We denote the sample estimate of the regression coefficient bj by b^j . We now consider how one might compute b^j . Let qðX Þ be a function of X. To estimate the regression coefficients, bj , we calculate the b^j that minimize p n X X ^ q Yi ðbj Xij Þ : ð2Þ ssq ¼ i¼1
j¼0
If Standard Scatchard plot We follow the descriptions given by Ishizu et al. [5] for the use of the plot with optical spectrophotometry. The Scatchard plot itself is a plot of B=F versus B, where B is the average number of bound ligands per nucleotide and F is the molar concentration of free ligand. If all binding sites are equivalent and noninteracting, such plots are approximately straight lines, according to the expression B=F ¼ Kðn BÞ. If optical spectrophotometry is used to study the complex formation, then the fraction of total ligand bound, a, is given by a ¼ ðAf AÞ=ðAf Ab Þ, where Af and Ab are the absorbance of ligand when it is present entirely in the free or bound form, respectively, and A is the observed absorbance of the mixture. B and F can then be determined from the relations B ¼ ðT F Þ=P and F ¼ T ð1 aÞ, where T is the total molar concentration of ligand and P is the molar concentration of the DNA nucleotide. Once B=F and B are determined from the above relations, a Scatchard plot is made, and then KI and KII are computed from the slopes of the two lines with the aid of standard least squares regression.
M-estimator robust regression M-estimator robust regression can be considered a modification of both OLS regression and maximum likelihood estimation [10,11] that diminishes the effects of outlying observations on the regression estimates. Further, as the effects of the outlying observations are being diminished, the outlying observations themselves are identified. As we will see, both of these results are of value when using a robust regression to analyze Scatchard plots. We now consider the combined regression approach for the linear statistical model (where by linear we mean linear in the regression coefficients), p X Yi ¼ ðbj Xij Þ þ ei ; ð1Þ j¼0
where the bj are the unknown regression coefficients and e is a random variable with the usual OLS assumptions [13,14]. Because we want Eq. (1) to contain a constant
qðX Þ ¼ X 2
ð3Þ ^ then this calculation results in the OLS bs. Here we are concerned with possible outliers in the data. To get more outlier-resistant estimates, we vary the function qðX Þ, as we discuss in the following paragraphs. This robust approach is outlined here. Full details can be found elsewhere [10,11,15]. To make the estimates scale invariant, we introduce a robust ‘‘standard deviation,’’ d. The usual standard deviation is very sensitive to outliers and thus it will not do. Introducing d means that we now want to find the b^j s that minimize: ! Pp n X ðYi j¼0 ðb^j Xij ÞÞ ssq ¼ q : ð4Þ d i¼1 Let Di be the numerator in Eq. (4) and d as shown in [10,11]. jDi j d ¼ median : ð5Þ ðnonzerovaluesÞ :6745 Doing the minimization by differentiating Eq. (4) with respect to b^j , setting the result equal to zero, and letting dqðtÞ dt gives the following system of equations: X Di w Xij ¼ 0; j ¼ 1; 2; 3 . . . ; p: d i
wðtÞ ¼
ð6Þ
ð7Þ
There are many ways to solve this system. If we choose iterative weighting as the calculation scheme and perform several steps of algebra [10], we get X wðDi =dÞ Di Xij ¼ 0: ðDi =dÞ i
ð8Þ
Letting Wi ¼
wðDi =dÞ ; ðDi =dÞ
ð9Þ
which we may consider to be a set of regression weights, yields a set of weighted regression normal equations. X Wi Di Xij ¼ 0; j ¼ 1; 2; . . . ; p: ð10Þ i
If we let W be the diagonal matrix with Wi as the ith diagonal element we may write Eq. (10) as the usual
260
D.E. Booth, K. Lee / Analytical Biochemistry 319 (2003) 258–262
weighted regression matrix equation [10,11,13]. See [10,13] for full details. b^ ¼ ðX 0 WX Þ1 X 0 WY :
ð11Þ
outlier would have been the observation with respect to the majority of the data. We will now use the program from [10] to do an analysis of Scatchard plot data.
We must now choose the function dqðtÞ dt to replace the OLS function
wðtÞ ¼
qðtÞ ¼ t2 :
Data ð12Þ
ð13Þ
The most commonly used choice is (letting uðxÞ ¼ wðxÞ) 8 < k; x < k uðxÞ ¼ x; k 6 x 6 k ; ð14Þ : k; x > k called the Huber function, or sinðx=aÞ; absðxÞ 6 aP uðxÞ ¼ ; 0; absðxÞ P aP
The data for this work were determined in a binding study of the methylphenazinium cation with doublestranded DNA conducted by Ishizu et al. [5]. The data used in this research were calculated directly from the UV absorption spectra under the reported conditions as given by Ishizu et al. [5]. The results are summarized in Table 1, where r ¼ ½DNA =½MP þ .
Results ð15Þ
called the Andrews function, where k and a are constants chosen as described [10,11]. Because this is an iterative calculation we must have starting values for the estimated regression coefficients. It has been shown that starting with the OLS estimates followed by the estimates generated by HuberÕs function and then by AndrewÕs function estimates is a suitable procedure [10]. If the starting values are not more robust than the OLS estimates, the AndrewÕs function-based calculation may not converge correctly. Using the OLS estimates followed by a Huber function calculation and then by an AndrewÕs function calculation avoids this problem [10,17]. A computer program to perform this calculation, originally programmed by R. Lenth, is also given in [10,17]. This program with a value of cc ¼ 1, to set the c and a of the program for (14) and (15), in line 150 of the program was used in this research. This means that 0 6 Wi 6 1 for all i, after the final AndrewÕs weighting calculation. The effect of the weights is to decrease the effect of outlying observations on the estimated regression coefficients. Further, the smaller the weight the more of an outlier is the observation, thus giving a method of outlier identification. Simulation studies, such as the famous Princeton robustness study [18] have been carried out to determine properties of the robust estimators described herein vis a vis OLS. These studies and those on other chemical applications [10,16,17] have shown that these robust estimates are to be much preferred over OLS in the presence of outliers and that our choice of values in the various functions are reasonable. The result of the calculation is, first, a set of weights that identity any outliers in the data and, second, regression coefficients that have not been unduly influenced by outliers as the OLS coefficients by themselves. Recall that the lower the weight value the more of an
Because the Scatchard plot reported by Ishizu et al. [5] was biphasic linear (see Fig. 1), we ran the robust regression using the B=F as the dependent variable ðY Þ, and B as the independent variable ðX Þ in the linear statistical model. Y ¼ b0 þ b1 X þ e:
ð16Þ
The robust regression results are given in Table 2, where the weights are those from the final iteration with the AndrewÕs function. We notice from the weights reported in Table 2, that observations 1, 9, 10, and 11 are outliers. The fact that the Scatchard plot is biphasic leads us to believe that the observations with high weights are the data points that best represent the intercalation (tight) mode of binding while those with low weights represent the sugar–phosphate (less tight) binding. Thus the slope estimate computed using the weight of Table 2 should be an estimate of the intercalation binding constant, KI ¼ 0:31
106 M.
Table 1 Computed binding data
Bound
Free
r
Absorbance
a
B
B=F
105 M
30 15 10 8 7 6 5 4 3 2 1.5 0.5 0
0.423 0.442 0.485 0.523 0.549 0.583 0.613 0.673 0.722 0.772 0.812 0.855 0.897
1 0.960 0.869 0.789 0.734 0.662 0.599 0.473 0.369 0.264 0.179 0.0886 0.000
0.0333 0.064 0.0869 0.0986 0.105 0.110 0.120 0.118 0.123 0.132 0.119 0.178 0.000
1 0.444 0.184 0.130 0.109 0.0902 0.0833 0.0621 0.0542 0.0498 0.0402 0.0543 0.000
D.E. Booth, K. Lee / Analytical Biochemistry 319 (2003) 258–262
261
Table 4 Robust regression with observations 1–8 Observation
Weight
1 2 3 4 5 6 7 8
0.000 0.846 0.914 0.976 0.964 0.000 0.984 0.760
106 M, our final value for the intercalation binding constant.
Discussion
Fig. 1. Scatchard plot of Table 1 data. Table 2 Robust regression results with Table 1 data Observation
Weight
1 2 3 4 5 6 7 8 9 10 11
0.000 0.862 0.993 0.988 0.963 0.808 0.925 0.995 0.716 0.499 0.000
Observations 9, 10, and 11 were thus outliers because they were from the sugar–phosphate binding. Observations 9, 10, and 11 were then run in a separate robust regression giving the weight values shown in Table 3. These results indicate clearly that observations 9, 10, and 11 are no longer outliers and therefore represent the second phase of the biphasic Scatchard plot, the binding to the sugar–phosphate backbone with binding constant, KII ¼ 0:202 104 M. The computer programs and output are available from the authors. A final robust regression was performed with observations 1–8 only. The results are given in Table 4. Table 4 results give a KI value of 0:494
Table 3 Robust regression results with observations 9–11 Observation
Weight
9 10 11
0.927 0.958 0.997
The results of this study show clearly the advantages of using robust regression in the analysis of binding data by a Scatchard plot. First, the identification of outlying observations by the robust regression allow the separation of the intercalation data points from the sugar–phosphate binding data points, thus yielding more accurate estimates of KI and KII . Second, downweighting the outlying observations in the two separate robust regression runs also yield more accurate estimates of KI and KII . Several other things must still be kept in mind in analyzing Scatchard plot data. As Raguin et al. [4], Booth [9], and others have pointed out, nonlinear effects may be important in some cases. Further as Raguin et al. [4] and others point out, it is very easy to make incorrect interpretations of Scatchard plots. We believe that robust regression will help in making correct interpretations. Should nonlinear statistical models be required there are robust procedures available for use [12,15]. Further work, in the spirit of Raguin et al. [4] is in progress.
References [1] D.E. Booth, K. Lee, Change point methods for the analysis of the binding of small molecules to biopolymers, Chemist 78 (4) (2001) 11–17. [2] J.M. Jamison, K. Krabill, K.A. Allen, S.H. Stuart, C.C. Tsai, RNA-intercalating agent interactions; in vitro antiviral activity studies, Antiviral Chem. Chemother. 1 (6) (1990) 333–347. [3] E.A. Mjkhailova, S.J.H. Ashcroft, M.V. Mjkhailov, A novel simple method for the investigation of drug binding to the KATP channel sulfonylurea receptor, Anal. Biochem. 307 (2002) 383– 385. [4] O. Raguin, A. Gruaz-Guyon, J. Barbet, Equilibrium expert: an add-in to Microsoft Excel for multiple binding equilibrium simulations and parameter estimations, Anal. Biochem. 310 (2002) 1–14. [5] K. Ishizu, H.H. Dearman, M.T. Huang, J.R. White, Interaction of the 5-methylphenazinium cation radical with deoxyribonucleic acid, Biochemistry 8 (1969) 1238–1246.
262
D.E. Booth, K. Lee / Analytical Biochemistry 319 (2003) 258–262
[6] T.L. Lai, L. Zhang, Statistical analysis of ligand binding experiments, Biometrics 50 (1994) 782–797. [7] N.C. Price, R.A. Swek, Principles and Problems in Physical Chemistry for Biochemists, Clarendon Press, Oxford, 1979. [8] G. Scatchard, The attractions of proteins for small molecules and ions, Ann. NY Acad. Sci 51 (1949) 660–672. [9] D.E. Booth, Determining the binding constants of biologically active compound and nucleic acids: a nonlinear regression approach, Ind. Math. 45 (2) (1995) 123–132. [10] D.E. Booth, Regression modules and problem banks, a unit to demonstrate an application of data analysis in UMAP ModulesTools for Teaching 1985, COMAP, INC, Arlington, MASS, 1986. [11] R.V. Hogg, An introduction to robust estimation, in: R.L. Launer, G.N. Wilkinson (Eds.), Robustness in Statistics, Academic Press, New York, 1979.
[12] R. Dutter, P.J. Huber, Numerical methods for the nonlinear robust regression problem, J. Stat. Comput. Simulation 13 (1981) 79–114. [13] J. Neter, M.H. Kutner, C.J. Nachtcsheim, W. Wasserman, Applied Linear Statistical Models, fourth ed., IRWIN, Chicago, 1996. [14] L. Ott, An Introduction to Statistical Methods and Data Analysis, fourth ed., Wadsworth, Belmont, CA, 1993. [15] P.T. Rousseeuw, A.N. Leroy, Robust Regression and Outlier Detection, Wiley, New York, 1987. [16] G.R. Phillips, E.M. Eyring, Comparison of conventional and robust regression in analysis of chemical data, Anal. Chem. 55 (1983) 1134. [17] R.V. Lenth, A computational procedure for robust multiple regression, Technical Report No. 53, Dept. of Statistics, The University of Iowa, Iowa City, IA 52242, 1976. [18] D.F. Andrews, P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, J.W. Tukey, Robust Estimates of Location: Survey and Advances, Princeton University Press, Princeton, NJ, 1972.