Computational Biology and Chemistry 32 (2008) 307–310
Contents lists available at ScienceDirect
Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem
Algorithm note
HELFIT: Helix fitting by a total least squares method Purevjav Enkhbayar a,b,c , Sodov Damdinsuren a , Mitsuru Osaki b , Norio Matsushima c,∗ a
Department of Biophysics, Faculty of Biology, National University of Mongolia, Ulaanbaatar 210646/377, Mongolia Division of Agricultural Science, Hokkaido University Graduate School of Agriculture, Sapporo, Hokkaido 060-0810, Japan c Division of Biophysics, Sapporo Medical University School of Health Sciences, Sapporo, Hokkaido 060-8556, Japan b
a r t i c l e
i n f o
Article history: Received 6 October 2007 Received in revised form 20 March 2008 Accepted 24 March 2008 Keywords: HELFIT Helix fitting Total least squares Helix parameters
a b s t r a c t The problem of fitting a helix to data arises in analysis of protein structure, in nuclear physics, and in engineering. A continuous helix is described by five parameters: helix axis, helix radius, and helix pitch. One of these helix parameters is frequently predefined in the helix fitting. Other algorithms find only the helix axis or determine separately the helix axis, the helix radius, or the helix pitch. Here we describe a total least squares method, HELFIT, for helix fitting. HELFIT enables one to calculate simultaneously all five of the helix parameters with high accuracy. The minimum number of data points required for the analysis is only four. HELFIT is very insensitive to noise even in short helices. HELFIT also calculates a parameter, p = rmsd/(N − 1)1/2 , which estimates the regularity of helical structures independent of the number of data points, where rmsd is the root mean square distance from the best-fit helix to data points and N is the number of data points. It should become a basic tool of structural bioinformatics. © 2008 Elsevier Ltd. All rights reserved.
1. Introduction The problem of fitting a helix to data arises in analysis of protein structure, in nuclear physics, and in engineering. A helix is described by five parameters: the helix axis, the helix radius, and the helix pitch. Most programs have predefined one of these helix parameters. For example, in nuclear physics there is an extended Riemann circle fit to a helix in 3D space; this is used for fitting observations to a helical track of particles in a magnetic field ¨ (Fruhwirth et al., 2002). In a cylindrical detector the helix radius is predefined and in a plane detector the helix axis is predefined. Alternative algorithms mainly concentrate on only the helix axis; they include parametric least squares (“parlsq”) (Christopher et al., 1996), moment matrix (“eigenfit”) (Kahn, 1989a), cylindrical fit ˚ (“Aqvist”) (Aqvist, 1986), cross product of triad bisectors (“Kahn”) (Kahn, 1989b), and rotational least squares (“rotfit”) (McLachlan, 1979). Recently, Nievergelt (1997) developed a method to fit helices to data. It first minimizes a sum of squares distance from data points to a cylinder for determination of both the helix axis and the helix radius. After that, the helix pitch is determined. Consequently, the helix radius and the helix pitch are determined separately. This means that the helix axis is coincident with the axis of the cylinder. Because it is not assured that the data points lie on a real
∗ Corresponding author. Tel.: +81 11 611 2111; fax: +81 11 612 3617. E-mail address:
[email protected] (N. Matsushima). 1476-9271/$ – see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2008.03.012
helix on the surface of a cylinder, such separate determinations of the helix radius, axis, and pitch are accompanied by an ambiguity. Here we develop a genuinely total least squares algorithm, HELFIT (http://www.sapmed.ac.jp/˜matusima/HELFIT/HELFIT.htm), which simultaneously determines to high accuracy all of the helical parameters—axis, radius, pitch, and handedness. 2. Algorithm We use a multi-step process to calculate the helical parameters by: Step 1 Fitting the axis and the radius of the cylinder. Step 2 Fitting the pitch of the helix. Step 3 Perform a full 3D minimization search over all the parameters. We are fitting a set of consecutive data points, {xi }, i = 1, 2, . . ., N. The data points comprise the sequence (1st, 2nd, 3rd, 4th, . . ., Nth). Other notations (Fig. 1): xi = (xi , yi , zi ) xhi = (xih , yih , zih ) o = (ox , oy , oz ) s = (sx , sy , sz ) a = (ax , ay , az ) v w r P
a vector from the origin to the ith data point a vector from the origin to the ith data point on the helix the perpendicular vector from the origin to the helix axis the position vector at the start of the helix the direction vector of the helix axis a unit vector perpendicular to the axis vector, a a unit vector perpendicular to both a and v the radius of the helix the pitch of the helix
308
P. Enkhbayar et al. / Computational Biology and Chemistry 32 (2008) 307–310
The initial axis of the cylinder in Step 1 was determined by the Kahn method (Kahn, 1989b). In the Kahn method, points which lie on the axis are found by vector algebra. The angle between three consecutive data points is bisected, resulting in a vector, V1 , that points towards and is perpendicular to the helix axis. Repetition of this process with a different set of data points gives a second vector perpendicular to the helix axis, V2 . As V1 and V2 are both perpendicular to the helix axis, the cross-product V1 × V2 gives the direction vector. The Kahn method, therefore, needs only four data points. The initial radius of the cylinder was calculated by 3D circle fitting using the coordinates projected onto the plane perpendicular to the cylindrical axis. The 3D circle fitting was performed using algorithms already available (Shakarji, 1998; Enkhbayar et al., ˚ 2004). Step 1 corresponds to the Aqvist method when the radius of the cylinder is constant. Step 2. This step determines the helix pitch, P and the sign of its pitch (handedness). The procedure uses total least squares to estimate the linear relation between the displacement along the axis and the angle of rotation around the axis. Step 1 and Step 2 calculate the initial sets of P, r, a, and o. Step 3. A point xh on the helix is given by xh = o + aPt + r(v cos t + w sin t)
(6)
where t, angle of rotation around the axis of the helix, is an independent variable. The vector v is directed from the starting point of the helix to the first data point. In addition to a, v and w are also unit vectors. These unit vectors satisfy a · v = 0, v · w = 0, and w · a = 0. The ith data point (xhi ) on the helix is obtained by the minimization of |xi − xhi |. Thus, |xi − xhi |(= d(xi ) = dih ) indicates the closest distances from xi to the helix (Fig. 1). Jh , given by
J h (P, r, a, o, t0 ) = Fig. 1. Geometry of the helix fitting by HELFIT. Notations are as follows: xi = (xi , yi , zi ): a vector from the origin to the ith data point; xhi = (xih , yih , zih ): a vector from the origin to the ith data point on the helix; o = (ox , oy , oz ): the perpendicular vector from the origin to the helix axis; s = (sx , sy , sz ): the position vector at the start of the helix; a = (ax , ay , az ): the direction vector of the helix axis; v: a unit vector perpendicular to the axis vector, a; w: a unit vector perpendicular to both a and v; r: the radius of the helix; P: the pitch of the helix. s is the position vector from the origin to the start of the helix, where the start is defined by the projection of the first data point onto the helix axis.
r i = xi − o − (xi · a)a
(1)
Thus the orthogonal distance (di ) from the single ith point to the cylindrical surface is given by di = |r i | − r
(2)
where r is the radius of the cylinder. Jc is the sum of the squares of di . J c (r, a, o) =
di2 =
(|r i | − r)2
(3)
[d(xi )]
2
(7)
is minimized, subject to the constraints of Eqs. (4) and (5), where t0 is the first data point. This minimization determines simultaneously all parameters—P, r, a, o, and t0 . Seven of these nine parameters are free variables because of the constraints of Eqs. (4) and (5). The root mean square distance from the best-fit helix to data points (rmsdh ) is then given by
h
s is the position vector from the origin to the start of the helix, where the start is defined by the projection of the first data point onto the helix axis. Step 1. The orthogonal distance from the axis to the ith data point is the length of the vector ri , given by
rmsd =
the minimum of
(dih ) N
2
1/2 .
(8)
Using this rmsdh we propose a parameter, p, which estimates the regularity of helix independent of the length of helix, p=
rmsdh (N − 1)1/2
(9)
The minimizations were performed using two functions, fminsearch and fmincon, in the optimization toolbox in the MatLab program package (The Mathworks, 2001). The orthogonal distance from data point was calculated using fminsearch. The minimizations used the two functions, fminsearch and fmincon, do not need seven data points for the helix fitting. The minimizations determine all nine parameters – P, r, a, o, and t0 – with seven free variables, as noted. Five of these nine parameters: helix axis (a), helix radius, and helix pitch, are important for the analysis of the helical structure.
The quantity Jc is minimized, subject to the constraints a·o=0
(4)
|a| = 1
(5)
This minimization determines seven parameters of r, a, and o.
3. Application For testing HELFIT, we used three examples: data used to evaluate the Nievergelt method (1997), an ideal 310 -helix, and 310 -helices with random noise.
P. Enkhbayar et al. / Computational Biology and Chemistry 32 (2008) 307–310 Table 1 Comparison of helical parameters calculated by HELFIT and by the Nievergelt method using the Nievergelt data pointsa
Direction vector of the helix axis (a)b Helix radius (r) Helix pitch (P) Handednessc rmsdc d rmsdh d
HELFIT
Nievergelt
(0.32, 0.73, 0.61) 178 660 −1 − 0.59
(0.33, 0.67, 0.67) 195 594 −1 0.00 0.72
a Data points used are 10; (12, 102, 198), (48, 138, 180), (65, 163, 169), (77, 187, 157), (85, 209, 149), (94, 266, 128), (93, 288, 120), (89, 316, 112), (82, 347, 107) and (62, 397, 103) (Nievergelt, 1997). The data lie exactly on the right-circular cylinder with radius 195 and with axis parallel to (1/3, 2/3, 2/3); however, the data do not lie exactly on any helix. Total rotational angle around the axis of helix is around 90◦ . b |a| = 1. c Handedness of the helix is determined as +1 for a right handed helix. d rmsdh is the root mean squares distances from the best-fit helix to the data points. rmsdc is the root mean squares distances from the best-fit cylinder to the data points.
3.1. Nievergelt’s Data The Nievergelt method (1997) contains only Step 1 and Step 2 but not Step 3. In his method Step 1 is performed using total least squares and single value decomposition. To illustrate his method, Nievergelt (1997) considered the problem of fitting in 3D space a right-handed circular helix to the data points shown in Table 1. The data lie exactly on the right-circular cylinder with radius 195 and with axis parallel to (1/3, 2/3, 2/3); however, the data do not lie exactly on any helix. Thus, the data provide an interesting example to address some differences between the Nievergelt method and HELFIT. We first sorted the Nievergelt’s data in ascending order. After that, HELFIT was performed. The Nievergelt method was also substituted by the procedures of Step 1 and Step 2. The helical parameters calculated by both methods are summarized in Table 1. As expected, the helical parameters, a, P and r calculated by Step 1 and Step 2 are consistent with respective values given by Nievergelt (1997). The best-fit helix yields 0 for the root mean square distance from the best-fit cylinder to the data points (rmsdc ). However, the root mean square distance from the best-fit helix to the data points (rmsdh ) is larger than that calculated by HELFIT involving Step 3. Correspondingly, a, r, and P are different between the Nievergelt method and HELFIT. The data points lie on a short nonplanar curve within a quarter of a turn around the cylinder. Thus it is possible that the helix axis is different from the axis of the cylinder. The parameter, p, estimates the regularity of the helix independent of the number of data points, as noted. The p-value for the best-fit helix by HELFIT is clearly smaller than that for the helix generated by the Nievergelt method. Even for data points that do not form one full turn, HELFIT is applicable. 3.2. Ideal 310 -Helix The second example is an ideal 310 -helix. The ideal right handed ˚ n (the residue number per 310 -helix is characterized by P = 5.80 A, turn) = 3.0, and r = 1.89 A˚ (Perutz, 1951). 310 -Helices with N = 4–9 were generated using Eq. (6). The helix fitting by HELFIT was performed using the generated data points. The results indicate that the relative deviations are less than 0.13% for P, and 0.03% for n and 0.10% for r. HELFIT also gave similar results for ideal ␣-helices (Pauling et al., 1951). Thus, HELFIT gives satisfactory results even for a short helix of only four contiguous atoms; while, the Nievergelt method presumably needs at least 10 data points.
309
3.3. 310 -Helices with Noise The third example is 310 -helices with noise. The 310 -helices were made as follows. Ideal 310 -helices with N = 4–9 were generated using Eq. (6). Random noise drawn from a standard random number generator with various RMS deviations was added to the coordinates of the ideal helices. The RMS of total added random noise (Et ) refers to the standard deviation of the random noise that was added to coordinates (about a mean of 0.0). The Et is given by 2
2
2
(Et ) = (Er ) + (Ez ) + (Eϕ )
2
(10)
where Er , Ez , and Eϕ are added random noise in the radial, axial, and tangential directions, respectively. The RMS’s of total ˚ At each added random noise, Et , were changed from 0.1 A˚ to 1.0 A. of the given Et values, we generated two of the three components, Er and Ez , using the random number generator. The remaining Eϕ was calculated by (Eϕ )2 = (Et )2 − (Er )2 − (Ez )2 . Plots of three helical parameters – P, r and n – are shown in Fig. 2A as a function of random noise, Et , and in Fig. 2B as a function of helix length (the number of data points). All of the helix parameters are near constant in both cases. It can be concluded that HELFIT is very insensitive to noises even in short helices. The pvalues increase with increasing added noise (Fig. 2C). This behavior is reasonable. 4. Discussion The Nievergelt method (Nievergelt, 1997) performs Step 1 and Step 2 for helix fitting; although, the algorithm is different from those used here. The most significant change in HELFIT is the addition of Step 3, which performs a full 3D minimization search over all the parameters for the helix fitting. Thus, HELFIT is a genuinely total least squares method. HELFIT enables one to calculate simultaneously all of the helix parameters and a parameter, p = rmsd/(N − 1)1/2 which estimates the regularity of helical structures independent of the number of data points (helix length). Only four data points are required for the analysis. HELFIT is very insensitive to noise, even in short helices. Five methods – “parlsq”, “eigenfit”, “Aqvist”, “Kahn”, and “rotfit” – are available for finding the helix axis. In order to determine the accuracy of these five methods, Christopher et al. (1996) calculated the helical direction vector and the RMS deviations using the coordinates of ideal ␣-helices with N = 8, 10, 12, 14, 16, and 18, in which random noise are added. They indicated that the accuracy of all methods increases with increasing helix length and the “rotfit” method is relatively insensitive to added noise. However, the error increases with increasing of added noise in all the five methods. The error for N = 8 and 10 helices reaches 10–25◦ and 5–30◦ , ˚ In contrast, the helical respectively, at rmsd of added noise = 1.0 A. parameters – P, r and n – calculated by HELFIT do not vary with added random noise even with short, N = 4–9, helices (Fig. 2). The program, HELANAL, characterizes overall geometry of a long ␣-helix as being linear, curved or kinked (Kumar and Bansal, 1996; Bansal et al., 2000). HELANAL computes local helical twist, virtual torsion angle, local helix origins, and local helix axes for each turn the helix (four contiguous atoms in ␣-helix), using the procedure of Sugeta and Miyazawa (1967). HELANAL needs at least the coordinates of nine atoms (Bansal et al., 2000). On the other hand, HELFIT needs only four data points. HELFIT using data points that all lie on a plane in 3D space is the same as a 3D circle fitting, meaning that the helix pitch P is zero. We utilized HELFIT for the analysis of 1774 310 -helices in 689 proteins of which 99% are shorter than nine residues in length
310
P. Enkhbayar et al. / Computational Biology and Chemistry 32 (2008) 307–310
(Enkhbayar et al., 2006). We proposed a new term, parahelix, in order to explain the structural feature of 310 -helices in proteins. We are now applying HELFIT to the analyses of helical structures involving ␣-helix and -helix in proteins. 5. Conclusions For helix fitting we developed a new algorithm, HELFIT. HELFIT calculates simultaneously all the helix parameters with high accuracy. The minimum number of data points required for the analysis is only four. HELFIT is very insensitive to noise even in short helices and determines p = rmsdh /(N − 1)1/2 which indicates the regularity of helical structures independent of the number of data points. It is a genuinely total least squares method and thus gives the best helical parameters. HELFIT should become a basic tool of structural bioinformatics. Acknowledgement We thank Dr. Robert H. Kretsinger of University of Virginia for his valuable suggestion and comments. References
Fig. 2. Helical parameters of 310 -helices with added random noise calculated by HELFIT. (A) The helix pitch (P), the residue number per turn (n), and the helix radius (r) as a function of added noise. (B) The helix pitch (P), the residue number per turn (n), and the helix radius (r) as a function of helix length. (C) The p parameter as ˚ In panels (A) and (C), N = 4, ; a function of added noise. The unit of P and r is A. N = 5, 䊉; N = 6, ; N = 7, ; N = 8, ; N = 9, . In panel (B), The calculated P, n and r for 310 -helices with Et = 1.0 A˚ for total added random noise are indicated. Dotted lines ˚ n = 3.0, and r = 1.89 A˚ for an ideal 310 -helix. indicate P = 5.80 A,
˚ Aqvist, J., 1986. A simple way to calculate the axis of an ␣-helix. Comput. Chem. 10, 97–99. Bansal, M., Kumar, S., Velavan, R., 2000. HELANAL: a program to characterize helix geometry in proteins. J. Biomol. Struct. Dyn. 17, 811–819. Christopher, J.A., Swanson, R., Balwin, T.O., 1996. Algorithms for finding the axis of a helix: fast rotational and parametric least-squares methods. Comput. Chem. 20, 339–345. Enkhbayar, P., Kamiya, M., Osaki, M., Matsumoto, T., Matsushima, N., 2004. Structural principles of leucine-rich repeat (LRR) proteins. Proteins 54, 394–403. Enkhbayar, P., Hikichi, K., Osaki, M., Kretsinger, R.H., Matsushima, N., 2006. 310 Helices in proteins are parahelices. Proteins 64, 691–699. ¨ Fruhwirth, R., Strandlie, A., Waltenberger, W., 2002. Helix fitting by an extended Riemann fit. Nucl. Instrum. Methods A 490, 366–378. Kahn, P.C., 1989a. Simple methods for computing the least squares line in three dimensions. Comput. Chem. 13, 191–195. Kahn, P.C., 1989b. Defining the axis of a helix. Comput. Chem. 13, 185–189. Kumar, S., Bansal, M., 1996. Structural and sequence characteristics of long ␣-helices in globular proteins. Biophys. J. 71, 1574–1586. McLachlan, A.D., 1979. Gene duplication in the structural evolution of chymotrypsin. J. Mol. Biol. 128, 49–79. Nievergelt, Y., 1997. Fitting helices to data by total least squares. Comput. Aided Geom. Des. 14, 707–718. Pauling, L., Corey, R.B., Branson, H.R., 1951. The structure of proteins: two hydrogenbonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. U.S.A. 37,235–240. Perutz, M.F., 1951. New X-ray evidence on the configuration of polypeptide chains: Polypeptide chains in poly-␥-benzyl-l-glutamate, keratin and hemoglobin. Nature (Lond.) 167, 053–1054. Shakarji, C.M., 1998. Least-squares fitting algorithms of the NIST algorithm testing system. J. Res. Natl. Stand. Technol. 103, 633–641. Sugeta, H., Miyazawa, T., 1967. General method for calculating helical parameters of polymer chains from bond lengths, bond angles and internal-rotation angles. Biopolymers 5, 673–679. The Mathworks, Inc., 2001.