Physica A 387 (2008) 5159–5168
Contents lists available at ScienceDirect
Physica A journal homepage: www.elsevier.com/locate/physa
Long range correlation and possible electron conduction through DNA sequences Sheng-Cheng Wang, Ping-Cheng Li ∗ , Hsen-Che Tseng Department of Physics, National Chung-Hsing University, 250 Guo-Kuang Road, Taichung 402, Taiwan, ROC
article
info
Article history: Received 5 October 2007 Received in revised form 8 April 2008 Available online 25 April 2008 Keywords: DNA sequence Long range correlation Conductivity Saccharomyces cerevisiae genome
a b s t r a c t Long range correlation analysis and charge conductivity investigation are applied to sequences in 16 chromosomes in the Saccharomyces cerevisiae genome. DNA sequence data are analyzed via Hurst’s analysis and Detrended Fluctuation Analysis (DFA) analysis. Super diffusive nature of mapping sequences are evident with the measured Hurst exponent H to be around the value of 0.60 for all sequences in the 16 chromosomes. The DFA result is consistent with the result from the Hurst analysis. Tight binding models are applied for the investigation of charge conduction through DNA sequences. The overall averaged transmission coefficients, hTN iav , calculated from sixteen chromosomes are shown to be significantly different from values calculated from random as well as periodic sequences. Sequences from the S. cerevisiae genome promise better charge conduction ability than random sequences. Finally, delocalized electronic wave function patterns are also shown through calculations using the tight binging model. Slightly delocalized electronic wavefunctions are seen on sequences in sixteen chromosomes, as compared with those obtained from random sequences on the same eigenenergies. © 2008 Elsevier B.V. All rights reserved.
1. Introduction DNA sequences which carry genetic information are important sequential data in life science. Sequences that we can acquire and study today are those evolved through a geological timescale. The understanding of their properties in terms of statistics and physics may pave the way to the enlightenment of their biological significances. At this moment many interesting findings based both on physical or biochemical analyses are published and discussed [1–8]. Among these valuable and inspirational outcomes there are two features which are worthy of further explorations. One is the wellknown long-range correlation characteristic, which was delightfully demonstrated by Peng et al. [1] and can be resorted to Lévy type statistics [2]. The other one is the possible charge conduction through DNA sequences. The second feature is especially alluring since it may be revealing the possibility of molecular wires suitable for advancing nanoelectronic devices. Biologically, a precise understanding of the charge conduction properties through DNA sequences might lead us to substantiated descriptions of damage recognition process, protein binding and processes of engineering biological types. With a view to future applications for nanoelectronic technology, as well as deeper understanding on the life phenomena on earth, it is necessary to explore essential characteristics of DNA sequences both physically and biologically. There are many efforts devoted to the study of how DNA wires behave as they are put under some electrical potential differences, and perhaps under different temperature environments. The whole scope of DNA conductivity measurements from different groups are so diverse that Endres et al. [9] felt necessary to write a review article in the hope of looking for any consistencies between them. According to this article, charge-transfer reactions and conductivity measurements show a large variety of
∗ Corresponding author. E-mail address:
[email protected] (P.-C. Li). 0378-4371/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.physa.2008.04.029
S.-C. Wang et al. / Physica A 387 (2008) 5159–5168
5160
possible electronic behaviors, ranging from Anderson and band-gap insulators to effective molecular wires and induced superconductors. As it was remarked by R.G. Endres the wide spectrum in the scope of observations in terms of DNA conduction is no surprise, since the understanding of a complicated polyelectrolytic aperiodic system is itself a difficult physical problem. Localized electronic wave functions are generally anticipated in this kind of aperiodic systems. However, in 2002 Pedro Carpena et al. [10] has demonstrated in a unifying way that different degrees of spatial correlations in DNA lead to different electron conductivity. Segments of DNA with stronger long range correlations are characterized by extended wave functions over several hundreds or even thousands of nucleotides. Their conclusive results raise an interesting question on the DNA sequences of different species that exist in nature. How do they perform in the model suggested by Pedro Carpena et al.? The model suggested in the Ref. [10] has trimmed away complications of counterions, temperature, backbone conduction, etc. but indeed provides a doorway for statistical study. Transmission coefficients calculated through this model can be averaged to obtain an overall averaged quantity hTN iav . By comparing hTN iav with that calculated from random sequences without long range correlations, useful information can thus be extracted. In viewing a possible evolutionary process of millions of years genetic information passed on from one generation to the next should confront risks of errors due to damage of genetic codes. Sequences with a better ability of error corrections reserve for themselves a better chance of survival in the history of nature. A better ability of corrections, as we acknowledged at the present time, might mean prompt detection of a damage site on a sequence, and the detection mechanism is better off via the detection of electronic signals. This possible option, as we contemplate logically the evolution of life forms on the earth, means that DNA sequences we see today should be those surviving from their fatal damages and should bear better charge conduction ability. On the other hand, in terms of statistical properties as mentioned before, DNA sequences started out in a random fashion and gradually modified and tailored through the course of natural evolution in order to be better adjusted to the cruel and harsh environment. DNA sequences inscribed in this way certainly carry histories of their passage, and display long-range correlation characteristics in all statistical measurements and analyses. In the light of the foregoing hypothesis, long-range correlation as seen in the statistical analyses should not be disassociated with observations of electrical conductivity. These two interesting features and their relations are the main concerns in this paper. We will describe our tactics, which mainly borrowed from forerunners in this field, and apply them to sequences in 16 chromosomes in the S. cerevisiae genome since they are thoroughly studied in many aspects and share the same quintessence with Homo sapiensas eukaryotes. 2. Long range correlation in DNA Long range correlations in DNA sequences manifest themselves in the DNA walk mapping analysis as a power law function with super diffusive nature [1–3]. Adapted from C.K. Peng’s convention [1] we assigned steps up [u(i) = −1] for purines (A, G) and steps down [u(i) = +1] for pyrimidines (C, T). A DNA sequence mapped in this way resembles a walker staggering along a one dimensional path randomly and whereupon the second moment of the fluctuations of the random walk can be calculated. The super diffusive nature of the mapping sequences can be observed thereby. As it was suggested by Stephan Roche et al. [11] Hurst’s analysis [12] was argued to be more reliable in the determination of the precise rescaling coefficients [13] we thus follow the same treatment in exploring the long range correlation feature in this article. Hurst’s analysis manages to diminish “patchiness” structure [1,2] in a much more sophisticated fashion. In order to define the rescaled range function R(n) we follow the forerunners’ prescription [11,12]. With a given sequence P of size N the net displacement x(n) of the random walker after n steps in a row is x(n) = ni=1 u(i), 1 ≤ n ≤ N. The net displacement difference between the two occasions when the random walker is either on the position m or on (m + k) is defined as ∆x(m, k) = x(m + k) − x(m). Rescaled variables X (m, k) can thus be defined as the following X (m, k) = ∆x(m, k) −
k n
∆x(m, n),
where 1 ≤ k ≤ n.
(1)
By doing this kind of an overall subtraction, a general trend of asymmetrical concentration of purines and pyrimidines is greatly diminished. The next step is to find the maximum and minimum values of the rescaled value X (m, k) within the range 1 ≤ k ≤ n and calculate the difference, S(m, n), between them. S(m, n) = max [X (m, k)] − min [X (m.k)]. 1≤k≤n
1≤k≤n
(2)
The foregoing procedure would greatly reduce the “patchiness” effect [1,2] usually confronted when doing this type of analysis. Average values obtained from these differences, S(m, n) for 1 ≤ m ≤ N − n, is calculated for different window length n.
hS(n)i =
N −n X
S(m, n)
m=1
N−n
.
(3)
Finally the Hurst’s exponent H is defined in the Eq. (4), R(n) =
hS(n)i
σ(n)
∝ nH
where σ 2 (n) is the standard deviation of u(i) over steps of length n and R(n) is known as the rescaled range function.
(4)
S.-C. Wang et al. / Physica A 387 (2008) 5159–5168
5161
Fig. 1. Rescaled range function R(n) versus n. The measured Hurst exponents are 0.61, 0.52, and 0.03 respectively for sequences taken from the first chromosome of S. cerevisiae, a random sequence and a periodic array. Both the random sequence and the periodic sequence are of the length 100 000 bp. In this figure the curve for S. cerevisiae is drawn in pink color, the curve for the random sequence in black color, and the curve for the periodic sequence in blue color. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Table 1 Hurst exponent and DFA-2 exponent calculated from S.C. chromosomes I.D. number of S.C. chromosome
Length (bp)
Hurst exponent
DFA-2 exponent
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
230 208 813 178 316 616 153 1916 576 869 270 148 109 0946 562 642 439 885 745 666 666 454 107 8174 924 429 784 331 109 1287 948 062
0.61 0.62 0.60 0.63 0.60 0.62 0.63 0.60 0.59 0.61 0.63 0.61 0.63 0.61 0.61 0.60
0.62 0.62 0.62 0.63 0.62 0.63 0.63 0.62 0.63 0.62 0.63 0.63 0.63 0.63 0.63 0.63
The Hurst exponent as defined in Eq. (4) is a characteristic measure of a random sequence. Sequences generated from the ordinary Brownian motion show H = 0.5, a typical value for sequences without long-range correlations. Whereas H > 0.5 is expected for a long range correlated sequence. We have here in the Fig. 1 presented the analysis of a random sequence with the length of 100 000 bp. An analysis and result exactly the same is found in the Ref. [11]. The Hurst exponent is measured to be H = 0.52, a characteristic value for a Brownian type sequence. For further comparison we also conduct the same measurement on a periodic sequence with the same length size. A nearly flat appearance, with H = 0.03, shown in the figure manifests the strong persistent correlation of the sequence nature. We have measured 16 Hurst exponent values from sequences of 16 S. cerevisiae chromosomes. All the measured values are listed in the Table 1. Different chromosomes inherit different lengths. They range from 200 K bp to 1.5M bp as we can tell from the table. All the sequences in the S. cerevisiae genome are measured to be near 0.6, a characteristic value of long range correlation. A full scale typical measurement of the rescaled range function R(n) through the R(n) versus n diagram is shown in the Fig. 2(a) for the first chromosome. The nonlinearity in the region for window length less than 30 is the cause of lack of statistics. By neglecting the range with n less than 30 all the data points fit the line very well. Besides Hurst analysis, the Detrended Fluctuation Analysis (DFA) has been developed in recent years for the purpose of studying DNA sequences and other nonstationary biological and physiological data where polynomial trends are embedded [13–15]. By properly filtering out trends of “clustering sequences” it can more accurately quantify the degree of correlations in the sequential data. To compare the results with those obtained from the Hurst analysis we performed DFA analysis, following the procedure of Ref. [15], to quantify correlations in the S. cerevisiae genome. In the DFA analysis, DNA sequences are mapped into sequential signals with u(i) = −1 for purines and u(i) = 1 for pyrimidines. At first sequential
S.-C. Wang et al. / Physica A 387 (2008) 5159–5168
5162
Fig. 2(a). A full scale typical measurement of the rescaled range function R(n) for the first chromosome of S. cerevisiae. Below n = 30 a nonlinear region was observed. This may due to the lack of statistics. This region is not included in the determination of Hurst exponents.
signals u(i) are used to obtain differential sequential signals y(i), y(i) =
i X
[u(j) − u]
(5)
j=1
where u=
N 1X
N j=1
u(j)
(6)
is the mean which is calculated through a DNA sequence of length N. The complete profile of differential sequential signals y(i) is then divided into boxes of equal length n. In each box n, a polynomial function yn (i) which represents the local trend in that particular box is fitted to y(i). A polynomial of order l used is designated as a DFA-l analysis. In our study, we employ the DFA-2 analysis since it seems to be the popular choice. In the following step the complete profile is detrended by subtracting the local trend yn (i) in each box of length n: Y n (i) = y(i) − yn (i).
(7)
Finally, the rms fluctuation, F (n), for the detrended signals is calculated for each box n: v u N u1 X F (n) = t [Y (i)]2 . N i=1
(8)
The whole process is repeated for different box length n. The characteristic behavior of F (n) over a wide range of scales can thus be obtained. Different degrees of correlations can be identified through DFA exponent of the power law: F (n) ∼ nα .
(9)
If α = 0.5, the signal is uncorrelated. If α > 0.5 then we have a sequential signal of long range correlation. A typical fit of the DFA-2 analysis for the first chromosome is shown in the Fig. 2(b). It can be compared with the Hurst analysis in Fig. 2(a) where a nonlinear portion on the low n segment is apparent. We have measured 16 DFA-2 exponent values from sequences of S. cerevisiae chromosomes. They are listed in the Table 2. DFA-2 exponent values are seen to be more stable compared to those obtained from the Hurst analysis. Long range correlation is then unambiguously confirmed in DNA sequences extracted from the genome of S. cerevisiae. Long range correlation itself in the existed genomes does not provide new information since it has been a well known phenomenon for a long time. Existing DNA sequences in genomes of various species inheriting possible electronic conductivity is the main issue of our exploration. There are some pioneer reports [9–11,16,17] which shows the promising signs of charge conduction in some existing DNA sequences. The long range correlation behavior in some part plays an important role to the charge conduction in these sequences. It is advisable to study in some detail how these matters are interwoven with each other via existing data at hand from sequences of living cells. We chose to study 16 sequences from the chromosomes of S. cerevisiae since they are thoroughly studied to some extent both by bio-scientists and physicists. Another reason for the choice of S. cerevisiae is its eukaryotic nature in the cell. We Homo sapiens inherit similar cell structure from some presumed origin in the history of natural evolution.
S.-C. Wang et al. / Physica A 387 (2008) 5159–5168
5163
Fig. 2(b). A full scale typical measurement of the DFA-2 analysis is shown. “Clustering sequences” are filtered out by the second order polynomial so that a more appropriate exponent value can be determined.
Table 2 List of the rms of the second moment Dave and its standard deviation (σ ) is shown here Data
Dave [bp] binary site energies
σ [bp] binary site energies
Dave [bp] four site energies
σ[bp] four site energies
S.C.1 S.C.2 S.C.3 S.C.4 S.C.5 S.C.6 S.C.7 S.C.8 S.C.9 S.C.10 S.C.11 S.C.12 S.C.13 S.C.14 S.C.15 S.C.16 Random Period
21.10 21.14 20.87 21.19 20.98 21.17 21.10 21.00 21.15 21.09 21.24 21.15 21.16 21.02 21.09 21.16 19.52 86.39
15.36 15.29 15.06 15.38 15.18 15.29 15.28 15.16 15.30 15.32 15.36 15.32 15.31 15.18 15.24 15.28 14.31 3.37
20.02 20.23 20.15 20.21 19.99 20.15 20.18 20.09 20.19 20.10 20.22 20.18 20.19 20.11 20.17 20.21 18.49 85.25
15.23 15.15 15.10 15.17 15.08 15.09 15.15 15.06 15.21 15.15 15.14 15.17 15.13 15.10 15.10 15.15 14.74 7.64
Values in the second and third column are calculated using binary site energy, whereas values in the fourth and the fifth column are calculated using four different site energies. According to the data shown in this table, long range correlated sequences from living cells indeed bear a slightly better ability of charge conduction than a random sequence. A periodic sequence is expected to be the best conductor in this case.
3. Charge transfer in DNA Charge conduction through a DNA chain is a complicated phenomenon. The measurement itself should be a delicate matter. Many factors can significantly affect experimental observations. The double strand helical structure of a DNA when fastened between two nano-electrodes for experiment should be under mechanical stress. This factor may significantly alter helical geometries and affect experimental observations. The order of complexity, local density distributions of different nucleotides, etc., are not unimportant factors to be overlooked. Not to mention water molecules, counter ions [9], temperatures etc., these factors contribute to a random electronic environment and surely would affect charge conduction in a DNA chain when it abides in the nucleus of a living cell. In a sense, we are not able to construct a sensible model of physics at this stage in order to explain the real physical conduction mechanism, but rather we are here to provide various presentations of data analysis according to the tight-binding model which has been extensively utilized for many authors [10,11,18]. This type of research should be regarded as a gateway to the real physics. Many averaged quantities obtained through the application of the tight-binding model can in principle be regarded as characteristic quantities of various DNA sequences. Comparisons of dissimilar types of sequences with different order of complexities can thus be made and further studies can also be anticipated. In the tight-binding model the simplest effective Hamiltonian describing the propagation of a hole in the DNA chain is [11,19,20] H=
X n
εn |nihn| + t
X n
[|nihn + 1| + |nihn − 1|].
(10)
S.-C. Wang et al. / Physica A 387 (2008) 5159–5168
5164
Fig. 3. (a) Transmission coefficients TN (E) as the function of energy E are calculated from two sequence chains taken from the first chromosome of S. cerevisiae. Their lengths are N = 60 and 120. (b) Transmission coefficients TN (E) as the function of energy E calculated from a random sequence with N = 60 and 120 are shown here for comparison.
We made the same choice with Ref. [11] for the hole site energies εn which are the energies required to excite an electron from corresponding nucleotide bases, εA = 8.24 eV, εT = 9.14 eV, εC = 8.87 eV, and εG = 7.75 eV (A = adenine, T = thymine, C = cytosine, and G = guanine). The Hamiltonian given in the Eq. (10) has been first used for realistic DNA sequences by Stephan Roche et al. on 2003 [21]. The hopping integral t, simulating the π–π stacking between adjacent nucleotides and is the hopping probability amplitude between the neighboring sites, is taken to be 1 eV in all calculations. It is known that the value of hopping integral t would affect characteristic quantities discussed below but would not have any effect of the overall content of physics. As is pointed out in the Ref. [11], the choice of t = 1 eV is meant to reduce the backscattering of holes at the contact electrodes so that larger transmission spectrum allows better characterization of conductivity in the DNA chain. Two electrode leads and the DNA sequence chain under calculation are taken to be a sequence of infinite sites. Site numbers from i = 1 to N are associated with the sequence under study whereas the two electrode leads are sites belong to [−∞, 0] ∪ [N + 1, +∞]. In carrying out calculations the time independent Schrödinger equation is projected into a localized basis by properly accounting for the boundary conditions [10]. An electronic eigenstate P |ψi of eigenenergy E is a linear combination of a basis {|ni}, |ψi = n an |ni, where |ni denotes an electronic hole state on the site n. The eigenstate wave function |ψi should satisfy the following time independent Schrödinger equation, H|ψi = E|ψi. By proper rearrangement the Schrödinger equation can then be transformed into the following equation.
ψn εn + tψn+1 + tψn−1 − Eψn = 0.
(11)
The function ψn is the projection of a eigenstate wave function ψ on the nth site and can be written as ψn = hn|ψi. These projected functions are related through the following equation.
ψn+2 ψn+1 ψ1 = Mn = Mn · · · M1 . ψn+1 ψn ψ0
(12)
Here Mn ’s are 2 × 2 matrices with Mn (1, 1) = (E − εn )/t, Mn (1, 2) = −1, Mn (2, 1) = 1, and Mn (2, 2) = 0. Appling the same algorithm as in the Ref. [11] we compute transmission coefficients, TN (E), from the transfer matrix formalism first and then evaluate some other characteristic quantities for further comparisons. The transmission coefficient TN (E) which shows the probability of tunneling electrons through the N-site DNA chain is defined as "
TN (E) = 4 −
(E − εm )2 t2
# ,(
−
(E − εm )2 t2
(P12 P21 + 1) +
(E − εm ) t
(P11 − P22 )(P12 − P21 ) +
) X i,j=1,2
Pij2 + 2
(13)
with P = MN MN−1··· M1 and Mn ’s are 2 × 2 matrices as defined in the Eq. (12). The energy εm is the boundary energy assumed on the edge connecting points of two electrodes. It is taken to be the ionization energy of the guanine base, εm = εG . This choice is to simulate a resonance with the G-HOMO energy level within the DNA chain under study. Two DNA sequence chains with different lengths, N = 60 and 120, taken from the first chromosome of S. cerevisiae are used for demonstration. Transmission coefficients TN (E) calculated are shown in the Fig. 3. It is evident that a longer sequence chain shows poorer transmittivity. When comparisons are made between two sequence chains with one taken from a real DNA sequence and the other from a random sequence, data calculated from a real sequence show better transmittivity. The calculated values of TN (E) are sequence order dependent, and it is advisable to define an averaged characteristic quantity based upon TN (E) for further discussion. We analyzed sequences in 16 chromosomes in the cell nucleus of S. cerevisiae. For
S.-C. Wang et al. / Physica A 387 (2008) 5159–5168
5165
Fig. 4. Averaged quantity hTN iav calculated from 16 chromosomes of S. cerevisiae are shown in the figure. SC1 and SC9 are data from the first and ninth chromosome. Fourteen others are stacked together and are labeled as SC others. Data from the mitochondrion DNA sequence are shown as “×” symbols. Results from a random sequence are shown as red triangles in the figure. Much better conductivity of all sixteen chromosomal DNA chains are observed over a wide range of N when compared with values calculated from a random sequence. We have also put here the result calculated from the DNA sequence of E. Coli. Its conductivity is expected to be slightly better than a random sequence. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
each sequence of length L we calculate TN (E, i) for every piece of segment chain with width N and starting from the site i. Calculations were carried out from i = 1 to (L − N + 1). In a sense the whole DNA sequence on a chromosome is taken to be a complete statistical ensemble. Afterward an integrated value, S[TN (E, i)], over an energy range E = 5.75 ∼ 9.75 eV (∆E = 4 eV) is computed [18]. S[TN (E, i)] =
Z 9.75 5.75
TN (E, i)dE.
(14)
An averaged characteristic quantity can thus be defined as
hTN iav =
M 1 X
M i=1
S[TN (E, i)]
(15)
where M = L − N + 1. The value hTN iav defined in the Eq. (15) is an aggregated average quantity over a well-defined energy range for a complete DNA sequence from a chromosome. All sixteen calculated values of hTN iav are shown in the Fig. 4. Much better conductivity of all sixteen chromosomal DNA chains are observed over a wide range of N when compared with values calculated from a random sequence. Aside from sequences from the first and the ninth chromosome others show similar behaviors as data points are all cumulated together. Sequence chains from the first and ninth chromosome have better conductivity with longer segment length (N ≥ 300). It is interesting to point out that segment chains taken from the mitochondrion of S. cerevisiae behave in the same way as those from chromosomes. This kind of result is expected from the pioneer work done by Pedro Carpena et al. [10] and Stephan Roche et al. [11]. In terms of the tight binding model sequence, chains with long range correlations show better conductivity than those without long range correlations. In this article we present the data analyzed from a complete set of the S. cerevisiae genome and the conclusive result is shown to be consistent with their research. The fundamental truth or mechanisms underneath the firmly settled phenomenon seen in this study and others should have intimate connection to the natural evolution of life forms. This idea deserves much attention for further exploration.
4. Electronic wave functions Charge conduction in DNA chains based on the tight binding model is shown as a promising approach for discussion. It is then profitable to reveal in some detail patterns of electronic wavefunctions, which might provide some indications or clues for further interpretation. In order to simplify our calculation, we take the same strategy as adopted in the Ref. [10]. Site energies εn are assigned 0.5 eV for purines and −0.5 eV for pyrimidines. This kind of oversimplification trims away detailed energy variations on all sites along a sequence chain but retains the feature of complexities on sequential order of two different types of nucleotide bases. In order to calculate electronic wave functions in a DNA sequence of length N bp we transform the Eq. (10) into a matrix equation
S.-C. Wang et al. / Physica A 387 (2008) 5159–5168
5166
Fig. 5(a). Example patterns of electronic eigenstate wave functions ψN for two different eigenenergies, E = −1.271 eV and E = −0.033 eV, are shown in this figure. A sequence chain of length N = 300 is taken from the first chromosome for the calculation of ψN .
ε1
1
0
···
1 0 .. .
ε2
..
.
..
.
1
.
εN−1
0
1
..
..
.
..
···
0
.
1
0
E1 .. . 0 . .. = 0 ψN−1 .. . 1
εN
ψ1 ψ2 .. .
ψN
0
0
···
E2
..
..
.
···
.. ..
.
···
.
..
.
EN−1
···
.
0
0
ψ1 ψ2 . .. ψN−1 0
.. . .. .
EN
(16)
ψN
and work out the eigenvalue equation. Electrons are assumed to be bound in the sequence so that ψ0 = ψN+1 = 0 on two boundary end sites are set. All component values of the wave function ψ on all sites for a sequence chain with length N are thus obtained. Typical results calculated are shown in the Figs. 5(a) and 5(b). In this figure a sequence chain of length N = 300 is taken from the first chromosome. Eigenstate wave function patterns are seen as quite sensitive to the eigenenergy. By computing the absolute square values of ψn on all sites, we can have the electronic probability distribution P(n) = |ψn |2 which characterizes charge conduction through the chain. Appearances of these probability distributions are so diverse that averaged characteristic quantities are needed for comparisons of segment chains from different DNA sequences. We calculate probability distributions with length N = 300 along a DNA sequence by running from site number i = 1 to N = L − 300 where L is the total length of the whole sequence. The whole task is completed by calculating probability distributions for all energies and all sequences. For each probability distribution obtained with an eigenenergy E we define a P mean radius hrP iE = Nn=1 nP(n, E) to show an overall central position of a distribution function P(n, E). The function P(n, E) is assumed to be normalized. This quantity is then used to calculate the second moment of the distribution. The second moment D2 (E) is defined from wavefunctions calculated from the Eq. (16) and is the measure of distribution dispersion which characterizes the conductivity behavior along the chain on an eigenstate of energy E. More dispersed distributions denote more delocalized wave functions and imply better ability of electronic conductions. D2 (E) =
N X
P (n, E)(n − hrP iE )2 .
(17)
n=1
The square root values of the second moment D(E) are averaged through all eigenenergies and all segment chains from a complete DNA sequence to define an averaged value Dave . Subsequently values of standard deviation σ for all chromosomes are also calculated. In the Table 2 averaged values of the second moment (Dave ) calculated from complete sequences of sixteen chromosomes are presented. These Dave values are observed to be quite the same; roughly around 21.0 base pairs. Grossly speaking, sequences from sixteen different chromosomes are of similar ability of charge conduction. Almost identical measured values of probability distribution dispersion (Dave ) indicate like delocalization behaviors of electronic wave functions in these systems. Measurements are carried out from a random sequence as well as from a periodic sequence. The measured Dave value of 19.5 bp from a random sequence is seen as slightly smaller than that calculated from sequences of living cells. This difference is not overwhelmingly large, but is statistically significant enough to tell the fact that long range correlated sequences from living cells indeed bear slightly better ability for charge conduction. A periodic sequence of purine - pyrimidine combinations is used for further comparison. The calculated Dave value is 86.4 bp (with σ = 3.4 bp) which is more than four times larger than those of chromosomes. This result is expected since periodic chains bear much better ability for conduction than disordered sequence chains. It is quite helpful to present dispersions of wave functions in terms of binary site energies. Since a DNA sequence is comprised of four different nucleotides, it is certainly more appropriate to explore wave function dispersions in terms of hole site energies which we have addressed previously, εA = 8.24 eV, εT = 9.14 eV, εC = 8.87 eV, and εG = 7.75 eV. By assuming t = 1 eV for the hopping integral and assigning hole site energies with exactly the same order of DNA sequences
S.-C. Wang et al. / Physica A 387 (2008) 5159–5168
5167
Fig. 5(b). Electronic wave function ψN and probability distributions |ψN |2 from two sequence chains with eigenenergy E = −1.57 eV are shown in this figure. The one labeled SC1 is taken from the first chromosome of S. cerevisiae and the other one labeled as “rand” is taken from a random sequence. The wave function, so does the probability distribution, calculated from the sequence chain labeled as SC1 is shown here slightly delocalized.
to the Eq. (16) eigenenergies, eigen-wavefunctions can be obtained. Repeating the whole procedure described above, we calculate the values of the square root of the second moment D(E) and obtain an averaged value of Dave by running through all eigenenergies and all segment chains from a complete DNA sequence. All 16 Dave values calculated from the S. cerevisiae genome are listed in the Table 2 along with Dave values calculated from a random sequence and a periodic one. They are listed together with their standard deviations (σ). Delocalization properties of wavefunctions on DNA sequences of S. cerevisiae genome are again revealed when compared to that obtained from a random sequence. Dave values obtained from DNA sequences are roughly about 20.0 bp as opposed to the value of 18.5 bp from a random sequence. A result calculated from a periodic sequence, 85.3 bp, is also listed for comparison. It should be emphasized here that wave function dispersion presented in this study is not meant to describe the real physics of charge conduction through these DNA sequences under study. To describe the real physics of charge conduction through a DNA sequence one has to accommodate quite a few factors including environmental temperature, counter ions, mechanical stress on a sample sequence and so forth. The choice of a hopping integral t = 1 eV is itself a controversial matter, since some authors [22,23] have in their studies suggested a smaller t value (<1 eV). A smaller t value would tend to wash out the results we report here. It is however reasonable to take quantities mentioned above as useful characteristic quantities which characterize statistical aspects of DNA sequences through the gateway of the one dimensional tight-binding model. In this kind of spirit, this study and others [10], provide an interesting glimpse of the close connection between the long range correlation and the dispersive nature of electronic wave functions in DNA sequences. 5. Conclusions Complete DNA sequence data from living cells of different species grab massive attention through the scientific communities of different disciplines. All existing living life forms on the earth today are basically regulated by genetic information defined in DNA sequences. They are encoded with only four different types of nucleotides, but are aligned by nature with orders of various complexities. More revelations on the features of DNA sequences would greatly improve our understanding on life phenomena, as well as the natural evolution of life forms on the earth. It is then profitable to design as many as possible methodologies to probe DNA’s. Long range correlations and possible charge conduction through
5168
S.-C. Wang et al. / Physica A 387 (2008) 5159–5168
hopping mechanism on DNA sequences have long been recognized as important topics of academic interest. We have here in this article, by following the work of pioneers, investigated these two features on DNA sequences from the S. cerevisiae genome. Long-range -correlation phenomenon are confirmed in these sequences by the Hurst exponent as well as the DFA-2 exponent. By showing averaged values of transmission coefficients hTN iav DNA sequences from the S. cerevisiae genome do indeed manifest a better ability of charge conduction statistically, when compared with a random sequence. A recent study made by Shih et al. [24] has shown the disruption of long range correlations by DNA point mutations. This effect yields unprecedented changes in the transmission coefficient, and it is suggested by the author that charge transport could play a significant role for DNA-repairing deficiency yielding carcinogenesis. Better charge conduction ability is further revealed by studying the delocalization patterns in electronic wave functions. Although the difference between the overall averaged dispersion of wave functions calculated from real DNA sequences and from a random sequence is not really large, it is already significant enough to show the fact that DNA sequences enclosed in a living S. cerevisiae cell nucleus bear better charge conduction ability. In other words, possible good conductivity as we discussed here in terms of the tight binding model is mainly due to the delocalization of wave functions. This fact was first studied and pointed out by Pedro Carpena et al. in 2002 [10]. Data presented in this article are from real DNA sequences that have survived from a harsh natural evolution process. Charge conduction ability is strongly affected by the order arrangement of four different types of nucleotide bases. Better charge conductivity promises better damage recognition efficiency and may be a better chance of survival through cruel natural environment. This interpretation complies with arguments of previous works [10,11]. References [1] C.K. Peng, et al., Nature (London) 356 (1992) 168. [2] Sergey V. Buldyrtev, Ary L. Goldberger, Shlomo Havlin, Chung-Kang Peng, Michael Simons, H. Eugene Stanley, Phys. Rev. E 47 (1993) 4514; Yuan-Yen Tai, Ping-Cheng Li, Hsen-Che Tseng, Physica A 369 (2006) 688. [3] W. Li, K. Kaneko, Europhys. Lett. 17 (1992) 655. [4] R.F. Voss, Phys. Rev. Lett. 68 (1992) 3805. [5] A. Arneodo, et al., Phys. Rev. Lett. 74 (1995) 3293. [6] S.V. Buldyrev, et al., Phys. Rev. E 51 (1995) 5084. [7] H. Herzel, I. Grosse, Phys. Rev. E 55 (1997) 800. [8] D. Holste, I. Grosse, Phys. Rev. E 67 (2003) 061913. [9] R.G. Endres, D.L. Cox, R.R.P. Singh, Rev. Modern Phys. 76 (2004) 195. [10] Pedro Carpena, Pedro Bernaola-Galvain, Plamen Ch. Ivanov, H. Eugene Stanley, Nature 418 (2002) 955. [11] Stephan Roche, Dominique Bicout, Enrique Maciá, Efim Kats, Phys. Rev. Lett. 91 (2003) 228101. [12] H.E. Hurst, Proc. Am. Soc. Civ. Eng. 76 (1950) 1. [13] Kun Hu, Plamen Ch. Ivanov, Zhi Chen, Pedro Carpena, H.Eugene Stanley, Phys. Rev. E 64 (2001) 011114. [14] Zhi Chen, Plamen Ch. Ivanov, Kun Hu, H. Eugene Stanley, Phys. Rev. E 65 (2002) 041107. [15] Limei Xu, Plamen Ch. Ivanov, Kun Hu, Zhi Chen, Anna Carbone, H. Eugene Stanley, Phys. Rev. E 71 (2005) 051101. [16] A. Montagnini, et al., Phys. Lett. A 244 (1998) 237. [17] Danny Porath, Alexey Bezryadin, Simon de Vries, Cees Dekker, Nature 403 (2000) 635. [18] C.T. Shih, Phys. Rev. E 74 (2006) 010903. [19] Y.A. Berlin, A.L. Burin, M.A. Ratner, Superlattices Microstruct. 28 (2000) 241. [20] A. Voityuk, et al., J. Chem. Phys. 114 (2001) 5614. [21] Stephan Roche, Phys. Rev. Lett. 91 (2003) 108101. [22] Ginaurelio Cuniberti, Luis Craco, Danny Porath, Cees Dekker, Phys. Rev. B 65 (2002) 241314(R). [23] Y.A. Berlin, A.L. Burin, M.A. Ratner, Chem. Phys. 275 (2002) 61. [24] Chi-Tin Shih, Stephan Roche, Rudolf A. Römer, Phys. Rev. Lett. 100 (2008) 018105.