A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing

A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing

Computer Methods and Programs in Biomedicine 149 (2017) 11–17 Contents lists available at ScienceDirect Computer Methods and Programs in Biomedicine...

2MB Sizes 0 Downloads 49 Views

Computer Methods and Programs in Biomedicine 149 (2017) 11–17

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine journal homepage: www.elsevier.com/locate/cmpb

A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing Muneer Ahmad a,∗, Low Tan Jung b, Al-Amin Bhuiyan a a b

College of Computer Sciences, King Faisal University, Saudi Arabia Department of Computer Sciences, University Technology PETRONAS, Malaysia

a r t i c l e

i n f o

Article history: Received 15 April 2016 Revised 29 May 2017 Accepted 23 June 2017

Keywords: Window filter 1/f noise Fuzzy adaptive filter 3-base periodicity Digital signal processing

a b s t r a c t Background and Objective: Digital signal processing techniques commonly employ fixed length window filters to process the signal contents. DNA signals differ in characteristics from common digital signals since they carry nucleotides as contents. The nucleotides own genetic code context and fuzzy behaviors due to their special structure and order in DNA strand. Employing conventional fixed length window filters for DNA signal processing produce spectral leakage and hence results in signal noise. A biological context aware adaptive window filter is required to process the DNA signals. Methods: This paper introduces a biological inspired fuzzy adaptive window median filter (FAWMF) which computes the fuzzy membership strength of nucleotides in each slide of window and filters nucleotides based on median filtering with a combination of s-shaped and z-shaped filters. Since coding regions cause 3-base periodicity by an unbalanced nucleotides’ distribution producing a relatively high bias for nucleotides’ usage, such fundamental characteristic of nucleotides has been exploited in FAWMF to suppress the signal noise. Results: Along with adaptive response of FAWMF, a strong correlation between median nucleotides and the  shaped filter was observed which produced enhanced discrimination between coding and noncoding regions contrary to fixed length conventional window filters. The proposed FAWMF attains a significant enhancement in coding regions identification i.e. 40% to 125% as compared to other conventional window filters tested over more than 250 benchmarked and randomly taken DNA datasets of different organisms. Conclusion: This study proves that conventional fixed length window filters applied to DNA signals do not achieve significant results since the nucleotides carry genetic code context. The proposed FAWMF algorithm is adaptive and outperforms significantly to process DNA signal contents. The algorithm applied to variety of DNA datasets produced noteworthy discrimination between coding and non-coding regions contrary to fixed window length conventional filters. © 2017 Elsevier B.V. All rights reserved.

1. Introduction DNA is considered as a repository for carrying the hereditary information of organisms [1,2]. This genetic information is encoded in the DNA sequence in the form of four important chemical bases called as Adenine, Thymine, Guanine and Cytosine (shortly represented as A, T, G and C, also known as nucleotide bases) [3,4]. DNA sequence is composed of these four letters arranged in a specific order over the sequence [5,6]. Commonly, digital signals are convo-

Abbreviations: bp, Base pair; SNR, Signal to noise ratio; DNA, Deoxyribonucleic acid; DSP, Digital signal processing. ∗ Corresponding author. E-mail addresses: [email protected] (M. Ahmad), [email protected] (L.T. Jung), [email protected] (A.-A. Bhuiyan). http://dx.doi.org/10.1016/j.cmpb.2017.06.021 0169-2607/© 2017 Elsevier B.V. All rights reserved.

luted with fixed length window filters for signal analysis but DNA signals differ in nature and characteristics from other signals due to their nucleotides contents. DNA signals formed from DNA sequences contain specific order of nucleotides with certain frequencies and mostly depict unbalanced nucleotides’ distribution. Interestingly, the nucleotides of DNA sequence also cause 3-base periodicity while forming protein sequence that is also an evidence for biological context of DNA signals in terms of coding regions identification. Here, the coding regions (exons) are sequence of nucleotides that actually code for protein while non-coding regions (introns) don’t code for protein [7,8]. The coding regions identification is tightly coupled with 1/f background noise which diffuses the boundaries of two regions in such a way that viable discernment of coding regions from non-coding regions is overly hindered.

12

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17

Table 1 Conventional window filters used for coding regions identification. Author(s)

Window filter used

Proposed window length

Ahmad et al. [12] Singha Roy and Barman [13] Marhon and Kremer [14] Zhang et al. [15] Ahmad [16] Sahu and Panda [17] Shakya et al. [18] Hota and Srivastava [19] Chavan et al. [20] Bergen and Antoniou [21] Andreas [22] Hota and Srivastava [23] Oppenheim and Schafer [24] Tiwari et al. [25] Nair and Sreenadhan [26] Anastassiou [27] Kotlar and Lavner [28] Akhtar et al. [29] Gunawan [30] Datta and Asif [31] Kakumani et al. [32] Tuqan and Rushdi [33] Datta and Asif [34] Akhtar et al. [35] Mena-Chalco et al. [36] George and Thomas [37] Abbasi et al. [38]

Kaiser Blackman Wavelet Gaussian Kaiser Rectangular Bartlett Rectangular Kaiser Rectangular Kaiser Rectangular Kaiser Rectangular Kaiser Rectangular Rectangular Rectangular Bartlett Rectangular Rectangular Rectangular Bartlett Rectangular Gaussian Rectangular Hamming

351 bp 100 bp 150 bp, 150 0 bp, 60 0 0 bp 90 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 234 bp 351 bp 351 bp 351 bp 351 bp 351 bp

Significant identification of protein coding regions is highly associated with application of appropriate window filter which enhances the identification and suppresses signal noise. Literature highlights that conventional window filters [12–39] have been particularly used for protein coding regions identification in digital signal processing approaches. The authors of this research reviewed the literature to seek common window filters used for DNA signal processing in context of identifying protein coding regions. It was observed that a series of published papers addressing coding regions identification report the employment of conventional window filters of some fixed length. In contrast, we could not find satisfactory literature related with window filters based on genetic context of code [9] and unbalanced nucleotides’ distribution that produce high bias for nucleotides’ usage in coding regions [10,11]. Table 1 presents a review of window filters with proposed length employed for DNA signal processing i.e. protein coding regions identification. It can be observed that Rectangular, Kaiser and Bartlett windows have been mostly used with a fixed window length 351 base pairs. Window filters (having a suitable window length) play a very important role in digital processing based approaches for coding regions identification. A comprehensive analysis of conventional window filters employed for coding regions identification was described by [39] using a benchmarked DNA sequence AF099922 [18,19,37,38] at different window lengths. This analysis previewed that various conventional window functions with different window lengths, identify coding regions, addressing the issues very differently from each other. Yin and Yau [40] observed that a small window size produces more statistical oscillations that results in prediction errors while large window sizes may miss small size coding and non-coding regions. We observed that smaller window lengths (e.g. 120 bp) do not suppress 1/f noise significantly and results in either very low relative peak of coding region or the non-coding regions express themselves more than the coding regions. A window length of 240 bp previews better results than 120 bp since it suppresses the noise and somehow better glimpses the peaks of coding re-

gions. On contrary, window filters owning a length of 351 bp identifies coding regions to a better extent by suppressing 1/f noise. Conventional window filters have been mostly employed with a variety of digital signals but DNA signal contains biologically inspired nucleotides’ data, in which each nucleotide holds a special genetic code context and its distribution is highly biased in coding regions. These special characteristics of nucleotides in codons conclude that employment of a conventional window filter (especially with a fixed window size) don’t suppress the 1/f background noise to a significant extent which results in a feeble discrimination between coding and non-coding regions. 2. Methodology We propose a novel fuzzy adaptive window median filter (FAWMF) that owns genetically meaningful characteristics of nucleotides in codons i.e. nucleotides’ density distribution, specific positions of nucleotides in codons and nucleotides’ usage in terms of their distribution [12]. A codon is a tri-nucleotide structure in which each nucleotide carries a specific genetic context that differentiates it from other nucleotides. Based on such characteristic of nucleotides, all codons differ from each other. Further, nucleotides being constituents of a host codon exhibit density distribution, position and associated nucleotide’s usage. Such fundamental features of nucleotides can be exploited to design more meaningful solutions for DNA signal processing i.e. protein coding regions identification. For instance, a codon a tri-nucleotide structure based on Adenine, Guanine, Thymine and Cytosine. Naturally, the codons depict fuzzy behavior since the membership values of nucleotides in codons differ depending on density of nucleotides. Some nucleotides express themselves more in one codon cluster while the same nucleotide may have weaker or no strength in other clusters. Such variations can only be described by z-shaped membership in cluster space to address the similarity association between heterogeneous and disjoint clusters. This implies nature of codon clusters, some of the clusters are totally disjoint while the others share some common density distribution. All clusters other than disjoint clusters in cluster space share some common density distribution. We can define membership values for clusters that share certain densities. For instance, the clusters with nucleotides sharing twice distribution achieve a membership value of "2/3" while those having single contribution achieve a membership of "1/3". The nucleotide which owns no physical contribution for a cluster, receives a membership value "0". In this regard, the strongest motivation behind introducing a new fuzzy window median filter is that within the DNA sequence, the nucleotides are arranged at specific positions and orders in codons and 3-base periodicity is caused by an unbalanced nucleotide distribution producing a relatively high bias for nucleotides’ usage in coding regions, such fundamental characteristic of nucleotides has been exploited in FAWMF to suppress the signal noise to a significant extent. Secondly, since exons are diffused in high 1/f noise caused by long range introns, fixed length window filters can’t guarantee an enhanced identification of exons in certain regions of DNA sequence. For instance, a conventional window filter of fixed length 351 bp moved over a DNA sequence may miss some short range exons likewise a Window filter of fixed length 120 bp may miss some exons of long range that are greater than the Window length. With FAWMF, It has been noticed that any change in segment length doesn’t change the uniformness of membership strength of nucleotides and spectral response of segment. 1/f background noise in DNA sequence arises due to strong diffusion of coding regions with non-coding regions that ultimately results in spectral leakage

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17

Start Take a Window (W) of size L consisting of Xm (m =1, 2, …, N ) data points such that 1≤ m ≤ L Calculate standard deviation (σ), mean (μ), minimum value (m1), and maximum value (m2) of W Calculate Loop step size (d) such that d = (m2-m1)/L Repeat from m1 to m2 with step size (d) and set variable j = ∑ Xm

No

Is d ≤ j ?

Yes Calculate fuzzy membership value using s-shaped and z-shaped functions combined by relation |

|



Update vector T(m) with new membership value and increment m Update step size d with d = d + m1 Vector T is required FAWMF

Fig. 1. Fuzzy adaptive window with median filtering (FAWMF).

T

A

T

G, C}) can be defined as the ratios of the number of codons with segment values dm to the total number of data points in the Window p(rm ) = dNm , where rm is the mth segment, dm is the number of data items with that string value, N is the total number of data  items in the Window and p(rm ) = 1. m

The proposed fuzzy median filter is organized with different fuzzy rules to determine the strength of a signal at any sampling instant from the neighborhood of that point. The filter is designed with the following notions: 1. The L sampling points are stored in descending or ascending order. These are determined from the amplitude of the signal at the vicinity points Xm . 2. For each piece of data Xm , a fuzzy membership value is computed through a -shaped membership function. The shaped membership function, employed for the fuzzy median filter, possess the following characteristics: i. The maximum and minimum amplitude values are selected for the membership degree of 0. ii. The mean value of the data points are designated with the membership degree of 1. The membership function is a -fashioned curve that defines how each data point in the input space is being mapped to a degree of membership between 0 and 1. The -shaped membership function is constructed with the arrangement of s-shaped and z-shaped curves, respectively expressed by:

⎧ 0, ⎪ ⎪ ⎨2  xm −xl 2 , xr −xl s ( xm ; xl , xr ) =  −xr 2 ⎪ 1 − 2 xxmr −x , ⎪ l ⎩ 1,

z ( xm ; xl , xr ) =

⎧ 1, ⎪ ⎪ ⎨

 xm −xl 2 ,  xm −xxr r−x2 l

1−2

⎪ ⎪ ⎩2

0,

xr −xl

,

xm ≤ xl xl ≤ xm ≤ xl +xr 2

xl +xr 2

≤ xm ≤ xr ⎪ ⎪

xm ≤ xl xl ≤ xm ≤

(2.1)



xm ≥ xr

xl +xr 2

⎫ ⎪ ⎪ ⎬

xl +xr 2

≤ xm ≤ xr xm ≥ xr

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭

(2.2)

where the parameters xl and xr locate the extremes of the sloped region of the curve representing the left and right breakpoints, respectively.

End

A T G A C

13

T

C A C T G A G T C

1 2 … m m+1 m+2 … N L =2m +1 W Fig. 2. Structure of nucleotides’ data.

at frequency analysis of signal. This phenomenon shows a correlation between 1/f noise and spectral leakage. 2.1. Fuzzy adaptive window median filter (FAWMF) In proposed FAWMF, fuzzy rules are applied along with median filtering to suppress signal noise and hence achieve enhanced identification of coding regions. The flowchart of the FAWMF is shown in Fig. 1. Let us consider a Window of size L ∈ W with Xm (m = 1,2, …, N) data points such that 1 ≤ m ≤ L, as shown in Fig. 2. The probabilities p(rm ) of codons containing segments rm (where rm  {A, T,

3. Assume 2 × k + 1 data points (k ≤ L/2), where k is the range of the data set, that is, the number of candidate data in the median calculation of the list (the median value and k preceding the onward data of the sorted list). 4. Determine the amplitude that provide the highest membership value and represent it as output. We describe here the simulation of FAWMF algorithm over the gene Homo sapiens mitochondrion that contains 16,0 0 0 bp and 13 coding regions. The nucleotide bases in this gene sequence are encoded using fuzzy encoding sequence [12] to form a vector containing digital contents of signal as shown in Table 2. We initially select a window length (L) of 401 points and move this window over the encoded signal. As the first iteration of algorithm, the mean value (μ) is calculated as 0.850083333. Similarly standard deviation (σ ) becomes 0.133729429, the minimum value (m1 ) and maximum value (m2 ) result in 0.6874 and 1.2299 respectively. The loop step size (d) is 0.0135625 and the fuzzy membership value for this iteration becomes 0.053349331. In second iteration, the loop step size is incremented and we achieve a new membership value as 0.053769913. The resultant vector (T) contains the membership strengths of all those data points that correspond to these high membership values for an entire window length of input signal.

14

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17 Table 2 Encoded sequence along with window contents. Nucleotide sequence Encoded sequence Window (W)

a 0.8957 0.8957

t 0.6874 0.6874

a 0.8957 0.8957

c 0.8672 0.8672

c 0.8672 0.8672

c 0.8672 0.8672

a 0.8957 0.8957

– – –

Fig. 3. Frame segments of windows and PSD estimation.

3. Results Performance evaluation of different Window filters for coding regions identification has been performed at nucleotide level. In this context, following important evaluation measures have been employed which are defined as,

Sensit ivit y (Sn ) =

TP TP + FN

(1)

Speci f icity (Sp) =

TP TP + FP

(2)

P rediction accuracy (P ) =

TP + TN TP + FP + TN + FN

(3)

Approximate correlation (AC ) = (ACP − 0.5 ) ∗ 2 where,

Fig. 4. PSD estimation of Homo sapiens mitochondrion gene.

Further, we convolute the window with encoded signal and calculate the segmented frames of signal. Fig. 3(A) describes 21 frames (out of different number of frames depending upon the adaptive window size) of window segments as a result or convolution with the signal and the power spectral density estimation of frames. Fig. 3(B) presents another 21 frames of PSD generated as convolution of window with the signal. This help to figure out which regions of signal contain probability of having coding regions. Fig. 4 presents the correct identification of 13 exons in the complete gene. The peaks E1 to E13 depict the locations of exons identified at particular locations in the signal.

ACP =

1 ∗ 4



TP TP TN TN + + + TP + FN TP + FP TN + FN TN + FP

(4)

(5)

Sensitivity (Sn) (also called as true positive rate) measures the proportion of the regions correctly identified as coding regions (exons) while Specificity (Sp) (also called as true negative rate) measures the proportion of regions correctly identified as non-coding regions (introns) [17,18,36,38]. Prediction accuracy (P) is another good evaluation measure taken as combination of Sensitivity (Sn) and Specificity (Sp) [38] while approximate correlation (AC) is equally employed as a suitable evaluation measure since sometimes the Prediction accuracy (P) may not ideally discriminate coding regions from non-coding regions due to higher Sensitivity (Sn) of regions against lower Specificity (Sp) and vice versa. Further, Table 3 describes the datasets used for performance evaluation of different window filters. The benchmarked DNA datasets

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17 Table 3 Description of datasets used for performance evaluation. Organism

No. of sequences

Homo sapiens Serinus canaria Nicotiana sylvestris Yersinia pestis Limulus polyphemus Felis catus Vicugna pacos Sus scrofa mitochondrion Cricetulus griseus Tursiops truncates Ornithorhynchus anatinus Mus musculus domesticus Meleagris gallopavo Canis lupus mitochondrion Galeopterus variegatus Nicotiana tomentosiformis S. Cerevisiae chromosome III Human, Mouse and Rat (HMR195)

15 20 17 1 20 23 18 1 15 18 20 1 25 1 18 20 1 103, 82 and 10 respectively 1

4. Discussions Protein coding regions are diffused with non-coding regions and a viable identification of such regions is suffered by exonintron mixed signal noise [13,14,41]. The suppression of signal noise correlates with enhancement in identification of coding regions. An optimal digital filter convolutes with DNA signal and significantly discriminates the boundaries of coding and non-coding regions [12]. Conventional window filters employed for protein coding regions enhancement lack the representation and implementation of biologically inspired nucleotides’ data [39]. Yin and Yau [40] observed that a small window size produces more statistical oscillations that results in prediction errors while large window sizes may miss small size coding and non-coding regions. The windows smaller in size (i.e. 60 bp, 120 bp and 240 bp) depict lower values of performance parameters. The extent of identification gradually increases from lower window length to higher length and a maximum identification is achieved at window length of 351 bp [12,39]. Similarly we found a decrease in prediction accuracy beyond 351 bp for windows having fixed length. A notable enhancement in identification was observed with FAWMF due to its adoptability by minimizing spectral leakage and signal noise. Employing randomly taken datasets, we calculated the mean signal to noise ratio (SNR) for each window filter at different window sizes to reveal the tendency of noise suppression of window filters. We noticed a slight variation in SNR at window length of 60 bp and 120 bp in existing window filters but a viable difference was observed comparing with FAWMF window filter. At window length of 351 bp (which had been chosen by most of researchers for coding regions identification) [16], the SNR of Kaiser window filter was most prominent in existing window filters while a re-

Average sequence length (bp) 4500 6250 3700 40 0 0 7880 4925 2120 80 0 0 5812 3460 3933 7700 2732 7800 2267 4200 80 0 0 7096 73,326

used for experiments are HMR195, S. cerevisiae chromosome III) [18,19,37,38] and HUMHBB (Human beta globin) [36] and rest of datasets are randomly taken DNA sequences of organisms. The performance analysis of different parameters can be observed in Table 4. Table 4 presents the performance evaluation of window filters in terms of specificity (Sp), prediction accuracy (P), approximate correlation (AC), false positive rate (FP) and signal to noise ratio (SNR) at different window lengths.

Table 4 Performance evaluation at different window lengths.

Bartle Blackman Rectangular Hamming Hann Taylorwin Triangular

60 bp

120 bp

240 bp

351 bp

460 bp

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

Kaiser μ(Sp)

Extent

Performance evaluaon at different window lengths 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000

FAWMF

600 bp

Specificity (Sp), Predicon accuracy (P) and Approximate Correlaon (AC)

0.900

0.700

Bartle

0.600

Blackman

0.500

Rectangular Hamming

0.400

Hann

0.300

Taylorwin

0.200

Triangular

60 bp

120 bp

240 bp

351 bp

460 bp

μ(FP)

μ(SNR)

μ(FP)

μ(SNR)

μ(SNR)

μ(FP)

μ(SNR)

μ(FP)

FAWMF

μ(SNR)

Kaiser

0.000 μ(FP)

0.100 μ(SNR)

Extent

0.800

μ(FP)

Human (Beta Globin HUMHBB)

15

600 bp

False posive rate (FP) and signal to noise rao (SNR)

16

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17 Table 5 Fuzzy membership strength and spectral response of window segments.

Fuzzy membership strength of nucleotides

Spectral response of window segments Window segment of length 351 bp 3

10 9 8 7 6 5 4 3 2 1 0

Spectral density

2.5 2 1.5 1 0.5 1 23 45 67 89 111 133 155 177 199 221 243 265 287 309 331

0 1 23 45 67 89 111 133 155 177 199 221 243 265 287 309 331

Magnitude

Window segment of length 351 bp

Fuzzy membership values of nucleodes

markable gain in SNR was observed with proposed FAWMF window filter compared with SNR of noticeable Kaiser window filter. We observed that signal to noise ratio increases from smaller window length to relatively larger length and maximum SNR is achieved at widow length of 351 bp while SNR gradually decreases beyond 351 bp. The same phenomenon was observed for false positive rate. The false positive rate decreases with increasing window sizes and optimal false positive rate is achieved around 351 bp. The same rate decreases for larger window sizes for conventional window filters since they lack the biological aspect of nucleotides being represented as contents of window filter. That is why, it is more appropriate to state that a window filter based on genetic code context of nucleotides would ensure significant noise suppression as compared with other conventional window filters. Further we performed ANOVA test for analysis of variance in the results achieved with determining different evaluation parameters (i.e. prediction accuracy, specificity, approximate correlation, false positive rate and signal to noise ratio). We observed p-values lower than 0.05 corresponding to the F-statistic of ANOVA for evaluation parameters. This phenomenon indicates that one or more window filters are significant than others (rejects the null hypothesis that all window filters achieve same performance). Since only ANOVA cannot depict which window filters are significant, we further performed Post-hoc Tukey HSD test to identify the significant window filters among other filters. FAWMF window filter achieved significant p-values (p < 0.01) in comparison with other conventional window filters. Another very important aspect tied with window filter is its adaptability according to variations in its length [39]. It is notable to describe that proposed FAWMF window filter is highly adaptive in its utilization for better convolution with DNA signal because of its biological structure. FAWMF bases over genetic code context and outperforms at all Window lengths. Table 5 presents the fuzzy membership strength of nucleotides and corresponding spectral response of window segments of different lengths. The red line approximates a uniform normal distribution of nucleotides’ strength and spectral estimate of segments. The fuzzy membership distribution of nucleotides in window segments is highly correlated with spectral response of segments. A uniform smooth fuzzy distribution depicts the same spectral response at different segment sizes. Further, it has been noticed that any change in segment length does not change the uniformness of membership strength of nucleotides and spectral response of segment. A small window size produces more statistical

Data points in Window segment

oscillations that results in prediction errors while large Window sizes may miss small size coding and non-coding regions [40] but FAWMF algorithm outperforms at different Window sizes revealing a uniform membership distribution with same smooth spectral response. 5. Conclusion Conventional window filters applied to DNA signals do not achieve significant results. This paper presented a novel biologically inspired fuzzy adaptive window median filter (FAWMF) based on genetic code context of nucleotides. We applied FAWMF to long noisy DNA sequences for enhancing coding regions identification. FAWMF algorithm computed the fuzzy membership strength of nucleotides and filtered nucleotides based on median filtering with a combination of s-shaped and z-shaped filters. FAWMF algorithm was observed very useful for tracing both short range and long range coding regions from a variety of noisy sequences due to its significant adaptive response. More than 250 benchmarked and randomly taken DNA datasets of different organisms were employed for performance evaluation of different Window filters. The proposed window filter outperformed and produced significant discrimination between coding and non-coding regions contrary to fixed window length conventional filters. References [1] D. Anastassiou, Genomic signal processing, IEEE Signal Process. Mag. 18 (4) (2001) 8–20. [2] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter, Molecular biology of the cell. new york: Garland science; 2002. Classic textbook now in its 5th Edition (2002). [3] Z. Ignatova, I. Martínez-Pérez, K.H. Zimmermann, DNA Computing Models, Springer Science & Business Media, 2008. [4] F. Brueckner, K.J. Armache, A. Cheung, G.E. Damsma, H. Kettenberger, E. Lehmann, J. Sydow, P. Cramer, Structure–function studies of the RNA polymerase II elongation complex, Acta Crystallogr. Sect. D: Biol. Crystallogr. 65 (2) (2009) 112–120. [5] M. Long, E. Betrán, K. Thornton, W. Wang, The origin of new genes: glimpses from the young and old, Nat. Rev. Genet. 4 (11) (2003) 865–875. [6] A.A. Turanov, A.V. Lobanov, D.E. Fomenko, H.G. Morrison, M.L. Sogin, L.A. Klobutcher, D.L. Hatfield, V.N. Gladyshev, Genetic code supports targeted insertion of two amino acids by one codon, Science 323 (5911) (2009) 259–261. [7] E. Coward, Equivalence of two Fourier methods for biological sequences, J. Math. Biol. 36 (1) (1997) 64–70. [8] Z. Wang, Y. Chen, Y. Li, A brief review of computational gene prediction methods, Genomics Proteomics Bioinf. 2 (4) (2004) 216–221. [9] I. Wasito, I. Veritawati, Fractal dimension approach for clustering of DNA sequences based on internucleotide distance, in: 2013 International Conference

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17

[10] [11] [12] [13] [14]

[15]

[16]

[17]

[18] [19]

[20]

[21]

[22] [23]

[24] [25]

of Information and Communication Technology (ICoICT), IEEE, 2013, March, pp. 82–87. J.W. Fickett, Recognition of protein coding regions in DNA sequences, Nucleic Acids Res. 10 (17) (1982) 5303–5318. C. Yin, S.S.T. Yau, A Fourier characteristic of coding sequences: origins and a non-Fourier approximation, J. Comput. Biol. 12 (9) (2005) 1153–1165. M. Ahmad, L.T. Jung, M.A.-A. Bhuiyan, On fuzzy semantic similarity measure for DNA coding, Comput. Biol. Med. 69 (2016) 144–151. S. Singha Roy, S. Barman, Polyphase filtering with variable mapping rule in protein coding region prediction, Microsyst. Technol. 22 (167) (2016) 1–11. S.A. Marhon, S.C. Kremer, Prediction of protein coding regions using a wide-range wavelet window method, IEEE/ACM Trans. Comput. Biol. Bioinf. 13 (4) (2016) 742–753. X. Zhang, Z. Shen, G. Zhang, Y. Shen, M. Chen, J. Zhao, R. Wu, Short Exon detection via Wavelet transform Modulus Maxima, PLOS ONE 11 (9) (2016) e0163088. M. Ahmad, A biologically-inspired computational solution for protein coding regions identification in noisy DNA sequences, in: Biologically-Inspired Energy Harvesting through Wireless Sensor Technologies, IGI Global, 2016, pp. 201–216. S.S. Sahu, G. Panda, Identification of protein-coding regions in DNA sequences using a time-frequency filtering approach, Genomics, Proteomics Bioinf. 9 (1) (2011) 45–55. D.K. Shakya, R. Saxena, S.N. Sharma, A DSP-based approach for gene prediction in eukaryotic genes, Int. J. Electr. Eng. Inf. 3 (4) (2011) 480–487. M.K. Hota, V.K. Srivastava, DSP technique for gene and exon prediction taking EIIP indicator sequence, in: Proceedings of the Second International Conference on Information Processing, 2008, January, pp. 117–123. M.S. Chavan, R.A. Agarwala, M.D. Uplane, Use of Kaiser window for ECG processing, in: Proceedings of the 5th WSEAS International Conference on Signal Processing, Robotics and Automation, Madrid, Spain, 2006, February. S.W. Bergen, A. Antoniou, Application of parametric window functions to the STDFT method for gene prediction, in: Proceedings on Communication, Computers and Signal Processing, (IEEE-PACRIM05), 2005, pp. 324–327. A. Andreas, Digital signal processing: Signals, systems, and filters, McGraw-Hill, New York, 2006 ISBN 10: 0070636338. M.K. Hota, V.K. Srivastava, Performance analysis of different DNA to numerical mapping techniques for identification of protein coding regions using tapered window based short-time discrete Fourier transform, in: 2010 International Conference on Power, Control and Embedded Systems (ICPCES), IEEE, 2010, November, pp. 1–4. A.V. Oppenheim, R.W. Schafer, Discrete-time Signal Processing, Pearson Higher Education, 2010. S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya, R. Ramaswamy, Prediction of probable genes by Fourier analysis of genomic sequences, Comput. Appl. Biosci.: CABIOS 13 (3) (1997) 263–270.

17

[26] A.S. Nair, S.P. Sreenadhan, A coding measure scheme employing electron-ion interaction pseudopotential (EIIP), Bioinformation 1 (6) (2006) 197–202. [27] D. Anastassiou, Frequency-domain analysis of biomolecular sequences, Bioinformatics 16 (12) (20 0 0) 1073–1081. [28] D. Kotlar, Y. Lavner, Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions, Genome Res. 13 (8) (2003) 1930–1937. [29] M. Akhtar, J. Epps, E. Ambikairajah, Signal processing in sequence analysis: advances in eukaryotic gene prediction, IEEE J Sel. Top. Signal Process. 2 (3) (2008) 310–321. [30] T.S. Gunawan, On the optimal window shape for genomic signal processing, in: International Conference on Computer and Communication Engineering, 2008. ICCCE 2008, IEEE, 2008, May, pp. 252–255. [31] S. Datta, A. Asif, A fast DFT based gene prediction algorithm for identification of protein coding regions, in: ICASSP (5), 2005, March, pp. 653–656. [32] R. Kakumani, V. Devabhaktuni, M.O. Ahmad, Prediction of protein-coding regions in DNA sequences using a model-based approach, in: 2008 IEEE International Symposium on Circuits and Systems, IEEE, 2008, May, pp. 1918–1921. [33] J. Tuqan, A. Rushdi, A DSP approach for finding the codon bias in DNA sequences, IEEE J. Sel. Top. Signal Process. 2 (3) (2008) 343–356. [34] S. Datta, A. Asif, DFT based DNA splicing algorithms for prediction of protein coding regions, in: Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004, 1, IEEE, 2004, November, pp. 45–49. [35] M. Akhtar, J. Epps, E. Ambikairajah, On DNA numerical representations for period-3 based exon prediction, in: 2007 IEEE International Workshop on Genomic Signal Processing and Statistics, IEEE, 2007, June, pp. 1–4. [36] J. Mena-Chalco, H. Carrer, Y. Zana, Cesar Jr, M. R., Identification of protein coding regions using the modified Gabor-wavelet transform, IEEE/ACM Trans. Comput. Biol. Bioinf. 5 (2) (2008) 198–207. [37] T.P. George, T. Thomas, Discrete wavelet transform de-noising in eukaryotic gene splicing, BMC Bioinf. 11 (1) (2010) 1. [38] O. Abbasi, A. Rostami, G. Karimian, Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform, BMC Bioinf. 12 (1) (2011) 1. [39] M. Ahmad, L.T. Jung, A.A. Bhuiyan, From DNA to protein: why genetic code context of nucleotides for DNA signal processing? A review, Biomed. Signal Process. Control 34 (2017) 44–63. [40] C. Yin, S.S.T. Yau, Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence, J. Theor. Biol. 247 (4) (2007) 687–694. [41] G. Liu, Y. Luan, Identification of protein coding regions in the eukaryotic DNA sequences based on Marple algorithm and wavelet packets transform, Abstract and Applied Analysis 2014 (2014, July).