A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing

Computer Methods and Programs in Biomedicine 149 (2017) 11–17 Contents lists available at ScienceDirect Computer Methods and Programs in Biomedicine...

Download PDF

2MB Sizes 0 Downloads 49 Views

Report

PDF Reader
Full Text

Computer Methods and Programs in Biomedicine 149 (2017) 11–17

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine journal homepage: www.elsevier.com/locate/cmpb

A biological inspired fuzzy adaptive window median ﬁlter (FAWMF) for enhancing DNA signal processing Muneer Ahmad a,∗, Low Tan Jung b, Al-Amin Bhuiyan a a b

College of Computer Sciences, King Faisal University, Saudi Arabia Department of Computer Sciences, University Technology PETRONAS, Malaysia

a r t i c l e

i n f o

Article history: Received 15 April 2016 Revised 29 May 2017 Accepted 23 June 2017

Keywords: Window ﬁlter 1/f noise Fuzzy adaptive ﬁlter 3-base periodicity Digital signal processing

a b s t r a c t Background and Objective: Digital signal processing techniques commonly employ ﬁxed length window ﬁlters to process the signal contents. DNA signals differ in characteristics from common digital signals since they carry nucleotides as contents. The nucleotides own genetic code context and fuzzy behaviors due to their special structure and order in DNA strand. Employing conventional ﬁxed length window ﬁlters for DNA signal processing produce spectral leakage and hence results in signal noise. A biological context aware adaptive window ﬁlter is required to process the DNA signals. Methods: This paper introduces a biological inspired fuzzy adaptive window median ﬁlter (FAWMF) which computes the fuzzy membership strength of nucleotides in each slide of window and ﬁlters nucleotides based on median ﬁltering with a combination of s-shaped and z-shaped ﬁlters. Since coding regions cause 3-base periodicity by an unbalanced nucleotides’ distribution producing a relatively high bias for nucleotides’ usage, such fundamental characteristic of nucleotides has been exploited in FAWMF to suppress the signal noise. Results: Along with adaptive response of FAWMF, a strong correlation between median nucleotides and the shaped ﬁlter was observed which produced enhanced discrimination between coding and noncoding regions contrary to ﬁxed length conventional window ﬁlters. The proposed FAWMF attains a signiﬁcant enhancement in coding regions identiﬁcation i.e. 40% to 125% as compared to other conventional window ﬁlters tested over more than 250 benchmarked and randomly taken DNA datasets of different organisms. Conclusion: This study proves that conventional ﬁxed length window ﬁlters applied to DNA signals do not achieve signiﬁcant results since the nucleotides carry genetic code context. The proposed FAWMF algorithm is adaptive and outperforms signiﬁcantly to process DNA signal contents. The algorithm applied to variety of DNA datasets produced noteworthy discrimination between coding and non-coding regions contrary to ﬁxed window length conventional ﬁlters. © 2017 Elsevier B.V. All rights reserved.

1. Introduction DNA is considered as a repository for carrying the hereditary information of organisms [1,2]. This genetic information is encoded in the DNA sequence in the form of four important chemical bases called as Adenine, Thymine, Guanine and Cytosine (shortly represented as A, T, G and C, also known as nucleotide bases) [3,4]. DNA sequence is composed of these four letters arranged in a speciﬁc order over the sequence [5,6]. Commonly, digital signals are convo-

Abbreviations: bp, Base pair; SNR, Signal to noise ratio; DNA, Deoxyribonucleic acid; DSP, Digital signal processing. ∗ Corresponding author. E-mail addresses: [email protected] (M. Ahmad), [email protected] (L.T. Jung), [email protected] (A.-A. Bhuiyan). http://dx.doi.org/10.1016/j.cmpb.2017.06.021 0169-2607/© 2017 Elsevier B.V. All rights reserved.

luted with ﬁxed length window ﬁlters for signal analysis but DNA signals differ in nature and characteristics from other signals due to their nucleotides contents. DNA signals formed from DNA sequences contain speciﬁc order of nucleotides with certain frequencies and mostly depict unbalanced nucleotides’ distribution. Interestingly, the nucleotides of DNA sequence also cause 3-base periodicity while forming protein sequence that is also an evidence for biological context of DNA signals in terms of coding regions identiﬁcation. Here, the coding regions (exons) are sequence of nucleotides that actually code for protein while non-coding regions (introns) don’t code for protein [7,8]. The coding regions identiﬁcation is tightly coupled with 1/f background noise which diffuses the boundaries of two regions in such a way that viable discernment of coding regions from non-coding regions is overly hindered.

12

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17

Table 1 Conventional window ﬁlters used for coding regions identiﬁcation. Author(s)

Window ﬁlter used

Proposed window length

Ahmad et al. [12] Singha Roy and Barman [13] Marhon and Kremer [14] Zhang et al. [15] Ahmad [16] Sahu and Panda [17] Shakya et al. [18] Hota and Srivastava [19] Chavan et al. [20] Bergen and Antoniou [21] Andreas [22] Hota and Srivastava [23] Oppenheim and Schafer [24] Tiwari et al. [25] Nair and Sreenadhan [26] Anastassiou [27] Kotlar and Lavner [28] Akhtar et al. [29] Gunawan [30] Datta and Asif [31] Kakumani et al. [32] Tuqan and Rushdi [33] Datta and Asif [34] Akhtar et al. [35] Mena-Chalco et al. [36] George and Thomas [37] Abbasi et al. [38]

Kaiser Blackman Wavelet Gaussian Kaiser Rectangular Bartlett Rectangular Kaiser Rectangular Kaiser Rectangular Kaiser Rectangular Kaiser Rectangular Rectangular Rectangular Bartlett Rectangular Rectangular Rectangular Bartlett Rectangular Gaussian Rectangular Hamming

351 bp 100 bp 150 bp, 150 0 bp, 60 0 0 bp 90 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 351 bp 234 bp 351 bp 351 bp 351 bp 351 bp 351 bp

Signiﬁcant identiﬁcation of protein coding regions is highly associated with application of appropriate window ﬁlter which enhances the identiﬁcation and suppresses signal noise. Literature highlights that conventional window ﬁlters [12–39] have been particularly used for protein coding regions identiﬁcation in digital signal processing approaches. The authors of this research reviewed the literature to seek common window ﬁlters used for DNA signal processing in context of identifying protein coding regions. It was observed that a series of published papers addressing coding regions identiﬁcation report the employment of conventional window ﬁlters of some ﬁxed length. In contrast, we could not ﬁnd satisfactory literature related with window ﬁlters based on genetic context of code [9] and unbalanced nucleotides’ distribution that produce high bias for nucleotides’ usage in coding regions [10,11]. Table 1 presents a review of window ﬁlters with proposed length employed for DNA signal processing i.e. protein coding regions identiﬁcation. It can be observed that Rectangular, Kaiser and Bartlett windows have been mostly used with a ﬁxed window length 351 base pairs. Window ﬁlters (having a suitable window length) play a very important role in digital processing based approaches for coding regions identiﬁcation. A comprehensive analysis of conventional window ﬁlters employed for coding regions identiﬁcation was described by [39] using a benchmarked DNA sequence AF099922 [18,19,37,38] at different window lengths. This analysis previewed that various conventional window functions with different window lengths, identify coding regions, addressing the issues very differently from each other. Yin and Yau [40] observed that a small window size produces more statistical oscillations that results in prediction errors while large window sizes may miss small size coding and non-coding regions. We observed that smaller window lengths (e.g. 120 bp) do not suppress 1/f noise signiﬁcantly and results in either very low relative peak of coding region or the non-coding regions express themselves more than the coding regions. A window length of 240 bp previews better results than 120 bp since it suppresses the noise and somehow better glimpses the peaks of coding re-

gions. On contrary, window ﬁlters owning a length of 351 bp identiﬁes coding regions to a better extent by suppressing 1/f noise. Conventional window ﬁlters have been mostly employed with a variety of digital signals but DNA signal contains biologically inspired nucleotides’ data, in which each nucleotide holds a special genetic code context and its distribution is highly biased in coding regions. These special characteristics of nucleotides in codons conclude that employment of a conventional window ﬁlter (especially with a ﬁxed window size) don’t suppress the 1/f background noise to a signiﬁcant extent which results in a feeble discrimination between coding and non-coding regions. 2. Methodology We propose a novel fuzzy adaptive window median ﬁlter (FAWMF) that owns genetically meaningful characteristics of nucleotides in codons i.e. nucleotides’ density distribution, speciﬁc positions of nucleotides in codons and nucleotides’ usage in terms of their distribution [12]. A codon is a tri-nucleotide structure in which each nucleotide carries a speciﬁc genetic context that differentiates it from other nucleotides. Based on such characteristic of nucleotides, all codons differ from each other. Further, nucleotides being constituents of a host codon exhibit density distribution, position and associated nucleotide’s usage. Such fundamental features of nucleotides can be exploited to design more meaningful solutions for DNA signal processing i.e. protein coding regions identiﬁcation. For instance, a codon a tri-nucleotide structure based on Adenine, Guanine, Thymine and Cytosine. Naturally, the codons depict fuzzy behavior since the membership values of nucleotides in codons differ depending on density of nucleotides. Some nucleotides express themselves more in one codon cluster while the same nucleotide may have weaker or no strength in other clusters. Such variations can only be described by z-shaped membership in cluster space to address the similarity association between heterogeneous and disjoint clusters. This implies nature of codon clusters, some of the clusters are totally disjoint while the others share some common density distribution. All clusters other than disjoint clusters in cluster space share some common density distribution. We can deﬁne membership values for clusters that share certain densities. For instance, the clusters with nucleotides sharing twice distribution achieve a membership value of "2/3" while those having single contribution achieve a membership of "1/3". The nucleotide which owns no physical contribution for a cluster, receives a membership value "0". In this regard, the strongest motivation behind introducing a new fuzzy window median ﬁlter is that within the DNA sequence, the nucleotides are arranged at speciﬁc positions and orders in codons and 3-base periodicity is caused by an unbalanced nucleotide distribution producing a relatively high bias for nucleotides’ usage in coding regions, such fundamental characteristic of nucleotides has been exploited in FAWMF to suppress the signal noise to a signiﬁcant extent. Secondly, since exons are diffused in high 1/f noise caused by long range introns, ﬁxed length window ﬁlters can’t guarantee an enhanced identiﬁcation of exons in certain regions of DNA sequence. For instance, a conventional window ﬁlter of ﬁxed length 351 bp moved over a DNA sequence may miss some short range exons likewise a Window ﬁlter of ﬁxed length 120 bp may miss some exons of long range that are greater than the Window length. With FAWMF, It has been noticed that any change in segment length doesn’t change the uniformness of membership strength of nucleotides and spectral response of segment. 1/f background noise in DNA sequence arises due to strong diffusion of coding regions with non-coding regions that ultimately results in spectral leakage

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17

Start Take a Window (W) of size L consisting of Xm (m =1, 2, …, N ) data points such that 1≤ m ≤ L Calculate standard deviation (σ), mean (μ), minimum value (m1), and maximum value (m2) of W Calculate Loop step size (d) such that d = (m2-m1)/L Repeat from m1 to m2 with step size (d) and set variable j = ∑ Xm

No

Is d ≤ j ?

Yes Calculate fuzzy membership value using s-shaped and z-shaped functions combined by relation |

|

√

Update vector T(m) with new membership value and increment m Update step size d with d = d + m1 Vector T is required FAWMF

Fig. 1. Fuzzy adaptive window with median ﬁltering (FAWMF).

T

A

T

G, C}) can be deﬁned as the ratios of the number of codons with segment values dm to the total number of data points in the Window p(rm ) = dNm , where rm is the mth segment, dm is the number of data items with that string value, N is the total number of data items in the Window and p(rm ) = 1. m

The proposed fuzzy median ﬁlter is organized with different fuzzy rules to determine the strength of a signal at any sampling instant from the neighborhood of that point. The ﬁlter is designed with the following notions: 1. The L sampling points are stored in descending or ascending order. These are determined from the amplitude of the signal at the vicinity points Xm . 2. For each piece of data Xm , a fuzzy membership value is computed through a -shaped membership function. The shaped membership function, employed for the fuzzy median ﬁlter, possess the following characteristics: i. The maximum and minimum amplitude values are selected for the membership degree of 0. ii. The mean value of the data points are designated with the membership degree of 1. The membership function is a -fashioned curve that deﬁnes how each data point in the input space is being mapped to a degree of membership between 0 and 1. The -shaped membership function is constructed with the arrangement of s-shaped and z-shaped curves, respectively expressed by:

⎧ 0, ⎪ ⎪ ⎨2 xm −xl 2 , xr −xl s ( xm ; xl , xr ) = −xr 2 ⎪ 1 − 2 xxmr −x , ⎪ l ⎩ 1,

z ( xm ; xl , xr ) =

⎧ 1, ⎪ ⎪ ⎨

xm −xl 2 , xm −xxr r−x2 l

1−2

⎪ ⎪ ⎩2

0,

xr −xl

,

xm ≤ xl xl ≤ xm ≤ xl +xr 2

xl +xr 2

≤ xm ≤ xr ⎪ ⎪

xm ≤ xl xl ≤ xm ≤

(2.1)

⎭

xm ≥ xr

xl +xr 2

⎫ ⎪ ⎪ ⎬

xl +xr 2

≤ xm ≤ xr xm ≥ xr

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭

(2.2)

where the parameters xl and xr locate the extremes of the sloped region of the curve representing the left and right breakpoints, respectively.

End

A T G A C

13

T

C A C T G A G T C

1 2 … m m+1 m+2 … N L =2m +1 W Fig. 2. Structure of nucleotides’ data.

at frequency analysis of signal. This phenomenon shows a correlation between 1/f noise and spectral leakage. 2.1. Fuzzy adaptive window median ﬁlter (FAWMF) In proposed FAWMF, fuzzy rules are applied along with median ﬁltering to suppress signal noise and hence achieve enhanced identiﬁcation of coding regions. The ﬂowchart of the FAWMF is shown in Fig. 1. Let us consider a Window of size L ∈ W with Xm (m = 1,2, …, N) data points such that 1 ≤ m ≤ L, as shown in Fig. 2. The probabilities p(rm ) of codons containing segments rm (where rm {A, T,

3. Assume 2 × k + 1 data points (k ≤ L/2), where k is the range of the data set, that is, the number of candidate data in the median calculation of the list (the median value and k preceding the onward data of the sorted list). 4. Determine the amplitude that provide the highest membership value and represent it as output. We describe here the simulation of FAWMF algorithm over the gene Homo sapiens mitochondrion that contains 16,0 0 0 bp and 13 coding regions. The nucleotide bases in this gene sequence are encoded using fuzzy encoding sequence [12] to form a vector containing digital contents of signal as shown in Table 2. We initially select a window length (L) of 401 points and move this window over the encoded signal. As the ﬁrst iteration of algorithm, the mean value (μ) is calculated as 0.850083333. Similarly standard deviation (σ ) becomes 0.133729429, the minimum value (m1 ) and maximum value (m2 ) result in 0.6874 and 1.2299 respectively. The loop step size (d) is 0.0135625 and the fuzzy membership value for this iteration becomes 0.053349331. In second iteration, the loop step size is incremented and we achieve a new membership value as 0.053769913. The resultant vector (T) contains the membership strengths of all those data points that correspond to these high membership values for an entire window length of input signal.

14

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17 Table 2 Encoded sequence along with window contents. Nucleotide sequence Encoded sequence Window (W)

a 0.8957 0.8957

t 0.6874 0.6874

a 0.8957 0.8957

c 0.8672 0.8672

c 0.8672 0.8672

c 0.8672 0.8672

a 0.8957 0.8957

– – –

Fig. 3. Frame segments of windows and PSD estimation.

3. Results Performance evaluation of different Window ﬁlters for coding regions identiﬁcation has been performed at nucleotide level. In this context, following important evaluation measures have been employed which are deﬁned as,

Sensit ivit y (Sn ) =

TP TP + FN

(1)

Speci f icity (Sp) =

TP TP + FP

(2)

P rediction accuracy (P ) =

TP + TN TP + FP + TN + FN

(3)

Approximate correlation (AC ) = (ACP − 0.5 ) ∗ 2 where,

Fig. 4. PSD estimation of Homo sapiens mitochondrion gene.

Further, we convolute the window with encoded signal and calculate the segmented frames of signal. Fig. 3(A) describes 21 frames (out of different number of frames depending upon the adaptive window size) of window segments as a result or convolution with the signal and the power spectral density estimation of frames. Fig. 3(B) presents another 21 frames of PSD generated as convolution of window with the signal. This help to ﬁgure out which regions of signal contain probability of having coding regions. Fig. 4 presents the correct identiﬁcation of 13 exons in the complete gene. The peaks E1 to E13 depict the locations of exons identiﬁed at particular locations in the signal.

ACP =

1 ∗ 4

TP TP TN TN + + + TP + FN TP + FP TN + FN TN + FP

(4)

(5)

Sensitivity (Sn) (also called as true positive rate) measures the proportion of the regions correctly identiﬁed as coding regions (exons) while Speciﬁcity (Sp) (also called as true negative rate) measures the proportion of regions correctly identiﬁed as non-coding regions (introns) [17,18,36,38]. Prediction accuracy (P) is another good evaluation measure taken as combination of Sensitivity (Sn) and Speciﬁcity (Sp) [38] while approximate correlation (AC) is equally employed as a suitable evaluation measure since sometimes the Prediction accuracy (P) may not ideally discriminate coding regions from non-coding regions due to higher Sensitivity (Sn) of regions against lower Speciﬁcity (Sp) and vice versa. Further, Table 3 describes the datasets used for performance evaluation of different window ﬁlters. The benchmarked DNA datasets

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17 Table 3 Description of datasets used for performance evaluation. Organism

No. of sequences

Homo sapiens Serinus canaria Nicotiana sylvestris Yersinia pestis Limulus polyphemus Felis catus Vicugna pacos Sus scrofa mitochondrion Cricetulus griseus Tursiops truncates Ornithorhynchus anatinus Mus musculus domesticus Meleagris gallopavo Canis lupus mitochondrion Galeopterus variegatus Nicotiana tomentosiformis S. Cerevisiae chromosome III Human, Mouse and Rat (HMR195)

15 20 17 1 20 23 18 1 15 18 20 1 25 1 18 20 1 103, 82 and 10 respectively 1

4. Discussions Protein coding regions are diffused with non-coding regions and a viable identiﬁcation of such regions is suffered by exonintron mixed signal noise [13,14,41]. The suppression of signal noise correlates with enhancement in identiﬁcation of coding regions. An optimal digital ﬁlter convolutes with DNA signal and signiﬁcantly discriminates the boundaries of coding and non-coding regions [12]. Conventional window ﬁlters employed for protein coding regions enhancement lack the representation and implementation of biologically inspired nucleotides’ data [39]. Yin and Yau [40] observed that a small window size produces more statistical oscillations that results in prediction errors while large window sizes may miss small size coding and non-coding regions. The windows smaller in size (i.e. 60 bp, 120 bp and 240 bp) depict lower values of performance parameters. The extent of identiﬁcation gradually increases from lower window length to higher length and a maximum identiﬁcation is achieved at window length of 351 bp [12,39]. Similarly we found a decrease in prediction accuracy beyond 351 bp for windows having ﬁxed length. A notable enhancement in identiﬁcation was observed with FAWMF due to its adoptability by minimizing spectral leakage and signal noise. Employing randomly taken datasets, we calculated the mean signal to noise ratio (SNR) for each window ﬁlter at different window sizes to reveal the tendency of noise suppression of window ﬁlters. We noticed a slight variation in SNR at window length of 60 bp and 120 bp in existing window ﬁlters but a viable difference was observed comparing with FAWMF window ﬁlter. At window length of 351 bp (which had been chosen by most of researchers for coding regions identiﬁcation) [16], the SNR of Kaiser window ﬁlter was most prominent in existing window ﬁlters while a re-

Average sequence length (bp) 4500 6250 3700 40 0 0 7880 4925 2120 80 0 0 5812 3460 3933 7700 2732 7800 2267 4200 80 0 0 7096 73,326

used for experiments are HMR195, S. cerevisiae chromosome III) [18,19,37,38] and HUMHBB (Human beta globin) [36] and rest of datasets are randomly taken DNA sequences of organisms. The performance analysis of different parameters can be observed in Table 4. Table 4 presents the performance evaluation of window ﬁlters in terms of speciﬁcity (Sp), prediction accuracy (P), approximate correlation (AC), false positive rate (FP) and signal to noise ratio (SNR) at different window lengths.

Table 4 Performance evaluation at different window lengths.

Bartle Blackman Rectangular Hamming Hann Taylorwin Triangular

60 bp

120 bp

240 bp

351 bp

460 bp

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

μ(Sp)

μ(P)

μ(AC)

Kaiser μ(Sp)

Extent

Performance evaluaon at diﬀerent window lengths 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000

FAWMF

600 bp

Speciﬁcity (Sp), Predicon accuracy (P) and Approximate Correlaon (AC)

0.900

0.700

Bartle

0.600

Blackman

0.500

Rectangular Hamming

0.400

Hann

0.300

Taylorwin

0.200

Triangular

60 bp

120 bp

240 bp

351 bp

460 bp

μ(FP)

μ(SNR)

μ(FP)

μ(SNR)

μ(SNR)

μ(FP)

μ(SNR)

μ(FP)

FAWMF

μ(SNR)

Kaiser

0.000 μ(FP)

0.100 μ(SNR)

Extent

0.800

μ(FP)

Human (Beta Globin HUMHBB)

15

600 bp

False posive rate (FP) and signal to noise rao (SNR)

16

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17 Table 5 Fuzzy membership strength and spectral response of window segments.

Fuzzy membership strength of nucleotides

Spectral response of window segments Window segment of length 351 bp 3

10 9 8 7 6 5 4 3 2 1 0

Spectral density

2.5 2 1.5 1 0.5 1 23 45 67 89 111 133 155 177 199 221 243 265 287 309 331

0 1 23 45 67 89 111 133 155 177 199 221 243 265 287 309 331

Magnitude

Window segment of length 351 bp

Fuzzy membership values of nucleodes

markable gain in SNR was observed with proposed FAWMF window ﬁlter compared with SNR of noticeable Kaiser window ﬁlter. We observed that signal to noise ratio increases from smaller window length to relatively larger length and maximum SNR is achieved at widow length of 351 bp while SNR gradually decreases beyond 351 bp. The same phenomenon was observed for false positive rate. The false positive rate decreases with increasing window sizes and optimal false positive rate is achieved around 351 bp. The same rate decreases for larger window sizes for conventional window ﬁlters since they lack the biological aspect of nucleotides being represented as contents of window ﬁlter. That is why, it is more appropriate to state that a window ﬁlter based on genetic code context of nucleotides would ensure signiﬁcant noise suppression as compared with other conventional window ﬁlters. Further we performed ANOVA test for analysis of variance in the results achieved with determining different evaluation parameters (i.e. prediction accuracy, speciﬁcity, approximate correlation, false positive rate and signal to noise ratio). We observed p-values lower than 0.05 corresponding to the F-statistic of ANOVA for evaluation parameters. This phenomenon indicates that one or more window ﬁlters are signiﬁcant than others (rejects the null hypothesis that all window ﬁlters achieve same performance). Since only ANOVA cannot depict which window ﬁlters are signiﬁcant, we further performed Post-hoc Tukey HSD test to identify the signiﬁcant window ﬁlters among other ﬁlters. FAWMF window ﬁlter achieved signiﬁcant p-values (p < 0.01) in comparison with other conventional window ﬁlters. Another very important aspect tied with window ﬁlter is its adaptability according to variations in its length [39]. It is notable to describe that proposed FAWMF window ﬁlter is highly adaptive in its utilization for better convolution with DNA signal because of its biological structure. FAWMF bases over genetic code context and outperforms at all Window lengths. Table 5 presents the fuzzy membership strength of nucleotides and corresponding spectral response of window segments of different lengths. The red line approximates a uniform normal distribution of nucleotides’ strength and spectral estimate of segments. The fuzzy membership distribution of nucleotides in window segments is highly correlated with spectral response of segments. A uniform smooth fuzzy distribution depicts the same spectral response at different segment sizes. Further, it has been noticed that any change in segment length does not change the uniformness of membership strength of nucleotides and spectral response of segment. A small window size produces more statistical

Data points in Window segment

oscillations that results in prediction errors while large Window sizes may miss small size coding and non-coding regions [40] but FAWMF algorithm outperforms at different Window sizes revealing a uniform membership distribution with same smooth spectral response. 5. Conclusion Conventional window ﬁlters applied to DNA signals do not achieve signiﬁcant results. This paper presented a novel biologically inspired fuzzy adaptive window median ﬁlter (FAWMF) based on genetic code context of nucleotides. We applied FAWMF to long noisy DNA sequences for enhancing coding regions identiﬁcation. FAWMF algorithm computed the fuzzy membership strength of nucleotides and ﬁltered nucleotides based on median ﬁltering with a combination of s-shaped and z-shaped ﬁlters. FAWMF algorithm was observed very useful for tracing both short range and long range coding regions from a variety of noisy sequences due to its signiﬁcant adaptive response. More than 250 benchmarked and randomly taken DNA datasets of different organisms were employed for performance evaluation of different Window ﬁlters. The proposed window ﬁlter outperformed and produced signiﬁcant discrimination between coding and non-coding regions contrary to ﬁxed window length conventional ﬁlters. References [1] D. Anastassiou, Genomic signal processing, IEEE Signal Process. Mag. 18 (4) (2001) 8–20. [2] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter, Molecular biology of the cell. new york: Garland science; 2002. Classic textbook now in its 5th Edition (2002). [3] Z. Ignatova, I. Martínez-Pérez, K.H. Zimmermann, DNA Computing Models, Springer Science & Business Media, 2008. [4] F. Brueckner, K.J. Armache, A. Cheung, G.E. Damsma, H. Kettenberger, E. Lehmann, J. Sydow, P. Cramer, Structure–function studies of the RNA polymerase II elongation complex, Acta Crystallogr. Sect. D: Biol. Crystallogr. 65 (2) (2009) 112–120. [5] M. Long, E. Betrán, K. Thornton, W. Wang, The origin of new genes: glimpses from the young and old, Nat. Rev. Genet. 4 (11) (2003) 865–875. [6] A.A. Turanov, A.V. Lobanov, D.E. Fomenko, H.G. Morrison, M.L. Sogin, L.A. Klobutcher, D.L. Hatﬁeld, V.N. Gladyshev, Genetic code supports targeted insertion of two amino acids by one codon, Science 323 (5911) (2009) 259–261. [7] E. Coward, Equivalence of two Fourier methods for biological sequences, J. Math. Biol. 36 (1) (1997) 64–70. [8] Z. Wang, Y. Chen, Y. Li, A brief review of computational gene prediction methods, Genomics Proteomics Bioinf. 2 (4) (2004) 216–221. [9] I. Wasito, I. Veritawati, Fractal dimension approach for clustering of DNA sequences based on internucleotide distance, in: 2013 International Conference

M. Ahmad et al. / Computer Methods and Programs in Biomedicine 149 (2017) 11–17

[10] [11] [12] [13] [14]

[15]

[16]

[17]

[18] [19]

[20]

[21]

[22] [23]

[24] [25]

of Information and Communication Technology (ICoICT), IEEE, 2013, March, pp. 82–87. J.W. Fickett, Recognition of protein coding regions in DNA sequences, Nucleic Acids Res. 10 (17) (1982) 5303–5318. C. Yin, S.S.T. Yau, A Fourier characteristic of coding sequences: origins and a non-Fourier approximation, J. Comput. Biol. 12 (9) (2005) 1153–1165. M. Ahmad, L.T. Jung, M.A.-A. Bhuiyan, On fuzzy semantic similarity measure for DNA coding, Comput. Biol. Med. 69 (2016) 144–151. S. Singha Roy, S. Barman, Polyphase ﬁltering with variable mapping rule in protein coding region prediction, Microsyst. Technol. 22 (167) (2016) 1–11. S.A. Marhon, S.C. Kremer, Prediction of protein coding regions using a wide-range wavelet window method, IEEE/ACM Trans. Comput. Biol. Bioinf. 13 (4) (2016) 742–753. X. Zhang, Z. Shen, G. Zhang, Y. Shen, M. Chen, J. Zhao, R. Wu, Short Exon detection via Wavelet transform Modulus Maxima, PLOS ONE 11 (9) (2016) e0163088. M. Ahmad, A biologically-inspired computational solution for protein coding regions identiﬁcation in noisy DNA sequences, in: Biologically-Inspired Energy Harvesting through Wireless Sensor Technologies, IGI Global, 2016, pp. 201–216. S.S. Sahu, G. Panda, Identiﬁcation of protein-coding regions in DNA sequences using a time-frequency ﬁltering approach, Genomics, Proteomics Bioinf. 9 (1) (2011) 45–55. D.K. Shakya, R. Saxena, S.N. Sharma, A DSP-based approach for gene prediction in eukaryotic genes, Int. J. Electr. Eng. Inf. 3 (4) (2011) 480–487. M.K. Hota, V.K. Srivastava, DSP technique for gene and exon prediction taking EIIP indicator sequence, in: Proceedings of the Second International Conference on Information Processing, 2008, January, pp. 117–123. M.S. Chavan, R.A. Agarwala, M.D. Uplane, Use of Kaiser window for ECG processing, in: Proceedings of the 5th WSEAS International Conference on Signal Processing, Robotics and Automation, Madrid, Spain, 2006, February. S.W. Bergen, A. Antoniou, Application of parametric window functions to the STDFT method for gene prediction, in: Proceedings on Communication, Computers and Signal Processing, (IEEE-PACRIM05), 2005, pp. 324–327. A. Andreas, Digital signal processing: Signals, systems, and ﬁlters, McGraw-Hill, New York, 2006 ISBN 10: 0070636338. M.K. Hota, V.K. Srivastava, Performance analysis of different DNA to numerical mapping techniques for identiﬁcation of protein coding regions using tapered window based short-time discrete Fourier transform, in: 2010 International Conference on Power, Control and Embedded Systems (ICPCES), IEEE, 2010, November, pp. 1–4. A.V. Oppenheim, R.W. Schafer, Discrete-time Signal Processing, Pearson Higher Education, 2010. S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya, R. Ramaswamy, Prediction of probable genes by Fourier analysis of genomic sequences, Comput. Appl. Biosci.: CABIOS 13 (3) (1997) 263–270.

17

[26] A.S. Nair, S.P. Sreenadhan, A coding measure scheme employing electron-ion interaction pseudopotential (EIIP), Bioinformation 1 (6) (2006) 197–202. [27] D. Anastassiou, Frequency-domain analysis of biomolecular sequences, Bioinformatics 16 (12) (20 0 0) 1073–1081. [28] D. Kotlar, Y. Lavner, Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions, Genome Res. 13 (8) (2003) 1930–1937. [29] M. Akhtar, J. Epps, E. Ambikairajah, Signal processing in sequence analysis: advances in eukaryotic gene prediction, IEEE J Sel. Top. Signal Process. 2 (3) (2008) 310–321. [30] T.S. Gunawan, On the optimal window shape for genomic signal processing, in: International Conference on Computer and Communication Engineering, 2008. ICCCE 2008, IEEE, 2008, May, pp. 252–255. [31] S. Datta, A. Asif, A fast DFT based gene prediction algorithm for identiﬁcation of protein coding regions, in: ICASSP (5), 2005, March, pp. 653–656. [32] R. Kakumani, V. Devabhaktuni, M.O. Ahmad, Prediction of protein-coding regions in DNA sequences using a model-based approach, in: 2008 IEEE International Symposium on Circuits and Systems, IEEE, 2008, May, pp. 1918–1921. [33] J. Tuqan, A. Rushdi, A DSP approach for ﬁnding the codon bias in DNA sequences, IEEE J. Sel. Top. Signal Process. 2 (3) (2008) 343–356. [34] S. Datta, A. Asif, DFT based DNA splicing algorithms for prediction of protein coding regions, in: Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004, 1, IEEE, 2004, November, pp. 45–49. [35] M. Akhtar, J. Epps, E. Ambikairajah, On DNA numerical representations for period-3 based exon prediction, in: 2007 IEEE International Workshop on Genomic Signal Processing and Statistics, IEEE, 2007, June, pp. 1–4. [36] J. Mena-Chalco, H. Carrer, Y. Zana, Cesar Jr, M. R., Identiﬁcation of protein coding regions using the modiﬁed Gabor-wavelet transform, IEEE/ACM Trans. Comput. Biol. Bioinf. 5 (2) (2008) 198–207. [37] T.P. George, T. Thomas, Discrete wavelet transform de-noising in eukaryotic gene splicing, BMC Bioinf. 11 (1) (2010) 1. [38] O. Abbasi, A. Rostami, G. Karimian, Identiﬁcation of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform, BMC Bioinf. 12 (1) (2011) 1. [39] M. Ahmad, L.T. Jung, A.A. Bhuiyan, From DNA to protein: why genetic code context of nucleotides for DNA signal processing? A review, Biomed. Signal Process. Control 34 (2017) 44–63. [40] C. Yin, S.S.T. Yau, Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence, J. Theor. Biol. 247 (4) (2007) 687–694. [41] G. Liu, Y. Luan, Identiﬁcation of protein coding regions in the eukaryotic DNA sequences based on Marple algorithm and wavelet packets transform, Abstract and Applied Analysis 2014 (2014, July).

A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing

A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing

Recommend Documents