A combination dimensionality reduction approach to codon position patterns of eubacteria based on their complete genomes

A combination dimensionality reduction approach to codon position patterns of eubacteria based on their complete genomes

Journal of Theoretical Biology 272 (2011) 26–34 Contents lists available at ScienceDirect Journal of Theoretical Biology journal homepage: www.elsev...

2MB Sizes 2 Downloads 40 Views

Journal of Theoretical Biology 272 (2011) 26–34

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

A combination dimensionality reduction approach to codon position patterns of eubacteria based on their complete genomes Zhao-Hui Qi n, Ruo-Yan Wei College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei 050043, People’s Republic of China

a r t i c l e i n f o

a b s t r a c t

Article history: Received 1 October 2010 Received in revised form 8 December 2010 Accepted 8 December 2010 Available online 14 December 2010

Graphical techniques have become powerful tools for the visualization and analysis of complicated biological systems. However, we cannot give such a graphical representation in a 2D/3D space when the dimensions of the represented data are more than three dimensions. The proposed method, a combination dimensionality reduction approach (CDR), consists of two parts: (i) principal component analysis (PCA) with a newly defined parameter r and (ii) locally linear embedding (LLE) with a proposed graphical selection for its optional parameter k. The CDR approach with r and k not only avoids loss of principal information, but also sufficiently well preserves the global high-dimensional structures in lowdimensional space such as 2D or 3D. The applications of the CDR on characteristic analysis at different codon positions in genome show that the method is a useful tool by which biologists could find useful biological knowledge. & 2010 Elsevier Ltd. All rights reserved.

Keywords: Eubacteria Codon usage Genome Graphical representation

1. Introduction With the rapid reporting of complete genomes derived with automated DNA sequencing techniques, the problem of processing such information became acute. Graphical methods have emerged as a powerful solution for the visualization and analysis of complicated biological systems. Many graphical approaches have been used to deal with a great number of biological problems. For example, various graphic schemes have been successfully used to study enzyme-catalyzed system (Chou, 1980; Chou and Forsen, 1980, 1981; Chou and Liu, 1981; Chou, 1989, 1990; Andraos, 2008), codon usage (Chou and Zhang, 1992; Zhang and Chou, 1994), protein folding kinetics (Chou, 1990, 1993), drug metabolism systems (Chou, 2010), alignment (Randicet al., 2006; Qi et al., 2010), protein subcellular location (Xiao et al., 2006a, b), protein structural classes (Xiao et al., 2008) and G-protein-coupled receptor functional classes (Xiao et al., 2009). Recently, graphic approaches were also used to examine the similarities/dissimilarities among the coding sequences of different species (Randicet al., 2003a, 2003b; Randic, 2006; Yao et al., 2006, 2008, 2010; Qi et al., 2007; Qi and Qi, 2007, 2009; Qi and Fan, 2007), study cellular signaling networks (Diao et al., 2007), and analyze the network structure of the amino acid metabolism (Shikata et al., 2007). Another useful graphic method, radar chart, has been used to illustrate the differences in amino acid compositions to predict protein subcellular localization (Chou and Elrod, 1999). Also, radar charts have been applied in a similar manner

n

Corresponding author. E-mail address: [email protected] (Z.-H. Qi).

0022-5193/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.jtbi.2010.12.014

for classifying organisms (Sorimachi and Okayasu, 2004, 2008a, 2008b; Okayasu and Sorimachi, 2009). Graphical approaches are useful for making visual inspection of data, and helpful for recognizing, organizing and analyzing the complicated biological problems. However, people may be confronted with a new issue for graphical representation in a two- or three-dimensional (2D/3D) space. Because of the limitation of dimensionality, we cannot give such a representation with more than three dimensions in a 2D/3D space. It is a reasonable scheme to reduce the data with more than three dimensions to 2D/3D. Dimensionality reduction is a useful operation for data clustering and pattern recognition. Because the high-dimensional data can include lots of redundancies and also hide some important relationships, we use dimensionality reduction approach to eliminate the redundancies and extract the useful information. As for biologists, they may also be confronted with a great deal of high-dimensional data when they discuss and analyze complex biological problems. Because of the complexity of the complete genome, there may exist a great deal of redundant information within the high-dimensional data with linear and nonlinear structures. Dimensionality reduction methods have been successfully applied to resolve some biological problems (Wang et al., 2008; Costa et al., 2009; Du et al., 2006; Qi et al., 2009). Since the patterns of genome can be hard to find in high-dimensional space, where graphical representation is not available, the possibility of grouping the variability in few variables is an important step to visualize and discover the hidden information. Here, we propose a combination dimensionality reduction (CDR) approach based on the classical linear method, Principal Components Analysis (PCA) (Jackson and Wiley, 1991), and the important nonlinear method, Locally Linear

Z.-H. Qi, R.-Y. Wei / Journal of Theoretical Biology 272 (2011) 26–34

Embedding (LLE) (Roweis and Saul, 2000). The CDR method consists of two parts: (i) PCA with its optional parameter r and (ii) LLE with its optional parameter k. Then the new method is used for mining and analyzing codon position characteristics in genome with linear and nonlinear structures. There are several applications for codon position, amino acid usage and codon usage analysis. In each application, the 2D graphical representation by CDR approach is used to investigate the classification patterns among different organisms.

2. Methods There are two major types of methods for dimensionality reduction: linear method and nonlinear method. Linear methods are more suitable for data reduction with linear structure. A key purpose of nonlinear methods is to preserve distances when mapping data to a low-dimensional space. This is to say that the points close in the input space must also be close in the output space. The high-dimensional data derived from genome include not only linear structure but also nonlinear structure. Here, the proposed CDR method can first reduce the linear structure by the linear method PCA with an optional parameter r. The output data after PCA are further deduced by the nonlinear method LLE to preserve distances when mapping data to a 2D space. Then we simply introduce the PCA and LLE approaches in the following. 2.1. Principal components analysis (PCA) and its optional parameter r PCA (Jackson and Wiley, 1991) is a projection method used to analyze data set and reduce it from high-dimensional space to few hidden variables while keeping information on its variability. Now, we give the simple description of the algorithm. Assume that the P mean of sample X ¼ fxi gni¼ 1 of space RD is x ¼ 1=n ni¼ 1 xi . Write the singular value decomposition of covariance matrix S as S ¼ U4UT, where S ¼ EðxxÞðxxÞT . Matrix U is an orthogonal matrix. Diagonal matrix 4 is made up of the eigenvalues of S, where 4 ¼ diag(l1,y,lD) and l1 Z    Z lD . The principle component transformation is U T ðXXÞ. Then a new data set Y ¼ fyi gni¼ 1 is obtained by Y ¼ U T ðXXÞ. The mean and covariance matrix of Y are 0 and diagonal matrix 4, respectively. Now we ignore the components of lesser significance and leave out some important components. Then the final data set will have fewer dimensions than the original. During the process of the dimensionality reduction, we do lose some information. However if the ignored eigenvalues are small, we do not lose much. To be precise, if we originally have n dimensions in our data, we can obtain the final set with p dimensions by choosing only the first p eigenvalues. It is well known that the high-dimensional data from complete genome are complex. It is a useful scheme to reduce the highdimensional data to 2D/3D data set. The 2D/3D graphical representation according to the 2D/3D data can provide an intuitive impression and help people gain useful insights. If the high-dimensional data with n dimensions are always reduced to 2D/3D data, we perhaps lose much information. In order to keep the significant components, we introduce a parameter r to specify the degree of dimensionality reduction: , m D X X r¼ lj li j¼1

Pm

i¼1

The j ¼ 1 lj in the equation is the sum of the first m eigenvalues (l1 Z    Z lm ). Then the final data set has only m dimensions. If we set a big r such as r Z0.9, we will ignore the components of lesser significance and do not lose much. However, when m 43, we cannot facilitate the graphical representation of the reduced data

27

set to intuitively obtain useful insights hidden in the original. Otherwise, we perhaps lose more information. In order to overcome the drawback, we further reduce the m-dimensional data set to fewer dimensions (2D or 3D) by the important nonlinear method LLE. 2.2. Locally linear embedding (LLE) and the selection of its optional parameter k Roweis and Saul (2000) introduced Locally Linear Embedding (LLE), a nonlinear method that computes low-dimensional, neighborhoodpreserving embeddings of high-dimensional inputs. The LLE algorithm assumes that a high-dimensional data set lies on, or near to, a smooth low-dimensional manifold. Small patches of the manifold, each containing a fraction of the data set, can be equipped with individual local coordinates. The high-dimensional coordinates of each patch can be mapped into corresponding local coordinates by means of an essentially linear transformation. LLE attempts to find a global transformation of the high-dimensional coordinates into low-dimensional ones by exploiting adjacency information about closely located data points, this information being a form of summarization of the local transformations between the high- and low-dimensional coordinates (Chojnacki and Brooks, 2009). LLE constructs a neighborhood-preserving mapping. In LLE, ! each high-dimensional observation X i is mapped to a low-dimen! sional vector Y i representing global internal coordinates on the ! manifold. This is done by choosing d-dimensional coordinates Y i to P ! P ! 2 minimize the embedding cost function fðYÞ ¼ i 9 Y i  j Wij Y j 9 . This cost function is based on locally linear reconstruction errors. One should fix the weights Wij while optimizing the coordinates ! Y i . Now, we give the simple description of the algorithm (Roweis and Saul, 2000). As an input, the algorithm requires n real-valued vectors X i in D-dimensional space. As an output, then it can give n real-valued vectors Y i in d-dimensional space, where d oD. (a) For each X i find its k nearest neighbors. (b) Reconstruct each data point from its neighbors. Reconstruction errors are measured by the cost function e, where  2 X! X !   e ¼ min  X i  Wij X j   i  j The weights Wij summarize the contribution of the jth data point to the ith reconstruction. To compute the weights Wij, one minimizes the cost function subject to two constraints: P ! j Wij ¼ 1 and Wij ¼0 for points that are not neighbors of X i . (c) Low-dimensional embeddings are found that best preserve high-dimensional neighborhood geometry represented by the weights Wij. This is done by choosing d-dimensional coordi! nates Y i to minimize the embedding cost function f(Y), where

fðYÞ ¼

X! X ! 2 9 Y i  j Wij Y j 9 i

Then, one should fix the weights Wij while optimizing the ! coordinates Y i . The embedding cost function f(Y) is subject to two P! ! P! ! constraints: i Y i ¼ 0 , and N1 i Y i  Y i ¼ I, where I is the d  d identity matrix. The bottom d (non-zero) eigenvectors of the matrix (I  W)T(I W) provide the solution of the embedding cost function f(Y) as shown in Roweis and Saul (2000). These eigenvectors form rows of the matrix Y. As for LLE, the parameter k (k nearest neighbors) is to be predefined. In the original paper (Roweis and Saul, 2000) authors

28

Z.-H. Qi, R.-Y. Wei / Journal of Theoretical Biology 272 (2011) 26–34

According to the above analysis, we can draw a feasible scheme to find an approximately optimal choice. The scheme is described in detail as follows:

do not explain how to make the best choice. Generally speaking, a choice for a large k perhaps causes smoothing or elimination of smallscale structures in the manifold. In contrast, too small neighborhoods can falsely divide the continuous manifold into disjoint sub-manifolds. However, in the original paper authors directly give their k value, k¼20. In order to investigate the reason for the choice we see again the steps of the LLE algorithm. A close look at the steps (a) and (b) of the LLE algorithm discovers that there is a minimal reconstruction error e for each k. We can draw an e-curve with the change in k by connecting all the dots (e(k),k) one by one. Now we experiment with the data sets (Fig. 1(a)) sampled from the typical manifold introduced in Roweis and Saul (2000). Fig. 1(b) shows a plot of e(k) for k ranging from 1 to 50. A close look at the figure reveals that the reconstruction error e is dynamically changed with k. The existence of a global minimum for k¼7 is caused by the fact that for many points their first few neighbors are all indeed close to them and adding a new neighbor decreases the reconstruction error every time. However, as the parameter k grows, the error e begins to rise or fall, because the Euclidean distance becomes an indicator for proximity. As for each kA[1,50], we draw the output data in a 2D space. In order to simplify the description, we only list the graphs in Fig. 1 when k¼7 and 20, respectively. Observing all graphs for each kA[1,50], the worst effect of dimensionality reduction is found in the figure when k¼7. As for each kZ18, the output data in a 2D space approximately maintain the stability of cluster. In the original paper, the choice for k¼20 indeed is an optimal choice because the reconstruction error e reaches its minimal value when kZ18, as shown in Fig. 1(b).

(a) Select kmax as the maximal possible value of k. P ! P ! 2 (b) Calculate e for each k, where e ¼ min i  X i  j Wij X j  and kA[1,kmax]. (c) Draw the e-curve with the change in k by connecting all the dots (e(k),k) one by one. (d) Find the ranges where the absolute value of the slope of k is small and stable from the e-curve. (e) Within the ranges we draw the output data in a 2D space. Observe the represented figures and find the figures in which the patterns maintain stable. (f) Take the figure with the least reconstruction error and the corresponding k as the final choice.

2.3. Combination dimensionality reduction (CDR) approach In this section, we describe the construction of CDR, a combination dimensionality reduction approach. The high-dimensional data set derived from the genome is a mixture set with linear and nonlinear structures because of the complexity of genome. It is a challenge for biologists how to find the principal information hidden in the mixture data set. People have used the linear method such as PCA (Du et al., 2006; Qi et al., 2009) to deal with some biological problems by

4.5

15

4

10

3.5

5  (k)

0 -5

3 2.5 2

-10

1.5

-15 40 20 0

0 -10

10

20

1

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 k

0.1 0.1 0.08

0.08 k=7

0.06

0.06

0.04

0.04

0.02

0.02

0

0

-0.02

-0.02

-0.04

-0.04 -0.06

-0.06

-0.08

-0.08

-0.1 -0.05 -0.04 -0.03 -0.02 -0.01

k = 20

0

0.01 0.02 0.03 0.04 0.05

-0.1 -0.05 -0.04 -0.03 -0.02 -0.01

0

0.01

0.02

0.03

0.04

0.05

Fig. 1. (a) Data sets sampled from the typical manifold in Roweis and Saul (2000). (b) Plot of e(k) for k ranging from 1 to 50. (c) Output data in a 2D space when k¼ 7. (d) Output data in a 2D space when k ¼20.

Z.-H. Qi, R.-Y. Wei / Journal of Theoretical Biology 272 (2011) 26–34

29

graphical representation based on the reduced data set in 2D/3D space. However, we may lose much information if the high-dimensional data with n dimensions are always reduced to 2D/3D data. Hence we introduce a parameter s described in Section 2.1 to specify the degree of dimensionality reduction. Then we further reduce the output data set to fewer dimensions (2D or 3D) by the important nonlinear method LLE with the optional parameter k described in Section 2.2. Now, we give the detailed description of CDR algorithm.

As for each k in the e-curve, we draw the output data in a 2D space. In order to simplify the description, we only list the graphs in Fig. 2(b) when k¼29. Observing all graphs for each kAK, the output data in a 2D space maintain the approximate stability of cluster when kZ29. Then we get k¼29 as the final choice. The final output data set is shown in Fig. 2(b).

(a) Assume that a set of vectors X ¼ fxi gni¼ 1 is a sample in space RD. By PCA algorithm there are D eigenvalues: l1 ,    , lD , where l1 Z    Z lD . (b) Choose the first m eigenvalues (l1 Z    Z lm ). Let the paraP meter r Z r0 (for example, r0 ¼0.9), where r ¼ m j ¼ 1 lj = PD l . Then we can obtain the output data set in an mi¼1 i dimensional space. If m¼2, the following step (c) will be skipped. (c) Take the m-dimensional data as the input of LLE algorithm. Take the figure with the least reconstruction error and the corresponding k as the final choice by the approximately optimal choice scheme in Section 2.2. (d) Visualize the final output data in a 2D space.

In this section, the proposed CDR approach is compared with the PCA and LLE approaches to reduce the 12D data to the 2D data, respectively. In Section 2.3, we use the proposed CDR approach to analyze the 12D data derived from Helicobacter pylori J99 (AE001439.1) genome. The final output data set is shown in Fig. 2(b). Here, we use the PCA and LLE approaches to reduce the same data set with 12D to the 2D data, respectively. The final output data sets from different methods are shown in Fig. 3(a)–(c) for a convenient observation. A close look at the figures shows that the clustering feature of Fig. 3(b) is very similar to Fig. 3(c). We easily observe two cluster centers although the other clustering is small. However, we can merely find out one cluster center in Fig. 3(a), and other dots surrounding the center shows a band state without cluster center. These results intuitively show that the LLE and CDR approaches obtain almost the same dimension reduction results. The PCA approach loses the other small cluster center. Fig. 3(d) gives a reasonable explanation. According to PCA algorithm, the first two eigenvalues are chosen in order to reduce the 12D data to the 2D data. LLE algorithm reserves all eigenvalues. The proposed CDR chooses the first six eigenvalues. Obviously, from the r(j)-curve of Fig. 3(d) we can see that the first two eigenvalues of PCA algorithm are merely about 60% of all eigenvalues. The CDR algorithm reserves about 89% of all eigenvalues. The lost 40% of all eigenvalues almost make PCA algorithm lose the other small cluster. LLE and CDR with 89% of all eigenvalues obtain almost the same dimension reduction results. However, the two approaches pay more time. The above results are obtained under the following computing environment: CPU (AMD 5200+, 2.7 MHZ), RAM (2 G), windows XP and Matlab 7.0. The LLE and CDR approaches approximately consume 3327 and 1886 s, respectively. The PCA algorithm only pays less than 10 s. In a word, the proposed CDR approach provides a compromise between information mining and time cost. Finding a balance between features and cost by adjusting the parameter r of CDR approach, one can mine as much feature information as possible under his existing computing environment.

Here, we give a practical example to demonstrate the validity of the proposed CDR approach. The experiment data are derived from Helicobacter pylori J99 (AE001439.1) genome. As for each gene, a 12D vector P12D is generated to describe the characteristics of the codon position in the gene, where P12D ¼ fp1A ,p1T ,    ,p3G ,p3C g. The components of the 12D vector denote the proportions of the four nucleotides, A, T, G or C, in different codon positions such as the first, the second and the third. For example, the component p1A denotes the proportion of the nucleotide A in the first codon position, and the component p3G denotes the proportion of the nucleotide G in the third codon position. Then we can obtain a 12D data set. Now we use PCA with r to reduce the 12D data set to acceptable outputs in a low-dimensional space, where r ¼0.89. Of course, someone can choose a different r. Then we obtain a 6D data set as the outputs. We only lose less than 11% of the original information. Then we use LLE with k to reduce the 6D data set to the outputs in a 2D space. In order to choose a suitable k, we draw the e-curve with the change in k. Fig. 2(a) shows a plot of e(k) for k ranging from 1 to 90. A close look at the figure reveals that the reconstruction error e is dynamically changed with k.

6

x 10-3

2.4. CDR approach compared with the PCA and LLE approaches

0.1

5.5

0.08

5

0.06 0.04

4.5

0.02

4

0

3.5

-0.02

3

-0.04

2.5

-0.06

2 1.5

-0.08 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 k

-0.1 -0.1

-0.08 -0.06 -0.04 -0.02

0

0.02

0.04

k = 29

Fig. 2. (a) Plot of e(k) for k ranging from 1 to 90. (b) Final output data set by CDR with PCA by r ¼ 0.89 and LLE by k¼29.

0.06

0.08

0.1

30

Z.-H. Qi, R.-Y. Wei / Journal of Theoretical Biology 272 (2011) 26–34

0.1

0.1

0.08

0.08

0.06

0.06

0.04

0.04

0.02

0.02

0

0

-0.02

-0.02

-0.04

-0.04

-0.06

-0.06

-0.08

-0.08

-0.1 -0.1 -0.08 -0.06 -0.04 -0.02

0

0.02 0.04 0.06 0.08

-0.1 -0.1 -0.08 -0.06 -0.04 -0.02

0.1

0

0.02 0.04 0.06 0.08

0.1

1

0.1 0.08

0.9

0.06 0.04

0.8  (j)

0.02 0 -0.02

0.7 0.6

-0.04 -0.06

0.5

-0.08 -0.1 -0.1 -0.08 -0.06 -0.04 -0.02

0

0.02 0.04 0.06 0.08

0.1

0.4

1

2

3

4

5

6

7

8

9

10

11

12

j Fig. 3. Final output data set by (a) PCA, (b) LLE and (c) CDR. (d) Plot of r(j) for j ranging from 1 to 12 (see Section 2.3, rðjÞ ¼

3. Applications 3.1. Genomic data used for this study Complete genome sequences were downloaded from NCBI Genbank (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). Here, the analysis of genomes was performed by some bacteria consisting of 11 Gram-positive and 12 Gram-negative bacteria (Sorimachi and Okayasu, 2004). To simplify the description, we list only the strain name and its accession number. The genome sequences used for this study are summarized as follows: Staphylococcus aureus Mu50 (BA000017.4), Streptococcus pyogenes M1 (AE004092.1), Bacillus subtilis (AL009126.2), Clostridium perfringens 13 (BA000016.3), Listeria monocytogenes (AL591824.1), Mycoplasma pulmonis (AL445566.1), Mycoplasma genitalium (L43967.2), Mycoplasma pneumoniae (U00089.2), Ureaplasma urealyticum (CP001184.1), Mycobacterium tuberculosis (AE000516.2), Mycobacterium leprae (AL450380.1), Rickettsia prowazekii (AJ235269.1), Borrelia burgdorferi (AE000783.1), Campylobacter jejuni (CP000538.1), Helicobacteripylori 26695 (AE000511.1), Helicobacter pylori J99 (AE001439.1), Escherichia coli (U00096.2), Salmonella typhi (AL513382.1), Vibrio cholerae (AE003852.1, AE003853.1), Yersinia pestis (AL590842.1), Neisseria meningitides (AL157959.1), Haemophilus influenzae (L42023.1) and Treponema pallidum (AE000520.1).

Pm

j¼1

lj =

PD

i¼1

li ).

study of the patterns. The four nucleotide frequencies in human (Zhang and Chou, 1993, 1996) and E. coli genes (Zhang and Chou, 1994) were graphically represented by points in a 3D space. Similar approaches were used to analyze HIV (Chou and Zhang, 1992) and anti-sense proteins (Chou et al., 1996). Okayasu and Sorimachi (2009) used GC frequencies at the three codon positions to classify 145 organisms into two types: ‘‘GC-type’’ and ‘‘AT-type’’. Because of the limitation of the graphical representation in a 2D space, the global frequency in the ‘‘GC-type’’ and ‘‘AT-type’’ may cover some detailed characteristics hidden in genomes. Here, we investigate the codon position characteristics in genome from the four nucleotide frequencies at the first codon position, the second codon position, the third codon position and the three codon positions, respectively. First, for a given gene we construct a multiple component vector with the four nucleotide frequencies at the corresponding codon position to describe its codon position characteristics. As for genome of an organism, there are thousands of genes. Then we will obtain a high-dimensional data set consisting of the corresponding vectors derived from genes. In order to discover more details hidden in all genes, we use the presented CDR method to investigate the codon position characteristics in genome.

3.3. Investigation of codon position characteristics by CDR method 3.2. Codon position characteristics Codon usage patterns were used to analyze a wide range of biological problems. People get some important insights from the

The complete genome of a species consists of thousands of genes. In order to visualize and uncover the information hidden in the highdimensional data set derived from the complete genome of the species,

Z.-H. Qi, R.-Y. Wei / Journal of Theoretical Biology 272 (2011) 26–34

we reduce the data set to two-dimension by the CDR method. Then we give the 2D graphical representation of the patterns of nucleotide compositions and utilize the representation to intuitively observe the evolution patterns among different organisms. As for the genomic data used for this study shown in Section 3.1, the detailed investigations of codon position characteristics are as follows. 3.3.1. Characteristics at the third codon position The complete genomes of the bacteria in Section 3.1 are classified into two groups, ‘‘S-type’’ represented by Staphylococcus aureus Mu50

31

(BA000017.4) and ‘‘E-type’’ represented by Escherichia coli (U00096.2) (Sorimachi and Okayasu, 2004; Qi et al., 2009) though authors use their different methods. Here, we first discover more details at the third codon position of each type by the graphical representation in a 2D space. In the actual process of dimensionality reduction, we find that the graphical representation of the final output has different shapes among different genomes. In order to easily distinguish one genome from the other, in each graph we represent two different data 2 sets derived from different genomes. Then we need to draw C23 graphs according to the Permutation and Combination theory. To

Fig. 4. 2D dot-matrix graphs of nucleotide compositions derived from the third codon position of the complete genomes of various bacteria in Section 3.1. As shown in the figure, red notation ‘‘dot’’ gives the representation of S. aureus Mu50. Blue notation ‘‘plus’’ represents the pattern of the comparative genome. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

32

Z.-H. Qi, R.-Y. Wei / Journal of Theoretical Biology 272 (2011) 26–34

simplify the description, we choose the representative of the ‘‘S-type’’, Staphylococcus aureus Mu50 (BA000017.4), as the basic comparison genome. By the CDR method, the patterns of nucleotide compositions derived from the third codon position of the complete genomes of various bacteria in Section 3.1 are the 2D dot-cluster graphs, as shown in Fig. 4. Here, the parameter r within CDR is 0.85. The parameter k is determined by the algorithm described in Section 2.2. In Fig. 4, the coordinate range in some graphs such as Fig. 4 (2) and (5) is [ 0.4, 0.4] when the original 4D data can be reduced to 2D by the sub-algorithm of CDR (PCA with r ¼0.85). If the original 4D data by the sub-algorithm is reduced to an intermediate data set with more than 2D, the other sub-algorithm of CDR (LLE with k) is used to reduce the intermediate data to the final 2D data set. The coordinate range in the graphs based on the final 2D data such as Fig. 4 (1) and (3) is [  0.1, 0.1]. The difference between two coordinate ranges results from the different scales of the final 2D data from different dimensionality reduction approaches. It is important for us to derive the patterns from the final 2D data set instead of the coordinate ranges. A close look at Fig. 4, we find that the dot-clusters in graphs are mainly grouped into three types: (i) one overlapping cluster such as Fig. 4 (1) and (3), (ii) two clusters closely adjacent to each other such as Fig. 4(16) and (17) and (iii) two clusters completely separate from each other such as Fig. 4 (9) and (10). According to the patterns in graphs, we can quickly determine the relationship between two different genomes in their nucleotide compositions of the third codon position. This is a general method if someone wants to investigate the detailed characteristics of the third codon position among different genomes. 3.3.2. Characteristics at the first codon position, at the second codon position and at the three codon positions Similarly, by the CDR method we can also investigate the nucleotide composition patterns of the complete genomes of

various bacteria at the first codon position, at the second codon position and at the three codon positions. To simplify the description, we no longer list all graphs for the specific codon position. Here, we choose several representative genomes such as ‘‘S. aureus Mu50 vs. S. pyogenes M1’’ (one overlapping cluster), ‘‘S. aureus Mu50 vs. E. coli’’ (two closely adjacent clusters) and ‘‘S. aureus Mu50 vs. M. tuberculosis’’ (two completely separate clusters) to demonstrate the characteristics at different codon positions. A 4D data set is reduced to 2D by the proposed CDR method when we investigate the characteristics at the first codon position, at the second codon position and at the third codon position. A 12D data set is reduced to 2D by the CDR method when investigating the characteristics at the three codon positions. Results of the survey are demonstrated in Fig. 5. From Fig. 5, the pattern in each type of representative genome is so similar that we could not easily identify the differences among them. A close look at the relative direction in dot-matrix graph shows that the patterns of three graphs are different the others. One is the pattern of the three codon positions from the first line of Fig. 5. The others are the patterns of the third codon position and the three codon positions from the second line of Fig. 5. Here, the change of relative direction in dot-matrix graph does not affect the type of division. The LLE algorithm in Section 2.2 gives the related reasons. The first step of the algorithm is to find the k nearest neighbors of each vector in high-dimensional space. Then the low-dimensional embeddings are found that best preserve high-dimensional neighborhoods geometry by minimizing the related embedding cost function. As a result, the relative direction in 2D dot-matrix graph can change with the different parameter k. However, the geometry relationship among the points in low-dimensional space maintains the neighbor information of the high-dimensional data set. So if the relative direction in 2D dot-matrix graph is ignored, the nucleotide composition patterns within different codon positions are almost identical to each other. In a word, the proposed CDR method is a useful tool to discover the patterns within the high-dimensional data. The same patterns within

Fig. 5. Comparison of the characteristics among several representative genomes at the first codon position, at the second codon position, at the third codon position and at the three codon positions.

Z.-H. Qi, R.-Y. Wei / Journal of Theoretical Biology 272 (2011) 26–34

the high-dimensional data can be discovered although the dimensionalities of the original data are different from each other.

3.3.3. Relationship between the types of eubacteria based on the proposed CDR method and the GC contents of their genomes As shown in Figs. 4 and 5, the proportion of the bases A, T, G and C at three codon positions has reduced the dimensions from 4D or 12D data to 2D data. The 2D data in the graphical representation are grouped into three types: (i) one overlapping cluster such as Fig. 4(1) and (3), (ii) two clusters closely adjacent to each other such as Fig. 4 (16) and (17) and (iii) two clusters completely separate from each other such as Fig. 4 (9) and (10). Then we continue to investigate the GC contents of each genome. The statistic results are as follows.

(i) The first type with one overlapping cluster includes these eubacteria: S. aureus Mu50 (33.57%), S. pyogenes M1 (39.14%), B. subtilis (44.22%), C. perfringens 13 (29.47%), L. monocytogenes (38.44%), M. pulmonis (27.29%), M. genitalium (31.55%), M. pneumoniae (40.74%), U. urealyticum (25.99%), R. prowazekii (30.41%), B. burgdorferi (28.82%), C. jejuni (30.94%), H. pylori 26695 (39.56%), H. pylori J99 (39.90%) and H. influenzae (38.82%). (ii) The second type with two clusters closely adjacent to each other is the following eubacteria: E. coli (51.82%), S. typhi (53.25%), V. Cholerae (48.29%), Y. pestis (48.90%), N. meningitidis (53.41%) and T. pallidum (52.61%). (iii) The third type with two clusters completely separate from each other is the following: M. tuberculosis (65.77%) and M. leprae (60.12%).

It is well known that the GC contents of genomes are closely related to the types of organisms. According to the above statistic results about GC contents, we easily discover the close relationship between the types of eubacteria by the proposed CDR method and the GC contents of their genomes. The S. aureus Mu50 is a eubacterium with a low GC content (33.57%). The 2D dot-matrix graph shows longer distance when the genome compared with S. aureus Mu50 has a higher GC content. In addition to the close relationship between distances and GC contents, the 2D dot-matrix graphs still show colorful patterns of clustering.

33

3.4. Investigation of the amino acid usage and codon usage characteristics in genomes by CDR method In Section 3.3, we investigate the characteristics at the first codon position, at the second codon position and at the third codon position, and the characteristics at the three codon positions. The proposed CDR method reduces the dimensions from 4D and 12D data to 2D data. In this section, we continue to investigate the characteristics about the amino acid usage and codon usage characteristics in genomes. It is well known that there are 20 different amino acids and 61 different codons except for 3 stop codons. In order to investigate the usage characteristics, we should reduce the dimensions from 20D and 61D data to 2D data by the proposed CDR method. To simplify the description, we no longer list all graphs for the specific genomes. Here, we choose the representative genomes ‘‘S. aureus Mu50 vs. S. pyogenes M1’’ (one overlapping cluster). The final output data set is shown in Fig. 6. Linking among the first four figures for ‘‘S. aureus Mu50 vs. S. pyogenes M1’’ in Fig. 5, and the two figures of Fig. 6, we obtain almost the same dimension reduction results although the dimensionalities within different data sets range from 4D to 61D. The graphical results further show that the proposed CDR method is a powerful tool to discover the patterns within the high-dimensional data. The same patterns within the high-dimensional data can be really discovered although the dimensionalities of the original data are highly different from each other.

3.5. Software package To facilitate biologists, a software package ‘Programs 1.0’ has been developed and is very easy to use. The software package consists of two sub-packages: ‘PERL 1.0’ and ‘CDR 1.0’. In the subpackage ‘PERL 1.0’, there are four Perl codes to derive the percentage of nucleotide compositions at different codon positions from genome. Then one can obtain a high-dimensional data such as 4D or 12D. In order to investigate the evolution patterns hidden in different species, one can utilize the other sub-package ‘CDR 1.0’ to reduce the 4D or 12D data to 2D, and then the dot-cluster graph based on the final 2D data is drawn by Matlab. The details about the above steps are described in the file ‘Readme.txt’ within the software package ‘Programs 1.0’. The software packages are freely available for academic use by contacting the corresponding author.

Fig. 6. Green notation ‘‘dot’’ gives the representation of S. aureus Mu50. Blue notation ‘‘plus’’ represents the pattern of S. pyogenes M1. (a) Final 2D dot-matrix graphs of the 20D vectors for amino acid usage. (b) Final 2D dot-matrix graphs of the 61D vectors for codon usage. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

34

Z.-H. Qi, R.-Y. Wei / Journal of Theoretical Biology 272 (2011) 26–34

4. Conclusion In this paper, we propose a combination dimensionality reduction approach CDR for linear and nonlinear dimensionality reduction and apply it to the visualization of high-dimensional data in a 2D embedded space. The CDR method includes two parts: (i) PCA with a parameter r and (ii) LLE with a proposed selection for its optional parameter k (k nearest neighbors). Then we apply the CDR method to the analysis of codon position characteristics in genome. The results indicate that the method is feasible and can be used in investigating the classification patterns among different organisms. It can also discover the essential patterns hidden in the high-dimensional data, though the dimensionalities are possibly different from each other. The corresponding software packages have been developed, and will be helpful to biologists.

Conflict of interest The authors declare that they have no conflict of interest.

Acknowledgments We thank the anonymous reviewers for their valuable comments to improve this paper. This work is supported by Natural Science Foundation of Hebei Province Education Department (Project no. 2010276).

Appendix A. Supplementary material Supplementary material associated with this article can be found in the online version at doi:10.1016/j.jtbi.2010.12.014.

References Andraos, J., 2008. Kinetic plasticity and the determination of product ratios forkinetic schemes leading to multiple products without rate laws: new methods based on directed graphs. Canadian Journal of Chemistry 86, 342–357. Chojnacki, W., Brooks, M.J., 2009. A note on the locally linear embedding algorithm. International Journal of Pattern Recognition and Artificial Intelligence 23, 1739–1752. Chou, K.C., 1980. A new schematic method in enzyme kinetics. European Journal of Biochemistry 113, 195–198. Chou, K.C., 2010. Graphic rule for drug metabolism systems. Current Drug Metabolism 11 (4), 369–378. Chou, K.C., Forsen, S., 1980. Graphical rules for enzyme-catalyzed rate laws. Biochemical Journal 187, 829–835. Chou, K.C., Forsen, S., 1981. Graphical rules of steady-state reaction systems. Canadian Journal of Chemistry 59, 737–755. Chou, K.C., Liu, W.M., 1981. Graphical rules for non-steady state enzyme kinetics. Journal of Theoretical Biology 91, 637–654. Chou, K.C., 1989. Graphical rules in steady and non-steady enzyme kinetics. Journal of Biological Chemistry 264, 12074–12079. Chou, K.C., 1990. Review: applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady state systems. Biophysical Chemistry 35, 1–24. Chou, K.C., 1993. Graphic rule for non-steady-state enzyme kinetics and protein folding kinetics. Journal of Mathematical Chemistry 12, 97–108. Chou, K.C., Elrod, D.W., 1999. Protein subcellular location prediction. Protein Engineering 12, 107–118. Chou, K.C., Zhang, C.T., 1992. Diagramatization of codon usage in 339 HIV proteins and its biological implication. AIDS Research and Human Retroviruses 8, 1967–1976. Chou, K.C., Zhang, C.T., Elrod, D.W., 1996. Do antisense proteins exist? Journal of Protein Chemistry 15, 59–61.

Costa, J.C., Alves, M.M., Ferreira, E.C., 2009. Principal component analysis and quantitative image analysis to predict effects of toxics in anaerobic granular sludge. Bioresource Technology 100, 1180–1185. Diao, Y., Li, M., Feng, Z., Yin, J., Pan, Y., 2007. The community structure of human cellular signaling network. Journal of Theoretical Biology 247, 608–615. Du, Q.S., Jiang, Z.Q., He, W.Z., Li, D.P., Chou, K.C., 2006. Amino acid principal component analysis (AAPCA) and its applications in protein structural class prediction. Journal of Biomolecular Structure and Dynamics 23, 635–640. Jackson, J.E., Wiley, J.W., 1991. A User’s Guide to Principle Components. WileyInterscience, New York. Okayasu, T., Sorimachi, K., 2009. Organisms can essentially be classified according to two codon patterns. Amino Acids 36 (2), 261–271. Qi, X.Q., Wen, J., Qi, Z.H., 2007. New 3D graphical representation of DNA sequence based on dual nucleotides. Journal of Theoretical Biology 249, 681–690. Qi, Z.H., Qi, X.Q., 2007. Novel 2D graphical representation of DNA sequence based on dual nucleotides. Chemical Physics Letters 440, 139–144. Qi, Z.H., Fan, T.R., 2007. PN-curve: a 3D graphical representation of DNA sequences and their numerical characterization. Chemical Physics Letters 442, 434–440. Qi, Z.H., Qi, X.Q., 2009. Numerical characterization of DNA sequences based on digital signal method. Computers in Biology and Medicine 39, 388–391. Qi, Z.H., Wang, J.M., Qi, X.Q., 2009. Classification analysis of dual nucleotides using dimension reduction. Journal of Theoretical Biology 260, 104–109. Qi, Z.H., Qi, X.Q., Liu, C.C., 2010. New method for global alignment of 2 DNA sequences by the tree data structure. Journal of Theoretical Biology 263, 227–236. Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326. Randic, M., Vracko, M., Lers, N., Plavsic, D., 2003a. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation. Chemical Physics Letters 371, 202–207. Randic, M., Vracko, M., Lers, N., Plavsic, D., 2003b. Novel 2-D graphical representation of DNA sequences and their numerical characterization. Chemical Physics Letters 368, 1–6. Randic, 2006. Spectrum-like graphical representation of DNA based on codons. Acta Chimica Slovenica 53, 477–485. Randic, M., Zupan, J., Drazen, V.T., Plavsic, D., 2006. A novel unexpected use of a graphical representation of DNA: graphical alignment of DNA sequences. Chemical Physics Letters 431, 375–379. Shikata, N., Maki, Y., Noguchi, Y., et al., 2007. Multi-layered network structure of amino acid (AA) metabolism characterized by each essential AA-deficient condition. Amino Acids 33, 113–121. Sorimachi, K., Okayasu, T., 2004. Classification of eubacteria based on their complete genome: where does Mycoplasmataceae belong? Biology Letters 271, S127–S130. Sorimachi, K., Okayasu, T., 2008a. Universal rules governing genome evolution expressed by linear formulas. Open Genomics Journal 1, 33–43. Sorimachi, K., Okayasu, T., 2008b. Codon evolution is governed by linear formulas. Amino Acids 34 (4), 661–668. Wang, T., Yang, J., Shen, H.B., Chou, K.C., 2008. Predicting membrane protein types by the LLDA algorithm. Protein & Peptide Letters 15, 915–921. Xiao, X., Shao, S.H., Ding, Y.S., Huang, Z.D., Chou, K.C., 2006a. Using cellular automata images and pseudo amino acid composition to predict protein subcellular location. Amino Acids 30, 49–54. Xiao, X., Shao, S.H., Chou, K.C., 2006b. A probability cellular automaton model for hepatitis B viral infections. Biochemical and Biophysical Research Communications 342, 605–610. Xiao, X., Wang, P., Chou, K.C., 2008. Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. Journal of Theoretical Biology 254, 691–696. Xiao, X., Wang, P., Chou, K.C., 2009. GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes. Journal of Computational Chemistry 30, 1414–1423. Yao, Y.H., Nan, X.Y., Wang, T.M., 2006. A new 2D graphical representation classification curve and the analysis of similarity/dissimilarity of DNA sequences. Journal of Molecular Structure: Theochem 764, 101–108. Yao, Y.H., Dai, Q., Nan, X.Y., et al., 2008. Analysis of similarity/dissimilarity of DNA sequences based on a class of 2D graphical representation. Journal of Computational Chemistry 2, 1632–1639. Yao, Y.H., Dai, Q., Li, L., et al., 2010. Similarity/dissimilarity studies of protein sequences based on a New 2D graphical representation. Journal of Computational Chemistry 31, 1045–1052. Zhang, C.T., Chou, K.C., 1993. Graphic analysis of codon usage strategy in 1490 human proteins. Journal of Protein Chemistry 12, 329–335. Zhang, C.T., Chou, K.C., 1994. Analysis of codon usage in 1562 E. coli protein coding sequences. Journal of Molecular Biology 238, 1–8. Zhang, C.T., Chou, K.C., 1996. An analysis of base frequencies in the anti-sense strands corresponding to the 180 human protein coding sequences. Amino Acids 10, 253–262.