Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
Contents lists available at ScienceDirect
Commun Nonlinear Sci Numer Simulat journal homepage: www.elsevier.com/locate/cnsns
Entropy embedding and fluctuation analysis in genomic manifolds Enrico Capobianco * CRS4 Bioinformatics Laboratory, Technology Park of Sardinia, Localita’ Piscinamanna, 09010 Pula, Cagliari, Italy
a r t i c l e
i n f o
Article history: Received 12 July 2007 Received in revised form 22 July 2008 Accepted 16 September 2008 Available online 27 September 2008 PACS: 02.50.Fz 02.50.Tt 02.60.x 05.10.a Keywords: Biological manifolds Dimensionality reduction Independent component analysis Embedding Entropy fluctuations
a b s t r a c t The complexity of typically high-dimensional genomic data requires computational work prone to integrate different biological information sources through efficient model solutions. Usually, one step involves dimensionality reduction (DR), which requires projecting the input data onto a low dimensional subspace, and often leads to an embedding. Thus, DR should be able to filter out the uninformative dimensions and recover the original variables. This step is of crucial relevance for any reverse engineering and statistical inference attempt to reconstruct the dynamics underlying the biological systems under study, i.e. the interactions between its genes or proteins. DR has become almost a standard practice just following the pre-processing steps typically applied to the experimental measurements (mining, normalization, etc.). In this work, the data for the analysis reflect expression values of genes whose dynamics are affected by perturbation experiments. In particular, the aims are to monitor the response of genes involved in a certain pathway, and then to isolate their biological variability from any possible external influence. Last, it is of interest to control the stability of the system; with this regard, we look at dynamical aspects of data through embedding theory and entropy fluctuation analysis. We demonstrate that a redundant biological system can in principle be reduced to a minimal number of almost independent components. In particular, such structures detect the higher-order statistical dependencies in the training data in addition to the correlations. Two popular DR techniques are compared in relation to their ability to extract the most salient features, allow gene selection, and minimize the various interferences due to algorithmic approximation errors and variable noise covers. Ó 2008 Elsevier B.V. All rights reserved.
1. Introduction A paradigmatic example of biological application fields where the design and the implementation of computationally efficient methods deeply leverages on statistical and machine learning tools is represented by gene microarrays. Due to the large genomes examined and the relatively few samples obtained from the available experimental conditions, it is common to refer to this dimensionality problem as the Large p–Small n microarray paradigm. Consequently, dimensionality reduction (DR) is usually required before applying any inference model step, with the scope of eliminating non-informative data features, including noise. However, it is likely that from the available samples a certain amount of bias affects the outcomes of standard statistical methods. There is an inherent complexity in the noisy microarray data available nowadays, which is in part due to their empirical distributions, mostly revealing non-Gaussian behavior. We thus adopt blind source separation methods, whose aim is to separate a set of unknown signal sources mixed together in the observed data (i.e. mixtures or mixed signals). * Tel.: +39 070 9250414. E-mail address:
[email protected] 1007-5704/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.cnsns.2008.09.015
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
2603
Under such premises, models with latent variables can play a role that becomes clear when both the source signals and the mixing parameters are retrieved (by suitable identification and estimation). The two techniques that we investigate here in more detail are Principal Component Analysis ([36]), or PCA,1 and Independent Component Analysis (ICA) ([12–14,17]). Experiments are run with the Escherichia coli bacterium as the model organism. The usual and natural attempt is to identify what genes are significantly changing expression values in response to induced perturbations. Because some of the genes result co-expressed (they show strongly correlated expression patterns) and possibly co-regulated (they have a common transcription factor in their promoter regions), one would like to group them according to a characterization based on qualitative aspects such as specific biological functions, molecular processes, and pathways. The paper is organized as follows. Section 2 presents background concepts on manifolds and source separation. Section 3 describes experimental and numerical computations with microarray data. Section 4 refers to entropy, embedding, predictability and fluctuations. Section 5 is for the conclusions. 2. Background on theory and methods Complex systems usually feature convoluted dynamics, redundancies, noise, and high order dependencies. However, a complex and high-dimensional system might have just a few latent variables or dimensions which encode the data and explain their dynamics. This implies that most of the degrees of freedom can refer to noise and artifacts, which should be filtered out. A first action is thus to try to reduce the complexity, starting by dimensionality. The common way to proceed with its identification is through the evaluation of eigenspectra from data-based covariance matrices. The underlying idea is that the convergence to a minimal or intrinsic dimensionality (ID), is driven by a certain (small) number of components that represent linear combinations of the original variables. Once such components are identified, the observed high-dimensional input space is split into decorrelated (via PCA) or statistically independent (via ICA) structures (see the Appendix for a diagram of relationships among these methods). Thus, instead of considering embedding, i.e. the number of variables or attributes, and only a representational dimensionality for the problem, searching the ID delivers the real number of dimensions in which the points can be encoded by preserving their mutual distances (i.e. an average local dimensionality, due to inherent clustering for instance). See the work of [45] for further analysis. In general, the ID in a data manifold refers to the least number of functionally independent structures that identify the observations from the function class characterizing the manifold where the former belong. With high dimensionality, a DR method projects the inputs onto a feature space and ideally allows approximation of the input system by a small number of degrees of freedom. As a result, a limited number of dimensions may be used to infer or reconstruct the internal system’s dynamics, select the features of interest, and obtain reliable predictions for the target variables. As for the present study, the variables of interest are genes, and we aim to reduce the genome dimensionality and select subsets of genes which are functionally linked (i.e. they belong to a specific pathway, for instance) and together characterize a certain component. Both PCA and ICA enable DR and consequently noise removal, and result particularly good at achieving gene selection. We illustrate their comparative performance, and then merge them into a more comprehensive methodological context. With the goal of achieving a regularized solution, we suggest to use shrinkage for eliminating isolated gene pattern discontinuities, probably linked to non-informative genes. Thus, relatively few genes are selected and considered informative, i.e. the ones the identified components assign to maximally independent or least dependent groups. 2.1. Classical blind source separation: SVD and PCA Classical techniques such as SVD or PCA seek a subspace of RN where, given the data Y and an orthogonal projection operator P(Y), a total square distance is minimized:
D ¼ kY P½Yk2 ¼
N Z X j¼1
T
kY j ðtÞ P½Y j ðtÞk2 dt:
ð1Þ
0
P RT From the correlation matrix Cor Y ¼ Ni¼1 0 Y j ðtÞY 0j ðtÞdt, by looking at the eigenvalue spectrum (i.e. the sorted eigenvalues P k1 P k2 P P kN P 0), the minimum value of D over all possible k(N) dimensional subspaces is given by Nj¼kþ1 kj . Correspondingly, the minimizing invariant subspace refers to the eigenvalues k1, . . . , kk. This basically means that a given statistical ensemble can be represented by a minimal number of modes reflecting the number of degrees of freedom in the system. P In classical SVD terms, a decomposition is obtained such that Y ¼ U KV 0 ¼ Ni¼1 ki ui 0i , where Y 2 RNM , U 2 RNN is an 0 N orthogonal matrix of left singular vectors ui 2 R , i.e. the eigenvectors of YY , V 2 RMM is an orthogonal matrix of right sin0 gular vectors i 2 RM , i.e. the eigenvectors of Y Y, and K 2 RNM is a diagonal matrixof singular values, ordered such that k1 P P kk P 0, with k = min(N,M).
1 While Singular Value Decomposition (SVD) (from [2,21,27,29]) applies to the raw data, one can equivalently consider the decomposition of covariance or correlation functions via PCA.
2604
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
If the recorded signals are linearly independent, all these structures are needed for a perfect reconstruction. Vice versa, under some degree of linear dependence, the full reconstruction power can be given by much fewer structures. Under Gaussianity and orthogonality of the decomposition structures, the decomposition leads to a decorrelated system, i.e. one where the linear dependence carried by the first two distributional moments has been removed. Thus, Cor_Y = C(yi, yj) = Cij = kidij, for dij (the Kronecker delta) identically equal to 0 or 1, for i–j or i = j, respectively. In case of non-Gaussian probabilistic hypotheses, the decorrelation property holds only approximately. One can miss important structure from higher order dependence, as a substantial loss of information can occur, due to the difference beb ij ki dij –0. When this gap is minimized through a minimal number of tween the estimated and true linear correlation Lij ¼ C components, we may assume to have approached the ID of the system under study, thus a representation in a minimal number of functional approximating structures. 2.2. A modern blind source separation approach via ICA The weakness of PCA is well-known: the components are valuable feature detectors under Gaussianity, thus with respect to covariance-based (second-order) statistics. However, the latter do not suffice to detect localized structure. Instead of just decorrelating the data, ICA exploits the higher order moments of the data distribution, thus the statistical information needed to detect stronger forms of dependencies beyond the covariance function. While PCA separates the signal and noise subspaces, and sorts the former by decreasing variability, ICA is mainly designed to extract latent sources from observed data mixtures.2 according to different principles. Its goal is exploring the non-Gaussianity in the data (ICA is thus linked to projection pursuit regression, [20,23,24,31,37]). The simplest ICA representation is the linear one, and in general the model is given by X = AS + , where the mixture X consists of k-dimensional vectors xi,i = 1, . . . , n factorized by independent non-Gaussian information sources (or components) S (of k-dimensional vectors sj,j = 1, . . . , m), mixed linearly or according to a mixing matrix A regulating the components. For stochastic systems generating complex and convoluted observed dynamics, ICA can approximate both S and A, given S 2 Rm, X 2 Rn, and A 2 Rnm. For example, in microarray applications we can set the observed gene expression matrix G according to a mixture of source signals S modulated by coefficients in A and additive noise . The available sample size n equals the number of experimental conditions through which the gene expression measures are obtained. The independent non-Gaussian sources have dimensionality according to the number of genes, and are bounded in number by the number of conditions, i.e. they approach a number m which less than or equal to the rank of the gene matrix. With ICA, the original dimensionality is reduced so as to reflect either the number of column vectors of the estimated mixing matrix (i.e. the directions of the data projections along which one expects to find structure), or equivalently the number of the components that are identified. ICA is a convenient DR method for at least the following reasons: The high-dimensionality of the observed data manifold can be absorbed with good approximation accuracy by a linear approximating subspace. While the number of dominant singular values approaches the ID, some of the smaller singular values might not be negligibly different from zero, but their selection is complicated by the possible convolution of dependence and noise. The decomposition leads to a few dominant modes which are if not statistically independent, just least dependent. Even without imposing a non-linear structure in the system, the linear ICA representation is not a limiting structural property, as the built-in numerical optimization step involves a non-linear (locally optimal) solution search performed after rotating from the projected space.3 Recent proposals have been noticed that extend ICA towards methodological advances, such as non-linear ICA [1,34,38], for which it is also quite hard to find unique solutions without suitable regularization. Linear or non-linear ICA (see Ref. [42], for an example of application on microarray data) can be compared, in principle, but we formerly [9] showed alternative ways to control how the linearity assumptions of the ICA model comes with acceptable penalties or not (i.e. by comparing correlation and mutual information surfaces). New methods [46] have also referred to self-organizing maps [40], and to ensemble learning approaches using multilayer perceptron networks, state-space modelling [16], natural gradient methods [52]. Last, it has been proposed the application of kernel methods [26], as with Kernel-ICA [5] which allows for a mapping of the data to a high-dimensional feature space where the effective subspace is then retrieved and the so-called kernel trick is exploited.4
2 The model representation can then be set to deal with linear on non-linear variable relationships, and by following parametric statistical inference or semiparametric approaches (for instance, via estimating equations theory). A review of methodological concepts linked to ICA see, among others, [3,33,34]. 3 The price to pay for such extra step is that no unique optimal solution is usually found, and several numerical solutions have been proposed based on entropic functionals, fixed point solutions, etc. 4 The kernel trick transforms algorithms that depend on the dot product between two vectors, which is replaced with the kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. However, because kernels are used, the original functions are never explicitly computed. This is desirable especially in high-dimensional spaces.
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
2605
2.3. ICA implementation aspects Several computational algorithms can be efficiently applied for this scope, such as the Joint Approximate Diagonalization of Eigenmatrices (for Real signals) or JadeR [14] and the fastICA [33,32] probably the most popular ones. The first possible step, as said, involves the estimation of the mixing matrix. The components can then be obtained by applying either the inverse or pseudo-inverse matrix,5 which leads to the de-mixing matrix B. Then, one needs to approximate through Y = BX the unknown S. Only under conditions of perfect separation the recovery Y = BAS = S makes sense; in general, the solutions hold only approximately, and in particular up to permutation P and scaling D matrices, i.e. Y = DPS is what can be computed.6 ICA induces a suitable system decomposition in just a few steps. It starts by a whitening transform, A = URVt, with U 2 Rnm and V 2 Rmm matrices with orthonormal columns, and R an m-diagonal matrix. Then it applies a rotational step, i.e. R = QA = QR1Ut, for Q orthogonal under ICA and Q = I under PCA. The element Ut projects the data into an m-dimensional source space, and R1 scales the projections to leave unit variance. The final step refers to optimization in the projected solution space, according to statistical or information theory criteria, and involving numerical or iterative algorithms. Before looking at other algorithmic features, it is important to start from what theoretical studies have stated; if the number of observed mixtures (n) is greater or equal than the number of sources (m), then it is possible to separate statistically independent sources provided that at most one of them is Gaussian [12,33]. This particular property is required for uniqueness and stability of solutions to be guaranteed. According to the algorithm, source recovery is possible only under non-Gaussianity, otherwise any rotation or orthogonal transformation would allow to find new components and the original sources could not be uniquely extracted. Intuitively, non-Gaussian directions in the projected space are considered the most interesting ones, as in this case the structure is maximally deviating from randomness and thus delivers relevant information. The other characterizing assumption is that the sources are statistically independent; thus, by assigning to S a multidimensional probability density function (PDF) P(S), statistical independence leads to the following joint source density factorization (at time t):
P½SðtÞ ¼ Pm i¼1 p½si ðtÞ:
ð2Þ
The independence condition is hard to verify in practice, as we can only approximate the unknown variables and their distributions, up to a certain degree, and noise and source interference can justify variable deviation degrees from a perfect source separation. However, such statistical assumption plays an important role for identification and estimation purposes, and it leads quite naturally to a non-parametric representation. The estimation algorithm adopted here is the previously mentioned JadeR. It works according to a sequence of conservative steps, as after whitening and projection onto the signal subspace, there is a shift from data to statistics provided by cumulant matrices (encoding the full set of fourth order cumulants) to be jointly diagonalized. Then, both a permutation (from the most energetic component) and a normalization (to unit variance) of the extracted source signals allow for a sorted unmixing of the observations, i.e. in order of decreasing energy. JadeR can be sketched as a Jacobi algorithm (due to the optimization step): Initialization: a whitening (sphering) matrix W is applied to the data X, and obtain Z; Statistics: estimate a maximal set of cumulant matrices CuMz; Contrast Optimization: find rotation R that maximally diagonalize the cumulant matrices, i.e. R ¼ argmin P z i off diag ðR0½CuM i RÞ; Separation: estimate V ¼ RW 1 and S ¼ V 1 X ¼ R0 Z.
3. Experimental work and numerical results 3.1. Materials and data pre-processing The organism we study is Escherischia coli, whose genome is experimentally perturbed to monitor the SOS pathway response to DNA damages.7 The data refer to an antibiotic time-course experiment, where DNA damage was induced with 10 lg/ml of norfloxacin, and samples were taken repeatedly after drug addition, for an outcome of six measurements in a time frame of 1 h. Thus,
5 When there is not a unique inverse, owing to the existence of an infinite number of independent components which are solutions of the linear problem, the latter is under-determined. 6 These are well-known model structure ambiguities, which can be fixed. The scale of the sources are usually normalized to a fixed variance, and can always be adjusted by ad hoc normalizing factors inserted in the mixing matrix. The order of the sources can also be fixed according to some criterion (for instance, by sorting by energy content, as in the JadeR algorithm). 7 The experimental data are included in a larger set of data (see [22]) which have been submitted to Gene Expression Omnibus (GEO), the NCBI microarray database (http://ncbi.nlm.nih.gov/geo/). Raw and normalized data for all 445 microarrays are available at M3D (http://m3d.bu.edu).
2606
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
time-course observations are collected every 12 min, i.e. t = 0, 12, 24, 36, 48, 60, including controls through untreated cell present at t = 24, 60, and until steady state is reached. The two extra conditions refer to the no-drug states, and are therefore ignored in the analysis. The microarray data have been analyzed with various procedures, including Affy MAS 5.0 and Robust Multichip Average, or RMA (see Refs. [8,35]).8 As the first two popular techniques follow different principles, we have chosen to consider both of them. It is a common practice to perform some pre-processing steps aimed to deliver a reduced set of interesting genes. First, a preliminary gene screening based on t and rank tests which eliminate genes showing random behavior or no variation at all. Then, with a reduced genome of 2330 genes (approximately half of the original size), the application of a log-transformation to the expression values so as to trim the strong fluctuations. Last, normalization at initial time is implemented to limit the impact of different gene fold changes, and instead allow for all genes a comparable expression change level (or equivalently, re-scaling is performed by conditioning to the initial values the absolute intensity shifts recorded gene-wise and caused by the perturbation experiments). 3.2. Gene selection via thresholding Consider a gene matrix G consisting of genes by the rows (in large number, according to the genome size) and experimental conditions by the columns (here time points), which together generate time-course gene expression values. By looking at rows and columns, respectively, we can observe wither short time gene signals or time profiles across the whole genome. Naturally enough, we can compute the correlation matrix and turn to an eigensystem decomposition, where the eigenvalues are the energies of the modes and the eigenvectors, say c, are determined by maximizing the energy in each mode. Thus, we are ready to explore PCA and ICA features. The rationale for such steps in genomic applications9 is to identify features f 2 F through a mapping f : G ! R. We can interpret the feature set as the result of projecting each gene profile over a particular template expression pattern that quantifies how each gene correlates (positively or negatively) with it and with a low or high degree depending on the relationships between the examined genes and the involved biological processes or pathways. 0 Due to the mapping, we transform the gene matrix according to G = F c . By a few (k) dominant eigenvalues we obtain via P k 10 0 ^¼ As each eigenvalue refers to a component, which can be characterized by the G i¼1 fi ci a more compact representation. action of a defined subset of genes, we need to discriminate between the specifically inherent biological characterization of the components according to various gene groups, and after filtering for noise-dependent and thus uninformative dynamics. Thresholding is an established technique (see for instance Ref. [19]) for obtaining a feasible discrimination between informative and noise-covered signals. It enables sparsity in redundant systems in a straightforward way, by simply keeping or killing terms depending on the magnitude of their values relatively to an established cut-off, and also induces regularization.11 In our application setting, we exploit the discriminatory power of each estimated component to enable statistical thresholding. The gene expressions significantly deviating from an average expression value may support with a certain confidence level the rejection of the null hypothesis of no presence of outlying effects. These genes can be considered outliers, i.e. they have a differentially expressed value which a statistical test establishes to be significant. The selection can now be made via fixed intervals by taking a scalar c times the standard deviation from the mean (approximately the 95% or 99% confidence levels), and then keeping those genes with values outside the computed interval. A simple shrinkage algorithm works for both PCA and ICA in a semi-automatic way (i.e. it can be repeated for each estimated component through simple steps) as follows: a pivot component j (with j = 1, . . . , m) is considered; a confidence interval is centered on a statistic and a related measure of deviation is computed from the available sample, given a pre-fixed critical level cl; a thresholding step is applied gene-wise over the k-dimensional genomic profile which is attached to each component. With regard to the confidence intervals, when the mean and standard deviation are computed for the gene expression kdimensional values sj (i.e. for the source j), it is straightforward to find the following:
mðclÞjk ¼ mean½sjk cl std½sjk :
8
ð3Þ
Also other common normalization algorithms, such as GCRMA and Dchip were run with the default parameters. See recently published work on microarray data, in particular gene feature detection and classification [6,9,15,30,42,43], while the most recent applications have been in cancer genomics [25,44,48,53]. 10 This type of system decomposition results in a generalized coordinate system defined by the eigenvectors of the gene matrix and by the mean squared error P optimality principle which leads to an optimal estimate of the parameters fi via min ¼ ½G ki¼1 fi c0i 2 . 11 One applies through thresholding a sort of regularized inference (see [49]) that under the premise of shrinking the data to eliminate noise, reduces the risk of overfitting and controls the complexity of the model. 9
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
2607
Table 1 Jarque–Bera normality test (a = 5%) Gaussianization
unG
lpG
hpG
Sample conditions (t) t=1 t=2 t=3 t=4 t=5 t=6
– 95.86 113.18 200.83 441.68 783.72
ns 94.72 85.01 202.57 371.98 540.49
ns ns 23.47 11.25 52.34 59.43
Independent components IC1 IC2 IC3 IC4
92.5 43,146 806.14 10,692
112.5 10,040 31.24 13.10
14.22 220.38 ns 11.53
Principal components PC1 PC2 PC3 PC4 PC5
349.91 4267 1386 11,800 5078
238.76 1066 62.98 43.88 ns
38.27 ns ns ns ns
Top: genomic profiles at each sample condition (time points) for unperturbed genome (unG), and genome under low (lpG) and high (hpG) perturbation. Middle: same for independent components. Bottom: same for principal components. Not significant test values are indicated by ‘‘ns”.
As mðclÞjk delivers lower and upper limits l and u, respectively, the outlying genes are outj R (l,u). We expect that more or less stringent intervals can help characterizing the dependence structure of genes, through clusters and components collecting interacting elements. Large outlying expression values might be confirmatory, on one hand, of some underlying strong evidence of differential gene behavior, but can also be inflated (i.e. false positives). Different intervals can be constructed to bypass the lack of statistical robustness of the mean statistic. One can use other statistics, such as the median, or consider another possibility offered by building the intervals sequentially: such intervals that remain centered on the mean, but allow extraction of the genes iteratively (one at a time) and require an interval update each time.12 3.3. Separation and interference Separation refers to the components that we are able to identify, because they exist, and then extract. The decorrelated or independent information content attached to them leads to optimal or sub-optimal mixture disentanglement. Under an orthogonal or independent component separation, some links appear among gene subsets within each component (i.e. groups of pathway-specific genes, for instance, with strong mutual dependence and no links with genes in other components). Vice versa, a certain overlapping degree is expected from other genes appearing in different components and which could not be uniquely assigned, thus representing the more widely connected at the biological level (because affecting several biological processes or pathways). We rise the issue of how to handle the interference that might emerge from a non-orthogonal or non-independent decomposition. Two of the important questions that we want to investigate are: to what extent the feature selection power of the two chosen decomposition techniques is influenced by noise? Can we determine which decomposition technique is more/less noise-resistant/sensitive, i.e. more/less robust to perturbations? In terms of methodological approach, we follow a perturbation strategy, and induce Gaussianization of the observed genome by the simple process of making it gradually more Gaussian. Thus, the low-to-high perturbations needed are provided by injecting zero mean Gaussian noise with variable magnitude (i.e. we choose variance r = 0.1 or 1) on the original genomic data, Gp = G + , where N (0,r). Clearly enough, the different characterization and statistical properties of PCA and ICA are expected to modulate the impact of noise. In other terms, PCA is a global method targeted to Gaussian data and orthogonal decompositions, while ICA is a local method able to extract statistically independent non-Gaussian sources, thus we expect to observe a diversified reaction and/or sensitivity to noise, which in turn would imply correspondingly mutated robustness level of the decomposed system. Table 1 report the Jarque–Bera normality test, which is used to check the hypothesis that a given sample x comes from a Normal random variable with unknown mean and dispersion.13 In our case, it is computed for the gene expression values under different perturbation degrees, from non-perturbation (unG) to low (lpG) and high (hpG) levels, which also define the intensity of the induced Gaussianization. 12
Both types of intervals have been proposed in [9 and 11]. This test is based on the fact that skewness and kurtosis of normal distribution equal zero. Therefore, the absolute value of these parameters could be a measure of deviation of the distribution from normal. 13
2608
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
The sample Jarque–Bera statistic (it should be noted that as n increases, the JB-statistic converges to a chi-square distribution with two degrees of freedom) is
" # n ðKurtðxÞ 3Þ2 2 : JBðxÞ ¼ ðSkewðxÞÞ þ 6 4
ð4Þ
Here n is a size of sample, then the p-value is computed using a table of distribution quantiles. Fig. 1 reports the histograms of the components for unperturbed (first column) and perturbed (other columns) expression values, with both IC and PC distributional patterns (histograms). The effect of Gaussianization is clear by monitoring the shape changes in the plots from left to right. Gaussianity takes over by effect of noise induction; consequently, the test is highly significant for the unpertubed genome, but gradually loses significance. The computed intervals are more or less stringent, and as a consequence the groups of genes vary in size. In terms of strength of links between the genes in each group, when maximally independent groups are formed under perfect separation of the components, the dependence within the groups is maximized too. If the groups are least dependent, the problem we have is how to interpret the residual degree of interference remaining between components.14 However, as shown in the simulated datasets, a diffusive behavior with the components more and more separated, is even worse due to the dominant presence of noise that masks the underlying dynamics. We present simulations with Gaussianization results referred to the components. They are given in Table 1 and in three other summaries (Table 2 for mean-based intervals impact, Table 3 is for sequential intervals impact, and Table 4 for relative comparisons in both absolute and percentage values). 3.4. Biological relevance The choice of fixed or sequential intervals is more or less conservative, thus affecting the overall gene selection quality. Building ad hoc confidence intervals is thus very important with regard to the final set of genes which remains as the most significantly differential in the expression changes, likewise it is key to check the noise sensitivity of the results. Fig. 2A and B reports the change occurring in the number of genes detected by each component of both PCA and ICA under variable Gaussianization. By just looking at the first two components, we see that PC1 is less sensitive to noise, an expected effect due to high variability allowed within the main identified eigenvector. More than twice the number of genes would be instead selected by IC1 with more noise in the system, thus showing different sensitivity of PCA and ICA to noise induction, and correspondingly to incremental Gaussianization. Then, given the conditions or time points, the estimated temporal patterns of the mixing matrix are shown,15 which are computed for every component in both the original (Fig. 2C and D) and the moderately perturbed (Fig. 2E and F) genome. It appears that under growing perturbation the system misleadingly identifies extra components in both cases. It is just hard to find out whether redundancy appears between components by looking at the D–F plot pair, for instance. One would notice that apart from the change of IC1 due to sign-switch endogenous to ICA, and a more fluctuating IC4, the interesting case is IC3 without perturbations, and IC3–IC5 under stronger perturbation, which could be both accepted. One question that might be legitimately asked is: what confidence can we put in the extracted features? Some specific work has been proposed (see for instance [28]) that describes how to test ICA stability. This approach might be particularly useful under condition of redundancy, i.e. when quite a large number of components could be generated, and thus their aggregation according to some similarity or distance criterion might be worth. Also, one can bootstrap the components and get a calibrated estimate of the inherent features of each identified component (see [9], where no substantial changes were observed). We believe that some level of control of the redundancy in the system, due to the presence of unnecessary (i.e. more than minimal) components, would indeed be useful for understanding the system’s dynamics and filtering out noise effects. At a more microscopic level, gene co-expression dynamics should be clearly distinguished from purely random effects. Numerical experiments have been conducted to test what happens when four or five components are considered, as we identify through eigenspectrum analysis more clearly four rather than five components. This way we allow for marginal redundancy in the system compared to what could already be indicated as the optimal dimensionality. We report descriptive and diagnostic evidence from the estimated components,16 and the numbers refer to the size of the gene groups selected by thresholding in each estimated component. Together with these values, we have also computed z-scores from the same components. z-Scores are used for normalization purposes, and here they are computed across components, in order to add variability gene-wise (note that, by definition, the variability here discussed, among components, is inhibited by the decomposition algorithm).
14
This type of interference refers to signal processing, therefore to overlapping signal sources at various frequencies. This matrix refers to the mixing mechanism of the components, which is unknown and has to be estimated. 16 In this case, the reported results refer to the Affy MAS 5.0 algorithm. The RMA method is by construction itself more robust, in statistical terms, and we prefer to be methodologically more neutral not to let the system’s robustness result constrained. 15
2609
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
A
300
200
200
100
100
200 IC1 100 0
IC2
5
0
5
0
5
0
10
400
400
200
200
200
100
0
5
0
5
0
5
200
0 10 200
100
100
10
300 IC3
0
5
5
0
5
10
0
5
5
0
5
200 100 0
IC4
5
0
5
600
150
400
100
200
50
5
0
0
5
5
0
5
200
100
0 10 0 Gaussianization Estimated IC
B
0
10
0 10 5 Original Genome (left)
400
0 0 5 5 Perturbed Genome, std = 03 (mid)
300
0 5 Perturbed Genome, std = 1
200
200 PC1 200
100 100
0 0.2 400
0
0.2
0 0.2 300
0
0.2
0 0.1 200
0
0.1
0
0.1
0
0.1
0
0.1
200 PC2 200
100 100
0 0.2 400
0
0.2
0 0.2 300
0
0.2
0 0.1 200
200
PC3 200
100 100
0 0.2 600 PC4
0
0.2
0 0.2 200
0
0.2
0 0.1 200
400 100
100
200 0 0.2 0 Gaussianization Estimated PC
0 0.2 0.2 Original genome (left)
0
0.2
0 0.1
Perturbed Genome, std = 0.3 (mid)
Perturbed Genome, std = 1
Fig. 1. Histograms under Gaussianization. The first four IC (A) and PC (B) components are reported subject to increasing (left-to-right) perturbation degree.
2610
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
Table 2 About 95% mean-based intervals Components
PC1
PC2
PC3
PC4
IC1
IC2
IC3
IC4
unG PC1 PC2 PC3 PC4
– 31 21 24
31 – 28 17
21 28 – 25
24 17 25 –
59 17 10 13
31 43 29 29
23 46 66 19
14 13 26 75
IC1 IC2 IC3 IC4 Dimension
59 31 23 14 121
17 43 46 13 105
10 30 66 27 120
13 29 19 75 99
– 7 15 8 106
7 – 22 25 79
15 22 – 20 124
8 25 20 – 104
lpG PC1 PC2 PC3 PC4
– 24 15 17
24 – 21 11
15 21 – 9
17 11 9 –
50 11 7 10
35 39 16 8
13 27 38 9
11 11 11 23
IC1 IC2 IC3 IC4 Dimension
50 35 13 11 125
11 39 27 11 106
7 16 38 11 122
10 8 9 23 120
– 9 11 5 114
9 – 10 8 77
11 10 – 9 113
5 8 9 – 107
hpG PC1 PC2 PC3 PC4
– 14 7 11
14 – 3 9
7 3 – 5
11 9 5 –
25 18 5 6
13 7 7 7
11 8 2 5
7 3 3 4
IC1 IC2 IC3 IC4 Dimension
25 13 11 7 112
18 7 8 3 104
5 7 2 3 114
6 7 5 4 111
– 4 8 5 108
4 – 9 3 102
8 9 – 5 113
5 3 5 – 118
Separation and Interference among components. Unperturbed Genome (unG, top), genome under low perturbation (std = 0.3-lpG); genome under high perturbation (std = 1-hpG).
In our context the z-scores measure the log-probability that a gene expression value computed by each component would be obtained by chance rather than according to the inherent detection power. The z-scores are computed for each of the genes in the k-dimensional genome profile encoded by the sjk ; j ¼ 1; . . . ; m sources. They are thus expressed as follows:
~sjk ¼
ðsjk mean½sjk Þ : std½sjk
ð5Þ
Because these data transform re-assign a membership probability to each gene with respect to the estimated component, what is done is basically similar to a fuzzy step. In other terms, the gene expression values are re-shuffled between the estimated components according to certain weights given by the z-scores. Overall, we want to account for differentially expressed genes that overlap the groups because selected by different estimated components. We believe that these genes have more chances to be possibly co-regulated by different biological factors. The list summary for the mean-based intervals computed in a 4-component system is as follows:
IC1: IC2: IC3: IC4:
106 selected genes (cl = 95%), 16 selected genes (cl = 99%); 79 selected genes (cl = 95%), 26 selected genes (cl = 99%); 124 selected genes (cl = 95%), 40 selected genes (cl = 99%); 104 selected genes (cl = 95%), 34 selected genes (cl = 99%).
As said, we consider the impact of redundancy on the system by allowing for the presence of one extra component added to the others. This basically means that we include one extra eigenvalue and then test the impact of its corresponding component. Interestingly, from the z-scores computed in the redundant system there seems to be evidence of aggregation among genes, as a result of the induced standardization, and unlike what has been observed with just four components. In a 5-component system the results are: IC1: 112 selected genes (cl = 95%), 46 selected genes (cl = 99%); z-scores: 17 selected genes (cl = 95%); IC2: 79 selected genes (cl = 95%), 66 selected genes (cl = 99%); z-scores: 24 selected genes (cl = 95%); IC3: 120 selected genes (cl = 95%), 19 selected genes (cl = 99%); z-scores: 41 selected genes (cl = 95%);
2611
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618 Table 3 About 95% sequential intervals Components
PC1
PC2
PC3
PC4
IC1
IC2
IC3
IC4
unG PC1 PC2 PC3 PC4
– 327 329 342
327 – 188 177
329 188 – 214
342 177 214 –
651 225 217 233
306 258 212 177
275 235 348 189
310 186 243 462
IC1 IC2 IC3 IC4 Dimension
651 306 274 310 1182
225 258 234 186 538
205 199 314 223 545
223 167 177 430 577
– 165 164 199 679
165 – 177 174 457
164 177 – 186 460
199 174 186 – 577
lpG PC1 PC2 PC3 PC4
– 187 179 152
187 – 117 102
179 117 – 105
152 102 105 –
212 185 114 70
192 221 112 84
222 102 237 152
128 103 108 70
IC1 IC2 IC3 IC4 Dimension
212 192 222 128 759
185 221 102 103 523
104 107 217 107 464
70 84 152 70 448
– 65 85 59 351
65 – 93 68 378
85 93 – 65 456
59 68 65 – 357
hpG PC1 PC2 PC3 PC4
– 80 62 61
80 – 75 69
62 75 – 63
61 69 63 –
67 69 50 72
158 164 87 76
110 104 50 59
59 120 113 139
IC1 IC2 IC3 IC4 Dimension
67 158 110 59 372
68 151 100 116 375
50 87 50 113 399
71 74 57 128 407
– 57 36 38 283
57 – 37 64 375
36 37 – 58 293
38 64 58 – 397
Separation and interference among components. Unperturbed Genome (unG, top), genome under low perturbation (std = 0.3 – lpG); genome under high perturbation (std = 1 – hpG).
IC4: 119 selected genes (cl = 95%), 29 selected genes (cl = 99%); z-scores: 49 selected genes (cl = 95%); IC5: 100 selected genes (cl = 95%), 55 selected genes (cl = 99%); z-scores: 35 selected genes (cl = 95%). The reasons for some genes to interfere among components might be the presence of noise, or artifacts from approximation errors, but also biological reasons (i.e. genes that are strong regulators and thus appear co-expressed at a wider level than a small compartmentalized set of genes). We have identified the frequency with which genes overlap or cross-talk between four, three, or two groups. In a 4-component system, and at 95% confidence level, there is only one gene in common to all the ICs, while 13 genes appear in three ICs, and 85 genes appear in two ICs. At 99% confidence level there are no genes detected by all the identified components, only one gene in three ICs, and 9 genes in two ICs. These numbers grow in a 5-component system, where at the 95% confidence level there are three genes in all the components, 20 genes in three ICs, and 142 genes in two ICs. At the 99% confidence level, no genes are found in all the components, three genes appear in three ICs, and 21 genes in two ICs. Overall, less or more stringency in selecting differentially expressed genes can be adopted, and it seems that in order to check the sort of interference or cross-links occurring between component-related genes, we can allow for a bit more redundancy with 95% intervals, therefore remaining not so conservative not to lose information. With regard to the qualitative information content of the results (i.e. biological validation) obtained from the detected gene group, note that among the genes reported by the first component at 95% confidence level, there are well identified genes, i.e. involved with the SOS response system for DNA repair, such as dinA, ybfE, uvrB, sulA, dinI, yebG, recN, recX, recA, dinD, and uvrA. The second component adds one relevant gene for for DNA repair, nth, at both confidence levels. The third and fourth components collect other genes, but not specifically involved in SOS. At 99% confidence level intervals some of these genes are again detected, as they are highly expressed: ybfE, sulA, recN, recX, and recA. Note that the higher confidence set suggests that some genes are highly significant, but this could not be a sign of biological relevance. It might be the case that regulation mechanisms start beyond a certain threshold of expression, not necessarily fixed at a very high positive/negative level, such that pathway-related dynamics occur anyway. Concerning the overlapping genes, there is no presence of SOS-related genes (the full annotation table of the genes in this pathway can be found in [9]), as they appear instead to be assigned to individual components, and thus specifically dependent on biological influences as much independent as they can be according to the identified IC.
2612
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
Table 4 Amount and percentage of genes uniquely detected by each component under variable Gaussianization degree Gaussianization
unG
%
lpG
%
hpG
%
[mean 95%] IC1 IC2 IC3 IC4
23 6 18 9
22 8 14 9
45 15 44 54
39 19 39 50
56 57 68 88
52 56 60 75
PC1 PC2 PC3 PC4
23 18 18 5
19 17 15 5
29 33 58 66
23 31 47 55
36 59 86 75
32 57 75 67
[mean 99%] IC1 IC2 IC3 IC4
13 18 35 27
82 70 88 80
5 20 24 17
100 90 92 94
16 11 10 9
100 100 100 100
PC1 PC2 PC3 PC4
13 18 20 24
57 55 58 67
17 20 11 10
68 65 69 84
8 7 6 9
80 77 85 90
[ seq 95%] IC1 IC2 IC3 IC4
0 8 7 26
0 2 2 11
18 10 25 24
6 3 6 7
67 40 42 41
24 11 15 11
PC1 PC2 PC3 PC4
213 29 19 11
19 6 4 1
163 78 53 115
22 15 12 26
60 39 129 108
17 11 33 27
[seq 99%] IC1 IC2 IC3 IC4
1 6 14 10
6 10 21 13
9 12 10 7
28 56 77 59
7 10 9 6
59 50 90 50
PC1 PC2 PC3 PC4
6 8 10 3
17 15 16 56
10 9 7 13
30 23 37 63
5 8 6 4
42 54 75 86
Mean-based (top two sub-tables) and sequential intervals, 95% confidence level (top), 99% confidence level (bottom).
4. Entropy-related issues 4.1. Extensivity and non-extensivity The entropy of the joint distribution of an observed sample can be decomposed into a noisy component (error), and an additional term useful to represent the model or statistical complexity. While the former component characterizes the extensive part of entropy, the latter refers to the predictive information in the data, and also characterizes the non-extensive part of entropy. In particular, the so-called sub-extensive entropy is linked to the mutual information (MI), as explained for instance by [7]. Data sets might often be too complex in their inherent dynamics, due to the presence of noise cover that reduces the signal quality at the desired resolution or detail in the distribution. In such cases, the difference between extensive and subextensive entropy components tends to become negligible and decrease the prediction power of the data. One might attempt to retrieve information from the data through fragmentation, i.e. by looking at a set of causal states that can be defined through trajectories or histories encoding the system’s dynamics in a local fashion. Interesting results in this direction have been recently reported [39], and have established the role of dynamical correlations for non-extensive entropy.17 The latter is thought to represent a feature of the subspace defined by strongly correlated clusters in the phase space of the system.
17 Statistical mechanics interprets energy as an extensive variables as the total energy of a system is proportional to its size. Entropy too is supposed to be extensive. Generally, the short-range nature of the system’s element interactions can justify the assumption, but with long-range interactions, a not extensive characterization for the variables is possible. Tsallis’s theory [50] proposes to replace the usual Gibbs entropy with a new, non-extensive Tsallis entropy, and maximize that, subject to constraints. There exists a whole infinite family of Tsallis entropies, indexed by a real-valued parameter q, which quantifies the degree of departure from extensivity, where the usual entropy is back again for q = 1.
2613
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618 90
N U M B E R
90
A
80
N 70 U M B E 60 R
70
60 O F G E N E S
B
80
PC3
50
O 50 F
PC2
IC1
G 40 E N E S 30
40
30
IC3 IC4
PC1
20
20 IC2
PC4 10
0
10
1
1.2
1.4 1.6 1.8 2 2.2 2.4 2.6 Genome distribution; from non Gaussian to Gaussian, let to right
2.8
3
0.6
0
1
1.2
1.4 1.6 1.8 2 2.2 2.4 2.6 Genome distribution; from non Gaussian to Gaussian, left to right
2.8
3
1.2
C
D
PC1
0.4
IC1
1 PC4
0.2 0.8 0 0.6 IC2 0.2 0.4 IC3
0.4 PC2
0.2
PC5
0.6
IC4
PC3 0
0.8
1
1
1.5
2
2.5
3 3.5 PC Mixing Matrix
4 Genome
4.5
5
5.5
6
0.2
1
1.5
2
2.5
3 3.5 IC Mixing Matrix
4 Genome
4.5
5
5.5
6
5.5
6
0.8
0.8
E
0.6
PC5
F
0.6
PC1 0.4
0.4
0.2
0.2
IC3
IC2
0
0
IC5
PC4 0.2
0.2
0.4
0.4
0.6
PC2
0.6
IC4
PC3 0.8
1
1 1
1.5
IC1
0.8
PC6
2
2.5
3
PC Mixing Matrix
3.5
4
Perturbed Genome
4.5 std = 0.3
5
5.5
6
1
1.5
2
2.5
3
IC Mixing Matrix
3.5
4
Perturbed Genome
4.5
5
std = 0.3
Fig. 2. Top: PC (A) vs. IC (B) genome distribution under incremental Gaussianization (from left to right, r = 0.1–1). Middle: PC (C) vs. IC (D) mixing matrix temporal patterns for unperturbed genome. Bottom: same for perturbed genome (mid, r = 0.1), with PC (E) vs. IC (F).
Given the ICA decomposition proposed in this work, we can naturally verify whether the identified independent components reveal predictive power within the gene structure they represent. To find evidence for predictability in the estimated independent components, we refer to embedding theory (see [10]). In order to deal with dependent structure in the variables and avoid the limitation of a possibly infinite number of linear transforms yielding decorrelation, ICA adopts several types of contrast functions to separate the sources through optimization steps. In general, the central limit theorem states that a linear mixture of independent signals becomes asymptotically Gaussian; conversely, the pursuit of non-Gaussianity can be a separating process and since the Gaussian distribution under fixed variance maximizes the entropy, one can consider that minimizing the entropy leads to source separation.
2614
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
Correspondingly, we can consider the residual dependence between the recovered sources and measure it through their MI.18 It is common to define the MI in terms of entropies that measure the average amount of information that the estimated S R yield. The differential entropy of an m-dimensional random variable S with PDF P(S) is defined as E½S ¼ PðSÞ log PðSÞdS. With S, the conditional entropy of S given e S thus becomes: a two-subset source partition S ¼ ½S; e
E½Sje S ¼
Z
PðS; e SÞ log PðSje SÞdSde S:
ð6Þ
The MI can be obtained as
MI½S; e S ¼ E½S þ E½e S E½S; e S ¼ E½S E½Sje S:
ð7Þ
P
For a random vector Z one finds MIðZÞ ¼ j Eðzj Þ EðZÞ. Since the MI represents the difference in information obtained by S are independent, observing separately (component-wise) or jointly the random vector, it becomes zero if and only if S and e SÞ ¼PðSÞPðe SÞ. This quantity is non-negative, as more information comes from separated rather than jointly observed i.e., PðS; e variables. When applied to the estimated components, the MI can also be associated to redundancy. If the recovered sources are independent, MI is zero and the source separation results successful. Therefore, approaching this value is an indication that the data are indeed a mixture of components and the separation algorithm works well in estimating the sources.19 Furthermore, since the entropy E(Z) remains constant under a transformation given by the product of whitening and orthogonal matrices, MI(Z) varies with the sum of the marginal entropies. We thus start from these quantities, estimate them, and then turn to embedding theory for understanding their relationships with predictability. 4.2. Embedding The Sample Approximate Entropy (SampEn, [47], and also [41]), is an ad-hoc statistic quantifying the degree of unpredictability or randomness in a sequence of observations, and can be used as a plug-in empirical estimate for the entropy functional and as a process of sampling information about regularity in data. The SampEn is the negative natural logarithm of the conditional probability that subseries (epochs) of length m that match pointwise within a tolerance r also match at the next point. The outputs are the sample entropies of the input, for all epoch lengths of 1 to a specified maximum length, m. For example, two sequences may be similar (within tolerance d) for a certain number of points, and remain similar also at the next point. Thus, with Bm(d) and Am+1(d) the probabilities that two sequences match for m and m + 1 points, respectively, mþ1
the statistic becomes SampEn= ln ABm ðdÞðdÞ. The basic aspects is that a sequence of values becomes more predictable if patterns of fluctuations are repeated; the SampEn values reflect the chance that similar patterns are observed or not. When many repeated patterns are present, the SampEn estimate is relatively small, and this fact makes more predictable the sequence. Conversely, a more complex and random sequence delivers a bigger SampEn value. A key parameter is that indicating the epochs or sub-sequences which are used to test the similarities of patterns, and more specifically it addresses their length. The algorithmic procedure resembles time delay embedding technique; through a time lag s the sampling rate is determined, while the embedding dimension m governs the length of the filter. Consequently, a sequence of data can be represented in phase-space by a set of delayed vectors xs,m(k) = [xks, . . . , xkms]. If the term ms is too small, then the signal variation within the vector is dominated by noise; thus, an adjustment is required. For instance, given s, m can be set to different values so as to find a good compromise between the scale of the hidden dynamics and the inherent complexity of the sequences of interest, in our case the genes encoded in a certain component. The phase space representation should be independent of s for an infinite amount of data, but this is not the case in applications. Thus, one might either estimate both parameter values or consider conditioning. Therefore, we fix to s = 1 the sampling rate and search across a range m = 1, . . . , 5 the value that identifies a near-optimal embedding for a 4- or 5-component system, and thus achieving a state of minimal differential entropy (or minimal disorder). As we are in a linear setting by model construction, the embedding localizes the linear structure and lets the changes occur smoothly over the phase space. However, the role of embedding is also useful when switching between the two systems, as we might expect that the modes or eigenvectors vary not so smoothly across the points, and particularly due to the degeneracy of some eigenvalues and their noise sensitivity. The extra structure inserted in the approximating model may bring smoothness sufficient to balance some potential instability. By measuring the information with a redundant system, a loss of stability can occur when moving away from the desired minimum.
18 19
This is sometimes considered a generalized correlation, compared to the classical Pearson linear correlation coefficient. It can be shown that ICA thus finds a separating transform, or matrix under linear mixing, which minimizes the MI between the estimated sources.
2615
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
4IC
Sample Aproximate Entropy 2.1
2.2
4ICz
2 2
1.9 1.8
1.8
1.7 1.6
1.6
1.5 1
2
3
4
5
1.4
1
2
3
4
5
4
5
5ICz
5IC 2.1 2.2
2 1.9
2
1.8 1.8 1.7 1.6
1
2
3
4
5
1.6
1
2
3
m=1,2,3,4,5 Fig. 3. Entropy fluctuations. Estimated entropy level (Y axis) under variable embedding dimension (X axis). 4-component system (top), with z-scores on the right; 5-compenent system (bottom), with z-scores on the right. Fluctuations refer to the components and originate at m = 1 with a computed entropy level whose value (from maximum to minimum) leads to the following sequences: Top left plot: IC1, IC3, IC4, IC2. Top right plot: ICz4, ICz3, ICz1, ICz. Bottom left plot: IC1, IC3, IC5, IC2, IC4. Bottom right lines: ICz3, ICz1, ICz2, ICz4, ICz5.
4.3. Fluctuations One of the problems is to be able to discriminate between informative and non-informative sources. [18] have shown that the extracted sources can be ordered by increasing value of the differential entropy (or uncertainty) as P E(s1) 6 6 E(sp) 6 6 E(sn). If sp is not Gaussian, then the objective function E ¼ pi¼1 Eðsi Þ under a fixed covariance structure (Cov(S) = I) can be maximized by sources with the smallest entropy (the least uncertain sources set in the mixture). In turn, this maximization corresponds to the one involving the Kullback–Leibler divergence (relative entropy) of the outR iÞ dsi . Overall, put marginal probability density functions p(si) from the Gaussian ones p(gi), i.e. KL½pðsi Þjpðg i ¼ pðsi Þ log pðs pðg i from the given hypotheses and due to source ordering in relation to increasing entropy, we naturally end up with minimum entropy contrasts and the search for non-Gaussian least entropic sources. The plots in Fig. 3 report the estimated entropy, which reflects an order induced by the links established with predictability through embedding. By looking at the SampEn pattern, its reduction might be due to an increase in the degree of regularity. The SampEn pattern is here observed more at a macroscopic level, as its value is computed at each embedding dimension from an ensemble of sequences.20 In particular, to understand whether redundancy drives the systems towards more or less predictability, we explore entropy fluctuations. Then, despite the inevitable presence of noise propagation, we make an attempt to distinguish these selfpropelling dynamics from more system-dependent ones. Without knowledge of the topological structure or the dynamics of the system, this task might appear speculative. A clear separation of internal and external dynamics (as in Ref. [4]) is hard to achieve in our case. The suggestion of letting a random displacement of a number of non-interacting walkers on the network visit the nodes with observable frequencies, then repeat M times the experiment and measure the frequency fluctuations so as to control the effect of the built-in diffusion process, is no longer informative when the number of walkers can change at each replicate. In this case, it is hard to verify whether fluctuations are diffusive or structural. In our example, we measure random fluctuations (through embedding) of sequences of estimated entropies. With various sources of independent information derived from an original signal, as in the case of ICA, the feature gene sets that are extracted may contain redundant information (due to collinear genes, for instance), which reduces the signal-to-noise ratio, 20
These refer to systems with a certain number of components and correspond to a range of values likely to detect essential data structure.
2616
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
and in turn also the global predictability. Thus, low redundancy is preferred when the task is inferring about predictability degree inherent to each individual source. We look at entropy fluctuations to find the phase space representation that identifies a near-optimal embedding endowed with minimal differential entropy, and thus maximal predictability. Entropy is an extensive quantity, as for a time series is asymptotically proportional to its duration; without an infinite sample, we create surrogates of finite size and monitor their behavior under variable embedding parameter. The SampEn instrument is ad hoc to investigate this matter. The evidence we have found is for a regular downturn pattern for all the components across m, apart from m = 5. Thus, as embedding refers to the length of the filter, after a certain dimensionality threshold a passage from sub-extensive to extensive entropy might be implied. It must be noted that the common patterns among the surrogates may be in part influenced by spurious correlation due to the presence of the same genes in the components, i.e. the genes that interfere. However, this factor combines with the fact that groups of genes tend to move together, due to co-expression or even co-regulation dynamics. IC1 presents a stably decreasing SampEn pattern, thus gaining predictability by approaching the dimensionality endowed with the minimal noise cover. A deviation from the optimum has thus the effect to increase the differential entropy. Also a slightly redundant system approaches near-optimality with a similar pattern. When instead the sequences are more random, because transformed via z-scores, again a decay is observed, but not as before. It was previously mentioned that the z-scores for the system with four or five components address a different degree of variability; we validate that graphical evidence with numerical computation of dispersion and empirical entropy measures. Fig. 3 shows the entropy fluctuations, which represent patterns of entropy levels across system’s dimensionality for each extracted component (for systems’ with 4 and 5 components) and under two hypothesis of cross-component variability (thus, z-scored components are also shown). We condition on sampling rate s = 1. It seems even harder to detect a system switch of regime by looking at the sample variances. We found the paths (see below the summary) of the SampEn variance by using an empirical estimator for var(H) = var[log(p(x))] = E[log(p(x))2] (see [51]). For both the 4-component and the 5-component systems, three variability levels com up; the lowest level involves IC1, the second one IC2 and IC3, and the highest one IC4. Then, more instability occurs till the fifth component, when the value drops down. This is also confirmed by looking at z-scores (we report results only for m = 1).
4-component 5-component 4-component 5-component
system: IC1: 3.48, IC2: 4.38, IC3: 4.27, IC4: 5.15; system: IC1: 3.34, IC2 (4.40), IC3: 4.42, IC4: 4.99, IC5: 4.86 system – z-scores: IC1z: 3.78, IC2z: 3.61, IC3z: 3.76, IC4z: 4.21 system – z-scores: IC1z: 3.39, IC2z: 4.26, IC3z: 4.21, IC4z: 4.38, IC5z: 3.76
5. Conclusions This paper addresses methodological issues based on experimental and simulation data obtained from time-course E. coli genome perturbation. It is proposed a comparison between two established methods of dimensionality reduction, PCA and ICA. The aim is to monitor the organism’s pathway, SOS, in its repair action in response to DNA damage, by looking at the changes in gene expression values under more or less noisy conditions. The main questions to ask refer to how many genes should be monitored in order to know what effects the perturbations had my data? The whole genome or a reduced part of it? What genes are most likely affected on biological grounds instead of just extra conditions such as noise? Is there a concise and most informative (less noisy) representation of the data through which gene associations have been induced over the space with reduced dimensionality? The gene selection task is performed by looking at the components which can be identified and extracted with either technique. Determining such number requires the typical analysis of the eigenspectrum significance, and despite even small eigenvalues might be informative about any extra dimension of the problem, it is usually considered relatively safe to keep the first biggest elements of the spectrum for representing the most valuable components. Overall, we verify that ICA achieves high-quality performance in gene selection, when compared to its natural counterpart, PCA. ICA tends to outperform PCA in clearly non-Gaussian contexts, and more clearly than PCA captures the latent dynamics, especially when these are localized (i.e. localized sub-networks). In order to isolate outlying (i.e. informative) genes, we compute thresholding of the gene profile in each estimated component, followed by significance tests built from simple IC-based statistics. At different confidence levels, we want to extract groups of genes with common biological links. Even in its simplest linear form, and relatively low algorithmic complexity, ICA appears endowed with a great power of approximating the complex gene network dynamics through a very limited number of parameters which are estimated in the mixing matrix. Thus, ICA’s capacity of decomposing the high-dimensional and likely non-linear system in just a few modes influencing each other almost independently appears more informative than relying on a simple orthogonal decomposition. These results are tested under Gaussianization or incremental noise cover, i.e. for different noise injection sizes. ICA shows more sensitivity than PCA, due to a more emphasized local structure. Correspondingly, PCA is more robust due to the global Gaussian nature which makes it noise-resistant but less localized. Last, it has been shown how the intrinsic dimensionality can be approached with quite good reliability even in these illposed application problems where few samples and many variables are present. Entropy-based fluctuation analysis naturally
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
2617
Fig. 4. Relationships between Principal, Factor and Independent Component Analysis. PCA performs optimal variable decorrelation under Gaussianity. FA is also optimal under Gaussian assumptions, and maps the interdependencies among variables on a reduced set of factors which can also be heteroskedastic. ICA works under non-Gaussian conditions for optimal and stable identification, and extracts statistically independent variables, when they exist.
links to embedding aspects, and it can be adapted to using disaggregated (via components) complex systems information for exploring its innermost dimensionality and (the control of) stability properties. ICA dimensionality is monitored by numerical (perturbation) analysis and by entropy-based fluctuation analysis; these strategies allow for a more rigorous definition of the intrinsic dimensionality, by strengthening the cut-off value suggested by the eigenvalue spectrum. These listed are novel contributions. By fixing the s parameter (i.e. we do not monitor more than one sampling rate, based on the idea that extra refinements would require priors which we do not have), and considering a variable embedding degree through the parameter m extended to a value covering the maximum system’s dimensionality allowed in our study, we verify for both a 4- and a 5-component system what results in terms of SampEn. These values are then matched to their z-scored IC counterparts, which are meant to work as a form of negative control, and from which it is expected a more random behavior. Last, our results suggest that it is worth considering classical dynamical systems tools such as embedding theory and entropy fluctuations, whose analysis appears to play a role in disentangling from the identified components both extensive and non-extensive aspects of the estimated entropies. In turn, the interest may thus shift from global dimensionality to local predictability inherent to the system through its main modes fluctuation and independence degrees. Acknowledgement The author thanks Sardegna Ricerche for support. This work has been in part conducted at Boston University, Biomedical Engineering Department, where the experimental data have been generated. Two referees have substantially contributed to improve the presentation of this work, and to better define its structure. Appendix Tables 2 and 3 show a confusion matrix21 for combination of components with simultaneously identified genes. Fig. 4 shows the relationships between Principal, Factor and Independent Component Analysis. The first row of Table 2 reports 31 genes belonging to PC1 and PC2, then 21 genes are detected by PC1 and PC3, also 59 genes appear from PC1 and IC1 and so on. The dimensions of each component is indicated at the bottom of each sub-table. Note that in Table 4 the various gene groups assigned to each component are those distinctly specific to them and thus refer to the separation power of PCA and ICA under different Gaussianization degree. Nevertheless, one may notice a few interesting aspects, here summarized: Table 2 (TOP). With mean-based intervals, the two most important components, PC1 and IC1, share 59 genes, which belong to 121 genes from PC1 and to 106 genes from IC1. Then, PC2, PC3 and PC4 have respectively 17, 13 and 10 genes in common with IC1, while IC2, IC3 and IC4 have respectively 31, 23 and 14 genes overlapping with PC1.22 Table 2 (MIDDLE). In general, under some degree of perturbation, the overlaps tend to reduce, thus leading to less interference between the components due to the extra noise separating them. Clearly enough, components start to be forced to separate. Tables 3 and 4. With sequential intervals, the order of magnitude of PC1 is quite different from that of IC1, and this affects the overlapping degree (in particular, PC1 with regard to IC1). This effect is evident from Table 4, where IC1 at 95% confidence level reports no distinctly detected genes due to the overwhelming abundance of PC1. Similar effects occur with the other ICs too. Table 4 (TOP and MIDDLE). The number and percentage of genes uniquely detected by each component under variable Gaussianization degree are reported. There is a shift in IC1 from 22% to 39% under a low perturbation, while in PC1 the shift is only from 19% to 23%. This behavior is also observed under stronger perturbation. For the other components, a 21 22
This term is usually adopted to show the agreement or disagreement between two classifications. Note that some of these overlapping genes may be present with a different frequency compared to others.
2618
E. Capobianco / Commun Nonlinear Sci Numer Simulat 14 (2009) 2602–2618
more homogeneous behavior is observed. When looking at the strong outliers, i.e. those genes with the highest differential expression value at 99% confidence level, IC1 is still more distinctly separated than PC1, and more noise-sensitive (as with 95% confidence level).
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53]
Almeida LB. MISEP – linear and nonlinear ICA based on mutual information. J Mach Learn Res 2003;4:1297–318. Alter O, Brown P, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. PNAS 2000;97(18):10101–6. Amari S, Cardoso J. Blind source separation – semiparametric statistical approach. IEEE Trans Sign Proc 1997;45:2692–700. Argollo de Menezes M, Barabasi AL. Separating internal and external dynamics of complex systems. Phys Rev Lett 1994;93(6):068701–4. Bach FR, Jordan MI. Kernel independent component analysis. J Mach Learn Res 2002;3:1–48. Berger JA, Hautaniemi S, Edgren H, Monni O, Mitra SK, Yli-Harja O, et al. Identifying underlying factors in breast cancer using independent component analysis. Proc IEEE Neur Net Sign Proc 2003:81–90. Bialek A, Nemenman I, Tishby N. Complexity through nonextensivity. Phys A 2001;302:89–99. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 2003;19(2):185–93. Capobianco E. Mining time-dependent gene features. J Bioinf Computat Biol 2005;3(5):1191–205. Capobianco E. Statistical embedding in complex bioystems. J Integr Bioinf 2006;3(2):30. Capobianco E. Model validation for gene selection and regulation maps. Funct Integr Genom 2007;8(2):87–99. Cardoso J. Source separation using higher order moments. Proc ASSP 1989:2109–12. Cardoso J. Dependence, correlation and gaussianity in independent component analysis. J Mach Learn Res 2003;4:1177–203. Cardoso J, Souloumiac A. Blind beamforming for non-Gaussian signals. IEEE Proc F 1993;140(6):771–4. Chiappetta P, Roubaud MC, Torresani S. Blind source separation and the analysis of microarray data. J Computat Biol 2004;11(6):1090–109. Cichocki A, Zhang L, Choi S, Amari SI. Nonlinear dynamic ICA using state-space and neural network models. Proc ICA’99 1999:99–104. Comon P. Independent component analysis – a new concept? Signal processing 1994;36(3):287–314. Cruces S, Cichocki A, Amari S. In: The minimum entropy and cumulant based contrast functions for blind source extraction. IWANN 2001 Proc Book, Part II, 2085. Heidelberg: LNCS Springer-Verlag; 2001. p. 786–93. Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm Pure Appl Math 2004;57:1413–57. Diaconis P, Friedman JH. Asymptotics of graphical projection pursuit. Ann Stat 1984;12(3):793–815. Eckart C, Young G. The approximation of one matrix by another of lower rank. Psychometr I 1936:211–8. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 2007;5:54–66. Friedman JH. Exploratory projection pursuit. J Am Stat Assoc 1987;82(397):249–66. Friedman JH, Tukey JW. A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comp 1974;C23(9):881–90. Frigyesi A, Veerla S, Lindgren D, Hoglund M. Independent component analysis reveals new and biologically significant structures in micro-array data BMC Bioinform 2006;7:290. Gretton A, Herbrick R, Smola A, Bousquet O, Scholkopf B. Kernel methods for meauring independence. J Mach Learn Res 2005;6:2075–129. Golub GH, Van Loan CF. Matrix computations. 3rd ed. Baltimore: Johns Hopkins Un Press; 1986. Himberg J, Hyvarinen A, Esposito F. Validating the independent components of neuroimaging time-series via clustering and visualization. Neuroimage 2004;22(3):1214–22. Holter N, Maritan A, Cieplak M, Fedoroff N, Banavar J. Dynamic modeling of gene expression data. PNAS 2001;98(4):1693–8. Hori G, Inoue M, Nishimura S, Nakahara H. Blind gene classification. An application of a signal separation method. Gen Inform 2001;12:255–6. Huber PJ. Projection pursuit (with discussion). Ann Stat 1985;13:435–525. Hyvarinen A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neur Net 1999;10(3):626–34. Hyvarinen A, Oja E. A fast fixed-point algorithm for independent component analysis. Neur Comput 1997;9(7):1483–92. Hyvärinen A, Pajunen P. Nonlinear independent component analysis: existence and uniqueness results. Neur Net 1999;12(3):429–39. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003;4(2):249–64. Jolliffe IT. Principal component analysis. New York: Springer; 1986. Jones MC, Sibson R. What is projection pursuit? (with discussion). J R Stat Soc A 1987;150:1–36. Karhunen J. In: Nonlinear ICA, Roberts S, Everson R, editors. Independent component analysis: principles and practice. Cambridge University Press; 2001. p. 113–4 [chapter 4]. Kodama T, Elze HT, Aguiar CE, Koide T. Dynamical correlations as origin of nonextensive entropy. Europhys Lett 2005;70(4):439–45. Kohonen T. Self-organized formation of topologically correct feature maps. Biol Cybern 1982;43:59–69. Lake DE, Richman JS, Griffin MP, Moorman JR. Sample entropy analysis of neonatal heart rate variability. Am J Physiol 2002;283:R789–97. Lee S, Batzoglou S. Application of independent component analysis to microarrays. Gen Biol 2003;4:R76-1–21. Liebermeister W. Linear modes of gene expression determined by independent component analysis. Bioinformatics 2002;18:51–60. Martoglio AM, Miskin JW, Smith S, MacKay DJC. A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics 2002;18(12):1617–24. Murtagh F. Ultrametricity in data: identifying and exploiting local and global hierarchical structure; 2006. Available from:
. Pajunen P, Hyvarinen A, Karhunen J. Non-linear blind source separation by self-organizing maps. Springer; 1996. Richman JS, Moorman JR. Physiological time-series analysis using approximate entropy and sample entropy. Am J Physiol 2000;278:H2039–49. Saidi SA, Holland CM, Kreil DP, MacKay DJC, Charnock-Jones DS, Print CG, et al. Independent component analysis of microarray data in the study of endometrial cancer. Oncogene 2004;23:6677–83. Schafer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 2005;4:32. article 32. Tsallis C. Possible generalization of Boltzmann–Gibbs statistics. J Stat Phys 1988;52:479–87. Wyner AJ, Foster D. On the lower limits of entropy estimation. Wharton School, Un. Pennsylvania: TR Department Statistics; 2003. Yang H, Amari SI, Cichocki A. Information theoretic approach to blind separation of sources in non-linear mixture. Signal Proc 1998;64:291–300. Zhang XW, Yap YL, Wei D, Chen F, Danchin A. Molecular diagnosis of human cancer type by gene expression profiles and independent component analysis. Eur J Hum Genet 2005:1–9.