Organization principles of biological networks: An explorative study

Organization principles of biological networks: An explorative study

Accepted Manuscript Title: Organization principles of Biological Networks: an explorative study Author: Havva Kohestani Alessandro Giuliani PII: DOI: ...

500KB Sizes 0 Downloads 27 Views

Accepted Manuscript Title: Organization principles of Biological Networks: an explorative study Author: Havva Kohestani Alessandro Giuliani PII: DOI: Reference:

S0303-2647(16)30004-1 http://dx.doi.org/doi:10.1016/j.biosystems.2016.01.004 BIO 3633

To appear in:

BioSystems

Received date: Revised date: Accepted date:

7-7-2015 12-1-2016 29-1-2016

Please cite this article as: Kohestani, H., Giuliani, A.,Organization principles of Biological Networks: an explorative study, BioSystems (2016), http://dx.doi.org/10.1016/j.biosystems.2016.01.004 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Organization principles of Biological Networks: an explorative study. Havva Kohestania, Alessandro Giulianib,*

ip t

a) Department of Biology, Faculty of Natural Science, University of Tabriz, Tabriz 5166616471, Iran , email: ha [email protected].

Corresponding author: Alessandro Giuliani, email: [email protected]

us



cr

b) Environment and Health Dept., Istituto Superiore di Sanità, Viale Regina Elena 299, , Roma, Italy. email: [email protected]

Abstract

Ac ce p

te

d

M

an

The definition of general topological principles allowing for graph characterization is an important pre-requisite for investigating structure-function relationships in biological networks. Here we approached the problem by means of an explorative, data-driven strategy, building upon a size-balanced data set made of around 200 distinct biological networks from seven functional classesand simulated networks coming from three mathematical graph models. A clear link between topological structure and biological function did emerge in terms of class membership prediction (Average 67% of correct predictions, p<0.0001) with a varying degree of ‘peculiarity’ across classes going from a very low (25%) recognition efficiency for neural and brain networks to the extremely high (80%) peculiarity of aminoacid – aminoacid interaction (AAI) networks. Werecognized four main dimensions (principal components) as main organization principles of biological networks. These components allowed for an efficient description of network architectures and for the identification of ‘not-physiological’ (in this case cancer metabolic networks acting as test set) wiring patterns. We highlighted as well the need of developing new theoretical generative models for biological networks overcoming the limitations of present mathematical graph idealizations. Keywords: Graph Theory, Systems Biology, Topology, Biocomplexity, Principal Component Analysis.

1. Introduction

In 1952 the Dutch electrical engineer Bernard Tellegen (Tellegen, 1952) developed a theorem whosegeneral importance in science has largely been underestimated (Mikulecky, 2001). . Tellegen's theorem gives a simple relation between magnitudes that satisfy Kirchhoff's laws of electrical circuit theory.

Page 1 of 25

cr

ip t

The Tellegen theorem is applicable to a multitude of network systems. The basic assumptions for the systems are the conservation of flow of extensive quantities (Kirchhoff's current law, KCL) and the uniqueness of the potentials at the network nodes (Kirchhoff's voltage law, KVL). Tellegen’s theorem is a conservation principle of both potential and flux. The flux does not need to be an electrical current and the same holds for the potential. A flux of matter (in terms of relative concentrations of reagents and products) traverses a metabolic network and the free energy of the relative reaction is the potential. Similar reasoning holds for protein contact networks where the molecular motion flows between neighboring residues (nodes) and an interaction potential (intramolecular bonds) establishes between them (Di Paola L. and Giuliani A. 2015).

us

As aptly stressed by Mikulecki(2001) the theorem opens the way to a ‘network thermodynamics’ strictly dependent from wiring architecture while largely independent of the constitutive laws governing the single elements.

Ac ce p

te

d

M

an

The most spectacular empirical proof of Tellegen’s theorem was probably in 1956, when Grunthard and Primas (1956) realized that the matrix used in the classical Huckel method for deriving the energy of π molecular orbitals is a simple function of the adjacency matrix of the molecular graph. Renewed interest in the study of complex networks (Albert, Barabasi 2002: Barabasi et al. 2004), of ‘biological systems as networks’boosted out in the beginning of this century. In the present work, inspired by Tellegen theorem, we look for ‘general principles’ of network wiring architecture in the realm of biological networks.We are fully aware of the fact that flux networks (e.g. metabolic networks) and interaction networks (e.g. gene regulation networks) are drastically different as for their operational properties (Huang, 2004). Strictly speaking Tellegen’s theorem only applies to flux networks, notwithstanding that, at a coarsegrain level, we are convinced the recognition of organization principles discriminating different kinds of networks could be ofgeneral interest. The existence of properties shared by biological and non-biological networks was already faced (Girvan and Newman, 2004; Alm and Arkin, 2004; Palla et al. 2005). These studies, at least to our knowledge, adopted a theoretical attitude focusing on features like community structure (Girvan and Newman, 2004; Palla et al. 2005) or scaling (Barabasi and Albert 1999).Other scientists focused on detailed analysis of the changes induced in the network wiring by disease conditions (Ideker and Krogan, 2012). Here we adopted a somewhat different perspective: instead of focusing on ‘average’ properties of network classes, we adopted a bottom-up classification approach somewhat similar to Quantitative Structure-Activity Relationships (QSAR, Cronin, 2010) and epidemiological exploratory studies (Price et al. 2006). Here the focus is no longer on the characterization of an ‘ideal model’ but on the actual discrimination of networks pertaining to different categories. This discrimination is obtained through a bottom-up strategy relying upon the empirical correlations holding between the data set descriptors. This approach allows for the derivation of ‘general principles’ in the form of principal components of the among descriptors correlation matrix (Benigni Giuliani 1984, Preisendorfer, 1988) acting as order parameters of the data set. Principal Component Analysis (PCA) of topological descriptors generated a very clear correlation structure indicating three different layers of network organization and an unexpected ‘network scalar’. These ‘principles’ allowed us to both discriminate different network classes (so pointing to peculiar organization principles for each functional class) and to predict a ‘pathological

Page 2 of 25

network character’ of cancer regulation network. It is worth stressing that while components are each other orthogonal by construction on the entire data set, they have a rich local (within functional classes) correlation structure: any network class is a peculiar relational model in the topological component space.

ip t

2. Material and Methods 2.1 Data set

us

cr

Eleventopological network descriptors and one derivedfeature defined as transmission load were computed for 222networks pertaining to 10classes in turn subdivided into 7 functional classes (the network are classified by their biological function) and 3 theoretical classes solely based on their wiring architecture. We also investigated29 cancer metabolic networks pertaining to different cancer types that were used as ‘test set’ to check the ability of the extracted component space derived from the 222 networks (training set) to get rid of (in a totally data-driven way) the peculiar features of cancer networks.

an

2.1.1 Topological descriptors

M

Number of nodes: Nodes arethe key building blocks of every network, also known as vertices in graph theory. A network is a system of nodes connected by links (or edges) (Strogatz, 2001). The number of nodes is a measure of the size of the network.

te

d

Shortest PathNumber:The shortest path between two nodes is the path connecting them with the minimum number of links to traverse and corresponds to the shortest distance between two specified nodes in a network. The total number of shortest paths in a network is thus dependent on both network size and wiring architecture.

Ac ce p

AverageShortest Path (ASP): or Characteristic Length is the average of shortest path lengths over all pairs of nodes in the network.ASP participates in the emergence of diverse large-scale behaviors of network systems. This is particularly evident in AAI (Amino Acid Interaction Networks), those networks having aminoacid residues as nodes and their contact in 3D space as edges (Zhou et al. 2014, Di Paola et al. 2012). In AAI, ASP minimization is demonstrated to be at the basis of an efficient allosteric behavior of protein molecules (Tasdighian et al. 2013, del Sol et al.2006). Average degree:The degree of a node is the number of nodes connected directly to that node also known as the number ofdirect connections of a node to other nodes via edges. For an undirected network including N nodes and L links,=2L/Ndenotes the average degree of network. Network density:The ratio of actual connections between nodes of a network to the maximum potential connections,is defined as the density of that network. For a network by N nodes potential connections is given by N*(N-1)/2. This measure acquires values in the domain of [0, 1]. Clustering coefficient:This descriptor has to do with the tendency of the nodes to cluster together constructing dense neighborhoods. This index builds upon the frequency of ‘triads’ satisfying the relation ‘if A is directly connected to B and B directly connected to C, even A

Page 3 of 25

is directly connected with C’. For an undirected network, clustering coefficient of a node n is defined as Cn = 2en/ (kn (kn-1)), where kn is the number of all neighbors of node n and en is the number of connected pairs between all neighbors of node (Shannon et al. 2003).We considered the average of clustering coefficient over all the nodes of each network.

an

us

cr

ip t

Network centrality:Centrality indices have different definitions depending upon the metrics adopted to define the ‘importance’ of a node. The simplest way (closeness centrality) to define the centrality of a node i is the inverse of the distances (in terms of edges to be traverse) from node i to all the other nodes of the network. Here we adopted the Eigen Vector Centrality (EVC) that corresponds to the eigenvalue of the principal eigenvector of the adjacency matrix of a network. It is a continuous measure that depends on more than just the node itself, but also on the surrounding neighborhood (as opposed to degree centrality that is purely local). EVC is a measure of network well-connectedness (Canright, Engø-Monsen, 2006; Carreras et al. 2007). EVC is a proxy of the spreading power of a single node, or a collection of nodes, by the most central node or by the average EVC of all the nodes in the collection

M

Network diameter: The longest distance between two nodes corresponds to network diameter in terms of number of edges to be traversed (Shannon et al. 2003). It is a measure of the topological width of the network.

te

d

Network heterogeneity:Network heterogeneity corresponds to node degree variance. Although thisdescriptor can be computed for all network types, heterogeneity is one of the most important featuresof scale-free networks. Scale-free networks havefew highly connected nodes (i.e. ‘hubs’) and a great majority of low-degree nodes. This feature is a result of power law degree distribution in scale-free networks.

Ac ce p

Modularity:Modularity corresponds tothe number of links between residues of the same module minus the expected number linksin the same module if the links of that network were distributed randomly. Modularity has values in the range of [-1/2,1](Newman, 2006). Transmission load:Transmission load is the ratio of average shortest path length to number of nodes in network TL = ASP/N. It builds upon the recognized correlation between number of nodes and ASP so giving a normalized measure of ASP linearly independent of the size of the network in terms of number of nodes. The higher the transmission load, the less optimal is the wiring of network in terms of transmission efficiency.

2.1.2 Network classes

We used Cytoscapev_2.8.1 and Cytoscape v_3.1.1 (Shannon et al. 2003)to perform network parametric analysis and importing several interaction types of networks via online data resources (Csermely et al. 2013). ReactomeFIPlugin (Wu et al., 2010)was our main software application for modularity measurements. We have limited our collection to networks including 100-600 nodes. Figure 1 reports representative elements of each network classes, with different clusters by colors.

Page 4 of 25

ip t cr us an

RLN: Regular Lattice Networks(Wolfram, 2012);All the nodes of RLN have the same degree thus giving rise to periodically ordered pattern.We analyzed 23 regular lattices networks generated by Mathematica v.9.0 software.

-

BAN: Barabasi-Albert Networks(Shannon et al. 2003); Most of the observed natural, biological and social networks follow power law degree distributions P(k) = γ-k with P being the probability for a node to have k edges. The Barabasi-Albert algorithm is the first proposed generative model scale-free networks acting by the rule of preferential attachment: each new vertex has an higher probability to be linked to a vertex that already has a large number of connections (Barabasi and Albert, 1999). We inserted 30 BAN in our trainingset as representatives of mathematical scale-free networks.

Ac ce p

te

d

M

-

-

ERN: Erdos-Renyi Networks(Shannon et al. 2003);ERNs are random graphs in which node degree follows a Poisson distribution. Paul Erdos and AlfredRenyi introduced random graph theory in 1961(Erdos and Renyi 1961). In their model all the nodes have equal probability to set an edge in the process of network growth.30 ERN are included in this work.

-

SON: Social networks(Strogatz, 2001, SNAP, 2014);A social network is a graph whose nodes are a set of social actors (such as individuals or organizations) and the edges social relation (friendship, mutual citations, work ties) between them.17 collaboration networks and internet peer-to-peer networks are our representatives for this functional class.

Page 5 of 25

NBN: Neural and Brain Networks(Xia et al. 2013, Ur Alon Lab, 2014);Due to importance of neural networks and brain structural and functional mapping, investigation of multi-scale properties of brain graphs and imaging has attracted much attention. Several computational tools are freely accessible for detailed analysis of these networks(Van Essen et al. 2013, Halchencko and Hanke, 2012). 14 networks including Human and Macaque brain networksand C.elegans neural circuits constitute our neural data set. Nodes are the functional areas of brain (Human and Macaque) while for C. elegans the nodes are single neurons and links are neural circuits or experimentally proven physiological connections among them.

-

MET: Metabolic Networks(Ma and Zeng, 2003);Metabolic processes play crucial role in supplying cellular energy by the degradation of nutrients (catabolism) and the building up of the materials necessary to life (anabolism). We have selected 23 networks of different human metabolic pathways having metabolites as nodes linked by enzyme catalyzed biochemical reactions between the corresponding nodes.

-

PPI: Protein-Protein Interaction Networks(Stelzl et al. 2005, Vidal et al. 2004); Proteins cooperate to make cell life possible via the generation of ordered metabolic pathways and signal transmission throughout the cell. This cooperation happens through the mutual interactions of protein molecules.We selected 21 protein interaction networks of different organisms like Arabidopsis thaliana and Drosophila melanogaster.PPI nodes correspond to proteins, links to (estimated) physical interactions between them.

Ac ce p

te

d

M

an

us

cr

ip t

-

-

AAI: Aminoacid – Aminoacid Interaction Networks(Zhou et al. 2014, Di Paola et al. 2012).Protein molecules physiological role is deeply dependent upon amino acids mutual interactions generated by folding process and giving rise to the protein native 3D structure. We took into consideration 22 AAI networks. Amino acid residues are the nodes of this class of networks while linkscorrespond to the intermolecular noncovalent contacts of residues in the protein 3D structure. A link is set when two i and j aminoacid residues have a mutual distance ranging from 4 to 8 Angstrom in 3D structure (Di Paola et al. 2012). These networks are also known as Protein Contact Networks (PCN).

-

GPI: Gene Protein Interaction networks(Wu et al. 2010);The nodes of the networks pertaining to this class are both genes and proteins. At odds with PPI, here the links do not correspond to putative physical interactions but to functional interactions of different nature (co-expression, physical interactions, involvement in the same pathway..) derived from literature data and curated data bases (Wu et al.

Page 6 of 25

2010). We selected 21 GPI networks related to different biological processes like DNA replication and repair, immune system and so forth.

REG: Gene Regulatory Networks(Verfaille et al. 2014,Wu et al. 2010);Gene regulatory networks have genes as nodes. A link is established if gene i is a target of the action of gene j by means of different functional relations (co-expression, inhibition..). Here we deal with the so called ‘targetomes’: each candidate gene is linked with other genes whose expression could be affected by the candidate gene. This imposes a peculiar structure to this kind of networks characterized by high degree central nodes (candidate genes) with links to target genes. 21 networks are included in this class.

us

cr

ip t

-

Ac ce p

- 2.1.3 Statistical Methods

te

d

M

an

- CAN: Cancer metabolic networks(Agren et al. 2012);Different types of cancer metabolic networks are selected. CAN class did not participate to the generation of principal components. (training set). They werea posterioriadded as a ‘test set’ in order to check the ability of the model to detect the specificity (if any) of these pathological networks.Analysis of structure and behavioral properties of this class of networks could be inspiring in network targeting medical approaches and drug design. In addition, cancer networks provide an opportunity for better perception about architecture and dynamic of pathological networks. 29 networks from different types of cancer like breast, cervical, ovarian, skin, endometrial and liver are selected. Nodes and links follow the same identification of MET, as nodes are metabolites and links reactions between them.

The data set of the 222 networks (training set) defined by the above described topological descriptors was analyzed by means of Principal Component Analysis (PCA) that in turn gave rise to a four component solution explaining 89% of total variance (see Results and Discussion section). The extracted components correspond to the eigenvectors of the between descriptors correlation matrix and are each other orthogonal by construction (Preisendorfer , 1988). The mutual independence of components holds for the global data set, so each component is an ‘independent wiring principle’. The global orthogonality of components goes hand-in-hand with their interdependence inside each class (within class correlation): each network class corresponds to a specific correlation pattern in the component space (Supplementary File).The loading (Pearson correlation coefficients between original descriptors and components) pattern allows to assign a meaning to each component. Linear Discriminant Analysis (LDA) allowed the recognition of different network classes in the space of components. The degree of recognition (peculiarity) of each

Page 7 of 25

ip t

single class corresponds to the number of correctly classified networks for each category.

3. Results and Discussion

an

us

cr

In order to get a general view of the mutual relations linking the above defined topological descriptors on the control data set, we computed their pairwise Pearson correlation coefficients. The considered networks have widely different wiring architectures, thus theirglobal correlation structure only catchesthe general statistical resemblance of topological descriptors. Nothwithstanding that, some emerging pairwise correlations deserve a brief comment.

M

1. General ‘size effects’: Increasing size in terms of number of nodes is correlated with an increase of short paths (r (Nodes, Shortest Path Number) = 0.87), analogously Network Diameter and Average Shortest Path Length (ASP) are strongly correlated (r = 0.92). It is worth noting Shortest Path Number scales with size as expressed by number of nodes while ASP scales with the spatial extension of networks in terms of number of edges (indirectly measured by Network Diameter).

Ac ce p

te

d

2. General ‘connectivity effects’: The amount of connectivity is measured by both Average Degree and Network Density (r (Network Density, Average Degree) = 0.92). The strong correlation (r=0.80) between Network Heterogeneity and Network Centrality (r=0.804) is of interest: the possibility of defining a different role for the nodes in terms of widely different node degree (Network Heterogeneity) implies the possibility to define network central nodes as opposed to periphery ones and could point to different functional roles for the different classes.Modularity is strongly correlated to both Diameter (r=0.80) and Average Shortest Path Length (r=0.82), this puts in evidence the basic network wiring optimization conundrum: the generation of communities, crucial for system robustness (Csete and Doyle 2002), provokesan increase in average shortest path length and consequently, at equal number of edges, a decrease in communication efficiency. The above relations are the image in light of a ‘maximal size’ after which a network splits into modules. The case of AAIs of protein molecules is a very clear paradigm of this feature (Zbilut et al. 2007).

The correlation matrix was the basis for the generation of principal component space: principal components correspond to the eigenvectors of the correlation matrix extracted in decreasing order of their eigenvalues(Preisendorfer, 1988). The components are the image in light of the different architectural principles, involving the different topological descriptors, shaping network wiring. Given components are each other independent by construction, we can safely consider the profile of component scores of each network as a sort of topological recipe in which independent ingredients (components) must be mixed up in different

Page 8 of 25

proportions to obtain the desired architecture. The different ‘recipes’ give rise to different local (within class) correlations between factors (Supplementary File).

us

cr

ip t

A four components solution explained the 89% of the total variance of the topological descriptor space and was considered as bona fide signal space (Preisendorfer , 1988) in which the 222 networks were projected. The loadings pattern (loadings correspond to the correlation coefficients between initial variables and principal components)is reported in Table 1 (bolded values correspond to the variables most important for component interpretation):

Factor2

Factor3

Factor4

Number of nodes Network density -0.61677

0.24268-0.18700 0.70634

0.05969

0.91973

0.15644

Clustering coefficient -0.407760.68074

0.048660.38984

Network diameter 0.85842

-0.026680.34797

0.22648 -0.53712

-0.282620.42726

Shortest path number0.24709

-0.16072

0.87425

d

Network centrality -0.61482

Network heterogeneity-0.38102

-0.60265

-0.71610

Transmission load0.451510.18803

0.13785

-0.04558

0.31348

0.67455

0.23264

0.13196

-0.141320.49613

0.87896

0.24021

-0.77700

0.06390

Ac ce p

Modularity

0.12253

te

Average shortest path0.90837 Average degree

0.10088

M

Factor1

an

Table 1

-0.06239

0.13179

The percentage of explained variance was: 37%, 22%, 21% and 8% for Factor1-Factor4 respectively.The interpretation of the extracted factors relies upon their loading profile, i.e. the correlation pattern between the components (hidden order parameters of networks organization) and topological descriptors. Factor1 scales with mesoscopic descriptors. The component describes the alternative between highly modular, wide and consequently high characteristic length (the descriptors positively correlated with Factor1) networks, and other architectures endowed with a high number of edges (Average degree, Network Density) at the negative end of Factor1. We can safely define Factor 1 as ‘Modularity’. Factor1 is a pure, size independent, ‘architectural’ componenthighlighting the two basic strategies for the construction of a connected network system:

Page 9 of 25

a) Reducing the wiring expenditures (each link implies energy cost) by means of modularity at the expense of a less efficient communication (high values of average shortest path, modularity and net diameter).

ip t

b) Ameliorating the efficiency of communication by increasing the number of links (high values of density and average degree).

Architectures more oriented toward a) strategies have high values of Factor1, while architectures with a more marked choice toward b) show low values of the first component.

us

cr

It is worth noting Scheffer et al. (2012) recognized these two wiring strategies as corresponding to correspond to two very different behavior in terms of continuous stress response. Type a) network have a continuous decay along the stress coordinate, while type b) networks present sharp transitions (Scheffer et al. 2012).

M

an

Factor2 spans the space from very heterogeneous graphs with nodes with different connectivity values (low values of Factor2, negative loading of Network Heterogeneity) to highly compact, highly connected and homogeneous systems (high loading values of clustering coefficient and average degree). We can define Factor2 as ‘Homogeneity’.

te

d

Factor3, can be defined as‘Network Size’. The negative value of Transmission Load (corresponding to normalized Characteristic Length) loading with Factor 3 points to an aspecific ‘optimization’ principle: characteristic length increases in a sub-linear way with size and its derivative with respect to number of nodes (roughly correspondent to transmission load) thus scales negatively with size.

Ac ce p

Having roughly characterized Factor1 to 3 as ‘Modularity’, ‘Homogeneity’ and ‘Size’ we need a specific insight as for describing Factor4. Factor 4 explains a minor portion of among networks variability, it has all positive loadings so it does not point to any ‘cline’ of alternative network architectures like the above described three major components. Factor4 looks like a ‘scalar’ network number corresponding to a ‘range of existence’ of proper designed networks. We will go more in depth into this point in the analysis of partial correlation coefficients among descriptors. Principal components are each other independent by construction, so each original descriptorcan be expressed in terms of a weighted sum of component scores. This implies any direct Pearson correlation coefficient between any two topological indexes can be considered as a weighted sum of mutually independent contributions to correlation from the different components. Partial correlation coefficient corresponds to the correlation between two variables X and Y depurated by the effect of one or more variables Z that eventually mediate their relation (De la Fuente et al. 2004). Thus computing the correlation between selected topological descriptors, partialled out of the effect of Factor1, Factor2, Factor3 and comparing it with simple correlation coefficient will give us an idea of the meaning of Factor4 in terms of ‘how the relation between original variables appears along Factor4 dimension’. If the interpretation of Factor4 as a scalar invariant shared by all networks is correct, we

Page 10 of 25

ip t

expect that topological variables having a negative simple correlation coefficient will turn to a positive correlation when partialled out of the influence of the major factors (Factor1 to Factor3). This positive correlation is the image in light of the character of ‘common scalar invariant’ of Factor 4.This was actually the case as easily understandable by the below results reporting the change of sign from negative to positive of the correlation of net diameter, modularity, asp with net centrality when partialled out of the Factor1-Factor3 effects and consequently restricted to Factor4 dimension :

us

Simple Pearson r (modularity, net centrality) = -0.578 Partial r (modularity, net centrality | Factor1-Factor3) = 0.360

cr

Simple Pearson r (net diameter, net centrality) = -0.486 Partial r (net diameter, net centrality | Factor1-Factor3) = 0.675

an

Simple Pearson r (average shortest path, net centrality) = -0.492 Partial r (average short path, net centrality | Factor1-Factor3) = 0.598

te

d

M

All in all the network phase space has four independent dimensions: two ‘shape’ factors relative to distinct scales of observation (mesoscopic for Factor1, microscopic for Factor2), a size (Factor3) component and a scale-independent dimension we roughly define as ‘network number’ (Factor4). These independent wiring principles are differently correlated inside each class (see Supplementary File). We checked the robustness of the PCA solution by the analysis of 119 independent samples (individual networks) coming from the same functional classes. The components extracted by this new sample were near to unity correlated with the original data set components.

Ac ce p

Having identified the general architectural principles of network construction in terms of principal components of topological descriptors, we tested the ability of such principles to predict the functional class of the different architectures. It is worth reminding that calling a graph a ‘metabolic network’ and another graph a ‘social network’ does not necessarily implies they correspond to two different wiring architectures. If we are able to discriminate, on the sole basis of the topological information, the different functional classes, this indicates the presence of a relevant structure/function link between topological information and network function. Our main focus is not to get a ‘perfect’ identification of a network functional class by means of its peculiar topological descriptors profile, but to check for the occupation of different areas of the topological space of the different functional classes. This is a much weaker constraint that only implies a statistically significant discrimination among groups. In our case this result is made more cogent by the insertion in the data base of three ‘abstract’ architectures corresponding to the network models routinely used for network classification, namely regular lattices (RLN), Erdos-Renyi random graphs (ERN) and Barabasi-Albert small world networks (BAN).Being abstract models, these architectures do not correspond to any specific function,their role being the reference templates to describe actual network architectures. If these three models are reliable idealizations of actual biological networks,

Page 11 of 25

we do expect they work as ‘attractors’ in the topological component space with each different real network class being markedly more similar to only one of these models.

ip t

Before going in depth into the general ability of the component space of discriminating different classes of network, we will check the relative ability of the original topological descriptors to differentiate the network functional categories in statistical terms. This can be done by computing the F value (F = Between-Class Variability / Within-Class Variability) for the different descriptors. In Fig.2, F value is reported together with the loadings on Factor4.

us

cr

F-values scale pretty well with Factor4 loadings (Pearson r=0.65, p<0.03), keeping in mind that components were computed with no explicit reference to network functional categories, this result reinforces the interpretation of Factor4 as a ‘network invariant’ whose (relatively minor) changes have important effect in determining network architecture. The link with ‘functional separation’ ability (expressed by F-value) is a first proof of the presence of a general structure-function relation present in our data set.

Ac ce p

te

d

M

an

Fig.2.

Page 12 of 25

ip t cr us an M d te Ac ce p The F-values (Y-axis of Fig.2)have a neat bimodality: comparatively low discrimination for size-related terms (low-left corner of the figure), high discrimination ability for mesoscopic

Page 13 of 25

structure descriptors (high-right corner). It is worth noting the singular position of modularity. This descriptor has a discrimination ability much higher than what expected by its loading on Factor4. This implies the amount of modularity is a very important parameter for determining functionality. The F values relative to the extracted components are:

ip t

Factor1 = 63.52; Factor2 = 22.13; Factor3 = 5.26; Factor4 = 40.85.

us

cr

The very low discrimination power of Factor3 (network size) is a consequence of our inclusion criteria that equated size range across different categories. The huge statistical significance of Factor1 (Modularity) tells us that the different functional categories have marked differences as for their ‘wiring optimization strategy’ (and indirectly tells us that Factor1 is not only the most informative parameter for single graphs discrimination but even for between functional categories differences). Microscopic structuring (Factor2, Homogeneity) is less important than mesoscopic organization for functional classification.

an

We already commented on Factor4, it is worth noting this factor accounts for only 8% of total variance (Factor1 accounts for 37%) thus pointing to the extreme sensitivity of what we call ‘network number’ to functional categories discrimination.

d

M

In the following we will further check the hypothesis of the existence of a meaningful topology/function relationabandoning a pure population statistics perspective (how the categories are different on average) and assuming a pattern recognition focus (how we can predict the functional class of each single graph). This approach will allow us to test the suitability of mathematical graph architectures as ideal models of biological networks.

Ac ce p

te

Linear Discriminant Analysis (LDA) algorithm is sub-optimal in terms of prediction power when the target classes are greater than two (Lachenbruch, Goldstein, 1979). Notwithstanding that, LDA allows to get a reliable and easy to interpret measure of the general discriminant power of the studied space, even with a greater number of classes (here n = 10). The power of LDA is in turn greatly enhanced the algorithm is applied to mutually orthogonal variables as in the case for factor scores. The discriminant power corresponds to the proportion of ‘correct hits’ (graphs correctly assigned to their functional class). The proportion of ‘correct hits’ for each target class is a straightforward measure of the topological peculiarity of each class, i.e. of the extent each single functional class is mirrored by a peculiar wiring architecture, and thus less prone to be confused with graphs pertaining to other categories. LDA gave rise to a global proportion of correct hits on the entire data set equal to 0.67, this value is statistically significantly (p<0.0001, Chi-Square test) and much higher than the expected 0.10 under the null hypothesis of a purely random assignment. This implies the existence of a meaningful relation between network function and topology confirming, at the single graph level, the hypothesis of a structure/function relation we already proved at the population level (F-values). On the contrary,the hypothesis of the suitability of the basic graph mathematical models to interpret real biological networks was not consistent with our experimental results. Figure 3reports the ‘Peculiarity’ of each network class expressed as the proportion of correct class prediction by LDA It is immediate to note abstract graph are the most ‘peculiar’, i.e.

Page 14 of 25

paradoxically, the least suited for acting as ‘paradigmatic classes’ of real biological systems. If the ideal models should efficiently act as paradigms for the real networks their peculiarity should be minimal: they should be interspersed among real networks.

M

an

us

cr

ip t

Figure 3.

d

The percentage of correct hits was 93%, 90% and 78% for mathematical graphs BAN, ERN and RLN respectively. This is consistent with the fact these networks are artificial constructs generated by explicitly different generation rules.

Ac ce p

te

It is worth noting the extremely high peculiarity of AAI (77.5%). AAI are the only biological network class describing a physical object (protein molecules 3D structures) and are shaped by very stringent physical constraints strictly related to protein physiology (Di Paola, Giuliani 2015, Del Sol et al. 2006). This generates a very finely controlled ‘optimal wiring’, giving rise to a strictly invariant network cartography (Guimerà, Amaral 2005, Cumbo et al. 2014), in which the different residues occupy largely invariant positions in terms of ‘betweenmodules’ and ‘within module’ contacts (Di Paola et al. 2012). The stringency of these physical constraints make AAI almost as peculiar as the mathematical constructs. The fact component space is a proper Euclidean space (the components are mutually orthogonal) allows to get an immediate perception of the mutual similarities between classes in terms of distances. The between networks squared distance matrix (computed between the centers of mass of the different categories)is reported in Table 2. The first row of the matrix reports the strong similarities between AAI-NBN and AAI-SON. BAN group has the minimum distance to PPI and ERN groups while MET networks show the maximumsimilarity to NBN networks. Another closely similar network pairis RLN-SON. Similarity between AAI and NBN is suggestive of their common feature as sensing and responding processing system for producing proper responses to environmental and internal stimulations. It is worth noting that architecture more similar to BAN is another mathematical model (ERN). This result is

Page 15 of 25

ip t

indicative of the lack of ability of theoretical models to get rid of the multiplicity of actual biological networks.

Table 2 BAN

ERN

GPI

MET

PPI

REG

RLN

3.3167

11.1418 44.2055 5.0632

3.5432

6.0931

3.7929

32.0585 4.8779

4.3643

AAI

0

BAN

13.3751

ERN

28.2507 3.6242

GPI

19.6220 23.1507 33.1759

MET

5.9346

4.1662

14.5487 18.8470

NBN

3.3167

6.0931

17.3676 19.3247 0.64055

0

PPI

11.1418 3.7929

11.8207 11.3745 2.3485

4.5147

REG

44.2055 32.0585 43.9761 19.1423 28.9409 35.6890 17.2518

RLN

5.0632

4.8779

13.3332 27.7315 2.8082

1.6837

8.2819

47.9039

SON

3.5432

4.3643

14.5439 24.6694 1.8960

1.3328

6.4131

40.4147 0.7706

3.6242

28.9409 2.8082

1.8960

4.5147

35.6890 1.6837

1.3328

0

17.2518 8.2819

6.4131

33.1759 14.5487 17.3676 11.8207 43.9761 13.3332 14.5439 18.8470 19.3247 11.3745 19.1423 27.7315 24.6694 0

0.64055 2.3485

an

0

M

0

23.1507 4.1662

0

47.9039 40.4147 0

0.7706 0

d

0

SON

us

13.3751 28.2507 19.6220 5.9346

NBN

cr

Class ID AAI

te

Table 2. Squared distance between different network classes. Most similar classes in terms of topology are bolded.

Ac ce p

Figure 4reports a snapshot of the networks in the two major factors space.

Page 16 of 25

ip t cr us an M d te Ac ce p The top-panel of Figure 4 reports the entire data set, where it is evident the ‘extreme’ character of REG and GPI classes that display an opposite behavior in terms of intra-class relation between mesoscopic (Factor1) and microscopic (Factor2) wiring (see Supplementary File). The bottom panel zooms on a subset of the component space ranging between -2 and

Page 17 of 25

+2 (the components are expressed in SD units), where it is evident the presence of a GPI and AAI peculiar zones.

cr

ip t

The first two components space allows for an immediate graphical appreciation of both the high peculiarity and low ‘modeling’ ability of ideal models. Fig.5 reports the positions of the three theoretical models in the space. The comparison with Fig.4 tells us that the theoretical models occupy a too narrow range of the topological space for being of use as ‘ideal templates’ (they accommodate almost perfectly in the -1; +1 range). Moreover the mathematical network classes are perfectly discriminated among them, consistently with their different generative models so indirectly confirming the soundness of the bottom-up representation we adopted.

Ac ce p

te

d

M

an

us

Figure 5

Having stated the principal features of the network data set, we move now to the last section of the work: the classification of 29 cancer metabolic networks (CAN, test set) by means of the component space generated by the 222 networks data set (training set). Table 3 reports the average values of Factor1-Factor4 for different classes: it is immediate to note the outlier position of CAN with respect to all the other ‘physiological’ classes. With the only exception of Factor2, CAN has extreme values for all the factors, Looking at the Cytoscape sketches reported in Figure 1, it is worth noting how CAN graphs display a unique feature: stellar sub-graphs with a central node linking a lot of

Page 18 of 25

different nodes interrupting the usual metabolic network architecture. This singularity is caught by the topological descriptors that posit CAN outside the realm of ‘normal’ networks space. Table 3

No. Networks

Factor 1

Factor 2

Factor 3

Factor 4

ip t

Class ID

22 0.9194441 0.6777081 0.3660125 1.0053955 30 0.0669507 -0.3869288 0.4019512 -0.5551713 29 -0.6234328 -1.9410122 2.3303402 2.9788036 ERN 30 -0.4243968 -0.1707787 0.6830369 -1.5381717 GPI 21 -1.3636680 1.5044612 0.0226561 0.7259251 MET 23 0.4965320 -0.1756013 -0.6593203 0.0733107 NBN 14 0.6852072 0.2412535 -0.6073730 0.1609290 PPI 21 -0.2835969 -0.1718923 -0.4194178 -0.0016648 REG 21 -1.5507545 -1.3858293 -0.5101824 1.0736840 RLN 23 0.9394426 0.2418140 -0.0919527 -0.2135738 SON 17 0.8843354 -0.1153132 0.2486121 0.2292824 Table 3. Mean values of factors for different class IDs including cancer networks.

d

M

an

us

cr

AAI BAN CAN

Fig.6

Ac ce p

te

From Table 3 it is evident that the singular properties of CAN networks is particularlyevident as for Factor1 and Factor4, Figure 6 reports the space spanned by these two components:

Page 19 of 25

ip t cr us an M d

te

As evident from Figure 6, CAN group escapes from the control data set component space along a line along which mesoscopic organization (Factor1) and ‘network number’ (Factor4) are positively correlated (Pearson r = 0.959, p < 0.0001).

Ac ce p

Net Diameter and ASP are the two topological descriptors most diverse in CAN with respect to the Control set. Group

ASP (mean) ASP (Std.Dev.) Net Diameter (mean) Net Diameter (Std. dev.)

Control set

3.83

1.66

8.09

4.61

CAN set

10.14

7.39

30.44

18.54

This change goes hand in hand with a dramatic change in connectivity (Average Degree) that for the Control Set has an average of 12.47 and for CAN set has an average of 2.63. The above data allow us to rationalize the ‘topological nature’ of cancer pathology. The presence of ‘stellar subgraphs’ dramatically decreases the network connectivity, this has an effect on network signal transmission increasing network characteristic length (asp) and diameter. The above anomalies spread over the entire topological structure making uncorrelated topological properties on the control set to correlate in the CAN graphs. On a more global perspective while the first factor of the control set accounts for the 37% of variance, the first factor of the CAN data set accounts for 70% of variance. In other words,

Page 20 of 25

the entire cancer metabolic networks topological profile is ‘slaved’ to the singular mesoscopic organization. 4. Conclusions.

ip t

By a purely data driven statistical approach we demonstrated the presence of two sizeindependent main organization principles of biological networks: a) the high modularity – low contact density / low modularity – high contact density axis and b) the High Heterogeneity – low local clustering / Low Heterogeneity – high local clustering axes.

an

us

cr

The different biological networks are scattered along these main topological dimensions that in turn are related to the different physiological functions of the corresponding systems. In addition to these organizing principles we singled out a network invariant (Factor4) identifying an ‘existence space’ for physiological networks in terms of range of ‘allowed values’ for the different descriptors. Cancer metabolic networks were demonstrated to have a very eccentric position along this factor so allowing us to sketch the hypothesis of a sort of ‘network number’ discriminating pathological and physiological networks. It is worth noting that CAN networks could be severely biased by the focusing on ‘hot-spots’, i.e. crucial metabolites that become largely artefactual hubs of the corresponding networks.

d

M

In the present work we cannot go in depth into the reasons of this result, what is importantis the demonstration of the feasibility of structure/function studies in biological networks. In this respect, the exploration of functional properties as systems resilience can largely benefit by a topological approach as reported in (Scheffer et al. 2012). It is worth noting the authors in Fig.1 of their work report our emergent ‘wiring principles’ (Homogeneity-Heterogeneity, Connectivity-Modularity) as the main drivers of the response of network systems to external changes.

Ac ce p

te

Ideal graph architectures appear not to be particularly useful for describing biological networks that ask for different generative and interpretative models with respect to those coming from mathematical graph theory. As aptly remarked in (Csermely et al. 2013) the graph ‘mathematical representation’ of biological networks is a gross over-simplification. Network nodes are not ‘exact points’ as in graph theory but have in turn a rich structure for themselves. This is why we must keep in mind the essence of Tellegen theorem we quoted in the Introduction: the behavior of any system depends on BOTH ‘wiring principles’ (graph based regularities independent of the actual nature of nodes and edges) AND from the constitutive laws governing the single nodes. The two layers are largely independent so we can look for ‘graph structural principles’ common to networks whose nodes are totally heterogeneous like metabolic or social networks. This was the inspiring principle of this ‘coarse grain’ statistical exploration that allowed to highlight evident differences in wiring architectures of different functional classes of biological networks asking for a more refined analysisand for a theoretical effort in finding generative models different from the classical mathematical graphs.

Page 21 of 25

References:

ip t

5.

cr

Agren R, Bordel S, Mardinoglu A, Pornputtapong N, Nookaew I, Nielsen J. (2012) Reconstruction of Genome-Scale Active Metabolic Networks for 69 Human Cell Types and 16 Cancer Types Using INIT.PLoS Comput Biol. 8(5), e1002518.

us

Albert, R., and Barabási, A. L. (2002). Statistical mechanics of complex networks. Rev. of Mod. Phys. , 74(1), 2-47. Alm, E., and Arkin, A. P. (2003). Biological Networks. Current Opinion in Struct. Biol.13(2), 193-202.

an

Barabási, A.L, Albert R. (1999) Emergence of scaling in random networks. Science 286, 509-512.

M

Barabási, A.L., Oltvai, Z.N. (2004) Network biology: understanding the cell's functional organization. Nat Rev Genet 5 101-113.

d

Benigni, R., Giuliani, A. (1994). Quantitative modeling and biology: the multivariate approach. Am. Journ. of Physiol.-Regulatory, Integrative and Comparative Physiology, 266(5), R1697-R1704.

te

Canright, G. S., and Engø-Monsen, K. (2006). Spreading on networks: a topographic view. Complexus, 3(1-3), 131-146.

Ac ce p

Carreras, I., Miorandi, D., Canright, G. S., Engø-Monsen, K. (2007). Eigenvector centrality in highly partitioned mobile networks: Principles and applications. In Advances in biologically inspired information systems (pp. 123-145). Springer Berlin Heidelberg. Cronin, M. T. (2010). Quantitative structure–activity relationships (QSARs)–applications and methodology. In Recent Advances in QSAR Studies (pp. 3-11). Springer Netherlands.). Csermely P., Korcsmáros T., Huba J.M. , Kiss G.L.,Nussinov R. (2013) Structure and dynamics of molecular networks: A novel paradigm of drug discovery A comprehensive review..Pharmacology & Therapeutics 138, 333–408 Csete, M. E., Doyle, J. C. (2002). Reverse engineering of biological complexity. Science, 295(5560), 1664-1669. Cumbo, F., Paci P.,Santoni D., Di Paola L., Giuliani A. (2014) GIANT: A Cytoscape Plug-in for Modular Networks. PLoS ONE, 9 (10): e105001. De La Fuente, A., Bing, N., Hoeschele, I., Mendes, P. (2004). Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics, 20(18), 3565-3574.

Page 22 of 25

Del Sol, A., Fujihashi, H., Amoros, D., Nussinov, R. (2006). Residues crucial for maintaining short paths in network communication mediate signaling in proteins. Molecular Systems Biology, 2(1). Di Paola, L., De Ruvo, M., Paci, P., Santoni, D., Giuliani, A. (2012). Protein contact networks: an emerging paradigm in chemistry. Chem. Rev,.113(3), 1598-1613.

ip t

Di Paola L. and Giuliani A. (2015) Protein contact network topology: a natural language for allostery. Current Opinion in Structural Biology 31, 43-48.

cr

Erdős, P., and Rényi, A. (1961). On the strength of connectedness of a random graph. Acta Mathematica Hungarica, 12(1-2), 261-267.

us

Girvan, M., Newman M.E.J. (2002) Community Structure in social and biological networks. Proc.Natl.Acad.Sci. USA 99 (12): 7821-7826.

an

Günthard, H. H., Primas, H. (1956). Zusammenhang von Graphentheorie und MO‐Theorie

M

von Molekeln mit Systemen konjugierter Bindungen. Helvetica Chimica Acta, 39(6), 16451653. Guimera, R., Amaral, L.A. (2005). Cartography of complex networks: modules and universal roles. Journal of Statistical Mechanics: Theory and Experiment, 2005(02), P02001.

te

d

Halchenko, Y. O. ,Hanke, M. (2012). Open is not enough. Let’s take the next step: An integrated, community-driven computing platform for neuroscience. Frontiers in Neuroinformatics, 6:22.

Ac ce p

Han, J. D. J., Bertin, N., Hao, T., Goldberg, D. S., Berriz, G. F., Zhang, L. V., Vidal, M. (2004). Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature, 430(6995), 88-93. Huang, S. (2004). Back to the biology in systems biology: What can we learn from biomolecular networks?. Briefings in Functional Genomics and Proteomics, 2(4), 279-297. Ideker, T., and Krogan, N. J. (2012). Differential Network Biology. Mol. Syst. Biol., 8(1), 565. Lachenbruch, P. A., Goldstein, M. (1979). Discriminant analysis. Biometrics, 35, 69-85. Ma, H., Zeng, A. P. (2003). Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics, 19(2), 270-277. Mikulecky, D. (2001) Network Thermodynamics and Complexity: A Transition in Relational Systems Theory. Computers and Chemistry 25, 369-391. Newman, M. E. J. (2006). Modularity and community structure in networks. Proc.Natl.Acad. Sci. USA103 (23), 8577–8696. Palla, G., Derenyi I., Farkas, I., Vicsek, T. (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814-818.

Page 23 of 25

Preisendorfer, R. W. (1988). Principal component analysis in meteorology and oceanography (Vol. 17). C. D. Mobley (Ed.). Amsterdam: Elsevier Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics38(8), 904-909.

ip t

Scheffer M., Carpenter SR, Lenton TM, Bascompte J., Brock W., Dalos V., van de Koppel J., van de Leemput IA., Levin SA., van Nes EH., Pascual M., Vandermeer J. (2012) Anticipating Critical Transitions. Science (338),344-348.

cr

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 13 (11), 2498-504.

us

SNAP Datasets, Stanford large network dataset collection. http://snap.stanford.edu/data. Jun, 2014.

an

Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F. H., Goehler, H, Wanker, E. E. (2005). A human protein-protein interaction network: a resource for annotating the proteome. Cell, 122(6), 957-968. Strogatz, S. H. (2001). Exploring complex networks. Nature, 410 6825, 268-276.

M

Tasdighian, S., Di Paola, L., De Ruvo, M., Paci, P., Santoni, D., Palumbo, P., Giuliani, A. (2013). Modules identification in protein structures: the topological and geometrical solutions. Journal of Chem. Inf. and Model. 54(1), 159-168.

te

d

Tellegen BA (1952) General Network Theorem with Applications. Phillips Res. Report. (7): 259-269. http://weizmann.ac.il/mcb/U

Ac ce p

Uri AlonLab, collection of complex networks. riAlon/download/collection-complex-networks. July, 2014.

Van Essen, D.C., Smith, S.M., Barch D.M., Behrens E.J., Yacoub, E., Ugurbil, K.(2013). The WU-Minn Human Connectome Project: An overview. NeuroImage 80, 62-79. Verfaillie, A., Imrichová, H., Van de Sande, B., Standaert, L., Christiaens, V., Hulselmans, G., Aerts, S. (2014). iRegulon: from a gene list to a gene regulatory network using large motif and track collections.PLoS Comput. Biol. e1003731. Wolfram Research, Inc., Mathematica, Version 9.0, Champaign, IL (2012). Wu, G., Xin F., Stein L. (2010). A human functional protein interaction network and its application to cancer data analysis. Genome Biol 11: R53. Xia M, Wang J, He Y (2013) BrainNet Viewer: A Network Visualization Tool for Human Brain Connectomics. PLoS ONE 8: e68910 Zbilut J.P., Chua G.H., Krishnan A., Bossa C., Rother K., Webber C.L., Giuliani A. (2007) A topologically related singularity suggests a maximum preferred size in protein domains. PROTEINS Struct. Funct. Bioinf. 66: 621-629.

Page 24 of 25

Ac ce p

te

d

M

an

us

cr

ip t

Zhou, J., Yan, W., Hu, G., Shen, B. (2014). Amino acid network for the discrimination of native protein structures from decoys. Current Protein and Peptide Science, 15(6), 522-528.

Page 25 of 25