Journal of Genetics and Genomics 46 (2019) 397e404
Contents lists available at ScienceDirect
Journal of Genetics and Genomics Journal homepage: www.journals.elsevier.com/journal-of-geneticsand-genomics/
Original research
The expression tractability of biological traits shaped by natural selection Li Liu a, 1, Jianguo Wang a, 1, Jian-Rong Yang b, Feng Wang a, Xionglei He a, * a b
State Key Laboratory of Bio-control, School of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, China Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China
a r t i c l e i n f o
a b s t r a c t
Article history: Received 29 May 2019 Received in revised form 31 July 2019 Accepted 1 August 2019 Available online 13 August 2019
Understanding how gene expression is translated to phenotype is central to modern molecular biology, and the success is contingent on the intrinsic tractability of the specific traits under examination. However, an a priori estimate of trait tractability from the perspective of gene expression is unavailable. Motivated by the concept of entropy in a thermodynamic system, we here propose such an estimate (ST) by gauging the number (N) of expression states that underlie the same trait abnormality, with large ST corresponding to large N. By analyzing over 200 yeast morphological traits, we show that ST predicts the tractability of an expression-trait relationship. We further show that ST is ultimately determined by natural selection, which builds co-regulated gene modules to minimize possible expression states. Copyright © 2019, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, and Genetics Society of China. Published by Elsevier Limited and Science Press. All rights reserved.
Keywords: Natural selection Tractability Expression state Complex trait
1. Introduction A complex system is characterized by the microscopic configuration of its constituents (or microstate) and the macroscopic property of the system (or macrostate) (Ladyman et al., 2013). Cell is a typical complex system in which the expression of genes represents microstate and cellular phenotype represents macrostate (Komili and Silver, 2008). Disentangling the relationships between expression microstate and phenotypic macrostate has been the aim of numerous studies in molecular cell biology, with a focus recently on the production of large-scale data and the development of sophisticated analyzing methods (Janes et al., 2005; Segal et al., 2005; Lee et al., 2008; Zhu et al., 2008; Ayroles et al., 2009; Chen et al., 2009; Civelek and Lusis, 2014; Gamazon et al., 2015; Ritchie et al., 2015; Gusev et al., 2016). For example, Chen et al. developed a method to predict the response of yeast to 94 drugs by integrating genotype and gene expression data (Chen et al., 2009); Gamazon et al. proposed a gene-based association method to detect known and new genes associated with diseases (Gamazon et al., 2015). Despite enormous achievements, the availability of many methods per se indicates the limitation of the data and the lack of generality of
* Corresponding author. E-mail address:
[email protected] (X. He). 1 These authors contribute equally to this work.
the methods (Amin et al., 2014; Lloyd et al., 2015; Volm and Efferth, 2015), highlighting the fact that the success of the extrinsic research efforts is contingent on the intrinsic tractability of the focal relationships. The importance of knowing a priori the tractability of a question to be addressed is widely recognized in mathematics, physics, and chemistry (Ayoub, 1982; Feldman and Crutchfield, 1998; Allu and Oprea, 2005; Lopezruiz et al., 2010). For example, the competition for finding a general formula for solving equations of degree five had lasted ~300 years since the successes in equations of degree three and degree four in the 16th century, and ended in the 19th century when the impossibility of such a general formula was proved; the subsequently developed group theory that gives an effective criterion for the solvability of all polynomial equations formed the basis of modern algebra (Ayoub, 1982; Rosen, 2016). However, biology as a branch of science dominated by empirical solutions has seldom been guided by such wisdoms. 2. Results We reason that here a critical issue is, for a given phenotypic macrostate, how many expression microstates there could be. Similar to the situation in thermodynamic systems, a large number (N) of possible microstates would suggest a strong uncertainty underlying the given macrostate, understanding of which may rely on stochastic thinking more than the prevailing deterministic thinking in biology. In contrast, straightforward understandings would be readily
https://doi.org/10.1016/j.jgg.2019.08.001 1673-8527/Copyright © 2019, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, and Genetics Society of China. Published by Elsevier Limited and Science Press. All rights reserved.
398
L. Liu et al. / Journal of Genetics and Genomics 46 (2019) 397e404
available if the N is small. In research practice each time often a particular trait, which represents a single dimension of the multidimensional macrostate of the system, is examined. For a trait considered qualitatively (e.g., normal versus abnormal), the information flow from genotypes to the same phenotype could be mediated by either the same or different expression microstates (Fig.1A). Here, the expression state of an individual gene can be categorized as 0, 1, and 1 for no expression difference, up-regulation, and downregulation relative to a reference level, respectively. If the trait abnormality is described quantitatively, individuals with similar trait values can be treated as qualitatively the same.
size, bud growth direction, area of nucleus region, etc.) of 1345 yeast Saccharomyces cerevisiae gene-deletion mutants whose genome-wide expression profiles are also available (Table S1) (Materials and methods), permitting us to study the expression microstates of a large number of traits simultaneously (Ohya et al., 2005; Kemmeren et al., 2014). For each morphological trait, we ranked the 1345 yeast mutants in an ascending order according to the focal trait values (Fig. 1B). Because of the large number of mutants assessed, neighbouring mutants often showed highly similar trait values (Fig. S1). We then computed the average expression divergence (D) between two neighbouring mutants using the formula:
2.1. Measure the divergence of the expression programs underlying a trait We propose a measure (ST) of the number of different expression states that underlie the same trait abnormality. Previous studies measured quantitatively 216 morphological traits (e.g., cell
D¼
n1 X
EðMi ; Miþ1 Þ=ðn 1Þ
(1)
i¼1
where n (¼1345) is the total number of mutants, Mi and Miþ1
Fig. 1. Proposition of a model for understanding expression-trait relationships. A: Two models of the information flow from genotype to phenotype that differ in the number of the intermediate expression microstates. B: The workflow of calculating ST using the 1345 yeast mutants.
L. Liu et al. / Journal of Genetics and Genomics 46 (2019) 397e404
represent the mutants ranked i and i þ 1, respectively, and E is the Euclidean distance of the expression profiles of the two mutants (Materials and methods). Note that here neighbouring mutants had commonality only in one dimension (i.e., the focal trait) of the phenotypic macrostate. To gauge the expression divergences that represent dimensions independent from the focal trait, we computed Do, which is the D of the n mutants that are randomly sorted. By repeating the randomization of the mutants 1000 times, we obtained the mean (Do ) and standard deviation (s) of Do, and then had:
ST ¼
D Do
s
(2)
Thus, by controlling the expression divergence unrelated to the focal trait, ST measured specifically the divergence of the expression programs underlying the (nearly) same trait alterations. Using this formula, we obtained ST for each of the 216 yeast traits (Table S2). It varied substantially among traits, ranging from ~ 7 to ~2 (Fig. 2A). Because ST was effectively the Z-score transformation of D in the distribution of Do, 2 < ST < 2 meant no significant difference between D and Do at the confidence level of P z 0.05 (Materials and methods). Thus, 2 < ST < 2 suggests no higher than expected expression convergence underlying the same/similar trait alterations; in contrast, ST < 2 suggests that the same/similar trait alterations of different mutants should be mediated by shared expression programs. Notably, the observation 2 < ST < 2 cannot be explained by strong trait dissimilarity between neighbouring mutants (Fig. S2). The acrosstrait variation of ST was insensitive to the number of genes examined (Fig. S3), as well as the number and the identity of the mutants examined (Fig. 2B), suggesting that ST represents an intrinsic feature of the traits (Fig. 3). 2.2. ST predicts the tractability of an expression-trait relationship Recurrent patterns are the prerequisite for revealing rules from empirical data (Ohnomachado, 2001; Veer and Bernards, 2008). Since ST effectively measures the titer of recurrent patterns of an expression-trait relationship, it may predict how much we could
399
possibly learn from an expression-trait relationship. To make a more general demonstration, we considered machine learning, which is often more capable than humans in dealing with complex relationships. To ensure that the learning is mathematically tractable, we studied the multiple linear regression model and gauged the learning performance using the correlation between observation and prediction. For every trait, we used the same set of mutants as training data, ran the same learning processes without artificial feature selection, and tested the learned models in the same testing mutants (Materials and methods). This design made the prediction performance for different traits comparable. We found that the varied prediction performance of the different traits by linear models was well explained by ST (Figs. 3A and S4). We also tested a non-linear support vector regression (SVR) model, and found that the prediction performance was also highly correlated with ST (Figs. 3B and S4). Although the predictions could be improved with more sophisticated models that can capture highorder gene interactions or with the inclusion of pre-existing knowledge, the above strong correlations suggest that ST can be a good measure of the relative titer of recurrent patterns, which determine the success of any learning practice based on empirical data.
2.3. ST is shaped by natural selection that builds co-regulated gene modules It is interesting to know which biological factors may modulate the ST of a complex trait. We noted that ST is analogous to the entropy of a thermodynamic system. The second law of thermodynamics predicts that the entropy of a thermodynamic system always increases over time unless external forces are applied to the system. We reasoned that natural selection is the only possible external force to constrain a biological system. We simulated how selection could shape an expression network that underlies traits by considering a previously proposed gene-trait architecture (Chen et al., 2016) (Materials and methods). The results suggest two insights: 1) Co-regulated gene modules can dramatically reduce ST; 2) such co-regulations tend to evolve among genes responsible for evolutionarily more important traits (Fig. S5). This is because that
Fig. 2. Calculation of ST for a large number of yeast traits. A: Distribution of ST of the 216 yeast morphologic traits. Approximately 40% of the traits show 2 < ST < 2, a range indicating no higher expression similarity than expected by chance between mutants of similar trait values. B: The relative ST of the 216 traits is insensitive to the reduced number of mutants included. Pearson's R between ST_all based on all 1345 mutants and ST_N based on randomly sampled N mutants, where N ¼ 1200, 1000, 800, 600, 400, and 200, respectively, is shown. Ten rounds of random sampling are considered for each N, and the mean Pearson's R is plotted with the error bar showing one standard deviation. The comparison of STall and ST-200 derived from one random sampling is shown as an inset, with the grey zone showing 95% confidence region of the regression line.
400
L. Liu et al. / Journal of Genetics and Genomics 46 (2019) 397e404
Fig. 3. ST underlies the tractability of a complex trait. Tractability is assessed by the prediction performance of linear model (A) and SVR model (B), two commonly used machine learning models. All yeast traits (Pearson's R ¼ 0.78 or 0.72, n ¼ 216, P < 1016) and the exemplary traits (inset) (Pearson's R ¼ 0.73 or 0.67, n ¼ 31, P < 105 and P < 104, respectively) are shown. The prediction performance is measured by Pearson's correlation coefficient between observed and predicted trait values of the testing mutants. The grey zone shows 95% confidence region of the regression line.
the number of possible expression states could be astronomical if the involved genes lack co-regulation. For example, there would be up to 250 or ~1015 possible expression states for 50 genes each with two states (on and off) if every gene expresses independently. Provided that distinct expression states can lead to similar trait alterations, the likelihood of observing similar expression states for similar trait alterations would be very low. However, the number of possible expression states will reduce dramatically if the 50 genes are subject to co-regulation (the number would be 2 for perfect coregulation). As a consequence, the likelihood of observing similar expression states for similar trait alterations would be high. Because co-regulated gene modules must be built and/or maintained by selection, genes responsible for an unimportant trait subject to little selection are unlikely to evolve co-regulation. Although the simulation was conducted with arbitrary settings, it provided clues on how natural selection might be involved in shaping ST. An immediate prediction is that the yeast traits with ST < 2 should be evolutionarily more important than those of 2 < ST < 2. Following previous studies (Ho and Zhang, 2014; Chen et al., 2016), we used cell growth rate as a proxy of evolutionary importance (or fitness), which was reasonable for the single-celled yeast, and calculated the correlation with fitness for each of the 216 traits (Table S2) (Materials and methods). Traits that are strongly correlated with fitness can be regarded as evolutionarily important. In support of the prediction, the ST of a trait was well explained by the trait importance (Pearson's R ¼ 0.78, n ¼ 216, P < 1016; Fig. 4A); the pattern held when 31 exemplary traits that are less related with each other were considered (Pearson's R ¼ 0.80, n ¼ 31, P < 107; inset of Fig. 4A) (Materials and methods). The result cannot be explained by the varied genetic complexity of the traits (Fig. S6), which is defined by the fraction of genes whose deletion effects are statistically significant, or the varied measuring repeatability (Fig. S7). According to the simulation and reasoning, natural selection reduces ST through promoting co-regulated gene modules to reduce the number of possible expression states that lead to the same phenotypic outputs. We examined the yeast genes forming protein complexes, which are known to be co-regulated (Teichmann and Babu, 2002). We considered 78 core yeast protein complexes each encoded by at least five different genes
(Materials and methods) (Table S3). Because the traits examined here represent nearly all dimensions of the yeast morphology, it is likely that, for at least some of the protein complexes, their coregulations are due to the selective constraints on these traits. For each protein complex, we computed complex-specific ST (c-ST) by using the same formula but considering only the few genes encoding the focal protein complex (Materials and methods). Consistently, c-ST < 2 suggests similar expression profiles of the protein complex genes given the same/similar trait alterations. If the co-regulation within a protein complex is built/maintained by selection irrelevant to the morphological traits, no correlation would be expected between c-ST and the strength of selection on these traits. However, we found that, for many of the protein complexes, c-ST tends to be significant (<2 or even < 4) in important traits (Fig. 4B and C). This result suggests that natural selection acting on these morphological traits accounts for the coregulations of the protein complex genes. Notably, to achieve this, the protein complexes must causally influence the traits instead of being reactive to the traits, because selection on a trait is unable to shape the genes just reactive to the trait. There was a different spectrum of significant c-ST even among the important traits (Fig. 4B), and also the number of exemplary traits that are presumably influenced by each protein complex varied substantially (Fig. 4D). This heterogeneity suggested that the significant c-ST cannot be explained by a single factor, such as the response to growth reduction. Thus, by associating known protein complexes with many morphological traits, we revealed both common and trait-specific mechanistic insights into the yeast traits (Table S4). For example, the Isw1 complex involved in modifying chromatin structure appeared to underlie C102_C, a trait measuring the contour length of budding cell, and the TRAPP complex involved in transporting vesicles from ER to plasma membrane appeared to affect D109_C, a trait measuring the distance from neck to budding cell's nucleus. These understandings provide rich information for future studies on cell morphology. To better explain the analysis of this part, using the protein phosphatase type 2A complex as an example, we plotted the expression profiles of the 7 member genes of this complex in the top 5 mutants of an important trait D109_C and an unimportant trait D197_C, respectively (Fig. 4B).
L. Liu et al. / Journal of Genetics and Genomics 46 (2019) 397e404
401
Fig. 4. Evidence for selection in shaping the ST of a complex trait. A: ST is explained by the trait's evolutionary importance (Pearson's R ¼ 0.78, n ¼ 216, P < 1016). A similar pattern is shown as an inset for exemplary traits that are largely independent from each other (R ¼ 0.8, n ¼ 31, P < 107). The grey zone shows 95% confidence regions of the regression line. B: The c-ST of 78 protein complexes in the 31 exemplary traits that are sorted by trait importance. Different traits show a different spectrum of significant c-ST (<2 or 4), suggesting trait-specific effects provided by the protein complexes. The expression profiles of an example protein complex in the top 5 mutants of the traits D109_C and D197_C are shown in the right panel, respectively; shared expression profiles are found in neighbouring mutants of D109_C (c-ST < 4), the evolutionarily more important trait that measures the distance from neck to bud's nucleus. C: More significant c-ST are found for important traits, suggesting that the selection for maintaining the expression coordination of the protein complexes can be mediated through these morphological traits. D: Frequency distribution of the numbers of exemplary traits that are affected by each of the 78 protein complex (c-ST < 2).
3. Discussion
4. Materials and methods
The central question addressed in this study is about the number (N) of different expression states that underlies a given trait abnormality. We developed a simple statistic (ST) for describing such an N-to-one expression-trait relationship. Although expression levels are categorized in the calculation of ST, using continuous gene expression levels leads to essentially the same results (Fig. S8). Using yeast data, we showed that ST is constrained by natural selection, which helps build/maintain co-regulated gene modules to reduce N, and that ST well predicts the relative tractability of over 200 expression-trait relationships. Considering the similarity to the entropy of a thermodynamic system, ST can be viewed as the expression entropy of a biological trait. Accordingly, stochastic thinking might be necessary to deal with the strong expression uncertainty of a trait with ST 0, a proposition reminding biologists of the paradigm shift in modern physics from classical mechanics to statistical mechanics. Because the definition of ST is independent of the yeast system and natural selection as the ultimate determinant of ST applies to all organisms, the main conclusions of this yeastbased study are likely to be of general meaning, although further work is certainly required to test the findings in other organisms.
4.1. Data The microscopic images of triple-stained yeast cells were generated and analyzed by Ohya et al. (2005), characterizing quantitatively 501 morphological traits for 4718 different yeast mutants each lacking a non-essential gene (SCMD). A proportion of the morphological traits (220 traits) with information of individual cells were considered by Ho and Zhang (2014). We considered just 216 traits that showed consistent information between the two previous studies in this analysis. The 216 morphological traits were also characterized for 122 wild-type yeast cell populations by Ohya et al. (2005). For each trait, the mean and standard variation of the 122 wild-type values were computed, and the raw trait value of a mutant was then scaled into Z-score using the mean and standard deviation of the wild types. In addition, the cell growth rate of 4449 yeast gene-deletion mutants were measured by Qian et al. (2012), and the microarray-based expression profile of 1484 mutants were generated by Kemmeren et al. (2014). Because there were 6123 genes covered in the microarray, the cutoff of defining expression up-regulation or down-regulation relative to wild type was set to
402
L. Liu et al. / Journal of Genetics and Genomics 46 (2019) 397e404
P < 0.0001 (the number of expected expression changes by chance is ~0.6). With manual corrections for the annotation of the different datasets, there were 1345 mutants each with all information of 216 morphological traits, growth rate, and expression profile. A total of 409 yeast core protein complexes were determined by Benschop et al. (2010), and 78 encoded by at least five different genes were considered in this study. 4.2. Define the terms or concepts used in this study ST and c-ST: The Euclidean distance of the expression profiles of two neighbouring mutants (EðMi ; Miþ1 Þ) was calculated as Eq. (3) for ST, where ei,j was the expression state of gene j in mutant i. Ztest was performed to measure the difference between D and D0. ST values between 2 and 2 were regarded as no significant difference between D and Do. Because standardized D0 followed standard normal distribution with mean and standard deviation equal to 0 and 1, respectively, here the “2 to 2” range corresponded to ~95% confidence interval.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u6123 uX 2 EðMi ; Miþ1 Þ ¼ t ei;j eiþ1;j
(3)
j¼1
underlying the trait. To build the initial network structure, we assigned regulatorworker interactions by assuming equal probability for all regulator-worker pairs so that on average a regulator controls 10 workers. The regulatory strength of a regulator-worker interaction considered two parameters: a regulator-specific trans-coefficient q, and a cis-coefficient r that was specific for each regulator-worker interaction. All q and r values were randomly chosen from standard normal distribution. The relative expression of worker j was then given by Eq. (5), where c ¼ 2 was a constant baseline expression level for every worker, R was the number of regulators that control worker j. R P 0
gj ¼ ce i¼1
qi þrij
(5)
Assuming a simple resource allocation model for expressions, the absolute expression level of worker j was given by Eq. (6), where C ¼ 100 was the total expression capacity (only workers’ expressions are considered). 0
gj ¼ Cg j
50 .X
0
gj
(6)
j¼1
The c-ST was computed similarly to ST but considering only the few genes encoding the focal protein complex and using a modified Euclidean distance (EðMci ;Mciþ1 Þ) as Eq. (4) to controls for the gene number (Nc).
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi, u Nc uX 2 EðMci ; Mciþ1 Þ ¼ t ei;j eiþ1;j Nc
(4)
j¼1
Exemplary traits: The 216 traits were not completely independent from each other, and 31 exemplary traits were derived by clustering analysis (R package, ‘apcluster’, negDistMat, r ¼ 2, Bodenhofer et al., 2011) of the 1345 mutants. Trait evolutionary importance: The Pearson's R between the absolute Z-score of a morphological trait and cell growth rate was computed among the 1345 yeast mutants. Because cell growth rate is a reasonable proxy of fitness for the single-celled yeast, morphological traits with a stronger correlation with fitness (larger R) were regarded as evolutionarily more important than those with a smaller R. Trait dissimilarity: To make the different traits more comparable, for each trait the average jZi e Ziþ1j/(jZij þ jZiþ1j) was used to represent the trait dissimilarity between neighbouring mutants, where Zi and Ziþ1 were the Z-score trait value of the sorted mutant i and i þ 1, respectively (i ¼ 1, 2, …, 1344). Genetic complexity: The genetic complexity of a trait was measured by the fraction of non-essential yeast genes that affect the trait (fgenes), which was obtained from Ho et al. (2014). Trait measuring repeatability: Often a few hundred cells of a mutant were examined in Ohya et al. (2005). We randomly divided the individual cells of each mutant into two equal halves and computed the traits for each half separately. For each trait, the Pearson's R between the two halves among the 1345 mutants was considered as the measuring repeatability of the focal trait. 4.3. Simulation of the evolution of a gene network We supposed a network consisting of 50 worker genes regulated by four regulators and four traits each determined by a nonoverlapping group of ten worker genes (remaining ten workers not associated with any trait). The value of a trait was the geometric average of the absolute expression levels of the specific ten workers
With this network structure unaltered, we simulated 100 scenarios each with different q and r values to derive the initial trait values. The average of the 100 values was used as the expected initial value t’ of a trait. We defined the optimal value of trait k in environment l by Eq. (7), where d acted as the “direction” factor to determine the direction of selection for optimality, and s was a random number of the uniform distribution between 1 and 1, which provided additional variability for the optimal trait values in different environments. 0
t *kl ¼ t des
(7)
A total of ten environments were considered, and, for each trait, d equaled to 100 in five environments and to 0.01 in the other five environments. In addition, for the ten environments considered, the trans-coefficient q of a regulator may vary across different environments (for each regulator there were different q values drawn from standard normal distribution in the initial network), but the cis-coefficient r of a regulator-worker interaction remained constant across the environments. The fitness of a specific network in environment l was calculated as Eq. (8), where wk was the fitness weight for trait k (w1 ¼ w2 ¼ 0.1, w3 ¼ w4 ¼ 10), and tkl was the value of trait k in environment l. 4 P
fl ¼ e k¼1
wk jlogðtkl =t *kl Þj=4
(8)
The combined fitness of a network was the harmonic average of the ten environment-specific fitness. To simulate the evolution of the network we considered a population of 100 individuals. For each individual, four kinds of mutations were applied independently each with 10% probability: 1) adding a random number drawn from the uniform distribution between 0.1 and 0.1 to a random qil; 2) adding a random number drawn from the uniform distribution between 0.1 and 0.1 to a random rij; 3) creating an interaction for a random regulatorworker pair; and 4) deleting an random regulator-worker interaction. After the mutations were applied, the new fitness of each individual/genotype was calculated. The 100 individuals/genotypes of the next generation were obtained by drawing with replacements randomly from the current 100 individuals/genotypes
L. Liu et al. / Journal of Genetics and Genomics 46 (2019) 397e404
according to the new relative fitness of the individuals/genotypes. This process contained mutation, selection and drift, and was repeated for 10,000 generations.
Supervised machine learning was conducted to infer an expression-trait function for each of the 216 traits using the 1345 mutants. The mutants were divided into a training set and a testing set with a ratio of 4:1. For each trait, the input features were all genes' expression levels and the output was the corresponding trait value. A linear model and the non-linear SVR model were considered. A ten-fold cross validation for the linear model or a five-fold cross validation for the SVR model was first carried out in the training set to tune the hyper-parameters; normal learning processes were then run in the same training set. The trained models were tested in the testing set. For each trait the performance of the trained models was evaluated by Pearson's R between observed and predicted trait values, normalized root mean squared error (NRMSE), and mean absolute error (MAE), respectively. Below were the details of the two models. Multiple linear model: The hypothesis of multiple linear model was represented as Eq. (9), where X was an m by n matrix of expression levels, y was an m by 1 vector of trait values (m was the number of samples considered and n was the number of features), b was the parameters to be learned from training set and was defined as the minimizer of the cost function of Eq. (9), as shown by Eq. (10), where l was the controller of contribution on the cost function by the regularization term.
b ¼ argminn
(9) 1
b2R 2m
2
2
kX b yk2 þ lkbk2
(10)
Gradient descent algorithm written in conventional MATLAB code was used to estimate b under the optimal l determined by cross validation. Then performance of the learned model with estimated b was tested in the testing set. SVR model: Epsilon-support vector regression is a non-linear regression model through kernel transformation of original features (Pedregosa et al., 2011). The hypothesis of SVR was represented as Eq. (11) with Gaussian (RBF) kernel shown as Eq. (12), where N, X and xi represented the number of support vectors, original m by n feature matrix and a support vector, respectively.
y ¼ wT FðXÞ þ b þ ε ¼
N X
wi Kðxi ; XÞ þ b þ ε
(11)
i
2 K xi ; xj ¼ exp gxi xj
(12)
SVG solved a primal optimization problem given as Eq. (13). m X 1 T w wþC di þ d*i subject to w;b;di ;d 2 i¼1
min
* i
yi wT Fðxi Þ b ε þ di ; wT Fðxi Þ þ b yi ε þ
parameters, C and g were tuned by grid search approach combined with a five-fold cross validation. Learned model in training set was evaluated in testing set. Acknowledgments
4.4. Empirical performance of machine learning
y ¼ Xb þ ε
403
(13)
d*i ;
di ; d*i 0; i ¼ 1; :::; m The SVR function with sequential minimal optimization (SMO) algorithm was embedded in the cikit-learn package of Python, and achieved optimization process according to Eq. (13). Two hyper-
This work was supported by research grants from National Natural Science Foundation of China (Nos. 31630042 and 91731302). We are grateful to the discussions with members of He lab. Supplementary data Supplementary data to this article can be found online at https://doi.org/10.1016/j.jgg.2019.08.001. References Allu, T.K., Oprea, T.I., 2005. Rapid evaluation of synthetic and molecular complexity for in silico chemistry. ChemInform 45, 1237e1243. Amin, S.B., Yip, W., Minvielle, S., Broyl, A., Li, Y., Hanlon, B.M., Swanson, D., Shah, P.K., Moreau, P., Der Holt, B.V., 2014. Gene expression profile alone is inadequate in predicting complete response in multiple myeloma. Leukemia 28, 2229e2234. Ayoub, R.G., 1982. On the nonsolvability of the general polynomial. Am. Math. Mon. 89, 397. Ayroles, J.F., Carbone, M.A., Stone, E.A., Jordan, K.W., Lyman, R.F., Magwire, M.M., Rollmann, S.M., Duncan, L., Lawrence, F., Anholt, R.R.H., 2009. Systems genetics of complex traits in Drosophila melanogaster. Nat. Genet. 41, 299e307. Benschop, J.J., Brabers, N., Van Leenen, D., Bakker, L., Van Deutekom, H.W.M., Van Berkum, N.L., Apweiler, E., Lijnzaad, P., Holstege, F.C.P., Kemmeren, P., 2010. A Consensus of Core Protein Complex Compositions for Saccharomyces cerevisiae. Mol. Cell 38, 916e928. Bodenhofer, U., Kothmeier, A., Hochreiter, S., 2011. APCluster: an R package for affinity propagation clustering. Bioinformatics 27, 2463e2464. Chen, B., Causton, H.C., Mancenido, D., Goddard, N.L., Perlstein, E.O., Peer, D., 2009. Harnessing gene expression to identify the genetic basis of drug resistance. Mol. Syst. Biol. 5, 310. Chen, H., Wu, C.-I., He, X., 2016. The regulator-executor-phenotype architecture shaped by natural selection. BioRxiv. https://doi.org/10.1101/026443. Civelek, M., Lusis, A.J., 2014. Systems genetics approaches to understand complex traits. Nat. Rev. Genet. 15, 34e48. Feldman, D.P., Crutchfield, J.P., 1998. Measures of statistical complexity: why? Phys. Lett. A 238, 244e252. Gamazon, E.R., Wheeler, H.E., Shah, K.P., Mozaffari, S.V., Aquinomichaels, K., Carroll, R.J., Eyler, A.E., Denny, J.C., Nicolae, D.L., Cox, N.J., 2015. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091e1098. Gusev, A., Ko, A., Shi, H., Bhatia, G., Chung, W., Penninx, B.W.J.H., Jansen, R., De Geus, E.J.C., Boomsma, D.I., Wright, F.A., 2016. Integrative approaches for largescale transcriptome-wide association studies. Nat. Genet. 48, 245. Ho, W.C., Zhang, J., 2014. The genotype-phenotype map of yeast complex traits: basic parameters and the role of natural selection. Mol. Biol. Evol. 31, 1568e1580. Janes, K.A., Albeck, J.G., Gaudet, S., Sorger, P.K., Lauffenburger, D.A., Yaffe, M.B., 2005. A systems model of signaling identifies a molecular basis set for cytokineinduced apoptosis. Science 310, 1646e1653. Kemmeren, P., Sameith, K., De Pasch, L.V., Benschop, J.J., Lenstra, T.L., Margaritis, T., Duibhir, E.O., Apweiler, E., Van Wageningen, S., Ko, C.W., 2014. Large-scale genetic perturbations reveal regulatory networks and an abundance of genespecific repressors. Cell 157, 740e752. Komili, S., Silver, P.A., 2008. Coupling and coordination in gene expression processes: a systems biology view. Nat. Rev. Genet. 9, 38e48. Ladyman, J., Lambert, J., Wiesner, K., 2013. What is a complex system? Eur. J. Philos. Sci. 3, 33e67. Lee, I., Lehner, B., Crombie, C., Wong, W., Fraser, A.G., Marcotte, E.M., 2008. A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat. Genet. 40, 181e188. Lloyd, K., Cree, I.A., Savage, R.S., 2015. Prediction of resistance to chemotherapy in ovarian cancer: a systematic review. BMC Cancer 15, 117. Lopezruiz, R., Mancini, H.L., Calbet, X., 2010. A statistical measure of complexity. Phys. Lett. A 209, 321e326. Ohnomachado, L., 2001. Modeling medical prognosis: survival analysis techniques. Comput. Biomed. Res. 34, 428e439. Ohya, Y., Sese, J., Yukawa, M., Sano, F., Nakatani, Y., Saito, T., Saka, A., Fukuda, T., Ishihara, S., Oka, S., 2005. High-dimensional and large-scale phenotyping of yeast mutants. Proc. Natl. Acad. Sci. U. S. A. 102, 19015e19020. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
404
L. Liu et al. / Journal of Genetics and Genomics 46 (2019) 397e404
Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn Res. 12, 2825e2830. Qian, W., Ma, D., Xiao, C., Wang, Z., Zhang, J., 2012. The Genomic Landscape and Evolutionary Resolution of Antagonistic Pleiotropy in Yeast. Cell Rep. 2, 1399e1410. Ritchie, M.D., Holzinger, E.R., Li, R., Pendergrass, S.A., Kim, D., 2015. Methods of integrating data to uncover genotype-phenotype interactions. Nat. Rev. Genet. 16, 85e97. Rosen, M., 2016. Niels Hendrik Abel and equations of the fifth degree. Am. Math. Mon. 102, 495e505. Segal, E., Friedman, N., Kaminski, N., Regev, A., Koller, D., 2005. From signatures to
models: understanding cancer using microarrays. Nat. Genet. 37, S38eS45. Teichmann, S.A., Babu, M.M., 2002. Conservation of gene co-regulation in prokaryotes and eukaryotes. Trends Biotechnol. 20, 407e410. Veer, L.J.V.T., Bernards, R., 2008. Enabling personalized cancer medicine through analysis of gene-expression patterns. Nature 452, 564e570. Volm, M., Efferth, T., 2015. Prediction of cancer drug resistance and implications for personalized medicine. Front. Oncol. 5, 282. Zhu, J., Zhang, B., Smith, E.N., Drees, B., Brem, R.B., Kruglyak, L., Bumgarner, R.E., Schadt, E.E., 2008. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat. Genet. 40, 854e861.