C H A P T E R
11 Drug Repurposing From Transcriptome Data: Methods and Applications Daniel Toro-Domı´nguez*,†, Marta E. Alarco´n-Riquelme†,‡, Pedro Carmona-Sa´ez* *
Bioinformatics Unit, GENYO: Centre for Genomics and Oncological Research, Pfizer/University of Granada/Andalusian Regional Government, Granada, Spain †Medical Genomics, GENYO: Centre for Genomics and Oncological Research, Pfizer/University of Granada/ Andalusian Regional Government, Granada, Spain ‡Unit of Inflammatory Chronic Diseases, Institute of Environmental Medicine, Karolinska Institutet, Solna, Sweden
1 INTRODUCTION The main strategy for drug development during past decades has been based on the high-throughput screening of thousands of molecules simultaneously in order to identify compounds that show activity against therapeutic targets (Iorio, Rittman, Ge, Menden, & Saez-Rodriguez, 2013). When more drugs are tested, it is more likely that an encouraging result will be found, but these approaches are linked to huge costs and low efficiency, i.e., a very low efficiency/cost ratio. This fact makes nonviable the establishment of this traditional technique as a doctrine in the field and therefore cheaper and more efficient alternative techniques are needed. Based on previous knowledge of the system, for example, information about specific targets, the range of potential drugs for testing can be reduced increasing the efficiency/cost ratio. Nevertheless, in this scenario potential hits will not be evaluated if they are not previously established as target-related. In addition, the efficiency of target-oriented approaches is low for polygenic and complex diseases. This is because these diseases often alter the biological system at different points and in different pathways, and alternative networks may be activated when blocking specific molecules or pathways that would be potential biomarkers. In addition, genetic heterogeneity can contribute to a candidate drug being ineffective in most
In Silico Drug Design. https://doi.org/10.1016/B978-0-12-816125-8.00011-0
303 # 2019 Elsevier Inc. All rights reserved.
304
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
cases. Therefore population heterogeneity should be taken into consideration, something hardly done at the early stages of classical in-vitro tests. In addition, drug candidates may show adverse side effects at later phases of the drugdiscovery process. This is because primary studies are usually focused on the local point of view, analyzing the effect on specific targets or signaling pathways, while the effect of the drug at a system level cannot be evaluated from a molecular perspective. In this context, drug repurposing is a potential alternative for drug discovery that addresses some of these drawbacks. Drug repurposing, or drug repositioning, consists of finding new medical uses for existing drugs (Ashburn & Thor, 2004). This is a cost-effective process because new therapeutic options arise from approved drugs, which drastically reduces time and cost by skipping phase I clinical trials (Chong & Sullivan Jr., 2007) and the pharmacokinetic, pharmacodynamic, and toxicity profiles of the drugs are already known. Drug repositioning has been historically driven by serendipity or by a deeper understanding of the mechanism of action (MOA) (Li et al., 2016). A classic example is sildenafil, currently used to treat erectile dysfunction but initially studied to treat hypertension (Morales, Gingell, Collins, Wicker, & Osterloh, 1998). In recent years, the development of high-throughput experimental techniques and the accumulation of large volumes of -omics data have allowed researchers to develop new computational approaches for effective drug-repurposing or drug-discovery analysis. Most of them are based on previous knowledge to explore shared properties of compounds, for example chemical structure data (King, Long, Pfalmer, Andersen, & McDougal, 2018; Ma et al., 2013), information about side effects (Campillos, Kuhn, Gavin, Jensen, & Bork, 2008), or information related to mechanisms of action or target pathways (Iorio et al., 2010; Pan, Cheng, Wang, & Bryant, 2014). Nevertheless, these methods are primarily focused on finding new potential compounds that share some properties with respect to a reference drug, but are unable to provide the connections between drugs and diseases through the molecular properties of the disease phenotype. In this context, methods based on transcriptomics data and functional genomics have emerged as very promising approaches for drug repurposing. The leading idea is that a given drug induces a specific gene expression signature, that is, a set of genes that are overexpressed or underexpressed (Itadani, Mizuarai, & Kotani, 2008), that reflect the effect of the drug in the cell. Therefore comparing gene expression signatures will allow us to establish connections with phenotypes (drug- or disease-induced, gene-expression signatures). Starting from a disease gene-expression signature, that is, the set of differentially expressed genes in the disease state with respect to the normal phenotype, we could mine a database with gene-expression signatures from several compounds and compute a correlation measurement between them. We can discover drugs with inverse correlations, that is, the genes upregulated in the disease are downregulated by the drug and vice versa. We can hypothesize that if a drug causes a transcriptional signature opposite to the genetic signature of the disease, it can reverse the signature of the disease and therefore the pathogenic phenotype, beyond concrete targets and biological pathways (Sirota et al., 2011). On the other hand, positive correlations can provide information about common pathways or targets that can be useful if, for example, the molecular basis of the drug is well characterized and may provide clues about the pathogenesis of the disease. A positive correlation might also implicate the particular drug as a potential inducer of the disease, with potential applications in generating in-vitro or in-vivo models for mechanistic experimentation.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
2 METHODS
305
In this context Hughes et al. published a pioneer work generating a gene-expression compendium in Saccharomyces cerevisiae, showing the potential of exploring similarities between gene-expression signatures as an alternative to high-throughput screening of chemical libraries (Hughes et al., 2000). Some years later, Lamb (2006) developed the connectivity map (CMap), a reference collection of gene-expression profiles from cultured human cells treated with bioactive small molecules that opened a totally new scenario for drug-discovery research. Since the publication of these breakthrough works, the widespread use of transcriptomics techniques, mainly DNA microarrays and next generation sequencing (NGS), has prompted the exponential growth of public gene-expression repositories (Barrett et al., 2013) that can be used for drug-repurposing applications. In addition, the NIH Library of Integrated Network-based Cellular Signatures (LINCS) initiative (Keenan et al., 2018) has recently released the L1000 database (Subramanian et al., 2017), the evolution of the CMap with 1.3 million profiles from 42,080 genetic and small-molecule perturbations profiled across a large number of cell types. This is an invaluable resource that will guide drug-discovery research and personalized medicine applications in coming years. In this context, it is also remarkable that the number of articles in PubMed with the search “drug repurposing[Title/Abstract] OR drug repositioning[Title/Abstract]” has increased from 15 articles published in 2009 to 377 published in 2017. Comparing gene-expression signatures from different drugs can be used to establish drugdrug connections. Drugs that induce a similar set of over- and underexpressed genes are expected to share common mechanisms and targets or affect similar pathways in the cell. In this way, we can generate new hypotheses about mechanisms of action for a new compound if it shows strong correlation with a well-characterized compound (Iorio et al., 2013). The mechanism of action is defined as the set of targets (genes, proteins, metabolites, etc.) necessary to produce its pharmacological effect and the characterization of mechanisms of action (MOA) is a major challenge in drug discovery. For drug-repurposing applications it is reasonable to hypothesize that if two drugs share a similar gene-expression signature they could share a similar therapeutic application. In the present chapter, we will review the most popular methods for drug-repurposing from transcriptomic data, as well as the main applications and available resources and databases. Fig. 1 shows a summary of the main drug-repurposing pipelines and applications.
2 METHODS There are different classifications of drug-repurposing approaches depending on different criteria and classification concepts based on, for example, the scale of information and knowledge that can be managed, we could group methods into drug-focused, disease- and treatment-focused ( Jin & Wong, 2014). Another classification scheme is based on methodological or algorithmic similarity (Li et al., 2016). In this review we only take into account methods that use transcriptomic data, directly or indirectly. In this context, we have grouped the different algorithms into two main groups, that is, similarity-based and machine-learning methods, where we also included network-based approaches.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
306
Drug Signature Database
Disease signature
Target database Drug structure Drug indications
Pathway information ...
Machine learning-based methods
Similarity-based methods
GSEA PGSEA XSum
SGSE
Co-expressed genes based approach
Network analysis
Matrix factorization
Drug-disease connection
Disease-disease similarities
Drug-drug similarities
Drug-target connection
New drugs candidates
Extrapolation from the treatment of one disease to another
Drugs with similarity MoA
Drugs with similar targets
BLM
Drug combination
FIG. 1 Summary of drug-repurposing analysis pipelines. The image is divided into three sections. At the top, the type of data and the available databases from which they can be extracted. In the middle, the methodological variants that can be applied based on the starting data, mainly differentiating the methods based on the gene similarity of signatures and machine learning-based approaches. At the bottom, a summary of the different applications is recovered.
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
Side effects
2 METHODS
307
2.1 Similarity-Based Methods Similarity-based methods are based on the direct comparison of gene-expression signatures derived from different processes or from different contexts in order to measure the degree of relationship or similarity between them. The degree and significance of the correlation between the two gene signatures reveal if the signatures have common genetic patterns, for example, sets of shared overregulated genes, inverse patterns such as groups of genes overregulated in a signature and downregulated in the another, or a lack of relationship between signatures. The first approaches were based on computing correlation coefficients between global gene-expression patterns (Hughes et al., 2000) but several other methods have been proposed and refined in the last decade. 2.1.1 Gene Set Enrichment Analysis Gene set enrichment analysis (GSEA) is a nonparametric method based on the statistic of Kolmogorov-Smirnov (KS). It was initially developed for functional analysis applications in order to decipher the main biological pathways or gene sets related to sets of over- or underexpressed genes (Subramanian et al., 2005). This method was originally implemented in the CMap application and was adopted in many studies (Hassane et al., 2008; Liu et al., 2015). Briefly, the GSEA consists of the following: (1) Generation of a sorted list of genes associated to a given phenotype or condition. Genes are arranged from highest to lowest based on a given metric, which is usually a reflection of the differences of their expression levels between two conditions; for example, genes can be ordered by fold-change or by significance values. (2) Calculation of an enrichment score for a gene set S that reflects the degree to which the set of genes are overrepresented at the extremes (top or bottom) of the entire ranked list. The score is calculated by walking down the list, increasing a running-sum statistic when a gene from the gene set is encountered and decreasing it when no genes are found. In this way, the score turns high values when all genes from the gene set are present in the top of the ranked list, and the more negative value is obtained when they are in the bottom. (3) Estimation of the significance value for each gene set is done by permutation. In CMap there is a query gene signature that is composed of two gene sets, one with the overexpressed genes and another with the underexpressed genes. GSEA is applied to each gene set and drug signatures (sorted lists for the comparison between treated and untreated cells). A similarity metric ranging from 1 to 1 between the query gene signature and each drug-derived signature is computed, which is called the “connectivity score.” Positive values denote that the query and drug-signature induce and repress a similar set of genes while negative values (i.e., negative correlation) mean that the set of genes overexpressed in the query are underexpressed by the drug and vice-versa. The greater the absolute value of the connectivity score, the greater the correlation, positive or negative, between the query and the drug signature. There are several variations of GSEA. One of interest is a parametric gene set enrichment analysis (PGSEA) (Kim & Volsky, 2005). It is based on a parametric statistical model of the
2. THEORETICAL BACKGROUND AND METHODOLOGIES
308
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
traditional gene-set enrichment analysis, that improves the computational efficiency and the analysis of minimally changed gene-expression profiles. 2.1.2 Co-Expressed Gene-Set Enrichment Analysis Gene co-expression analysis is an approach to extract sets of genes whose expression is highly correlated across different samples, genes that work in concert and so groups of genes that could be considered as regulatory modules. These modules provide an alternative space to gene-based comparisons that can provide significant advantages for establishing connections between gene-expression signatures and phenotypes. In drug-repurposing analysis this concept was implemented in Cogena ( Jia et al., 2016), and it is divided into three main steps. First, an analysis of clustering by genes is carried out to find groups of co-expressed genes. Once the groups of co-expressed genes are obtained, the next step is to perform an enrichment analysis for each cluster to associate pathways and gene sets enriched in each cluster. In this way, clusters of genes are translated into a functional space. Finally, a second enrichment analysis is performed, this time using the significant functional gene sets as input data and the gene signatures of a set of drugs as a reference. This analysis yields a list of drugs that are strongly related with the different significant functional pathways, which in turn are the pathways related with the co-expression gene clusters from the condition of study. This analysis gives a set of candidate drugs to treat the biological functions that could be playing an important role in the regulatory mechanisms in the studied condition. 2.1.3 Other Metrics The standard KS statistic implemented in GSEA-related methods is by far the most commonly used drug-repurposing method derived from gene-expression data. Nevertheless, in the last few years, several alternative metrics have been proposed. For example, weighted signed statistics that increase the score for those compounds that have many replicates in the database have been proposed to reduce false positives while increasing sensitivity (Zhang & Gant, 2008). Cheng et al. (2012) proposed the eXtreme cosine (XCos) metric and showed that it outperforms the standard KS similarity statistics. The same group also published a (Cheng, Yang, Kumar, & Agarwal, 2014) systematic evaluation of different metrics in a curated dataset and showed that a simple scoring algorithm called eXtreme Sum (XSum) performs better than the standard KS or XCos. These scores are also reviewed in Musa et al. (2017).
2.2 Machine-Learning Algorithms Machine learning-based algorithms are innovative techniques that work with similarity measures. Their aim is to construct a set of classification rules and subsequent learning of each classification rule in order to distinguish true from false node associations or relations between two conditions (for example, a drug-disease, target-drug, drug-drug association, among others) (Vanhaelen et al., 2017). Machine-learning algorithms are a powerful way to predict new associations and drug indications, but generally previous knowledge is necessary. In the next sections we describe some of the main methods that have been applied to drug-repurposing applications.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
2 METHODS
309
2.2.1 Network-Based Approaches Due to multifactorial perturbation caused by complex diseases, it is expected that cellularnetwork approaches may obtain results closer to reality than the traditional studies where only one target or sets of not connected targets are studied. These observations have led to a recent paradigm shift from target-centered to systems-driven drug discovery (Cichonska, Rousu, & Aittokallio, 2015). A network can be depicted as a graph of connected nodes, where each node represents an individual molecular entity, for instance, a drug, a gene, etc., whereas an edge represents either a direct or indirect interaction between two nodes. Different models have been developed, such as power graphs or the information flow model, but the most widely used are Bayesian networks because of their capacity for expressing causal relationships and learning from incomplete datasets (see a revision in Pujol, Mosca, Farres, & Aloy, 2010). Wu, Wang, and Chen (2013) reviewed many applications of network-based approaches in the field of drug repurposing and grouped drug combinations into three main types: (i) approaches that are based on information about disease-gene associations and drug-target associations, (ii) those that use molecular activity information, mainly from -omics datasets, such as gene-expression profiles, and (iii) methods based on information about drug-induced phenotypes, such as drug toxicity, side effects, etc. Many network-based approaches for drug repurposing are based on previously established information about DNA-protein interactions, protein-protein, or drug-target interactions [see a revision in (Lotfi Shahreza, Ghadiri, Mousavi, Varshosaz, & Green, 2017)] but many others integrate -omics data with networks or signaling pathways. A major focus of research is the integration of heterogeneous data sources for drug repurposing or drug-target prediction. For example, Luo et al. introduced a novel network integration pipeline called DTINet (Luo et al., 2017) with applications for drug-target prediction and drug repurposing. Although in the original work they used information from drugs, proteins, diseases, and side effects, DTINet is a scalable framework where other information, such as gene-expression data or pathways, can be easily integrated. Nascimento, Prud^encio, and Costa (2016) proposed a kernel-based method to combine multiple heterogeneous sources and networks of arbitrary size to predict drug-target interactions. The algorithm automatically selects the more relevant kernels by returning weights indicating their importance and the predictive quality in a certain drug-target interaction. Random walk is another algorithm widely used to extract network similarities (Chen, Liu, & Yan, 2012; Liu et al., 2016). Basically, random walk measures the similarities of two or more networks, taking into account the position of each node and the length and direction of the edges that link neighbor nodes in a random procedural way. Chemical and genomic information sources could be included here, to construct drug-drug, drug-target, proteinprotein, and target-target transcriptomic interaction networks. 2.2.2 Matrix Factorization Models Matrix factorization techniques are widely used in the analysis of large datasets for dimensionality reduction and feature extraction. Matrix factorization is an unsupervised method that decomposes an original matrix Anxm, with n rows and m columns, as a product of two submatrices with lower rank, Mnxk and Nkxm. These submatrices capture relevant patterns from the data, one describes the structure between row elements and another the structure between
2. THEORETICAL BACKGROUND AND METHODOLOGIES
310
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
column elements from the original matrix. They have been widely used for gene-expression data analysis to uncover gene or sample relationships and, more recently, they have been applied for drug-repurposing analysis from transcriptomic data. Yang, Li, Fan, and Cheng (2014) proposed a causal inference-probabilistic matrix factorization approach to predict and classify drug-disease relationships. Matrix factorization was applied to extract shared and related patterns between drug and disease using transcriptomic data and these patterns can be used to construct drugs-diseases relationship networks. More recently, Dai et al. (2015) proposed a matrix factorization model integrating feature vectors of genes, diseases and drugs, for detecting potential drug-disease associations and predict novel drug indications. Although this approach was based on interaction data about drug-disease, drug-gene, and disease-gene interactions, transcriptomics data has been used to establish such interactions. 2.2.3 Supervised Classifiers Bipartite local model (BLM) is a supervised inference method used to predict new drugtarget interactions that has the advantage that it can integrate data from different sources, such as genetic data or interactions in gene-expression data and biochemical interactions (Bleakley & Yamanishi, 2009). This method integrates a kernel-based method with a supervised learning algorithm (Yamanishi, Araki, Gutteridge, Honda, & Kanehisa, 2008). BLM is based on measures of similarity or interaction in the form of kernels, or in other words, interaction groups, and it takes place in two main steps. First, a training matrix is generated that contains all the known targets with drug interactions used for the disease or study condition. These interactions can be assigned to a single class or to several classes, for example, specific functional groups in which the targets act. Interactions can be defined from different sources, such as transcriptomic or biochemical relationships, or a combination of both. Then, a new class is created with the rest of the targets without interaction based on a priori knowledge on the drugs. Given a training matrix, a support vector machine (SVM) approach is applied to define the most important labels to predict between classes. This method is used in classification and regression problems and, more recently, for drug discovery (Heikamp & Bajorath, 2013). Given a set of training data and defined classes, a model is constructed that predicts the class of a new sample. Intuitively, SVM is a model that represents the points of the samples in space (in this case, the targets of the drugs), separating the classes into spaces by means of a separation hyperplane defined as the vector between the two points. SVM allows obtaining the data information as vectorial as well as nonvectorial by the use of the so-called kernel trick. This means that instead of encoding the biological information about a drug or target as a vector, SVM only needs the definition of a positive semidefinite kernel between any two vertices derived from biological information (Mei, Kwoh, Yang, Li, & Zheng, 2013). Finally, the models generated are applied to the targets without known interaction with the drugs to predict new target candidates by assigning these new targets within the kernels generated in the training phase, as well as new indications for the drugs.
3 APPLICATIONS In the next sections we provide an overview of the main approaches, applications, and objectives that are usually sought in drug repositioning analysis based on transcriptomic data.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
3 APPLICATIONS
311
3.1 Drug-Disease Connections As we have mentioned previously, the introduction of CMap opened new ways for drug repurposing from gene-expression data allowing its use in many contexts. The approach requires at least a robust and representative genetic signature of the disease and a set of different drug signatures. In the work of Toro-Domı´nguez, Carmona-Sa´ez, and Alarco´n-Riquelme (2017), a drug-repurposing analysis for systemic lupus erythematosus was performed integrating 10 different public datasets from NCBI GEO. Samples from different cell populations, array platforms, and cohorts were integrated with the aim of selecting drug candidates that were shared in all cases. Phosphoinositol 3 kinase inhibitors and mTOR inhibitors stood out as the best result, affecting a biological pathway that is potentially altered in lupus patients. Siavelis, Bourdakou, Athanasiadis, Spyrou, and Nikita (2016) analyzed drug candidates for Alzheimer’s disease. In this case, they selected five different datasets and extracted the disease gene signature jointly using three different algorithms, thus obtaining threegene signatures for the disease, one for each calculation algorithm of the differentially expressed genes. Then they retrieved each signature in CMap (Lamb, 2006), SPIEDw (Williams, 2013), sscMap (Zhang & Gant, 2008), and CMap-linked user environment (CLUE) (Subramanian et al., 2017), four different tools for drug repurposing. Finally, they generated two scores for each drug, one extracted by a mathematical model that considered the results of the three signatures in three of the four tools, and the second score was obtained from CLUE of significant drugs obtained using at least two of the signatures. They then merged the lists of significant drugs obtained using both approaches, obtaining a list of 27 candidate drugs to treat Alzheimer’s disease. This idea of applying CMap analysis to multiple signatures and combining drug-repurposing results for the same disease was also implemented in the approach proposed by Fortney et al. (2015). Additionally, there are many other studies described in two extended revisions (Musa et al., 2017; Qu & Rajpal, 2012).
3.2 Disease-Disease Similarities Searching relationships between various diseases is an indirect approach for drug repurposing that is based on the assumption that, if two different diseases have shared genetic patterns, the drugs used to treat one of the diseases can be extrapolated for use in the other disease. In the work of Martı´nez, Sorzano, Pascual-Montano, and Carazo (2017), they used NFFinder (Setoain et al., 2015) to find diseases similar to gene signatures extracted from malignant nerve cells, showing that the most similar diseases were other types of cancer, among which they found solid tumors as prostate and breast cancer, leukemia and lymphoma, and premalignant neoplasias of epithelial tissue in the endometrium and kidney; pulmonary and muscular diseases; and neuronal conditions. A related approach was implemented by Liu et al. (2014) in DiseaseConnect, a web server that integrates gene-expression data, genotypes, and literature data to the analysis of common molecular mechanisms shared by diseases, also incorporating known drug-disease relationships, which is information that can be used for new therapeutic strategies and drugrepurposing analysis.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
312
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
In addition to disease signatures, a similar approach can be applied to mine gene signatures from biological pathways and associate them to specific diseases. For example, if a drug acts on a certain pathway, other similar drugs can be searched for that alter the same the biological pathway. This concept based on searching drugs that are associated to biological pathways is implemented on Cogena web tool (see in Tools and Databases section).
3.3 Drug-Drug Connections Transcriptome data also can be analyzed to discover correlations between different drugs based on similarities in their gene-expression signatures. This can be useful in several contexts, for example, signatures of unknown drugs can be compared with signatures of known drugs under the hypothesis that two drugs with similar gene-expression signatures or positively correlated signatures might act in a similar way. That is, the mechanism of action (MOA) of the unknown drug can be predicted based on the MOA of known drugs. Similarly, the search for drugs with similar patterns can be used to assign new indications to the drug studied, based on the known indications of drugs that are positively correlated. Several works have used similarity-based methods to compare drug signatures and establish connections among MOA (Gheeya et al., 2010; Keenan et al., 2018; Lamb, 2006). In a recent work, Sirci et al. (2017) found 258 compounds with similar gene-expression signatures associated with processes of regulation of autophagy. The authors performed an analysis in which they integrated gene signatures and structural data with the aim of testing if drugs with different structures but similar gene-expression signature shared the same MOA. The integration of these different sources of information allowed the authors to remove weak and noisy transcriptional responses in the analysis of structurally similar drugs sharing the same MOA. In an interesting work published by Woo et al. (2015) the authors proposed a networkbased approach. While similarity-based methods are focused on comparing gene-expression profiles, network-based methods perform integrative analyses over interacting pathways and are more suited to de novo MOA elucidation. They introduce Detecting Mechanism of Action by Network Dysregulation (DeMAND), an approach that elucidates compound MOA by interrogating tissue-specific regulatory networks using small-size, gene-expression profiles from compound perturbations. Another utility for drug-drug connections is to search for drugs with a reverse gene-expression signature, or that are negatively correlated, as a way to establish contraindications.
3.4 Drug-Target Connections Drug-target connection searches are widely used for two main objectives: (1) Set targets of unknown drugs based on described drug-target relations. This approach is conceptually similar to the drug-drug connection methods, assuming again the hypothesis that two drugs with a very similar gene signature, can act on the same target or on a very close target. (2) Discover new unknown drug-target relations or off-target processes.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
4 DATABASES AND TOOLS
313
Earlier we mentioned some methods that are used to predict drug-target associations, such as BLM. The CLUE web tool has a section including gene signatures caused by knock-in and knock-out single-gene experiments, that is, genetic signatures derived from the change of expression of a single gene. We can consider these single genes as gene targets and search correlations between gene targets and drugs to assign strongly positively correlated signature of the gene targets with the signature of the drug as potential drug targets.
3.5 Drug-Drug Combinations Rather than using a single compound, the combination of different drugs can provide greater advantages when treating some conditions or diseases, taking advantage of the synergistic effect between the drugs used. In this context, there are studies that provide evidence that combined treatments show a higher efficacy, fewer side effects, and less toxicity compared with single-drug treatments (Gupta & Ito, 2002; Kim et al., 2017; Sun, Vilar, & Tatonetti, 2013). In this context, a variation of the techniques for drug repurposing mentioned throughout the document focuses on the prioritization of drug pairs in order to find the combination that best reverses the disease signature (Sun, Sanderson, & Zheng, 2016). Several studies have used data from CMap to establish drug combinations for many diseases [see a revision in (Musa et al., 2017)]. One approach to address this problem is to extract the different biological pathways affected in the diseases from the genetic signature of a disease and then look for drugs that act on specific pathways. In this way the combination of drugs is considered as the pair of drugs that best reverses two of the main pathogenic pathways. Another approach is carried out in the studies published by Jin and Wong (2014), and Zhong et al. (2013), where they generated a score similar to that obtained using GSEA but combining pairs of drugs to finally gather the combination with a higher negative score, and hence, the combination that best reverses the disease signature. In the study of Jin et al., they identified the combination of Trolox C and Cytisine for the treatment of type-2 diabetes, while Zhong et al. suggested that the combination of an ACE inhibitor and a histone deacetylase inhibitor could have therapeutic potential for various kidney diseases.
4 DATABASES AND TOOLS In the last few years, several tools have been developed to facilitate drug-repurposing analysis. Sam and Athri published a revision of web-based tools that can be used for many applications in drug-repurposing analysis (Sam & Athri, 2017), which can be categorized into three main groups: tools for predicting drug-target interactions, tools for linking drugs to disease, and tools that used drug-induced gene expression to predict new connections. Here we focus on databases and tools specially designed to exploit gene-expression data providing a deeper analysis of these applications. A summary of these databases is provided in Table 1 and tools in Table 2.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
314
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
TABLE 1 Available Drug Databases Database
Data Type
Storage
Description
References
NCBI GEO
Gene expression data
94,000 datasets/2,500,000 samples
NCBI GEO is a database repository of highthroughput gene-expression data, hybridization arrays, chips, and microarrays.
Barrett et al. (2013)
ArrayExpress
Gene expression data
70,000 datasets/2,200,000 samples
ArrayExpress stores data from high-throughput functional genomics experiments.
Brazma et al. (2003)
LINCS
Drug-induced gene expression data
28,000 drugs/500,000induced signatures
The LINCS program aims to create a network-based understanding of biology by cataloging changes in gene expression and other cellular processes that occur when cells are exposed to a variety of perturbing agents.
Keenan et al. (2018)
DSigDB
Drug-induced gene expression data
Around 22,500 gene sets from 17,000 drugs
DSigDB is a resource that relates drugs/compounds and their target genes.
Yoo et al. (2015)
PharmGKB
Drug-target annotation
641 drugs and 130 curated pathways
PharmGKB is a database that contains information on gene variants and how each one affects drug metabolism.
Barbarino et al. (2018)
DrugMatrix
Drug-induced gene-expression data/drug structures/target annotations
637 drugs
DrugMatrix is a database encompassed in the National Toxicology Program that contains gene profiles obtained by microarray experiments.
Ganter et al. (2005)
CTD
Drug-target annotation/ disease-target annotations
1,700,000 drug-target interactions/165,000 phenotype-based interactions/23,000,000 gene-disease interactions/ 2,500,000 drug-disease connections
Database to query a gene signature of interest and search correlations or similarity measures with signatures from diseases, drugs, or pathways.
Davis et al. (2017)
CDA
Drug-target annotation/druginduced geneexpression data
1309 drugs profiles from CMap
Database and tool for multisignaling pathway, targeting combinatorial drugdiscovering based on transcriptomic data.
Lee et al. (2012)
The table summarizes the databases currently maintained and contains information about the type and amount of data contained in each database.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
315
4 DATABASES AND TOOLS
TABLE 2
Summary of Drug Repurposing Tools Type of Analysis
Input
Type of Tool
Approach
Reference
CMap
Drug-disease Drug-drug
Disease/drug signature
Web tool/ database
Similarity
Lamb (2006)
Clue
Drug-disease Drug-drug Drug-target Disease-target
Disease/drug signature
Web tool/ database
Similarity
Subramanian et al. (2017)
Mantra
Drug-drug
Microarray/ drug signature
Web tool
Network
Carrella et al. (2014)
MarQ
Drug-drug Drug-disease Disease-disease
Microarray from GEO
Web tool
Similarity
Vazquez et al. (2010)
NFFinder
Drug-drug Drug-disease Disease-disease
Any signature
Web tool
Similarity
Setoain et al. (2015)
Cogena
Disease-disease Disease-drug Drug-drug
Microarrays
R package
Co-expressed genes similarity
Jia et al. (2016)
KsRepo
Disease-disease Disease-drug Drug-drug
Microarrays
R package
Individual datasets integration. Similarity
Brown et al. (2016)
GOpredict
Drug-drug
Activity matrix
Web tool/ database
Network
Louhimo et al. (2016)
Integrity
Drug-targetdisease
Disease/drug signature
Web tool/ database
Network
Emig et al. (2013)
Gene2Drug
Disease-disease Disease-drug Drug-drug
Disease/drug signature
Web tool/ database
Similarity
Napolitano et al. (2018)
GeneExpressionSignature
Any signature connection
Microarrays/ gene signatures
R package
Similarity
Li et al. (2013)
DvD
Disease-disease Drug-disease Drug-drug
Microarray from GEO/ ArrayExpress
R package
Similarity
Pacini et al. (2012)
DeSigN
Drug-disease
Disease signature
Web tool
Similarity
Lee et al. (2017)
PDOD
Drug-targetdisease
KEEG annotation/ disease signature
Web tool
Network
Yu et al. (2016)
Tool
The table summarizes some characteristics of the available tools for drug repurposing analysis. The table contains the information about the type of analysis that can be done using each tool or the application; the input of the tool; the type of tool; the differentiation between web tool, desktop tool, database, or r package; and the type of approach carried out internally.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
316
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
4.1 NCBI GEO National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) is a public repository that contains more than 94,000 datasets and over 2 million samples hosted by the NCBI. This is an invaluable resource that provide users access to most of the published gene-expression studies (Barrett & Edgar, 2006). For drug-repurposing analysis, the datasets could be used to extract gene signatures of a condition of interest or extract signatures to use as a reference (for example, the data of the NIH LINCS Consortium are stored in GEO with the GEO identifier GSE92742).
4.2 ArrayExpress This is a resource from the European Bioinformatics Institute that together with NCBI GEO is one of the repositories recommended by major scientific journals to archive functional genomics data from microarray and sequencing platforms in order to support reproducible research. This database contains around 70,000 datasets and over 2 million samples, although the number of datasets grows day to day (Brazma et al., 2003; Rustici et al., 2013). Most of the data sets are available in NCBI GEO as well, with the exception that in this database some additional requirements are applied when publishing the data, with the aim of having a higher order and higher data quality. The submission of the datasets must be in agreement with the minimum information about a microarray experiment (MIAME) and the minimum information about a sequencing experiment (MINSEQE) guidelines.
4.3 LINCS As we described above regarding the CLUE web tool, the L1000 database is part of the Library of Integrated Network-based Cellular Signatures (LINCS) program that stores 1,319,180 gene profiles derived from 42,080 genetic and drug perturbations in more than 15 different human tissues, with different drug-dose applications and time points generated from a total of 5178 different drugs.
4.4 DSigDB The drug signatures database (DSigDB) is a database of drug gene-expression signatures (Yoo et al., 2015). The database contains more than 22,000 gene sets from different drugs and these gene sets or signatures can be used for drug-repurposing analysis by some of the approaches described in the Methods section, such as approaches based on signature similarities. DSigDB also contains additional information about each drug as molecular structure, targets, or active sites.
4.5 PharmGKB The Pharmacogenomics Knowledge Base (PharmGKB) (Barbarino, Whirl-Carrillo, Altman, & Klein, 2018) is a database that integrates curated annotations about genotype, molecular data and clinical knowledge together with pathway information. It currently
2. THEORETICAL BACKGROUND AND METHODOLOGIES
4 DATABASES AND TOOLS
317
contains information about 100 clinical dosing guidelines, 498 drug labels, 3753 clinical annotations, 130 pathways, 65 very important pharmacogenes (VIPs), which are genes involved in the metabolism of, or response to, one or several drugs, and more than 20,000 genetic variations, which are curated associations between a variant (e.g., Snp or indel) and a drug-related phenotype. Therefore this is a very important resource for drug discovery that can help to interpret results from high-throughput analysis and research in precision medicine as it contains information about the interaction between genotype and drug response.
4.6 DrugMatrix DrugMatrix is a database encompassed in the National Toxicology Program that contains gene profiles obtained with microarray experiments derived from 637 different compounds tested in seven tissues for a total of more than 4100 drug-dose-time-tissue combinations (Ganter et al., 2005).
4.7 CTD The Comparative Toxicogenomics Database (CTD) (Davis et al., 2017; Ganter et al., 2005) is a web tool and database storing around 1,700,000 chemical-gene interactions, 165,000 phenotype-based interactions, 23,000,000 gene-disease interactions, and around 2,500,000 chemical-disease connections. The tool allows researchers to query a gene signature of interest and search correlations or similarity measures with signatures from diseases, drugs, or pathways.
4.8 CDA Combinatorial Drug Assembler (CDA) is a tool for multisignaling pathway targeting combinatorial drug discovering based on transcriptomic data. The authors hypothesize that due to the multidimensional affection of complex diseases’ transcriptomes, for example, with dysregulated sets of different functional pathways, the use of balanced multicomponent therapies might be better than the use of only one therapeutic component (Lee et al., 2012). The tool uses as reference signatures the gene-expression data of 1309 different molecules from the second actualization of CMap. It accepts a query signature and performs GSEA analysis comparing the query and all drug signatures to obtain a similarity score between them in a similar way as CMap, but it also lists best-pattern matching for single drugs/combinatorial drug pairs across the input gene set-related signaling pathways to get a final list of sorted drug-combination candidate therapies.
4.9 CMap As we have mentioned previously, the CMap was a pioneer application introduced in 2006 by Lamb (2006) that has had a deep impact on the field of drug discovery and has opened up new lines of scientific research in the fields of drug repurposing, lead discovery, and characterization of mechanisms of action (Lamb, 2006; Qu & Rajpal, 2012).
2. THEORETICAL BACKGROUND AND METHODOLOGIES
318
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
This database was originally released with the gene-expression profiles of 164 drugs that were later expanded to 1309 FDA-approved small molecules. Gene-expression profiles were generated for each molecule by comparing treated and untreated samples at different doses in a panel of five human cell lines. A great breakthrough was the provision of the database with a web-based application that was able to query the gene-expression profiles and establish connections between a query signature and the drug signatures. This web-tool adopted the GSEA methodology to calculate a similarity score between the query signature and each of the gene-expression signatures from the database. The query signature is compared to each of the database signatures to determine whether they share upregulated and downregulated genes (“positive connectivity”) or upregulated in one signature and downregulated in the other and vice versa (“negative connectivity”). Signatures from the database that are strongly anticorrelated are expected to revert the query phenotype and those with strong correlation might be used to search for phenotypes that induce similar expression programs. Several versions of CMap have been published in order to reduce noise and strengthen the methods’ reliability. In this context, an interesting approach is to combine signatures for the same disease obtained from different studies in order to get more robust and less noisy patterns (Fortney et al., 2015; Toro-Domı´nguez et al., 2017).
4.10 Clue The LINCS program is an NIH funded initiative that “aims to create a network-based understanding of human biology by cataloging changes in basic understanding of human biology cataloging changes in gene and protein expression, signaling processes, cell morphology, and epigenetic states, which occur when cells are exposed to a variety of perturbing agents” (Keenan et al., 2018). As part of this consortium, Subramanian et al. (2017) published a paper describing the L1000 database and the associated exploratory tool Clue (https://clue.io). This is a modification of the CMap concept in which the authors used a low-cost gene-expression profiling technology (termed L1000). The underlying idea is to provide most of the information about any cellular state measuring a subset of the transcriptome. The analysis of 2031 previously published Affymetrix gene-expression profiles and allowed the authors to establish a set of 1000 landmark transcripts and release a database of 1,319,138 L1000 profiles from 42,080 genetic and small-molecule perturbations, that is, including chemical compounds and also expression profiles from knock-out or knock-in experiments. Together with this database, they released CLUE, a cloud-based software platform that allows users to query this database for a range of applications such as searching for the off-target effect of a compound or drug repurposing for a given disease, predicting the mechanism of action of a drug based on similarity with other drugs, or getting the functional pathways altered by a compound, among others.
4.11 MANTRA MANTRA (Mode of Action by NeTwoRk Analysis) is a computational tool for the analysis of the MOA of novel drugs and the identification of known and approved candidates for
2. THEORETICAL BACKGROUND AND METHODOLOGIES
4 DATABASES AND TOOLS
319
“drug repositioning” from gene expression data (Carrella et al., 2014). This web tool integrates similarity and network drug-repurposing approaches. It is based on a previous work of Iorio et al. (2010) and is based on the generation of a ranked list of differentially expressed genes for each drug from the data collected from the CMap across multiple cell lines or at different doses, summarizing in this way their transcriptional response or transcriptional signature. These signatures were used to calculate the distance between pairs of compounds by selecting the top 250 over- and underregulated genes. GSEA is applied comparing the signature of each compound against the remaining complete signatures of all compounds and a similarity score is obtained by pairwise comparisons. Finally, a connection network was constructed where nodes represent compounds and edge connections represent similarities between drugs and clustering by affinity propagation algorithm to extract groups of compounds highly similar (Iorio et al., 2010). They hypothesize that compounds that belong to the same cluster have a quite similar MOA based on a shared transcriptional signature without initially considering other characteristics such as compound structure, or specific targets. Mantra allows users to include new microarray gene-expression data from a treated compound vs. control samples, extract the gene-expression signature of the compound, and predict its MOA.
4.12 MarQ MarQ (Microarray Rank Query) was one of the first applications that systematically processed all gene-expression datasets available in the NCBI GEO database in order to search for similar (or opposite) signatures given a query signature (Vazquez et al., 2010). The tool extended the concept of CMap to gene-expression signatures from public repositories. The authors derived gene-expression signatures from GEO datasets performing all pairwise comparisons across experimental conditions in each dataset. A query signature can be used to query the database and the GSEA algorithm is used to calculate a similarity score and retrieve and sort gene signatures. The application also incorporates text mining and enrichment methods that allow users to explore key words that are enriched in the list of most relevant signatures. It is very useful to perform similarity analysis between disease-disease, drugdrug, and disease-drug signatures.
4.13 NFFinder NFFinder is a web tool for identifying potential useful drugs based on transcriptomic relationships between drugs, diseases, and phenotypes (Setoain et al., 2015). Internally, NFFinder is based on the methodology implemented in MarQ, but it is focused on drugrepurposing analysis, and integrates data from NCBI GEO, CMap, and DrugMatrix (Ganter et al., 2005). It also contains annotations for signatures with terms related to drugs, diseases, and expert scientists. NFFinder used as a query two lists of over- and underexpressed genes (matched to Gene Symbol identifiers) obtained by differential expression analysis between two conditions but also allows users to include in the query a list of microRNAs expressed in the study case.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
320
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
4.14 Cogena Cogena ( Jia et al., 2016), which is the acronym of “co-expressed gene-set enrichment analysis,” is a tool that arises from the premise that the analysis of the complete transcriptome can mask significant patterns because it is difficult to discern cause and effect in a phenotype among thousands of differentially expressed genes, extended to drug repurposing analysis from transcriptome data. Their analysis aims to discover more local but highly correlated patterns. The drug repurposing approach starts with a clustering algorithm to obtain groups of genes from the set of differentially expressed genes in a disease condition. Then, enrichment analysis of pathways and drugs is performed in each set of genes. For pathway analysis KEGG annotations are used and drug-related gene sets are defined from the CMap. Finally, the known or inferred drug MOA from the pathway analysis in the same cluster is obtained.
4.15 ksRepo This is an R package for gene expression-based drug repurposing that uses the Kolmogorov-Smirnov (KS) statistic (Brown, Kong, Kohane, & Patel, 2016). ksRepo works internally with the Comparative Toxicogenomics Database (CTD) (Davis et al., 2017) as reference of drug gene-expression signatures and with NCBI GEO for the disease or drug data to test. A main advantage of ksRepo is that it offers the possibility to use several and different datasets as a query in an integrative way, which allows users to take a meta-analysis-like approach for more robust and meaningful results. Briefly, ksRepo extracts the gene-expression signature from each dataset, then it performs KS enrichment analysis between independent signatures and drug signatures to obtain the list of ranked drugs. A bootstrapping is applied to compute P-values, which are adjusted with the false discovery rate approach to deal with multiple testing. Finally, a common significance list can be obtained by combining P-values across results.
4.16 GOpredict GOpredict (Louhimo et al., 2016) is oriented to precision medicine allowing researchers to stratify cancer patients based on drug responses. It integrates information from several databases, including KEGG (Kanehisa, Furumichi, Tanabe, Sato, & Morishima, 2017), TCGA (Cancer Genome Atlas Research Network et al., 2013), DrugBank (Wishart & Wu, 2016), and Gene Ontology (Ashburner et al., 2000). The input of the tool consists of an activity matrix of a gene by sample and with matrix entries being 0 if the gene is inactive or 1 if it is active. This matrix is constructed from gene-expression, gene copy-number, or mutation data, or a combination of all three. The matrix is used for sample stratification, and drug prioritization is carried out for each sample group taking into account the number of targets of a drug, the relevance or rank of the drug targets in the database, and the activity of the gene target in the groups. The final results consist of a list of drugs for each sample group, obtaining a sample stratification based on the best drug candidates to treat each group.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
4 DATABASES AND TOOLS
321
4.17 Integrity Integrity is a tool that implements an integrated network-based approach for drug repurposing analysis and drug-target prediction (Emig et al., 2013). The authors hypothesize that drug targets are located in close overall proximity to the differentially expressed genes, and drug-target prediction for a disease is performed by applying network-based prioritization methods using expression signature genes as an input. They used MetaBase (Bureeva, Zvereva, Romanov, & Serebryiskaya, 2009) as a resource to build the network. In this framework, all molecules are represented as a network object. The physical interactions between them (protein-protein, transcription factor-target genes, or RNA-metabolite interactions) are summarized as interactions between pairs of network objects, in addition to information about the directionality and the effects, such as activation or inhibition. The disease gene signature is transformed into molecular networks applying a number of local and global network-based prioritization methods (random walk, interconnectivity, neighborhood scoring, and network propagation). The predictions from these methods are combined using a logistic regression model resulting in a set of prioritized gene targets for the disease that overlap with drug targets. So, the tool assigns a list of drugs to treat specific targets from the disease. For integrity functionality, the authors developed a curated database of drugs with information about their respective targets and the diseases associated to their use, cataloging them as “validated,” “candidate,” or “exploratory” drugs. In this way, if the drug target is already used for a different indication, it can be readily evaluated as a candidate for the disease of interest.
4.18 Gene2Drug Gene2drug (Napolitano et al., 2018) is a tool that integrates drug gene-expression signatures and information from pathway databases. It can be used to prioritize drugs by assessing their impact on the pathways that involve the target gene. The method relies on a drug set enrichment analysis (DSEA) method previously developed by the same authors (Napolitano, Sirci, Carrella, & di Bernardo, 2016). It uses annotated pathways and gene expression data from CMap to create a pathway-drugs matrix. Briefly, controltreatment, fold-change values from CMap are computed and converted into ranks. Replicated compounds are merged across different experimental conditions producing a matrix with unique drugs in columns and ranked genes in rows, which are transformed into a drugbiological pathway matrix, by ranking gene ontology and KEGG pathways according to how much the expression of genes annotated to each pathway changes after drug treatment. While the ranking phase is performed column-wise in Gene2Drug it is done row-wise in the DSEA, which implies that the two methods have indeed different applications. DSEA uses a set of drugs as its input and predicts a common MOA and Gene2Drug uses a set of pathways as its input and predicts drugs that are able to target them. Gene2drug applies the GSEA to compute an enrichment score for each drug but for a set of pathways rather than a set of genes and outputs the list of top-ranked drugs. It also accepts as input single genes and generates the subset of pathways in which the gene is included.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
322
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
4.19 GeneExpressionSignature It is an R package that allows users to compare gene-expression signatures in order to establish functional connections (Li et al., 2013). It uses as input gene-expression matrices or lists of genes with fold changes and the tool combines these into a unique, ranked gene list using two possible approaches: an equally weighted method, where all ranked lists with the same biological state are treated with equal importance, or an adaptively weighted method, where the lists are weighted differently. The tool implements two methods to measure the similarity between the query signature and reference signatures, GSEA and PGSEA, described above in the Methods section of this chapter.
4.20 DvD Drug versus Disease (DvD) is an R package and Cytoscape plug-in that implements a guided pipeline to perform drug repurposing analysis from public gene expression repositories (Pacini et al., 2012). It uses as input gene-expression datasets from NCBI GEO (Barrett & Edgar, 2006) or ArrayExpress (Brazma et al., 2003). DvD performs a differential expression analysis between two selected conditions (e.g., healthy and case samples, or treated and nontreated samples) and defines a gene-expression signature that uses a gene signature as its query. GSEA is then applied comparing the query signature with drug signatures from CMap database and the output contains drugs ranked by similarity score, clusters based on similarity, and connection network plots.
4.21 DeSigN This is a web tool conceptually similar to CMap for predicting drug efficacy in cancer (Lee et al., 2017). The tool stores a set of gene signatures derived from 140 compounds used in cancer as reference and it uses a gene signature as input. The application calculates a score based on the statistic of KS that reflects the similarity between the input signature and each one of the reference drugs. Like CMap, a negative score is obtained for inverse signature profiles. Initially it was designed to look for drugs that better reverted specific signatures of cancer cell lines.
4.22 PDOD Prediction of Drugs having Opposite effects on Disease genes (PDOD) (Yu et al., 2016) is a tool that identifies drugs having opposite effects on expression alterations of disease-related genes. Briefly, PDOD first constructs a network of gene-gene interactions from KEGG database including information about the relationship between nodes (inhibition or activation). Known gene-targets of drugs are then marked in the network as start points and diseaseassociated genes are labeled in the same way. The algorithm generates the possible connections between drug targets and disease-altered genes and, finally, a score is calculated based on the distance of each path.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
5 CONCLUSIONS
323
5 CONCLUSIONS The in-silico drug repurposing analysis provides a series of advantages over traditional methods, such as a shorter time and lower costs. Since the publication of CMap, applications for this type of analysis have grown exponentially in several contexts and diseases. In this chapter we have reviewed several cases where different methods of drug repurposing have been used satisfactorily, obtaining drug candidates that act on pathogenic pathways or confirming efficacy through in-vitro analysis. Transcriptomic data have been successfully used in many applications for drug repurposing, identifying MOA, discovering new therapeutic targets, or generally getting a better characterization of the molecular mechanisms underlying cellular perturbations of disease phenotypes. In this context, there are important efforts being made by scientific community to develop new bioinformatics pipelines and methods to analyze large collections of gene expression data in a more comprehensive way. The CMap (Lamb, 2006) was the pioneer application and database in this field and, although there is a large number of studies that analyze the gene-expression program induced by small molecules, the CMap concept has guided many drug-repurposing applications and the development of bioinformatics methods and tools. In this context, as we have discussed in this chapter, many versions of algorithms have been proposed to analyze drug and disease gene-expression signatures, but most of them can be classified into similarity-based or machine-learning approaches. In the latter ones, we have included network-based methods that are gaining increasing interest, in part because of their potential to decipher mechanistic regulatory networks and actionable models. The LINCS initiative, as an evolution of CMap, has provided the scientific community with an unprecedented amount of data, which is attracting the interest of many researchers working on drug discovery. Although most of the works based on transcriptomic data remain in the field of scientific research, data from the Quantitative Structure Transcriptional Activity Project (QSTAR) have proved that the analysis of gene-expression data is able to detect biologically relevant signals that help prioritize compounds and it is a very useful approach in the early stages of the drugdevelopment process, especially for the detection of off-target effects (Verbist et al., 2015). Subramanian et al. also proved that using the L1000 data they were able to recover up to 63% of known small molecule MOA, being nonsuccess-cases those due to ambiguities in annotation of compounds, limitations of the L1000 platform, or the inability of transcriptional profiling to recover certain types of connections, among others (Subramanian et al., 2017). Despite several successful use cases, there are still many important challenges in the field. First, the analysis of gene expression data allows us to interrogate one layer of the molecular mechanisms, but it cannot be used to detect changes in metabolites or protein levels, protein modifications, etc., and information about drugs that induce perturbations at these levels must be taken into account to get a complete overview of the mechanism of the drugs. Although some small molecules may yield universal signatures across cell types, there are many others that yield highly cell-type selective, gene-expression signatures (Subramanian et al., 2017). Most of the available resources, including CMap and LINCS, provide geneexpression signatures from cancer cells and, although they can be used in other contexts,
2. THEORETICAL BACKGROUND AND METHODOLOGIES
324
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
special attention should be paid to this area. One potential way to deal with this issue is to identify noncell-specific signatures by integrating results from several studies in a metaanalysis-like approach (Brown et al., 2016; Toro-Domı´nguez et al., 2017). The lack of a structured gold standard for drug repositioning is another pending task in the field, which has made it hard to compare and evaluate the performance of computational methods (Li et al., 2016). Despite these limitations, computational methods and gene-expression data are of great significance to accelerate the drug-discovery process, establishing new uses for existing drugs, characterizing MOA, or defining new targets. The advance of high-throughput technologies is increasing the amount of data that, together with the development of appropriate bioinformatics pipelines, will guide the drug-development process toward new discoveries and breakthroughs in the next decade.
References Ashburn, T. T., & Thor, K. B. (2004). Drug repositioning: identifying and developing new uses for existing drugs. Nature Reviews Drug Discovery, 3(8), 673–683. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., et al. (2000). Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature Genetics, 25(1), 25–29. Barbarino, J. M., Whirl-Carrillo, M., Altman, R. B., & Klein, T. E. (2018). PharmGKB: a worldwide resource for pharmacogenomic information. Wiley Interdisciplinary Reviews. Systems Biology and Medicine, 10(4), e1417. Barrett, T., & Edgar, R. (2006). Gene expression omnibus: microarray data storage, submission, retrieval, and analysis. Methods in Enzymology, 411, 352–369. Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., Marshall, K. A., et al. (2013). NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Research, 41(Database issue), D991–D995. Bleakley, K., & Yamanishi, Y. (2009). Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics, 25(18), 2397–2403. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N., Holloway, E., et al. (2003). ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Research, 31 (1), 68–71. Brown, A. S., Kong, S. W., Kohane, I. S., & Patel, C. J. (2016). ksRepo: a generalized platform for computational drug repositioning. BMC Bioinformatics, 17, 78. Bureeva, S., Zvereva, S., Romanov, V., & Serebryiskaya, T. (2009). Manual annotation of protein interactions. In Methods in molecular biology (pp. 75–95) Humana Press. Campillos, M., Kuhn, M., Gavin, A. -C., Jensen, L. J., & Bork, P. (2008). Drug target identification using side-effect similarity. Science, 321(5886), 263–266. Cancer Genome Atlas Research Network, Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., et al. (2013). The cancer genome atlas pan-cancer analysis project. Nature Genetics, 45(10), 1113–1120. Carrella, D., Napolitano, F., Rispoli, R., Miglietta, M., Carissimo, A., Cutillo, L., Sirci, F., et al. (2014). Mantra 2.0: an online collaborative resource for drug mode of action and repurposing by network analysis. Bioinformatics, 30(12), 1787–1788. Chen, X., Liu, M. -X., & Yan, G. -Y. (2012). Drug-target interaction prediction by random walk on the heterogeneous network. Molecular BioSystems, 8(7), 1970–1978. Cheng, J., Xie, Q., Kumar, V., Hurle, M., Freudenberg, J. M., Yang, L., & Agarwal, P. (2012). Evaluation of analytical methods for connectivity map data. Biocomputing. 2013, https://dx.doi.org/10.1142/9789814447973_0002. Cheng, J., Yang, L., Kumar, V., & Agarwal, P. (2014). Systematic evaluation of connectivity map for disease indications. Genome Medicine. 6(12), https://dx.doi.org/10.1186/s13073-014-0095-1. Chong, C. R., & Sullivan, D. J., Jr. (2007). New uses for old drugs. Nature, 448(7154), 645–646. Cichonska, A., Rousu, J., & Aittokallio, T. (2015). Identification of drug candidates and repurposing opportunities through compound-target interaction networks. Expert Opinion on Drug Discovery, 10(12), 1333–1345.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
REFERENCES
325
Dai, W., Liu, X., Gao, Y., Chen, L., Song, J., Chen, D., Gao, K., et al. (2015). Matrix factorization-based prediction of novel drug indications by integrating genomic space. Computational and Mathematical Methods in Medicine, 2015, 275045. Davis, A. P., Grondin, C. J., Johnson, R. J., Sciaky, D., King, B. L., McMorran, R., Wiegers, J., et al. (2017). The comparative toxicogenomics database: update 2017. Nucleic Acids Research 45(D1), D972–D978. Emig, D., Ivliev, A., Pustovalova, O., Lancashire, L., Bureeva, S., Nikolsky, Y., & Bessarabova, M. (2013). Drug target prediction and repositioning using an integrated network-based approach. PLoS One, 8(4), e60618. Fortney, K., Griesman, J., Kotlyar, M., Pastrello, C., Angeli, M., Sound-Tsao, M., & Jurisica, I. (2015). Prioritizing therapeutics for lung cancer: an integrative meta-analysis of cancer gene signatures and chemogenomic data. PLoS Computational Biology 11(3), e1004068. Ganter, B., Tugendreich, S., Pearson, C. I., Ayanoglu, E., Baumhueter, S., Bostian, K. A., Brady, L., et al. (2005). Development of a large-scale chemogenomics database to improve drug candidate selection and to understand mechanisms of chemical toxicity and action. Journal of Biotechnology 119(3), 219–244. Gheeya, J., Johansson, P., Chen, Q. -R., Dexheimer, T., Metaferia, B., Song, Y. K., Wei, J. S., et al. (2010). Expression profiling identifies epoxy anthraquinone derivative as a DNA topoisomerase inhibitor. Cancer Letters 293(1), 124–131. Gupta, E. K., & Ito, M. K. (2002). Lovastatin and extended-release niacin combination product: the first drug combination for the management of hyperlipidemia. Heart Disease, 4(2), 124–137. Hassane, D. C., Guzman, M. L., Corbett, C., Li, X., Abboud, R., Young, F., Liesveld, J. L., et al. (2008). Discovery of agents that eradicate leukemia stem cells using an in silico screen of public gene expression data. Blood 111(12), 5654–5662. Heikamp, K., & Bajorath, J. (2013). Support vector machines for drug discovery. Expert Opinion on Drug Discovery, 9(1), 93–104. Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., Bennett, H. A., et al. (2000). Functional discovery via a compendium of expression profiles. Cell 102(1), 109–126. Iorio, F., Bosotti, R., Scacheri, E., Belcastro, V., Mithbaokar, P., Ferriero, R., Murino, L., et al. (2010). Discovery of drug mode of action and drug repositioning from transcriptional responses. Proceedings of the National Academy of Sciences of the United States of America 107(33), 14621–14626. Iorio, F., Rittman, T., Ge, H., Menden, M., & Saez-Rodriguez, J. (2013). Transcriptional data: a new gateway to drug repositioning? Drug Discovery Today 18(7–8), 350–357. Itadani, H., Mizuarai, S., & Kotani, H. (2008). Can systems biology understand pathway activation? Gene expression signatures as surrogate markers for understanding the complexity of pathway activation. Current Genomics, 9(5), 349–360. Jia, Z., Liu, Y., Guan, N., Bo, X., Luo, Z., & Barnes, M. R. (2016). Cogena, a novel tool for co-expressed gene-set enrichment analysis, applied to drug repositioning and drug mode of action discovery. BMC Genomics 17, 414. Jin, G., & Wong, S. T. C. (2014). Toward better drug repositioning: prioritizing and integrating existing methods into efficient pipelines. Drug Discovery Today 19(5), 637–644. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., & Morishima, K. (2017). KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45(D1), D353–D361. Keenan, A. B., Jenkins, S. L., Jagodnik, K. M., Koplev, S., He, E., Torre, D., Wang, Z., et al. (2018). The library of integrated network-based cellular signatures NIH program: system-level cataloging of human cells response to perturbations. Cell Systems, 6(1), 13–24. Kim, J. E., Patel, M. A., Mangraviti, A., Kim, E. S., Theodros, D., Velarde, E., Liu, A., et al. (2017). Combination therapy with anti-PD-1, anti-TIM-3, and focal radiation results in regression of Murine Gliomas. Clinical Cancer Research: An Official Journal of the American Association for Cancer Research, 23(1), 124–136. Kim, S. -Y., & Volsky, D. J. (2005). PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics, 6, 144. King, M. D., Long, T., Pfalmer, D. L., Andersen, T. L., & McDougal, O. M. (2018). SPIDR: small-molecule peptideinfluenced drug repurposing. BMC Bioinformatics 19(1), 138. Lamb, J. (2006). The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science, 313(5795), 1929–1935. Lee, B. K. B., Tiong, K. H., Chang, J. K., Liew, C. S., Abdul Rahman, Z. A., Tan, A. C., Khang, T. F., et al. (2017). DeSigN: connecting gene expression with therapeutics for drug repurposing and development. BMC Genomics 18(Suppl 1), 934.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
326
11. DRUG REPURPOSING FROM TRANSCRIPTOME DATA: METHODS AND APPLICATIONS
Lee, J. -H., Kim, D. G., Bae, T. J., Rho, K., Kim, J. -T., Lee, J. -J., Jang, Y., et al. (2012). CDA: combinatorial drug discovery using transcriptional response modules. PLoS One, 7(8), e42573. Li, F., Cao, Y., Han, L., Cui, X., Xie, D., Wang, S., & Bo, X. (2013). GeneExpressionSignature: an R package for discovering functional connections using gene expression signatures. Omics: A Journal of Integrative Biology, 17(2), 116–118. Li, J., Zheng, S., Chen, B., Butte, A. J., Swamidass, S. J., & Lu, Z. (2016). A survey of current trends in computational drug repositioning. Briefings in Bioinformatics, 17(1), 2–12. Liu, C. -C., Tseng, Y. -T., Li, W., Wu, C. -Y., Mayzus, I., Rzhetsky, A., Sun, F., et al. (2014). DiseaseConnect: a comprehensive web server for mechanism-based disease-disease connections. Nucleic Acids Research, 42(Web Server issue), W137–W146. Liu, H., Song, Y., Guan, J., Luo, L., & Zhuang, Z. (2016). Inferring new indications for approved drugs via random walk on drug-disease heterogenous networks. BMC Bioinformatics, 17(Suppl 17), 539. Liu, X., Yang, X., Chen, X., Zhang, Y., Pan, X., Wang, G., & Ye, Y. (2015). Expression profiling identifies bezafibrate as potential therapeutic drug for lung adenocarcinoma. Journal of Cancer, 6(12), 1214–1221. Lotfi Shahreza, M., Ghadiri, N., Mousavi, S. R., Varshosaz, J., & Green, J. R. (2017). A review of network-based approaches to drug repositioning. Briefings in Bioinformatics. 19(5): 878–892. https://dx.doi.org/10.1093/bib/bbx017. Louhimo, R., Laakso, M., Belitskin, D., Klefstr€ om, J., Lehtonen, R., & Hautaniemi, S. (2016). Data integration to prioritize drugs using genomics and curated data. BioData Mining, 921. Luo, Y., Zhao, X., Zhou, J., Yang, J., Zhang, Y., Kuang, W., Peng, J., et al. (2017). A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nature Communications, 8(1), 573. Ma, D. -L., Chan, D. S. -H., & Leung, C. -H. (2013). Drug repositioning by structure-based virtual screening. Chemical Society Reviews, 42(5), 2130. Martı´nez, M., Sorzano, C. O. S., Pascual-Montano, A., & Carazo, J. M. (2017). Gene signature associated with benign neurofibroma transformation to malignant peripheral nerve sheath tumors. PLoS One, 12(5), e0178316. Mei, J. -P., Kwoh, C. -K., Yang, P., Li, X. -L., & Zheng, J. (2013). Drug-target interaction prediction by learning from local information and neighbors. Bioinformatics, 29(2), 238–245. Morales, A., Gingell, C., Collins, M., Wicker, P. A., & Osterloh, I. H. (1998). Clinical safety of oral sildenafil citrate (VIAGRATM) in the treatment of erectile dysfunction. International Journal of Impotence Research, 10(2), 69–73. Musa, A., Ghoraie, L. S., Zhang, S. -D., Glazko, G., Yli-Harja, O., Dehmer, M., Haibe-Kains, B., et al. (2017). A review of connectivity map and computational approaches in pharmacogenomics. Briefings in Bioinformatics 18(5), 903. Napolitano, F., Carrella, D., Mandriani, B., Pisonero-Vaquero, S., Sirci, F., Medina, D. L., Brunetti-Pierri, N., et al. (2018). gene2drug: a computational tool for pathway-based rational drug repositioning. Bioinformatics, 34(9), 1498–1505. Napolitano, F., Sirci, F., Carrella, D., & di Bernardo, D. (2016). Drug-set enrichment analysis: a novel tool to investigate drug mode of action. Bioinformatics 32(2), 235–241. Nascimento, A. C. A., Prud^encio, R. B. C., & Costa, I. G. (2016). A multiple kernel learning algorithm for drug-target interaction prediction. BMC Bioinformatics, 17, 46. Pacini, C., Iorio, F., Gonc¸alves, E., Iskar, M., Klabunde, T., Bork, P., & Saez-Rodriguez, J. (2012). DvD: An R/Cytoscape pipeline for drug repurposing using public repositories of gene expression data. Bioinformatics, 29(1), 132–134. Pan, Y., Cheng, T., Wang, Y., & Bryant, S. H. (2014). Pathway analysis for drug repositioning based on public database mining. Journal of Chemical Information and Modeling 54(2), 407–418. Pujol, A., Mosca, R., Farres, J., & Aloy, P. (2010). Unveiling the role of network and systems biology in drug discovery. Trends in Pharmacological Sciences, 31(3), 115–123. Qu, X. A., & Rajpal, D. K. (2012). Applications of connectivity map in drug discovery and development. Drug Discovery Today, 17(23–24), 1289–1298. Rustici, G., Kolesnikov, N., Brandizi, M., Burdett, T., Dylag, M., Emam, I., Farne, A., et al. (2013). ArrayExpress update—trends in database growth and links to data analysis tools. Nucleic Acids Research 41(Database issue), D987–D990. Sam, E., & Athri, P. (2017). Web-based drug repurposing tools: a survey. Briefings in Bioinformatics. https://dx.doi. org/10.1093/bib/bbx125. Setoain, J., Franch, M., Martı´nez, M., Tabas-Madrid, D., Sorzano, C. O. S., Bakker, A., Gonzalez-Couto, E., et al. (2015). NFFinder: an online bioinformatics tool for searching similar transcriptomics experiments in the context of drug repositioning. Nucleic Acids Research, 43(W1), W193–W199.
2. THEORETICAL BACKGROUND AND METHODOLOGIES
REFERENCES
327
Siavelis, J. C., Bourdakou, M. M., Athanasiadis, E. I., Spyrou, G. M., & Nikita, K. S. (2016). Bioinformatics methods in drug repurposing for Alzheimer’s disease. Briefings in Bioinformatics, 17(2), 322–335. Sirci, F., Napolitano, F., Vaquero, S. P., Carrella, D., Medina, D. L., & di Bernardo, D. (2017). Integrated StructureTranscription analysis of small molecules reveals widespread noise in drug-induced transcriptional responses and a transcriptional signature for drug-induced phospholipidosis. bioRxiv. 119990 https://dx.doi.org/ 10.1101/119990. Sirota, M., Dudley, J. T., Kim, J., Chiang, A. P., Morgan, A. A., Sweet-Cordero, A., Sage, J., et al. (2011). Discovery and preclinical validation of drug indications using compendia of public gene expression data. Science Translational Medicine, 3(96), 96ra77. Subramanian, A., Narayan, R., Corsello, S. M., Peck, D. D., Natoli, T. E., Lu, X., Gould, J., et al. (2017). A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171(6), 1437–1452.e17. Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102(43), 15545–15550. Sun, W., Sanderson, P. E., & Zheng, W. (2016). Drug combination therapy increases successful drug repositioning. Drug Discovery Today, 21(7), 1189–1195. Sun, X., Vilar, S., & Tatonetti, N. P. (2013). High-throughput methods for combinatorial drug discovery. Science Translational Medicine, 5(205), 205rv1. Toro-Domı´nguez, D., Carmona-Sa´ez, P., & Alarco´n-Riquelme, M. E. (2017). Support for phosphoinositol 3 kinase and mTOR inhibitors as treatment for lupus using in-silico drug-repurposing analysis. Arthritis Research & Therapy, 19 (1), 54. Vanhaelen, Q., Mamoshina, P., Aliper, A. M., Artemov, A., Lezhnina, K., Ozerov, I., Labat, I., et al. (2017). Design of efficient computational workflows for in silico drug repurposing. Drug Discovery Today, 22(2), 210–222. Vazquez, M., Nogales-Cadenas, R., Arroyo, J., Botı´as, P., Garcı´a, R., Carazo, J. M., Tirado, F., et al. (2010). MARQ: an online tool to mine GEO for experiments with similar or opposite gene expression signatures. Nucleic Acids Research, 38(Web Server issue), W228–W232. Verbist, B., Klambauer, G., Vervoort, L., Talloen, W., QSTAR Consortium, Shkedy, Z., Thas, O., et al. (2015). Using transcriptomics to guide lead optimization in drug discovery projects: lessons learned from the QSTAR project. Drug Discovery Today, 20(5), 505–513. Williams, G. (2013). SPIEDw: a searchable platform-independent expression database web tool. BMC Genomics, 14, 765. Wishart, D. S., & Wu, A. (2016). Using drug bank for in silico drug exploration and discovery. Current Protocols in Bioinformatics 54, 14.4.1–14.4.31. Woo, J. H., Shimoni, Y., Yang, W. S., Subramaniam, P., Iyer, A., Nicoletti, P., Rodrı´guez Martı´nez, M., et al. (2015). Elucidating compound mechanism of action by network perturbation analysis. Cell, 162(2), 441–451. Wu, Z., Wang, Y., & Chen, L. (2013). Network-based drug repositioning. Molecular BioSystems, 9(6), 1268. Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., & Kanehisa, M. (2008). Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24(13), i232–i240. Yang, J., Li, Z., Fan, X., & Cheng, Y. (2014). Drug-disease association and drug-repositioning predictions in complex diseases using causal inference-probabilistic matrix factorization. Journal of Chemical Information and Modeling, 54 (9), 2562–2569. Yoo, M., Shin, J., Kim, J., Ryall, K. A., Lee, K., Lee, S., Jeon, M., et al. (2015). DSigDB: drug signatures database for gene set analysis. Bioinformatics, 31(18), 3069–3071. Yu, H., Choo, S., Park, J., Jung, J., Kang, Y., & Lee, D. (2016). Prediction of drugs having opposite effects on disease genes in a directed network. BMC Systems Biology, 10(Suppl 1), 2. Zhang, S. -D., & Gant, T. W. (2008). A simple and robust method for connecting small-molecule drugs using geneexpression signatures. BMC Bioinformatics, 9, 258. Zhong, Y., Chen, E. Y., Liu, R., Chuang, P. Y., Mallipattu, S. K., Tan, C. M., Clark, N. R., et al. (2013). Renoprotective effect of combined inhibition of angiotensin-converting enzyme and histone deacetylase. Journal of the American Society of Nephrology: JASN, 24(5), 801–811.
2. THEORETICAL BACKGROUND AND METHODOLOGIES