Data in Brief 15 (2017) 933–940
Contents lists available at ScienceDirect
Data in Brief journal homepage: www.elsevier.com/locate/dib
Data Article
Application of bi-clustering of gene expression data and gene set enrichment analysis methods to identify potentially disease causing nanomaterials Andrew Williams ⁎, Sabina Halappanavar Environmental Health Science and Research Bureau, Health Canada, Ottawa, Ontario, Canada K1A 0K9
a r t i c l e i n f o
abstract
Article history: Received 19 September 2017 Accepted 23 October 2017 Available online 26 October 2017
This article contains data related to the research article ‘Application of bi-clustering of gene expression data and gene set enrichment analysis methods to identify potentially disease causing nanomaterials’ (Williams and Halappanavar, 2015) [1]. The presence of diverse types of nanomaterials (NMs) in commerce has grown significantly in the past decade and as a result, human exposure to these materials in the environment is inevitable. The traditional toxicity testing approaches that are reliant on animals are both time- and cost- intensive; employing which, it is not possible to complete the challenging task of safety assessment of NMs currently on the market in a timely manner. Thus, there is an urgent need for comprehensive understanding of the biological behavior of NMs, and efficient toxicity screening tools that will enable the development of predictive toxicology paradigms suited to rapidly assessing the human health impacts of exposure to NMs. In an effort to predict the long term health impacts of acute exposure to NMs, in Williams and Halappanavar (2015) [1], we applied bi-clustering and gene set enrichment analysis methods to derive essential features of altered lung transcriptome following exposure to NMs that are associated with lung-specific diseases. Several datasets from public microarray repositories describing pulmonary diseases in mouse models following exposure to a variety of substances were examined and functionally related bi-clusters showing similar gene expression profiles were identified. The identified bi-clusters were then used to conduct a gene set enrichment analysis on lung gene expression profiles derived from mice exposed to nano-titanium dioxide, carbon black or carbon nanotubes (nanoTiO2, CB and CNTs) to determine the disease significance of these
Keywords: Nanomaterials Toxicogenomics Predictive toxicology Bi-clustering
⁎
Corresponding author. E-mail address:
[email protected] (A. Williams).
https://doi.org/10.1016/j.dib.2017.10.060 2352-3409/Crown Copyright & 2017 Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
934
A. Williams, S. Halappanavar / Data in Brief 15 (2017) 933–940
data-driven gene sets. The results of the analysis correctly identified all NMs to be inflammogenic, and only CB and CNTs as potentially fibrogenic. Here, we elaborate on the details of the statistical methods and algorithms used to derive the disease relevant gene signatures. These details will enable other investigators to use the gene signature in future Gene Set Enrichment Analysis studies involving NMs or as features for clustering and classifying NMs of diverse properties. Crown Copyright & 2017 Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Specifications Table Organism/cell line/ tissue Sequencer or array type Data format Experimental factors Experimental features
Sample source location Data accessibility
Mus Musculus/Lung Agilent-028005 SurePrint G3 Mouse GE 8×60K Microarray Raw: TXT files; normalized data: TXT files Exposures to a variety of nanomaterials (nano-titanium dioxide, carbon black, carbon nanotubes) Bi-cluster analysis on publically available data obtained from Gene Expression Omnibus (GEO) describing specific lung diseases was conducted to identify functionally related gene sets. DAVID analysis was conducted on each of the gene sets from this analysis to identify functional representation of each gene set. Gene set enrichment analysis was then conducted on nine toxicogenomic gene expression studies examining the toxicity induced by a variety of nanomaterials to determine the disease significance of the altered gene expression profiles following exposure to NMs. Ottawa, Ontario, Canada National Centre for Biotechnology Information, GEO database Accession: GSE35193, GSE41041, GSE47000, GSE60801, GSE61366
Value of the data
The results enabled deeper mechanistic understanding of NM-induced lung toxicity. The data enabled the development of a database with toxicity fingerprints that are specific to lung diseases.
Using the statistical tools and algorithms established, it may be possible to predict the toxicities of new NMs that have yet to undergo experimental testing.
The data was integral in identifying new gene sets associated with lung pathology that were previously not known.
The gene sets identified could serve as features for clustering and classifying NMs of diverse properties.
1. Data description We anticipate that the importance of toxicogenomics studies in chemical risk assessment will continue to increase in the coming years. However, its success will depend on 1) accurate and prompt reporting of the data, 2) ensuring public availability of the datasets and 3) sharing of the training sets
A. Williams, S. Halappanavar / Data in Brief 15 (2017) 933–940
935
Fig. 1. Workflow employed for the discovery phase of the analysis.
and tools such as those described here. As the public repository of toxicogenomics datasets for individual NMs is populated, and the underlying mechanisms of NM-induced toxicity are revealed, the data can be used for routine categorization of NMs and their prioritization for further testing.
2. Experimental design, materials and methods 2.1. Experimental design The discovery phase was the first step in this study. This involved obtaining publically available data from Gene Expression Omnibus (GEO) describing specific mouse lung diseases. Raw data files were downloaded and processed, normalized and standardized to the control samples. The data from these studies were then merged together and bi-cluster analysis was then conducted. Fig. 1 outlines the various processing steps taken for the gene set discovery phase of the analysis. DAVID Analysis was then conducted on each of the bi-cluster to interpret the possible functions of the gene set. Gene set enrichment analysis was then conducted using the bi-clusters on a series of toxicogenomics studies on three NMs (nano-TiO2, CB and CNTs) to determine the disease significance of these data-driven gene sets. The specific details of these analyses are outlined below.
936
A. Williams, S. Halappanavar / Data in Brief 15 (2017) 933–940
Table 1 Studies used for Gene Set Exploration Phase. Geo accession
Phenotype/model
Microarray platform (GEO GPL ID)
Reference
GSE4231
Lung inflammation
UCSF 10Mm Mouse v.2 Oligo Array (GPL1089); UCSF GS Operon Mouse v.2 Oligo Array (GPL3330); UCSF 11Mm Mouse v.2 Oligo Array (GPL3331); UCSF 7Mm Mouse v.2 Oligo Array (GPL3359) Affymetrix Mouse Genome 430 2.0 Array (GPL1261) Affymetrix Mouse Genome 430 2.0 Array (GPL1261) Affymetrix Mouse Genome 430 2.0 Array (GPL1261) Agilent-011978 Mouse Microarray G4121A (GPL891) Affymetrix Mouse Genome 430 2.0 Array (GPL1261) Illumina MouseRef-8 v2.0 Expression Beadchip (GPL6885) Affymetrix Mouse Genome 430 2.0 Array (GPL1261) Affymetrix Mouse Genome 430 2.0 Array (GPL1261) Affymetrix Mouse Genome 430 2.0 Array (GPL1261)
[2]
Illumina Mouse WG-6 v2.0 expression beadchip (GPL6887) Illumina MouseRef-8 v2.0 Expression Beadchip (GPL6885)
[12] [13]
GSE6116 GSE6858 GSE8790 GSE11037 GSE18534 GSE19605 GSE25640 GSE31013 GSE40151
Lung tumors Asthma Emphysema Emphysema Small cell lung cancer Lung carcinogenesis Pulmonary fibrosis Spontaneous lung tumors Idiopathic pulmonary fibrosis GSE42233 Lung cancer GSE52509 Chronic obstructive pulmonary disease (COPD)
[3] [4] [5] [6] [7] [8] [9] [10] [11]
2.2. Lung disease models The data used in the discovery phase of novel gene sets relating to lung disease models and lung injury outcomes were obtained from the GEO. The accession numbers for these studies [2–13] are presented in Table 1. Lung disease models or lung injury outcomes addressed in this analysis included lung inflammation, emphysema, chronic obstructive pulmonary disease, and lung cancer and tumors. These studies utilized several different microarray platforms including the Affymetrix GeneChip®, and Illumina Expression Beadchip. 2.3. Data processing and normalization The log2 transformation was applied to all signal intensity measurements. For the two colour microarray studies, the LOWESS normalization method [14] using the R statistical software environment [15] was applied. For studies using the Affymetrix GeneChips®, the RMA normalization was applied using the justRMA()function in the affy R package [16]. Quantile normalization was applied for studies that utilized the Illumina Expression Beadchip and completed using the lumiN()function in the lumi R package [17]. Probes with technical replicates were averaged using the median; the data for each study was then merged to its appropriate annotation file to obtain the gene symbol and probes with the same gene symbol were averaged using the median. Experimental conditions with biological replicates were also averaged using the median. The average for each of the experimental conditions was normalized to the appropriate control samples resulting in the log2 fold change for each experimental condition. The control samples were then removed from further analysis. All studies were merged together using the gene symbol. The resulting dataset consisted of 8752 gene symbols. 2.4. Bi-clustering The bi-clustering data analysis was conducted in R using the biclust package [18]. The Bimax method [19] was selected for this analysis. Bimax uses a simple data model that assumes two possible states for each expression level, no change and change with respect to a control experiment. For this analysis, two binary matrices were constructed. One matrix representing zero's and one's where the one's indicate genes that were 2-fold up regulated and the second matrix where the one's identify genes that were 2-fold down regulated. These two matrices were analyzed independently.
A. Williams, S. Halappanavar / Data in Brief 15 (2017) 933–940
937
Table 2 NM Studies used for the Gene Set Enrichment Analysis. Geo Accession
Nanomaterial
Doses
Time Points
Reference
GSE29042 GSE35193 GSE41041 GSE47000 GSE60801 GSE60801 GSE60801
CNT: MWCNT-7 CB: Printex 90 TiO2: UV-Titan L181 CNT: Mitsui7 TiO2: NRCWE-025, NRCWE-030 TiO2 Sanding dust: Indoor-R, IndoornanoTiO2 TiO2: Sanding dust NRCWE-032, Sanding dust NRCWE-033 TiO2: NRCWE 001 (neutral), NRCWE 002 (positively charged) CNT: NRCWE-26, NM-401
10 µg, 20 µg, 40 µg and 80 µg 18 µg, 54 µg and 162 µg 18 µg, 54 µg and 162 µg 18 µg, 54 µg and 162 µg 18 µg, 54 µg and 162 µg 18 µg, 54 µg and 162 µg 18 µg, 54 µg and 162 µg
1, 3, 28 and 56 days 1, 3 and 28 days 1, 3 and 28 days 1 and 28 days 1, 3 and 28 days 1 and 28 days 1 and 28 days
[23–26] [27] [28] [29] [30] [31] [31]
18 µg, 54 µg and 162 µg
1 and 28 days
[31]
18 µg, 54 µg and 162 µg
1, 3 and 28 days
[31]
GSE60801 GSE61366
The option for the minimum number of rows for the Bimax method was set at 15 and the minimum number of columns (which represent the experimental conditions) was set at 5 with the maximum number of columns set at 15. This resulted in 8 bi-clusters from the binary matrix representing the up regulated genes and 2 bi-clusters were identified for the matrix representing the down regulated genes. 2.5. DAVID analysis Gene lists from each bi-cluster were submitted to DAVID (https://david-d.ncifcrf.gov/) for functional annotation [20,21]. Gene lists were pasted into the web application and the “Official_Gene_Symbol” was selected as the gene identifier. Mus Musculus was selected as the species which was used as the background for the analysis. Default settings were selected and used for the functional annotation clustering. 2.6. NM-induced lung response data sets Datasets examining differential gene expression in mouse lung exposed to nano-titanium dioxide, carbon black or carbon nanotubes (nano-TiO2, CB and CNTs) were compiled from GEO. The GEO accession numbers for these studies are presented in Table 2. These studies utilized the two-colour Agilent microarray (GPL7202 Agilent-014868 Whole Mouse Genome Microarray 4×44K G4122F and GPL10787 Agilent-028005 SurePrint G3 Mouse GE 8×60K Microarray for GSE61366) reference design [22]. The data were LOWESS normalized and probes with technical replicates were averaged. The annotation file containing the gene symbol was merged with the expression data and probes with multiple gene symbols were averaged using the median expression. 2.7. Gene set enrichment As the NM-induced lung response data sets contained multiple doses, the test statistic from the Attract approach was used [32]. Using this method, the overall F-statistic for the dose effect was estimated for each gene. Since large F-statistics are indicative of a strong dose effect, a bi-cluster whose distribution of F-statistics is skewed towards larger values represents an enrichment of that gene set. A two sample t-test assuming unequal variances was then conducted comparing the mean of the log2 F-statistics within the bi-cluster to the mean of the log2 F-statistics for all genes. For bicluster 7, a graphical representation of the group medians for GSE61366 is presented as Fig. 2. The p-values for NRCWE-26 were 0.6091, 0.0412 and o 0.0001 for days 1, 3, and 28 and similarly for NM401 with p-values of 0.5391, 0.0005 and o 0.0001. These results were graphically reported in Williams and Halappanavar [1].
938
A. Williams, S. Halappanavar / Data in Brief 15 (2017) 933–940
Fig. 2. A heatmap of Bi-cluster 7 for GSE61366 is presented. Biological replicates were averaged using the median. Group medians were clustered using average linkage with the 1-correlation dissimilarity metric estimated using spearman correlations.
The results identified all the NMs to be inflammogenic and only CB and CNTs as potentially fibrogenic. In addition to identifying several previously defined, functionally relevant gene sets, the study also identified two novel genes associated with pulmonary fibrosis and reactive oxygen species. These results demonstrate the advantages of using a data-driven approach to identify novel, functionally related gene sets.
Acknowledgements The authors are thankful to Dr. Marc Beal and Dr. Francina Webster for reviewing the manuscript and for the helpful comments.
A. Williams, S. Halappanavar / Data in Brief 15 (2017) 933–940
939
Transparency document. Supplementary material Transparency document associated with this article can be found in the online version at http://dx. doi.org/10.1016/j.dib.2017.10.060.
Appendix A. Supporting information Supplementary data associated with this article can be found in the online version at http://dx.doi. org/10.1016/j.dib.2017.10.060.
References [1] A. Williams, S. Halappanavar, Application of biclustering of gene expression data and gene set enrichment analysis methods to identify potentially disease causing nanomaterials, Beilstein J. Nanotechnol. 6 (2015) 2438–2448. [2] C.C. Lewis, J.Y.H. Yang, X. Huang, S.K. Banerjee, M.R. Blackburn, P. Baluk, D.M. McDonald, T.S. Blackwell, V. Nagabhushanam, W. Peters, D. Voehringer, D.J. Erle, Disease-specific gene expression profiling in multiple models of lung disease, Am. J. Respir. Crit. Care Med. 177 (2008) 376–387. [3] R.S. Thomas, L. Pluta, L. Yang, T.A. Halsey, Application of genomic biomarkers to predict increased lung tumor incidence in 2-year rodent cancer bioassays, Toxicol. Sci. J. Soc. Toxicol. 97 (2007) 55–64. [4] X. Lu, V.V. Jain, P.W. Finn, D.L. Perkins, Hubs in biological interaction networks exhibit low changes in expression in experimental asthma, Mol. Syst. Biol. 3 (2007) 98. [5] T. Rangasamy, V. Misra, L. Zhen, C.G. Tankersley, R.M. Tuder, S. Biswal, Cigarette smoke-induced emphysema in A/J mice is associated with pulmonary oxidative stress, apoptosis of lung cells, and global alterations in gene expression, Am. J. Physiol. Lung Cell. Mol. Physiol. 296 (2009) L888–L900. [6] E.M. Thomson, A. Williams, C.L. Yauk, R. Vincent, Overexpression of tumor necrosis factor-Α in the lungs alters immune response, matrix remodeling, and repair and maintenance pathways, Am. J Pathol. 180 (4) (2012) 1413–1430. [7] B.E. Schaffer, K.S. Park, G. Yiu, J.F. Conklin, C. Lin, D.L. Burkhart, A.N. Karnezis, E.A. Sweet-Cordero, J. Sage, Loss of p130 accelerates tumor development in a mouse model for human small-cell lung carcinoma, Cancer Res. 70 (2010) 3877–3883. [8] C.E. Ochoa, S.G. Mirabolfathinejad, V.A. Ruiz, S.E. Evans, M. Gagea, C.M. Evans, B.F. Dickey, S.J. Moghaddam, Interleukin 6, but not T helper 2 cytokines, promotes lung carcinogenesis, Cancer Prev. Res. 4 (2011) 51–64. [9] T. Liu, H.A. Baek, H. Yu, H.J. Lee, B.H. Park, M. Ullenbruch, J. Liu, T. Nakashima, Y.Y. Choi, G.D. Wu, M.J. Chung, S.H. Phan, FIZZ2/RELM-Β induction and role in pulmonary fibrosis, J. Immunol. 187 (2011) 450–461 (1950). [10] A.R. Pandiri, R.C. Sills, V. Ziglioli, T.V. Ton, H.H. Hong, S.A. Lahousse, K.E. Gerrish, S.S. Auerbach, K.R. Shockley, P.R. Bushel, S. D. Peddada, K.J. Hoenerhoff, Differential transcriptomic analysis of spontaneous lung tumors in B6C3F1 mice: comparison to human non-small cell lung cancer, Toxicol. Pathol. 40 (2012) 1141–1159. [11] R. Peng, S. Sridhar, G. Tyagi, J.E. Phillips, R. Garrido, P. Harris, L. Burns, L. Renteria, J. Woods, L. Chen, J. Allard, P. Ravindran, H. Bitter, Z. Liang, C.M. Hogaboam, C. Kitson, D.C. Budd, J.S. Fine, C.M. Bauer, C.S. Stevenson, Bleomycin induces molecular changes directly relevant to idiopathic pulmonary fibrosis: a model for “Active” disease, PLoS One 8 (2013) e59348. [12] O. Delgado, K.G. Batten, J.A. Richardson, X.J. Xie, A.F. Gazdar, A.A. Kaisani, L. Girard, C. Behrens, M. Suraokar, G. Fasciani, W. E. Wright, M.D. Story, I.I. Wistuba, J.D. Minna, J.W. Shay, Radiation-enhanced lung cancer progression in a transgenic mouse model of lung cancer is predictive of outcomes in human lung and breast cancer, Clin. Cancer Res. J. Am. Assoc. Cancer Res. 20 (2014) 1610–1622. [13] G. John-Schuster, K. Hager, T.M. Conlon, M. Irmler, J. Beckers, O. Eickelberg, A.Ö. Yildirim, Cigarette smoke-induced iBALT mediates macrophage activation in a B cell-dependent manner in COPD, Am. J. Physiol. Lung Cell. Mol. Physiol. 307 (2014) L692–L706. [14] Y.H. Yang, S. Dudoit, P. Luu, D.M. Lin, V. Peng, J. Ngai, T.P. Speed, Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation, Nucleic Acids Res. 30 (2002) e15. [15] R Core Team. R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing: Vienna, Austria, 2014. [16] L. Gautier, L. Cope, B.M. Bolstad, R.A. Irizarry, Affy–analysis of affymetrix genechip data at the probe level, Bioinformatics 20 (2004) 307–315. [17] P. Du, W.A. Kibbe, S.M. Lin, Lumi: a pipeline for processing illumina microarray, Bioinformatics 24 (2008) 1547–1548. [18] S. Kaiser, R. Santamaria, T. Khamiakova, M. Sill, R. Theron, L. Quintales, F. Leisch and E. De Troyer, Biclust: BiCluster Algorithms. R package, version 1.2.0, 2015. [19] A. Prelić, S. Bleuler, P. Zimmermann, A. Wille, P. Bühlmann, W. Gruissem, L. Hennig, L. Thiele, E.A. Zitzler, Systematic comparison and evaluation of biclustering methods for gene expression data, Bioinformatics 22 (2006) 1122–1129. [20] D.W. Huang, B.T. Sherman, Q. Tan, J.R. Collins, W.G. Alvord, J. Roayaei, R. Stephens, M.W. Baseler, H.C. Lane, R.A. Lempicki, The DAVID gene functional classification tool: a novel biological module-centric algorithm to functionally analyze large gene lists, Genome Biol. 8 (9) (2007) R183. [21] D.W. Huang, B.T. Sherman, Q. Tan, J. Kir, D. Liu, D. Bryant, Y. Guo, R. Stephens, M.W. Baseler, H.C. Lane, R.A. Lempicki, DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists, Nucleic Acids Res. 35 (2007). [22] M.K. Kerr, G.A. Churchill, Statistical design and the analysis of gene expression microarray data, Genet. Res. 77 (2001) 123–128.
940
A. Williams, S. Halappanavar / Data in Brief 15 (2017) 933–940
[23] N.L. Guo, Y.W. Wan, J. Denvir, D.W. Porter, M. Pacurari, M.G. Wolfarth, V. Castranova, Y. Qian, Multiwalled carbon nanotube-induced gene signatures in the mouse lung: potential predictive value for human lung cancer risk and prognosis, J Toxicol. Environ. Health A 75 (18) (2012) 1129–1153. [24] J. Dymacek, B.N. Snyder-Talkington, D.W. Porter, R.R. Mercer, M.G. Wolfarth, V. Castranova, Y. Qian, N.L. Guo, mRNA and miRNA regulatory networks reflective of multi-walled carbon nanotube-induced lung inflammatory and fibrotic pathologies in mice, Toxicol. Sci. 144 (1) (2015) 51–64. [25] J. Dymacek, N.L. Guo, Integrated miRNA and mRNA Analysis of Time Series Microarray Data. ACM BCB 2014122-127, 2014. [26] B.N. Snyder-Talkington, C. Dong, X. Zhao, J. Dymacek, D.W. Porter, M.G. Wolfarth, V. Castranova, Y. Qian, N.L. Guo, Multiwalled carbon nanotube-induced gene expression in vitro: concordance with in vivo studies, Toxicology 328 (2015 3) 66–74. [27] J.A. Bourdon, S. Halappanavar, A.T. Saber, N.R. Jacobsen, A. Williams, H. Wallin, U. Vogel, C.L. Yauk, Hepatic and pulmonary toxicogenomic profiles in mice intratracheally instilled with carbon black nanoparticles reveal pulmonary inflammation, acute phase response, and alterations in lipid homeostasis, Toxicol. Sci. 127 (2) (2012) 474–484. [28] M. Husain, A.T. Saber, C. Guo, N.R. Jacobsen, K.A. Jensen, C. Yauk, A. Williams, U. Vogel, H. Wallin, S. Halappanavar, Pulmonary instillation of low doses of titanium dioxide nanoparticles in mice leads to particle retention and gene expression changes in the absence of inflammation, Toxicol. Appl. Pharmacol. 269 (3) (2013) 250–262. [29] S. Søs Poulsen, N.R. Jacobsen, S. Labib, D. Wu, M. Husain, A. Williams, J.P. Bøgelund, O. Andersen, C. Købler, K. Mølhave, Z. O. Kyjovska, A.T. Saber, H. Wallin, C.L. Yauk, U. Vogel, S. Halappanavar, Transcriptomic analysis reveals novel mechanistic insight into murine biological responses to multi-walled carbon nanotubes in lungs and cultured lung epithelial cells, PLoS One 8 (2013) 11. [30] S. Halappanavar, A.T. Saber, N. Decan, K.A. Jensen, D. Wu, N.R. Jacobsen, C. Guo, J. Rogowski, I.K. Koponen, M. Levin, A. M. Madsen, R. Atluri, V. Snitka, R.K. Birkedal, D. Rickerby, A. Williams, H. Wallin, C.L. Yauk, U. Vogel, Transcriptional profiling identifies physicochemical properties of nanomaterials that are determinants of the in vivo pulmonary response, Environ. Mol. Mutagen 56 (2015) 245–264. [31] S. Søs Poulsen, A.T. Saber, A. Mortensen, J. Szarek, D. Wu, A. Williams, O. Andersen, N.R. Jacobsen, C.L. Yauk, H. Wallin, S. Halappanavar, U. Vogel, Changes in cholesterol homeostasis and acute phase response link pulmonary exposure to multi-walled carbon nanotubes to risk of cardiovascular disease, Toxicol. Appl. Pharmacol. 283 (3) (2015) 210–222. [32] J.C. Mar, N.A. Matigian, J. Quackenbush, C.A. Wells, Attract: a method for identifying core pathways that define cellular phenotypes, PLoS One 6 (2011) e25445.