Systems Glycobiology: Integrating glycogenomics, glycoproteomics, glycomics and other ‘omics data sets to characterize cellular glycosylation processes Sandra V. Bennun, Deniz Baycin Hizal, Kelley Heffner, Ozge Can, Hui Zhang, Michael J. Betenbaugh PII: DOI: Reference:
S0022-2836(16)30250-9 doi: 10.1016/j.jmb.2016.07.005 YJMBI 65143
To appear in:
Journal of Molecular Biology
Received date: Revised date: Accepted date:
6 February 2016 5 July 2016 7 July 2016
Please cite this article as: Bennun, S.V., Hizal, D.B., Heffner, K., Can, O., Zhang, H. & Betenbaugh, M.J., Systems Glycobiology: Integrating glycogenomics, glycoproteomics, glycomics and other ‘omics data sets to characterize cellular glycosylation processes, Journal of Molecular Biology (2016), doi: 10.1016/j.jmb.2016.07.005
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
SC
RI
PT
Systems Glycobiology: Integrating glycogenomics, glycoproteomics, glycomics and other ‘omics data sets to characterize cellular glycosylation processes Sandra V. Bennun1,4, Deniz Baycin Hizal1, Kelley Heffner1, Ozge Can2, Hui Zhang3, Michael J. Betenbaugh1* 1 Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, Maryland, USA 2 Department of Medical Engineering, Acibadem University, Istanbul, Turkey 3 Johns Hopkins University School of Medicine, Maryland, USA 4 Current Address: Regeneron Pharmaceuticals, Tarrytown, New York, USA
NU
*To whom correspondence should be addressed:
[email protected]
AC CE P
TE
D
MA
Keywords: Glycoinformatics, Systems Biology, N-glycosylation, Glycans, Chinese hamster ovary, Automatic glycan annotation, Cancer Biomarkers
1
ACCEPTED MANUSCRIPT Abstract The number of proteins encoded in the human genome has been estimated at between
PT
20,000 and 25,000 despite estimates that the entire proteome contains more than a million
RI
proteins. One reason for this difference is due to the many protein post-translational
SC
modifications that contribute to proteome complexity. Among them, glycosylation is of particular relevance because it serves to modify a large number of cellular proteins.
NU
Glycogenomics, glycoproteomics, glycomics and glycoinformatics are helping to accelerate our
MA
understanding of the cellular events involved in generating the glycoproteome, the variety of glycan structures possible, and the importance glycans play in therapeutics and disease. Indeed, interest in glycosylation has expanded rapidly over the past decade as large amounts of
TE
D
experimental ‘omics data relevant to glycosylation processing has accumulated. Furthermore, new and more sophisticated glycoinformatics tools and databases are now available for glycan
AC CE P
and glycosylation pathway analysis. Here, we summarize some of the recent advances in both experimental profiling and analytical methods involving N- and O-linked glycosylation processing for biotechnological and medically-relevant cells together with the unique opportunities and challenges associated with interrogating and assimilating multiple disparate high-throughput glycosylation data sets. This emerging era of advanced glycomics will lead to the discovery of key glycan biomarkers linked to diseases and help establish a better understanding of physiology and improved control of glycosylation processing in diverse cells and tissues important to disease and production of recombinant therapeutics. Furthermore, methodologies that facilitate the integration of glycomics measurements together with other ‘omics data sets will lead to a deeper understanding and greater insights into the nature of glycosylation as a complex cellular process.
2
ACCEPTED MANUSCRIPT 1. Introduction Glycans are structurally complex carbohydrate chains found on proteins and other
PT
molecules that play an important role in health and disease. Since abnormalities in glycosylation
RI
are linked to cancer and many other diseases [1-5], they provide opportunities for diagnosis and treatment [5-8]. In the production of biotherapeutics, glycans are important because they can
SC
influence the therapeutic properties of the proteins and their immune responses. Considering that
NU
glycosylation is one of the most common post-translational modifications and that more than half of proteins undergo glycosylation [6, 7], a better knowledge of glycans offer opportunities to
MA
improve or develop new biomarkers for cancer [8-12] and improve the glycan profiles of
D
biotherapeutics.
TE
Two widely studied forms of glycosylation modifications made to proteins are N- and O-
AC CE P
type, which are defined by the linkage between the polypeptide backbone and the glycan structure. N-linked glycans are attached to proteins at the nitrogen atom (“N”) of the amide group of an asparagine amino acid residue [13, 14]. O-linked glycans are attached to the oxygen atom (“O”) of serine (mainly) or threonine [15, 16]. Other types of polysaccharides present in cells are glycosaminoglycans (GAGs). GAGs present unbranched polysaccharide configurations with repeated disaccharide units consisting of an amino sugar with an uronic sugar or galactose. Glycans are also attached to lipids in the form of glycolipids. While this paper focuses primarily on N-glycosylation, a widely-studied form of protein glycosylation, many of the approaches and findings are also being applied for other types of glycosylated molecules.
1
ACCEPTED MANUSCRIPT Since glycan patterns are exposed on cell surfaces, they are readily amenable to profiling using new high-throughput technologies [17, 18]. Indeed, advances in high-throughput
PT
technologies significantly benefit glycobiology and allow for fast screening of cells and
RI
generation of extensive glycomics data sets. Furthermore, the development of sophisticated analytical techniques [19-23] and data analysis tools [24-31] render increasing opportunities for
SC
improvement of high-throughput screening for glycans as disease markers and structure
NU
classification in therapeutic proteins. However, there are significant challenges to better understand the cellular glycosylation transformations and the use of this information to develop
MA
improved diagnostics and glycan characterization. Glycogene microarrays, lectin chips and RNA sequencing tools are widely used to analyze the whole glycogenome and the changes in the
TE
D
glycosylation enzymes during pathological conditions [31-35]. In addition to these tools, recent advancements in mass spectrometry-based technologies [36] allow analysis of glycan, glycosite,
review,
we
AC CE P
glycopeptide and intact glycoproteins both at the qualitative and quantitative levels. In this summarize
existing
glycogenomics,
glycomics,
glycoproteomics,
and
glycoinformatics tools to support analysis of glycosylation, and provide examples of these approaches that have led to a better understanding of glycosylation. Finally, we discuss some of the challenges associated with comprehensive characterization of protein glycosylation and the role that integrative glycoinformatics and systems biology tools will likely play in addressing these questions.
2. Advances in Glycogenomics Protein glycosylation is one of the most diverse and complicated post-translational modifications due to the nature of glycans on the glycosites and their biosynthesis. In order to
2
ACCEPTED MANUSCRIPT better comprehend the glycosylation effect on the pathological conditions, a knowledgebase including glycan, glycopeptide, and glycosite information should be implemented. A key step
PT
toward establishing this knowledgebase is to decipher the glycogenome to understand the genes
RI
and enzymes involved in the glycosylation pathways of the species or pathological conditions [37].
SC
Glycosylation levels and compositions significantly affect the functional activity and
NU
half-life of therapeutic proteins in the circulatory system as well as the immune response of the human body. Different species are characterized by distinct glycosylation pathways as genes
MA
expressed in some species are suppressed in others. For example, CHO cells, which are widely used for production of protein therapeutics because of glycosylation compatible with human
TE
D
immune systems, often provide simpler glycosylation patterns than those from human cells. Typical pathways for N-glycosylation are shown in Figure 1 for CHO cells and Figure 2 for
AC CE P
human cells. A complete CHO glycogenome analysis was performed when the CHO genome was sequenced in 2011 [38]. Only three of the genes (UDP-N-acelyglucosamine transferase ALG13 and sulfotransferases CHST7 and CHST13) were lacking homologs in the CHO genome out of 300 glycosylation genes in human. However, RNA-sequencing was performed and the results showed expression of only half of the predicted glycan synthesis and degradation genes. Statistical analysis showed that sulfotransferases, fucosyltransferases and N-acetylgalactosamine (GalNAc) transferases were significantly depleted with a p-value of below 0.06 among the other genes. Some other glycogenes repressed in CHO-K1 cells include the bisecting GlcNAc transferase III (GnTIII, Mgat3), α(1,2), α(1,3) and α(1,4)-linked fucosyltransferases, and ST6Gal [38]. Subsequently, North et al. using a variety of data types, including higher mass MS (up to 11,000 Daltons) MS/MS and GC/MS, have shown the presence of some bisecting GlcNAc
3
ACCEPTED MANUSCRIPT structures in Pro¯5 wild type CHO [39]. Unpublished results using mass spectra modeling and linkage analysis from our group also showed that the Mgat3 gene for generating bisecting
PT
GlcNAc is not completely silent in the Pro−5 line of CHO cells. Because certain CHO cell lines
RI
may exhibit activation of the Mgat3 gene, which provides GnTIII activity, we included the Mgat3 gene in Table I, which summarize the main enzymes and genes in the N-glycosylation
SC
pathway of CHO cells. Similarly, Table 2 is included to indicate the main glycosylation
NU
enzymes and genes for human cells.
In order to correlate glycan dynamics to glycotransferase expression profiles, glycogene
MA
microarrays, including glycosidases, glycotransferases and sugar transporters, as well as lectin chips, are widely used to understand the mechanism of action for glycans in disease states. For
TE
D
instance, to find out the pivotal role of glycans in tumor metastasis, two different cell lines showing high metastatic potential
(HCCLM3) and low metastatic potential (Hep3B) were
AC CE P
compared. Genes such as ST3GalI, FUT8, β3GalT5, MGAT3 and MGAT5, which play a role in glycolipid, N-glycan, and sialyl Lewis antigen biosynthesis, were differentially expressed [40]. In a recent experiment, a knockout screening of pivotal glycosyltransferases for CHO glycosylation control was conducted [41]. The approach used genome editing of these CHO cells to design glycoproteins with specific engineered glycoform profiles [41]. Carbohydrate groups are known as the essential mediators of cellular and molecular interactions. In summary, genomics, transcriptomics, and glycogene microarrays can be applied and coupled with other tools such as mass spectrometry-based glycan data to discover novel glyco-biomarkers as will describe in later sections. 3. Advances in Glycoproteomics
4
ACCEPTED MANUSCRIPT Glycoproteomics is a field that evaluates glycosylated proteins and their glycosylation sites [42]. It usually involves glycoprotein enrichment of the samples of healthy and/or disease
PT
states that can be compared to find differentially expressed glycoproteins potentially playing
RI
important roles in certain diseases or disease states. Such an approach requires sophisticated comparative proteomics methods, advanced mass spectrometry techniques, and powerful
SC
bioinformatics tools to identify biomarkers for early prediction of diseases that can eventually be
NU
used for disease prognosis. Label free quantification [43], stable isotope labeling (SILAC) [44], isobaric tag for relative and absolute quantitation (iTRAQ) [45] and tandem mass tags (TMT)
MA
[46], are some of the methods used to interpret the differentially expressed proteins between samples for biomarker and target discovery.
TE
D
Hydrazide chemistry and solid phase extraction of glycosylated peptides (SPEG) provide means to identify and quantify N-linked glycoproteins. In this method, a protein mixture is
AC CE P
equilibrated with hydrazide resin, which binds to carbohydrate moieties on the glycoproteins upon oxidation, glycoproteins are then enzymatically removed by PNGaseF and analyzed by LCMS [47]. Using this technique, Yang et. al. evaluated the differences in glycoproteomic profiles of dysynchronous heart failure and cardiac resynchronization therapy [48]. Relative changes in the level of glycoproteins were determined by iTRAQ and verified by label-free LC-MS. The levels of several glycoproteins reverted back to normal level after therapy. This is important because it is of great interest in identifying corresponding changes with prognostic power after therapy. Tian et al. [14] used the SPEG technique to enrich the glycoproteins from ovarian tumors and adjacent normal ovary tissues. The enriched glycopeptides from the normal ovary, clear-cell carcinoma, high-grade endometrioid carcinoma, high-grade serous carcinoma, lowgrade endometrioid carcinoma, low-grade serous carcinoma, mucinous carcinoma, and
5
ACCEPTED MANUSCRIPT transitional carcinoma samples were labeled with iTRAQ reagents and relative protein quantitation was performed based on iTRAQ labeling and MS/MS spectra. It was possible to
PT
identify both the proteins showing differential expression in ovarian tumors versus normal
RI
tissues, as well as uniquely overexpressed proteins specific to each ovarian tumor. Further, western blot analysis supported the proteomics results and showed elevated levels of
SC
carcinoembryonic antigen-related cell adhesion molecules 5 and 6 (CEA5 and CEA6) in ovarian
NU
mucinous carcinoma. The same technique has been used to identify underlying immunological activation resulting from HIV infection, HIV elite suppressors, and antiretroviral therapy [49].
MA
These findings revealed that HIV elite suppressors significantly affected the immunologically relevant glycoproteins as a consequence of antiviral immunity.
TE
D
Using sialoglycoproteome enrichment and isotope labeling methods, differentially expressed proteins in breast cancer were identified. Further western blot and lectin analyses
AC CE P
confirmed that versican is one of the most highly differentially expressed sialoglycoproteins in breast cancer [13]. In addition, coupling sialoglycoprotein enrichment methods with selective reaction monitoring (SRM) techniques indicated the upregulation of sialylated prostate specific antigen (PSA) in prostate cancer tissues [50]. In order to increase the accuracy of prognosis and diagnosis of cancer, organ-specific glycosylated and sialylated proteins, such as PSA, can be used. Hydrazide chemistry, lectins, multilectin affinity chromatography, and metabolic incorporation of sugar analogs for glycoprotein isolation are all commonly used methods for biomarker and target discovery aimed at early detection and therapy of different cancer types [12].
4. Advances in Glycomics
6
ACCEPTED MANUSCRIPT Glycans are critical biomolecules for a number of diseases including cancer [7], immune disorders [51], cardiovascular disease [48], and HIV [52]; they also play major roles in
PT
monoclonal and bispecific antibodies potency. However, due to the technical challenges in the
RI
structural analysis of glycans, global glycomics has been hampered. A variety of analytical platforms, including capillary electrophoresis and liquid chromatography, are widely used after
SC
derivatizing the glycans with permethylation or carbodimide coupling. Introduction of
NU
fluorescence tags such as 2-aminobenzamide enabled the quantitation of glycans with a fluorescent detector. Recent advancements in mass spectrometry technologies have provided the
MA
determination of glycan composition and quantitation, which can also include solid phase immobilization and isobaric labeling [53]. Figure 3 shows an overview of common glycan
TE
D
analysis methods that are often used including lectin microarrays, UPLC, and LC/MS/MS. Isobaric tag approaches, such as TMT and iTRAQ, have been frequently used for peptide
AC CE P
quantification to discover biomarkers or targets in various medical conditions. However, there has been limited success in glycan quantification with the use of isobaric tags, such as aminoxyTMT and iART [54], due to their tertiary amine structure. A novel mass spectrometrybased technology called quaternary amine-containing isobaric tag for glycans (QUANTITY) [54] was developed recently to improve the complete labeling of the glycans and enhance the reporter ion intensity upon MS2 fragmentation. Four-plex QUANTITY reagents include a reactive glycan conjugation site, balancer for compensation of molecular mass changes, and reporter which can provide the generation of reporter ions ranging from 176 to 179 Daltons upon MS2 fragmentation for quantification purposes. Recently, the QUANTITY labeling approach was coupled with solid phase immobilization techniques for glycomic comparison of CHO cells engineered with
7
ACCEPTED MANUSCRIPT glycosyltransferases. As shown in Figure 4, a number of samples from different tissues or cells can be denatured and immobilized on the AminoLink resin. To stabilize the sialic acid groups, p-
PT
toluidine can be used with carbodimide coupling reagent and PNGaseF can be used to release the
RI
N-glycans from the solid support. Next, the aldehyde group of the N-acetylglucosamine (GlcNAc) at the reducing end of glycans from each sample can be labeled with QUANTITY
SC
followed by an analysis with LC/MS/MS. A global proteomics analysis can also be conducted by
NU
performing on-bead digestion [54].
Site-specific glycan occupancy and alterations in glycoproteins are also significantly
MA
important in pathological conditions. Until recently, glycosites, glycopeptides, and glycans have been studied separately due to difficulties with simultaneous analysis. A method called solid
TE
D
phase extraction of N-linked glycans and glycosite-containing peptides (NGAG) [55] was developed for comprehensive analysis of glycans, glycosites, and glycopeptides from complex
AC CE P
samples. This method was applied to a single protein, bovine fetuin, and complex cell line (OVCAR-3), for evaluation. In this method, using an aldehyde-functionalized solid support, the peptides were immobilized. PNGaseF and Asp-N digestions allowed for release of the N-glycans and N-glycopeptides, respectively. After the mass spectrometry analysis, a sample-specific intact glycopeptide database was established containing all possible glycosites and glycans. At the same time, intact glycopeptides from OVCAR-3 cells were isolated and run by MS. The glycan oxonium ions were used as signature to pick the spectra for intact glycopeptides. These spectra were mapped to the OVCAR-3 glycosylation specific database using GPQuest software [55,56]. Due to the absence of a known endoglycosidase, it is challenging to study O-glycomics. However recently, some analytical methods have been developed to analyze and quantify O-
8
ACCEPTED MANUSCRIPT glycans. A microwave (MW)-assisted β-elimination procedure in the presence of pyrazolone analogues (BEP) was optimized to analyze the O-glycans from cells, tissues, serum and FFPE
PT
tissues [57]. In a variety of human disorders, especially in the tumors, aberrant expression of the truncated O-glycans Tn (GalNAcα1-Ser/Thr) and its sialylated version sialyl-Tn (STn)
RI
(Neu5Acα2,6GalNAcα1-Ser/Thr) has been demonstrated. For this reason both Tn and STn are
NU
5. Advances in Glycoinformatics
SC
known tumor carbohydrate markers [58].
MA
The availability of high-throughput technologies has led to huge increases in the amounts of of glycosylation data. Unfortunately, the development of high-throughput glycan analysis
D
workflows and glycoinformatics tools has been limited by the diversity of glycans and the
TE
complexity of the glycosylation processing. Therefore, resources that provide an integral
AC CE P
approach for analysis are still under development. Current high-throughput processing of glycomics data offers opportunities to organize, analyze, and integrate experimental data in order to obtain valuable insights. Consequently, high-throughput processing methodologies
with
automated pipelines and extensive glycoinformatics support and infrastructure are just as important as the experimental tools for glycosylation analysis. Toward that end computational methods, databases, and tools are being developed and a collection of glycoinformatics platforms are publicly available [26, 27, 59-62]. Some of these platforms serve as repositories of glycan structures, glycogenes, enzymes, and experimental glycan data. Other platforms permit analysis of glycans from diverse perspectives and interpret different types of experimental data. For example, there are publicly available databases that provide glycoproteomic entries and tools to predict glycosylation sites. UniPep lists 9651 peptides derived from 6027 proteins that form a representative library of various tissues mapped to theoretical glycopeptides (www.unipep.org).
9
ACCEPTED MANUSCRIPT Based on previously successful databases, GlycoSuiteDB [63], EUROCarbDB [27], the UniCarb KnowledgeBase (UniCarbKB; http://unicarbkb.org) was established to provide an open access to
Hosted by Expasy, GlycoMod predicts the possible
RI
experimental data collections [64].
PT
a curated database of glycan structures of glycoproteins and to integrate functional and
oligosaccharide structures that occur on proteins from their experimentally determined masses
SC
(http://web.expasy.org/glycomod/); whereas SugarBind provides information on known
NU
carbohydrate sequences to which pathogenic organisms bind (http://sugarbind.expasy.org/).
MA
One challenge is the lack of consensus on a common computer code for database exchange that represents structures for monosaccharide residues and glycan sequences [65] since
D
codes for graphical representation of glycans already exist. Many initiatives, such as the Kyoto
TE
Encyclopedia of Genes and Genomes (KEGG) [66,67] and the Consortium for Functional
AC CE P
Glycomics (CFG) [59] developed unique structural code representations for glycans and independent databases to store them. This has made combining information from more than one database highly challenging. While there has been agreement toward the sequence format GLYDE-II as a general exchange format for glycan structures [68], many of the databases and glycomics tools still use unique sequence codes. GlycomeDB [24, 69] is a single resource database for glycan structures that was established as an effort to integrate entries from the major databases including CFG, GLYCOSCIENCES.de, KEGG, CarbBank and the Protein DataBank (PDB). GlycomeDB stores glycan structures in a sequence format GlycoCT [70]. In addition GlycomeDB maintains the biological source information from the integrated databases and references IDs of the original database entries. Databases not yet integrated into GlycomeDB include UniCarb-DB - http://unicarb-db.biomedicine.gu.se/ and UniCarbKB http://unicarbkb.org/. A glycan registry (GlyTouCan - https://glytoucan.org/) was created as a
10
ACCEPTED MANUSCRIPT new effort to integrate not only the structures from GlycomeDB and other databases but also allows users to register their own structures. Table 3 describes the main databases that store
PT
carbohydrate structures, experimental data, and many other resources.
RI
Table 4 highlights glycoinformatics resources and tools for glycan analysis and
SC
interpretation including web applications, stand-alone applications, and web-based resource sites. A very important platform for the registration and discovery of glycoinformatics tools is the
NU
GlycomicsPortal (http://glycomics.ccrc.uga.edu/GlycomicsPortal). This web-based search engine
MA
is regularly updated and currently stores 35 databases, 39 web services, 33 software tools, and a workflow. Most of the tools mentioned in Tables 3 and 4 of this publication and other
D
publications are registered in the GlycomicsPortal, expediting the search for glycobiology
TE
resources. RINGS [71] is another website for resources, which provides algorithmic and data
AC CE P
mining tools. Additional resource tools for glycan analysis and interpretation are included in Table 4, with their corresponding literature references. Considering the massive collection of data and resources, a description of all the glycoinformatics tools is beyond the scope of this review. Resources and applications for molecular modeling of glycan structures are described in additional references [21, 72-74].
6. Systems Glycobiology and Integration of ‘Omics Datasets Integrative glycoinformatics and systems glycobiology developments based on a holistic understanding of the complex glycosylation process and the relations among its components can provide a more complete analysis that is not only based solely on glycan annotation but also on other aspects, such as enzymatic levels, glycans abundances, biosynthetic pathways, and complementary ‘omics datasets (Figure 5). However, most of the available glycoinformatics tools do not consider the integration and the complex relationships among the different
11
ACCEPTED MANUSCRIPT components of the glycosylation process (e.g., enzymes, glycans, sugar nucleotides, transporters), in which the glycan structures are defined as a result of the action of many
PT
enzymes. Instead, these tools are based primarily on standard bioinformatics approaches mostly
RI
developed for proteomics and genomics, which have limitations in their application to glycomics. These tools may not work properly for glycans, given that glycans are not directly
SC
encoded in the genome and differ from proteins in that they are assembled from the
NU
interconnected action of several enzymes. For that reason and because of the adoption of traditional bioinformatics approaches that do not consider the complexity of glycosylation,
MA
methods and tools for analysis of glycans have lagged and most glycoinformatics tools available are specialized for the analysis of one type of data [75-78]. For example, a common approach for
TE
D
mass spectrometry-based glycoprofiling involves a one-to-one database matching of particular MS measurements to specific glycans from a known glycan library in order to annotate the
AC CE P
individual peaks of the mass spectrum separately [79,80]. Methods that consider the complexity of glycosylation would require that all glycan structures used for the annotation of the spectrum be generated by the enzymatic machinery of the studied organism [81]. This alignment will ensure consistency among enzyme activities and those structures assigned to each peak in the same spectrum.
Current bioinformatics techniques that attempt to integrate diverse data sets are still in early development [23, 82, 83] despite substantial progress on several ‘omics’ fronts. For example, statistical database-driven approaches to relate gene expression levels to the abundance of specific glycan linkages did not provide quantitative predictions of detailed glycan distributions [82, 83]. This reflects the need for integrative glycoinformatics and systems tools to identify glycan structural data and also to link these with gene expression data of glycosylation
12
ACCEPTED MANUSCRIPT enzymes that produce these glycan structures. Mathematical modeling of glycosylation may represent a promising approach to start understanding how mRNA levels relate to the actual
PT
amount and distribution of glycans found within healthy or diseased cells [23, 30, 81].
RI
An approach that considers data integration could be highly effective to reduce variability (false positives and negatives) in the analytical high-throughput experimental platform. Results
SC
obtained with integrative glycoinformatics and systems glycobiology tools that are confirmed by
NU
different experimental data (e.g., glycogenes expression and mass spectra profile) will increase confidence in predictions and recommendations for biomarkers. Moreover, integrated
MA
glycoinformatics tools enable the analysis and comparison of multiple studies with multiple platforms, which can reveal limitations in analytical sensitivities.
TE
D
One approach has been to develop and implement tools for optimal analysis and effective validation of mass spectra and gene expression data sets to understand cancer glycosylation. A
AC CE P
comprehensive simulation framework was implemented [23, 30] to integrate information across a broad spectrum of mass spectral data for leukemia cell types as compared to normal cell types [30]. The method was also used to integrate mass spectral data and mRNA datasets to identify glycan patterns associated with prostate cancer types [23]. These two examples provide a deeper understanding of the interrelation among the complex cellular processes leading to changes in cancer glycosylation. In both cases, a glycosylation model for mammalian cells that uses MALDI TOF mass spectra has been developed to completely characterize a measured glycan mass spectrum in terms of a relatively small number of enzyme activities [30]. Automatic annotation of the mass spectrum in terms of glycan structures is produced, with every peak being assigned a full range of alternative glycan structures and abundances as shown in the MS profile at the top of Figure
13
ACCEPTED MANUSCRIPT 6. The method was initially applied to mass spectral data of normal-human monocytes and monocytic leukemia cells, and it provided insights into the relevant glycosylation pathways that
PT
differentiate normal and diseased cells [30]. This is an important advance toward integrative
RI
systems glycobiology that connects the enzymatic activities of the glycosyltransferases, the complete bioprocessing pathways, and the resulting glycan structures.
SC
A more advanced implementation of this method involves the integration of mass spectral
NU
and gene expression data as illustrated in Figure 6. The power of this novel method was demonstrated and applied to low- and high-passage Lymph node carcinoma of the prostate
MA
(LNCaP) cancer cells, which correspond to androgen-dependent and the more metastatic androgen-independent cell stages [23]. The novel method identified and quantified glycan
(Figure 6).
TE
D
structural details not typically derived from single-stage mass spectral or gene expression data Differences between the cell types uncovered include increases in the more
AC CE P
metastatic androgen-independent cells of H type II and Lewis-y glycan structures characteristic of blood groups and the correlation of a correspondingly greater activity of a fucosyltransferase (FUT1). The model further elucidated limitations in the two analytical platforms, including a defect in the microarray for detecting the GnTV (MGAT5) enzyme. The results demonstrate the potential of integrative systems glycobiology tools for elucidating key glycan biomarkers and potential therapeutic targets along with specifying limitations in the analytical platforms. The integration of multiple data sets demonstrates how a systems biology approach can provide a better understanding of complex cellular processes and lead to the elucidation of glycan signatures representative of potential biomarkers for cancer and other diseases. Other pioneering studies have combined glycan analysis with multiple ‘omics analytical tools to gain insights into glycosylation differences in human populations and disease states as
14
ACCEPTED MANUSCRIPT well as the genetic control of these changes [84].
In one approach, genome-wide association
studies (GWAS) was combined with high throughput HPLC analysis of plasma proteins of 2,705
PT
individuals analysis to reveal polymorphisms in the fucosyltransferase genes FUT6 and FUT8 as
RI
well as Hepatocyte Nuclear Factor 1HNF1) [85]. Furthermore, HNF1 and HNF4were
SC
found to regulate expression of fucosyltransferase and fucose biosynthetic genes. The analysis was then extended to 3533 individuals to identify polymorphisms in the genes for N-
NU
acetylglucosamine transferase (MGAT5), glucuronyltransferase (B3GAT1) and the protein pump SLC9A9 based on up to 45 glycan traits in the plasma glycome of tested individuals [86]. Other
MA
efforts involve relating the epigenome with the glycome [85]. In one study, HNF1 silencing through methylation of CpG sites was associated with changes in the plasma glycome of a
TE
D
population of 810 individuals [87]. Another study showed that global changes in the DNA methylation of ovarian cancer epithelial cells (OVCAR3) can cause differential alterations in the
AC CE P
glycan structures such as reduced core fucosylation, increased branching, and enhanced sialylation. These changes were related to alterations in the expression of GMDS and FX genes involved in fucose biosynthesis and the expression of MGAT5 affecting the branching and sialylation of secreted glycans [88]. These studies demonstrate how genomic and epigenetic analysis when combined with glycan structural analysis can yield powerful insights into the role of genes and pathways on glycosylation, metastasis and cancer progression [88]. Another study profiled both N and O-glycan structures and the corresponding glycosylation machinery genes isolated from mouse embryonic stem cells (ES), embryoid bodies (Ebs) and extraembryonic endodermal cells (ExE). By using pathway mapping, researchers showed a significant correlation between the transcriptional regulation and glycan expression. Increased polysialylation and α-Gal termination were observed in the differentiated cell types
15
ACCEPTED MANUSCRIPT whereas α-Gal capped glycan were more abundant in ExE cells [89]. Another integration study mapped miRNA regulators onto the glycan biosynthetic pathways by incorporating glycomics
PT
data. By using lectin microarrays and mimics and inhibitors of certain group of miRNAs,
RI
microRNA regulators of high mannose, fucose and β-GalNac networks were determined [90]. Researchers have also profiled the N-glycan and glycogene expression in the epithelial to
SC
mesenchymal transition (EMT) process which occurs following transformation of the normal
NU
mouse mammary gland epithelial (NMuMG) cell model induced by transforming growth factorβ1 (TGFβ1). Using a systems glycobiology approach, the effect of TGFβ-induction on the levels
MA
of high-mannose, antennary and bisecting GlcNAc N-glycans as well as fucosylation were demonstrated. Fucosylation and bisecting GlcNac glycans were significantly decreased while
D
high mannose type N-glycans were increased [31]. As a result, the integration of glycomics
TE
approaches together with other ‘omics tools has led much greater understanding about control of
AC CE P
the expression of glycosylation genes at the genomic, transcriptomic, and epigenetic levels and its impact on the glycan profiles of populations and in cellular transformations and disease states. 7. Biomarkers and Glycan-based Diseases The discovery of glycan-based biomarkers requires efficient extraction and analysis of glycan structures. Indeed, the identification of appropriate glycan-based biomarkers of different cancer types is challenging and has been hindered by a number of factors, notably: specific glycans may not be present in current databases, cancer associated glycans may be at low levels, a collective pattern of glycans structures rather than a specific glycan may be more representative of the cancer state, an absence of a unified and standard format for glycans notation, and the requirement of novel algorithms to handle glycan complexity. Most importantly, differences in glycan profiles between cancer and normal cells may involve subtle differences in amounts rather than on and off changes.
16
ACCEPTED MANUSCRIPT Fortunately, advances in the ‘omics’ tools including RNA sequencing and mass spectrometry-based technologies have initiated a better understanding of the glycogenome,
PT
glycoproteome, and glycan changes in cells or pathological conditions. In addition to cancer,
RI
immunological and cardiovascular disorders, congenital disorders of glycosylation (CDGs) are one of the largest classes of diseases affected by glycosyltransferases and other glycogenes [91].
SC
Understanding the glycogenome is a first key step towards identifying glycan or glycoprotein
NU
markers. Various sequencing technologies or arrays including whole exome sequencing (WES) [92] have been used to find out the functional deleterious mutations. A website called GlyMAP
MA
(http://glymap.glycomics.ku.dk) was constructed to provide the global map of glycogenome genetic stability [91]. In addition, glycoproteomics is also widely used for finding novel
TE
D
biomarkers from serum or tissues. Lectin affinity chromatography, hydrazide chemistry, titanium dioxide affinity chromatography, reductive amination chemistry, and boronic acid chemistry are
AC CE P
the most widely-used enrichment techniques for finding the glycosites and glycopeptides [93]. By coupling these enrichment techniques with the MS labeling technologies and global proteomics approaches, novel glycoprotein markers can be found and validated [94]. Examples of some studied glycan biomarkers in cancer are given in Table 5. In addition, the subtle differences that may exist between cells and tissues may be addressed through systems biology integration described in the previous section. Differences in glycan profiles and also different enzyme expression profiles offer the potential to be used as valuable biomarkers by indicating a transformation from normal to a cancerous state [30] or from a less malignant to a more malignant cancer state [23]. One goal involves obtaining new epitopes based on differential glycan patterns observed between normal and diseased cells and tissues. Algorithms based on machine learning methods [95-98], frequent sub-tree mining [97, 99, 100],
17
ACCEPTED MANUSCRIPT and mathematical modeling [23, 30] have been developed to predict glycan biomarkers or glycan biomarker patterns. Some of these algorithms can compare two glycan profiles directly, each
PT
consisting of thousands of structures, to find those substructures or combinations of substructures
RI
that most definitely characterize the differences between the two profiles.
In conclusion, glycomics represents a challenging but highly promising field of study to
SC
gain new insights in biomarker discovery and biotherapeutics development. Glycomics serves as
NU
one of the key initial tools in order dig deeper and gain a better understanding of cellular physiology, facilitated when used in concert with genomics, epigenomics, transcriptomics, and
MA
proteomics data. An emerging opportunity in processing diverse glycosylation data sets is the development of adequate tools that allow an integrative analysis perspective to interrogate,
TE
D
interpret, and gain insights from diverse ‘omics data sets. While important progress is being achieved on this end, future advances will focus on the implementation of integrative
AC CE P
glycoinformatics and systems glycobiology approaches to help characterize glycosylation and the impact of changes or differences at the genetic, transcriptional, epigenetic, and translational level. This work will require the development of methods to integrate disparate datasets, perform network analysis, visualize glycosylation processing differences, and create unified computational platforms.
Funding This work was supported by the National Cancer Institute: Awards 5R41CA127885-02 and R01CA112314, the Consortium of Functional Glycomics: Bridging Grant Number U54GM062116-10, and NIH/NIGMS funding the National Center for Glycomics and Glycoproteomics (8P41GM103490). The content is solely the responsibility of the authors and
18
ACCEPTED MANUSCRIPT does not necessarily represent the official views of the National Cancer Institute and the Consortium of Functional Glycomics.
PT
Conflict of interest
RI
The authors have declared no conflict of interest.
5. 6. 7. 8. 9. 10.
11. 12. 13. 14. 15. 16. 17. 18. 19.
NU
MA
4.
D
3.
TE
2.
Brockhausen, I., Mucin-type O-glycans in human colon and breast cancer: glycodynamics and functions. Embo Reports, 2006. 7(6): p. 599-604. Brockhausen, I., J. Schutzbach, and W. Kuhns, Glycoproteins and their relationship to human disease. Acta Anatomica, 1998. 161(1-4): p. 36-78. Hakomori, S., Glycosylation defining cancer malignancy: new wine in an old bottle. Proc Natl Acad Sci USA, 2002. 99(16): p. 10231-10233. Kim, Y.J. and A. Varki, Perspectives on the significance of altered glycosylation of glycoproteins in cancer. Glycoconj J, 1997. 14(5): p. 569-576. Buskas, T., P. Thompson, and G.J. Boons, Immunotherapy for cancer: synthetic carbohydratebased vaccines. Chem Commun (Camb), 2009(36): p. 5335-5349. Tong, L., et al., Glycosylation changes as markers for the diagnosis and treatment of human disease. Biotechnol Gen Eng Rev, 2003. 20: p. 199-244. Adamczyk, B., T. Tharmalingam, and P.M. Rudd, Glycans as cancer biomarkers. Biochim Biophys Acta, 2012. 1820(9): p. 1347-1353. Fuster, M.M. and J.D. Esko, The sweet and sour of cancer: Glycans as novel therapeutic targets. Nat Rev Cancer, 2005. 5(7): p. 526-542. Furukawa, K. and A. Kobata, Protein glycosylation. Curr Opin Biotechnol, 1992. 3(5): p. 554559. Apweiler, R., H. Hermjakob, and N. Sharon, On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. Biochim Biophys Acta, 1999. 1473(1): p. 4-8. Arnold, J.N., et al., Novel glycan biomarkers for the detection of lung cancer. J Prot Res, 2011. 10(4): p. 1755-1764. Tian, Y. and H. Zhang, Characterization of disease-associated N-linked glycoproteins. Proteomics, 2013. 13(3-4): p. 504-511. Tian, Y., et al., Altered expression of sialylated glycoproteins in breast cancer using hydrazide chemistry and mass spectrometry. Mol Cell Prot, 2012. 11(6): p. M111011403. Tian, Y., et al., Identification of glycoproteins associated with different histological subtypes of ovarian tumors using quantitative glycoproteomics. Proteomics, 2011. 11(24): p. 4677-4687. Rakus, J.F. and L.K. Mahal, New technologies for glycomic analysis: toward a systematic understanding of the glycome. Ann Rev Anal Chem, 2011. 4: p. 367-392. Ito, H., et al., Strategy for glycoproteomics: identification of glyco-alteration using multiple glycan profiling tools. J Prot Res, 2009. 8(3): p. 1358-1367. Tousi, F., et al., Technologies and strategies for glycoproteomics and glycomics and their application to clinical biomarker research. Anal Met, 2011. 3(1): p. 195-203. Zhang, Y., H. Yin, and H. Lu, Recent progress in quantitative glycoproteomics. Glycoconj J, 2012. 29(5-6): p. 249-258. Ito, S., K. Hayama, and J. Hirabayashi, Enrichment strategies for glycopeptides. Met Mol Biol, 2009. 534: p. 195-203.
AC CE P
1.
SC
References
19
ACCEPTED MANUSCRIPT
25. 26. 27. 28. 29. 30.
33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44.
PT
AC CE P
32.
TE
D
31.
RI
24.
SC
22. 23.
NU
21.
Hua, S. and H.J. An, Glycoscience aids in biomarker discovery. BMB Rep, 2012. 45(6): p. 323330. Furukawa, J., N. Fujitani, and Y. Shinohara, Recent advances in cellular glycomic analyses. Biomolecules, 2013. 3: p. 198-225. GlycomicsPortal, http://glycomics.ccrc.uga.edu/GlycomicsPortal. Bennun, S.V., et al., Integration of the transcriptome and glycome for identification of glycan cell signatures. 2013. PLoS Comput Biol 9(1): e1002813. doi:10.1371/journal.pcbi.1002813. Ranzinger, R., et al., GlycomeDB-a unified database for carbohydrate structures. Nuc Acids Res, 2011. 39: p. D373-D376. Taniguchi, N. and J.C. Paulson, Frontiers in glycomics: bioinformatics and biomarkers in disease. Proteomics, 2007. 7(9): p. 1360-1363. Raman, R., et al., Glycomics: an integrated systems approach to structure-function relationships of glycans. Nat Met, 2005. 2(11): p. 817-824. von der Lieth, C.W., et al., EUROCarbDB: An open-access platform for glycoinformatics. Glycobiol, 2011. 21(4): p. 493-502. Akune, Y., et al., The RINGS resource for glycome informatics analysis and data mining on the Web. Omics, 2010. 14(4): p. 475-486. Lutteke, T., et al., GLYCOSCIENCES.de: an internet portal to support glycomics and glycobiology research. Glycobiol, 2006. 16(5): p. 71R-81R. Krambeck, F.J., et al., A mathematical model to derive N-glycan structures and cellular enzyme activities from mass spectrometric data. Glycobiol, 2009. 19(11): p. 1163-1175. Tan, Z., et al., Altered N-glycan expression profile in epithelial-to-mesenchymal transition of NMuMG cells revealed by an integrated strategy using mass spectrometry and glycogene and lectin microarray analysis. J Proteome Res, 2014. 13(6): p. 2783-2795. Comelli, E.M., et al., A focused microarray approach to functional glycomics: transcriptional regulation of the glycome. Glycobiol, 2006. 16(2): p. 117-131. Hirabayashi, J., Lectin-based structural glycomics: glycoproteomics and glycan profiling. Glycoconj J, 2004. 21(1-2): p. 35-40. Kaji, H., et al., Lectin affinity capture, isotope-coded tagging and mass spectrometry to identify N-linked glycoproteins. Nat Biotechnol, 2003. 21(6): p. 667-672. Nairn, A.V., et al., Regulation of Glycan Structures in Animal Tissues: TRANSCRIPT PROFILING OF GLYCAN-RELATED GENES. J Biol Chem, 2008. 283(25): p. 17298-17313. Yang, S. and H. Zhang, Glycomic analysis of glycans released from glycoproteins using chemical immobilization and mass spectrometry. Curr Protoc Chem Biol, 2014. 6(3): p. 191-208. Mickum, M., et al., Deciphering the glycogenome of schistosomes. Front Genet, 2014. 5: p. 262. Xu, X., et al., The genomic sequence of the Chinese hamster ovary [1]-K1 cell line. Nat Biotechnol, 2011. 29: p. 735-741. North, S. J., et al., Glycomics profiling of Chinese hamster ovary cell glycosylation mutants reveals N-glycans of a novel size and complexity. J Biol Chem, 2010. 285(8): p. 5759-5775. Kang, X., et al., Glycan-related gene expression signatures in human metastatic hepatocellular carcinoma cells. Exp Ther Med, 2012. 3(3): p. 415-422. Yang, Z., et al., Engineered CHO cells for production of diverse, homogeneous glycoproteins. Nat Biotech, 2015. 33(8): p. 842-844. Tian, Y. and H. Zhang, Glycoproteomics and clinical applications. Proteomics, 2010. 4(2): p. 124-132. Megger, D.A., T. Bracht, H.E. Meyer, and B. Sitek, Label-free quantification in clinical proteomics. Biochim Biophys Acta, 2013. 1834(8): p. 1581-1590. Kashyap, M.K., et al., SILAC-based quantitative proteomic approach to identify potential biomarkers from the esophageal squamous cell carcinoma secretome. Cancer Biol Ther, 2010. 10(8): p. 796-810.
MA
20.
20
ACCEPTED MANUSCRIPT
51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61.
62. 63. 64. 65. 66. 67.
PT
RI
SC
50.
NU
49.
MA
48.
D
47.
TE
46.
Tian, Y., G.S. Bova, and H. Zhang, Quantitative glycoproteomic analysis of optimal cutting temperature-embedded frozen tissues identifying glycoproteins associated with aggressive prostate cancer. Anal Chem, 2011. 83(18): p. 7013-7019. Raso, C., et al., Characterization of breast cancer interstitial fluids by TMT labeling, LTQOrbitrap velos mass spectrometry, and pathway analysis. J Prot Res, 2012. 11 (6): p. 3199–3210. Zhang, H., X.J. Li, D.B. Martin, and R. Abersold, Identification and quantification of N-linked glycoproteins using hydrazide chemistry, stable isotope labeling and mass spectrometry. Nat Biotechnol, 2003. 21(6): p. 660-666. Yang, S., et al., Glycoproteins identified from heart failure and treatment models. Proteomics, 2015. 15(2-3): p. 567-579. Yang, W., et al., Glycoproteomic study reveals altered plasma proteins associated with HIV elite supressors. Thernanostics, 2014. 4(12): p. 1153-1163. Li, Y., et al., Simultaneous analysis of glycosylated and sialylated prostate-specific antigen revealing differential distribution of glycosylated prostate-specific antigen isoforms in prostate cancer tissues. Anal Chem, 2011. 83(1): p. 240-245. Ząbczyńska M. and E. Pocheć, The role of protein glycosylation in immune system. Postepy Biochem, 2015. 61(2): p. 129-137. Garces, F., et al., Affinity maturation of a potent family of HIV antibodies is primarily focused on accommodating or avoiding glycans. Immunity, 2015. 43(6): p. 1053-1063. Yang, S., A. Rubin, S.T. Eshghi, and H. Zhang, Chemoenzymatic method for glycomics: isolation, identification, and quantitation. Proteomics, 2016. 16(2): p. 241-256. Yang, S. et al., QUANTITY: an isobaric tag for quantitative glycomics. Sci Rep, 2015. 5: p. 17585. Sun, S., et al., Comprehensive analysis of protein glycolsylation by solid-phase extraction of Nlinked glycans and glycosite-containing peptides. Nat Biotechnol, 2016. 34(1): p. 84-88. Eshghi, S., et al., GPQuest: a spectral library matching algorithm for site-specific assignment of tandem mass spectra to intact N-glycopeptides. Anal Chem, 2015. 87(10): p. 5181-5188. Furukawa, J., et al., Quantitative O-glycomics by microwave-assisted β-elimination in the presence of pyrazolone analogues. Anal Chem, 2015. 87(15): p. 7524-7528. Ju, T., et al., Tn and sialyl-Tn antigens, aberrant O-glycomics as human disease markers. Proteomics Clin Appl, 2013. 7(9-10): p. 618-631. CFG. The Consortium for Functional Glycomics. http://www.functionalglycomics.org. Raman, R., et al., Advancing glycomics: implementation strategies at the consortium for functional glycomics. Glycobiol, 2006. 16(5): p. 82R-90R. Yoshida, K., A. Suzuki, and N. Taniguchi, Japan consortium for glycobiology and glycotechnology; toward establishment of international network and systems glycobiology. Prot Nuc Acid Enzyme, 2004. 49(15 Suppl): p. 2313-2318. Hayes, C.A., et al., UniCarb-DB: a database resource for glycomic discovery. Bioinform, 2011. 27(9): p. 1343-1344. Cooper, C.A., et al., GlycoSuiteDB: a new curated relational database of glycoprotein glycan structures and their biological sources. Nuc Acids Res, 2001. 29(1): p. 332-335. Campbell, M.P., et al., UniCarbKB: building a knowledgebase platform for glycoproteomics. Nuc Acids Res, 2014. 42: p. D215-D221. Ranzinger, R. and W.S. York, Glyco-Bioinformatics today (August 2011) – solutions and problems. http://www.beilstein-institut.de/glycobioinf2011/Proceedings, 2011. Hashimoto, K., et al., KEGG as a glycome informatics resource. Glycobiol, 2006. 16(5): p. 63R70R. Kanehisa, M., et al., The KEGG resource for deciphering the genome. Nuc Acids Res, 2004. 32: p. D277-D280.
AC CE P
45.
21
ACCEPTED MANUSCRIPT
74. 75. 76. 77. 78. 79.
80. 81. 82. 83. 84. 85.
86. 87. 88. 89 90.
PT
RI
SC
73.
NU
72.
MA
71.
D
70.
TE
69.
Packer, N.H., et al., Frontiers in glycomics: bioinformatics and biomarkers in disease. An NIH white paper prepared from discussions by the focus groups at a workshop on the NIH campus, Bethesda MD (September 11-13, 2006). Proteomics, 2008. 8(1): p. 8-20. Ranzinger, R., et al., GlycomeDB - integration of open-access carbohydrate structure databases. BMC Bioinform, 2008. 9: p. 384. Herget, S., et al., GlycoCT-a unifying sequence format for carbohydrates. Carbohydr Res, 2008. 343(12): p. 2162-2171. Akune, Y., Hosoda, M., Kaiya, S., Shinmachi, D., and K.F. Aoki-Kinoshita, The RINGS resource for glycome informatics analysis and data mining on the web. OMICS, 2010. 14(4): p. 475-486. DeMarco, M.L. and R.J. Woods, Structural glycobiology: a game of snakes and ladders. Glycobiol, 2008. 18(6): p. 426-440. von der Lieth, C.W., et al., Bioinformatics for glycomics: status, methods, requirements and perspectives. Brief Bioinform, 2004. 5(2): p. 164-78. Frank, M. and S. Schloissnig, Bioinformatics and molecular modeling in glycobiology. Cell Molecular Life Sci, 2010. 67(16): p. 2749-2772. Ceroni, A., et al., GlycoWorkbench: a tool for the computer-assisted annotation of mass spectra of glycans. J Prot Res, 2008. 7(4): p. 1650-1659. Maass, K., et al., "Glyco-peakfinder" - de novo composition analysis of glycoconjugates. Proteomics, 2007. 7(24): p. 4435-4444. Goldberg, D., et al., Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra. Proteomics, 2005. 5(4): p. 865-875. Kawano, S., et al., Prediction of glycan structures from DNA microarray data. Glycobiol, 2004. 14(11): p. 1204-1204. Packer, N.H., et al., Frontiers in glycomics: bioinformatics and biomarkers in disease - An NIH White Paper prepared from discussions by the focus groups at a workshop on the NIH campus, Bethesda MD (September 11-13,2006). Proteomics, 2008. 8(1): p. 8-20. Joshi, H.J., et al., Development of a mass fingerprinting tool for automated interpretation of oligosaccharide fragmentation data. Proteomics, 2004. 4(6): p. 1650-1664. Krambeck, F.J. and M.J. Betenbaugh, A mathematical model of N-linked glycosylation. Biotechnol Bioeng, 2005. 92(6): p. 711-728. Kawano, S., et al., Prediction of glycan structures from gene expression data based on glycosyltransferase reactions. Bioinform, 2005. 21(21): p. 3976-3982. Suga, A., et al., An improved scoring scheme for predicting glycan structures from gene expression data. Genome Informatics, 2007. 18: p. 237-246.
AC CE P
68.
Zoldos, V., Horvat, T., and G. Lauc, Glycomics meets genomics, epigenomics and other high throughput omics for system biology studies. Curr Opin Chem Biol, 2013. 17(1): p. 33-40. Lauc, G., et al., Genomics meet glycomics – the first GWAS study of human N-glycome identifies HNF1α as a master regulator of plasma protein fucosylation. PLoS Genet, 2010. 6(12): e1001256. Huffman, J.E., et al., Polymorphisms in B3GAT1, SLC9A9 and MGAT5 are associated with variation within the human plasma N-glycome. Hum Mol Genet, 2011. 20: p. 5000-5011. Zoldos, V., et al., Epigenetic silencing of HNF1A associates with changes in the composition of the human plasma N-glycome. Epigenetics, 2012. 7(2): p. 164-172. Saldova, R., et al., 5-AZA-2’-deoxycytidine induced demethylation influences N-glycosylation of secreted glycoproteins in ovarian cancer. Epigenetics, 2011. 6(11): p. 1362-1372. Nairn, A.V., et al., Combined transcript profiling glycan-related genes and glycan structural analysis. J Biol Chem, 2012. 287(45): p. 37835-37856. Agrawal, P., et al., Mapping posttranscriptional regulation of the human glycome uncovers microRNA defining the glycocode. Proc Natl Acad Sci, 2014. 111(11): p. 4338-4343.
22
ACCEPTED MANUSCRIPT
97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107.
108. 109. 110. 111. 112. 113.
PT
RI
SC
96.
NU
95.
MA
94.
D
93.
TE
92.
Hansen, L., et al., A glycogene mutation map for discovery of diseases of glycosylation. Glycobiol, 2015. 25(2): p. 211-224. Rabbani, B., Tekin, M., and N. Mahdieh, The promise of whole-exome sequencing in medical genetics. J Human Genet, 2014. 59: p. 5-15. Zhang, Y., J. Jiao, P. Yang, and H. Lu, Mass spectrometry-based N-glycoproteomics for cancer biomarker discovery. Clin Proteomics, 2014. 11(18): doi:10.1186/1559-0275-11-18. Ahn, J.M., et al., Integrated glycoproteomics demonstrates fucosylated serum paraoxonase 1 alterations in small cell lung cancer. Mol Cell Proteomics, 2014. 13(1): p. 30-48. Hizukuri, Y., et al., Extraction of leukemia specific glycan motifs in humans by computational glycomics. Carbohydr Res, 2005. 340(14): p. 2270-2278. Kuboyama, T., et al., A gram distribution kernel applied to glycan classification and motif extraction. Gen Inform, 2006. 17(2): p. 25-34. Yamanishi, Y., F. Bach, and J.P. Vert, Glycan classification with tree kernels. Bioinform, 2007. 23(10): p. 1211-1216. Li, L., et al., A weighted q-gram method for glycan structure classification. BMC Bioinform, 2010. 11 Suppl 1: p. S33. Aoki-Kinoshita, K.F., Mining frequent subtrees in glycan data using the RINGS glycan miner tool. Met Molecular Biol, 2013. 939: p. 87-95. Hashimoto, K., et al., Mining significant tree patterns in carbohydrate sugar chains. Bioinform, 2008. 24(16): p. i167-i173. Doubet, S., et al., The Complex Carbohydrate Structure Database. Trends in Biochemical Sciences, 1989. 14(12): p. 475-477. van Kuik, J.A., K. Hard, and J.F. Vliegenthart, A 1H NMR database computer program for the analysis of the primary structure of complex carbohydrates. Carbohydr Res, 1992. 235: p. 53-68. Schomburg, I., et al., BRENDA, the enzyme database: updates and major new developments. Nucleic acids research, 2004. 32(Database issue): p. D431-3. Coutinho, P.M., et al., An evolving hierarchical family classification for glycosyltransferases. Journal of Molecular Biology, 2003. 328(2): p. 307-317. Tomiya, N., et al., Analyses of N-Linked Oligosaccharides Using a Two-Dimensional Mapping Technique. Analytical Biochemistry, 1988. 171(1): p. 73-90. Bairoch, A., The ENZYME database in 2000. Nucleic acids research, 2000. 28(1): p. 304-5.
AC CE P
91.
Yoshida, K., A. Suzuki, and N. Taniguchi, [Japan consortium for glycobiology and glycotechnology; toward establishment of international network and systems glycobiology]. Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme, 2004. 49(15 Suppl): p. 2313-8. Campbell, M.P., et al., GlycoBase and autoGU: tools for HPLC-based glycan analysis. Bioinformatics, 2008. 24(9): p. 1214-1216. Cooper, C.A., E. Gasteiger, and N.H. Packer, GlycoMod--a software tool for determining glycosylation compositions from mass spectrometric data. Proteomics, 2001. 1(2): p. 340-9. Damodaran, D., et al., CancerLectinDB: a database of lectins relevant to cancer. Glycoconjugate Journal, 2008. 25(3): p. 191-198. Goldman, R., et al., Detection of hepatocellular carcinoma using glycomic analysis. Clin Cancer Res, 2009. 15(5): p. 1808-1813. Arnold, J.N., et al., Novel glycan biomarkers for the detection of lung cancer. J Proteome Res, 2011. 10(4): p. 1755-1764. Kronewitter, S.R., et al., The glycolyzer: automated glycan annotation software for high performance mass spectrometry and its application to ovarian cancer glycan biomarker discovery. Proteomics, 2012. 12(15-16): p. 2523-2538.
Figure and Table Legends
23
ACCEPTED MANUSCRIPT Figure 1. Simplified view of principal N-glycosylation pathways of CHO cells. Figure 2. Simplified view of principal N-glycosylation pathways in human cells.
PT
Figure 3. Schematic overview of several glycan analysis methods used in the literature including
RI
lectin microarrays, UPLC, and LC/MS/MS.
Figure 4. Quantitative glycome analysis using solid phase immobilization and quaternary amine-
SC
containing isobaric tag for glycan labeling.
NU
Figure 5. Emerging systems glycobiology paradigm in which multiple glycomics data sets are combined with other ‘omics inputs using glycoinformatics in order to elucidate insights about
MA
glycosylation processing from multiple data sets.
Figure 6. Integration of glycomics information from glycosylation models with transcriptomics
TE
D
data for elucidating biomarkers of prostate cancer LNCaP cells. This integrative glycoinformatics approach allows the simultaneous analysis of gene expression (pink) and mass
AC CE P
spectra (blue) data using a glycosylation model to elucidate glycan substructures and enzyme/gene levels that differentiate prostate cancer types. Increases of H type II and Lewis Y glycans abundances in more malignant androgen independent prostate cancer tumor cells versus less malignant cells were detected and show agreement for both MS (dark blue vs. light blue) and gene expression data (dark pink vs. light pink). Table 1. Main enzymes and corresponding genes for N-Glycosylation pathway of CHO cells. Table 2. Main enzymes and corresponding genes for N-Glycosylation pathway of human cells. Table 3. Listing of some of the Web-based glycan resources. Table 4. Directory of main links available for some glycan analysis databases and tools. Table 5.Examples of documented glycan biomarkers, associated diseases, method of detection and reference source.
24
Fig. 1
AC CE P
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
25
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
AC CE P
TE
Fig. 2
26
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
AC CE P
TE
D
Fig. 3
27
Fig. 4
AC CE P
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
28
AC CE P
Fig. 5
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
29
Fig. 6
AC CE P
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
30
ACCEPTED MANUSCRIPT Table 1 Enzymes
Glycogenes
NCBI
Enzyme
Enzyme name
EC number
130074
ManI
α2-mannosidase I
3.2.1.113
4124 4122
ManII
α3/6-mannosidase II
2530
a6FucT
α6-Fuc-transferase
4245
GnTI
β2-GlcNAc-transferase I
4247
GnTII
β2-GlcNAc-transferase II
4248
GnTIII
MAN2A1 MAN2A2
2.4.1.68
FUT8
RI
PT 3.2.1.114
MGAT1
2.4.1.143
MGAT2
β4-GlcNAc-transferase III
2.4.1.144
MGAT3
GnTIV
β4-GlcNAc-transferase IV
2.4.1.145
GnTV
β6-GlcNAc-transferase V
2.4.1.155
iGnT
Blood group i β3-GlcNAc-transferase
2.4.1.149
b4GalT
β4-Gal-transferase
2.4.1.38
D
MA
NU
SC
2.4.1.101
a3FucT
6487 6484 10402
a3SiaT
AC CE P
2526 2527 2528 2529 10690
TE
11320 11282 4249 146664 10678 10331 79369 2683 8704 8703 9334
Gene ID
α3- Fuc-transferase
2.4.1.152
α3-Sialyltransferase
2.4.99.6
31
MGAT4A MGAT4B MGAT5 MGAT5B B3GNT2 B3GNT3 B3GNT4 B4GALT1 B4GALT2 B4GALT3 B4GALT5 FUT4 FUT5 FUT6 FUT7 FUT9 ST3GAL3 ST3GAL4 ST3GAL6
ACCEPTED MANUSCRIPT Table 2 Enzymes
Glycogenes
Enzyme
Enzyme name
EC number
130074
ManI
α2-mannosidase I
3.2.1.113
4124 4122
ManII
α3/6-mannosidase II
3.2.1.114
MAN2A1 MAN2A2
2530
a6FucT
α6-Fuc-transferase
2.4.1.68
FUT8
4245
GnTI
β2-GlcNAc-transferase I
4247
GnTII
β2-GlcNAc-transferase II
4248
GnTIII
β4-GlcNAc-transferase III
11320 11282
GnTIV
β4-GlcNAc-transferase IV
2.4.1.145
MGAT4A MGAT4B
GnTV
β6-GlcNAc-transferase V
2.4.1.155
iGnT
Blood group i β3-GlcNAc-transferase
2.4.1.149
b4GalT
β4-Gal-transferase
2526 2527 2528 2529 10690
RI
MGAT1
2.4.1.143
MGAT2
2.4.1.144
MGAT3
SC
NU
MA
Gene ID
2.4.1.101
2.4.1.38
D
4249 146664 10678 10331 79369 2683 8704 8703 9334
PT
NCBI
MGAT5 MGAT5B B3GNT2 B3GNT3 B3GNT4 B4GALT1 B4GALT2 B4GALT3 B4GALT5 FUT4 FUT5 FUT6 FUT7 FUT9
2.4.1.152
a3SiaT
α3-Sialyltransferase
2.4.99.6
ST3GAL3 ST3GAL4 ST3GAL6
IGnT
Blood group I β6-GlcNAc-transferase
2.4.1.150
GCNT2
a6SiaT
α6-sialyltransferase
2.4.99.1
ST6GAL1
b3GalT
β3-Gal-transferase
FucTLe
α3/4- Fuc-transferase III
2.4.1.65
FUT3
2523 2524
FucTH
α2- Fuc-transferase, Se, H
2.4.1.69
FUT1 FUT2
28
GalNAcT-A
Blood group A α3-GalNAc-transferase
2.4.1.40
ABO
28
GalT-B
Blood group B α3-Gal-transferase
2.4.1.37
ABO
2651 6480 8708 8707 10317 2525
AC CE P
6487 6484 10402
TE
α3- Fuc-transferase
a3FucT
32
B3GALT1 B3GALT2 B3GALT 5
ACCEPTED MANUSCRIPT Table 3 Glycan Databases
Description
Web address
CarbBank was the first publicly available glycan structure database with approximately 50,000 entries extracted from literature. The database contains glycan structure, publications, biological source information and information about the experimental technique and data about the attached glycan. The CFG glycan database includes structures from CFG's mammalian glycan array, and structures extracted from other CFG databases (CarbBank and GlycoMinds Ltd). Glycan structures are Consortium for linked with the CFG glycan binding protein database, which includes Functional Glycomics gene and protein sequences, biological functions, and binding [26, 60] specificities based on the glycan array data. CFG also provides a reagent bank, a glycotransferase database, knock-out mouse phenotype data, glycan profile data from MALDI-TOF mass spectra, and enzyme transcriptomic data for various tissues and cells. EuroCarbDB a repository of carbohydrate structures, and experimental evidence. Provides glycoinformatics tools and databases for interpretation and storing of glycan structures and EUROCarbDB glycan experimental data, including tandem MS, HPLC and NMR. [27] EurocarbDB also includes a suite of various tools, GlycoPeakFinder, AutoGU and Gasper. Other efforts and spin offs are Glycoworkbench, MonosacharideDB, GLYCOSCIENCE.de, and GlycomicsPortal. A carbohydrate structure metadatabase for glycan structures that have been integrated from carbohydrate sequences of all freely GlycomeDB available databases (CFG , KEGG, GLYCOSCIENCES.de, BCSDB [24, 69] and Carbbank) to GlycoCT, and created a new database (GlycomeDB) containing all structures and annotations. GlycomeDB provides cross-references to the integrated databases. The site provides databases and bioinformatics tools for glycobiology and glycomics. Glycan structure data is mainly GLYCOSCIENCES.de extracted from CarbBank, the PDB (Protein DataBank) and literature [29] research. It includes biological source information, NMR data and literature references. A glycan 3D structure generation tool [47] is integrated with the structure database. Contains literature-based glycan data that has been experimentally GlycoSuiteDB verified. The current database has been integrated into UniCarbKB [63] database (www.unicarbkb.org). JCGGDB Japan Consortium for A portal to search across major bioinformatics databases in Japan. Glycobiology and Integrated resources include tumor marker references, glycoepitope Glycotechnology databases, glycodisease gene databases, experiment support, glycan database profile data from mass spectral, lectin array data, glycoprotein data, [61] glycogene data and tools for glycan structure analysis.
http://glycomics.ccrc.uga.ed u/GlycomicsPortal/showEnt ry.action?id=46
http://www.functionalglyco mics.org/
http://relax.organ.su.se/euro carb/home.action
http://www.glycome-db.org
AC CE P
TE
D
MA
NU
SC
RI
PT
CarbBank (CCSD) The complex carbohydrate structure database [101]
KEGG Kyoto Encyclopedia of Genes [66, 67]
SUGABASE [102] UniCarb-DB [62]
http://www.glycosciences.d e
http://unicarbkb.org/
http://jcggdb.jp/
The KEGG glycan structure database contains entrees from experimentally determined glycan structures, CarbBank, from publications and structures linked with the genes and pathways http://www.genome.jp/kegg/ provided by other KEGG resources. Many resources are available at glycan/ KEGG, including glycan structures, glycotransferases, glycogene information, glycan binding protein data, glycosyltransferase and glycosylhydrolase reactions, and glycosylation pathways. SUGABASE is a carbohydrate-NMR database that combines http://glycomics.ccrc.uga.ed CarbBank Complex Carbohydrate Structure Data (CCSD) with u/GlycomicsPortal/showEnt proton and carbon chemical shift values. ry.action?id=48 Database with LC-MS/MS data of glycan structures. The database http://unicarbprovides for each structure the biological source, LC-MS/MS data, db.biomedicine.gu.se/ and publications.
33
ACCEPTED MANUSCRIPT A curated resource with experimental data, linked with glyco-related UniProtKB data. In addition to glycan structures, the database http://www.unicarbkb.org/ provides biological source information and publications.
AC CE P
TE
D
MA
NU
SC
RI
PT
UniCarbKB [64]
34
ACCEPTED MANUSCRIPT Table 4 Resource
Content
Web address
BRENDA [103]
Repository of comprehensive enzyme information.
http://www.brendaenzymes.info/
GGDB Glycogene data base [107]
PT
RI
http://www.cazy.org/
http://www.gak.co.jp/ECD/ Hpg_eng/hpg_eng.htm
http://enzyme.expasy.org/
http://riodb.ibase.aist.go.jp/ rcmg/ggdb/
https://glycobase.nibrt.ie/gl ycobase/about.action
A tool that can differentiate oligosaccharide structures on proteins from their experimentally determined masses and can be used for free or derivatized oligosaccharides and for glycopeptides. A tool for determination of glycan compositions from their mass signals for rapid annotation of MS and MSn data with different types of ions. A tool for identifying the peptide moiety of glycopeptides generated using a nonspecific enzyme. GlycoWorkbench is a standalone suite of software tools for rapid drawing of glycan structures and for assisting the process of structure determination from MS data and annotation using structures from other databases. This database provides structure and sequence information on plant lectins. The database has been completely manually annotated. Lectin data is integrated into a common framework together with analytical tools and extensive links. Data for each lectin pertains to taxonomic, biochemical, domain architecture, molecular sequence, and structural details, as well as carbohydrate and blood group specificities. Web resources site with algorithmic and data mining glycan structure tools.
AC CE P
GlycoMod [109]
TE
D
GlycoBase [108]
SC
Enzyme [106]
NU
Elution Coordinate Database [105]
Database for display and analysis of genomic, structural, and biochemical information on Carbohydrate-Active enzymes. Contains more than 400 pyridylaminated N-glycan structures, code numbers, the elution positions on Shimpack CLC-ODS and Amide-80 columns. The elution coordinate database of this site is necessary for the 2-D/3-D sugar mapping techniques. Repository of information relative to the nomenclature of enzymes. Database containing information for analysis of glycogenes, includes genes associated with glycan synthesis, transport, and nucleotide synthesis. First database to store information on substrate specificity Genes are stored in XML format. Glycobase 3.2.4, is an HPLC/UPLC resource of elution positions for N-linked and O-linked glycan structures determined by a combination of HPLC, UPLC, exoglycosidase sequencing and mass spectrometry (MALDI-MS, ESI-MS, ESI-MS/MS, LC-MS, LC-ESIMS/MS). Provides glycoinformatics and visualization functionalities for the data mining of glycans. AutoGu, application to assist the interpretation and assignment of HPLC-glycan profiles was integrated in the database.
MA
CAZY [104]
Glyco-Peakfinder [76] GlycoPep ID
GlycoWorkbench [75]
LectinDB [110]
RINGS [28]
35
http://web.expasy.org/glyc omod/ http://www.glycopeakfinder.org/ http://hexose.chem.ku.edu/ predictiontable.php http://code.google.com/p/gl ycoworkbench/
http://proline.physics.iisc.e rnet.in/lectindb/
http://rings.t.soka.ac.jp
ACCEPTED MANUSCRIPT Table 5 Disease
Method Used
Glycan Biomarker
Reference
High sialylation of
PT
HILIC-HPLC-2AB Breast cancer
antenna fucosylated Labeling
[88]
RI
glycans MALDI-TOF on
carcinoma
permethylated glycans
Decrease in tri- and tetraantennary
SC
Hepatocellular
[111]
NU
glycan
Increase in SLeX,
HILIC and WAX-
mono-antennary, tri-
HPLC
MA
Lung Cancer
[112]
sialylated N-glycans
D
Increase in sialylated
TE
MALDI-FTICR-MS Ovarian Cancer
glycans and decrease
of glycans
[113]
AC CE P
in neutral and high mannose type glycans
36
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
AC CE P
TE
D
Graphical abstract
37