Systems Glycobiology: Integrating Glycogenomics, Glycoproteomics, Glycomics, and Other ‘Omics Data Sets to Characterize Cellular Glycosylation Processes

Systems Glycobiology: Integrating Glycogenomics, Glycoproteomics, Glycomics, and Other ‘Omics Data Sets to Characterize Cellular Glycosylation Processes

    Systems Glycobiology: Integrating glycogenomics, glycoproteomics, glycomics and other ‘omics data sets to characterize cellular glyco...

731KB Sizes 0 Downloads 65 Views

    Systems Glycobiology: Integrating glycogenomics, glycoproteomics, glycomics and other ‘omics data sets to characterize cellular glycosylation processes Sandra V. Bennun, Deniz Baycin Hizal, Kelley Heffner, Ozge Can, Hui Zhang, Michael J. Betenbaugh PII: DOI: Reference:

S0022-2836(16)30250-9 doi: 10.1016/j.jmb.2016.07.005 YJMBI 65143

To appear in:

Journal of Molecular Biology

Received date: Revised date: Accepted date:

6 February 2016 5 July 2016 7 July 2016

Please cite this article as: Bennun, S.V., Hizal, D.B., Heffner, K., Can, O., Zhang, H. & Betenbaugh, M.J., Systems Glycobiology: Integrating glycogenomics, glycoproteomics, glycomics and other ‘omics data sets to characterize cellular glycosylation processes, Journal of Molecular Biology (2016), doi: 10.1016/j.jmb.2016.07.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

SC

RI

PT

Systems Glycobiology: Integrating glycogenomics, glycoproteomics, glycomics and other ‘omics data sets to characterize cellular glycosylation processes Sandra V. Bennun1,4, Deniz Baycin Hizal1, Kelley Heffner1, Ozge Can2, Hui Zhang3, Michael J. Betenbaugh1* 1 Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, Maryland, USA 2 Department of Medical Engineering, Acibadem University, Istanbul, Turkey 3 Johns Hopkins University School of Medicine, Maryland, USA 4 Current Address: Regeneron Pharmaceuticals, Tarrytown, New York, USA

NU

*To whom correspondence should be addressed: [email protected]

AC CE P

TE

D

MA

Keywords: Glycoinformatics, Systems Biology, N-glycosylation, Glycans, Chinese hamster ovary, Automatic glycan annotation, Cancer Biomarkers

1

ACCEPTED MANUSCRIPT Abstract The number of proteins encoded in the human genome has been estimated at between

PT

20,000 and 25,000 despite estimates that the entire proteome contains more than a million

RI

proteins. One reason for this difference is due to the many protein post-translational

SC

modifications that contribute to proteome complexity. Among them, glycosylation is of particular relevance because it serves to modify a large number of cellular proteins.

NU

Glycogenomics, glycoproteomics, glycomics and glycoinformatics are helping to accelerate our

MA

understanding of the cellular events involved in generating the glycoproteome, the variety of glycan structures possible, and the importance glycans play in therapeutics and disease. Indeed, interest in glycosylation has expanded rapidly over the past decade as large amounts of

TE

D

experimental ‘omics data relevant to glycosylation processing has accumulated. Furthermore, new and more sophisticated glycoinformatics tools and databases are now available for glycan

AC CE P

and glycosylation pathway analysis. Here, we summarize some of the recent advances in both experimental profiling and analytical methods involving N- and O-linked glycosylation processing for biotechnological and medically-relevant cells together with the unique opportunities and challenges associated with interrogating and assimilating multiple disparate high-throughput glycosylation data sets. This emerging era of advanced glycomics will lead to the discovery of key glycan biomarkers linked to diseases and help establish a better understanding of physiology and improved control of glycosylation processing in diverse cells and tissues important to disease and production of recombinant therapeutics. Furthermore, methodologies that facilitate the integration of glycomics measurements together with other ‘omics data sets will lead to a deeper understanding and greater insights into the nature of glycosylation as a complex cellular process.

2

ACCEPTED MANUSCRIPT 1. Introduction Glycans are structurally complex carbohydrate chains found on proteins and other

PT

molecules that play an important role in health and disease. Since abnormalities in glycosylation

RI

are linked to cancer and many other diseases [1-5], they provide opportunities for diagnosis and treatment [5-8]. In the production of biotherapeutics, glycans are important because they can

SC

influence the therapeutic properties of the proteins and their immune responses. Considering that

NU

glycosylation is one of the most common post-translational modifications and that more than half of proteins undergo glycosylation [6, 7], a better knowledge of glycans offer opportunities to

MA

improve or develop new biomarkers for cancer [8-12] and improve the glycan profiles of

D

biotherapeutics.

TE

Two widely studied forms of glycosylation modifications made to proteins are N- and O-

AC CE P

type, which are defined by the linkage between the polypeptide backbone and the glycan structure. N-linked glycans are attached to proteins at the nitrogen atom (“N”) of the amide group of an asparagine amino acid residue [13, 14]. O-linked glycans are attached to the oxygen atom (“O”) of serine (mainly) or threonine [15, 16]. Other types of polysaccharides present in cells are glycosaminoglycans (GAGs). GAGs present unbranched polysaccharide configurations with repeated disaccharide units consisting of an amino sugar with an uronic sugar or galactose. Glycans are also attached to lipids in the form of glycolipids. While this paper focuses primarily on N-glycosylation, a widely-studied form of protein glycosylation, many of the approaches and findings are also being applied for other types of glycosylated molecules.

1

ACCEPTED MANUSCRIPT Since glycan patterns are exposed on cell surfaces, they are readily amenable to profiling using new high-throughput technologies [17, 18]. Indeed, advances in high-throughput

PT

technologies significantly benefit glycobiology and allow for fast screening of cells and

RI

generation of extensive glycomics data sets. Furthermore, the development of sophisticated analytical techniques [19-23] and data analysis tools [24-31] render increasing opportunities for

SC

improvement of high-throughput screening for glycans as disease markers and structure

NU

classification in therapeutic proteins. However, there are significant challenges to better understand the cellular glycosylation transformations and the use of this information to develop

MA

improved diagnostics and glycan characterization. Glycogene microarrays, lectin chips and RNA sequencing tools are widely used to analyze the whole glycogenome and the changes in the

TE

D

glycosylation enzymes during pathological conditions [31-35]. In addition to these tools, recent advancements in mass spectrometry-based technologies [36] allow analysis of glycan, glycosite,

review,

we

AC CE P

glycopeptide and intact glycoproteins both at the qualitative and quantitative levels. In this summarize

existing

glycogenomics,

glycomics,

glycoproteomics,

and

glycoinformatics tools to support analysis of glycosylation, and provide examples of these approaches that have led to a better understanding of glycosylation. Finally, we discuss some of the challenges associated with comprehensive characterization of protein glycosylation and the role that integrative glycoinformatics and systems biology tools will likely play in addressing these questions.

2. Advances in Glycogenomics Protein glycosylation is one of the most diverse and complicated post-translational modifications due to the nature of glycans on the glycosites and their biosynthesis. In order to

2

ACCEPTED MANUSCRIPT better comprehend the glycosylation effect on the pathological conditions, a knowledgebase including glycan, glycopeptide, and glycosite information should be implemented. A key step

PT

toward establishing this knowledgebase is to decipher the glycogenome to understand the genes

RI

and enzymes involved in the glycosylation pathways of the species or pathological conditions [37].

SC

Glycosylation levels and compositions significantly affect the functional activity and

NU

half-life of therapeutic proteins in the circulatory system as well as the immune response of the human body. Different species are characterized by distinct glycosylation pathways as genes

MA

expressed in some species are suppressed in others. For example, CHO cells, which are widely used for production of protein therapeutics because of glycosylation compatible with human

TE

D

immune systems, often provide simpler glycosylation patterns than those from human cells. Typical pathways for N-glycosylation are shown in Figure 1 for CHO cells and Figure 2 for

AC CE P

human cells. A complete CHO glycogenome analysis was performed when the CHO genome was sequenced in 2011 [38]. Only three of the genes (UDP-N-acelyglucosamine transferase ALG13 and sulfotransferases CHST7 and CHST13) were lacking homologs in the CHO genome out of 300 glycosylation genes in human. However, RNA-sequencing was performed and the results showed expression of only half of the predicted glycan synthesis and degradation genes. Statistical analysis showed that sulfotransferases, fucosyltransferases and N-acetylgalactosamine (GalNAc) transferases were significantly depleted with a p-value of below 0.06 among the other genes. Some other glycogenes repressed in CHO-K1 cells include the bisecting GlcNAc transferase III (GnTIII, Mgat3), α(1,2), α(1,3) and α(1,4)-linked fucosyltransferases, and ST6Gal [38]. Subsequently, North et al. using a variety of data types, including higher mass MS (up to 11,000 Daltons) MS/MS and GC/MS, have shown the presence of some bisecting GlcNAc

3

ACCEPTED MANUSCRIPT structures in Pro¯5 wild type CHO [39]. Unpublished results using mass spectra modeling and linkage analysis from our group also showed that the Mgat3 gene for generating bisecting

PT

GlcNAc is not completely silent in the Pro−5 line of CHO cells. Because certain CHO cell lines

RI

may exhibit activation of the Mgat3 gene, which provides GnTIII activity, we included the Mgat3 gene in Table I, which summarize the main enzymes and genes in the N-glycosylation

SC

pathway of CHO cells. Similarly, Table 2 is included to indicate the main glycosylation

NU

enzymes and genes for human cells.

In order to correlate glycan dynamics to glycotransferase expression profiles, glycogene

MA

microarrays, including glycosidases, glycotransferases and sugar transporters, as well as lectin chips, are widely used to understand the mechanism of action for glycans in disease states. For

TE

D

instance, to find out the pivotal role of glycans in tumor metastasis, two different cell lines showing high metastatic potential

(HCCLM3) and low metastatic potential (Hep3B) were

AC CE P

compared. Genes such as ST3GalI, FUT8, β3GalT5, MGAT3 and MGAT5, which play a role in glycolipid, N-glycan, and sialyl Lewis antigen biosynthesis, were differentially expressed [40]. In a recent experiment, a knockout screening of pivotal glycosyltransferases for CHO glycosylation control was conducted [41]. The approach used genome editing of these CHO cells to design glycoproteins with specific engineered glycoform profiles [41]. Carbohydrate groups are known as the essential mediators of cellular and molecular interactions. In summary, genomics, transcriptomics, and glycogene microarrays can be applied and coupled with other tools such as mass spectrometry-based glycan data to discover novel glyco-biomarkers as will describe in later sections. 3. Advances in Glycoproteomics

4

ACCEPTED MANUSCRIPT Glycoproteomics is a field that evaluates glycosylated proteins and their glycosylation sites [42]. It usually involves glycoprotein enrichment of the samples of healthy and/or disease

PT

states that can be compared to find differentially expressed glycoproteins potentially playing

RI

important roles in certain diseases or disease states. Such an approach requires sophisticated comparative proteomics methods, advanced mass spectrometry techniques, and powerful

SC

bioinformatics tools to identify biomarkers for early prediction of diseases that can eventually be

NU

used for disease prognosis. Label free quantification [43], stable isotope labeling (SILAC) [44], isobaric tag for relative and absolute quantitation (iTRAQ) [45] and tandem mass tags (TMT)

MA

[46], are some of the methods used to interpret the differentially expressed proteins between samples for biomarker and target discovery.

TE

D

Hydrazide chemistry and solid phase extraction of glycosylated peptides (SPEG) provide means to identify and quantify N-linked glycoproteins. In this method, a protein mixture is

AC CE P

equilibrated with hydrazide resin, which binds to carbohydrate moieties on the glycoproteins upon oxidation, glycoproteins are then enzymatically removed by PNGaseF and analyzed by LCMS [47]. Using this technique, Yang et. al. evaluated the differences in glycoproteomic profiles of dysynchronous heart failure and cardiac resynchronization therapy [48]. Relative changes in the level of glycoproteins were determined by iTRAQ and verified by label-free LC-MS. The levels of several glycoproteins reverted back to normal level after therapy. This is important because it is of great interest in identifying corresponding changes with prognostic power after therapy. Tian et al. [14] used the SPEG technique to enrich the glycoproteins from ovarian tumors and adjacent normal ovary tissues. The enriched glycopeptides from the normal ovary, clear-cell carcinoma, high-grade endometrioid carcinoma, high-grade serous carcinoma, lowgrade endometrioid carcinoma, low-grade serous carcinoma, mucinous carcinoma, and

5

ACCEPTED MANUSCRIPT transitional carcinoma samples were labeled with iTRAQ reagents and relative protein quantitation was performed based on iTRAQ labeling and MS/MS spectra. It was possible to

PT

identify both the proteins showing differential expression in ovarian tumors versus normal

RI

tissues, as well as uniquely overexpressed proteins specific to each ovarian tumor. Further, western blot analysis supported the proteomics results and showed elevated levels of

SC

carcinoembryonic antigen-related cell adhesion molecules 5 and 6 (CEA5 and CEA6) in ovarian

NU

mucinous carcinoma. The same technique has been used to identify underlying immunological activation resulting from HIV infection, HIV elite suppressors, and antiretroviral therapy [49].

MA

These findings revealed that HIV elite suppressors significantly affected the immunologically relevant glycoproteins as a consequence of antiviral immunity.

TE

D

Using sialoglycoproteome enrichment and isotope labeling methods, differentially expressed proteins in breast cancer were identified. Further western blot and lectin analyses

AC CE P

confirmed that versican is one of the most highly differentially expressed sialoglycoproteins in breast cancer [13]. In addition, coupling sialoglycoprotein enrichment methods with selective reaction monitoring (SRM) techniques indicated the upregulation of sialylated prostate specific antigen (PSA) in prostate cancer tissues [50]. In order to increase the accuracy of prognosis and diagnosis of cancer, organ-specific glycosylated and sialylated proteins, such as PSA, can be used. Hydrazide chemistry, lectins, multilectin affinity chromatography, and metabolic incorporation of sugar analogs for glycoprotein isolation are all commonly used methods for biomarker and target discovery aimed at early detection and therapy of different cancer types [12].

4. Advances in Glycomics

6

ACCEPTED MANUSCRIPT Glycans are critical biomolecules for a number of diseases including cancer [7], immune disorders [51], cardiovascular disease [48], and HIV [52]; they also play major roles in

PT

monoclonal and bispecific antibodies potency. However, due to the technical challenges in the

RI

structural analysis of glycans, global glycomics has been hampered. A variety of analytical platforms, including capillary electrophoresis and liquid chromatography, are widely used after

SC

derivatizing the glycans with permethylation or carbodimide coupling. Introduction of

NU

fluorescence tags such as 2-aminobenzamide enabled the quantitation of glycans with a fluorescent detector. Recent advancements in mass spectrometry technologies have provided the

MA

determination of glycan composition and quantitation, which can also include solid phase immobilization and isobaric labeling [53]. Figure 3 shows an overview of common glycan

TE

D

analysis methods that are often used including lectin microarrays, UPLC, and LC/MS/MS. Isobaric tag approaches, such as TMT and iTRAQ, have been frequently used for peptide

AC CE P

quantification to discover biomarkers or targets in various medical conditions. However, there has been limited success in glycan quantification with the use of isobaric tags, such as aminoxyTMT and iART [54], due to their tertiary amine structure. A novel mass spectrometrybased technology called quaternary amine-containing isobaric tag for glycans (QUANTITY) [54] was developed recently to improve the complete labeling of the glycans and enhance the reporter ion intensity upon MS2 fragmentation. Four-plex QUANTITY reagents include a reactive glycan conjugation site, balancer for compensation of molecular mass changes, and reporter which can provide the generation of reporter ions ranging from 176 to 179 Daltons upon MS2 fragmentation for quantification purposes. Recently, the QUANTITY labeling approach was coupled with solid phase immobilization techniques for glycomic comparison of CHO cells engineered with

7

ACCEPTED MANUSCRIPT glycosyltransferases. As shown in Figure 4, a number of samples from different tissues or cells can be denatured and immobilized on the AminoLink resin. To stabilize the sialic acid groups, p-

PT

toluidine can be used with carbodimide coupling reagent and PNGaseF can be used to release the

RI

N-glycans from the solid support. Next, the aldehyde group of the N-acetylglucosamine (GlcNAc) at the reducing end of glycans from each sample can be labeled with QUANTITY

SC

followed by an analysis with LC/MS/MS. A global proteomics analysis can also be conducted by

NU

performing on-bead digestion [54].

Site-specific glycan occupancy and alterations in glycoproteins are also significantly

MA

important in pathological conditions. Until recently, glycosites, glycopeptides, and glycans have been studied separately due to difficulties with simultaneous analysis. A method called solid

TE

D

phase extraction of N-linked glycans and glycosite-containing peptides (NGAG) [55] was developed for comprehensive analysis of glycans, glycosites, and glycopeptides from complex

AC CE P

samples. This method was applied to a single protein, bovine fetuin, and complex cell line (OVCAR-3), for evaluation. In this method, using an aldehyde-functionalized solid support, the peptides were immobilized. PNGaseF and Asp-N digestions allowed for release of the N-glycans and N-glycopeptides, respectively. After the mass spectrometry analysis, a sample-specific intact glycopeptide database was established containing all possible glycosites and glycans. At the same time, intact glycopeptides from OVCAR-3 cells were isolated and run by MS. The glycan oxonium ions were used as signature to pick the spectra for intact glycopeptides. These spectra were mapped to the OVCAR-3 glycosylation specific database using GPQuest software [55,56]. Due to the absence of a known endoglycosidase, it is challenging to study O-glycomics. However recently, some analytical methods have been developed to analyze and quantify O-

8

ACCEPTED MANUSCRIPT glycans. A microwave (MW)-assisted β-elimination procedure in the presence of pyrazolone analogues (BEP) was optimized to analyze the O-glycans from cells, tissues, serum and FFPE

PT

tissues [57]. In a variety of human disorders, especially in the tumors, aberrant expression of the truncated O-glycans Tn (GalNAcα1-Ser/Thr) and its sialylated version sialyl-Tn (STn)

RI

(Neu5Acα2,6GalNAcα1-Ser/Thr) has been demonstrated. For this reason both Tn and STn are

NU

5. Advances in Glycoinformatics

SC

known tumor carbohydrate markers [58].

MA

The availability of high-throughput technologies has led to huge increases in the amounts of of glycosylation data. Unfortunately, the development of high-throughput glycan analysis

D

workflows and glycoinformatics tools has been limited by the diversity of glycans and the

TE

complexity of the glycosylation processing. Therefore, resources that provide an integral

AC CE P

approach for analysis are still under development. Current high-throughput processing of glycomics data offers opportunities to organize, analyze, and integrate experimental data in order to obtain valuable insights. Consequently, high-throughput processing methodologies

with

automated pipelines and extensive glycoinformatics support and infrastructure are just as important as the experimental tools for glycosylation analysis. Toward that end computational methods, databases, and tools are being developed and a collection of glycoinformatics platforms are publicly available [26, 27, 59-62]. Some of these platforms serve as repositories of glycan structures, glycogenes, enzymes, and experimental glycan data. Other platforms permit analysis of glycans from diverse perspectives and interpret different types of experimental data. For example, there are publicly available databases that provide glycoproteomic entries and tools to predict glycosylation sites. UniPep lists 9651 peptides derived from 6027 proteins that form a representative library of various tissues mapped to theoretical glycopeptides (www.unipep.org).

9

ACCEPTED MANUSCRIPT Based on previously successful databases, GlycoSuiteDB [63], EUROCarbDB [27], the UniCarb KnowledgeBase (UniCarbKB; http://unicarbkb.org) was established to provide an open access to

Hosted by Expasy, GlycoMod predicts the possible

RI

experimental data collections [64].

PT

a curated database of glycan structures of glycoproteins and to integrate functional and

oligosaccharide structures that occur on proteins from their experimentally determined masses

SC

(http://web.expasy.org/glycomod/); whereas SugarBind provides information on known

NU

carbohydrate sequences to which pathogenic organisms bind (http://sugarbind.expasy.org/).

MA

One challenge is the lack of consensus on a common computer code for database exchange that represents structures for monosaccharide residues and glycan sequences [65] since

D

codes for graphical representation of glycans already exist. Many initiatives, such as the Kyoto

TE

Encyclopedia of Genes and Genomes (KEGG) [66,67] and the Consortium for Functional

AC CE P

Glycomics (CFG) [59] developed unique structural code representations for glycans and independent databases to store them. This has made combining information from more than one database highly challenging. While there has been agreement toward the sequence format GLYDE-II as a general exchange format for glycan structures [68], many of the databases and glycomics tools still use unique sequence codes. GlycomeDB [24, 69] is a single resource database for glycan structures that was established as an effort to integrate entries from the major databases including CFG, GLYCOSCIENCES.de, KEGG, CarbBank and the Protein DataBank (PDB). GlycomeDB stores glycan structures in a sequence format GlycoCT [70]. In addition GlycomeDB maintains the biological source information from the integrated databases and references IDs of the original database entries. Databases not yet integrated into GlycomeDB include UniCarb-DB - http://unicarb-db.biomedicine.gu.se/ and UniCarbKB http://unicarbkb.org/. A glycan registry (GlyTouCan - https://glytoucan.org/) was created as a

10

ACCEPTED MANUSCRIPT new effort to integrate not only the structures from GlycomeDB and other databases but also allows users to register their own structures. Table 3 describes the main databases that store

PT

carbohydrate structures, experimental data, and many other resources.

RI

Table 4 highlights glycoinformatics resources and tools for glycan analysis and

SC

interpretation including web applications, stand-alone applications, and web-based resource sites. A very important platform for the registration and discovery of glycoinformatics tools is the

NU

GlycomicsPortal (http://glycomics.ccrc.uga.edu/GlycomicsPortal). This web-based search engine

MA

is regularly updated and currently stores 35 databases, 39 web services, 33 software tools, and a workflow. Most of the tools mentioned in Tables 3 and 4 of this publication and other

D

publications are registered in the GlycomicsPortal, expediting the search for glycobiology

TE

resources. RINGS [71] is another website for resources, which provides algorithmic and data

AC CE P

mining tools. Additional resource tools for glycan analysis and interpretation are included in Table 4, with their corresponding literature references. Considering the massive collection of data and resources, a description of all the glycoinformatics tools is beyond the scope of this review. Resources and applications for molecular modeling of glycan structures are described in additional references [21, 72-74].

6. Systems Glycobiology and Integration of ‘Omics Datasets Integrative glycoinformatics and systems glycobiology developments based on a holistic understanding of the complex glycosylation process and the relations among its components can provide a more complete analysis that is not only based solely on glycan annotation but also on other aspects, such as enzymatic levels, glycans abundances, biosynthetic pathways, and complementary ‘omics datasets (Figure 5). However, most of the available glycoinformatics tools do not consider the integration and the complex relationships among the different

11

ACCEPTED MANUSCRIPT components of the glycosylation process (e.g., enzymes, glycans, sugar nucleotides, transporters), in which the glycan structures are defined as a result of the action of many

PT

enzymes. Instead, these tools are based primarily on standard bioinformatics approaches mostly

RI

developed for proteomics and genomics, which have limitations in their application to glycomics. These tools may not work properly for glycans, given that glycans are not directly

SC

encoded in the genome and differ from proteins in that they are assembled from the

NU

interconnected action of several enzymes. For that reason and because of the adoption of traditional bioinformatics approaches that do not consider the complexity of glycosylation,

MA

methods and tools for analysis of glycans have lagged and most glycoinformatics tools available are specialized for the analysis of one type of data [75-78]. For example, a common approach for

TE

D

mass spectrometry-based glycoprofiling involves a one-to-one database matching of particular MS measurements to specific glycans from a known glycan library in order to annotate the

AC CE P

individual peaks of the mass spectrum separately [79,80]. Methods that consider the complexity of glycosylation would require that all glycan structures used for the annotation of the spectrum be generated by the enzymatic machinery of the studied organism [81]. This alignment will ensure consistency among enzyme activities and those structures assigned to each peak in the same spectrum.

Current bioinformatics techniques that attempt to integrate diverse data sets are still in early development [23, 82, 83] despite substantial progress on several ‘omics’ fronts. For example, statistical database-driven approaches to relate gene expression levels to the abundance of specific glycan linkages did not provide quantitative predictions of detailed glycan distributions [82, 83]. This reflects the need for integrative glycoinformatics and systems tools to identify glycan structural data and also to link these with gene expression data of glycosylation

12

ACCEPTED MANUSCRIPT enzymes that produce these glycan structures. Mathematical modeling of glycosylation may represent a promising approach to start understanding how mRNA levels relate to the actual

PT

amount and distribution of glycans found within healthy or diseased cells [23, 30, 81].

RI

An approach that considers data integration could be highly effective to reduce variability (false positives and negatives) in the analytical high-throughput experimental platform. Results

SC

obtained with integrative glycoinformatics and systems glycobiology tools that are confirmed by

NU

different experimental data (e.g., glycogenes expression and mass spectra profile) will increase confidence in predictions and recommendations for biomarkers. Moreover, integrated

MA

glycoinformatics tools enable the analysis and comparison of multiple studies with multiple platforms, which can reveal limitations in analytical sensitivities.

TE

D

One approach has been to develop and implement tools for optimal analysis and effective validation of mass spectra and gene expression data sets to understand cancer glycosylation. A

AC CE P

comprehensive simulation framework was implemented [23, 30] to integrate information across a broad spectrum of mass spectral data for leukemia cell types as compared to normal cell types [30]. The method was also used to integrate mass spectral data and mRNA datasets to identify glycan patterns associated with prostate cancer types [23]. These two examples provide a deeper understanding of the interrelation among the complex cellular processes leading to changes in cancer glycosylation. In both cases, a glycosylation model for mammalian cells that uses MALDI TOF mass spectra has been developed to completely characterize a measured glycan mass spectrum in terms of a relatively small number of enzyme activities [30]. Automatic annotation of the mass spectrum in terms of glycan structures is produced, with every peak being assigned a full range of alternative glycan structures and abundances as shown in the MS profile at the top of Figure

13

ACCEPTED MANUSCRIPT 6. The method was initially applied to mass spectral data of normal-human monocytes and monocytic leukemia cells, and it provided insights into the relevant glycosylation pathways that

PT

differentiate normal and diseased cells [30]. This is an important advance toward integrative

RI

systems glycobiology that connects the enzymatic activities of the glycosyltransferases, the complete bioprocessing pathways, and the resulting glycan structures.

SC

A more advanced implementation of this method involves the integration of mass spectral

NU

and gene expression data as illustrated in Figure 6. The power of this novel method was demonstrated and applied to low- and high-passage Lymph node carcinoma of the prostate

MA

(LNCaP) cancer cells, which correspond to androgen-dependent and the more metastatic androgen-independent cell stages [23]. The novel method identified and quantified glycan

(Figure 6).

TE

D

structural details not typically derived from single-stage mass spectral or gene expression data Differences between the cell types uncovered include increases in the more

AC CE P

metastatic androgen-independent cells of H type II and Lewis-y glycan structures characteristic of blood groups and the correlation of a correspondingly greater activity of a fucosyltransferase (FUT1). The model further elucidated limitations in the two analytical platforms, including a defect in the microarray for detecting the GnTV (MGAT5) enzyme. The results demonstrate the potential of integrative systems glycobiology tools for elucidating key glycan biomarkers and potential therapeutic targets along with specifying limitations in the analytical platforms. The integration of multiple data sets demonstrates how a systems biology approach can provide a better understanding of complex cellular processes and lead to the elucidation of glycan signatures representative of potential biomarkers for cancer and other diseases. Other pioneering studies have combined glycan analysis with multiple ‘omics analytical tools to gain insights into glycosylation differences in human populations and disease states as

14

ACCEPTED MANUSCRIPT well as the genetic control of these changes [84].

In one approach, genome-wide association

studies (GWAS) was combined with high throughput HPLC analysis of plasma proteins of 2,705

PT

individuals analysis to reveal polymorphisms in the fucosyltransferase genes FUT6 and FUT8 as

RI

well as Hepatocyte Nuclear Factor 1HNF1) [85]. Furthermore, HNF1 and HNF4were

SC

found to regulate expression of fucosyltransferase and fucose biosynthetic genes. The analysis was then extended to 3533 individuals to identify polymorphisms in the genes for N-

NU

acetylglucosamine transferase (MGAT5), glucuronyltransferase (B3GAT1) and the protein pump SLC9A9 based on up to 45 glycan traits in the plasma glycome of tested individuals [86]. Other

MA

efforts involve relating the epigenome with the glycome [85]. In one study, HNF1 silencing through methylation of CpG sites was associated with changes in the plasma glycome of a

TE

D

population of 810 individuals [87]. Another study showed that global changes in the DNA methylation of ovarian cancer epithelial cells (OVCAR3) can cause differential alterations in the

AC CE P

glycan structures such as reduced core fucosylation, increased branching, and enhanced sialylation. These changes were related to alterations in the expression of GMDS and FX genes involved in fucose biosynthesis and the expression of MGAT5 affecting the branching and sialylation of secreted glycans [88]. These studies demonstrate how genomic and epigenetic analysis when combined with glycan structural analysis can yield powerful insights into the role of genes and pathways on glycosylation, metastasis and cancer progression [88]. Another study profiled both N and O-glycan structures and the corresponding glycosylation machinery genes isolated from mouse embryonic stem cells (ES), embryoid bodies (Ebs) and extraembryonic endodermal cells (ExE). By using pathway mapping, researchers showed a significant correlation between the transcriptional regulation and glycan expression. Increased polysialylation and α-Gal termination were observed in the differentiated cell types

15

ACCEPTED MANUSCRIPT whereas α-Gal capped glycan were more abundant in ExE cells [89]. Another integration study mapped miRNA regulators onto the glycan biosynthetic pathways by incorporating glycomics

PT

data. By using lectin microarrays and mimics and inhibitors of certain group of miRNAs,

RI

microRNA regulators of high mannose, fucose and β-GalNac networks were determined [90]. Researchers have also profiled the N-glycan and glycogene expression in the epithelial to

SC

mesenchymal transition (EMT) process which occurs following transformation of the normal

NU

mouse mammary gland epithelial (NMuMG) cell model induced by transforming growth factorβ1 (TGFβ1). Using a systems glycobiology approach, the effect of TGFβ-induction on the levels

MA

of high-mannose, antennary and bisecting GlcNAc N-glycans as well as fucosylation were demonstrated. Fucosylation and bisecting GlcNac glycans were significantly decreased while

D

high mannose type N-glycans were increased [31]. As a result, the integration of glycomics

TE

approaches together with other ‘omics tools has led much greater understanding about control of

AC CE P

the expression of glycosylation genes at the genomic, transcriptomic, and epigenetic levels and its impact on the glycan profiles of populations and in cellular transformations and disease states. 7. Biomarkers and Glycan-based Diseases The discovery of glycan-based biomarkers requires efficient extraction and analysis of glycan structures. Indeed, the identification of appropriate glycan-based biomarkers of different cancer types is challenging and has been hindered by a number of factors, notably: specific glycans may not be present in current databases, cancer associated glycans may be at low levels, a collective pattern of glycans structures rather than a specific glycan may be more representative of the cancer state, an absence of a unified and standard format for glycans notation, and the requirement of novel algorithms to handle glycan complexity. Most importantly, differences in glycan profiles between cancer and normal cells may involve subtle differences in amounts rather than on and off changes.

16

ACCEPTED MANUSCRIPT Fortunately, advances in the ‘omics’ tools including RNA sequencing and mass spectrometry-based technologies have initiated a better understanding of the glycogenome,

PT

glycoproteome, and glycan changes in cells or pathological conditions. In addition to cancer,

RI

immunological and cardiovascular disorders, congenital disorders of glycosylation (CDGs) are one of the largest classes of diseases affected by glycosyltransferases and other glycogenes [91].

SC

Understanding the glycogenome is a first key step towards identifying glycan or glycoprotein

NU

markers. Various sequencing technologies or arrays including whole exome sequencing (WES) [92] have been used to find out the functional deleterious mutations. A website called GlyMAP

MA

(http://glymap.glycomics.ku.dk) was constructed to provide the global map of glycogenome genetic stability [91]. In addition, glycoproteomics is also widely used for finding novel

TE

D

biomarkers from serum or tissues. Lectin affinity chromatography, hydrazide chemistry, titanium dioxide affinity chromatography, reductive amination chemistry, and boronic acid chemistry are

AC CE P

the most widely-used enrichment techniques for finding the glycosites and glycopeptides [93]. By coupling these enrichment techniques with the MS labeling technologies and global proteomics approaches, novel glycoprotein markers can be found and validated [94]. Examples of some studied glycan biomarkers in cancer are given in Table 5. In addition, the subtle differences that may exist between cells and tissues may be addressed through systems biology integration described in the previous section. Differences in glycan profiles and also different enzyme expression profiles offer the potential to be used as valuable biomarkers by indicating a transformation from normal to a cancerous state [30] or from a less malignant to a more malignant cancer state [23]. One goal involves obtaining new epitopes based on differential glycan patterns observed between normal and diseased cells and tissues. Algorithms based on machine learning methods [95-98], frequent sub-tree mining [97, 99, 100],

17

ACCEPTED MANUSCRIPT and mathematical modeling [23, 30] have been developed to predict glycan biomarkers or glycan biomarker patterns. Some of these algorithms can compare two glycan profiles directly, each

PT

consisting of thousands of structures, to find those substructures or combinations of substructures

RI

that most definitely characterize the differences between the two profiles.

In conclusion, glycomics represents a challenging but highly promising field of study to

SC

gain new insights in biomarker discovery and biotherapeutics development. Glycomics serves as

NU

one of the key initial tools in order dig deeper and gain a better understanding of cellular physiology, facilitated when used in concert with genomics, epigenomics, transcriptomics, and

MA

proteomics data. An emerging opportunity in processing diverse glycosylation data sets is the development of adequate tools that allow an integrative analysis perspective to interrogate,

TE

D

interpret, and gain insights from diverse ‘omics data sets. While important progress is being achieved on this end, future advances will focus on the implementation of integrative

AC CE P

glycoinformatics and systems glycobiology approaches to help characterize glycosylation and the impact of changes or differences at the genetic, transcriptional, epigenetic, and translational level. This work will require the development of methods to integrate disparate datasets, perform network analysis, visualize glycosylation processing differences, and create unified computational platforms.

Funding This work was supported by the National Cancer Institute: Awards 5R41CA127885-02 and R01CA112314, the Consortium of Functional Glycomics: Bridging Grant Number U54GM062116-10, and NIH/NIGMS funding the National Center for Glycomics and Glycoproteomics (8P41GM103490). The content is solely the responsibility of the authors and

18

ACCEPTED MANUSCRIPT does not necessarily represent the official views of the National Cancer Institute and the Consortium of Functional Glycomics.

PT

Conflict of interest

RI

The authors have declared no conflict of interest.

5. 6. 7. 8. 9. 10.

11. 12. 13. 14. 15. 16. 17. 18. 19.

NU

MA

4.

D

3.

TE

2.

Brockhausen, I., Mucin-type O-glycans in human colon and breast cancer: glycodynamics and functions. Embo Reports, 2006. 7(6): p. 599-604. Brockhausen, I., J. Schutzbach, and W. Kuhns, Glycoproteins and their relationship to human disease. Acta Anatomica, 1998. 161(1-4): p. 36-78. Hakomori, S., Glycosylation defining cancer malignancy: new wine in an old bottle. Proc Natl Acad Sci USA, 2002. 99(16): p. 10231-10233. Kim, Y.J. and A. Varki, Perspectives on the significance of altered glycosylation of glycoproteins in cancer. Glycoconj J, 1997. 14(5): p. 569-576. Buskas, T., P. Thompson, and G.J. Boons, Immunotherapy for cancer: synthetic carbohydratebased vaccines. Chem Commun (Camb), 2009(36): p. 5335-5349. Tong, L., et al., Glycosylation changes as markers for the diagnosis and treatment of human disease. Biotechnol Gen Eng Rev, 2003. 20: p. 199-244. Adamczyk, B., T. Tharmalingam, and P.M. Rudd, Glycans as cancer biomarkers. Biochim Biophys Acta, 2012. 1820(9): p. 1347-1353. Fuster, M.M. and J.D. Esko, The sweet and sour of cancer: Glycans as novel therapeutic targets. Nat Rev Cancer, 2005. 5(7): p. 526-542. Furukawa, K. and A. Kobata, Protein glycosylation. Curr Opin Biotechnol, 1992. 3(5): p. 554559. Apweiler, R., H. Hermjakob, and N. Sharon, On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. Biochim Biophys Acta, 1999. 1473(1): p. 4-8. Arnold, J.N., et al., Novel glycan biomarkers for the detection of lung cancer. J Prot Res, 2011. 10(4): p. 1755-1764. Tian, Y. and H. Zhang, Characterization of disease-associated N-linked glycoproteins. Proteomics, 2013. 13(3-4): p. 504-511. Tian, Y., et al., Altered expression of sialylated glycoproteins in breast cancer using hydrazide chemistry and mass spectrometry. Mol Cell Prot, 2012. 11(6): p. M111011403. Tian, Y., et al., Identification of glycoproteins associated with different histological subtypes of ovarian tumors using quantitative glycoproteomics. Proteomics, 2011. 11(24): p. 4677-4687. Rakus, J.F. and L.K. Mahal, New technologies for glycomic analysis: toward a systematic understanding of the glycome. Ann Rev Anal Chem, 2011. 4: p. 367-392. Ito, H., et al., Strategy for glycoproteomics: identification of glyco-alteration using multiple glycan profiling tools. J Prot Res, 2009. 8(3): p. 1358-1367. Tousi, F., et al., Technologies and strategies for glycoproteomics and glycomics and their application to clinical biomarker research. Anal Met, 2011. 3(1): p. 195-203. Zhang, Y., H. Yin, and H. Lu, Recent progress in quantitative glycoproteomics. Glycoconj J, 2012. 29(5-6): p. 249-258. Ito, S., K. Hayama, and J. Hirabayashi, Enrichment strategies for glycopeptides. Met Mol Biol, 2009. 534: p. 195-203.

AC CE P

1.

SC

References

19

ACCEPTED MANUSCRIPT

25. 26. 27. 28. 29. 30.

33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44.

PT

AC CE P

32.

TE

D

31.

RI

24.

SC

22. 23.

NU

21.

Hua, S. and H.J. An, Glycoscience aids in biomarker discovery. BMB Rep, 2012. 45(6): p. 323330. Furukawa, J., N. Fujitani, and Y. Shinohara, Recent advances in cellular glycomic analyses. Biomolecules, 2013. 3: p. 198-225. GlycomicsPortal, http://glycomics.ccrc.uga.edu/GlycomicsPortal. Bennun, S.V., et al., Integration of the transcriptome and glycome for identification of glycan cell signatures. 2013. PLoS Comput Biol 9(1): e1002813. doi:10.1371/journal.pcbi.1002813. Ranzinger, R., et al., GlycomeDB-a unified database for carbohydrate structures. Nuc Acids Res, 2011. 39: p. D373-D376. Taniguchi, N. and J.C. Paulson, Frontiers in glycomics: bioinformatics and biomarkers in disease. Proteomics, 2007. 7(9): p. 1360-1363. Raman, R., et al., Glycomics: an integrated systems approach to structure-function relationships of glycans. Nat Met, 2005. 2(11): p. 817-824. von der Lieth, C.W., et al., EUROCarbDB: An open-access platform for glycoinformatics. Glycobiol, 2011. 21(4): p. 493-502. Akune, Y., et al., The RINGS resource for glycome informatics analysis and data mining on the Web. Omics, 2010. 14(4): p. 475-486. Lutteke, T., et al., GLYCOSCIENCES.de: an internet portal to support glycomics and glycobiology research. Glycobiol, 2006. 16(5): p. 71R-81R. Krambeck, F.J., et al., A mathematical model to derive N-glycan structures and cellular enzyme activities from mass spectrometric data. Glycobiol, 2009. 19(11): p. 1163-1175. Tan, Z., et al., Altered N-glycan expression profile in epithelial-to-mesenchymal transition of NMuMG cells revealed by an integrated strategy using mass spectrometry and glycogene and lectin microarray analysis. J Proteome Res, 2014. 13(6): p. 2783-2795. Comelli, E.M., et al., A focused microarray approach to functional glycomics: transcriptional regulation of the glycome. Glycobiol, 2006. 16(2): p. 117-131. Hirabayashi, J., Lectin-based structural glycomics: glycoproteomics and glycan profiling. Glycoconj J, 2004. 21(1-2): p. 35-40. Kaji, H., et al., Lectin affinity capture, isotope-coded tagging and mass spectrometry to identify N-linked glycoproteins. Nat Biotechnol, 2003. 21(6): p. 667-672. Nairn, A.V., et al., Regulation of Glycan Structures in Animal Tissues: TRANSCRIPT PROFILING OF GLYCAN-RELATED GENES. J Biol Chem, 2008. 283(25): p. 17298-17313. Yang, S. and H. Zhang, Glycomic analysis of glycans released from glycoproteins using chemical immobilization and mass spectrometry. Curr Protoc Chem Biol, 2014. 6(3): p. 191-208. Mickum, M., et al., Deciphering the glycogenome of schistosomes. Front Genet, 2014. 5: p. 262. Xu, X., et al., The genomic sequence of the Chinese hamster ovary [1]-K1 cell line. Nat Biotechnol, 2011. 29: p. 735-741. North, S. J., et al., Glycomics profiling of Chinese hamster ovary cell glycosylation mutants reveals N-glycans of a novel size and complexity. J Biol Chem, 2010. 285(8): p. 5759-5775. Kang, X., et al., Glycan-related gene expression signatures in human metastatic hepatocellular carcinoma cells. Exp Ther Med, 2012. 3(3): p. 415-422. Yang, Z., et al., Engineered CHO cells for production of diverse, homogeneous glycoproteins. Nat Biotech, 2015. 33(8): p. 842-844. Tian, Y. and H. Zhang, Glycoproteomics and clinical applications. Proteomics, 2010. 4(2): p. 124-132. Megger, D.A., T. Bracht, H.E. Meyer, and B. Sitek, Label-free quantification in clinical proteomics. Biochim Biophys Acta, 2013. 1834(8): p. 1581-1590. Kashyap, M.K., et al., SILAC-based quantitative proteomic approach to identify potential biomarkers from the esophageal squamous cell carcinoma secretome. Cancer Biol Ther, 2010. 10(8): p. 796-810.

MA

20.

20

ACCEPTED MANUSCRIPT

51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61.

62. 63. 64. 65. 66. 67.

PT

RI

SC

50.

NU

49.

MA

48.

D

47.

TE

46.

Tian, Y., G.S. Bova, and H. Zhang, Quantitative glycoproteomic analysis of optimal cutting temperature-embedded frozen tissues identifying glycoproteins associated with aggressive prostate cancer. Anal Chem, 2011. 83(18): p. 7013-7019. Raso, C., et al., Characterization of breast cancer interstitial fluids by TMT labeling, LTQOrbitrap velos mass spectrometry, and pathway analysis. J Prot Res, 2012. 11 (6): p. 3199–3210. Zhang, H., X.J. Li, D.B. Martin, and R. Abersold, Identification and quantification of N-linked glycoproteins using hydrazide chemistry, stable isotope labeling and mass spectrometry. Nat Biotechnol, 2003. 21(6): p. 660-666. Yang, S., et al., Glycoproteins identified from heart failure and treatment models. Proteomics, 2015. 15(2-3): p. 567-579. Yang, W., et al., Glycoproteomic study reveals altered plasma proteins associated with HIV elite supressors. Thernanostics, 2014. 4(12): p. 1153-1163. Li, Y., et al., Simultaneous analysis of glycosylated and sialylated prostate-specific antigen revealing differential distribution of glycosylated prostate-specific antigen isoforms in prostate cancer tissues. Anal Chem, 2011. 83(1): p. 240-245. Ząbczyńska M. and E. Pocheć, The role of protein glycosylation in immune system. Postepy Biochem, 2015. 61(2): p. 129-137. Garces, F., et al., Affinity maturation of a potent family of HIV antibodies is primarily focused on accommodating or avoiding glycans. Immunity, 2015. 43(6): p. 1053-1063. Yang, S., A. Rubin, S.T. Eshghi, and H. Zhang, Chemoenzymatic method for glycomics: isolation, identification, and quantitation. Proteomics, 2016. 16(2): p. 241-256. Yang, S. et al., QUANTITY: an isobaric tag for quantitative glycomics. Sci Rep, 2015. 5: p. 17585. Sun, S., et al., Comprehensive analysis of protein glycolsylation by solid-phase extraction of Nlinked glycans and glycosite-containing peptides. Nat Biotechnol, 2016. 34(1): p. 84-88. Eshghi, S., et al., GPQuest: a spectral library matching algorithm for site-specific assignment of tandem mass spectra to intact N-glycopeptides. Anal Chem, 2015. 87(10): p. 5181-5188. Furukawa, J., et al., Quantitative O-glycomics by microwave-assisted β-elimination in the presence of pyrazolone analogues. Anal Chem, 2015. 87(15): p. 7524-7528. Ju, T., et al., Tn and sialyl-Tn antigens, aberrant O-glycomics as human disease markers. Proteomics Clin Appl, 2013. 7(9-10): p. 618-631. CFG. The Consortium for Functional Glycomics. http://www.functionalglycomics.org. Raman, R., et al., Advancing glycomics: implementation strategies at the consortium for functional glycomics. Glycobiol, 2006. 16(5): p. 82R-90R. Yoshida, K., A. Suzuki, and N. Taniguchi, Japan consortium for glycobiology and glycotechnology; toward establishment of international network and systems glycobiology. Prot Nuc Acid Enzyme, 2004. 49(15 Suppl): p. 2313-2318. Hayes, C.A., et al., UniCarb-DB: a database resource for glycomic discovery. Bioinform, 2011. 27(9): p. 1343-1344. Cooper, C.A., et al., GlycoSuiteDB: a new curated relational database of glycoprotein glycan structures and their biological sources. Nuc Acids Res, 2001. 29(1): p. 332-335. Campbell, M.P., et al., UniCarbKB: building a knowledgebase platform for glycoproteomics. Nuc Acids Res, 2014. 42: p. D215-D221. Ranzinger, R. and W.S. York, Glyco-Bioinformatics today (August 2011) – solutions and problems. http://www.beilstein-institut.de/glycobioinf2011/Proceedings, 2011. Hashimoto, K., et al., KEGG as a glycome informatics resource. Glycobiol, 2006. 16(5): p. 63R70R. Kanehisa, M., et al., The KEGG resource for deciphering the genome. Nuc Acids Res, 2004. 32: p. D277-D280.

AC CE P

45.

21

ACCEPTED MANUSCRIPT

74. 75. 76. 77. 78. 79.

80. 81. 82. 83. 84. 85.

86. 87. 88. 89 90.

PT

RI

SC

73.

NU

72.

MA

71.

D

70.

TE

69.

Packer, N.H., et al., Frontiers in glycomics: bioinformatics and biomarkers in disease. An NIH white paper prepared from discussions by the focus groups at a workshop on the NIH campus, Bethesda MD (September 11-13, 2006). Proteomics, 2008. 8(1): p. 8-20. Ranzinger, R., et al., GlycomeDB - integration of open-access carbohydrate structure databases. BMC Bioinform, 2008. 9: p. 384. Herget, S., et al., GlycoCT-a unifying sequence format for carbohydrates. Carbohydr Res, 2008. 343(12): p. 2162-2171. Akune, Y., Hosoda, M., Kaiya, S., Shinmachi, D., and K.F. Aoki-Kinoshita, The RINGS resource for glycome informatics analysis and data mining on the web. OMICS, 2010. 14(4): p. 475-486. DeMarco, M.L. and R.J. Woods, Structural glycobiology: a game of snakes and ladders. Glycobiol, 2008. 18(6): p. 426-440. von der Lieth, C.W., et al., Bioinformatics for glycomics: status, methods, requirements and perspectives. Brief Bioinform, 2004. 5(2): p. 164-78. Frank, M. and S. Schloissnig, Bioinformatics and molecular modeling in glycobiology. Cell Molecular Life Sci, 2010. 67(16): p. 2749-2772. Ceroni, A., et al., GlycoWorkbench: a tool for the computer-assisted annotation of mass spectra of glycans. J Prot Res, 2008. 7(4): p. 1650-1659. Maass, K., et al., "Glyco-peakfinder" - de novo composition analysis of glycoconjugates. Proteomics, 2007. 7(24): p. 4435-4444. Goldberg, D., et al., Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra. Proteomics, 2005. 5(4): p. 865-875. Kawano, S., et al., Prediction of glycan structures from DNA microarray data. Glycobiol, 2004. 14(11): p. 1204-1204. Packer, N.H., et al., Frontiers in glycomics: bioinformatics and biomarkers in disease - An NIH White Paper prepared from discussions by the focus groups at a workshop on the NIH campus, Bethesda MD (September 11-13,2006). Proteomics, 2008. 8(1): p. 8-20. Joshi, H.J., et al., Development of a mass fingerprinting tool for automated interpretation of oligosaccharide fragmentation data. Proteomics, 2004. 4(6): p. 1650-1664. Krambeck, F.J. and M.J. Betenbaugh, A mathematical model of N-linked glycosylation. Biotechnol Bioeng, 2005. 92(6): p. 711-728. Kawano, S., et al., Prediction of glycan structures from gene expression data based on glycosyltransferase reactions. Bioinform, 2005. 21(21): p. 3976-3982. Suga, A., et al., An improved scoring scheme for predicting glycan structures from gene expression data. Genome Informatics, 2007. 18: p. 237-246.

AC CE P

68.

Zoldos, V., Horvat, T., and G. Lauc, Glycomics meets genomics, epigenomics and other high throughput omics for system biology studies. Curr Opin Chem Biol, 2013. 17(1): p. 33-40. Lauc, G., et al., Genomics meet glycomics – the first GWAS study of human N-glycome identifies HNF1α as a master regulator of plasma protein fucosylation. PLoS Genet, 2010. 6(12): e1001256. Huffman, J.E., et al., Polymorphisms in B3GAT1, SLC9A9 and MGAT5 are associated with variation within the human plasma N-glycome. Hum Mol Genet, 2011. 20: p. 5000-5011. Zoldos, V., et al., Epigenetic silencing of HNF1A associates with changes in the composition of the human plasma N-glycome. Epigenetics, 2012. 7(2): p. 164-172. Saldova, R., et al., 5-AZA-2’-deoxycytidine induced demethylation influences N-glycosylation of secreted glycoproteins in ovarian cancer. Epigenetics, 2011. 6(11): p. 1362-1372. Nairn, A.V., et al., Combined transcript profiling glycan-related genes and glycan structural analysis. J Biol Chem, 2012. 287(45): p. 37835-37856. Agrawal, P., et al., Mapping posttranscriptional regulation of the human glycome uncovers microRNA defining the glycocode. Proc Natl Acad Sci, 2014. 111(11): p. 4338-4343.

22

ACCEPTED MANUSCRIPT

97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107.

108. 109. 110. 111. 112. 113.

PT

RI

SC

96.

NU

95.

MA

94.

D

93.

TE

92.

Hansen, L., et al., A glycogene mutation map for discovery of diseases of glycosylation. Glycobiol, 2015. 25(2): p. 211-224. Rabbani, B., Tekin, M., and N. Mahdieh, The promise of whole-exome sequencing in medical genetics. J Human Genet, 2014. 59: p. 5-15. Zhang, Y., J. Jiao, P. Yang, and H. Lu, Mass spectrometry-based N-glycoproteomics for cancer biomarker discovery. Clin Proteomics, 2014. 11(18): doi:10.1186/1559-0275-11-18. Ahn, J.M., et al., Integrated glycoproteomics demonstrates fucosylated serum paraoxonase 1 alterations in small cell lung cancer. Mol Cell Proteomics, 2014. 13(1): p. 30-48. Hizukuri, Y., et al., Extraction of leukemia specific glycan motifs in humans by computational glycomics. Carbohydr Res, 2005. 340(14): p. 2270-2278. Kuboyama, T., et al., A gram distribution kernel applied to glycan classification and motif extraction. Gen Inform, 2006. 17(2): p. 25-34. Yamanishi, Y., F. Bach, and J.P. Vert, Glycan classification with tree kernels. Bioinform, 2007. 23(10): p. 1211-1216. Li, L., et al., A weighted q-gram method for glycan structure classification. BMC Bioinform, 2010. 11 Suppl 1: p. S33. Aoki-Kinoshita, K.F., Mining frequent subtrees in glycan data using the RINGS glycan miner tool. Met Molecular Biol, 2013. 939: p. 87-95. Hashimoto, K., et al., Mining significant tree patterns in carbohydrate sugar chains. Bioinform, 2008. 24(16): p. i167-i173. Doubet, S., et al., The Complex Carbohydrate Structure Database. Trends in Biochemical Sciences, 1989. 14(12): p. 475-477. van Kuik, J.A., K. Hard, and J.F. Vliegenthart, A 1H NMR database computer program for the analysis of the primary structure of complex carbohydrates. Carbohydr Res, 1992. 235: p. 53-68. Schomburg, I., et al., BRENDA, the enzyme database: updates and major new developments. Nucleic acids research, 2004. 32(Database issue): p. D431-3. Coutinho, P.M., et al., An evolving hierarchical family classification for glycosyltransferases. Journal of Molecular Biology, 2003. 328(2): p. 307-317. Tomiya, N., et al., Analyses of N-Linked Oligosaccharides Using a Two-Dimensional Mapping Technique. Analytical Biochemistry, 1988. 171(1): p. 73-90. Bairoch, A., The ENZYME database in 2000. Nucleic acids research, 2000. 28(1): p. 304-5.

AC CE P

91.

Yoshida, K., A. Suzuki, and N. Taniguchi, [Japan consortium for glycobiology and glycotechnology; toward establishment of international network and systems glycobiology]. Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme, 2004. 49(15 Suppl): p. 2313-8. Campbell, M.P., et al., GlycoBase and autoGU: tools for HPLC-based glycan analysis. Bioinformatics, 2008. 24(9): p. 1214-1216. Cooper, C.A., E. Gasteiger, and N.H. Packer, GlycoMod--a software tool for determining glycosylation compositions from mass spectrometric data. Proteomics, 2001. 1(2): p. 340-9. Damodaran, D., et al., CancerLectinDB: a database of lectins relevant to cancer. Glycoconjugate Journal, 2008. 25(3): p. 191-198. Goldman, R., et al., Detection of hepatocellular carcinoma using glycomic analysis. Clin Cancer Res, 2009. 15(5): p. 1808-1813. Arnold, J.N., et al., Novel glycan biomarkers for the detection of lung cancer. J Proteome Res, 2011. 10(4): p. 1755-1764. Kronewitter, S.R., et al., The glycolyzer: automated glycan annotation software for high performance mass spectrometry and its application to ovarian cancer glycan biomarker discovery. Proteomics, 2012. 12(15-16): p. 2523-2538.

Figure and Table Legends

23

ACCEPTED MANUSCRIPT Figure 1. Simplified view of principal N-glycosylation pathways of CHO cells. Figure 2. Simplified view of principal N-glycosylation pathways in human cells.

PT

Figure 3. Schematic overview of several glycan analysis methods used in the literature including

RI

lectin microarrays, UPLC, and LC/MS/MS.

Figure 4. Quantitative glycome analysis using solid phase immobilization and quaternary amine-

SC

containing isobaric tag for glycan labeling.

NU

Figure 5. Emerging systems glycobiology paradigm in which multiple glycomics data sets are combined with other ‘omics inputs using glycoinformatics in order to elucidate insights about

MA

glycosylation processing from multiple data sets.

Figure 6. Integration of glycomics information from glycosylation models with transcriptomics

TE

D

data for elucidating biomarkers of prostate cancer LNCaP cells. This integrative glycoinformatics approach allows the simultaneous analysis of gene expression (pink) and mass

AC CE P

spectra (blue) data using a glycosylation model to elucidate glycan substructures and enzyme/gene levels that differentiate prostate cancer types. Increases of H type II and Lewis Y glycans abundances in more malignant androgen independent prostate cancer tumor cells versus less malignant cells were detected and show agreement for both MS (dark blue vs. light blue) and gene expression data (dark pink vs. light pink). Table 1. Main enzymes and corresponding genes for N-Glycosylation pathway of CHO cells. Table 2. Main enzymes and corresponding genes for N-Glycosylation pathway of human cells. Table 3. Listing of some of the Web-based glycan resources. Table 4. Directory of main links available for some glycan analysis databases and tools. Table 5.Examples of documented glycan biomarkers, associated diseases, method of detection and reference source.

24

Fig. 1

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

25

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC CE P

TE

Fig. 2

26

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC CE P

TE

D

Fig. 3

27

Fig. 4

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

28

AC CE P

Fig. 5

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

29

Fig. 6

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

30

ACCEPTED MANUSCRIPT Table 1 Enzymes

Glycogenes

NCBI

Enzyme

Enzyme name

EC number

130074

ManI

α2-mannosidase I

3.2.1.113

4124 4122

ManII

α3/6-mannosidase II

2530

a6FucT

α6-Fuc-transferase

4245

GnTI

β2-GlcNAc-transferase I

4247

GnTII

β2-GlcNAc-transferase II

4248

GnTIII

MAN2A1 MAN2A2

2.4.1.68

FUT8

RI

PT 3.2.1.114

MGAT1

2.4.1.143

MGAT2

β4-GlcNAc-transferase III

2.4.1.144

MGAT3

GnTIV

β4-GlcNAc-transferase IV

2.4.1.145

GnTV

β6-GlcNAc-transferase V

2.4.1.155

iGnT

Blood group i β3-GlcNAc-transferase

2.4.1.149

b4GalT

β4-Gal-transferase

2.4.1.38

D

MA

NU

SC

2.4.1.101

a3FucT

6487 6484 10402

a3SiaT

AC CE P

2526 2527 2528 2529 10690

TE

11320 11282 4249 146664 10678 10331 79369 2683 8704 8703 9334

Gene ID

α3- Fuc-transferase

2.4.1.152

α3-Sialyltransferase

2.4.99.6

31

MGAT4A MGAT4B MGAT5 MGAT5B B3GNT2 B3GNT3 B3GNT4 B4GALT1 B4GALT2 B4GALT3 B4GALT5 FUT4 FUT5 FUT6 FUT7 FUT9 ST3GAL3 ST3GAL4 ST3GAL6

ACCEPTED MANUSCRIPT Table 2 Enzymes

Glycogenes

Enzyme

Enzyme name

EC number

130074

ManI

α2-mannosidase I

3.2.1.113

4124 4122

ManII

α3/6-mannosidase II

3.2.1.114

MAN2A1 MAN2A2

2530

a6FucT

α6-Fuc-transferase

2.4.1.68

FUT8

4245

GnTI

β2-GlcNAc-transferase I

4247

GnTII

β2-GlcNAc-transferase II

4248

GnTIII

β4-GlcNAc-transferase III

11320 11282

GnTIV

β4-GlcNAc-transferase IV

2.4.1.145

MGAT4A MGAT4B

GnTV

β6-GlcNAc-transferase V

2.4.1.155

iGnT

Blood group i β3-GlcNAc-transferase

2.4.1.149

b4GalT

β4-Gal-transferase

2526 2527 2528 2529 10690

RI

MGAT1

2.4.1.143

MGAT2

2.4.1.144

MGAT3

SC

NU

MA

Gene ID

2.4.1.101

2.4.1.38

D

4249 146664 10678 10331 79369 2683 8704 8703 9334

PT

NCBI

MGAT5 MGAT5B B3GNT2 B3GNT3 B3GNT4 B4GALT1 B4GALT2 B4GALT3 B4GALT5 FUT4 FUT5 FUT6 FUT7 FUT9

2.4.1.152

a3SiaT

α3-Sialyltransferase

2.4.99.6

ST3GAL3 ST3GAL4 ST3GAL6

IGnT

Blood group I β6-GlcNAc-transferase

2.4.1.150

GCNT2

a6SiaT

α6-sialyltransferase

2.4.99.1

ST6GAL1

b3GalT

β3-Gal-transferase

FucTLe

α3/4- Fuc-transferase III

2.4.1.65

FUT3

2523 2524

FucTH

α2- Fuc-transferase, Se, H

2.4.1.69

FUT1 FUT2

28

GalNAcT-A

Blood group A α3-GalNAc-transferase

2.4.1.40

ABO

28

GalT-B

Blood group B α3-Gal-transferase

2.4.1.37

ABO

2651 6480 8708 8707 10317 2525

AC CE P

6487 6484 10402

TE

α3- Fuc-transferase

a3FucT

32

B3GALT1 B3GALT2 B3GALT 5

ACCEPTED MANUSCRIPT Table 3 Glycan Databases

Description

Web address

CarbBank was the first publicly available glycan structure database with approximately 50,000 entries extracted from literature. The database contains glycan structure, publications, biological source information and information about the experimental technique and data about the attached glycan. The CFG glycan database includes structures from CFG's mammalian glycan array, and structures extracted from other CFG databases (CarbBank and GlycoMinds Ltd). Glycan structures are Consortium for linked with the CFG glycan binding protein database, which includes Functional Glycomics gene and protein sequences, biological functions, and binding [26, 60] specificities based on the glycan array data. CFG also provides a reagent bank, a glycotransferase database, knock-out mouse phenotype data, glycan profile data from MALDI-TOF mass spectra, and enzyme transcriptomic data for various tissues and cells. EuroCarbDB a repository of carbohydrate structures, and experimental evidence. Provides glycoinformatics tools and databases for interpretation and storing of glycan structures and EUROCarbDB glycan experimental data, including tandem MS, HPLC and NMR. [27] EurocarbDB also includes a suite of various tools, GlycoPeakFinder, AutoGU and Gasper. Other efforts and spin offs are Glycoworkbench, MonosacharideDB, GLYCOSCIENCE.de, and GlycomicsPortal. A carbohydrate structure metadatabase for glycan structures that have been integrated from carbohydrate sequences of all freely GlycomeDB available databases (CFG , KEGG, GLYCOSCIENCES.de, BCSDB [24, 69] and Carbbank) to GlycoCT, and created a new database (GlycomeDB) containing all structures and annotations. GlycomeDB provides cross-references to the integrated databases. The site provides databases and bioinformatics tools for glycobiology and glycomics. Glycan structure data is mainly GLYCOSCIENCES.de extracted from CarbBank, the PDB (Protein DataBank) and literature [29] research. It includes biological source information, NMR data and literature references. A glycan 3D structure generation tool [47] is integrated with the structure database. Contains literature-based glycan data that has been experimentally GlycoSuiteDB verified. The current database has been integrated into UniCarbKB [63] database (www.unicarbkb.org). JCGGDB Japan Consortium for A portal to search across major bioinformatics databases in Japan. Glycobiology and Integrated resources include tumor marker references, glycoepitope Glycotechnology databases, glycodisease gene databases, experiment support, glycan database profile data from mass spectral, lectin array data, glycoprotein data, [61] glycogene data and tools for glycan structure analysis.

http://glycomics.ccrc.uga.ed u/GlycomicsPortal/showEnt ry.action?id=46

http://www.functionalglyco mics.org/

http://relax.organ.su.se/euro carb/home.action

http://www.glycome-db.org

AC CE P

TE

D

MA

NU

SC

RI

PT

CarbBank (CCSD) The complex carbohydrate structure database [101]

KEGG Kyoto Encyclopedia of Genes [66, 67]

SUGABASE [102] UniCarb-DB [62]

http://www.glycosciences.d e

http://unicarbkb.org/

http://jcggdb.jp/

The KEGG glycan structure database contains entrees from experimentally determined glycan structures, CarbBank, from publications and structures linked with the genes and pathways http://www.genome.jp/kegg/ provided by other KEGG resources. Many resources are available at glycan/ KEGG, including glycan structures, glycotransferases, glycogene information, glycan binding protein data, glycosyltransferase and glycosylhydrolase reactions, and glycosylation pathways. SUGABASE is a carbohydrate-NMR database that combines http://glycomics.ccrc.uga.ed CarbBank Complex Carbohydrate Structure Data (CCSD) with u/GlycomicsPortal/showEnt proton and carbon chemical shift values. ry.action?id=48 Database with LC-MS/MS data of glycan structures. The database http://unicarbprovides for each structure the biological source, LC-MS/MS data, db.biomedicine.gu.se/ and publications.

33

ACCEPTED MANUSCRIPT A curated resource with experimental data, linked with glyco-related UniProtKB data. In addition to glycan structures, the database http://www.unicarbkb.org/ provides biological source information and publications.

AC CE P

TE

D

MA

NU

SC

RI

PT

UniCarbKB [64]

34

ACCEPTED MANUSCRIPT Table 4 Resource

Content

Web address

BRENDA [103]

Repository of comprehensive enzyme information.

http://www.brendaenzymes.info/

GGDB Glycogene data base [107]

PT

RI

http://www.cazy.org/

http://www.gak.co.jp/ECD/ Hpg_eng/hpg_eng.htm

http://enzyme.expasy.org/

http://riodb.ibase.aist.go.jp/ rcmg/ggdb/

https://glycobase.nibrt.ie/gl ycobase/about.action

A tool that can differentiate oligosaccharide structures on proteins from their experimentally determined masses and can be used for free or derivatized oligosaccharides and for glycopeptides. A tool for determination of glycan compositions from their mass signals for rapid annotation of MS and MSn data with different types of ions. A tool for identifying the peptide moiety of glycopeptides generated using a nonspecific enzyme. GlycoWorkbench is a standalone suite of software tools for rapid drawing of glycan structures and for assisting the process of structure determination from MS data and annotation using structures from other databases. This database provides structure and sequence information on plant lectins. The database has been completely manually annotated. Lectin data is integrated into a common framework together with analytical tools and extensive links. Data for each lectin pertains to taxonomic, biochemical, domain architecture, molecular sequence, and structural details, as well as carbohydrate and blood group specificities. Web resources site with algorithmic and data mining glycan structure tools.

AC CE P

GlycoMod [109]

TE

D

GlycoBase [108]

SC

Enzyme [106]

NU

Elution Coordinate Database [105]

Database for display and analysis of genomic, structural, and biochemical information on Carbohydrate-Active enzymes. Contains more than 400 pyridylaminated N-glycan structures, code numbers, the elution positions on Shimpack CLC-ODS and Amide-80 columns. The elution coordinate database of this site is necessary for the 2-D/3-D sugar mapping techniques. Repository of information relative to the nomenclature of enzymes. Database containing information for analysis of glycogenes, includes genes associated with glycan synthesis, transport, and nucleotide synthesis. First database to store information on substrate specificity Genes are stored in XML format. Glycobase 3.2.4, is an HPLC/UPLC resource of elution positions for N-linked and O-linked glycan structures determined by a combination of HPLC, UPLC, exoglycosidase sequencing and mass spectrometry (MALDI-MS, ESI-MS, ESI-MS/MS, LC-MS, LC-ESIMS/MS). Provides glycoinformatics and visualization functionalities for the data mining of glycans. AutoGu, application to assist the interpretation and assignment of HPLC-glycan profiles was integrated in the database.

MA

CAZY [104]

Glyco-Peakfinder [76] GlycoPep ID

GlycoWorkbench [75]

LectinDB [110]

RINGS [28]

35

http://web.expasy.org/glyc omod/ http://www.glycopeakfinder.org/ http://hexose.chem.ku.edu/ predictiontable.php http://code.google.com/p/gl ycoworkbench/

http://proline.physics.iisc.e rnet.in/lectindb/

http://rings.t.soka.ac.jp

ACCEPTED MANUSCRIPT Table 5 Disease

Method Used

Glycan Biomarker

Reference

High sialylation of

PT

HILIC-HPLC-2AB Breast cancer

antenna fucosylated Labeling

[88]

RI

glycans MALDI-TOF on

carcinoma

permethylated glycans

Decrease in tri- and tetraantennary

SC

Hepatocellular

[111]

NU

glycan

Increase in SLeX,

HILIC and WAX-

mono-antennary, tri-

HPLC

MA

Lung Cancer

[112]

sialylated N-glycans

D

Increase in sialylated

TE

MALDI-FTICR-MS Ovarian Cancer

glycans and decrease

of glycans

[113]

AC CE P

in neutral and high mannose type glycans

36

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC CE P

TE

D

Graphical abstract

37