Computational Biology and Chemistry 80 (2019) 217–224
Contents lists available at ScienceDirect
Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/cbac
Research Article
De novo glycan structural identification from mass spectra using tree merging strategy
T
Fusong Jua,c,1, Jingwei Zhanga,c,1, Dongbo Bua,c, Yan Lib,c, Jinyu Zhoub,c, Hui Wanga,c, ⁎ ⁎ Yaojun Wanga,c, Chuncui Huangb,c, , Shiwei Suna,c, a
Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China c University of Chinese Academy of Sciences, Beijing 100049, China b
A B S T R A C T
Motivation: Glycans are large molecules with specific tree structures. Glycans play important roles in a great variety of biological processes. These roles are primarily determined by the fine details of their structures, making glycan structural identification highly desirable. Mass spectrometry (MS) has become the major technology for elucidation of glycan structures. Most de novo approaches to glycan structural identification from mass spectra fall into three categories: enumerating followed by filtering approaches, heuristic and dynamic programming-based approaches. The former suffers from its low efficiency while the latter two suffer from the possibility of missing the actual glycan structures. Thus, how to reliably and efficiently identify glycan structures from mass spectra still remains challenging. Results: In this study we propose an efficient and reliable approach to glycan structure identification using tree merging strategy. Briefly, for each MS peak, our approach first calculated monosaccharide composition of its corresponding fragment ion, and then built a constraint that forces these monosaccharides to be directly connected in the underlying glycan tree structure. According to these connecting constraints, we next merged constituting monosaccharides of the glycan into a complete structure step by step. During this process, the intermediate structures were represented as subtrees, which were merged iteratively until a complete tree structure was generated. Finally the generated complete structures were ranked according to their compatibility to the input mass spectra. Unlike the traditional enumerating followed by filtering strategy, our approach performed deisomorphism to remove isomorphic subtrees, and ruled out invalid structures that violates the connection constraints at each tree merging step, thus significantly increasing efficiency. In addition, all complete structures satisfying the connection constraints were enumerated without any missing structure. Over a test set of 10 N-glycan standards, our approach accomplished structural identification in minutes and gave the manually-validated structure first three highest score. We further successfully applied our approach to profiling and subsequent structure assignment of glycans released from glycoprotein mAb, which was in perfect agreement with previous studies and CE analysis.
1. Introduction Glycans are large molecules that link multiple monosaccharides via glycosidic bonds and form tree-like branching structures (Fig. 1a). Glycans serve important purposes in a wide range of biological processes, including protein conformation, molecular recognition, and immunological responses (Raman et al., 2005; Rudd et al., 2001). These biological activities are closely related to their structures, making glycan structural identification highly desired. However, this task is extremely challenging due to the complex tree structures and the unavailability of templates for glycans. Mass spectrometry (MS) has become the primary experimental technique for elucidating glycan structures because of its high sensitivity and throughput. In an MS experiment, a glycan of interest is fragmented at a certain glycosidic bond, forming fragment ions denoted
⁎
as Bi/Yn−i or Ci/ Zni , where n represents the total number of residues, and i represents the cleavage site. Cross-ring clevages may also occur, forming A- and X-ions (Fig. 1b). The mass and intensity of these fragment ions are then measured and represented as peaks, which constitute mass spectrum of the glycan under investigation. Mass spectra carry essential structural information, therefore enabling elucidating glycan structures from mass spectra. A straightforward strategy for interpreting mass spectra is database searching, i.e., comparing query mass spectrum against theoretical spectra of glycans gathered in a database, or experimental spectra with annotated glycan structures (Joshi et al., 2004; Lohmann and bvon der Lieth, 2004). The most similar mass spectrum will be identified with the corresponding glycan structure reported as identification result. Database search strategy has been proved successful in peptide sequencing; however, unlike proteins, glycan databases are still largely incomplete,
Corresponding authors at: University of Chinese Academy of Sciences, Beijing 100049, China. E-mail addresses:
[email protected] (F. Ju),
[email protected] (C. Huang),
[email protected] (S. Sun). 1 The first two authors contributed equally to this study. https://doi.org/10.1016/j.compbiolchem.2019.03.015 Received 9 March 2019; Accepted 23 March 2019 Available online 30 March 2019 1476-9271/ © 2019 Published by Elsevier Ltd.
Computational Biology and Chemistry 80 (2019) 217–224
F. Ju, et al.
Fig. 1. Tree-like glycan structure (a) and fragment ions of B−, Y−, C−, Z−, A− and X−types during mass spectrometry experiments. Here we showed structure of N-glycan Man-5D1 as an example, which has a trimannosyl core linked to two HexNAc.
candidate structures for subsequent evaluation (Tang et al., 2005).
thus precluding the application of this strategy in glycan structural identification. Unlike database search strategy, de novo interpretation of glycan mass spectra does not require any pre-built glycan database and thus possesses the potential to discover new glycan structures. Typically, a de novo approach consists of two procedures, i.e., enumerating all possible glycan structures, and evaluating these candidate structures. An ideal enumerating procedure should generate a small number of candidate structures for further evaluating but should not miss the actual glycan structure. Recent approaches to enumerating glycan structures fall into three categories:
The evaluation of candidate glycan structures highly relies on an accurate scoring scheme. Most of the existing scoring schema are variants of the shared peak count, i.e., the count of the common peaks shared by query experiment spectrum and theoretical spectrum of candidate glycan structures (Tang et al., 2005; Goldberg et al., 2005, 2006). Biosynthetic rules of glycans have also been proved to be effective in evaluating structures (Bocker et al., 2011). Recently machine learning technique was proposed to learn how to evaluate structural elements (Kumozaki et al., 2015; Horlacher et al., 2017) and rank candidate structures (Hong et al., 2017). Analysis suggested that the existing exhaustive searching approaches usually suffer from the low efficiency in generating candidate structures whereas the heuristic and dynamic programming approaches suffer from the possibility of missing the correct glycan structures Gaucher et al. (2000), Hong et al. (2017), Tang et al. (2005). Thus, how to accurately and efficiently identify glycan structures from mass spectra still remains challenging. In this study, we present an efficient and reliable de novo approach, called gNovo, to glycan structural identification. The major advantages of gNovo are summarized as follows: (i) Unlike traditional exhaustive searching approaches enumerating all candidate structures for subsequent filtering, gNovo removes invalid and redundant structures at every step of its execution, therefore significantly improving candidate enumerating efficiency. (ii) In addition, our approach avoids the possibility of missing structures as all candidate structures satisfying constraints are enumerated. Computational experiments showed that gNovo could perform accurate de novo sequencing of N-glycans and could be used to profiling and subsequent structural assignment for glycans released from glycoproteins.
(i) Exhaustive search: Given precursor ion mass of the glycan under investigation, the monosaccharide composition of the glycan could be easily calculated using Knapsack algorithm (Cooper et al., 2001). The exhaustive search approaches, say STAT (Gaucher et al., 2000), StrOligo (Ethier et al., 2002), and OSCAR (Lapadula et al., 2005), enumerate all possible branching structures that match the monosaccharide composition. This strategy is feasible only for small glycans as the number of candidate glycan structures increases exponentially over the number of monosaccharides. The huge search space might be narrowed down by applying biosynthetic rules over the candidate glycan structures; however, our knowledge of these rules are incomplete, thus limiting the general applicability of this operation (Ethier et al., 2003; Goldberg et al., 2005, 2006). (ii) Heuristic approaches: The problem of generating candidate glycan structures without repetitive peak counting has been proved to be NP-hard (Shan et al., 2008). To make computation tractable, several heuristics have been proposed. For example, only a limited number of substructures were kept for each peak (Shan et al., 2008; Dong et al., 2015) to save time and space. Sun et al. proposed to reconstruct glycan structure step by step and considered a fixed number of high-quality structures at each iteration (Sun et al., 2015). Bocker et al. developed a fixed-parameter tractability algorithm, where the parameter is the number of peaks. For mass spectrum with a large number of peaks, only k most intense peaks are required to be used at most once while other peaks are allowed to be used multiple times (Bocker et al., 2011). (iii) Dynamic programming-based approaches: Similar to de novo peptide sequencing (Chen et al., 2001), GLYCH uses dynamic programming technique to find the most probable branching structure from tandem MS spectra (Tang et al., 2005). GLYCH has been proved to be successful except for its preference of linear structures over branched ones incurred by repetitive peak counting (Bocker et al., 2011). Recently Kumozaki et al. formulated the candidate generating problem into an integer linear programming problem, and then used dynamic programming technique to infer the most probable structures (Kumozaki et al., 2015). To make computation manageable, dynamic programming approaches usually return a fixed number of top scoring structures, e.g., GLYCH reports top 200
2. Methods Before describing the details of gNovo algorithm, we first introduced the notations to be used in this study. 2.1. Notations
• We represent a mass spectrum as a peak list P = {p , …, p }, where 1
•
218
N
pi denotes the mass of the ith peak, and these peaks are sorted in an ascending order of mass. For convenience, we enrich the peak list by adding two auxiliary peaks, namely, p0 represents a dumb peak with mass 0, and pN+1 represents the precursor ion. The enriched mass spectrum can thus be described as P = {p0, p1, …, pN, pN+1}. A glycan structure with n monosaccharides u1, u2, …, un is modeled as a tree T, in which a node represents a monosaccharide and an edge represents a glycosidic bond connecting two monosaccharides. Our approach constructs the tree structure by adding glycosidic bonds step by step; thus at intermediate steps of the construction process, only part of glycosidic bonds are known, and the connected
Computational Biology and Chemistry 80 (2019) 217–224
F. Ju, et al.
•
monosaccharides form several fragments of the underlying tree structures. Here each fragment is represented as a subtree. Thus, the incomplete glycan structures are usually a collection of subtrees rather than a complete tree, which is represented as a forest F. For each peak pi, we use a set Ci to represent monosaccharide composition of the fragment ion corresponding to pi.
•
2.2. Algorithm The objective of gNovo is to reconstruct structure of the glycan of interest through interpreting its mass spectra. Here MS2 spectrum was used to generate candidate structures, and multi-stage spectra were used to rank these candidate structures. The key idea of gNovo is applying tree merging strategy to efficiently and reliably generate candidate structures based on MS2 spectrum. The basic idea of gNovo is as follows: We start with a null structure without any glycosidic bond among monosaccharides, i.e., each monosaccharide is itself a degenerated subtree, and the collection of these subtrees form a forest. Next, we merged certain subtrees into one according to glycosidic bonds derived from MS2 peaks. The underlying rational of tree merging is that each MS2 peak corresponds to a fragment of the glycan under investigation, which consists of several monosaccharides with glycosidic bonds connecting them. Thus each peak poses a restriction that the concerning monosaccharides should form a directly-connected subgraph in the underlying tree structure. These connection information were explored to merge concerning subtrees. This tree merging operation was repeated until all composing monosaccharides are connected into a single and complete tree. To speed up this process, duplicated structures were removed after each tree merging operation to avoid redundancy. The pseudocode of gNovo is listed below. Algorithm 1. 1:
2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
GNOVO
for glycan identification
Enrich the input MS2 spectrum P = {p1, …, pN} by adding two auxiliary peaks p0 and pN+1, where p0 represents a dumb peak with mass of 0, and pN+1 represent the precursor ion. Calculate monosaccharide composition of the glycan. Let Ci denote the monosaccharide composition of the fragment ion corresponding to peak pi; Set S = {CN+1}; //Start from empty structure without any glycocidic bond among monosaccharides; Set FS = {S}; //Initialize the forest set with only one structure; for each peak p0, p1, ⋯, pN+1 do FSnew = {}; for each structure S in FS do for each mapping of Ci onto S do Snew =TREEMERGING(Ci, S); //Merge certain subtrees in F based on Ci; Insert Snew into FSnew; end for end for FS =DEISOMORPHISM(FSnew); //Remove isomorphic structures from FSnew; end for Filter and rank candidate tree structures in FS, and return top ones.
•
We described the key procedures of gNovo as below.
• Calculating monosaccharide composition: Similar to (Gaucher et al.,
2000; Ethier et al., 2003), we applied the Knapsack algorithm to derive monosaccharide composition of the glycan of interest form its mass. Since there are usually multiple possible monosaccharide compositions, we next select the most probable one according to its matching with the given mass spectrum (Ethier et al., 2003). Finally, for each peak pi, the matched monosaccharide composition were identified and denoted as Ci. For example, from the precursor mass of 2-FL (m/z 651), we derived its monosaccharide composition as 2 galactose/glucose plus 1 fucose (Fig. 2a). Here galactose and glucose are not differentiated as they have identical mass. We also calculated the composition C1 of the peak p1 at m/z 433 as 2 galactose/
glucose, and composition C2 for the peak p2 at m/z 463 as 1 galactose/glucose plus 1 fucose. It is worthy pointed out that although KNAPSACK problem is NP-Hard, the algorithm presented here is very efficient as the number of monosaccharides is usually less than 30 for general glycans. Merging subtrees in forest FS: The input of tree merging is two-fold: (i) an incomplete structure S consisting of a collection of subtrees, each subtree representing a fragment of the underlying glycan structure; and (ii) the monosaccharide composition Ci corresponding to peak pi, which is also a fragment of the glycan of interest. The goal of this procedure is to merge certain subtrees into one based on the connection information implied in Ci. Using glycan 2-FL as an example, we describe the main concept of tree merging (Fig. 2). First, as the correspondence between monosaccharides in Ci and those of the glycan cannot be undetermined in advance, we need to enumerate all such correspondences. As shown in Fig. 2b, the monosaccharide composition derived from the peak p2 contains 1 fucose and 1 galactose/glucose. As there are 1 fucose and 2 galactose/glucose in the glycan, we have a total of 2 possible correspondences. Next for each correspondence, we enumerate all possible glycosidic bonds that connect the monosaccharides in Ci into a fragment. For this end, we applied the Prüfer sequence technique (Kajimoto, 2003). Specifically, a fragment over |Ci| monosaccharides one-toone corresponds to a length |Ci| − 2 sequence over {1, 2, …, |Ci|}. Thus, we first generate all such sequences, and then convert them into the corresponding fragments. Although there are O(|Ci||Ci|−2) Prüfer sequences in total, this enumerating operation is efficient in practice as |Ci| is usually a small number. Note that before this fragment enumerating procedure, there might already exist some glycosidic bonds connecting certain monosaccharides in Ci, making the generic Prüer sequence technique fail. To circumvent this difficulty, we shrunk each connecting component into a super-node before executing Prüer coding, and unshrunk these super-nodes afterwards. Finally, each possible fragment was used a bridge to merge concerning trees into one tree. As shown in Fig. 2c (left panel), the fragment f connects two subtrees: one subtree has a single fucose, and the other has two galactose/glucose. Thus using f as a bridge, we merged these two subtrees into one bigger subtree. These new trees constitute a new forest Snew (Fig. 2). Removing redundant structures from FS: Although our algorithm starts from a single initial structure, we always have multiple intermediate structures during the execution due to various mappings of monosaccharides Ci onto a structure S. The intermediate structures, collected in structure set FS, usually contain significant redundancy, i.e., some intermediate structures are essentially duplicates to other structures after appropriate permutation of monosaccharides (e.g., ′ in Fig. 2c). The duplicated structhe two structures Snew and Snew tures, called isomorphism in graph theory, will lead to substantial redundancy in subsequent structure construction. To detect and thereafter remove isomorphic structures, we employed the Weisfeiler–Lehman kernel technique (Shervashidze et al., 2010). The basic idea of Weisfeiler–Lehman kernel technique is to assign each structure with a hash value such that structures with identical hash value are isomorphic. Consider a tree structure rooted at node r. Its hash value is defined recursively over the children nodes v1, …, vd of r as follows:
f ([], label(u)) if u isaleafnode H (r ) = ⎧ ⎨ ⎩ f (sort([H (v1), ⋯H (vd )]), label(u)) otherwise Here label(r) represents the monosaccharide type of node u, and f(·, ·) is a function that transforms a sorted list and monosaccharide type into a number. 219
Computational Biology and Chemistry 80 (2019) 217–224
F. Ju, et al.
Fig. 2. The concept of gNovo using glycan 2-FL as an example. (a) For the MS2 peak p1 (m/z 433), we calculated its monosaccharide composition C1 as 2 galactose/glucose. For the p2 at m/z 463, we calculated its monosaccharide composition C2 as 1 galactose/glucose plus 1 fucose. (b) gNovo starts from a degenerated structure FS0, where no glycocidic bond is known. As there are only two monosaccharides in C1, we obtained only 1 fragment f through enumerating all possible connections. According to f, we added a glycocidic bond among the two galactose/glucose, generating a new structure FS1. (c) As there are 1 fucose and 2 galactose/glucose in the glycan, we have 2 possible correspondences. For each correspondence, we used C2 to add glycocidic bonds, ′ as reproducing two structures Snew and Snew ′ are essentially isosults. Since Snew and Snew morphic, only one structure was kept after removing redundant structures. Thus gNovo successful constructed the actual structure of 2FL from its MS2 spectrum.
•
The candidate tree structures in FS were filtered according to rules derived from biosynthesis of N-glycans. Specifically, in eukaryotes, the first event of N-glycan biosynthesis involves assembling an oligosaccharide precursor structure. Following the covalently attachment of oligosaccharides to asparagine residues, a series of processing reactions and diversifications occurs. The N-glycan diversification generates three N-glycan subtypes, namely highmannose, hybrid, and complex, depending on enzyme activity in the
For an intermediate structure consisting of multiple subtrees, we first transformed the forest into a single tree by adding a virtual root node v and connecting it to root nodes of all composing subtrees. Then H (v ) is calculated similarly and used as hash value of this intermediate structure. The calculation of H(r) is efficient as it takes only O(n) time, where n represents the number of monosaccharides. Filtering and ranking candidate tree structures in FS: 220
Computational Biology and Chemistry 80 (2019) 217–224
F. Ju, et al.
applied to a μ focus MALDI plate target (900 μm, 384 circles, HST). A solution (0.5 μL) of 2,5-dihydroxybenzoic acid (DHB, 20 mg/mL) in a mixture of methanol-water (1:1) containing 0.1% trifluoroacetic acid (TFA) and 1 mM NaCl were added to the plate, and then mixed with samples. The mixture was air dried by keeping it at room temperature for several minutes. Observed ion peaks were calibrated using a mixture of standard peptides as mass markers. The setting of mass spectrometer is listed in supplementary material. In this study we used MS2 spectrum to calculate candidate structures and used MSn(n > 2) spectra to rank these candidate structures. For a glycan of interest, we generated its MSn(n > 2) spectra through selecting the most intense peaks (top 5) as precursor ion at each MS scanning stage. The instrument used in this study has the capacity to generate up to MS5 spectrum. Thus for each glycan, we generated 1 MS1 spectrum, 1 MS1 spectrum, 5 MS3 spectra, 25 MS4 spectra, and 125 MS5 spectra. These spectra are available upon request.
Table 1 Identification of branching structures of 10 N-glycan standards with gNovo. Here all candidate structures were ranked using a scoring function and the rank of the real structure was listed in column 4. Glycan
#Residues
#Candidates
Rank
Running time
A2 Man5-D1 Man6 Man7D1 NA2 NA3 NA4 NGA2 NGA3 NGA4
11 7 8 9 9 11 13 7 8 9
124 3 4 8 35 242 1610 3 4 8
1 1 1 1 1 3 2 1 1 1
2.692 s 0.011 s 0.066 s 0.086 s 0.406 s 24.224 s 804.173 s 0.011 s 0.054 s 0.099 s
Golgi. The hybrid and complex glycans may exist with two or more GlcNAc-bearing branches, forming mutli-antennary structures. During this process, GlcNAc residues may be added to the trimannosyl core by six different GlcNAc transferases. It should be pointed out that due to competition among GlcNAc transferases for the same substrate, only a subset of linkages performed by the GlcNAc transferases can occur on any one N-glycan structure (Wearsch et al., 2011). According to the biosynthesis process of N-glycans, some candidate tree structures in FS are invalid. To filter out these invalid structures, we applied the following rules, which are similar to those used in StrOligo (Ethier et al., 2003): (i) At most 4 antennary are allowed in mutli-antennary structures. (ii) A GlcNAc can be added to the middle mannose of the trimannosyl core. (iii) Neu5Ac/Neu5Gc could be added to the end of the branches of complex-type N-glycans. After filtering out invalid structures from FS, we ranked the leftover candidate structures and finally reported top ones. In particular, for each candidate structure, we first predicted its theoretical spectrum by simulating its fragmentation in mass spectrum experiment, and then compared the theoretical spectrum against experimental spectrum. The dot product between them was calculated as similarity measure and thereafter used to ranked candidate structures. In this study, multistage mass spectra were used to rank candidate structures.
3.2. Structural assignment of N-glycan standards We first evaluated gNovo on 10 N-glycan standards with branching structures. As shown in Table 1 and Supplementary Table 1, our approach correctly assign structure for 8 out of the 10 N-glycans: in all cases, gNovo generated just a few candidate structures and the real structure was among these candidate structures. As the number of candidate structures was significantly reduced, gNovo accomplished structure assignment for these glycans within minutes or even seconds. More importantly, over all of these glycans except NA3 and NA4, the real structures were ranked first among the generated candidate glycans, thus correctly elucidating structures. Take the N-glycan Man-7D1 as a concrete example. The MS1 spectrum gave a MNa+ at m/z 1987. According to its MS2 spectrum, gNovo generated eight candidate structures (Fig. 3). All of these candidate structures shared the same trimannosyl core but differed at connecting pattern of the other four mannose, i.e., the four mannose were linearly connected to the trimannosyl core in candidate structure G2, G5 and G8 but formed complex branching pattern in G1, G3, G4, G6, and G7. According the scoring function described above, the real structure G1 was ranked 1st and therefore reported as the final prediction structure. On NA3 and NA4, although the real structure was listed as one of the candidate structures, it was not ranked 1st by the scoring function, thus leading to the incorrect structural assignment. As shown in Fig. 4, the MS1 spectrum of NA3 gave a MNa+ at m/z 2519, and according to its MS2 spectrum, gNovo generated a total of 242 candidate structures. These candidate structures shared the common trimannosyl core but differed at the connecting pattern of the three N-acetyl glucosamine and three mannose: in the real structure G3, three N-acetyl glucosamine connect to the trimannosyl core first, and three mannose then connect to the three N-acetyl glucosamine. On the contrary, in the candidate structure G4, three mannose connect to the trimannosyl core first, and three N-acetyl glucosamine connect to the three mannose. Compared with G3 and G4, the candidates G1 and G2 have mixed connecting scheme. As G1 and G2 were given score (8.89) higher than G3 (8.25), gNovo reported G1 and G2 as the final prediction result. The case of NA4 is similar and thus be omitted here. The failure of gNovo on NA3 and NA4 implied that a more accurate scoring function is highly desirable, which is one of our future works.
3. Results and discussion The algorithm described above has been implemented in a program gNovo, written in Java and Python on Linux/Unix system. Running time were measured on an AMD CPU, 2.6GHz, with 64GB memory. We tested gNovo on mass spectra of 10 N-glycan standards (A2, Man-5D1, Man-6, Man-7D1, NA2, NA3, NA4, NGA2, NGA3, and NGA4), and 8 Nglycans released from glycoprotein mAb adalimumab. The molecule mass and structure of these glycans are listed in Supplementary Tables 1 and 2. The mass spectra of these glycans were obtained as below. 3.1. MALDI-MS experiments N-glycans A2, NGA3, NGA4, Man-5D1, and Man-6 were purchased from Ludger (Abingdon, England). Man-7D1 was provided by Vladimir Piskarev (Nesmeyanov Institute of Organoelement Compounds, Russian Academy of Sciences, Moscow). Glycoprotein mAb is an over the counter drug. Fractions of permethylated standards and N-glycans released from glycoproteins were analyzed on an Axima MALDI Resonance mass spectrometer (Shimadzu). A nitrogen laser was used to irradiate samples at 337 nm, with an average of 200 shots made. An aqueous solution of standards and glycans from glycoproteins (ca. 100 pmol, 0.5 μL) was
3.3. Application to profiling and structural assignment of individual glycans released from glycoprotein mAb adalimumab We further applied our approach to N-glycan profiling and subsequent detailed structural identification of individual components of glycoproteins mAb adalimumab (Fig. 5). The MS1 spectrum exhibited multiple components, and we applied gNovo to determine glycan of each component. As shown in Table 2 and Supplementary Table 2, for the sample 221
Computational Biology and Chemistry 80 (2019) 217–224
F. Ju, et al.
Fig. 3. Identification of Man-7D1 by gNovo. The MS1 spectrum gave a MNa+ at m/z 1987, and eight glycan structure were generated as candidates according to the MS2 spectrum. Each of the candidates was assigned with a score according to its compatibility with MSn spectra. Man-7D1 (G1) has the highest score and thus was reported as the actual glycan structure.
structure differed from the real structure at the position of only one Nacetyl glucosamine. Similar phenomena were observed for the peaks m/ z 2285 and 2605. These incorrect structural assignment might be corrected by using more accurate scoring function and filtering rules. Taken together, these results demonstrated that the gNovo approach can be combined with MALDI-MS N-glycan profiling (Canis et al., 2012) and reliably assign structure for each N-glycan of glycoproteins.
isolated from mAb adalimumab, the gNovo program performed product-ion scanning for each of the eight MNa+ peaks, m/z 1835, 2040, 2070, 2081, 2244, 2285, 2401, and 2605. Among these peaks five could be identified as NGA2F, FNGA2B, FA2G1-a/FA2G1-b, NA2F, and FA2G1S1, essentially in agreement with a previous report on the Fcglycosylation profile (Reusch et al., 2015) and CE analysis (Fig. 5). However gNovo failed to determine the structure of m/z 2070, 2285, and 2605. Close examination suggested that for these glycans, the real structure was generated as a candidate structure but was not ranked 1st by the scoring function. Take the peak m/z 2070 as an example. The gNovo program generated the real structure (NA2) as one of the candidate structure but ranked it 2nd, therefore leading to incorrect structural assignment. It is interesting to note that the ranked 1st
3.4. Comparison with exhaustive search strategy To examine the efficiency of gNovo, we compared it against the exhaustive search strategy. As shown in Table 2, when N-glycan constraints were not considered, the exhaustive search strategy could not Fig. 4. Identification of NA3 by gNovo. The MS1 spectrum gave a MNa+ at m/z 1906 and according to MS2 spectrum, 242 glycan structures were generated as candidates. Here only 8 of them were shown. In the real structure G3, three N-acetyl glucosamine connect to the trimannosyl core first, and three mannose then connect to the three N-acetyl glucosamine. In contrast, the candidates G1 and G2 have mixed connecting scheme. As G1 and G2 were given score (8.89) higher than G3 (8.25), gNovo reported G1 and G2 as the final prediction result.
222
Computational Biology and Chemistry 80 (2019) 217–224
F. Ju, et al.
Fig. 5. Profiling and subsequent sequence assignment of N-glycans released from mAb adalimumab. (a) MALDI mass spectrum of the released N-glycans as permethylated derivatives showed eight N-glycans with MNa+ at m/z 1835, 2040, 2070, 2081, 2244, 2285, 2401, and 2605, respectively. The structures identified by gNovo (Supplementary Table 2) are shown next to the ion peaks. For peaks m/z 2070, 2285 and 2605, the identification by gNovo failed, and the actual glycans are shown in red rectangle. The peak at m/z 2040 is a mixture of two isomers (peaks 9 and 10, in panel b). (b) Electropherogram of the released N-glycans indicated a total of twelve components, among these the identities of eight can be corroborated by available N-glycan standards (peak 1, A1F; peak 6, A1F; peak 7, NGA2F; peak 9, FA2G1-a; peak 10, FA2G1-b; peak 11, NA2F) while others by literature data (Mechref et al., 2009).
the difficulties in filtering and ranking. It is also reasonable to expect the advantage of gNovo is more prominent for larger glycans.
accomplish structure enumeration in reasonable time for glycans with over 10 residues. The number of generated candidate structures were also high, leading to difficulties in subsequent filtering and ranking operations. When considering N-glycan constraints, the exhaustive search strategy could end in reasonable time as it suffices to enumerate the residues outside the trimannosyl core. However, for large glycans, the exhaustive search strategy needs much longer time than gNovo, e.g., for glycan NA4 with 11 residues, gNovo costs only 804.173 s while the exhaustive search strategy costs 14947.287 s even if considering the Nglycan constraint. More importantly, gNovo always generate fewer candidate structures than the exhaustive search strategy. For example, gNovo generated 8 candidate structures for the glycan m/z 2401 of mAb, while the exhaustive search strategy generated 57 candidate structures. The advantage of succinct candidate structures is rooted in the deisomorphism operation of gNovo, which will significantly easy
4. Conclusion We presented here a de novo approach to elucidation of glycan structures. The ability of our approach has been clearly determined by the success of assign structures for 10 glycan standards within minutes. For profiling N-glycans released from glycoproteins, our approach added an extra dimension to traditional profiling by assigning structures to each composing glycan. The structural identification results are in perfect agreement with previous studies and CE analysis. The reliability and efficiency of our approach have highlighted the special feature of tree merging strategy and de-isomorphism operations accompanying the merging steps. In the present work, only glycocidic cleavages of B-, Y-, C- and Z223
Computational Biology and Chemistry 80 (2019) 217–224
F. Ju, et al.
Table 2 Comparison of gNovo with the exhaustive search strategy over N-glycans. Here the label “–” is used when algorithms cannot finish in reasonable time (less than 24 h). We examined two manners of exhaustive search: one without considering N-glycan constraints, i.e., enumerating all possible tree structures formed by the composing saccharides, and the other with N-glycan constraints taken into consideration, i.e., enumerating all possible structures formed by the composing saccharides outside the trimannosyl core. Glycan
A2 Man-5D1 Man-6 Man-7D1 NA2 NA3 NA4 NGA2 NGA3 NGA4 mAb_1835 mAb_2040 mAb_2070 mAb_2081 mAb_2244 mAb_2285 mAb_2401 mAb_2605
#Residues
11 7 8 9 9 11 13 7 8 9 8 9 9 9 10 10 10 11
Exhaustive search w/o N-glycan Constraints
Exhaustive search with N-glycan constraints
gNovo
#Candidates
Running time
#Candidates
Running time
#Candidates
Running time
– 78 194 460 1470 – – 126 357 1007 1637 6870 1470 5560 – – – –
– 0.228 s 5.806 s 146.767 s 159.673 s – – 0.283 s 6.611 s 161.186 s 7.205 s 194.347 s 166.657 s 196.983 s – – – –
180 3 4 8 37 252 1659 3 4 8 4 11 37 9 38 53 37 124
22.721 s 0.021 s 0.013 s 0.071 s 0.069 s 24.520 s 14947.287 s 0.206 s 0.011 s 0.070 s 0.024 s 0.018 s 0.070 s 0.016 s 0.153 s 0.146 s 0.207 s 2.401 s
124 3 4 8 35 242 1610 3 4 8 4 11 37 9 37 50 8 49
2.692 s 0.011 s 0.066 s 0.086 s 0.406 s 24.224 s 804.173 s 0.011 s 0.054 s 0.099 s 0.007 s 0.055 s 0.208 s 0.353 s 0.048 s 0.569 s 0.048 s 0.695 s
types were considered for inferring structures. These ions contain little information of linkage among saccharides. In contrast, the cross-ring cleavages of A- and X-types contain linkage information and thus can be used for linkage assignment. How to incorporate these cleavages into gNovo is one of our future works. Take together, the approach presented here would greatly facilitate our understanding of glycan structures and its applications in health and disease.
spectrometry. Rapid Commun. Mass Spectrom. 17 (24), 2713–2720. Gaucher, S.P., Morrow, J., Leary, J.A., 2000. Stat: a saccharide topology analysis tool used in combination with tandem mass spectrometry. Anal. Chem. 72 (11), 2331–2336. Goldberg, D., Bern, M., Li, B., Lebrilla, C.B., 2006. Automatic determination of o-glycan structure from fragmentation spectra. J. Proteome Res. 5 (6), 1429–1434. Goldberg, D., Sutton-Smith, M., Paulson, J., Dell, A., 2005. Automatic annotation of matrix-assisted laser desorption/ionization n-glycan spectra. Proteomics 5 (4), 865–875. Hong, P., Sun, H., Sha, L., Pu, Y., Khatri, K., Yu, X., Tang, Y., Lin, C., 2017. Glycodenovoan efficient algorithm for accurate de novo glycan topology reconstruction from tandem mass spectra. J. Am. Soc. Mass Spectrom. 28 (11), 2288–2301. Horlacher, O., Jin, C., Alocci, D., Mariethoz, J., Müller, M., Karlsson, N.G., Lisacek, F., 2017. Glycoforest 1.0. Anal. Chem. 89 (20), 10932–10940 (PMID: 28901741). Joshi, H.J., Harrison, M.J., Schulz, B.L., Cooper, C.A., Packer, N.H., Karlsson, N.G., 2004. Development of a mass fingerprinting tool for automated interpretation of oligosaccharide fragmentation data. Proteomics 4 (6), 1650–1664. Kajimoto, H., 2003. An extension of the Prüfer code and assembly of connected graphs from their blocks. Graphs Combinatorics 19 (2), 231–239. Kumozaki, S., Sato, K., Sakakibara, Y., 2015. A machine learning based approach to de novo sequencing of glycans from tandem mass spectrometry spectrum. IEEE/ACM Trans. Comput. Biol. Bioinform. 12 (6), 1267–1274. Lapadula, A.J., Hatcher, P.J., Hanneman, A.J., Ashline, D.J., Zhang, H., Reinhold, V.N., 2005. Congruent strategies for carbohydrate sequencing. 3. Oscar: an algorithm for assigning oligosaccharide topology from MSn data. Anal. Chem. 77 (19), 6271–6279. Lohmann, K.K., bvon der Lieth, C.-W., 2004. Glycofragment and glycosearchms: web tools to support the interpretation of mass spectra of complex carbohydrates. Nucleic Acids Res. 32 (suppl_2), W261–W266. Mechref, Y., Hussein, A., Bekesova, S., Pungpapong, V., Zhang, M., Dobrolecki, L.E., Hickey, R.J., Hammoud, Z.T., Novotny, M.V., 2009. Quantitative serum glycomics of esophageal adenocarcinoma and other esophageal disease onsets. J. Proteome Res. 8 (6), 2656–2666. Raman, R., Raguram, S., Venkataraman, G., Paulson, J.C., Sasisekharan, R., 2005. Glycomics: an integrated systems approach to structure–function relationships of glycans. Nat. Methods 2 (11), 817. Reusch, D., Haberger, M., Maier, B., Maier, M., Kloseck, R., Zimmermann, B., Hook, M., Szabo, Z., Tep, S., Wegstein, J., et al., 2015. Comparison of methods for the analysis of therapeutic immunoglobulin G FC-glycosylation profiles – part 1: separation-based methods. MAbs, vol. 7. Taylor & Francis, pp. 167–179. Rudd, P.M., Elliott, T., Cresswell, P., Wilson, I.A., Dwek, R.A., 2001. Glycosylation and the immune system. Science 291 (5512), 2370–2376. Shan, B., Ma, B., Zhang, K., Lajoie, G., 2008. Complexities and algorithms for glycan sequencing using tandem mass spectrometry. J. Bioinform. Comput. Biol. 6 (01), 77–91. Shervashidze, N., Schweitzer, P., Jan, E., Leeuwen, V., Mehlhorn, K., Borgwardt, K.M., 2010. Weisfeiler–Lehman graph kernels. J. Mach. Learn. Res. 1 (01), 1–48. Sun, W., Lajoie, G.A., Ma, B., Zhang, K., 2015. A novel algorithm for glycan de novo sequencing using tandem mass spectrometry. In: International Symposium on Bioinformatics Research and Applications. Springer. pp. 320–330. Tang, H., Mechref, Y., Novotny, M.V., 2005. Automated interpretation of MS/MS spectra of oligosaccharides. Bioinformatics 21 (suppl_1), i431–i439. Wearsch, P.A., Peaper, D.R., Cresswell, P., 2011. Essential glycan-dependent interactions optimize MHC class I peptide loading. Proc. Natl. Acad. Sci. U.S.A. 108 (12), 4950–4955.
Funding This work was supported by the National High-Tech Research and Development Project [2014AA021101], the National Natural Science Foundation of China [31600650, 31671369, and 31770775], the National Key Research and Development program of China [FC2018YFC0910405]. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.compbiolchem.2019.03. 015. References Bocker, S., Kehr, B., Rasche, F., 2011. Determination of glycan structure from tandem mass spectra. IEEE/ACM Trans. Comput. Biol. Bioinform. 8 (4), 976–986. Canis, K., McKinnon, T.A.J., Nowak, A., Haslam, S.M., Panico, M., Morris, H.R., Laffan, M.A., Dell, A., 2012. Mapping the N-glycome of human von Willebrand factor. Biochem. J 447 (2), 217–228. Chen, T., Kao, M.Y., Tepel, M., Rush, J., Church, G.M., 2001. A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 8 (3), 325–337. Cooper, C.A., Gasteiger, E., Packer, N.H., 2001. Glycomod – a software tool for determining glycosylation compositions from mass spectrometric data. Proteomics 1 (2), 340–349. Dong, L., Shi, B., Tian, G., Li, Y., Wang, B., Zhou, M., 2015. An accurate de novo algorithm for glycan topology determination from mass spectra. IEEE/ACM Trans. Comput. Biol. Bioinform. 12 (3), 568–578. Ethier, M., Saba, J.A., Ens, W., Standing, K.G., Perreault, H., 2002. Automated structural assignment of derivatized complex n-linked oligosaccharides from tandem mass spectra. Rapid Commun. Mass Spectrom. 16 (18), 1743–1754. Ethier, M., Saba, J.A., Spearman, M., Krokhin, O., Butler, M., Ens, W., Standing, K.G., Perreault, H., 2003. Application of the stroligo algorithm for the automated structure assignment of complex n-linked glycans from glycoproteins using tandem mass
224