Analytica Chimica Acta 743 (2012) 80–89
Contents lists available at SciVerse ScienceDirect
Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca
Improve accuracy and sensibility in glycan structure prediction by matching glycan isotope abundance Guang Xu a,b , Xin Liu a , Qing Yan Liu b , Yanhong Zhou a,∗∗ , Jianjun Li b,∗ a b
College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China National Research Council Canada, Ottawa, Ont., Canada K1A 0R6
h i g h l i g h t s
g r a p h i c a l
a b s t r a c t
A glycan isotope pattern recognition strategy for glycomics. A new data preprocessing procedure to detect ion peaks in a giving MS spectrum. A linear soft margin SVM classification for isotope pattern recognition.
a r t i c l e
i n f o
Article history: Received 19 April 2012 Received in revised form 13 June 2012 Accepted 10 July 2012 Available online 16 July 2012 Keywords: Bioinformatics Glycan Isotope Mass spectrometry
a b s t r a c t Mass Spectrometry (MS) is a powerful technique for the determination of glycan structures and is capable of providing qualitative and quantitative information. Recent development in computational method offers an opportunity to use glycan structure databases and de novo algorithms for extracting valuable information from MS or MS/MS data. However, detecting low-intensity peaks that are buried in noisy data sets is still a challenge and an algorithm for accurate prediction and annotation of glycan structures from MS data is highly desirable. The present study describes a novel algorithm for glycan structure prediction by matching glycan isotope abundance (mGIA), which takes isotope masses, abundances, and spacing into account. We constructed a comprehensive database containing 808 glycan compositions and their corresponding isotope abundance. Unlike most previously reported methods, not only did we take into count the m/z values of the peaks but also their corresponding logarithmic Euclidean distance of the calculated and detected isotope vectors. Evaluation against a linear classifier, obtained by training mGIA algorithm with datasets of three different human tissue samples from Consortium for Functional Glycomics (CFG) in association with Support Vector Machine (SVM), was proposed to improve the accuracy of automatic glycan structure annotation. In addition, an effective data preprocessing procedure, including baseline subtraction, smoothing, peak centroiding and composition matching for extracting correct isotope profiles from MS data was incorporated. The algorithm was validated by analyzing the mouse kidney MS data from CFG, resulting in the identification of 6 more glycan compositions than the previous annotation and significant improvement of detection of weaker peaks compared with the algorithm previously reported. Crown Copyright © 2012 Published by Elsevier B.V. All rights reserved.
1. Introduction ∗ Corresponding author at: National Research Council Canada, 100 Sussex Drive, Ottawa, Ont., Canada K1A 0R6. Tel.: +1 613 990 0558; fax: +1 613 952 9092. ∗∗ Corresponding author. Tel.: +86 027 87792217. E-mail addresses:
[email protected] (Y. Zhou),
[email protected] (J. Li).
Glycomics is defined as the study of the biological functions of glycans. Determination of glycan structures is the basis and crucial part in glycomics field [1]. The structural changes of carbohydrate chains can cause abnormal proliferation and metastasis of cancer,
0003-2670/$ – see front matter. Crown Copyright © 2012 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.aca.2012.07.009
G. Xu et al. / Analytica Chimica Acta 743 (2012) 80–89
autoimmune diseases and inflammation [2,3]. However, the diversity and heterogeneity of the glycan structures still post challenges to precisely determinate the glycoforms. Unlike proteins, glycans are the secondary products of genes and their synthesis are generally associated with glycosyltransferase, glycosidases and other carbohydrate related enzymes [1–3]. Advances in glycomics are anticipated to be driven by improvements in the development of techniques to determine complex glycan structures, quantitative analysis of glycans, carbohydrate sequencing and bioinformatics. Mass Spectrometry (MS) is extremely useful for the characterization of glycan structures and their quantities. Tandem mass spectrometry (MS/MS) can provide more detailed fragment ion information, and consequently determine the glycan structures. MS has become one of the key means in glycomics research, however the manual processing and interpretation of huge amount of MS data have become a bottleneck in high throughput analysis of glycans. Therefore, computational methods with the ability to automatically identify glycan molecules and effectively annotate MS spectra are highly desired to the MS-based glycomics researches [4,5]. First reported in 2001, Glycomod, attempted to derive all possible monosaccharide compositions according to the detected molecule weights in conjunction with certain structural rules [6]. Unlike Glycomod, STAT and StrOligo can predict the topology structures (though no linkage information) of glycans from tandem mass spectrometry spectra [7,8]. However, these algorithms can only applied to those glycans with limited numbers of monosaccharide units owing to their computational complexity and are not able to mine structural information from cross-ring fragmentation. GLYCH is thus proposed to utilize the cross-ring fragmentation information, produced by high energy collision-induced dissociation MS/MS, to ingeniously resolve the diverse linkage types in glycan structures through a dynamic programming algorithm [9]. In addition, a top–down sequencing strategy was introduced in the Oligosaccharide Subtree Constraint Algorithm (OSCAR) [10]. This de novo algorithm proposes structures from multiple-stage tandem mass spectrometry (MSn ) data without using biosynthetic constraints or comparisons against reported glycans. However, it depends on how informative the fragmentation pathways are and more importantly it requires MSn data, which can be very challenging when the available instruments and sample amounts are limited. On the other hand, similar to the database-based strategies used in protein identification, such as MASCOT [11] and SEQUEST [12], databases searching algorithms are also exploited in glycan structures interpretation [13–17]. GlycosidIQ was developed to compare MS data against the previously annotated MS spectra stored in GlycoSuiteDB [13,14]. GlycoWorkbench is a software tool to assist manual interpretation of MS data [15]. This software can accomplish structural determination by searching database from Consortium for Functional Glycomics (CFG) [16] or GLYCOSCIENCES [17]. It can also evaluate a set of structures proposed by users by matching the corresponding theoretical list of fragment ions against the list of peaks derived from the spectrum. Cartoonist has been used to automatically annotate N-glycan compositions, which labels peaks in MALDI spectra of permethylated N-glycans with diagrams or cartoons of the most plausible glycans consistent with the peak masses and the types of glycans being analyzed [18]. Although it can determine the precision and calibration of the machine used to generate the spectrum automatically based on the spectrum itself, it does not include other crucial data preprocessing steps, such as background subtraction, noise smoothing and isotope pattern recognition for peak detection and structure prediction. Molecule isotope distribution has been used for accurate interpretation in mass spectrometry [19–23]. A comprehensive MS calibration base on the isotope cluster distribution has been proposed to achieve spectral accuracy and further enhance the mass
81
measurement and formula determination [19,20]. The isotope abundance matching has been used in reliably excluding the invalid elemental formulas of molecule ions and fragment ions [21–23]. However, matching glycan isotope abundance (mGIA) strategy has not been applied in glycomics. Most glycan tools only consider the m/z value of monoisotope variant, but not including the information related to isotope pattern. Thus it is expected that incorporating mGIA for MS-based glycomics can improve the accuracy in the glycan mass spectra data interpretation. In this paper, we present a mGIA algorithm for automatic interpretation of MS data and accurate annotation of glycan compositions and structures. This algorithm is especially useful for analyzing low abundant peaks. When applied to the MALDI-TOF MS data from mouse samples, mGIA algorithm detected more glycan structures than that reported in CFG profiling database or obtained by using Cartoonist algorithm [18]. 2. Methods 2.1. MALDI-TOF analysis MALDI-TOF spectra were acquired on a 4800 MALDI-TOF/TOF spectrometer (AB Sciex, Concord, Canada). Dihydroxybenzoic acid (DHB, 10 mg mL−1 ) in 50% MeCN was used as the matrix. 2.2. Data preprocessing The mGIA algorithm focused on improving the detection accuracy of isotope vectors in conjunction with the comprehensive information collected in CFG database. We first adopted a data preprocessing procedure with an intensity-weighted centroid method for peak detection. This algebraic algorithm can improve the sensitivity of centroid peak detection and mass accuracy without setting a conventional threshold. Based on the CFG glycan collection, we reconstructed such database to be used as a reference for mGIA, in which isotope vectors for each glycan composition is included. Using the publicly available data sets to train the algorithm, we obtained a linear SVM algorithm in a Cartesian coordinate system with m/z value as abscissa and ln(De ) (logarithmic Euclidean distance) as ordinate to identify signal peaks, which can assess on the confidence of the proposed structures. 2.3. Baseline correction MS data can be influenced by chemical background, sample preparation and instrumental operation conditions. In order to compare different areas in a single MS spectrum or multiple MS spectra against theoretically generated data, a flat baseline is needed. There are a few baseline removal and smoothing algorithms currently available for this purpose [24,25]. For example, both discrete wavelet transform and continuous wavelet transform have been used for MS-based proteomic data analysis [26,27]. Another baseline correction method, proposed by Xi and Rocke for NMR data, has recently been adopted to MALDI FT-ICR MS data analysis [28,29]. The primary mission of these algorithms is to remove the baseline as completely as possible. Unfortunately, it is very difficult to obtain a complete set of isotopes, especially for the glycans with low abundance, mainly due to the inadvertent removal of the corresponding heavier isotope variants. In order to implement the mGIA method, we modified a previously reported method [30,31]. Instead of using the mean or median peak intensity, we utilized a fixed window to divide a MS spectrum into small segments, 2-Da in width, and determined the minimum point bi in each segment. From the point sets {bi }, through incorporating Michael Thomas Flanagan’s java scientific library, we generated a baseline for the raw spectrum with a cubic
82
G. Xu et al. / Analytica Chimica Acta 743 (2012) 80–89
Fig. 1. Workflow chart of the new algorithm.
interpolation S{bi } rather than a linearly interpolation [30–32]. This modification makes the baseline curve more moderate, although the resulted background is not as clean as that from other baseline subtraction methods. The new data sets {x1b , y1b } with a m/z value xi and intensity yi in the raw data were constructed from the following functions: xib = xi yib
(1)
= (yi − S{bi } (yi ))+
However there often exist local lumps on the slope that can cause false identification (Fig. 2B). Although a smoothing procedure may reduce false rate over a simple baseline subtraction processing, it is still not possible to remove local lumps completely (Fig. 2C). When traversing the data points along the m/z axis in ascending order, each data point in a MS spectrum can be classified in either a rising slope side datasets RSi or a falling slope side datasets FSi of ith peak Pi . This classification makes it possible to divide the MS spectrum into small segment so that each one contains only a single peak, with a starting point at the beginning of a rising slope line and ending point at the beginning of a following arising slope line. If continuous increment in ion abundance of three adjacent data points is detected, the algorithm sets the first data point as the beginning of a rising slope side. Similarly, a data point is classified as in a falling slope side when the peak intensities decrease continuously over three neighboring points. This is to determine the turning point for falling slope and trigger the algorithm to search the beginning of another rising slope, i.e. the starting point of another peak. The advantage of the new peak centroiding algorithm is that the peak width, even for the weak isotope peaks, can be detected without the consideration of instrument-dependent parameters, such as mass resolution [33]. It is important to note that an effective smoothing procedure is essential to lowering the false-positive rate. After the segment width being determined, the centroided peaks (x1c , y1c ) can be calculated using an improved intensityweighted average method:
2 xj −xitop ≤min(xitop −xileft ,xiright −xitop ) xj yj c xi = , 2 xj −xitop ≤min(xitop −xileft ,xiright −xitop ) yj 3 xj −xitop ≤min(xitop −xileft ,xiright −xitop ) yj c yi = , 2 xj −xitop ≤min(xitop −xileft ,xiright −xitop ) yj top
left
(3)
(4)
right
where ˛+ ≡max{˛,0} and i ranges from 1 to the data numbers n (Fig. 1).
where xk , xi , xI are the m/z value of the maximum intensity, the lower end and upper end m/z values in the data sets FSi ∪RSi , top right left respectively. The difference between xk and xI or between xi
2.4. Smoothing and centroiding
and xk , whichever smaller, is selected as the window to calculate the intensity and the m/z value of the centroided peak so that the potential interference from peak tailing can be minimized. Furthermore, the square intensity-weighted averages are used to calculate the m/z values x1c and the integrated intensity y1c . The results indicated that this algorithm consequently decreases the interval error between isotopes. Fig. 3 shows the impact of data preprocessing on the raw data. The expanded view of MALDI-MS raw spectrum from the N-glycans of lactoferrin and a calculated baseline curve are presented in Fig. 3A. After subtracting the baseline from the data, the resulting curve clearly shows that the uneven background has been removed (Fig. 3B). The centroided spectrum is then produced from the baseline-subtracted data (Fig. 3C). It is worthy of mentioning that the computational error in the process of smoothing and centroiding could not be eliminated especially in the high m/z segment of a MS spectrum, due to the limited resolution of MALDI-TOF. A higher resolution instrument (e.g. MALDI-FT-ICR MS) will certainly be useful to obtain more meticulous peak information from the raw data because more data points can be used to calculate x1c and y1c .
top
In a MS spectrum with moderate mass resolution, one peak can consist tens of data points, either genuine signals or noise peaks. A Gaussian weighted filter was selected to smooth the spectrum in this study. The following functions were used to generate the smoothed data sets {x1s , y1s }:
xis = xib , yis =
2
ye (xj ,yj ) ∈ Wi j
(xj ,yj ) ∈ Wi
−(xj −xb ) /2 2 i
2
e
−(xj −xb ) /2 2
,
(2)
i
where the deviation equals to window width and Wi is the set consisting of all the data involved in the window with center at (x1b , y1b ) and the width of 6x1b (6 or 7 points in one window). The use of a variable width is to ensure sufficient data points for Gaussian weighted filter processing (Fig. 2A). Although the intensity of individual point may be slightly altered, the weighted smoothing method has the advantage of preserving the total intensity of all data points within the peak, which is extremely important since the isotopic distribution will be used for comparison in peak detection and glycan identification. In order to extract intact isotope profiles, every isotope in a mass spectrum has to be represented by exactly one data point, which leads to the requirement of peak centroiding in data preprocessing. A widely used algorithm for peak centroiding is to detect the local maximum and in conjunction with peak intensity threshold [33].
2.5. GIA detection Most available software uses the detected molecular weights to search against glycan databases for structural determination. CFG glycan database collected hundreds of N-linked glycan structures in combination with the information related to sample types, glycans from glycoconjugates in human and mouse tissues and cells, and
G. Xu et al. / Analytica Chimica Acta 743 (2012) 80–89
83
Fig. 2. Peak smoothing and centroiding. (A) The relation of the interval between two neighboring points and their corresponding m/z. (B) Peak shapes after baseline subtraction (dotted line). (C) Peak shapes after smoothing (solid line) and centroiding.
biological functions [34]. The goal of this algorithm is to develop a novel algorithm by using molecular weight matching for composition prediction and isotope profile matching for validation. Since the improved peak centroiding method makes it possible to obtain accurate GIA for a giving mass spectrum, the next step is to integrate the theoretical isotope profiles with their corresponding glycan structures. The GIA of each glycan can be calculated in combination with the natural abundances of atom carbon (C), hydrogen (H), oxygen (O) and nitrogen (N). The masses of the isotopic variants of a molecule can be computed by summing the masses of the isotopes that compose the molecule. The monoisotopic peak is composed of the most abundant isotopes, i.e. 12 C, 1 H, 16 O and 14 N. The contributions of other stable isotopes, i.e. 13 C, 2 H, 15 N, 17 O, and 18 O, were taken into account to calculate heavier isotopic variants, up to the isotopic variant with additional four neutrons (M + 1, M + 2, M + 3 and
M + 4). The GIA of each glycan thus can be defined using a Glycan Isotope Abundance Vector (GIAV), expressing as (RM , RM+1 , RM+2 , RM+3 , and RM+4 ). The term RM stands for the relative abundance of the monoisotopic variant and is always set to 100. The term RM+1 , RM+2 , RM+3 , and RM+4 is the normalized abundance of the isotopic variants with additional one, two, three and four neutrons, respectively. Due to the very low probability of occurrence, the isotopic variants with five or more additional neutrons are ignored in the algorithm. It should be noted that glycans with the same additional neutron content were pooled together to represent an aggregated isotopic variant (Figure S1 in Supporting Information). As previously addressed, the mass deficit can be ignored for the instruments with a limited resolution [35]. Moreover, the relative abundance of a glycan molecule can be obtained by symbolically expanding a polynomial function, typically the combination of atom isotope abundance vectors AIAVX (X∈{C,H,O,N}). For a glycan composition
Fig. 3. MALDI-MS data pre-processing example. (A) Expanded view of raw data and calculated baseline (m/z 1000–1100). (B) Profile after baseline subtraction. (C) Centroided spectrum.
84
G. Xu et al. / Analytica Chimica Acta 743 (2012) 80–89
Table 1 Examples of the glycan compositions and their corresponding m/z and NIRV. Composiion
Cal. m/z
NIRV
Cal. m/z (PerMe, Na)
NIRV
Hex4 dHex1 HexNAc4 Hex8 HexNAc2 Hex7 dHex1 NeuAc4 HexNAc6
1625.6 1721.6 3682.3
(100.0,73.29,35.49,12.88,3.88) (100.0,75.05,37.99,14.42,4.56) (100.0,165.79,157.00,108.39,60.26)
2040.03 2192.09 4587.27
(100.0,104.98,63.56,28.16,10.07) (100.0,111.27,71.53,33.67,12.80) (100.0,237.11,300.28,267.65,187.43)
of CnC HnH OnO NnN , the AIAVX (X∈{C,H,O,N}) is given according to the nature isotope abundance and the number of the atom X (nX ). The AIAVC can be easily calculated using the formula:
nC
1, 1
R13 C/12 C ,
nC
2 R13
2
C/12 C
nC
, 3
3 R13
C/12 C
nC
, 4
4 R13
(5)
C/12 C
where R13 C/12 C indicates the rate between atom isotope 13 C and in nature. Similarly, the AIAVH and AIAVN can be derived by replacing the corresponding nature abundance and the numbers of hydrogen and nitrogen in a molecule, respectively. Actually, the number of atom N can be smaller than 4 (nN < 4). In these cases, we set the corresponding vector components to 0, when the ordinals are greater than nN + 1 in the AIAVN . However, modification is necessary to compute AIAVO because of the presence of three stable isotopes, i.e. 16 O, 17 O, 18 O: 12 C
nO
1,
R17 O/16 O ,
1
+
nO 2
nO 2
2 R17 + O/16 O
2
R18 O/16 O +
3
nO
1
3
nO
R18 O/16 O ,
1
nO 3
3 R17 + O/16 O
2
nO
1
2
R18 O/16 O R17 O/16 O ,
nO 4
4 R17 O/16 O
2
(6)
R18 O/16 O R17 O/16 O
After the atom isotope abundance vectors are constructed, the relative abundance of isotope variants RM+i (i = 1, 2, 3, 4) can be obtained by the functions from the combination of AIAVC , AIAVH , AIAVO and AIAVN (see the Supporting Information, Computational 1). Table 1 shows several examples of the calculated glycan isotope abundance vectors (cGIAV). Derived from the CFG database, the cGIAV of 7820 glycan structures or 808 glycan compositions were obtained, in both native form and permethylated form. The next step is to detect the true glycan signal by matching the calculated molecule weight of each composition with the centroiding m/z value in a preprocessed MS spectrum. If a peak p0 (x0 , y0 ) matches a glycan composition with a difference in m/z value less than 1.0 (optional in this software), the algorithm then searches for the additional four isotopic peaks pi (xi , yi ), with the interval (x0 + i − 0.1, x0 + i + 0.1), i = 1, 2, 3, 4. The peak p0 (x0 , y0 ) is not considered as a glycan if missing any one of the isotopic variants, M + 1, M + 2 and M + 3. The results demonstrated that this criterion can significantly improve the identification confidence. However, it was also found that the presence of the isotopic variant with four additional neutron (M + 4) is not crucial for positively identifying a glycan composition. In such case, the intensity vector is simply filled with 0. Therefore, the detected GIAV (dGIAV) can be extracted from the spectrum with a format of 100 × (1, y1 /y0 , y2 /y0 , y3 /y0 , y4 /y0 ), and the detected isotope m/z vector consists of (x0 , x1 , x2 , x3 , x4 ). The Euclidean distance De between the cGIAV and dGIAV of each candidate composition is chosen to evaluate the confidence of glycan identification. We applied a logarithmic transformation for De , i.e. ln(De ), to reduce the orders of magnitude and amplify the difference in variation of data:
5 2 ln
(cGIAV (i) − dGIAV (i) ) ,
We can use the MALDI-MS spectrum of one known glycan structure, Hex4 dHex1 HexNAc4 , to demonstrate the merit of the proposed peak centroiding method and illustrate how the isotope matching works. The cGIA of the glycan and the experimental spectrum are presented in Fig. 4A and B, respectively. The calculated m/z values of the first five isotopic variants are m/z 2040.03, 2041.03, 2042.03, 2043.03 and 2044.03; whereas the experimental isotopic variants were determined to be m/z 2039.71, 2040.72, 2041.71, 2042.70 and 2043.70 (Fig. 4C). The presence of such comparable isotopic variants in the range of m/z 2040 and m/z 2045 indicates the great possibility that detected ion has the same composition as predicted. In the meantime, the cGIAV and dGIAV can be expressed as two vectors, (100, 104.98, 63.56, 28.16, 10.07) and (100, 101.20, 61.57, 30.93, 11.03), respectively. Thus, the algorithm may further
(7)
i=1
where ln is the natural logarithm (base e) and i is the ith element in vector.
validate the results by calculating the Euclidean distance and its logarithmic transformation, which is 1.64 in this particular case. Interestingly, this example also demonstrates that the peak widthbased method can result in erroneous centroiding spectrum due to the existence of many small lumps (Fig. 4D). 3. Results and discussion 3.1. Glycan identification CFG contains a large number of glycan MS profiling data from a variety of tissues and cells from human patients or mice [36]. The majority of CFG database are derived from MALDI-TOF mass experiments, which produced very little fragment signals and are suitable to our algorithm. The meticulous manual annotation of peaks appeared in the database can be used as the expert knowledge and help us to find the rules for screening the most likely ones from all the candidate compositions. Although the annotation in the database needs to be further confirmed, we assume that it is correct so that it can be used to generate classification criteria by the learning process. The criteria can then be applied for other data annotation. As discussed, molecular weight matching is selected as the preliminary measure for predicting glycan compositions, from which the m/z values of their corresponding isotopic variants can be derived. Such isotope peak information is then served as an additional criterion, i.e. the detection of the expected isotopic variants within ± 0.1 m/z. Both criteria deal with the mass accuracy, the mass error tolerance for molecular weight (maximum relative mass error allowed) or masses between two adjacent isotopic variants. However, how to use the isotopic distributions as an independent validation measure remains to be examined. In order to obtain a measurable norm, we analyzed three sets of MS data from CFG database, including spleen, liver and skin. For skin sample, a total
G. Xu et al. / Analytica Chimica Acta 743 (2012) 80–89
85
Fig. 4. Isotope profiles of glycan Hex4 dHex1 HexNAc4 (m/z 2040.03). (A) cGIA. (B) Raw MALDI-MS data. (C) Centroided spectrum using our method. (D) Centroided spectrum using peak width function [33].
Fig. 5. Linear SVM classifier are trained from two properties ln(De ) and m/z values of the detected peaks in human liver, skin, spleen samples, and the pooled data. The peaks were divided into two group, previously identified or unidentified. (A) Plot of property IMZRE vs m/z in human skin sample data. (B) Plot of ln(De ) vs m/z in human skin sample data. (C) The linear classifiers trained from different sample data. Lines 1, 2, 3 and 4 correspond to spleen, liver, combined and skin, respectively. Line 5 gives an safer boundary from the trained ones, the peaks below it possess a higher confidence. Key: ( ) previously unidentified, ( ) previously identified, () Support vectors.
86
G. Xu et al. / Analytica Chimica Acta 743 (2012) 80–89
of 162 compositions were derived from molecular weight matching with dynamic tolerance in combination with the detection of isotopic variants, in the range of m/z 1500–4000. Because of the positive correlation between mass error and m/z values (Fig. 2A), we could set the dynamic tolerance, a product of calculated m/z value and a constant coefficient. As illustrated in Figure S2 (Supporting Information), 0.0001, 0.0002 and 0.0004 were examined and 0.0002 was chosen as the coefficient in our algorithm because 0.0001 resulted in more false positives and 0.0004 resulted in more false negatives. A total of 32 out of the 33 CFG-reported compositions are identified among the 134 glycan candidates. The results
Table 2 The average correct rate of 10-fold cross-validation in different data classification applying linear SVM algorithm. Samples
Avg. correct rate
Liver Skin Spleen Combined
0.859 0.883 0.898 0.906
Fig. 6. Test and comparison of our method in mouse kidney sample data. (A) Plot of ln(De ) vs m/z. The areas below the solid line and between two lines contain the identification with either higher confidence or lower confidence. A highlighted peak is previously unidentified but detected by mGIA algorithm. (B) and (C) show the close matching between the calculated and detected isotopes profile at m/z 2418.22. Key: (- - -) combined classifier, (—) safe boundary, () unidentified peaks, () previously identified.
Fig. 7. Automatically annotation of one portion of the mouse kidney profile with m/z range from 1500 Da to 3250 Da. The corresponding compositions and references are listed in Table 3.
Table 3 The calculated m/z values, relative intensities and compositions of corresponding peaks and CFG ID number of the possible N-linked glycan structures in mouse source. Composition
Cal.
Det.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Hex5 HexNAc2 Hex6 HexNAc2 Hex3 dHex1 HexNAc4 Hex3 HexNAc5 Hex7 HexNAc2 Hex5 HexNAc4 Hex3 dHex1 HexNAc5 Hex8 HexNAc2 Hex5 dHex1 HexNAc4 Hex4 dHex1 HexNAc5 Hex9 HexNAc2 Hex5 dHex2 HexNAc4 Hex4 dHex2 HexNAc5 Hex5 dHex1 HexNAc5 Hex4 dHex1 HexNAc6 Hex5 dHex3 HexNAc4 Hex5 dHex2 HexNAc5 Hex3 dHex3 HexNAc6 Hex5 dHex3 HexNAc5 Hex6 dHex2 HexNAc5 Hex5 dHex2 HexNAc6 Hex6 dHex3 HexNAc5 Hex5 dHex3 HexNAc6 Hex6 NeuAc3 HexNAc3 Hex6 dHex2 HexNAc6 Hex6 dHex4 HexNAc5
1579.79 1783.89 1835.93 1906.97 1987.99 2070.04 2081.06 2192.09 2244.13 2285.16 2396.19 2418.22 2459.25 2489.26 2530.28 2592.31 2663.34 2674.36 2837.43 2867.44 2908.47 3041.53 3082.56 3112.53 3112.57 3215.62
1579.92 1784.04 1836.09 1907.14 1988.16 2070.20 2081.23 2192.27 2244.30 2285.36 2396.40 2418.43 2459.44 2489.46 2530.48 2592.51 2663.55 2674.55 2837.62 2867.63 2908.68 3041.69 3082.76 3112.72 3112.72 3215.76
Hex4 dHex1 HexNAc4 Hex4 HexNAc5 Hex7 dHex1 HexNAc2 Hex3 dHex1 HexNAc6 Hex6 dHex2 HexNAc3 Hex6 dHex1 HexNAc4 Hex10 HexNAc2 Hex6 dHex1 HexNAc7
2040.03 2111.07 2162.08 2326.18 2377.19 2448.23 2600.29 3183.61
2040.20 2111.26 2162.22 2326.35 2377.40 2448.42 2600.48 3183.87
1 2 3 4 5 6 7 8 a
m/z
Rel. Int. (%)
ln(De )
Annotation
CFG identification number
44.9 30.3 7.2 4.5 19.2 5.7 8.3 31.3 10.4 23.7 34.9 6.4 32.2 7.0 1.6 9.6 17.0 5.1 100 4.3 4.5 3.8 10.1 2.3 2.3 5.9
1.59 1.45 3.07 3.13 1.76 3.15 2.46 2.3 2.36 1.42 1.61 2.94 1.92 2.07 3.5 2.38 2.54 3.44 3.34 3.75 3.79 1.64 2.76 3.73 3.72 3.28
p,c p p N/A p p p p,c p p,c p,c N/A p,c p N/A p p,c N/A p,c c p,c p p,c N/A N/A p,c
carbNlink carbNlink carbNlink carbNlink carbNlink carbNlink carbNlink carbNlink carbNlink carbNlink carbNlink N/Aa N/A carbNlink carbNlink N/A carbNlink N/A N/A N/A N/A N/A N/A N/A N/A N/A
3.31 3.39 3.58 3.47 3.74 3.77 3.82 4.03
p N/A N/A N/A N/A N/A N/A N/A
carbNlink 21264 D000,carbNlink 21326 P,carbNlink 21240 D000,carbNlink 20938 D000 carbNlink 20422 D000, carbNlink 44158 D000,carbNlink 20428 D000 N/A N/A N/A carbNlink 20423 D000,carbNlink 21231 D000,carbNlink 40797 D000 N/A N/A
5.2 2.7 1.7 2.0 3.2 6.9 4.1 0.89
20482 40082 30182 21244 21078 20137 41658 20714 20382 21243 30017
D000, carbNlink 21123 D000, carbNlink 40975 D000, carbNlink 40431 D000 D000, carbNlink 45263 D000, carbNlink 21079 D000, carbNlink 40976 D000, carbNlink 40460 D000 A,carbNlink 20387 D000,carbNlink 45038 D000,carbNlink 20435 D000 D000,carbNlink 41136 D000,carbNlink 40734 D000 D000, carbNlink 40977 D000,carbNlink 40973 D000, carbNlink 45264 D000,carbNlink 40118 D000 D000,carbNlink 42667 D000,carbNlink 44160 D000 D000,carbNlink 40334 D000 D000,carbNlink 40972 D000,carbNlink 45265 D000, carbNlink 40466 D000,carbNlink 40080 D000 D000,carbNlink 20684 D000,carbNlink 41593 D000,carbNlink 20139 D000 D000,carbNlink 20939 D000,carbNlink 44165 D000 A
30247 A,carbNlink 44172 D000 40084 D000 20418 D000
G. Xu et al. / Analytica Chimica Acta 743 (2012) 80–89
Peak
N/A, not available; p, previously annotated by CFG database; c, previously annotated by Cartoonist algorithm.
87
88
G. Xu et al. / Analytica Chimica Acta 743 (2012) 80–89
also indicated that the average of m/z errors of isotope variants is dependent on the molecular weights. To make the discussion clearer, we computed the isotopes m/z relative error (IMZRE) by function:
1000
5 (cMZ (i) i=1
2
− dMZ (i) ) /5
cMZ (1)
(8)
where cMZ and dMZ are calculated and detected isotope m/z value vectors, respectively. Fig. 5a indicates that IMZRE is approximately steady over a range of m/z 1500–4500, which is also the reason for not using a fixed m/z error to evaluate the confidence for glycan composition determination. In the case of more than one composition annotate to one peak, we selected the composition with the smallest De (Euclidean distance). Compared to previous annotation, additional glycan compositions were identified using the proposed method. This suggests that the isotopic abundance vector may play an important role in improving glycan identification. Similar results were obtained in the human liver and spleen samples (see the Supporting Information, Figure S3A and S4A). SVM is a non-probabilistic binary linear classifier that has been widely used in bioinformatics community. We utilize a soft linear SVM algorithm to train the classifier from our samples. Fig. 5B shows the classification results of human skin data, in which the red rectangle and blue triangle stand for the previously unidentified and identified compositions, respectively. Interestingly, a total of 13 previously unidentified peaks were identified by using the proposed strategy; whereas we were not able to identify 6 previously identified compositions. The circled points are the support vectors which maximize the vectors margin width, and at the same time minimize the error penalty that related to the distance between one support vector and the margin. The classification results of the other samples are shown in Figure S3B and S4B (Supporting Information). Fig. 5C illustrates the overlap of the linear classifiers of the three samples, in which line 1, 2, 3 and 4 are the classifiers from liver, skin, combined and spleen sample data, respectively. To further evaluate our algorithm, we used 10-fold cross-validation, where the data were randomly divided into 10 balanced sets. The algorithm then iterated from 1 to 10 and on ith step using ith set for testing while all the other sets will be used for training. This evaluation method offers a quantitative measure to test and train the data, i.e. average correct rates (Table 2). The SVM algorithm and 10-fold cross-validation are implemented in Matlab and the codes are shown in Figure S3B and S4B (Supporting Information), Computational 2. When all data sets were combined, a linear classifier (i.e. combined linear classifier, line 3: y = 0.00071x + 2.075) derived from the same SVM algorithm can be applicable to a wider range of samples. Furthermore, all the lines are close enough such that it is possible to determine a safe boundary for confidence evaluation, i.e. line 5: y = 0.00068x ± 1.86 in Fig. 5c. This safe boundary was defined by the lower bound of all the linear classifiers in order to ensure the correct identification and can be easily obtained from the straight line over two lowest points at m/z 1500 and 4500 of four linear classifiers. The ions have no positive identification when their ln(De ) values locate in the space above the combined linear classifier, i.e. line 3. The peak identification is of high confidence if the ln(De ) of the ions locate in the region below the safe boundary, i.e. the space below line 5. For those ions with the ln(De ) values locate in the region above the safe boundary at the same time below the combined linear classifier, i.e. the space defined by lines 3 and 5, we define their corresponding identification confidence as low. 3.2. Application The performance of this algorithm was verified using the reported data. We further examine the algorithm using the data
from C57BL/6 mouse kidney sample with both combined linear classifier and safe boundary (Fig. 6A). A total of 42 peaks were identified by the combined classifier, of which 28 identifications were with high confidence. Interestingly, the ion at m/z 2418, highlighted in circle, was identified as the composition of Hex5 dHex2 HexNAc4 , but it was not identified previously. Fig. 6B and C shows the cGIAV and dGIAV of the isotopes which give the strong proof that the peak is not noise data even if its relative intensity is only 2.5%, which means it is difficult to detect and easily missed. We emphasize this case in order to demonstrate that our automated annotation algorithm can give mass spectrum experimenter effective assistance to extract the weak signal in MALDI-TOF MS, both in efficiency and in accuracy. A portion of the annotation results for the mouse kidney sample was illustrated in Fig. 7. The detected and calculated m/z values, relative intensities, compositions, previous annotation results and ID number in CFG database of the possible N-linked glycan structures are summarized in Table 3. In comparison with the annotation in the spectrum from CFG profiling database, a total of 13 additional glycan compositions were identified, 6 with high confidence, i.e. the ln(De ) of the ions locate in the region below the safe boundary, and 7 with low confidence, i.e. the ln(De ) of the ions locate in the region between line y = 0.00068x ± 1.86 and y = 0.00071x ± 2.075. Meanwhile, two previously annotated peaks were missed by the new method. Further validation revealed a significant difference between the calculated and detected isotope profiles. For the data from C57BL/6 mice kidney, the new method identified 23 additional glycan compositions but missed 2 compositions compared with that annotated by Cartoonist algorithm [18]. Among the 23 additional annotated peaks, 20 of them are low abundant ions with relative intensities lower than 10% (Table 3). This confirms that our data preprocessing method and mGIA algorithm can considerably improve the sensitivity and make it possible to detect weak peaks that were missed in Cartoonist software [18].
4. Conclusion We have developed a mGIA algorithm, based on isotope distribution of glycans, to automatically identify glycan compositions from MALDI-TOF mass spectra. The mGIA algorithm compared the similarity of the calculated and detected isotope vectors of glycan molecule, which include the isotope masses, abundances and spacing. In consideration of the possible mislabeled and missing annotation reported in the CFG profiling database, a linear soft margin SVM classification method is used for isotope pattern recognition. We also developed a novel data preprocessing procedure to detect ion peaks in a giving MS spectrum, i.e. a slopebased centroiding algorithm. By testing the mGIA algorithm with sample data from different human tissues, we proposed a novel combined classifier and a safe boundary, which demonstrated a significant improvement in accuracy and sensitivity. Unlike the traditional algorithm, where only the monoisotopic masses were used for deriving the compositions from the detected m/z, the mGIA algorithm used the isotopic profiles of each glycan composition candidate as an additional criterion. However, the application of the proposed algorithm can be limited by the presence of such mixture, e.g. the molecular weights of two glycoforms are different by 1 ± 0.1, 2 ± 0.1, 3 ± 0.1, or 4 ± 0.1 Da. Further improvement is underway to integrate of a deconvolution algorithm so that the mGIA algorithm can quantify mixtures of m/z-overlapped glycans. Despite significant progress made on data preprocessing and peak detection in the mass spectra of proteins and glycans, the bioinformatics for glycomics is still in its infancy. The lack of publicly available MALDI-TOF MS glycan profiles in combination
G. Xu et al. / Analytica Chimica Acta 743 (2012) 80–89
with their corresponding glycan structures and sample sources for training and testing is still a challenge to the development of novel computational tools. For example, many structures and compositions can be found from a database, but their biological sources of the samples were not documented. It is well-known that sample information can have a great impact on identification confidence by reducing false positive annotation for the glycans, since glycan structures can vary among different species. Nevertheless, this proposed method is suitable for application to high-throughput screening of glycan molecules. It is expected that training and testing the mGIA with more samples from different sample sources can further enhance its reliability and accuracy. Acknowledgement We thank Dr. Yu Xue for valuable discussions during the preparation of this manuscript. We would like to thank Functional Glycomics Consortium for glycan profile data sharing. G.X. acknowledges the financial support from China Scholarship Council. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.aca.2012.07.009. References [1] R. Raman, S. Raguram, G. Venkataraman, J.C. Paulson, R. Sasisekharan, Nat. Methods 2 (2005) 817–824. [2] A. Varki, R.D. Cummings, J.D. Esko, H.H. Freeze, P. Stanley, C.R. Bertozzi, G.W. Hart, M.E. Etzler, Essentials of Glycobiology, Cold Spring Harbor Laboratory Press, New York, 2008, pp. 469–565. [3] M.E. Taylor, K. Drickamer, Introduction to Glycobiology, Oxford University Press, New York, 2003, pp. 183–193. [4] K.F. Aoki-Kinoshita, PLoS Comput. Biol. 4 (2008) e1000075. [5] C.W.V.D. Lieth, A. Bohne-Lang, K.K. Lohmann, M. Frank, Brief Bioinform. 5 (2004) 164–178. [6] C.A. Cooper, E. Gasteiger, N.H. Packer, Proteomics 1 (2001) 340–349. [7] S.P. Gaucher, J. Morrow, J.A. Leary, Anal. Chem. 72 (2000) 2332–2336. [8] M. Ethier, J.A. Saba, M. Spearman, O. Krokhin, M. Butler, W. Ens, K.G. Standing, H. Perreault, Rapid Commun. Mass Spectrom. 17 (2003) 2713–2720.
89
[9] H. Tang, Y. Mechref, M.V. Novotny, Bioinformatics 21 (2005) i431–i439. [10] A.J. Lapadula, P.J. Hatcher, A.J. Hanneman, D.J. Ashline, H. Zhang, V.N. Reinhold, Anal. Chem. 77 (2005) 6271–6279. [11] D.N. Perkins, D.J. Pappin, D.M. Creasy, J.S. Cottrell, Electrophoresis 20 (1999) 3551–3567. [12] J.K. Eng, A.L. McCormack, J.R. Yates III, J. Am. Soc. Mass Spectrom 5 (1994) 976–989. [13] H.J. Joshi, M.J. Harrison, B.L. Schultz, C.A. Cooper, N.H. Packer, N.G. Karlsson, Proteomics 4 (2004) 1650–1664. [14] C.A. Cooper, H.J. Joshi, M.J. Harrison, M.R. Wilkins, N.H. Packer, Nucleic Acids Res. 31 (2003) 511–513. [15] A. Ceroni, K. Maass, H. Geyer, R. Geyer, A. Dell, S.M. Haslam, J. Proteome Res. 7 (2008) 1650–1659. [16] R. Raman, M. Venkataraman, S. Ramakrishnan, W. Lang, S. Raguram, R. Sasisekharan, Glycobiology 16 (2006) 82R–90R. [17] T. Lütteke, A. Bohne-Lang, A. Loss, T. Goetz, M. Frank, C.W.V.D. Lieth, Glycobiology 16 (2006) 71R–81R. [18] D. Goldberg, M. Sutton-Smith, J. Paulson, A. Dell, Proteomics 5 (2005) 865–875. [19] Y. Wang, M. Gu, Anal. Chem. 82 (2010) 7055–7062. [20] Y. Wang, United States Patent No. 6,983,213 (2006). [21] J. Zhang, D. Xu, W. Gao, G. Lin, S. He, Rapid Commun. Mass Spectrom. 23 (2009) 3448–3456. [22] C.L. Do Lago, C. Kascheres, J. Comput. Chem. 15 (1991) 149–155. [23] J. Zhang, W. Gao, J. Cai, S. He, R. Zeng, R. Chen, IEEE/ACM Trans. Comput. Biol. Bioinform. 2 (2005) 217–230. [24] M. Hilario, A. Kalousis, C. Pellegrini, M. Müller, Mass Spectrom. Rev. 25 (2006) 409–449. [25] P. Roy, C. Truntzer, D. Maucort-Boulch, T. Jouve, N. Molinari, Brief. Bioinform. 12 (2011) 176–186. [26] K.R. Coombes, S. Tsavachidis, J.S. Morris, K.A. Baggerly, M.C. Hung, H.M. Kuerer, Proteomics 5 (2005) 4107–4117. [27] P. Du, W.A. Kibbe, S.M. Lin, Bioinformatics 22 (2006) 2059–2065. [28] D.A. Barkauskas, D.M. Rocke, Anal. Chim. Acta 657 (2010) 191–197. [29] D.A. Barkauskas, H.J. An, S.R. Kronewitter, M.L. de Leoz, H.K. Chew, R.W. de Vere White, G.S. Leiserowitz, S. Miyamoto, C.B. Lebrilla, D.M. Rocke, Bioinformatics 25 (2009) 251–257. [30] D. Mantini, F. Petrucci, D. Pieragostino, P. Del Boccio, M. Di Nicola, C. Di Ilio, G. Federici, P. Sacchetta, S. Comani, A. Urbani, BMC Bioinformatics 8 (2007) 101–117. [31] Y. Yasui, M. Pepe, M.L. Thompson, B.L. Adam, G.L. Wright Jr., Y. Qu, J.D. Potter, M. Winget, M. Thornquist, Z. Feng, Biostatistics 4 (2003) 449–463. [32] Available: http://www.ee.ucl.ac.uk/∼mflanaga/java/. [33] M. Gentzel, T. Köcher, S. Ponnusamy, M. Wilm, Proteomics 3 (2003) 1597–1610. [34] Available: http://www.functionalglycomics.org/glycomics/molecule/jsp/ carbohydrate/carbMoleculeHome.jsp. [35] D. Valkenborg, I. Mertens, F. Lemière, E. Witters, T. Burzykowski, Mass Spectrom. Rev. 31 (2011) 96–109. [36] Available: http://www.functionalglycomics.org/glycomics/publicdata/ glycoprofiling-new.jsp.