Journal of Chromatography A, 1602 (2019) 273–283
Contents lists available at ScienceDirect
Journal of Chromatography A journal homepage: www.elsevier.com/locate/chroma
An improved peak clustering algorithm for comprehensive two-dimensional liquid chromatography data analysis Jucai Xu a,b , Lin Zheng a,b , Guowan Su a,b , Baoguo Sun c , Mouming Zhao a,b,c,∗ a b c
School of Food Science and Engineering, South China University of Technology, Guangzhou 510640, China Guangdong Food Green Processing and Nutrition Regulation Technologies Research Center, Guangzhou 510640, China Beijing Advanced Innovation Center for Food Nutrition and Human Health, Beijing Technology & Business University, Beijing 100048, China
a r t i c l e
i n f o
Article history: Received 20 February 2019 Received in revised form 22 May 2019 Accepted 23 May 2019 Available online 27 May 2019 Keywords: Two-dimensional liquid chromatography Peak detection Retention time criterion Overlap criterion Data analysis
a b s t r a c t In this work, an improved algorithm was developed for two-dimensional (2D) peak detection in complex two-dimensional liquid chromatography (LC×LC) data sets. In the first step, conventional onedimensional peak detection was performed. In the second step, retention time, bidirectional overlap and unimodality criteria were applied to decide which of the individual peaks should be merged. To improve the peak detection with LC×LC analysis using shifting second dimension (2 D) gradients, the variable thresholds, which permitted different thresholds for candidate peaks at different first dimension (1 D) retention times, were employed for examination of the 2 D retention time differences. Furthermore, the bidirectional overlap criterion performed at specified height was recommended to improve detection for tailing peaks. The developed algorithm was further tested on data sets from different LC×LC analyses of a complex peptide mixture, and then quantitatively evaluated by comparison between the results by the algorithm and mass analysis. Evidently improved performance with an accuracy rate over 60% was obtained by the algorithm, even for peak detection with LC×LC analysis under relatively low 1 D sampling frequency or shifting 2 D gradients. This would help to improve LC×LC quantitative analysis and performance assessment. © 2019 Elsevier B.V. All rights reserved.
1. Introduction In the last decades, comprehensive two-dimensional liquid chromatography (LC × LC) has undergone an explosive development with wide use in various analyses [1–3]. However, there are only a few reports having been published on two-dimensional (2D) data analysis algorithms [4]. In LC × LC separation, one compound from the first dimension (1 D) is usually cut into multiple fractions, and further presented as several individual one-dimensional (1D) peaks across the second dimension (2 D) chromatograms [5,6]. For data analysis, an algorithm is required to merge these individual 1D peaks to reconstruct 2D peaks (also called peak cluster) [7]. This plays a key role in both LC × LC quantitative analysis and performance assessment [8]. Quantitative analysis based on a combined 2D peak can usually obtain higher accuracy than that based on an individual 1D peak, since the individual 1D peak area may be seriously affected by peak shifting in the first dimension [6,9]. Additionally, two-dimensional peaks formed from hundreds or even
∗ Corresponding author at: No.381 Wushan Road, Guangzhou 510640, China. E-mail address:
[email protected] (M. Zhao). https://doi.org/10.1016/j.chroma.2019.05.046 0021-9673/© 2019 Elsevier B.V. All rights reserved.
thousands of individual 1D peaks are essential for evaluation of LC × LC performance in orthogonality and peak capacity [10,11]. At present, LC × LC data analysis is mainly performed using the watershed algorithm and two-step algorithm [4]. The watershed algorithm (originally used in image analysis) makes use of the inverse of a 2D image generated from a LC × LC separation [12]. In that case, 2D peaks (i.e., mountains) appear like negative peaks (i.e., basins), and then the algorithm works to find different basins with a minimum lower than a certain threshold (i.e., 2D peaks with a maximum higher than a certain threshold) [12]. The main drawback of the watershed algorithm is its intolerance to 2 D retention time variability, which may induce the formation of “false” saddle 2D peaks from the same compound [13]. 2 D retention time shift correction can help to improve the performance of the algorithm in situations of slight variations in 2 D retention times [14]. However, for complex situations with large 2 D retention time variations, the watershed algorithm should be applied with care due to the possible “false” saddle 2D peaks [15]. Besides, relatively low 1 D sampling frequencies, which can promote the appearance of “true” saddle 2D peaks, may further increase the difficulty of detection by the watershed algorithm.
274
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283
As an alternative to the watershed algorithm, conventional 1D peak detection is employed as the first step in the two-step algorithm, and then 1D peaks likely from the same compound are merged in the second step [7]. Therefore, the two-step algorithm can work better in some cases as described above [15]. For peak clustering in the second step, two different methods have been reported: Peters’ method [7] and Bayesian method [15]. Peters’ method typically uses overlap and unimodality criteria to decide which of the individual peaks should be merged [7]. The overlap criterion compares peak regions of candidate peaks and the last peak of an existing peak cluster, and only candidate peaks with overlap degree higher than a threshold are selected. The unimodality criterion examines the intensities of candidate peaks to select the one with lower intensity if a maximum has been detected. Different from Peters’ method, Bayesian method calculates probabilities of myriads of peak clusters based on the Bayesian inference and then the peak cluster of the highest value is chosen [15]. Due to the large number of possible configurations, a relatively long time is required for data analysis by the Bayesian method (around 1 min in reference 15) [15]. Furthermore, this method should be carefully used when highly crowded chromatograms are treated according to the author [15]. Moreover, except reference 15, no more information can be found on this method in the open literature. In comparison, Peters’ method has been widely used in 2D peak detection and further extended to assessment of LC × LC separation performance [4,16]. However, this method should also be used
with care. In Peters’ method [7], the overlap ratio is always calculated using region length of the last peak as denominator. This unidirectional overlap ratio may cause incorrect decision when merging peaks with relatively large difference in region length, which is especially relevant for LC × LC analysis with relatively low 1 D sampling frequency. This may also contribute to a remarkable effect of merging direction on the detection [7]. Peak tailing in the 2 D chromatography may also affect results when using baseline peak regions in the overlap criterion. Furthermore, the overlap criterion may be insufficient for complex LC × LC data analysis (e.g., highly crowded secondary chromatograms with many saddle peaks). Though the retention time difference was reported to be simply added in comprehensive LC × LC data analysis by van der Klift [17], no details can be found in the report or in other literature. Moreover, with improving software tools for separation method programming, comprehensive LC × LC analysis increasingly makes use of shifting 2 D gradients (Figure S-1 in the supporting information) to improve full utilization of the separation space, and then more remarkable peak shifting will be observed in the secondary chromatograms [18,19]. This makes it more difficult for data analysis by the Peters’ method, and efforts need to be made for improving such data analyses. In the present study, an improved peak clustering algorithm was developed based on the Peters’ method. In the new algorithm, retention time criterion was added to improve peak detection for complex data sets, especially those from LC × LC analyses using shifting 2 D gradients. Furthermore, the bidirectional overlap criterion performed at specified peak height was also added to improve the data analysis in highly crowded chromatograms. Effects of the retention time criterion and bidirectional overlap criterion performed at specified peak height on 2D peak detection were quantitatively evaluated using a complex peptide mixture analyzed by the LC × LC coupled with a mass spectrometer. Moreover, a comparison between the accuracy of the 2D peak detection by the new algorithm and Peters’ method, respectively, was performed to illustrate the utility of the new algorithm.
2. Theory 2.1. Overview
Fig. 1. Schematic presentation of the new algorithm. Nmax , the number of rows of the temporary two-dimensional array.
A schematic presentation of the new algorithm is shown in Fig. 1. Firstly, data folding is performed in the new algorithm. In this step, raw data obtained from a LC × LC analysis, which are usually a twocolumn matrix with one column representing time and another column representing signal [7], are exported to the new algorithm. With known modulation in the LC × LC experiment, signal of each independent 2 D analysis are extracted from the two-column matrix, and then transposed and listed in a two-dimensional matrix row by row according to the fraction elution order. Additionally, 1 D time data of the fractions are listed in a one-column matrix, and 2 D time data of the cycling 2 D analysis are listed in a one-row matrix. As for 1D peak detection, conventional methods [20,21] are employed, and focus of this work is put on the peak clustering process in the second step. As shown in Fig. 1, an “initial peak” is obtained after 1D peak detection, and then taken as the starting peak of a peak cluster (the initial peak is usually the first peak from the first fraction). Once a starting peak is given, the iterative peak clustering process starts. At the beginning of each iteration, a temporary two-dimensional array with fixed number of rows (Nmax ) is initialized, and then used to cache the starting peak and other peaks from the next Nmax -1 1 D fractions as candidate peaks. These peaks from each fraction are also listed in the array row by row according to the fraction elution order. Then, several criteria are employed to decide which of the
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283
275
individual peaks should be merged. It should be noted that each iteration of the data process will generate a peak cluster and the iterative data analysis will continue until all the individual peaks are merged to clusters. Detailed explanation on these steps and criteria will be given in the following sections.
from each other, and the secondary chromatograms may be more crowded with occurrence of many saddle peaks (Fig. 2B (e–g)). In this work, the bidirectional overlap criterion is developed and the calculation of Ov is improved as below:
2.2. One-dimensional peak detection
where smaller (a,b) represents the smaller value between a and b, and b is the peak region of the next peak. In Eq. (2), calculation of Ov considers peak regions of the two peaks. Therefore, the Ov values in case c and d of Fig. 2B becomes very similar, and both the two cases can be accepted though at high TOv . This increased the screening ability of the overlap criterion. Furthermore, for complicated case e and f, where both the two candidate peaks match the unidirectional overlap criterion and similar retention time differences are presented, the bidirectional overlap criterion can still easily determine which of the peak should be merged, through the evident difference between o1 /b1 and o2 /b2 . In addition, considering that peak tailing may have a negative effect on 2D peak detection (Fig. 2B (g)), peak regions at specified peak height (Hs , given by the user) are recommended (Fig. 2B (h)). Using peak regions at Hs can translate complicated cases to simple cases (Fig. 2B (g–h)) and then improve the accuracy of 2D peak detection. It should be noted that, if a row in the temporary array becomes empty (no candidate peak kept) after applying the criteria above, only rows before this row will be kept.
In this paper, 1D peak detection is performed employing the conventional method as described in Ref [20]. Savitzky-Golay algorithm is applied for noise suppression [20,22], and the firstand second-order derivatives are used for calculation of the peak starting-time, retention time and end-time [23]. Baseline peak regions in each secondary chromatogram are then defined as regions between the peak start and peak end [7]. 2.3. Peak clustering process After 1D peak detection, all peaks are submitted to the peak clustering algorithm for further processing. In this step, several criteria are applied to the candidate peaks, and only those that match these criteria will be kept and then merged as a 2D peak. The number of rows of the temporary array (Nmax ) is a parameter given by the user. The effect of this parameter on 2D peak detection will be discussed in the following section. The criteria, including the retention time criterion, bidirectional overlap criterion, closest criterion and unimodality criteria, are explained in detail in the following subsections. 2.3.1. The retention time criterion Peaks from the same compound may shift in consecutive 2 D chromatograms [14], especially in LC × LC analysis with shifting 2 D gradients [18]. Therefore, the retention time criterion is used to examine the degree of the 2 D retention time difference (Pd ) between two candidate peaks, which is given by: Pd =2 tRd /2 tR 2t Rd
(1) 2D
is the retention time difference between two candiwhere date peaks and 2 tR is the 2 D retention time of the previous candidate peak. For evaluation of the candidate peaks, a threshold value range (TPdl (the lower bound)-TPdu (the upper bound)) is given by the user. In 2D peak detection with LC × LC analysis using normal 2 D gradient elution, constant TPdl and TPdu are applied for candidate peaks at different 1 D retention times, and only the TPdu is given by the user for convenience since TPdu = - TPdl . Considering peaks may shift to only the left or right while using shifting 2 D gradients, the absolute value of TPdl can be not equal to that of TPdu . In that case, variable threshold values, which permit different TPdl and TPdu values for candidate peaks at different 1 D retention times, are employed as shown in Fig. 2A. Candidate peaks with Pd lower than TPdl or higher than TPdu are all discarded. 2.3.2. The bidirectional overlap criterion Overlap criterion is used to evaluate the overlap degree between two candidate peaks from adjacent 1 D fractions (Ov ). In Peters’ method [7], Ov is calculated by Ov = o/a with o representing the overlapped region between two peaks and a being the peak region of the previous peak (Fig. 2B (a)), and a threshold value (TOv ) is given by the user. It works well in most situations as shown in Fig. 2B (a–b). As for case c and d in Fig. 2B, although the values of Ov differ a lot (1.0 for case c and much less than 1.0 in case d), both the two cases can be accepted at a low TOv . However, this asymmetrical calculation of Ov (or unidirectional overlap criterion) may cause some failures when clustering peaks for LC × LC analysis with relatively low 1 D sampling frequency. In that case, peak regions between two peaks from adjacent 1 D fractions may differ more
Ov =o/smaller(a,b)
(2)
2.3.3. The closest criterion After evaluation of candidate peaks by the criteria above, there still may exist more than one candidate peak at the same 1 D retention time. This may bring difficulties for further application of the unimodality criterion. In this case, the candidate peak with Pd value closest to the average value of TPdl and TPdu is selected. 2.3.4. The unimodality criterion When intensities of the candidate peaks are plotted at the maximum against the 1 D retention time (1 tR ), the 1 D peak-maxima profile should be unimodal if they belong to the same compound [7] (Fig. 3A). In other words, the 1 D peak-maxima profile should only exhibit one maximum. This property is used for further evaluation of the candidate peaks (the unimodality criterion). The unimodality criterion is performed as follows: in the first step, a peak-maxima profile is obtained as shown in Fig. 3B (a). In the second step, the first-order derivative of the peak-maxima profile is calculated (Fig. 3B (b)) and conventional 1D peak detection [19,20] is performed to find the local minimum after the first point (point 5, as indicated by the arrow in Fig. 3B (a)). In the third step, the nearest two points (4 and 6) are interpolated using the spline interpolation method [24] (the number of interpolated points, Ni , is given by the user) to increase the local resolution of the peak-maxima profile (Fig. 3B (c)). In the fourth step, the process in the second step is repeated to find the new local minimum between point 4 and 6, and then candidate peaks are split at the new local minimum. Candidate peaks before the new local minimal are then merged to form a 2D peak. Interpolation in the whole peak-maxima profile as described in the Peters’ method [7] is not recommended, since incorrect splitting may occur in that case [24]. As indicated by the arrows in Fig. 3B (d), more than one local maximum may be obtained while interpolating signal between points of similar values. In the new method, it is essential to ensure that only one candidate peak exists at the same 1 D retention time before applying the unimodality criterion, since the unimodality criterion works only to find the ending-peak among a series of candidate peaks across different 1 D retention times. This is a little different from the Peters’ method, in which the unimodality criterion works mainly to select the one of lower intensity from candidate peaks at the same 1 D
276
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283
Fig. 2. Variable threshold values (the lower bound, TPdl , and the upper bound, TPdu ) for degrees of the 2 D retention time differences (Pd ) for 2D peak detection with LC × LC analysis employing shifting 2 D gradients in the supporting information Figure S-1 (A), and schematic representation of peak regions of candidate peaks from adjacent 1 D fractions in different cases (B). The rectangles and dots in part B represent peak regions and maxima, respectively. Parameters a, b, b1 and b2 represent peak regions of the candidate peaks, and o, o1 and o2 represent overlapped regions between candidate peaks in the figure, respectively.
Fig. 3. Schematic representation of a unimodal peak-maxima profile of a 2D peak (A) and measurement of the end point (peak) of a 2D peak (B). For part A, circles joined with a line indicate the peak-maxima profile. For part B, large circles with a dot represent the peak maxima, whereas interpolated points are represented using small circles.
retention time after a maximum has been detected. The effect of this difference between the two methods on 2D peaks detection is relatively small, and this will not be discussed further. 3. Experimental 3.1. Chemicals, samples, and instrumentation A complex peptide mixture (containing 187 peptides) was prepared in 90% water /10% acetonitrile (v/v) with 0.1% formic acid (FA) using variety of commercial peptides (purity 95%, GL
Biochem, Shanghai, China, for more information, please see supporting information Table S-1). Ultra-pure water was obtained from a Milli-Q water purification system (Millipore, Milford, MA, USA). Acetonitrile and formic acid were purchased from Sigma-Aldrich (Steinheim, Germany). A LC × LC system was designed and constructed in-house employing an automated stop-flow interface following our previous research (see supporting information Figure S-2) [9]. The system was equipped with a 1 D pump, an autosampler, a 1 D column oven, an automated ten-port valve (Ultimate 3000 SD, Thermo Fisher Scientific Inc., Shanghai, China), a 2 D pump, a 2 D column oven
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283
and a UV detector (Acquity UPLC I-class, Waters, Shanghai, China). Chromeleon software 7.2 (Thermo Fisher Scientific Inc., Shanghai, China) was used for system control. The column set included a 4.6 × 150 mm Protein BEH SEC 1 D column (1.8 m, 120 Å, Waters, Shanghai, China) and a 2.1 × 50 mm HSS T3 2 D column (1.8 m, Waters, Shanghai, China). Isocratic elution was used for the 1 D analysis with a mobile phase consisting of water and acetonitrile (90:10, v/v) with 0.1% FA. Unless specified otherwise, the 2 D analysis was performed at a flow rate of 0.80 mL/min with mobile phase consisting of water (A) and acetonitrile (B) with 0.1% FA under gradient elution: 0-0.10 min (95% A isocratic),0.10-0.85 min (95%-60% A), 0.85-0.90 min (60%-95% A),0.90–1.40 min (95% A isocratic). In each independent 2 D run, a slice of 0.20 min was collected at a flow rate of 0.10 mL/min from the 1 D. At the end of the 0.20 min, the tenport valve was switched to the stopping position to transfer the slice to the 2 D. Simultaneously, the 1 D flow rate was set zero and the next 2 D gradient started. The 1 D injection volume for the complex peptide mixture was 1.4 L, and the 2 D UV absorbance at 214 nm was monitored at a sampling rate of 80 HZ. The LC × LC system was then coupled to an Impact II ultra-high resolution qq-time-of-flight mass spectrometer (QqTOF, Bruker Daltonics GmbH,Bremen, Germany) for further mass detection of the eluting peaks at a split ratio of 1:1. The mass spectrometer was equipped with an electrospray ionization source (ESI), and mass detection was performed in positive ion mode with an acquisition range over 20–1200 m/z at a spectra rate of 15 HZ. OtofControl 4.0 (Bruker Daltonics GmbH,Bremen, Germany) was used for controlling the QqTOF. 3.2. Data analysis The 2D peak detection was performed in home-built routines written in MATLAB 2015a (Mathworks, Natick, MA) following the strategies described above and those in the Peters’ method [7], respectively. Raw data obtained from the LC × LC system were exported as *. txt files and then processed with the developed programs. The number of points used for the Savitzky-Golay filter was 20 for noise suppression. Unless specified otherwise, the unimodality criterion option Ni was set 1 for both of the two methods. To evaluate the accuracy of the results by the two methods, mass data of the eluting peaks were analyzed using DataAnalysis 4.3 (Bruker Daltonics GmbH, Bremen, Germany), and eluting peaks from the same compound were merged to form a “true” 2D peak. It should be noted that if a candidate peak (e.g., peak 5 in Fig. 3B) was determined to be a mixture peak containing compounds from the left and right cluster, merging the candidate peak to the left or right cluster were both accepted in “true” 2D peaks. Accuracy rate (Ar ) was given by Ar = Nc / Nt × 100%, where Nc is the number of correct 2D peaks and Nt is the total number of the “true” 2D peaks. Considering the existence of some overlapped 2D peaks that consisted of several “true” 2D peaks (none of the algorithms could provide correct answer for these “true” 2D peaks in most cases), detection of these overlapped 2D peaks were considered as unavoidable error and was evaluated using the unavoidable error rate (Ur ): Ur = No / Nt × 100%, where No represents the number of overlapped 2D peaks. In addition, the erroneous 2D peaks were evaluated using error rate (Er ), which was given by Er = Ne / Nd × 100%, where Ne is the number of erroneous 2D peaks and Nd is the total number of 2D peaks detected by algorithms. Visio 2013 (Microsoft, Washington, USA) and Origin 9.0 (OriginLab, Northampton, MA01060, USA) were used for data visualization. 4. Results and discussion To illustrate how the new algorithm works in 2D peak detection and demonstrate the utility of the new algorithm, data
277
Fig. 4. The 2D peak detection for normal LC × LC analysis (Ar , the accuracy rate; Er , the error rate; Ur , the unavoidable error rate) at parameters: Nmax = 12, TPdu = 0.030, TOv = 0.60 and Hs = 0. Two dimensional peaks are presented as circles joined with a line in different color (black for correct 2D peaks, blue for overlapped 2D peaks, and red for erroneous 2D peaks). Nmax , the number of rows of the temporary twodimensional array; TPdu , the upper bound of the threshold for 2 D retention time difference; TOv , the threshold for overlap; Hs , the specified height for peak regions (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.).
sets obtained from normal LC × LC analysis (as described in the experimental section), LC × LC analysis under lower 1 D sampling frequency and LC × LC analysis employing shifting 2 D gradients were studied, respectively. 4.1. Two-dimensional peak detection with normal LC × LC analysis For normal LC × LC analysis of the highly complex sample, a total number of 144 “true” 2D peaks were detected as shown in the supporting information Figure S-3. While baseline peak regions were used for the overlap criterion (the specified peak height, Hs = 0), the 2D peak detection was performed by the new method at optimal parameters (Nmax = 12, TPdu = 0.030 and TOv = 0.60) as shown in Fig. 4. In the figure, contour profiles are presented with baseline subtracted, and 2D peaks are represented using circles joined with a line in different color (black for correct 2D peaks, blue for overlapped 2D peaks, and red for erroneous 2D peaks). For convenience, this representation was used in all of the following sections. From Fig. 4, a number of red erroneous 2D peaks were evidently observed in the left part of the figure. This was mainly due to the serious shifting of 2 D peaks in this part (see supporting information Figure S-3). Nevertheless, a relatively high accuracy rate of 65.97% was still obtained by the new method for this complex data set. Considering the large value used for Nmax , the effects of Nmax on 2D peak detection with the normal LC × LC analysis are presented in Fig. 5A. From the figure, Ar firstly increased and then remained unchanged with the increasing Nmax . This was mainly because the parameter Nmax limited the maximal number of individual peaks contained in each 2D peak. In most cases, Nmax was set no less than the number of fractions available from the broadest 1 D peak, and the use of Nmax will be explained in the following section. Fig. 5B shows the effects of the retention time criterion on 2D peak detection with the normal LC × LC analysis. From the figure, the highest Ar (65.97%) was obtained at TPdu = 0.030 or 0.035, and lower or larger TPdu causes lower Ar . Especially when using extremely low TPdu (0.005), Ar became only 53.47% and Er reached 67.20% due to the 2 D peak shifting as shown in the left of Fig. 4. Additionally, when TPdu was set relatively large to 0.060, the accuracy rate decreased to 63.19%. In this case, the candidate peaks in region 1 of Fig. 4 were newly merged as enlarged in Fig. 6A. Since the relatively large TPdu failed in evaluation of the candidate peak
278
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283
of Fig. 4 as depicted in Fig. 6B. From the figure, peak 1 and peak 2 belonging to the same compound (peptide AHW) were split into two different 2D peaks by mistake (as indicated by the red arrow), while they were correctly merged in Fig. 4 by the new method. The reason for this error was that the Ov value of peak 1 to peak 2 was evidently lower than 0.45 while individual peaks were merged in direction of decreasing 1 tR (the detection in another merging direction of increasing 1 tR could improve this error, but the erroneous 2D peaks were still left as they presented lower retention time difference according to Peters et al. [7]). A relatively low TOv (0.20) could help to fix the error in Fig. 6B, but lower Ar (57.64%) was also obtained due to the decreased evaluation ability by the unidirectional overlap criterion. Fig. 6C shows the newly merged candidate peaks from region 3 of Fig. 4 by the Peters’ method at TOv = 0.20. From the figure, the Ov value of individual peak 5 (peptide WLA) to individual peak 4 (peptide LFP) was evidently larger than 0.20, and hence peak 5 was merged with peaks 1–4 (peptide LFP) by mistake. In comparison, individual peaks in Fig. 6 B and C were all merged correctly in the corresponding regions of Fig. 4 by the new method. Tailing of the individual peak 4 in Fig. 6C contributed to the error by Peters’ method to a certain degree and this error might also occur if a lower TOv was used by the new method. Therefore, peak regions at specified height were recommended in the new method. As shown in Fig. 7A (the newly merged candidate peaks from region 3 of Fig. 4 at parameters: Nmax = 12, TPdu = 0.030, TOv = 0.10 and Hs = 0.5), the peak 5 and peak 4 could be easily distinguished when Ov was calculated at Hs = 0.5. Similar improvements could also be observed for region 4 of Fig. 4. As shown in the enlarged zone (Fig. 7B) of region 4 from Fig. 4, individual peak 6 (peptide MYPLPR) was merged with individual peak 10 (peptide YPGIADRM) together by mistake due to the highly overlapped peak regions between the two peaks. While peak regions at half height were compared, the Ov value between the two peaks became nearly 0, and the error was then fixed in Fig. 7C. Moreover, the highest accuracy rate (68.06%) was also obtained for 2D peak detection with the normal LC × LC analysis while using the peak regions at half height as listed in Table 1 (see also Figure S-5 in the supporting information). 4.2. Two-dimensional peak detection with LC × LC analysis under lower 1D sampling frequency
Fig. 5. The effects of Nmax (A), TPdu (B) and TOv (C) on the 2D peak detection with normal LC × LC analysis (Ar , the accuracy rate; Er , the error rate). For part A, TPdu = 0.030, and TOv = 0.60; for part B, Nmax = 12 and TOv = 0.60; for part C by the new method, Nmax = 12 and TPdu = 0.030. Nmax , the number of rows of the temporary two-dimensional array; TPdu , the upper bound of the threshold for 2 D retention time difference; TOv , the threshold for overlap.
7, the peaks 1–10 (peak 1–4, peptide RW; peak 5–7, peptide HD; peak 8–10, peptide GL) were merged by mistake as indicated by the arrows in the figure. In comparison, they were originally merged correctly in Fig. 4 at optimal TPdu . The effects of TOv on 2D peak detection by the new method and Peters’ method are shown in Fig. 5C. From the figure, the highest Ar (65.97%) by the new method at TOv = 0.60 was higher than that (61.11%) by the Peters’ method. It should be noted that the highest Ar by the Peters’ method was obtained at TOv = 0 or 0.05. In this case, overlap criterion did not work well, and hence the high Ar was attributed to the closest maximum principle as described in the method [7]. To illustrate the utility of the bidirectional overlap criterion over the unidirectional overlap criterion, detection by Peters’ method was performed at the next highest Ar point (TOv = 0.45) (see Figure S-4 in the supporting information). The attention was then focused on the newly merged candidate peaks from region 2
Lower 1 D sampling frequency usually means lower 1 D resolution and higher 2 D complexity in the data set [25]. Due to the increased sampling interval (0.40 min), the number of individual peaks contained in each 2D peak greatly decreased (see “true” 2D peaks presented in the supporting information Figure S-6). This made it more difficult for 2D peak detection. Fig. 8 shows the detection by the new method at optimal parameters as listed in Table 1 (Nmax = 8, TPdu = 0.0125, TOv = 0.10, Hs = 0.5 and Ni = 4). The Ar and Er by the detection were 60.78% and 34.01%, respectively, suggesting a relatively good performance of the method in the highly complex data set. As mentioned above, the parameter Nmax limited the maximal number of individual peaks contained in each peak cluster. Practically, Nmax was used to improve the detection for overlapped 2D peaks in some specific cases. As shown in Fig. 9 (the newly formed 2D peaks for region 1 of Fig. 8 at Nmax = 4), the originally overlapped 2D peak (including individual peak a–f) in Fig. 8 were merged as two 2D peaks correctly (a–c, peptide AVPYPQR; d, peptides AVPYPQR and LGYPR; e, peptide LGYPR; f, peptides LGYPR and EEHPTLL). Similar improved detection was also observed in the upper part of the figure. Two erroneous 2D peaks were formed in Fig. 8 due to the incorrect split as indicated by the red arrow in Fig. 9 (peak m, peptide YPP; i, peptide MGHF), but this error was then fixed by using Nmax = 4 as indicated by the black arrow in Fig. 9. This demon-
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283
279
Fig. 6. The newly formed 2D peaks for region 1 of Fig. 4 by the new method at TPdu = 0.060 (A), for region 2 of Fig. 4 by the Peters’ method at TOv = 0.45 (B) and for region 3 of Fig. 4 by the Peters’ method at TOv = 0.20 (C). The pink horizontal lines indicate the baseline peak regions. TPdu , the upper bound of the threshold for 2 D retention time difference; TOv , the threshold for overlap.
Fig. 7. The newly formed 2D peaks by the new method for region 3 of Fig. 4 at TOv = 0.10 and Hs = 0.5 (A), and the 2D peak detection by the new method for region 4 of Fig. 4 at original parameters: TOv = 0.60 and Hs = 0 (B) and new parameters: TOv = 0.10 and Hs = 0.5 (C). The pink horizontal lines represent the baseline peak regions. TOv , the threshold for overlap; Hs , the specified height for peak regions.
Table 1 Summary of the two-dimensional peak detection at optimal parameters for data sets obtained from different LC × LC analyses. normal LC × LC analysis
Nmax TPdu TOv Hs Ni Ar , % Er , %
LC × LC analysis under lower 1 D sampling frequency
LC × LC analysis with shifting 2 D gradients
New method
Peters’ method
New method
Peters’ method
New method
Peters’ method
12 0.030 0.10 0.5 1 68.06 25.34
– – 0.05 – 1 61.11 23.66
8 0.0125 0.10 0.5 4 60.78 34.01
– – 0.40 – 4 51.63 35.43
15 Variable 0.05 0.1 1 74.00 15.83
– – 0.35 – 1 67.33 26.90
Note: Variable TPdu values are presented in Fig. 2A with TPdl . Nmax , the number of rows of the temporary two-dimensional array; TPdu and TPdl , the upper and lower bounds of the thresholds for the second dimension (2 D) retention time differences, respectively; TOv , the threshold for overlap; Hs , the specified height for peak regions; Ni , the number of interpolated points; Ar , the accuracy rate; Er , the error rate; ‘-’, unavailable for the method.
280
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283
Fig. 8. The 2D peak detection performed by the new method for data set obtained from LC × LC analysis at a 1 D sampling interval of 0.40 min at optimal parameters as listed in Table 1. Two dimensional peaks are presented as circles joined with a line in different color (black for correct 2D peaks, blue for overlapped 2D peaks, and red for erroneous 2D peaks) (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.).
Fig. 9. The 3-dimensional plot with waterfall profiles for newly formed 2D peaks for region 1 of Fig. 8 by the new method at Nmax = 4. Black circles joined with a line indicate correct 2D peaks, and pink circles represent the interpolated points. Correct and incorrect splits are indicated by the arrows in black and red color, respectively. Nmax , the number of rows of the temporary two-dimensional array (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.).
strated a positive effect of an optimal Nmax on 2D peak detection in some specific cases, but Nmax should be set carefully. For 2D peak detection at Nmax = 4 (see also supporting information Figure S-7), relatively lower Ar (58.17%) and much higher Er (43.56%) were obtained. Therefore, it would be better to use a relatively large Nmax in most cases. Effects of the retention time criterion on 2D peak detection are shown in Fig. 10, and a similar conclusion can be reached as described in the previous section. An optimal TPdu played an important role in 2D peak detection. One should note that the highest Ar (61.44%) obtained at TPdu = 0.0075 was not suitable for such an analysis due to the relatively high Er (42.60%). In comparison, better performance could be obtained at TPdu = 0.0125 with the next highest Ar (60.13%) and relatively low Er (34.69%). In addition, to further demonstrate the utility of the new method over Peters’ method, the effects of merging direction on 2D peak detection at different TOv values are compared in Fig. 11. In Fig. 11A, individual peaks were merged along the direction of increasing 1 tR and decreasing 1 tR (the initial peak was set as the first peak from the last fraction) separately. Furthermore, a mixture of the results with both two directions considered was also obtained according
Fig. 10. The effects of TPdu (the upper bound of the threshold for 2 D retention time difference) on the accuracy rate (Ar , primary y-axis) and the error rate (Er , secondary y-axis) for 2D peak detection with LC × LC analysis at a 1 D sampling interval of 0.40 min. For other parameters used for the detection, please refer to Table 1.
to the method of Peters et al. [7]. From the figure, similar Ar values were obtained among the 2D peaks detected in different merging directions or with both directions considered at the same TOv , suggesting a relatively small effect of merging direction on 2D peak detection by the new method. In comparison, an evident influence of merging direction on 2D peak detection was observed in Fig. 11B by the Peters’ method [7]. Furthermore, an evidently increased Ar was obtained in 2D peak detection with both directions considered by Peters method at TOv = 0.30-0.50. This fully explains the necessity for 2D peak detection with both directions considered in the Peters’ method [7]. Nevertheless, the highest Ar (51.63%) obtained by the Peters’ method at TOv = 0.40 with both directions considered (see also supporting information Figure S-8) was much lower than that by the new method (60.78%) in merging direction of increasing 1 t (Fig. 8). Different from the Peters’ method, the 2D peak detecR tion in another direction or with both directions considered was not necessary for the new method. In most cases, merging individual peaks in one direction might be enough for even complex data analysis. Practically, the relatively short time for 2D peak detection in one direction (less than 1 s) could also facilitate the rapid optimization of parameters, while a relatively long time (around 40 s) was needed for 2D peak detection with both directions considered (further code optimization might help to reduce the time cost). 4.3. Two-dimensional peak detection with LC × LC analysis employing shifting 2 D gradients To further illustrate the utility of the new method in 2D peak detection, the developed algorithm was further tested on the data set obtained from LC × LC analysis employing shifting 2 D gradients as presented in the supporting information Figure S-1. The “true” 2D peaks for the date set are shown in the supporting information Figure S-9. Considering that candidate peaks might shift only to the left or right, variable TPdl and TPdu were used for 2D peak detection of candidate peaks at different 1 D retention times as presented in Fig. 2A. The detection using variable TPdl and TPdu values is presented in Fig. 12 (Nmax = 15, TOv = 0.05 and Hs = 0.1), and relatively good performance of high Ar (74.00%) and low Er (15.83%) was obtained by the algorithm as listed in Table 1. Benefiting from the use of variable TPdl and TPdu , most of the individual peaks shifting to only the left (as shown in the lower-right corner of the figure) or right (as shown in the middle part of the figure) were merged correctly. In comparison, much higher Er (24.14%) was obtained when a constant TPdu value (0.03) was used for the 2D peak detection, suggesting an evident increase in the erroneous 2D peaks (see Figure S-10 in the supporting information). This fully demonstrates
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283
281
Fig. 11. The effects of merging direction on 2D peak detection at different TOv (the threshold for overlap) values with LC × LC analysis at a 1 D sampling interval of 0.40 min by the new method (A) and the Peters’ method (B). For other parameters used for the detection, please refer to Table 1.
chosen as the final result, since they exhibited lower 2 D retention time differences according to Peters et al. [7]. 5. Conclusions
Fig. 12. The 2D peak detection performed for data set obtained from LC × LC analysis employing shifting 2 D gradients (see Figure S-1 in the supporting information) by the new method at optimal parameters as listed in Table 1. Variable TPdl and TPdu values used for the detection are presented in Fig. 2A. Two dimensional peaks are presented as circles joined with a line in different color (black for correct 2D peaks, blue for overlapped 2D peaks, and red for erroneous 2D peaks). TPdl and TPdu , the lower and upper bound of the thresholds for 2 D retention time differences, respectively (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.).
the superiority of variable TPdl and TPdu in 2D peak detection with LC × LC analysis employing shifting 2 D gradients. The data set was also analyzed using Peters’ method at optimal parameters (TOv = 0.35) as listed in Table 1 (see also Figure S-11 in the supporting information). A comparison was then performed between the results by the two methods. Compared with the detection by the new method, much lower Ar (67.33%) and higher Er (26.90%) were obtained by the Peters’ method. For more details, the region 1 of Fig. 12 is enlarged in Fig. 13A, and the newly formed 2D peaks by the Peters’ method for the region are presented in Fig. 13B. In Fig. 13A, individual peaks (peak 1–9, peptide VAPEEHPTLL; peak 10–17, peptide ADLAKVPL) were all merged correctly by the new method employing variable TPdl and TPdu . On the contrary, as both the candidate peaks 2 and 10 matched the overlap and unimodality criteria to peak 1 while merging candidate peaks along the direction of increasing 1 tR , the candidate peak 10 closest to peak 1 was selected by mistake in Fig. 13B by the Peters’ method. Practically, these individual peaks were correctly merged along the direction of decreasing 1 tR . However, the erroneous 2D peaks in Fig. 13B were
An improved peak clustering algorithm with retention time and bidirectional overlap criteria was developed for 2D peak detection in this work. Through combination of the retention time and bidirectional overlap criteria, the 2D peak detection for highly complex LC × LC data sets by the new algorithm was improved. Furthermore, the accuracy rate obtained by the new algorithm was at least 6% higher than that obtained by the Peters’ method. This could be attributed to the several advantages of the new algorithm. Firstly, the retention time criterion allowed the use of variable thresholds for 2 D retention time differences at different 1 D retention times. This could help to improve the 2D peak detection with LC × LC analysis using shifting 2 D gradients. Secondly, the bidirectional overlap criterion took peak regions of both the two candidate peaks into consideration while calculating Ov between them. This could reduce the effect of merging direction on 2D peak detection, and improve the peak detection for candidate peaks of evidently different peak regions. Moreover, the bidirectional overlap criterion could be applied with peak regions at specified peak height. This would have a positive effect on 2D peak detection for tailing 2 D peaks. With more accurate 2D peaks detected, LC × LC quantitative analysis and performance assessment would also be improved. It should be noted that the error rate was still very poor for the new algorithm though with improved 2D peak detection. Practically, the error rate of other previously published algorithms [7,13,15] was also very poor. Two-dimensional peak detection for highly complex data sets was a serious challenge for all of these methods. To further reduce the error rate, future work would be directed towards the enhancement of such an algorithm in multichannel data analysis. This would also help to improve the detection for overlapped 2D peaks. In this study, attempts were also made to improve the detection for overlapped 2D peaks, through limitation in the maximal number of individual peaks contained in each 2D peak (Nmax ). Though this worked only for specified 2D peaks and more errors might by observed in 2D peak detection due to the fixed Nmax for all 2D peaks, it provided a potential way for further improvements by automatically giving Nmax for different 2D peaks. This would be also included in our future work. One should note that, although
282
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283
Fig. 13. Enlarged zone of region 1 from Fig. 12 by the new method (A) and the newly formed 2D peaks by the Peters’ method for the region at TOv = 0.35 (B). TOv , the threshold for overlap.
the algorithm was only tested on LC × LC data, its application to two-dimensional gas chromatography data should at most need some minor modification. Acknowledgments This work was supported by the National Key Research and Development Program of China (No. 2017YFD0400201), the National Postdoctoral Program for Innovative Talents of China (No. BX201700081), the Guangzhou Science and Technology Plan Project (No. 201604020122) and the China Postdoctoral Science Foundation (No. 2017M622702). Appendix A. Supplementary data Supplementary material related to this article can be found, in the online version, at doi:https://doi.org/10.1016/j.chroma.2019. 05.046. References [1] B.W.J. Pirok, D.R. Stoll, P.J. Schoenmakers, Recent developments in two-dimensional liquid chromatography: fundamental improvements for practical applications, Anal. Chem. 91 (2018) 240–263. [2] Q. Wu, H. Yuan, L. Zhang, Y. Zhang, Recent advances on multidimensional liquid chromatography–mass spectrometry for proteomics: from qualitative to quantitative analysis-a review, Anal. Chim. Acta 731 (2012) 1–10. [3] P. Schoenmakers, P. Marriott, J. Beens, Nomenclature and conventions in comprehensive multidimensional chromatography, LCGC Europe 25 (2003) 335–339. [4] L.L.P.V. Stee, U.A.T. Brinkman, Peak detection methods for GC × GC: an overview, TrAC-Trend. Anal. Chem. 83 (2016) 1–13. [5] S. Lee, H. Choi, T. Chang, B. Staal, Two-dimensional liquid chromatography analysis of polystyrene/polybutadiene block copolymers, Anal. Chem. 90 (2018) 6259–6266. [6] P. Donato, F. Rigano, F. Cacciola, M. Schure, S. Farnetti, M. Russo, P. Dugo, L. Mondello, Comprehensive two-dimensional liquid chromatography-tandem mass spectrometry for the simultaneous determination of wine polyphenols and target contaminants, J. Chromatogr. A 1458 (2016) 54–62. [7] S. Peters, G. Vivó-Truyols, P.J. Marriott, P.J. Schoenmakers, Development of an algorithm for peak detection in comprehensive two-dimensional chromatography, J. Chromatogr. A 1156 (2007) 14–24.
[8] J.T. Matos, R.M. Duarte, A.C. Duarte, Trends in data processing of comprehensive two-dimensional chromatography: state of the art, J. Chromatogr. B 910 (2012) 31–45. [9] J. Xu, D. Sun-Waterhouse, C. Qiu, M. Zhao, B. Sun, L. Lin, G. Su, Additional band broadening of peptides in the first size-exclusion chromatographic dimension of an automated stop-flow two-dimensional high performance liquid chromatography, J. Chromatogr. A 1521 (2017) 80–89. [10] M. Gilar, J. Fridrich, M.R. Schure, A. Jaworski, Comparison of orthogonality estimation methods for the two-dimensional separations of peptides, Anal. Chem. 84 (2012) 8722–8732. [11] M. Gilar, P. Olivova, A.E. Daly, J.C. Gebler, Orthogonality of separation in two-dimensional liquid chromatography, Anal. Chem. 77 (2005) 6426–6434. [12] S.E. Reichenbach, M. Ni, V. Kottapalli, A. Visvanathan, Information technologies for comprehensive two-dimensional gas chromatography, Chemometr. Intell. Lab. 71 (2004) 107–120. [13] G. Vivó-Truyols, H. Janssen, Probability of failure of the watershed algorithm for peak detection in comprehensive two-dimensional chromatography, J. Chromatogr. A 1217 (2010) 1375–1385. [14] H. Parastar, M. Jalali-Heravi, R. Tauler, Comprehensive two-dimensional gas chromatography (GC×GC) retention time shift correction and modeling using bilinear peak alignment, correlation optimized shifting and multivariate curve resolution, Chemometr. Intell. Lab. 117 (2012) 80–91. [15] G. Vivótruyols, Bayesian approach for peak detection in two-dimensional chromatography, Anal. Chem. 84 (2012) 2622–2630. [16] P.G. Stevenson, M. Mnatsakanyan, G. Guiochon, R.A. Shalliker, Peak picking and the assessment of separation performance in two-dimensional high performance liquid chromatography, Analyst 135 (2010) 1541–1550. [17] V.D.K. Ej, G. Vivótruyols, F.W. Claassen, F.L. van Holthoon, T.A. van Beek, Comprehensive two-dimensional liquid chromatography with ultraviolet, evaporative light scattering and mass spectrometric detection of triacylglycerols in corn oil, J. Chromatogr. A 1178 (2008) 43–55. [18] B.W.J. Pirok, A.F.G. Gargano, P.J. Schoenmakers, Optimizing separations in on-line comprehensive two-dimensional liquid chromatography, J. Sep. Sci. 41 (2018) 68–98. [19] P. Jandera, T. Hájek, P. Cesla, Comparison of various second-dimension gradient types in comprehensive two-dimensional liquid chromatography, J. Sep. Sci. 33 (2010) 1382–1397. [20] G. Vivó-Truyols, J.R. Torres-Lapasió, A.M. van Nederkassel, H.Y. Vander, D.L. Massart, Automatic program for peak detection and deconvolution of multi-overlapped chromatographic signals part I: peak detection, J. Chromatogr. A 1096 (2005) 133–145. [21] Y.J. Yu, Q.L. Xia, S. Wang, B. Wang, F.W. Xie, X.B. Zhang, Y.M. Ma, H.L. Wu, Chemometric strategy for automatic chromatographic peak detection and background drift correction in chromatographic data, J. Chromatogr. A 1359 (2014) 262–270. [22] J. Luo, K. Ying, J. Bai, Savitzky-Golay smoothing and differentiation filter for even number data, Signal Process 85 (2005) 1429–1434.
J. Xu et al. / J. Chromatogr. A 1602 (2019) 273–283 [23] R. Danielsson, B. Dan, K.E. Markides, Matched filtering with background suppression for improved quality of base peak chromatograms and mass spectra in liquid chromatography-mass spectrometry, Anal. Chim. Acta 454 (2002) 167–184. [24] R.C. Allen, S.C. Rutan, Investigation of interpolation techniques for the reconstruction of the first dimension of comprehensive two-dimensional
283
liquid chromatography-diode array detector data, Anal. Chim. Acta 705 (2011) 253–260. [25] J.M. Davis, D.R.S. And, P.W. Carr, Effect of first-dimension undersampling on effective peak capacity in comprehensive two-dimensional separations, Anal. Chem. 80 (2008) 461–473.