Chemometrics and Intelligent Laboratory Systems 127 (2013) 1–6
Contents lists available at SciVerse ScienceDirect
Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
Fast and shift-insensitive similarity comparisons of NMR using a tree-representation of spectra Andrés Mauricio Castillo a, Lalita Uribe b, Luc Patiny c, Julien Wist d,⁎ a
Facultad de Ingeniería, Universidad Nacional de Colombia, Bogotá D.C., Colombia Departamento de Química, Universidad Nacional de Colombia, Bogotá D.C., Colombia Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland d Chemistry Department, Universidad del Valle, A.A. 25360, Cali, Valle, Colombia b c
a r t i c l e
i n f o
Article history: Received 11 March 2013 Received in revised form 20 May 2013 Accepted 21 May 2013 Available online 26 May 2013 Keywords: NMR Search engine Spectral similarity Binary trees
a b s t r a c t An efficient method to extract and store information from NMR spectra is proposed that is suitable for comparison and construction of a search engine. This method based on trees doesn't require any peak picking or any pre-treatment of the data and is found to outperform the currently available methods, both in terms of compactness and velocity. Our approach was tested for 1D proton spectra and 2D HSQC spectra and compared with the method proposed by Pretsch and coworkers [1,2] [Bodis et al. 2007, Bodis et al. 2009]. Additionally, the correspondence between spectral and structural similarity was evaluated for both methods. © 2013 Elsevier B.V. All rights reserved.
1. Introduction Modern strategies aiming at active compounds' discovery, at developing new tools for the diagnosis of diseases or at monitoring food origin and quality make an intensive use of Nuclear Magnetic Resonance (NMR). As a result, a large number of spectra are produced that have to be stored, processed and analyzed; today, experts usually perform this task manually. In addition, several applications have been described that make an extensive use of databases and database lookup to determine the composition of mixtures [3], identify natural product compounds [4] or metabolites from biofluids [5], and to predict chemical shifts [6,7]. Despite universities or research institutes do not necessarily have strategies regarding the management of spectroscopic data, this latter is readily achieved automatically in a way that ensures security, traceability, and privacy. During the last 6 years, our group has been working to build a cloud-like solution allowing to directly connect spectrometers (IR, NMR, MS) to a server where the spectra can be accessed, manipulated and analyzed from the Web. This service [8], in use in our respective institutions, can be used for free. Several databases allow to store NMR or spectroscopic data [9–19], but most of them only accept spectra that have been assigned [10,11,13,15,16,19], i.e. a list of peaks exists that have been attributed to each atom of a molecule. This attribution or assignment has to be ⁎ Corresponding author at: Universidad del Valle, Ciudad Universitaria Meléndez, Calle 13 No 100-00, A.A. 25360, Cali, Colombia. Tel.: +57 316 478 27 10. E-mail address:
[email protected] (J. Wist). 0169-7439/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.chemolab.2013.05.009
checked manually, and thus rely on the user's will to share his data and analysis. The benefit is that the database can be searched easily; all the spectra having peaks in a range of chemical shifts can be selected. However, searching similar spectra based on peak-table comparison might fail, due to variations in signal positions induced by modification of the experimental conditions. Because all the spectra produced in our scenario are automatically sent to a server, a portion of them will not be further analyzed, or will be analyzed elsewhere. Thus, those raw spectra can't be compared using methods that rely on peak tables. A solution to this problem would consist in extracting automatically such tables assuming that a method exists that is robust and efficient at extracting peak positions. A procedure to find duplicate spectra based on peak tables [20] has been proposed, but the type of experiments, the conditions of acquisition, the signal overlap, and the choice of the solvent are factors that affect our ability to obtain reliable peak positions automatically. Besides, as already mentioned, the comparison of peak tables from spectra obtained at different conditions of temperature and pH is not trivial, and searching spectra using chemical shift range might fail to capture the similarities between spectra. Another approach consists in comparing spectra directly. Techniques such as cross-correlation [21,22] or area-overlap [23] (spectra intersection) were first proposed for IR spectra and allow for the pairwise comparison of vectors, but are computationally intensive and are thereby neither suitable for large collections of data nor for data suffering from peak shifts in position due to different experimental conditions. A thorough discussion about the comparison of IR and MS spectra can be found elsewhere [24,25]. More recently Pretsch and coworkers [1]
2
A.M. Castillo et al. / Chemometrics and Intelligent Laboratory Systems 127 (2013) 1–6
proposed a method referred to as binning that consists in successively dividing the spectrum into bins of decreasing width. The intensities of the signals or signal integrals within each bin are computed at each step and stored linearly as a vector that is then used for comparison. A procedure for comparing the latter is proposed that minimizes the effect of signal shifts on the measured similarity. This method doesn't rely on peak tables and is computationally efficient when compared to cross-correlation. More recently, an implementation of the binning approach has been proposed for 2D-HSQC (Heteronuclear Single Quantum Coherence) spectra [2]. However, the efficiency of the binning method decreases with decreasing signal density, e.g. when switching from 1D to 2D spectra. Indeed, dividing an entire 2D spectrum produces a vast number of empty bins. In this paper we introduce a new method based on trees for the storage of NMR spectra and a comparison algorithm that scales linearly with tree size, this regardless of the dimensions of the experiments. The paper outline is as follows; first a description of the method used to encode experimental data into trees is provided, second the procedure to compare the resulting trees is described, and finally examples obtained with 1 H NMR and 1H–13C HSQC (Heteronuclear Single-Quantum Correlation spectroscopy) spectra are shown followed by a comparison of our results with those obtained using the binning method. We will discuss the results from two different stands, on the one hand considering the ability to find replicated spectra of a single molecule and, on the other hand, the correlation between spectral similarity and the molecular similarity. While the former is relevant for a search engine, the latter aspect is particularly important when mining NMR data for drug design [24,26].
2. Theory 2.1. Constructing trees The spectrum is split at its mass center. An initial vertex or root node is created, identified with the chemical shift of the mass center and the total intensity of the spectra. The center of mass divides the spectra into left and right sub-regions. The immediate descendants of the root node are created by determining the mass centers and total intensities for both sub-regions. Those three nodes split the spectrum into four sub-regions, whose mass centers determine four new descendants. Repeating this procedure iteratively generates a binary tree whose nodes converge towards the more crowded regions of the spectrum (see Fig. 1A and Code Fragment 1). Since the noisy regions are excluded from this division scheme, the same idea can be readily extended to data of higher dimensionality. In the case of a 2D spectrum, for instance, 4 quadrants result of the division at each center of mass and therefore a quad-tree is created, as depicted in Fig. 1B. Each of its vertices contains the coordinates of the mass center, i.e., the chemical shifts in the direct and indirect dimensions and the intensity of the region. Since the bandwidth of the signals is not negligible a peak can be divided between two regions, thus affecting the position of the center of mass for the next iteration. This is particularly true for large peaks that are broad close to the level of the noise and a correction is needed to avoid children being too close to their parent node. Therefore, if the chemical shift of a node is very close to that of its parent, the child node is coalesced into the parent node. For this purpose, chemical shifts are considered sufficiently similar if they lay within one quarter of the
A
B
C
Fig. 1. A) Binary tree obtained from 1D NMR spectra. The first iteration allows to find the root node, while the iteration stops when no child nodes can be created, i.e., when the total intensity of the sub-region is below a certain threshold value. In this work this latter was set to one percent of the total integral for 1D spectra, and half that amount for 2D spectra. B) HSQC spectrum and its resulting quad tree. The construction of trees for 2D spectra is very similar to the one of 1D spectra. In this case, each node splits the spectrum into four quadrants and two coordinates are required to describe the position of the center of mass. The strength of this approach is to ignore low-density regions of the spectrum. C) Illustrates the correction performed when the tail of a huge peak (left) has an integral larger than the one of the weak signal (right). In this case a child node will fall close to its parent, as indicated by the gray wide dashed line and the gray circle. Since the distance is smaller than one quarter of the minWindow parameter (represented by the gray area) this node is removed. The next generation falls outside the minimal distance and a node is created. Finally a last child (on the rightmost side) is found close to the position of the weak signal, while the other child (left, open gray circle) cannot be created.
A.M. Castillo et al. / Chemometrics and Intelligent Laboratory Systems 127 (2013) 1–6
minimum window size from the boundaries of the corresponding sub-region, where the minimum window size is an adjustable parameter that defines the minimum area that can be integrated (0.16 ppm in this work).
C ða; bÞ ¼ α
3
minðIa ; Ib Þ −γjδa −δb j þ ð1−α Þe otherwise: maxðIa ; Ib Þ
Where I and δ represent the integrals and the chemical shifts of the nodes. The first term compares the intensities, while the second term compares the chemical shifts. The α parameter weights the relative importance of intensity vs. shift match, and γ controls the attenuation of the effect of chemical shift differences on the similarity score. This parameter is adjusted according to the expected variations due to different experimental conditions and thus depends on the observed nuclei. In the present work, α = 0.1 and γ = 0.01. It can be shown that both equations ensure that s(a,b) = s(b,a). 2.3. Comparing molecules
Code Fragment 1. Java source code for the creation of the binary trees. The spectrum is stored in a, while parameters from and to define the upper and lower boundaries for the calculation of the center of mass. Since in this work the whole spectra were considered, the upper and lower limits were initially set to 10 and 0 ppm. The minWindow and threshold adjustable parameters define the termination conditions and were set to 0.16 ppm and 1% of the integral of the whole spectrum. minWindow represents the smallest region that will be integrated, any node standing below minWindow/4 from its parent is collapsed with it, as illustrated in Fig. 1C. The two functions sum(a, from, to) and to to weightSum(a, from, to) stands for ∫from ydx and ∫from xydx respectively, and can be evaluated very efficiently. Here, x and y represent the chemical shifts and the intensity. 2.2. Comparing trees In order to compare the resulting trees, the simplest method was chosen that consists in starting analysis from the root node. The similarity s(a,b) between the two nodes a and b of trees A and B, can be written as a recursive equation:
sða; bÞ ¼ βC ða; bÞ þ ð1−βÞ
1 m ∑ sðchildl ðaÞ; childl ðbÞÞ: m l¼1
This equation is valid for both binary and quad trees, where m stands for the number of children nodes that exist in A and B trees and children matching follows the canonical left-right or quadrant ordering. The term C(a,b) on the right-hand side measures the quality of the matching between the two nodes under comparison, the sum on the second term measures the quality of the matches between their children. Beta is an adjustable parameter that weights the relative importance of node matching and children matching. If β is close to unity, the match between the two root nodes determines the similarity between the trees. Decreasing this value will increase the contributions from the children nodes. In this work β was set to 0.33, meaning that the contribution of each generation is lowered by a third; this avoids small shifts to dramatically affect the overall similarity measurement. C(a,b) is defined as: C ða; bÞ ¼ 0; if at least one node is empty:
As pointed out early by Pretsch and coworkers [26] spectral similarity has to match structural similarity in order to provide a useful tool for searching similar compounds. The hypothesis is that similar molecules have similar properties. Thus, two similar spectra are expected to correspond to similar molecules. In order to validate this assumption, each molecule was described by a vector of features, a binary vector obtained by determining the presence (1) or absence (0) of substructures or structural descriptors. In this case, a set of 512 structural descriptors [27] was used to build these binary feature vectors that were then compared using a simple cosine similarity. 3. Material and methods To evaluate our approach, a set of 250 molecules that have comparable masses (250 ± 10) and chemical formulae (C10–20H10–30O0–10N0–10) was chosen. For each of them, the NMR parameters (chemical shifts and coupling constants) were predicted using the Spinus [28] online platform. In order to simulate the different experimental conditions, random fluctuations with a standard deviation of 0.28 ppm were added to the predicted chemical shifts before simulating the spectra. The standard deviation was chosen to correspond to the average fluctuations in peak positions observed in the spectrum of a molecule by slightly varying temperature, pH and concentration (unpublished results). Thus, each original prediction resulted in 10 prediction tables that were used to simulate 10 spectra at a frequency of 400 MHz and 16k points with the algorithm that we described elsewhere [29]. In addition, a collection of HSQC spectra was built based on proton and carbon predictions. The latter were obtained using NMRShiftDB [9], while the connectivity was obtained from the molecular structures. A two dimensional Gaussian function of width 0.2 and 0.02 ppm in the proton and carbon dimension was added to the spectrum for each bonded C\H pair and all signals were assigned the same intensity. Fluctuations in both the carbon and proton dimension (respectively 3 and 0.28 ppm) were introduced to account for the effect of experimental conditions. 4. Discussion of the results For each of the 2500 spectra of both sets of experiments the resolution was reduced to 1024 equidistant points, between 0 and 10 ppm for 1 H and between 0 and 200 ppm for 13C, and the trees were computed. This reduction in resolution significantly hastens the construction of trees and was found not to affect the overall results. Examples of trees obtained for 1D and 2D spectra are illustrated in Fig. 2. Similarity matrices were computed by evaluating the similarity between each pair of trees, as described in the Introduction. For the sake of comparison, a similarity matrix was computed using binning vectors, strictly following the procedure described by Pretsch and coworkers [1]. Finally, to evaluate the correlation between the spectral and the molecular similarity,
4
A.M. Castillo et al. / Chemometrics and Intelligent Laboratory Systems 127 (2013) 1–6
A
B 0
40 60 80 100 120
carbon dimension [ppm]
20
140 160 8
7
6
5
4
3
2
1
0
proton dimension [ppm]
8
7
6
5
4
3
2
1
0
proton dimension [ppm]
Fig. 2. Examples of trees obtained from A) a proton and B) an HSQC spectrum. It should be noted that trees provide a sort of peak-picking, indeed the centers of mass rapidly converge towards the chemical shift of individual signals. On the right, the open circles represent the correlation peaks, while the black circles represent the nodes of the quad tree.
pair-wise similarities for molecules were calculated using the procedure described in the previous section. Ideally, the similarity matrix is block diagonal. The elements of the matrix should be equal to 1 inside the diagonal blocks, i.e., for the 10 spectra that belong to the same molecule and b 1 elsewhere. In real cases, the similarity between spectra of the same molecule (positive matches) is not equal to 1, but is still expected to be higher than the similarity measured for two spectra belonging to different molecules (negative matches). Therefore, the method that performs best is the one that maximizes the distances outside the diagonal blocks and minimizes the distances and the standard deviation of their distribution inside the diagonal blocks. If the distributions of positive and negative distances are completely separated, a threshold value can be found that allows for a perfect classification. Since the aim of this work is to test the robustness of our comparison methods, large chemical shift fluctuations have been introduced in the spectra. In addition, the data set was chosen so that molecules are very similar, and hence overlapped distributions are expected. The distribution area overlap reflects the performances of the classification methods, but determining the best threshold value is not trivial anymore. The area under the ROC (Receiver Operating Characteristics) curve (AURC) reflects the overall performances [30] of the comparison method. The AURC is equal to 1 for perfect comparison, its maximum value and decreases as the quality of the comparison deteriorates to reach the value of 0.5, corresponding to a random comparison. 4.1. Comparison of 1D spectra The results obtained for 1D proton spectra are summarized in Table 1 and illustrated in Fig. 4. It can be observed that similarities evaluated using trees achieved slightly higher Area Under the ROC Curve (AURC). For binning, the results were obtained using a minimum bin width of 0.28 ppm, corresponding to N = 35 iterations. Decreasing
further the minimum bin width was found to introduce noise into the binning vector degrading the results. Conversely, the figures reported in Table 1 were found insensitive with respect to the depth of the trees as a consequence of the adjustment of the threshold intensity. This behavior is easily understood by considering that dividing into smaller bins introduces more empty bins, while adding further levels to a tree only appends children nodes that are close to their parents. These results depend on the dataset, i.e., on the initial choice of the molecules. Nevertheless, considering that molecules were chosen to have similar chemical formula and similar masses, such figures confirm that both binning and trees achieved good discrimination and can be used to build search engines for NMR spectral databases. However, trees are much more compact. Indeed, binning performed best using a minimum bin width of 0.28 ppm reached after N = 35 iterations and resulting in a binning vector of 630 points (N ∗ (N + 1) / 2). Conversely, the trees' maximum depth L was found equal to 8 corresponding to 255 nodes (2L-1). However the tree is sparse and the maximal number of nodes in this dataset was 85. Clearly, trees provide a very compact manner to extract and store the relevant information available in NMR spectra in a way that enables comparison. Fig. 3 shows that the average number of nodes increases linearly by increasing the molecular weight, using a set of 1100 molecules, whose distribution is represented by the histogram of the same figure. These average values would allow us to store 1 million NMR spectra in 400 Mb. For the sake of comparison, the predicted peak tables were compared directly using cross-correlation. Therefore, each peak was assumed to be a triangular function, a method that is used to compare mass spectrograms. Not surprisingly, the results (solid gray line in Fig. 4A and B) were found very close to the one obtained using cross-correlation to compare spectra (black dashed line), since most of the information is contained in the peak table. However, in the case of real spectra this would require a robust peak-picking procedure.
Table 1 Results obtained for proton and HSQC spectra compared using binning vectors or trees. The time measurements are normalized to proton and the numbers in parenthesis show the results when normalized to HSQC trees. The first and second hit values represent the similarities between the molecule of the target spectra and the molecules associated with its most and 2nd most similar spectra. The values show averaged results for different target spectra.
Trees 1H Bin 1H Tree HSQC Bin HSQC
AURC
First hit (best match)
Second hit (2nd best match)
Time for mapping
Time to compare
0.99 0.987 0.991 0.986
0.72 0.72 0.73 0.73
0.71 0.71 0.71 0.71
1 1.71 7.75 (1) 17.99 (2.32)
1 6.43 2.35 (1) 14.88 (6.33)
A.M. Castillo et al. / Chemometrics and Intelligent Laboratory Systems 127 (2013) 1–6
5
Fig. 3. The mean number of tree nodes (+), the number of hydrogen (black diamond) and carbon (x) atoms grow linearly with increasing molecular weight. The gray envelope ranges from the minimum to the maximum number of nodes, while the black histogram (right scale) represents the molecular weight distribution in the dataset.
method, first proposed by Varmuza and coworkers for infra-red spectroscopy [24] was implemented and applied to binning vectors and trees. For a spectrum picked at random (out of 2500 spectra) the first h more similar spectra (h first hits), were identified and the h pairwise similarities for their respective structures were computed and averaged. A grand average was obtained by repeating this procedure for each of the 2500 spectra. The first two hits were reported in Table 1, while the first 10 were plotted against h (see Fig. 4B and D). The lower bold line (Fig. 4B and D) represents the mean similarity obtained when the h spectra were chosen randomly and is thus related to the intrinsic structural similarity of the dataset. The upper bold line represents the best possible result, i.e., when the ranking obtained by spectral similarities exactly matches the one resulting from structural similarities. It also depends on the choice of the molecules that constitute the database. The results show that both binning vectors and trees provide a similar insight on structural similarity. Indeed, both trees and binning vectors were found to perform similarly; in fact the hit list curves are almost indistinguishable (see Fig. 4B and D). These results are dependent on the data and on the set of descriptors [27] selected to construct the structural vector, nevertheless the comparison between the two methods holds. Finally, we repeated the procedure for proton spectra, but using the peak tables directly
4.2. Comparison of 2D spectra The compactness of trees makes them attractive for the comparison of multi-dimensional spectra. To illustrate this, 2500 HSQC spectra were prepared and quad-trees constructed for each spectrum. Their pair-wise similarities were evaluated as described in the Introduction. While the time needed for the construction of trees increases with higher data dimensionality (7× in the present example, see Table 1), the time for comparison scales linearly with the number of nodes. The lower overlapping of signal in HSQC spectra was found to produce twice as many nodes as the 1H experiments, thus doubling the time required for comparison (see Table 1). Once trees have been built and stored in the database, the search engine will perform fast regardless of the type of experiments that are being compared. The mapping procedure is the bottleneck step for both methods, however the tree-based search engine was always found to perform faster. 4.3. Spectral similarity vs. molecular similarity Finally, we would like to compare the similarity values obtained from structures and from NMR spectra to estimate the power of the spectral search strategy for searching similar molecules. Therefore, the hit list
TREE
A
B
PP
AO
Gran average
True positive rate
CC BIN
False positive rate
Number of hits (h)
D
Gran average
True positive rate
C
False positive rate
Number of hits (h)
Fig. 4. ROC curves for proton spectra A) and HSQC spectra C), obtained using trees (black), binning (dotted), cross correlation applied to spectra (dashed, proton only), cross-correlation applied directly to peak tables (gray solid line, proton spectra only) and standard area-overlap (dotted dashed line). On the right hand side, the hit list curves for proton B) and HSQC spectra D) spectra using trees (black), binning (dotted), cross-correlation applied to spectra (dashed, proton only), cross-correlation applied directly to peak tables (gray solid line, proton only) and standard area-overlap (dotted dashed line). On the right, the upper and lower bold lines represent the best possible and random curves. It can be appreciated that the curves obtained for binning and for trees are completely superimposed both for proton and HSQC spectra.
6
A.M. Castillo et al. / Chemometrics and Intelligent Laboratory Systems 127 (2013) 1–6
obtained from the predictions instead of trees. In this case, the comparison was achieved using a standard cross-correlation. The AURC and hit lists obtained were found comparable to the ones obtained using the area-overlap method (dashed-dotted curve in Fig. 4A). We therefore conclude that the direct approach consisting in performing an automated peak picking to compare raw data is not viable, when comparison of the whole spectra is required. 5. Conclusions We demonstrated that it is possible to compare NMR spectra efficiently using trees. These latter are faster to build, faster to compare and much more compact than the binning vectors, while performances were found similar measured both with AURC and hit lists. Trees' compactness makes them suitable for search engine, since they can be pre-computed and stored to accelerate the lookup response. The quasi-linear relationship between the mean node number and the number of peaks makes trees attractive to find similarity between spectra not only for 2D HSQC but also for higher dimension spectra. We showed that the automatic peak-picking approach, assuming a robust function exists for peak detection, would not help for comparing the entire NMR spectra, at least not using a simple cross-correlation [20]. Acknowledgments The authors acknowledge Colciencias-Renata (RC-561-2009) for funding this project. References [1] L. Bodis, A. Ross, E. Pretsch, A novel spectra similarity measure, Chemometrics and Intelligent Laboratory Systems 85 (2007) 1–8. [2] L. Bodis, A. Ross, J. Bodis, E. Pretsch, Automatic compatibility tests of HSQC NMR spectra with proposed structures of chemical compounds, Talanta 79 (2009) 1379–1386. [3] S.L. Robinette, F. Zhang, L. Brüschweiler-Li, R. Brüschweiler, Web server based complex mixture analysis by NMR, Analytical Chemistry 80 (2008) 3606–3611, (May 2008). [4] A. Tsipouras, J. Ondeyka, C. Dufresne, S. Lee, G. Salituro, N. Tsou, M. Goetz, S.B. Singh, S.K. Kearsley, Using similarity searches over databases of estimated 13C NMR spectra for structure identification of natural product compounds, Analytica Chimica Acta 316 (1995) 161–171. [5] J. Xia, T.C. Bjorndahl, P. Tang, D.S. Wishart, MetaboMiner — semi-automated identification of metabolites from 2D NMR spectra of complex biofluids, BMC Bioinformatics 9 (2008) 507. [6] Y. Binev, M. Corvo, J. Aires-de-Sousa, The impact of available experimental data on the prediction of 1H NMR chemical shifts by neural networks, Journal of Chemical Information and Computer Sciences 44 (2004) 946–949. [7] J. Meiler, PROSHIFT: protein chemical shift prediction using artificial neural networks, Journal of Biomolecular NMR 26 (2003) 25–37. [8] MyLIMS, My laboratory information management system. [Online]. Available: www.mylims.org, (Accessed: 04-Feb-2013). [9] C. Steinbeck, S. Kuhn, NMRShiftDB — compound identification and structure elucidation support through a free community-built web database, Phytochemistry 65 (2004) 2711–2717. [10] nmrshiftdb2 — open nmr database on the web. [Online]. Available: http:// nmrshiftdb.nmr.uni-koeln.deJul-2002, (Accessed: 04-Feb-2013).
[11] BMRB — Biological Magnetic Resonance Bank. [Online]. Available: http://www. bmrb.wisc.edu, [Accessed: 04-Feb-2013]. [12] J.L. Markley, H. Akutsu, T. Asakura, M. Baldus, R. Boelens, A. Bonvin, R. Kaptein, A. Bax, I. Bezsonova, M.R. Gryk, J.C. Hoch, D.M. Korzhnev, M.W. Maciejewski, D. Case, W.J. Chazin, T.A. Cross, S. Dames, H. Kessler, O. Lange, T. Madl, B. Reif, M. Sattler, D. Eliezer, A. Fersht, J. Forman-Kay, L.E. Kay, J. Fraser, J. Gross, T. Kortemme, A. Sali, T. Fujiwara, K. Gardner, X. Luo, J. Rizo-Rey, M. Rosen, R.R. Gil, C. Ho, G. Rule, A.M. Gronenborn, R. Ishima, J. Klein-Seetharaman, P. Tang, P. van der Wel, Y. Xu, S. Grzesiek, S. Hiller, J. Seelig, E.D. Laue, H. Mott, D. Nietlispach, I. Barsukov, L.-Y. Lian, D. Middleton, T. Blumenschein, G. Moore, I. Campbell, J. Schnell, I.J. Vakonakis, A. Watts, M.R. Conte, J. Mason, M. Pfuhl, M.R. Sanderson, J. Craven, M. Williamson, C. Dominguez, G. Roberts, U. Günther, M. Overduin, J. Werner, P. Williamson, C. Blindauer, M. Crump, P. Driscoll, T. Frenkiel, A. Golovanov, S. Matthews, J. Parkinson, D. Uhrin, M. Williams, D. Neuhaus, H. Oschkinat, A. Ramos, D.E. Shaw, C. Steinbeck, M. Vendruscolo, G.W. Vuister, K.J. Walters, H. Weinstein, K. Wüthrich, S. Yokoyama, In support of the BMRB, Nature Structural and Molecular Biology 19 (2012) 854–860. [13] D.S. Wishart, T. Jewison, A.C. Guo, M. Wilson, C. Knox, Y. Liu, Y. Djoumbou, R. Mandal, F. Aziat, E. Dong, S. Bouatra, I. Sinelnikov, D. Arndt, J. Xia, P. Liu, F. Yallou, T. Bjorndahl, R. Perez-Pineiro, R. Eisner, F. Allen, V. Neveu, R. Greiner, A. Scalbert, HMDB 3.0 — The Human Metabolome Database in 2013, Nucleic Acids Research 41 (2013) D801–D807. [14] Public database of NMR spectra. [Online]. Available: http://www.acornnmr.com/ database.htm, (Accessed: 04-Feb-2013). [15] National Institute of Advanced Industrial Science, Technology (AIST), [Online]. Available: http://sdbs.riodb.aist.go.jp/sdbs/cgi-bin/cre_index.cgi?lang=eng, (Accessed: 04-Feb-2013). [16] The Madison Metabolomics Consortium Database (MMCD). [Online]. Available: http://mmcd.nmrfam.wisc.edu, (Accessed: 04-Feb-2013). [17] Q. Cui, I.A. Lewis, A.D. Hegeman, M.E. Anderson, J. Li, C.F. Schulte, W.M. Westler, H.R. Eghbalnia, M.R. Sussman, J.L. Markley, Metabolite identification via the Madison Metabolomics Consortium Database, Nature Biotechnology 26 (2008) 162–164. [18] O. Lundberg, P. Vogel, T. Malusek, A. Lundquist, P.O. Cohen, L. Dahlqvist, MDL — the magnetic resonance metabolomics database (mdl. imv. liu. se), in 22th Annual Meeting of the European Society for Magnetic Resonance in Medicine and Biology, Magnetic Resonance Materials in Physics, Biology and Medicine, 2005. [19] The Magnetic Resonance Metabolomics Database (mdl.imv.liu.se). [Online]. Available: http://www.liu.se/hu/mdl/main/. [20] A. Hinneburg, B. Egert, A. Porzel, Duplicate detection of 2D-NMR Spectra, Journal of Integrative Bioinformatics 4 (2007) 53. [21] S.J. Prestrelski, N. Tedeschi, T. Arakawa, J.F. Carpenter, Dehydration-induced conformational transitions in proteins and their inhibition by stabilizers, Biophysical Journal 65 (1993) 661–671. [22] S.J. Prestrelski, T. Arakawa, J.F. Carpenter, Separation of freezing- and drying-induced denaturation of lyophilized proteins using stress-specific stabilization. II. Structural studies using infrared spectroscopy, Archives of Biochemistry and Biophysics 303 (1993) 465–473. [23] B.S. Kendrick, A. Dong, S.D. Allison, M.C. Manning, J.F. Carpenter, Quantitation of the area of overlap between second-derivative amide I infrared spectra to determine the structural similarity of a protein in different states, Journal of Pharmaceutical Sciences 85 (8) (1996) 155–158. [24] K. Varmuza, M. Karlovits, W. Demuth, Spectral similarity versus structural similarity: infrared spectroscopy, Analytica Chimica Acta 490 (2003) 313–324. [25] W. Demuth, M. Karlovits, K. Varmuza, Spectral similarity versus structural similarity: mass spectrometry, Analytica Chimica Acta 516 (2004) 75–85. [26] M. Zürcher, J. Clerc, M. Farkas, E. Pretsch, General theory of similarity measures for library search systems, Analytica Chimica Acta 206 (1998) 161–172. [27] T. Sander, J. Freyss, M. von Korff, J.R. Reich, C. Rufener, OSIRIS, an entirely in-house developed drug discovery informatics system, Journal of Chemical Information and Modeling 49 (2009) 232–246. [28] Y. Binev, M.M.B. Marques, J. Aires-de-Sousa, Prediction of 1H NMR coupling constants with associative neural networks trained for chemical shifts, Journal of Chemical Information and Modeling 47 (2007) 2089–2097. [29] A.M. Castillo, L. Patiny, J. Wist, Fast and accurate algorithm for the simulation of NMR spectra of large spin systems, Journal of Magnetic Resonance 209 (2011) 123–130. [30] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (2006) 861–874.