Information Processing Letters 110 (2010) 679–686
Contents lists available at ScienceDirect
Information Processing Letters www.elsevier.com/locate/ipl
Optimal global alignment of signals by maximization of Pearson correlation Pietro Di Lena ∗ , Luciano Margara Department of Computer Science, University of Bologna, Italy
a r t i c l e
i n f o
a b s t r a c t
Article history: Received 2 February 2010 Received in revised form 24 May 2010 Accepted 25 May 2010 Available online 27 May 2010 Communicated by A.A. Bertossi Keywords: Algorithms Signal similarity detection Pearson correlation Needleman–Wunsch algorithm Optimal global alignment
The problem of detecting the similarity between noisy signals obtained from electronic instrument measurements arises in several different contexts and it is approached with specific strategies accordingly. In this paper we propose a simple and general method for the comparison of noisy signals of different lengths. Assuming any a-priori knowledge about two noisy signals, their degree of similarity can be detected by computing the global alignment that maximizes their Pearson correlation. The Pearson correlation coefficient is a widely used measure of linear dependence between two random variables of the same length. The optimal alignment of two signals with respect to the Pearson correlation identifies the sub-regions of the two signals that exhibit the highest pairwise degree of similarity. We show that the optimal alignment of two signals by maximization of the Pearson correlation can be computed in (quadratic) polynomial-time by a simple application of the Needleman–Wunsch algorithm. Our approach can be used for the comparison of one-dimensional signals, multi-dimensional signals and multiple-alignments of (one-dimensional or multi-dimensional) signals. © 2010 Elsevier B.V. All rights reserved.
1. Introduction The problem of detecting the similarity between two discrete noisy signals arises in many different contexts. Some examples are electroencephalography (EEG) spike detection [1] and electrocardiogram (ECG) spike detection [2] for clinical diagnosis of epilepsy and heart diseases, respectively, or gas/liquid chromatography–mass spectrometry (GC-MS/LC-MS) peak detection [3] for the identification of different substances such as drugs and explosives. In the most general formulation, the signal similarity detection refers to the problem of selecting from a set of template-signals the ones that are more similar to a given target-signal. The objective is the extraction of information not readily available from the target-signal through visual inspection. The main difficulty in signal detection is that signals are corrupted by different types of noise and can
*
Corresponding author. E-mail addresses:
[email protected] (P. Di Lena),
[email protected] (L. Margara). 0020-0190/$ – see front matter doi:10.1016/j.ipl.2010.05.024
© 2010
Elsevier B.V. All rights reserved.
have different lengths. Notably, replicates of the same sample obtained from different experiments are usually not perfectly aligned in the time dimension. Such a problem is called time jitter and it is common to measurements obtained from electronic instruments. Most of the reported techniques for jitter compensation rely on the assumption that the signals to be matched belong to one of different classes whose waveforms are known. In those cases in which the waveforms are unknown these techniques cannot be used proficiently and different approaches are needed. In such cases, one general approach for signal detection is to perform pairwise alignments. The alignment between two signal involves the problem of detecting the subsequences of the two signals exhibiting the highest pairwise similarity. Alignment techniques have been extensively investigated for bio-sequence comparison in the last fifty years and several optimal or near-optimal techniques have been developed in this context for sequence comparison [4]. In this paper we describe a general approach for the comparison of (one-dimensional or multi-dimensional)
680
P. Di Lena, L. Margara / Information Processing Letters 110 (2010) 679–686
discrete noisy signals of different lengths, assuming any a-priori knowledge about the two types of signals. Our approach relies on a simple and direct combination of the Needleman–Wunsch polynomial-time algorithm [5] for optimal global alignment and the Pearson correlation coefficient [6] as a measure of similarity. The Pearson correlation coefficient is one of the most widely used measure of linear dependence between two random variables (realvalued vectors of the same length). Due to time jitters, even for signals of the same length the Pearson correlation is a poor measure of similarity if the two signals are not aligned to compensate jitter perturbations. We show that, given two signals, it is possible to compute in (quadratic) polynomial-time the optimal global alignment that maximizes their Pearson correlation (or their average Pearson correlation in the multi-dimensional case). The optimal global alignment via maximization of the Pearson correlation can be computed by using the exact Needleman– Wunsch algorithm with opportune scoring scheme and gap penalty function. The optimal alignment with respect to the Pearson measure identifies the subsequences of the two signals that share the highest pairwise degree of linear dependence with respect to the statistical distributions of the two signals. The higher is the correlation of the alignment, the higher is the confidence that the matched subsequences exhibit relevant similarity. In the context of signal detection, the identification of a cluster of similar signals (obtained, for example, from several replicates of the same experiment) is a common technique used to identify a profile-signal representative of that particular cluster. There are several ways to define a profile-signal. Some examples are the identification of the average signal in the cluster or the identification of a characteristic waveform. More advanced techniques have been developed in Computational Biology for profiling a set of similar sequences. One of the most important technique in this context is the multiple-alignment of all sequences within a cluster [7,8]. The multiple-alignment between all the elements within a cluster highlights the similarities that are common to all the elements and thus it is a quite convenient representation of the whole cluster. An alignment between two multiple-alignments is more sensitive than an alignment between single sequences for similarity detection. By a simple modification of the alignment algorithm for single signals, an optimal global alignment can be obtained in polynomial-time also for multiplealignments of signals, when the objective function is the maximization of the average Pearson correlation of all pairwise alignments between signals in the two profiles. The cost of the algorithm is quadratic on the length of the longest multiple-alignment and thus it does not depend on the number of signals used to compile the two multiplealignments. The paper is organized as follows. Section 2 is devoted to the introduction of the most basic background needed to understand the rest of the paper. In particular, in Section 2.1 we define the Pearson correlation coefficient and in Section 2.2 we review the Needleman–Wunsch polynomial-time algorithm for optimal global alignment. In Section 3 we describe our algorithms for optimal global
alignment by maximization of the Pearson correlation. Section 4 is devoted to the concluding remarks. 2. Preliminaries 2.1. Pearson correlation coefficients The Pearson product-momentum correlation coefficient is a measure of the linear dependence between two random variables (real-valued vectors). Historically, it is the first formal measure of correlation and it is still one of the most widely used measure of relationship [6]. The Pearson correlation coefficient of two variables X and Y is formally defined as the covariance of the two variables divided by the product of their standard deviations (which acts as a normalization factor) and it can be equivalently defined by:
rXY =
n i =1
Xi − X
n k =1 ( X k
−
X )2
· n
Yi − Y
(1)
2 k =1 ( Y k − Y )
n
where X = n1 i =1 X i denotes the mean of X . The coefficient r X Y ranges from −1 to 1 and it is invariant to linear transformations of either variables. A correlation close to 1 indicates a positive relationship between X and Y . A correlation close to −1 indicates a negative relationship. A correlation close to 0 indicates the absence of relationship between the two variables. Note that the Pearson correlation cannot be computed when one or both variables are constant, since in that case the quantity
n
k=1 ( X k
− X )2
is 0 and the coefficient (1) is undetermined. 2.2. Needleman–Wunsch algorithm for optimal global alignment An alignment between two sequences is a one-to-one partial mapping of elements from the first sequence to the second sequence. The Needleman–Wunsch algorithm [5] is a polynomial-time procedure for computing the optimal global alignment between two sequences with respect to a given scoring scheme and a given cost function for the introduction of gaps in the alignment. Let X , Y be two sequences of length m and n, respectively. We do not need to make any a-priori assumptions on the elements of X , Y : they can be numbers/symbols or vectors of numbers/symbols. We denote with I X = {1, . . . , n} the set of indices for the X sequence, i.e. the index i ∈ I X identifies the i-th element X i of X . An alignment between X and Y is a mapping f : I X → I Y that respects the following two properties: 1. f is an injective partial function, i.e. f is one-to-one and it does not necessarily map every element of I X (resp. I Y ) to an element of I Y (resp. I X ). When f is not defined on i ∈ I X (resp. i ∈ I Y ), we use the notation f (i ) = ∅ (resp. f −1 (i ) = ∅). 2. ∀i , j ∈ I X such that f (i ) = ∅ = f ( j ),
i< j
if and only if
f (i ) < f ( j ).
P. Di Lena, L. Margara / Information Processing Letters 110 (2010) 679–686
The first condition imposes that an element of X (resp. Y ) can be aligned at most with one element of Y (resp. X ). The non-aligned positions (i.e. i ∈ I X such that f (i ) = ∅ and i ∈ I Y such that f −1 (i ) = ∅) are assumed to be matched with a (opening) gap in the other sequence. Note that, intuitively, a matching of a gap in X with a gap in Y has no sense and it is not allowed. The second condition imposes that an alignment must preserve the ordering of the elements in X and Y . Note that the above definition of alignment, even if completely sound, deviates slightly from the Computational Biology tradition. The adoption of a different definition is justified by the attempt to simplify the notation in our theorems, especially when we deal with the multi-dimensional case. An optimal global alignment of X and Y is a alignment f : I X → I Y that maximizes some well-defined objective function. The objective function is defined in terms of a scoring scheme S : I X × I Y → R and a gap penalty G ∈ R. The scoring scheme defines a similarity score between every pair of elements of X and Y , while the gap penalty defines the cost for the introduction of a gap in the alignment. Given a scoring scheme S and a gap penalty G, the optimal global alignment between X and Y is formally defined as the alignment f : I X → I Y that maximizes the objective function n
S i , f (i ) + G · gap f
(2)
i =1 f (i )=∅
where gap f is the number of gaps introduced by f and it is defined as
gap f = i ∈ I X | f (i ) = ∅ + i ∈ I Y | f −1 (i ) = ∅ .
The optimal alignment with respect to the objective function (2) can be computed in O (mn) with the Needleman– Wunsch dynamic programming algorithm [5], originally introduced for biological sequence comparison. The Needleman–Wunsch algorithm can be easily modified in order to deal with more complex objective functions that encode different gap penalties for different positions in the alignment. In particular, the cost of the algorithm does not change if the gap penalty cost function is linear and increases to O (mn2 + nm2 ) for non-linear cost functions [9]. In this note we only need to consider constant gap penalties. 3. Optimal global alignment by maximization of Pearson correlation Given two signals X , Y of different lengths our objective is to find the alignment between X and Y that maximizes their Pearson correlation. To obtain the optimal global alignment with respect to the Pearson correlation, we use the Needleman–Wunsch algorithm with opportune scoring scheme and constant gap penalty. The constant gap assumption is in general not necessary for alignment algorithms but it is needed here to do not increase the complexity of the algorithm and to maintain its cost quadratic. In detail, we show that, under necessary conditions for the introduction of constant gaps, there exists an optimal (quadratic) polynomial-time algorithm maximizing
681
the Pearson correlation between two one-dimensional signals (Theorem 1 in Section 3.1) and the average Pearson correlation between two multi-dimensional signals (Theorem 2 in Section 3.1). We also show that the optimal alignment with respect to the average Pearson correlation can be computed in (quadratic) polynomial-time also for the problem of computing an alignment between multiplealignments of signals (Theorem 3 in Section 3.2), where a multiple-alignment of signals is an all-against-all alignment of an arbitrary number of (multi-dimensional) signals. Each theorem is the generalization of the previous one. Theorem 1 is a particular case of Theorem 2 (the case in which the dimension is one), which is, in turn, a particular case of Theorem 3 (the case in which the multiplealignments contain just a single signal). We present the different theorems in increasing level of complexity for the clearness of exposition. 3.1. Optimal global alignment between signals We first consider the problem of computing an optimal global alignment with respect to the Pearson correlation for one-dimensional signals. The approach can be easily generalized to the multi-dimensional case. In order to obtain an alignment between two onedimensional signals X and Y , we are allowed to insert gaps in both X and Y so that the gapped X and Y have the same length. Since we are dealing with real-valued vectors, a gap in vector X (resp. Y ) is represented by an opportune real value. Once the gaps have been introduced and the two vectors have the same length, we can compute the score of the alignment as the Pearson correlation defined in Eq. (1). To obtain a meaningful alignment with respect to the Pearson correlation we must assure that the real values chosen to represent gaps do not introduce artificial correlation. A necessary condition in order to avoid this problem is that the gaps introduced in a vector X do not change X ’s statistical distribution. In particular, from Eq. (1), a necessary condition is that the gaps inserted in X do not change the constant quantities X and
n
k=1 ( X k
− X )2 .
Note that this condition is very hard to satisfy if we allow the introduction of non-constant gaps and, very likely, non-constant gaps would lead to an exponential algorithm. Thus, under the assumption that the gaps inserted in X are constant regardless their positions and that they do not af fect the quantities X and
n
k=1 ( X k
− X )2 , it is easy to
obtain that the only possibility is that a gap value in X is equal to X (resp. Y for Y ). Note that this condition follows also if we only assume that the constant gaps introduced in X do not change the average X . Actually the quantity
n
k=1 ( X k
− X )2 is, in some sense, less important in the
determination of the correlation coefficient, since it acts only as a normalization factor. In the following theorem we prove that, under these conditions for the introduction of gaps, an optimal alignment maximizing Pearson correlation can be computed in polynomial-time. Theorem 1. Let X ∈ Rn , Y ∈ Rm be two non-constant vectors. Under the assumption that a gap value in X is equal to X and
682
P. Di Lena, L. Margara / Information Processing Letters 110 (2010) 679–686
a gap value in Y is equal to Y , the optimal global alignment that maximizes the Pearson correlation between X and Y is computed by the Needleman–Wunsch algorithm with scoring scheme
S ( i , j ) = n
Xi − X
k =1 ( X k
− X )2
· n
Yj −Y
k =1 ( Y k
(3)
− Y )2
and gap penalty G = 0. Proof. Let f be some given alignment between X and Y and consider the Pearson correlation coefficient of the alignment f . First of all, note that, since we are assuming that gap values in X are equal to X , we have that
( X − X )/
n
k=1 ( X k
− X )2 = 0. Then, by Eq. (1), a position
in Y aligned with a gap in X has no positive or negative contribute in the correlation coefficient (the same holds symmetrically for Y ). Thus, by Eq. (1), the Pearson correlation of an alignment f is equal to n i =1 f (i )=∅
Y f (i ) − Y
Xi − X
n k =1 ( X k
−
X )2
· n
.
(4)
2 k =1 ( Y k − Y )
Note that this quantity is exactly equal to the score of the alignment f (Eq. (2) in Section 2.2) n
S i , f (i ) + G · gap f
i =1 f (i )=∅
when the scoring scheme S is defined as in Eq. (3) and the gap penalty is G = 0. 2 Example 1. Consider the two signals
X = {3, 5, 3, 13, 3, 5, 3, 15, 3, 8, 3, 4, 4, 6, 3, 17, 4}, Y = {3, 2, 3, 2, 5, 11, 2, 3, 4, 15, 4, 2, 10, 2, 6, 2, 4, 4,
The alignment that maximizes the Pearson correlation between X and Y is:
_ 3
3 2
X Y
8 3 10 2
a signal is represented by two measures, mass and charge. We can represent a multi-dimensional signal with a matrix X ∈ Rl×n . We denote with X t ∈ Rn , for 1 t l, the t-th signal in the matrix and with X ti ∈ R the i-th element of the t-th signal. We say that a multi-dimensional signal X ∈ Rl×n is non-constant if X t is not constant for every 1 t l. Consider the problem of finding the optimal alignment between the two multi-dimensional signals X ∈ Rl×n and Y ∈ Rl×m . Obviously, an alignment between these two signals makes sense only if the signal (feature) X t is of the same “type” of signal Y t for 1 t l. Moreover, note that the introduction of a gap in the i-th time/space position of X implies the introduction of a gap in the i-th position of each X 1 , . . . , Xl . We can easily modify the algorithm described in Theorem 1 in order to obtain an alignment between multiple vectors by maximization of the average correlation of each pair X t , Y t , 1 t l. Theorem 2. Let X ∈ Rl×n , Y ∈ Rl×m be two non-constant multi-dimensional signals. Under the assumption that a gap value in X t (resp. Y t ) is equal to X t (resp. Y t ), 1 t l, the optimal global alignment f maximizing the average Pearson correlation n l 1
15, 2}.
X Y
Fig. 1. Graphical representation of the optimal global alignment maximizing Pearson correlation of X and Y . Only the aligned positions are drawn in the picture.
5 3 _ 13 3 3 2 5 11 2 _ 4 4 6 2 4
6 _
5 _
3 _ 15 _ 3 3 4 15 4 2
3 17 4 4 15 2
where a gap (denoted with “_”) in X is equal to X = 6 and a gap in Y to Y = 5.15. The score/correlation of this alignment is equal to 0.956916, which denotes a high degree of similarity between the matched subsequences. The graphical representation of the alignment is drawn in Fig. 1, where we can notice that the main peaks of X and Y are matched correctly and that the overall similarity of the aligned subsequences is very good. In most practical cases, not a unique signal but a finite set of signals (all of the same length) are representative of a single object. This happens, for example, in LC-MS, where
l
t =1
i =1 f (i )=∅
Y t f (i ) − Y t
X ti − X t
n k=1 ( X tk
− X t )2
· n
k=1 ( Y tk
− Y t )2 (5)
is computed by the Needleman–Wunsch algorithm with scoring scheme
1 l
S (i , j ) =
l
t =1
X ti − X t
n k=1 ( X tk
− Xt
)2
· n
Yt j − Yt
k=1 ( Y tk
− Y t )2
and gap penalty G = 0. Proof. Let f be some given alignment between X and Y . Due to the gap value hypothesis, a position in X (resp. Y ) aligned with a gap in Y (resp. X ) has no contribute in the average Pearson correlation (5). Rewrite Eq. (5) as n l 1 i =1 f (i )=∅
l
t =1
X ti − X t
n k=1 ( X tk
− X t )2
Y t f (i ) − Y t
· n
k=1 ( Y tk
− Y t )2
.
P. Di Lena, L. Margara / Information Processing Letters 110 (2010) 679–686
Note that this quantity is exactly equal to the score of the alignment f , as defined in (2), when the scoring scheme S is defined as in the statement of the theorem and the gap penalty is G = 0. Thus the Pearson correlation (5) can be maximized by running the Needleman–Wunsch algorithm with such scoring scheme S and gap penalty G = 0. 2 Note that, when the dimension l of the signals is equal to 1, the alignment computed by the algorithm described in Theorem 2 is equal to the alignment computed by the algorithm described in Theorem 1. From Theorem 2, it follows immediately that the cost of an optimal alignment between multi-dimensional signals is exactly O (lmn). Since l is always constant (and, in particular, it is a small number for most practical cases) the cost of an optimal alignment is bounded by O (mn).
683
tween two given multiple-alignments of signals. The same problem has been investigated for multiple-alignments of sequences in [15–17]. As usual, we are allowed to introduce gaps in the two multiple-alignments. A gap in the i-th position of the multiple-alignment X 1 , . . . , X p implies the introduction of a gap in the i-th position of all X u , 1 u p. As defined in Section 3.1, a gap in the i-th position of the multi-dimensional signal X u ∈ Rl×n means the introduction of the value X tu in the i-th position of X tu , for each 1 t l. Theorem 3. Let X 1 , . . . , X p ∈ Rl×n and Y 1 , . . . , Y q ∈ Rl×m be two multiple-alignments of multi-dimensional non-constant signals. The score of an alignment f between the two multiplealignments X 1 , . . . , X p and Y 1 , . . . , Y q is defined by
q p l 1 1
3.2. Optimal global alignment between multiple-alignments
pq
Due to time jitter perturbations and unavoidable imperfections of the electronic devices used to perform experimental measurements, signal samples obtained by several replicates of the same experiment are usually not perfectly aligned in the time/space domain and contain noisy information. When a cluster of replicate samples of the same experiment or, more generally, when a cluster of similar signals are available, a multiple-alignment between all signals in the cluster is more representative of the class than every single member of it. If the multiple-alignment is computed in a meaningful way, it highlights the most important characteristics that are common to all the signals in the cluster. The counterpart of this approach is that the problem of computing the optimal multiple-alignment of sequences (strings) is intractable [10–14] and, even if we do not have a formal proof, it is reasonable to expect that the intractability holds also for multiple-alignment of signals. Anyway, even if the problem is not tractable, there are several interesting techniques developed in Computational Biology for near-optimal multiple sequence alignment [7,8] that can be used also to obtain near-optimal multiple signal alignments. In this section we formalize the problem of computing the optimal alignment (with respect to the Pearson correlation) between multiple signal alignments by providing an exact polynomial-time algorithm. We also discuss the problem of computing near-optimal multiple signal alignments. A multiple-alignment of signals is an all-against-all alignment of signals. A multiple-alignment can be represented by a multi-dimensional vector but, in order to have a more manageable notation, we denote a multiple-alignment as a set of (one-dimensional or multi-dimensional) vectors/signals X 1 , . . . , X p ∈ Rl×n (the dimension of the signals is l 1). The ordering of the signals in the multiple-alignment is not important. Note that all the signals in the multiple-alignment must have the same length n; to obtain a list of signals all of the same length we are allowed to insert gaps in each single signal but we are not allowed to have a column of only gaps in their multiplealignment. We first consider the problem of computing the alignment that maximizes the average Pearson correlation be-
where
l
u =1 v =1
r Xtu Y tv =
t =1
n
=
r Xtu Y tv
lpq
t =1 v =1 u =1
r Xtu Y tv
(6)
Y tvf (i ) − Y tv
X tiu − X tu
n u k=1 ( X tk
i =1 f (i )=∅
q p l 1
− X tu )2
· n
v k=1 ( Y tk
.
− Y tv )2
Let X ∈ Rl×n , Y ∈ Rl×m be defined by
X ti =
p
n
X tiu − X tu
− X tu )2
u k=1 ( X tk
u =1
and
Y ti =
q
Y tiv − Y tv
n v k=1 ( Y tk
v =1
,
− Y tv )2
respectively. Then the optimal global alignment that maximizes the quantity (6) is computed by the Needleman–Wunsch algorithm with scoring scheme
S (i , j ) =
l 1
lpq
X ti Y ti
(7)
t =1
and gap penalty G = 0. Proof. Let f be an alignment between X 1 , . . . , X p and Y 1 , . . . , Y q . Consider the average Pearson correlation of the alignment f , as defined in Eq. (6). As in Theorem 2, the matching of a position with a gap has no contribute in (6). By using the definitions of X ti and Y ti in the hypothesis, we can rewrite (6) as n
q l 1
i =1 f (i )=∅
=
lpq n
i =1 f (i )=∅
=
n i =1 f (i )=∅
X ti ·
t =1 v =1 l 1
lpq
t =1
l 1
lpq
t =1
X ti
Y tvf (i ) − Y tv
n
q v =1
v k=1 ( Y tk
Y tvf (i ) − Y tv
n
X ti Y t f (i )
− Y tv )2
v k=1 ( Y tk
− Y tv )2
684
P. Di Lena, L. Margara / Information Processing Letters 110 (2010) 679–686
and this quantity is exactly equal to the score of the alignment, as defined in (2), when the scoring scheme is defined as in Eq. (7) and G = 0. 2 Note that, by Theorem 3, to represent a multiple-alignment X 1 , . . . , X p ∈ Rl×n we just need a multi-dimensional vector X ∈ Rl×n , independently from the number p of signals. This compact representation of a multiple-alignment is called profile for multiple-alignments of protein sequences [7]. Thus, the cost of an optimal alignment between multiple-alignments is exactly O (lnm), where the constant l represent the dimension of the signals. Then the cost of an optimal alignment between multiple-alignments does not depend on the number of signals in the two multiple-alignments. Consider now the problem of computing the optimal multiple-alignment for signals X 1 ∈ Rl×n1 , . . . , X p ∈ Rl×n p where the objective function is the maximization of the average Pearson correlation of all pairs ( X i , X j ), 1 i < j p. The Needleman–Wunsch algorithm can be extended to any number of signals/sequences, but the resulting procedure is exponential and it cannot used for the simultaneous optimal alignment of more than a few sequences. One of the most successful approaches for computing nearoptimal multiple-alignment of bio-sequences is to build up the multiple-alignment by performing pairwise alignments between single sequences and/or multiple-alignments, according to the ordering described by a binary guide tree. This approach was first introduced in [18] and it is general enough to be used also for multiple-alignment of signals. In [18] the guide tree is built from a similarity matrix by using the UPGMA (Unweighted Pair Group Method using arithmetic Averages) algorithm [19]. The UPGMA algorithm builds a hierarchical cluster from a set of sequences according to their degree of similarity, as described by a distance matrix M (where the distance is not necessarily a metric). In the following example we show an application of this method to compute a multiple-alignment of signals. Before to show the example, we need to show what is the range of possible values for the alignment scores, which will be used to define the distance matrix M. Recall that, the Pearson correlation coefficient ranges into the interval [−1, 1]. We show that the Pearson correlation of an optimal alignment between signals or, more generally, between multiple-alignments of signals ranges into the interval [0, 1]. l×n
l×m
and Y , . . . , Y ∈ R Corollary 1. Let X , . . . , X ∈ R be two multiple-alignments of multi-dimensional non-constant signals. The score of their optimal global alignment with respect to the Pearson correlation is a real number into the interval [0, 1]. 1
p
1
q
Proof. Let X ∈ Rl×n and Y ∈ Rl×m be the representations of the two multiple-alignments as defined in Theorem 6. Let f be the optimal global alignment between X and Y . To reach the proof it is sufficient to show that for every 1 i n such that f (i ) = ∅ we have that S (i , f (i )) 0. Assume for the moment that this is true. Then, since the S (i , f (i )) for all i ∈ [1, n] such score of the alignment is
Fig. 2. Hierarchical clustering of the set of signals { X , X1, X2, X3, X4} computed by the UPGMA algorithms. The dissimilarity score between two signals has been computed as 1 minus the score of their optimal pairwise alignment with respect to the Pearson correlation. The bar on the left indicates the distance between signals/clusters at each node of the tree.
that f (i ) = ∅, this quantity must be greater than or equal to 0. Now, assume for absurd that for some i ∈ [1, n] such that f (i ) = ∅ it happens that S (i , f (i )) < 0. Then we can increase the score of the alignment by matching the i-th column of X with a gap in Y and the f (i )-th column of Y with a gap in X . This contradicts the assumption that f is an optimal alignment. Then it must be S (i , f (i )) 0. 2 Example 2. Consider the following set of signals
X = {3, 5, 3, 13, 3, 5, 3, 15, 3, 8, 3, 4, 4, 6, 3, 17, 4}, X1 = {3, 4, 1, 13, 4, 4, 1, 4, 13, 3, 7, 3, 4, 3, 7, 3, 15, 2}, X2 = {3, 2, 5, 3, 13, 4, 1, 6, 5, 13, 2, 7, 4, 5, 6, 8, 2, 15, 2}, X3 = {3, 2, 3, 2, 5, 11, 2, 3, 4, 15, 4, 2, 10, 2, 6, 2, 4, 4, 15, 2}, X4 = {3, 3, 4, 1, 12, 5, 6, 3, 2, 17, 3, 2, 7, 3, 4, 3, 6, 3, 2, 17, 4}, obtained by perturbing the signal X of Example 1. In detail, every X i has been obtained by perturbing the entries of X with a random integer taken from the interval [−2, 2]. Moreover every X i contains exactly i more entries than X , randomly taken from {1, 2}. The signal X3 is equal to the signal Y of Example 1. The correlation coefficients of the optimal pairwise alignments between these signals are shown in the following dissimilarity matrix:
X1 X2 X3 X4
X X1 X2 X3 0.970970 0.944070 0.965705 0.956916 0.935653 0.937163 0.973658 0.948751 0.933223 0.952833
Since, by Corollary 1, these coefficients range into [0, 1] we can define the distance between two elements X i, X j simply as 1 − r X i X j . The binary guide tree built by the UPGMA algorithm from this set of distances is represented in Fig. 2. Following the guide tree of Fig. 2, the multiple-alignment of the set of signals { X , X1, X2, X3, X4} is built by first computing the pairwise alignment between X and X4, say X , and next between X1 and X2, say X ; then the alignment is computed between the multiple-alignment X and
P. Di Lena, L. Margara / Information Processing Letters 110 (2010) 679–686
Fig. 3. Graphical representation of the optimal multiple-alignment maximizing the average Pearson correlation for the set of signals { X , X1, X2, X3, X4}. Only the matched subsequences are drawn in the picture (i.e. only points that corresponds to columns in the multiplealignment that do not contain gaps).
the single sequence X3, say X , and the final multiplealignment is obtained by aligning X with X . The final result is:
X X4 X3 X2 X1
3 3 3 3 3
5 3 2 2 4
_ 4 3 5 _
3 1 2 3 1
_ _ 5 _ _
13 12 11 13 13
_ 5 _ _ _
_ 6 _ _ _
3 3 2 4 4
5 _ _ _ 4
3 2 3 1 1
_ _ _ 6 _
_ _ 4 5 4
15 17 15 13 13
X X4 X3 X2 X1
_ 3 4 _ _
3 8 3 _ 2 7 3 _ 2 10 2 6 2 7 4 _ 3 7 3 _
_ 4 _ _ 4
4 3 2 5 3
_ _ _ 6 _
_ 6 _ 8 7
4 3 4 _ _
6 _ _ _ _
3 2 4 2 3
17 17 15 15 15
4 4 2 2 2
The total score of the multiple-alignment is 0.947079, while the upper bound, i.e. the average of the maximum pairwise scores, is 0.951894. The optimal multiplealignment has a score between these two close values. In fact, the overall agreement of the common subsequences in the multiple-alignment is very good, as shown in Fig. 3. 4. Concluding remarks In this paper we showed a set of simple polynomialtime algorithms for the comparison of noisy signals. Our approach is general enough to be used in several different contexts and, in particular, it can be applied also in those situations in which no a-priori knowledge is available about the waveforms of the signals to be compared. Our approach for signal comparison relies on the Pearson correlation coefficient and the Needleman–Wunsch algorithm. We showed that, by a simple and natural application of the Needleman–Wunsch algorithm, it is possible to compute in (quadratic) polynomial-time the optimal global alignment that maximizes the Pearson correlation of two signals. The optimal global alignment by maximization of the Pearson correlation identifies the subsequences of two signals that share the highest pairwise degree of linear dependence with respect to the statistical distributions of the two signals. The similarity of the two subsequences is higher as higher is the correlation/score of the alignment. This approach can be used to align one-dimensional signals, multi-dimensional signals and multiple-alignments
685
of (one-dimensional or multi-dimensional) signals. When a cluster of similar signals is available, a multiple-alignment of all-against-all signals in the cluster is a quite convenient representation of the whole cluster since it highlights the similarities that are common to all the signals in that class. We showed that an optimal global alignment (with respect to the Pearson correlation) between two multiple-alignments can be computed in (quadratic) polynomial-time, regardless the number of signals in the two multiple-alignments. We conclude by briefly discussing our choice of considering optimal global alignments instead of optimal local alignments. The main difference between a global alignment and a local alignment is that the former aligns two sequences in order to maximize their overall similarity while the latter aligns two sequences in order to detect the two segments that exhibit the highest similarity (allowing the introduction of gaps in the two segments). The local alignment is useful when searching for similar local regions between two sequences that contain several different characteristic sub-patterns. The Smith–Waterman algorithm [20], is a simple modification of the Needleman–Wunsch algorithm for optimal local alignments. When used with the same scoring scheme and the same gap penalty function, the Smith–Waterman and Needleman–Wunsch algorithms can eventually compute different alignments between two sequences only if the gap penalty function can assume negative values. Thus, due to the necessary conditions we impose for the introduction of gaps, the algorithms we developed in Section 3 are, by definition, optimal with respect to both local and global alignments. References [1] S.B. Wilson, R. Emerson, Spike detection: a review and comparison of algorithms, Clinical Neurophysiology 113 (12) (2002) 1873–1881. [2] R. Jan, H. Rix, P. Caminal, P. Laguna, Alignment methods for averaging of high-resolution cardiac signals: a comparative study of performance, IEEE Trans. Biomed. Eng. 38 (6) (1991) 571–579. [3] E. Lange, R. Tautenhahn, S. Neumann, C. Gröpl, Critical assessment of alignment procedures for LC-MS proteomics and metabolomics measurements, BMC Bioinformatics 9 (2008) 375. [4] D.W. Mount, Bioinformatics: Sequence and Genome Analysis, Cold Spring Harbor Press, 2001. [5] S.B. Needleman, C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology 48 (3) (1970) 443–453. [6] J.L. Rodgers, W.A. Nicewander, Thirteen ways to look at the correlation coefficient, The American Statistician 42 (1) (1988) 59–66. [7] D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Univ. Press, 1996. [8] C. Notredame, Recent progress in multiple sequence alignment: A survey, Pharmacogenomics 3 (1) (2002) 131–144. [9] M.S. Waterman, Efficient sequence alignment algorithms, J. Theor. Biol. 108 (3) (1984) 333–337. [10] L. Wang, T. Jiang, On the complexity of multiple sequence alignment, J. Comput. Biol. 1 (4) (1994) 337–348. [11] P. Bonizzoni, G. Della Vedova, The complexity of multiple sequence alignment with SP-score that is a metric, Theoretical Computer Science 259 (1) (2001) 63–79. [12] W. Just, Computational complexity of multiple sequence alignment with SP-score, J. Comput. Biol. 8 (6) (2001) 615–623. [13] W. Just, G. Della Vedova, Multiple sequence alignment as a facility location problem, INFORMS Journal on Computing 16 (4) (2004) 430–440. [14] I. Elias, Settling the intractability of multiple alignment, J. Comput. Biol. 13 (7) (2006) 1323–1339.
686
P. Di Lena, L. Margara / Information Processing Letters 110 (2010) 679–686
[15] J. Kececioglu, W. Zhang, Aligning Alignments, Lecture Notes in Computer Science, vol. 1448, 1998, pp. 189–208. [16] B. Ma, Z. Wang, K. Zhang, Alignment Between Two Multiple Alignments, Lecture Notes in Computer Science, vol. 2676, 2003, pp. 254– 265. [17] J. Kececioglu, D. Starrett, Aligning alignments exactly, in: Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, vol. 8, 2004, pp. 85–96.
[18] D.F. Feng, R.F. Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J. Mol. Evol. 24 (4) (1987) 351–360. [19] R.R. Sokal, P.H.A. Sneath, Principles of Numerical Taxonomy, Freeman, San Francisco, 1973. [20] T.F. Smith, M.S. Waterman, Identification of common molecular subsequences, Journal of Molecular Biology 147 (1981) 195–197.