Expert Systems with Applications 41 (2014) 2842–2850
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Asynchronism-based principal component analysis for time series data mining Hailin Li ⇑ College of Business Administration, Huaqiao University, Quanzhou 362021, China
a r t i c l e
i n f o
Keywords: Asynchronous correlation Covariance matrix Principal component analysis Time series data mining Dynamic time warping
a b s t r a c t Principal component analysis (PCA) is often applied to dimensionality reduction for time series data mining. However, the principle of PCA is based on the synchronous covariance, which is not very effective in some cases. In this paper, an asynchronism-based principal component analysis (APCA) is proposed to reduce the dimensionality of univariate time series. In the process of APCA, an asynchronous method based on dynamic time warping (DTW) is developed to obtain the interpolated time series which derive from the original ones. The correlation coefficient or covariance between the interpolated time series represents the correlation between the original ones. In this way, a novel and valid principal component analysis based on the asynchronous covariance is achieved to reduce the dimensionality. The results of several experiments demonstrate that the proposed approach APCA outperforms PCA for dimensionality reduction in the field of time series data mining. Ó 2013 Elsevier Ltd. All rights reserved.
1. Introduction Time series is one kind of the most important research objects in the field of data mining. The techniques used in this data is called time series data mining (TSDM) (Esling & Agon, 2012). However, since its high dimensionality often renders standard data mining techniques inefficient, the methods used to reduce the dimensionality are devised. So far, there exists many methods to resolve this problem, which are considered as two kinds of dimensionality reduction. One is based on univariate time series, such as discrete Fourier transformation (DFT) (Agrawal, Faloutsos, & Swami, 1993), discrete wavelet transformation (DWT) (Maharaj & Urso, 2011; Struzik & Siebes, 1998, 1999), polynomial representation (PR) (Fuchs, Gruber, Pree, & Sick, 2010), piecewise linear approximation (PLA) (Keogh, Chu, Hart, & Pazzani, 2001; Papadakis & Kaburlasos, 2010; Shatkay & Zdonik, 1996), piecewise aggregate approximation (PAA) (Keogh, Chakrabarti, Pazzani, & Mehrotra, 2000; Li & Guo, 2011), symbolic aggregate approximation (SAX) (Lee, Wu, & Lee, 2009; Lin, Keogh, Lonardi, & Chiu, 2003). These methods are mainly proposed to reduced dimensionality from the points of univariate time series. In other word, they mainly concentrate on the transformation of a single time series so that the dimension of the reduced representations is lower than that of the original one. The other is based on the time series dataset, such as singular value decomposition (SVD) (Spiegel, Gaebler, & Lommatzsch,
⇑ Tel.: +86 595 22693815. E-mail address:
[email protected] 0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.10.019
2011), principal component analysis (PCA) (Singhal & Seborg, 2002) and independent component analysis (ICA) (Cichocki & Amari, 2002). SVD and PCA are often seen as the same method to retain the first several principal components and to represent the whole dataset. However, ICA is the development of principal component analysis and factor analysis. In the field of time series data mining, the methods are often used and combined with the corresponding measurements to discover the information and knowledge from time series dataset. Krzanowski (1979) used PCA to construct the principal components and chosen the first k principal components to represent the multivariate time series. At the same time, the similarity between two time series are calculated by using the cosine value of the angle between the corresponding principal components. Singhal and Seborg (2005) proposed a new approach Sdist to compute the similarity based on PCA, which is better than the earlier methods. Karamitopoulos and Evangelidis (2010) used PCA to construct the feature space of the queried time series and projected every time series to the space. They computed the error between two reconstructed time series as the distance between the query time series and the queried one. SVD is often based on PCA, which uses KL decomposition method to reduce the dimensionality of time series. Li, Khan, and Prabhakaran (2006) proposed two methods to choose the feature vectors and used them to classify time series. Weng and Shen (2008) extended the traditional SVD to an two-dimensional SVD (2dSVD) that extracts the principal components from the column–column and row–row directions to compute the covariance matrix. Since feature extraction is one of the most importance tasks for ICA, it was applied to the analysis of time series. Wu and Yu (2005) used FastICA (Hyvärinen, 1999) to
H. Li / Expert Systems with Applications 41 (2014) 2842–2850
obtain independent principal components for multivariate time series and cluster them by combining with the correspond distance measurement. Baragona and Battaglia (2007) used ICA to detect the anomalies by extracting the unusual components. PCA is the basic theory and widely used to reduce the dimensionality of time series (Karamitopoulos & Evangelidis, 2010; Bankó & Abonyi, 2012). It uses the variance to measure how much the information is retained. Moreover, the covariance is computed to measure the correlation between two different time series in PCA. However, the traditional PCA uses the linear and synchronous method to compute the covariance between two time series, which is not effective when the two series are similar or correlative at different points in time. In other words, the same shape trends appearing on two time series at different points in time will be regarded as uncorrelated or negative correlated. It means that in some cases PCA works ineffectively. Moreover, the length of time series must be equal when they are research by PCA. Meanwhile, PCA is often used to mine the knowledge from multivariate time series dataset instead of univariate time series dataset. The research motivations of this work are to overcome the above mentioned problems. Firstly, the dimensionality of time series with different lengths can be reduced by the principle of principal component. It means that the proposed method can process the time series with different lengths. However, the existing work including SVD, PCA, and ICA only process the ones with equal length. Secondly, the existing methods only consider the synchronous relationship between two variables or two time series, they neglect the asynchronous relationships. Therefore, the proposed methods must be improved to consider the asynchronous relationships. Thirdly, the important information about time series should be concerned by the proposed method rather than the existing work. The reason is that some points of time series reflect the key shape trends and can provide much more important information than the others. For the above mentioned motivations, the work will include the measurement of asynchronous correlation coefficient, the design of asynchronism-based PCA and the representations of univariate time series for dimensionality reduction. The asynchronous correlation derives from correlation coefficient between a pair of two interpolated time series that are formed by the elements of the best warping path. Moreover, the best warping path can be found by dynamic time warping (DTW) (Yu, Yu, & Hu, 2011). The interpolated time series can be used to improve the effectiveness of correlation coefficient (Pearsons product moment correlation coefficient) (Rodgers & Nicewander, 1988), which measures the similarity (or correlation) between time series with the same shape trends appearing on the different points in time. The asynchronism-based PCA considers the asynchronous correlation to measure the whole time series dataset and obtains the first several principal components that retain the important information about the time series dataset as much as possible. In particular, the tuple of the first several principal components is regarded as the corresponding representations so that every time series can be represented by a short tuple for dimensionality reduction. In comparison to the traditional PCA, the proposed method (Asynchronism-based PCA, APCA) not only can measure the synchronous correlation as PCA does, but also can obtain the asynchronous correlation. It is a good approach to measure the similarity between two time series whose similar shape trends appear on the different points in time. The remainder of the paper is organized as follows. In Section 2, we provide some necessary background material and discuss related work. In Section 3, we present the proposed method. The experimental evaluation of the new method is described in Section 4. Finally, we discuss our results further and conclude in Section 5.
2843
2. Background and related work PCA is a common method used to reduce the dimensionality of time series. At the same time, DTW is one of the most robust ways to measure the similarity between time series. According to motivations of this work, the two methods are prior to be introduced in this section.
2.1. Principal component analysis PCA is a well-known statistical approach that is often used to reduce the dimensionality of dataset (Jolliffe, 2004; Wang, 2012). The dataset can be represented as a data matrix Xnm, where n denotes the number of objects with m properties (or variables). Accordingly, Xnm denotes a time series dataset that has m time series of length n. Each column represents a time series, and each row represents a group of observed values for a special time. PCA is an orthogonal linear transformation. It transforms a dataset to a new system. The greatest variance of data points in the new system by any projection of the data objects lies on the first coordinate that we call the first principal component, the second variance on the second coordinate (the second principal component), and so on. In this way, PCA transforms the data matrix X of size n m into another reduced matrix Y of size n k for the dimensionality reduction, where k < m. Formally, Y is a reduced dataset with k variables (or principal components) that are orthogonal to each other, and Ynk = XnmVmk, where Vmk are composed of the first k principal components and X is zero empirical mean. According to SVD, R ¼ V KV 1 , where R is the covariance matrix of X, that is R ¼ X T X because of the zero empirical mean. Meanwhile, K is an m m rectangular diagonal matrix with nonnegative real numbers on the diagonal and the real numbers are composed of the eigenvalues of R in descending order. V is the corresponding eigenvector matrix of R. Moreover, according to the energy content for each eigenvector, the first k principal components are chosen from the first k eigenvectors of which the cumulative energy content is not less a threshold e. The algorithm of PCA can be described as follows. Step 1: Organize the dataset (or data matrix) Xnm. Each row denotes an observation with m variables and each column denotes an variable. Step 2: Calculate the empirical mean and change the data matrix into a new one with zero mean. At the same time, the original dataset replaces the new one. That is X = X SU, where P Uð1; iÞ ¼ 1n nj¼1 Xðj; iÞ and i = 1,2, . . . , m and S is a n 1 column vector. Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix R = XTX. According to SVD, R = VKV1, where K is the diagonal matrix of eigenvalues of R in descending order. Meanwhile, sort the eigenvector matrix V according to the order of decreasing eigenvalues, that is k1 6 k2 6 km. Step 4: Choose a subset of the eigenvectors as basis vectors according to the cumulative energy content. If the cumulative enP ergy content g k ¼ ki¼1 ki is above a certain value e, then the first k eigenvectors are selected as the basis vectors. In other words, if the contribution rate g ¼ ggmk of cumulative energy is larger than a threshold , then the first k eigenvectors can be seen as the best principal components. Step 5: Project the original dataset onto the new system. The new dataset Y with low dimension can be formed by Ynk = XnmVmk. Since k is often less than m, that is k < m, PCA achieves the dimensionality reduction. In the field of time series data mining, PCA is often used to reduce the dimensionality of multivariate time series and PCA-based measurements are also applied to this field. One of the earliest
2844
H. Li / Expert Systems with Applications 41 (2014) 2842–2850
measurements (Krzanowski, 1979) was proposed to retain the first k principal components and compare the angles between all the combination of the selected principal components. For better measuring the similarity between two time series, another method (Johannesmeyer et al., 1999) was proposed to modify the previous methods by weighting the angles with the corresponding variances. Singhal and Seborg (2005) extended the previous methods by incorporating an extra term so as to resolve the case that two time series datasets have similar principal components but the values of the variables are different. A similarity measurement Eros based on the acute angles between the correspond components instead of all the components in the previous method is proposed by Yang and Shahabi (2004). Karamitopoulos, Evangelidis, and Dervos (2008) proposed a distance measure that need not require for the query object to be PCA-represented for time series similarity search. However, those methods use PCA to reduce the dimensionality and measure the distance between two time series as the rigid Euclidean distance does. Moreover, the covariance between two time series only considers the synchronism analysis and neglects the case of asynchronous correlation. Therefore, an improved version of PCA need to be studied so as to consider the covariance based on asynchronism analysis. 2.2. Dynamic time warping Dynamic time warping (DTW) is a robust way to measure the similarity between two time series (Keogh & Pazzani, 2001; Li, Guo, & Qiu, 2011). Moreover, it is not like Euclidean distance that requires time series have equal length. DTW can measure the similarity between two time series of which the lengths are different. Meanwhile, DTW is able to obtain the best warping path for the minimum distance between two time series. Recently, it is widely applied to the field of time series data mining. Suppose there were two time series Q = {q1,q2, . . . , qM} and C = {c1,c2, . . . , cN}. The minimum distance between the two time series Q and C is denoted as v = DTW(Q, C), and the corresponding best warping path can be written as P = {p1,p2, . . . , pK}, where pk = [qi, cj] represents the mapping between two data points qi and cj that derive from Q and C. As shown in Fig. 1, through executing the procedure of DTW, the best warping path P in grey is obtained, and the mapping like [q35, c39] makes up one element of the best warping path. To obtain the best warping path P and the minimum distance V, DTW must construct a cost matrix R such that the best warping path can be found in the matrix. There are three constraints on the best path P, that is boundary, continuity and monotonicity (Keogh et al., 2001), which makes the element pk+1 only appear at the three adjacent positions on the top-left corner of pk in the matrix R. Moreover, d(pk) = d(qi, cj) = (qi cj)2 denotes the distance between two data points qi and cj respectively. Finally, there exists the best warping path P such that
DTWðQ ; CÞ ¼ min P
X dðpk Þ
ð1Þ
k
To solve the above problem, we use the dynamic programming to construct the cost matrix R, i.e.,
Rði; jÞ ¼ dðqi ; cj Þ þ minðRði; j 1Þ; Rði 1; j 1Þ; Rði 1; jÞÞ
ð2Þ
where i = 1,2, . . . , M, j = 1,2, . . . , N, R(i, 0) = R(0, j) = R(0, 0) = +1. In this way, the minimum distance between the two time series Q and C is v = DTW(Q, C) = R(M, N). At the same time, the best warping path P is composed of the elements (or the mappings) that contribute the value on the minimum distance. Since DTW can retrieve the best warping path used to map the points with same shape and return a minimum distance, it is widely applied to the field of time series data mining (Muscillo, Conforto, & Schmid, 2007; Wong & Wong, 2003), speech recognition (Itakura, 1975; Sakoe & Chiba, 1978) and other disciplines (Chu, Keogh, Hart, & Pazzani, 2002), such as medicine, meteorology, gesture recognition and finance. However, the quadratic time and space complexity (O(NM)) of DTW constrain its performance. There are some methods (Keogh et al., 2000; Zhou & Wong, 2005) used to speed up the calculation of DTW. Two of the most commonly used methods is the Sakoe-Chuba Band (Sakoe & Chiba, 1978) and the Itakura Parallelogram (Itakura, 1975). FTW (Sakurai, Yoshikawa, & Faloutsos, 2005) and FastDTW (Salvador & Chan, 2007) are also proposed to solve the problem of calculation cost. 3. Asynchronism-based PCA PCA used to reduce the dimensionality of time series dataset requires that the lengths of time series in the dataset must be equal, which means that the lengths of univariate time series in the dataset must be equal so that the covariance between two variables (or two univariate time series) in the algorithm of PCA can be computed. Therefore, the equal-required length is one of the constraints in PCA. It is easy to known that the covariance between two variables X and Y of length n can be calculated by
Cov ðX; YÞ ¼ E½ðX XÞðY YÞ;
ð3Þ
where, X and Y are respectively the mean of X and Y. For the data matrix X0 and Y0 with the zero empirical mean, the covariance between them is transformed into
Cov ðX 0 ; Y 0 Þ ¼ EðX 0 Y 0 Þ:
ð4Þ
Eq. (4) indicates that the product of each pair of data points deriving from two data matrix X0 and Y0 is contributed to covariance. Therefore, the covariance depends on each pair of data points that are at the same point in time. In this way, the covariance only
q35
Q Q
C
C c39 0
20
40
60
80
100
Fig. 1. The best warping path between two time series is produced by DTW.
0
20
40
60
80
100
Fig. 2. The mappings between two variables (or time series) Q and C reflect the synchronous relationship.
H. Li / Expert Systems with Applications 41 (2014) 2842–2850
reflects the correlation between the synchronous points which come from two variables at the same point in time. As shown in Fig. 2, the covariance in PCA only reflects the synchronous relationship. However, in most cases, one variable in a system often lags to affect another variable. The synchronous correlation only reflects the relationship between two variables at the same points in time and neglects the asynchronous one that reflects the asynchronous relationship at the different points in time. As shown in Fig. 1, the same shape trends should be mapped to each other so that the asynchronous result can better to reflect the correlation between two time series. To make PCA reflect the asynchronous correlation between two time series, we propose an asynchronism-based principal component analysis (APCA). Through analyzing the importance of asynchronism in the covariance computation, we design an approach of asynchronous covariance calculation instead of the traditional covariance between two time series. APCA uses the approach to obtain the asynchronous covariance of every pair of time series in the two dataset. In the process of the asynchronous variance design, DTW can retrieve the best warping path between two time series and further produces two interpolated time series that are composed of the elements of the best warping path. The asynchronous covariance between the original time series can be reflected by the result of synchronous covariance between the interpolated ones. Formally, there is one time series data matrix X = {X1,X2, . . . , Xm} need to reduce the dimensionality by APCA, where Xi = [xi1,xi2, . . . , xin]T represents an univariate time series, X has m univariate time series. To obtain the interpolated time series that better reflect the correlation between the original time series Xi and Xj, we should normalize each time series in advance. Every pair of normalized time series (X 0i and X 0j ) are used to compute the best warping path P ji ¼ fp1 ; p2 ; . . . ; pK g, where pk ¼ ðx0il ; x0jh Þ is an element of the best warping path. The interpolated time series X 00i and X 00j can be formed by all elements of the best warping path, that is X 00i ¼ fp1 ð1Þ; p2 ð1Þ; . . . ; pK ð1Þg and X 00j ¼ fp1 ð2Þ; p2 ð2Þ; . . . ; pK ð2Þg, where pk ð1Þ ¼ x0il and pk ðð2Þ ¼ x0jh . Finally, the asynchronous covariance between two time series is computed by
ACov ðX i ; X j Þ ¼ EðX 00i X 00j Þ
ð5Þ
As shown in Fig. 3, the interpolated can be obtained by the best warping path. The mappings between the two interpolated time series are used to reflect the asynchronous relationship between the original ones. In this example, it is easy to know that some points from the original time series are interpolated and form an interpolated one. The figure also indicates that the interpolated time series can match well to each other at the same value of X-axis. In addition, the length K of the interpolated time series is larger than that of the original one. K is in the range [max(M, N)M + N 1),
Q
C
0
20
40
60
80
100 00
00
Fig. 3. The mappings between two interpolated time series Q and C reflect the asynchronous relationship between the original time series Q and C.
2845
where M and N respectively denote the length of the original time series Q and C (or (Xi) and (Xj)). Comparing Fig. 1 to Fig. 3, we know that the length of the interpolated time series is 104 and that of the original one is 100, which means that the length of the interpolated one is larger than that of the original one. According to the algorithm of PCA, we give the algorithm of APCA as follows. Step 1: Organize the dataset (or data matrix) Xnm. Each row denotes an observation with m variables and each column denotes an variable or an univariate time series. Step 2: Normalize each time series in the data matrix X. The Z-score way is used to normalize time series Xi and makes the normalized X0i have zero mean and one standard variance, that is x0 ih N(0, 1). Step 3: Obtain the interpolated time series. Combining with the (1), DTW is applied to the computation of the best warping path P ji for the normalized time series X 0i and X 0j . Let every elements of the best warping path form the element of the interpolated time series. They are X 00i ¼ fp1 ð1Þ; p2 ð1Þ; . . . ; pK ð1Þg and X 00j ¼ fp1 ð2Þ; p2 ð2Þ; . . . ; pK ð2Þg respectively. All the normalized time series should be extended to the ones reflecting the corresponding asynchronous correlation. Step 4: Let X re-denotes the interpolated time series dataset X00 , that is X = X00 . Step 5: Execute the last three steps of PCA to achieve the dimensionality reduction. Compare to PCA, APCA has three advantages at least except for the extra time that is cost by the computation of the interpolated time series. (1) The similar shape trends existing in the two time series can be mapped by DTW, which makes APCA reflect the correlation between two time series at the different points in time. It means that APCA considers the asynchronism to reduce the dimensionality. However, PCA neglects the asynchronism to reflect the correlation at the same point in time. (2) The length jXij of time series Xi in data matrix X should be equal in PCA. In the proposed method, since DTW is suitable for the comparison between two time series with unequal length, APCA is suitable to reduce the dimensionality of the equal-length time series as well as the unequallength ones. It means that the time series with arbitrary length in the data cell X can be processed by APCA, where X = {X1,X2, . . . , Xm} and jX1j – (or =)jX2j – (or =) – (or =)jXmj. (3) APCA is more available than PCA. If we restrain the search scope in the cost matrix to find best warping path, DTW can produce the interpolated ones same to the original time series. In this case, APCA degenerates to be PCA and PCA is regarded as a special case of APCA. On the contrary, APCA is an extended and improved version of PCA. After executing the algorithm of APCA, we obtain the first k principal components Vmk. Then, the univariate time series can be represented by the corresponding row vector of V that is denoted as a tuple Fi = V(i, :). It means that Fi is the vector of feature values of the ith univariate time series in the dataset. In other word, Fi is the representation of the ith univariate time series in X. In this way, the feature values instead of the original time series can be used in the field of time series data mining. Moreover, each univariate time series of length n is transformed into the feature sequence of length k, where k < n and k < m. The time complexity of APCA mainly depends on the computation of DTW, which is used to obtain the interpolated time series. However, the maximal time complexity of DTW using to compute the best warping path between two time series is O(MN) and the minimal one is O(rN), where M and N are the length of the two time
2846
H. Li / Expert Systems with Applications 41 (2014) 2842–2850
series and r is a small factor proposed by Salvador and Chan (2007). At the same time, the lengths (M0 and N0 ) of the two interpolated time series are larger than that of the original ones, that is max(N, M) 6 M0 (or N0 ) 6 M + N 1, where N0 = M0 . Therefore, the minimal time complexity of APCA to compute the relationship between two equal-length time series is O(rN) + O(M0 ). However, the time complexity of PCA to compute the relationship between two equal-length time series is O(N). Therefore, the extra computation of APCA is required. 4. Experimental evaluation In the field of time series data mining, PCA is often applied to dimensionality reduction for time series so that the reduced data can remove the redundancy and relevance. To test the validity and superiority of APCA, the experimental results of APCA are compared with that of PCA for time series data mining, including simulation data clustering, UCI data Clustering and classification, and stock time series clustering. In addition, they are designed to illustrate the working principle of APCA. 4.1. Simulation data clustering According to the previous discussion, APCA can process the cases that dataset has asynchronous correlation between time series. At the same time, PCA and APCA can be of effect to random disturbance and can address the problems of offsets. In this experiment, four pairs of simulation time series are designed to clustering. They are produced by the following formula.
X 1 ¼ modð2pt; 1Þ X 2 ¼ signðsinðtÞÞ pt X 3 ¼ sin 10
X 6 ¼ signðcosðtÞÞ þ 6 þ 0:5randð1; 50Þ pt X 7 ¼ cos þ 4 þ 0:5randð1; 50Þ 10 X 8 ¼ X 4 þ 3 þ 0:5randð1; 50Þ where t = 1,2, . . . , 50 and rand(1, 50) creates 50 random values that follow 0–1 norm distribution. The eight time series as shown in Fig. 4 can be divided into four groups at first. From the above formula, it is easy to know that the shape trends of time series in a group are similar, that is (X1, X5), (X2, X6), (X3, X7) and (X4, X8).
1 0 −1 7 6 5 6 4 2
X1
0
0
0
0:0557 B 0:1875 B B B 0:1809 B B 0:6566 B V APCA ¼ B B 0:0648 B B 0:1897 B B @ 0:1758 0:6533 0
0:0208 B 0:1274 B B B 0:1746 B B 0:6927 B V PCA ¼ B B 0:0006 B B 0:0167 B B @ 0:0024
1 0:0710 0:0093 0:0132 0:4768 0:0639 0:8711 0:0362 0:6768 0:1319 0:0249 0:5212 0:2518 0:2151 0:3277 C C C 0:0683 0:4807 0:1654 0:1778 0:2959 0:0525 0:7631 C C 0:2140 0:1807 0:6412 0:1185 0:2499 0:0513 0:0122 C C C 0:1019 0:0328 0:3397 0:4863 0:6915 0:3250 0:2205 C C 0:6696 0:1166 0:0025 0:3939 0:4103 0:2671 0:3272 C C C 0:0235 0:8225 0:1135 0:2391 0:2895 0:0747 0:3641 A 0:1646 0:1663 0:6576
0:0805
20 X3
40
20 X5
40
20
20
X7
40
40
1 0 −1 10 0 −10 8 6 4 10 0 −10
X2
0
20
0
20
1 0:0172 0:0097 0:0562 0:7122 0:4618 0:0941 0:5164 0:1236 0:9349 0:0470 0:1314 0:0878 0:0260 0:2582 C C C 0:0554 0:3151 0:0508 0:4104 0:2705 0:1517 0:7746 C C 0:0083 0:0300 0:0069 0:0620 0:0244 0:6693 0:2582 C C C 0:0952 0:0093 0:0371 0:5426 0:8334 0:0204 0:0000 C C 0:9809 0:1409 0:0916 0:0275 0:0914 0:0117 0:0000 C C C 0:0967 0:0472 0:9911 0:0783 0:0038 0:0039 0:0000 A
0
20
20
0:0479
0:7203
X4
X6
1 0 0:0557 0:0710 0:0208 B 0:1875 0:6768 C B 0:1274 C B B C B B B 0:1809 0:0683 C B 0:1746 C B B B 0:6566 0:2140 C B 0:6927 C B B ¼B C and F PCA ¼ B B 0:0648 0:1019 C B 0:0006 C B B B 0:1897 0:6696 C B 0:0167 C B B C B B @ 0:1758 0:0235 A @ 0:0024 0:6533 0:1646 0:6875 0
F APCA
0:0000
X8
1 0:0172 0:1236 C C C 0:0554 C C 0:0083 C C C: 0:0952 C C 0:9809 C C C 0:0967 A 0:0253
40
40
1
1
0.5
0.5
0 0
0:2281 0:0834 0:1434
The principal components are used to represent the original time series. If we intend to retain k (k < m) principal components, then the first k column vectors of the VAPCA and VPCA are stored. It means that the first k principal components are used to represent the features of the original time series. In this case, we intend to retain 2 (k = 2) principal components, the first two column vectors can be stored. They are
X 5 ¼ modð2pðt þ 10Þ; 1Þ þ 5 þ 0:5randð1; 50Þ
0
0
0:6875 0:0253 0:0594 0:0032 0:0440
X 4 ¼ 2X 1 þ X 2 þ 3X 3
1 0.5 0
Suppose Xi is a column vector of length 50. The data matrix can be represented as X508 = {X1,X2, . . . , X8}. Using the algorithms of APCA and PCA, the interpolated time series are obtained. However, the comparison of the different pair of two time series will produce different pair of the interpolated time series. The reason is that the best warping path between different pair of two time series causes the different interpolated values. For instance, as shown in Fig. 5, there are two pairs of time series (X1, X2) and (X1, X3) used to compute the relationships. The interpolated time series of X1 in the case of the two comparisons are respectively denoted as X 001 ðX 2 Þ and X 001 ðX 3 Þ. X00 are different according to the computation of the best warping path between different pair of two time series. Compared with Fig. 4, the interpolated time series X 00i retains all the information of the original one and obtained the interpolated values to better describe the asynchronous relationship. In addition, the length of X 00i is often larger than the Xi. Finally, the principal components of the data matrix X can be obtained respectively. They are denoted as VAPCA and VPCA.
40
0
20
40
60
1
Fig. 4. Eight simulated time series have the similar shape trends for each group.
−1
1
0
20
40
60
80
0
20
40
60
80
0
0 40
0
0
20
40
60
−1
Fig. 5. Eight simulated time series have the similar shape trends for each group.
2847
H. Li / Expert Systems with Applications 41 (2014) 2842–2850
From the values of feature representations, we know that in FAPCA the feature values of time series with similar shape trends are close to each other. For example, the first row vector FAPCA(1, :) = (0.0557, 0.0710) is close to the fifth one FAPCA(5, :) = (0.0648, 0.1019), the second row vector FAPCA(2,:) = (0.1875, 0.6768) is close to sixth one FAPCA(6, :) = (0.1897, 0.6696), and so on. However, in FPCA the feature values of time series with similar shape trends seems to be not close to each other. The scatter diagrams about the feature values in F are shown in Figs. 6 and 7. It is easy to find that the feature values produced by APCA can reflect the true correlation between the original time series with the similar shape trends. As far as we know, hierarchical clustering is a method used to observe the similarities among different objects. Moreover, the dendrogram intuitively shows the similarity degree. Thereby, the hierarchical clustering is used to mine the knowledge from the representations produced by PCA and APCA. At the same time, different experiments are made according to the parameters including the reduced dimensions k (principal components) and the contribute rate g of cumulative energy. The clustering results is shown in Fig. 8. Compare APCA with PCA, the clustering results in Fig. 8 tell us that APCA has superiorities over PCA. According to different numbers of reduced principal components, time series with similar shape trends at different points in time can be clustered in advanced. For example, time series X3 and X7 can be divided into one group by APCA-based clustering method all the time. Moreover, X1 and X5 (X2 and X6) can be clustered by APCA-based clustering at the earliest stage. In addition, for the same number k of principal components, the cumulative energy of APCA is more than that of PCA. In the cases of k = 1,2,3, the contribution rate g of the cumulative energy of APCA is larger than that of PCA. In detail and k¼1 k¼2 formally, gk¼1 APCA ¼ 90:07% > gPCA ¼ 82:88%; gAPCA ¼ 97:17% > k¼2 k¼3 gPCA ¼ 88:98% and gAPCA ¼ 99:00% > gk¼3 ¼ 98:59%. In other word, PCA APCA can retain much more important information about time series than PCA for the same number of reduced principal components.
To further introduce the validity of APCA used in time series clustering, we perform APCA-based clustering and PCA-based clustering on two types of time series dataset. One is a classic UCI data (Synthetic Control) which is often used to test the performance of some algorithms. The other is a set of Chinese stock time series. In the first dataset, we arbitrarily choose 12 time series of equal length. They are in essential divided into 6 groups, that are (1, 2), (3, 4), (5, 6), (7, 8), (9, 10) and (11, 12). Meanwhile, the overall trends
0.6 0.4 8
4 7
0
3 5
−0.2
1
−0.4 −0.6 −0.8 −0.8
6 −0.6
−0.4
0.2 0
2 8
4
3
−0.2
7
1
5
−0.4 −0.6 −0.8 −1 −0.8
−0.6
−0.4
−0.2
6 0
0.2
Fig. 7. The scatter diagram about the feature values is produced by PCA.
PCA (k=1, η=82.22%) 8 4 3 2 1 6 7 5 APCA (k=1,η=90.07%) 8 4 5 1 7 3 6 2
PCA (k=2, η=88.98%) 6 3 2 5 7 1 8 4
PCA (k=4, η=98.59%) 7 2 6 3 5 1 8 4
APCA (k=2, η=97.17%) 8 4 7 3 5 1 6 2
APCA (k=4,η=99.00%) 8 4 7 3 5 1 6 2
Fig. 8. The clustering results of the simulated time series are shown according to the reduced dimensions.
4.2. Mining on UCI data and stock time series
0.2
0.4
−0.2
2 0
Fig. 6. The scatter diagram about the feature values is produced by APCA.
of (1, 2, 3, 4), (5, 6, 9, 10) and (7, 8, 11, 12) are level, up and down respectively. In the second dataset, 10 Chinese stock time series are selected and their lengths are different. They are the closing prices and have the same start/stop times (2010-03-01/2011-0431). The stock codes from number one to ten are {002096, 002097, 002098, 002099, 002100, 002101, 002102, 002103, 002104, 002105}. They are {1:241, 2:239, 3:240, 4:234, 5:237, 6:240, 7:236, 8:237, 9:240, 10:239}, where i:L means number of the ith stock time series of length L. The clustering results of the two methods are shown in Fig. 9 and 11. In Fig. 9, it is easy to find that the clustering results of APCA is better than that of PCA. The results show that APCA can cluster them into 6 groups in the lower reduced dimensions such as k = 1 and k = 2, which is consistent with the essential divisions. In addition, time series with similar shape trends can be recognized by APCA. For examples, when k = 1 and k = 2, time series (1, 2, 3, 4), (5, 6, 9, 10) and (7, 8, 11, 12) are respectively divided into the level, the up and the down. At the same time, the cumulative energy retained by APCA is larger than that retained by PCA, which means that APCA can obtain more important information. In this case, the experimental conclusions are the same to that in the previous experiment on simulated time series data. Therefore, the experimental results indicate that APCA considering the asynchronous correlation between time series is able to improve the
2848
H. Li / Expert Systems with Applications 41 (2014) 2842–2850
PCA (k=1, η=52.07%)
PCA (k=2, η=71.16%)
4 3 11 8 12 7 2 1 6 5 10 9
4 3 8 11 12 7 2 1 10 9 6 5
APCA (k=1, η=55.47%)
PCA (k=4,η=85.74%) 4 3 9 5 10 6 2 1 7 8 12 11
APCA (k=2, η=85.97%)
10 9 6 5 11 12 8 7 4 3 2 1
10 9 6 5 4 3 2 1 12 11 8 7
APCA (k=2, η=94.00%)
APCA (k=1, η=80.98%) 10 8 7 4 6 3 1 5 9 2
1 4 6 3 5 10 8 7 9 2
APCA (k=3, η=97.13%) 4 6 3 1 9 5 2 7 10 8
APCA (k=4, η=93.04%) Fig. 11. The clustering results on stock time series with unequal length.
3 4 9 6 5 10 2 1 12 11 8 7
11 10
stock 8
9 8 7
Fig. 9. The clustering results on synthetic control time series with equal length.
5
performance of dimensionality reduction of the traditional PCA that only considers the synchronous correlation. In addition, Classification is one of the most important tasks in the field of time series data mining. The 1-nearest neighbor classification combining with APCA and PCA is used to classify time series. We random get 30 and 10 time series from synthetic control dataset and regard them as the testing set and the training set respectively. APCA and PCA are firstly applied to dimensionality reduction of the two sets according to different reduced dimensions k. Then the two feature sets can be obtained. For a special value of k, the 1-nearest neighbor classification is performed on the two feature sets. Let every feature in the testing set searches the most similar object in the training set. The classification results of the methods can be expressed by the error rates that reflect how many objects of which the class label is different from that of the query. As shown in Fig. 10, it is easy to know that the classification results of APCA is much better than PCA in terms of different reduced dimensions k. In most cases, there exists time series with unequal length to be mined in the field of time series data mining. Unfortunately, PCA is not able to process this kind of time series. The reason is that the covariance between two different unequal-length time series can not be computed by (5). In order to overcome this difficulty, the interpolated time series derived from the original unequal-length ones can be used to compute the covariance by (5). Moreover, the interpolated ones reflect the asynchronous correlation between the original time series.
1
APCA
PCA
Error Rate
0.8 0.6 0.4 0.2 0
1
2
3
4
5
6
Fig. 10. The classification results of the two methods on synthetic control are shown in terms of different dimensions k.
stock 10
6 0
50
100
150
200
250
Fig. 12. The comparison between stock time series 8 and 10 is given.
In Fig. 11, the results of stock time series clustering demonstrate that APCA not only has an ability of dimensionality reduction for unequal-length time series but also can firstly divide the stock time series with similar shape trends into a same group. For example, stock 8 and 10 are clustered into a group all the way. From the perspective of the shape of time series, it is easy to find that their shape trends are the most similar. As shown in Fig. 12, the trends of stock 8 and stock 10 are similar. At the same time, although the lengths (237 and 239) of the two time series are different, APCA can measure their relationship but PCA can not. Moreover, their relationship does reflect the similar trend. In some cases, an investor should purchase some stocks. To obtain a large profit and reduce risk, the investor had better purchase the stocks from different clusters. The reason is that if you want to play the stock market, it’s smarter to divide your money and buy different stocks instead of putting the whole amount into just one stock and putting all your eggs in one basket. Therefore, the first tasks is finding the different clusters by APCA so that the unequal-length stock can be divided. 5. Conclusions An asynchronism-based principal component analysis (APCA) is proposed to reduce the dimensionality in light of asynchronous correlation between time series. According to the existing methods such as SVD, PCA and ICA, they are often used to reduce the dimensionality of time series. In particular, SVD and PCA have the similar principle and are based on principal components. ICA is the development of principal component analysis and factor analysis. They are widely applied in different fields including time series data mining. Unfortunately, they only considers the synchronous relationship between two time series (or two variables). They neglect the asynchronous relationship. Instead, The proposed method APCA considers the asynchronous covariance in the process of computation and is an improve version of PCA. Since the covariance is often used to measure the correlation between two equal-length variables, it is not work for unequal-length time series. To better measure the covariance between the unequal-length time series with similar shape trends, we use DTW to find the best warping path and obtain the interpolated time series that reflect
H. Li / Expert Systems with Applications 41 (2014) 2842–2850
the similar shape trends of time series at different points in time. Moreover, the length of the interpolated time series can be stretched to be equal. In this way, PCA can be used to measure the interpolated ones with equal length instead of the original time series with unequal length. In addition, the experimental results indicate that APCA used in the field of time series data mining is more powerful than PCA. The advancement of APCA over PCA can be summarized as follows: (1) APCA can reduce the dimensionality of time series with different lengths in the dataset. However, PCA only process the one with equal length. Moreover, the idea of PCA is applied in the procedure of APCA, which means that APCA is the extended version of PCA. (2) Time series with the similar shape trends between different points in time can be mapped to each other by APCA. Moreover, a correlation including synchronism and asynchronism between two time series with different lengths can be reflected. PCA only reflects the synchronous one. (3) Since the repeated value are interpolated to form the interpolated time series, more important information are considered by APCA than by PCA. It means that APCA can obtain more cumulative energy than PCA for the same reduced dimensions (the same number of retained principal components). So, extending the traditional PCA to process time series with different lengths, making PCA consider asynchronous relationships and reflecting the important and repeated information on time series are the unique research contributions of this paper. All those contributions overcome the shortage of PCA and widen the application of the principle of principal components. Although APCA is effective for time series data mining, extra time is cost by DTW in the process of APCA. Presently, much work (Lemire, 2009; Salvador & Chan, 2007) is concentrated on decreasing time cost of DTW, which attempt to speed up the computation of APCA. However, these methods depend on some factors that are difficult to be set. Therefore, one of the future works is to propose a suitable method without any factors to find the best warping path in APCA so that the execution of the algorithm can be faster. At the same time, since APCA considers the asynchronous relationships and is able to process time series with different lengths, it, as well as SVD, PCA and ICA, can be used to solve the problems about image recognition, speech recognition, financial analysis, and so on. In particular, in the field of multivariate stock data analysis, the asynchronous relationships between different variables are very important for volatility analysis. Therefore, the application of volatility analysis and co-movement analysis of stock market is another research direction. Acknowledgments This work has been partly supported by the Fundamental Research Funds for the Central Universities (12SKGC-QG03), the Research Funds of Huaqiao University (13SKBS104) and the National Natural Science Foundation of China (61300139). References Agrawal, R., Faloutsos, C. & Swami, A. (1993). Efficient similarity search in sequence databases. In Proceedings of the fourth international conference on foundations of data organization and algorithms (pp. 69–84). Bankó, Z., & Abonyi, J. (2012). Correlation based dynamic time warping of multivariate time series. Expert Systems with Applications, 39, 12814–12823. Baragona, R., & Battaglia, F. (2007). Outliers detection in multivariate time series by independent component analysis. Neural Computation Archive, 19(7), 1962–1984.
2849
Chu,S., Keogh, E., Hart, D., & Pazzani, M. (2002). Iterative deepening dynamic time warping for time series. In Proceedings of the second SIAM international conference on data mining (pp. 195–212). Cichocki, A., & Amari, S. (2002). Adaptive blind signal and image processing. John Wiley. Esling, P., & Agon, C. (2012). Time-series data mining. ACM Computing Surveys, 45, 1201–1234. Fuchs, E., Gruber, T., Pree, H., & Sick, B. (2010). Temporal data mining using shape space representation of time series. Neurocomputing, 74, 379–393. Hyvärinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Itakura, F. (1975). Minimum prediction residual principle applied to speech recognition. IEEE Transactions on acoustics, Speech, and Signal Process, 23(1), 52–72. Johannesmeyer, M. C. (1999). Abnormal situation analysis using pattern recognition techniques and historical data. M.S. thesis, UCSB, Santa Barbara, CA. Jolliffe, I. T. (2004). Principal component analysis. New York: Springer. Karamitopoulos, L., Evangelidis, G., & Dervos, D. (2008). Multivariate time series data mining: PCA-based measures for similarity search. In Proceedings of the 2008 international conference in data mining (pp. 253–259). Karamitopoulos, L., & Evangelidis, G. (2010). PCA-based time series similarity search. Data Mining Annals of Information Systems, 8, 255–276. Keogh, E., & Pazzani, M. (2001). Derivative dynamic time warping. In Proceedings of the first SIAM international conference on data mining (pp. 1–11). Keogh, E., Chakrabarti, E., Pazzani, M., & Mehrotra, S. (2000). Dimensionality reduction for fast similarity search in large time series databases. Journal of Knowledge and Information Systems, 3, 263–286. Keogh, E., Chu, S., Hart, D., & Pazzani, M. (2001). An online algorithm for segmenting time series. In Proceedings of the 2001 IEEE international conference on data mining (pp. 289–296). Krzanowski, W. (1979). Between-groups comparison of principal components. Journal of the Acoustical Society of America, 74, 703–707. Lee, A. J. T., Wu, H. W., Lee, T. Y., et al. (2009). Data & Knowledge Engineering, 68, 1071–1090. Lemire, D. (2009). Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recognition, 42, 2169–2180. Li, H., & Guo, C. (2011). Piecewise cloud approximation for time series mining. Knowledge-Based Systems, 24, 492–500. Li, H., Guo, C., & Qiu, W. (2011). Similarity measure based on piecewise linear approximation and derivative dynamic time warping for time series mining. Expert Systems with Applications, 38, 14732–14743. Li, C., Khan, L., & Prabhakaran, B. (2006). Real-time classification of variable length multi-attribute motion data. International Journal of Knowledge and Information Systems, 10(2), 163C183. Lin, J., Keogh, E., Lonardi, S., & Chiu, B. (2003). A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the eighth ACM SIGMOD international conference on management of data workshop on research issues in data mining and knowledge discovery (pp. 2–11). Maharaj, E. A., & Urso, P. D. (2011). Fuzzy clustering of time series in the frequency domain. Information Sciences, 181, 1187–1211. Muscillo, R., Conforto, S., & Schmid, S. (2007). Classification of motor activities through derivative dynamic time warping applied on accelerometer aata. In: Proceedings of the 29th annual international conference of engineering in medicine and biomedicine (pp. 4930–4933). Papadakis, S. E., & Kaburlasos, V. G. (2010). Piecewise-linear approximation of nonlinear models based on probabilistically/possibilistically interpreted intervals numbers (INs). Information Sciences, 180, 5060–5076. Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, 42, 59–66. Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Process, 26(1), 43–49. Sakurai, Y., Yoshikawa, M., & Faloutsos, C. (2005). FTW: Fast similarity search under the time warping distance. In Proceedings of the 2005 principles of database systems (pp. 326–337). Salvador, S., & Chan, P. (2007). FastDTW: Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11, 561–580. Shatkay, H., & Zdonik, S. (1996). Approximation queries and representations for large data sequences. In Proceedings of the 12th IEEE international conference on data engineering (pp. 536–545). Singhal, D., & Seborg, A. (2002). Clustering of multivariate time-series data. In Proceedings of the American control conference (Vol. 5). Singhal, A., & Seborg, D. E. (2005). Clustering multivariate time-series data. Journal of Chemometrics, 19, 427–438. Spiegel, S., Gaebler, J., & Lommatzsch, A., et al. (2011). Pattern recognition and classification for multivariate time series. In Proceedings of the fifth international workshop on knowledge discovery from sensor data (pp. 1–9). Struzik, R., & Siebes, A. (1998). Wavelet transform in similarity paradigm. In Proceedings of the second Pacific-Asia conference on knowledge discovery and data mining (pp. 295–309). Struzik, R., & Siebes, A. (1999). The haar wavelet transform in the time series similarity paradigm. In Proceedings of the third European conference on principles and practice of knowledge discovery in databases (pp. 12–22). Wang, H. (2012). Block principal component analysis with L1-norm for image analysis. Pattern Recognition Letters, 33, 537–542. Weng, X., & Shen, J. (2008). Classification of multivariate time series using twodimensional singular value decomposition. Knowledge-Based Systems, 21(7), 535–539.
2850
H. Li / Expert Systems with Applications 41 (2014) 2842–2850
Wong, T. S. F., & Wong, M. H. (2003). Efficient subsequence matching for sequences databases under time warping. In Proceedings of the seventh international database engineering and applications symposium (pp. 139–148). Wu, E. C., & Yu, P. H. (2005). Independent component analysis for clustering multivariate time series data. Advanced Data Mining and Applications, 474–482.
Yang, K., & Shahabi, C. (2004). A PCA-based similarity measure for multivariate time series. In Proceedings of the second ACM international workshop on multimedia databases (pp. 65–74). Yu, D., Yu, X., Hu, Q., et al. (2011). Information Sciences, 181, 2787–2796. Zhou, M., & Wong, M. H. (2005). A segment-wise time warping method for time scaling searching. Information Sciences, 173(1), 227–254.