Knowledge-Based Systems 112 (2016) 80–91
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Exploring shapelet transformation for time series classification in decision trees Willian Zalewski a,∗, Fabiano Silva b, A.G. Maletzke c, C.A. Ferrero d a
Federal University for Latin American Integration, Foz do Iguaçu, Parana, Brazil Federal University of Parana, Curitiba, Parana, Brazil c State University of West Parana, Foz do Iguaçu, Parana, Brazil d Federal Institute of Santa Catarina, Lages, Santa Catarina, Brazil b
a r t i c l e
i n f o
Article history: Received 11 September 2015 Revised 24 August 2016 Accepted 31 August 2016 Available online 1 September 2016 Keywords: Data mining Classification Time series Shapelets
a b s t r a c t In data mining tasks, time series classification has been widely investigated. Recent studies using nonsymbolic learning algorithms have reported significant results in terms of classification accuracy. However, in applications related to decision-making processes it is necessary to understand of reasoning used in the classification process. To take this into account, the shapelet primitive has been proposed in the literature as a descriptor of local morphological characteristics. On the other hand, most of the existing work related to shapelets has been dedicated to the development of more effective approaches in terms of time and accuracy, disregarding the need for the classifiers interpretation. In this work, we propose the construction of symbolic models for time series classification using shapelet transformation. Moreover, we develop strategies to improve the representation quality of the shapelet transformation, using feature selection algorithms. We performed experimental evaluations comparing our proposal with the state-of-the-art algorithms present in the time series classification literature. Based upon the experimental results, we argue that the improvement in shapelet representation can contribute to the construction of more interpretable and competitive classifiers in comparison to non-symbolic methods. © 2016 Elsevier B.V. All rights reserved.
1. Introduction The analysis of events and behaviors in time series is a nontrivial task strongly related to the domain application [1]. Approaches based upon data mining methods have been widely studied over the past two decades [2–6]. The traditional data mining algorithms were developed to handle data without considering the existence of order among the attributes. In recent years, motivated by this problem, many studies have been proposed in order to cover the time-dependent data. Time series classification has attracted great research attention in relation to data mining tasks [7–16]. Recent studies [10,17] have found that, using extensive empirical evaluation, 1-nearest neighbor algorithm (1NN) in combination with the Euclidean distance (ED) or dynamic time warping (DTW) provides the best performance in terms of accuracy for most assessed domains. Therefore, the 1NN method has been appointed as the state-of-the-art for time series classification [18]. Nonetheless, ∗
Corresponding author. E-mail addresses:
[email protected],
[email protected] (W. Zalewski). http://dx.doi.org/10.1016/j.knosys.2016.08.028 0950-7051/© 2016 Elsevier B.V. All rights reserved.
in decision making processes, such as medical diagnosis, industrial production control, safety monitoring systems for aircrafts and power plants, must be explained the reasoning used in the classification process [19–23]. Interpretability is hardly present in 1NN algorithm because it only provides information about the degree of similarity between two time series. Furthermore, because it is based on a lazy learning strategy, no explicit model of knowledge is constructed. The most common strategy for constructing intelligible classifiers has been the use of symbolic machine learning algorithms such as decision trees or decision rules. In general, rule-based methods allow greater understanding of the knowledge represented in the machine learning models [24]. In the literature, most methods proposed for constructing symbolic classifiers of time series are based on the representation of statistical features and/or the adjustment of parameter’s functions [7,25–28]. However, the use of those representations can affect knowledge understanding in the symbolic models because it refers to more distant concepts of human intuition. In this context, the shapelet primitive [21] has been proposed to improve the comprehension using local morphological characteristics, which approximates to human perception
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91
for identifying patterns in time series [29]. A shapelet is a subsequence of a time series that can be used as a primitive for time series data mining. The classification task via shapelets involves measuring the similarity between a shapelet and each time series, then using this similarity as a discriminatory feature. Classification with shapelets offers several advantages over other approaches. Firstly, shapelets can be directly interpreted and can offer explanatory insights about the domain. Secondly, shapelet identification is based on shape similarity independent of scale and phase. In order to facilitate the adaptation of new approaches the separation of the induction model from the shapelets identification process it was proposed [30]. The authors presented the shapelet transformation algorithm, which is based on mapping each time series by a set of distance values related to each shapelet identified. The shapelet transformation algorithm presents three main parameters that should be estimated for each domain, the minimum and the maximum length of subsequences to be explored, and the number of shapelets that should be selected. The correct definition of the parameters is not a trivial task because they can directly affect the quality of representation of the time series in shapelet transformation. The improper restriction of minimum and maximum range for the subsequences might not be enough to capture the most relevant shapelets. Also, the selection of few numbers of shapelets can favor certain classes over others. On the other hand, the use of too many shapelets can produce overfitting models on training set and dilute the influence of the most representative shapelets [30,31]. Despite those problems, to the best of our knowledge, the existing work about shapelet transformation has not objectively assessed the influence of these parameters. Hence, one of our main goals in this work is to explore the use of these parameters. Besides the aspects mentioned above, it is noteworthy that most of the work related to shapelets in the literature has been dedicated to the development of more effective approaches in terms of processing time and accuracy [31–35]. Although positive, this scenario shows that the original purpose of providing greater degree of intelligibility to time series classification is yet to be explored. Therefore, the other main focus of this work is to retake the use of shapelets in decision trees for time series classification. In this work, our main contributions can be summarized as follows: 1. We suggest two extensions of shapelet transformation algorithm to explore the effect of those parameters in the classification accuracy. 2. We introduce the reduced approach, which is an alternative extension that uses feature selection algorithms to automatically determine the number of the best k shapelets. 3. We evaluate our proposed extensions methods using decision trees and compare them with other shapelet based algorithms, such as logical-shapelets and fast-shapelets, and with state-of-the-art algorithms, such as 1NN ED and 1NN DTW. The paper is organized as listed: Section 2 provides background and definitions on time series classification and shapelet primitive. In Section 3, three extensions of the shapelet transformation are proposed, in order to investigate performance in decision trees. In Section 4, we describe our experimental design and results, and perform qualitative and statistical analysis. Finally, in Section 5, we present our conclusions and suggestions for future work. 2. Definitions and background
81
2.1. Time series classification Definition 1 (Time series). A time series T = {t1 , . . . , ti , t j , . . . , tm } is a sequence of m ≥ 2 ordered values, such that if i < j then ti occurs chronologically before tj . The classification task is one of the most studied tasks in the literature dedicated to time series. This task determines a function that enables the association of a class to a time series. Formally, let T be the set of all possible series of a particular domain and C = {c1 , . . . , cw } a set of classes, such that:
∀Ti ∈ T : ((Ti ∈ c1 )∨ . . . ∨ (Ti ∈ cw )) ∧ (Ti ∈ c j → Ti ∈/ ck , j = k ) (1) A time series classifier is a function f that maps a time series Ti ∈ T to a class c ∈ C:
f : T → {c1 , . . . , cw }
(2)
2.2. Primitive shapelet A shapelet is a subsequence of a time series T ∈ T that is reached through exhaustive search of every possible subsequence between predefined values for minimum and maximum lengths. Definition 2 (Subsequence). A subsequence s = {t p , . . . , t p+n−1 } is a contiguous subset of n values of T, which starts at position p, such that 2 ≤ n ≤ m and 1 ≤ p ≤ m − n + 1. The shapelet discovery is comprised by three main steps: candidate generation, shapelet distance calculations and shapelet assessment. In the first step, given a set BM ⊆ T of M time series, candidate subsequences are found via a sliding window process on each time series of BM . Definition 3 (Sliding window). Given a time series T of length m, and n the length of candidate subsequences, a sliding window composes the set of all distinct subsequences of length n that can be extracted from T. The process of constructing a sliding window is based on selecting a subsequence of size n from each position p of a time series T, denoted by sn,p = {t p , . . . , tn+ p−1 }, such that 1 ≤ p ≤ (m − n + 1 ). As result, the set of (m − n ) + 1 subsequences defined by the sliding window can be expressed by:
STn = {sn,1 ∪ sn,2 ∪ . . . ∪ sn,(m−n+1) }
(3)
The set of all candidate subsequences of size n that can be n }. Assume extracted from each Ti ∈ BM is Sn = {S1n ∪ S2n ∪ . . . ∪ SM that the size n is restricted to range (min, max), the set of all candidate subsequences is SC = {Smin ∪ Smin+1 ∪ . . . ∪ Smax }, where the number of elements in SC is:
|SC | =
max
n=min
Ti ∈ BM
( mi − n + 1 )
(4)
In the second stage, for each candidate subsequence sn ∈ SC , a set of distances is calculated using each time series in BM . Definition 4 (Euclidean distance). Given two subsequences s = {v1 , . . . , vn } and r = {u1 , . . . , un } of length n, the squared Euclidean distance between s and r is:
Dist (s, r ) =
n
( vi − ui )2
(5)
i=1
In this section we provide some preliminary definitions of time series and present a brief description of the shapelet transformation for time series classification.
As suggested in [21], before the distance calculations, the subsequences s and r need to be z-normalized. For example, each
82
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91
Definition 11 (Separation gap). Given a split point sp, the separation gap is:
DistGap(O, sp) =
di ∈ OR
Fig. 1. Illustration of the minimum distance location between S and R [29].
v −μ(s )
observation vi ∈ s is computed by znorm(vi ) = i σ (s ) , where μ(s) and σ (s) are the mean and standard deviation of s. Definition 5 (Subsequence distance). Given two subsequences s and r of length ns and nr , respectively, such that ns ≤ nr , and let Srns be the set of subsequences generated by a sliding window of length ns applied over r, the subsequence distance is:
SubDist (s, r ) = Min(Dist (s, q )
| q ∈ Srns )
(6)
where Min is a function that returns the minimum distance between s and all subsequences in Srns . In the Fig. 1 we show an illustration of the minimum distance location between two subsequences. Definition 6 (Set of distances). Given a candidate subsequence s of length n and a dataset BM , the set of distances from s is:
Ds = {d1 , ..., dM } Ds = {SubDist(s, T1 ), ..., SubDist(s, TM )}
(7)
In the third stage, the quality of each candidate subsequence s is assessed based on the set of distances Ds . Definition 7 (Orderline). Given a set of distances Ds , a dataset BM , and a set C of w classes, an orderline O is a one-dimensional representation of Ds , where the values {d1 , d2 , . . . , dM } are sorted in ascending order. In Fig. 2 we show a representation of the orderline for a given subsequence and a dataset B8 = {T1 ..., T8 }, where there are four series for class circle and four series for class square. Definition 8 (Split point). A split point sp is a real number that divides an orderline O into two disjoint subsets OL = {di : di ∈ O, di ≤ sp} and OR = {di : di ∈ O, di > sp}. Let O be an orderline, the set of possible split points is
{sp1 , sp2 , . . . , spM−1 }, where spi =
di +di+1 . 2
|OR | |OL | E ( OL ) + E ( OR ) |O| |O|
(8)
Definition 10 (Entropy). Given an orderline O and ei the number of elements of O, which is associated to class ci ∈ C = {c1 , . . . , cw }, the entropy E of O is:
E (O ) =
w i=1
where N =
−
ei ei log N N
w
j=1
−
d j ∈ OL
dj
|OL |
(10)
Definition 12 (Shapelet). A shapelet is a tuple sH = (s, sp), where s is a subsequence of SC and sp is the split point that enables the division of an orderline O into two disjoint subsets, such that the information gain is the maximum available for O. The shapelet discovery is based on an enumerative search. The time complexity to find a single shapelet of length n in a dataset of M time series is O(Mnm), and the search of all shapelets is O(M2 m4 ), where m is the size of the largest time series in the dataset. 2.3. Embedded-based shapelets The original shapelet algorithm was introduced in [21]. The algorithm builds a decision tree classifier by recursively searching for a discriminatory shapelet that splits the data into two groups, where the information gain is the maximum possible. The decision tree construction is finished when the entropy of the two groups equals zero. This approach was extended in [36], in which the authors proposed the logical shapelets (LS). The algorithm is based on the combination of multiple shapelets in the same decision tree node using conjunctions or disjunctions of shapelets. In addition, the authors also proposed speed-up techniques for searching shapelets, such as an approach based on precalculation of distances between time series, which reduces time complexity to O(M2 m3 ) in the worst case. However, considering that the shapelets discovery process is still time consuming, in [32] the authors proposed the fast shapelets (FS) algorithm, another extension of [21]. This algorithm is based on the symbolic aggregate approximation (SAX) algorithm to discretize time series. Thereafter, the random projection technique is used to compute the similarity between the discretized time series. Using the FS algorithm the time complexity is reduced to O(Mm2 ) in the worst case. On the other hand, that algorithm is based on a non-deterministic method (random projections), where for each execution of the algorithm a different subset of shapelets can be found. Moreover, there are four new parameters that must be evaluated for each dataset. The space complexity of those algorithms is O(k) in the worst case, where k is the number of shapelets used in the decision tree induction. 2.4. Shapelet transformation
Definition 9 (Information gain). Given a split point sp and the size of O, OL and OR as |O|, |OL | and |OR |, respectively, the information gain is defined as:
IG(O, sp ) = E (O ) −
di
|OR |
(9)
e j.
When we found the same information gain for distinct split points, we used the separation gap as tiebreaker criteria [21].
In [30], the authors introduced a generalization of the primitive shapelet in order to separate the task of finding shapelets from the decision tree induction process. This approach extracts the best k shapelets of the training set to build a new representation of the time series, where each time series should be represented by a set of distances between the k shapelets selected. The shapelet transformation algorithm can be understood in three distinct stages: estimation of parameter k, best shapelets selection and transformation. In the initial stage, the proper k number of shapelets must be estimated. In [30], the authors proposed two approaches. In the first approach, the value of k is set to m 2 . The second approach is based on the application of a cross-validation with 5 partitions on the training set. For each partition m sets of shapelets are created, each one with k ∈ {1, 2, . . . , m} shapelets. The final value of k is defined by the number of shapelets which allowed the best average accuracy on the 5 partitions.
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91
83
Fig. 2. Orderline for the subsequence S. Each time series is placed in the orderline based on the SubDist from S [36].
In the second stage, the k best shapelets are selected (Algorithm 1). For each time series in BM , all subsequences Algorithm 1: kBestShapelets (BM , min, max, k). 1 2 3 4 5 6 7 8 9 10 11 12
kShapelets ← ∅; shapelets ← ∅; SC ← Generat eC andidates(BM , min, max ); for each subsequence s in SC do Ds ← C omput eDistances(s, BM ); quality ← AssesQuality(Ds ); shapelets ← shapelet s ∪ (s, qualit y ); end Sort ByQualit y(shapelet s ); RemoveSel f Similar (shapelets ); kShapelets ← Merge(k, kShapelets ∪ shapelets ); return kShapelets
are sorted by length (line 7), where the length of the 25th subsequence is defined as min (line 8) and the length of the 75th subsequence is defined as max (line 9). Notwithstanding, since the transformation algorithm processes all the subsequences in the (min, max ) interval, the complexity time is the same as the original embedded approach, O(M2 m4 ) in the worst case. In the last stage, the k best shapelets found by Algorithm 1 are used in Algorithm 3 to transform BM into a new representation
Algorithm 3: ShapeletTransformation (BM , kShapelets). 1 2 3 4 5 6 7
of lengths between min and max are generated (line 3). The set of distances of each subsequence s ∈ SC are computed (line 5) and assessed (line 6). Once all candidate subsequences in SC have been assessed, they are sorted by quality (line 9), and self-similar subsequences are removed (line 10). Two subsequences are similar if they are taken from the same time series and have overlapping indices. Finally, only the k best shapelets are selected (line 11). The Algorithm 1 uses the min and max parameters to define the range of length of candidate subsequences. To reduce the computational costs, the authors proposed Algorithm 2 to estimate Algorithm 2: EstimateMinMax (BM ). 1 2 3 4 5 6 7 8 9 10
∗
shapelets ← ∅; for i ← 1 until 10 do RandomiseOrder (BM ); B10 ← {T1 , T2 , . . .T10 }; ∗ ∗ shapelets ← shapelets ∪ kBestShapelets(B10 , 3, M, 10 ); end ∗ Sort ByLength(shapelet s ); 25 min ← Length(sH ); max ← Length(s75 ); H return (min, max )
these parameters. The procedure is based on 10 randomly selected time series from BM (lines 3,4). Thereafter, the Algorithm 1 is used to find the 10 best shapelets for parameters min = 3 and max = n (line 5). This task is repeated ten times, yielding a set of 100 subsequences (shapelets∗). The set of those subsequences
8 9
X R ← ∅; for each time series Ti in BM do j for each shapelet sH in kShapelets do j di j ← SubDist (sH , Ti ); DTi ← DTi ∪ di j ; end X R ← X R ∪ DTi ; end return X R
(Table 1). In that representation, each time series Ti ∈ BM , 1 ≤ i ≤ M, is described by the set of k shapelets, where each shapelet j represents a distance value di j = SubDist (sH , Ti ), 1 ≤ j ≤ k. Hence, the space complexity of the shapelet transformation algorithm is O(kM). In Algorithm 3, for each time series Ti ∈ BM the subsequence distances dij (line 4) are stored in DTi = {di1 , di2 , . . . , dik } (line 5). The shapelet transformation of BM is given by X R . In [30], the authors performed an experimental evaluation to demonstrate the equivalence between the shapelet embedded of [21] and shapelet transformation algorithm in combination with C4.5 classifier. Also, the authors used the shapelet transformation with other classifiers such as nearest-neighbor (using Euclidean distance and DTW), naive Bayes, Bayesian networks, random forests and SVM.
Table 1 Attribute-value representation of shapelet transformation. Time series (T)
s1H
s2H
...
skH
Class (c)
T1 T2
d11 d21
d12 d22
... ...
d1k d2k
c1 c2
TM
dM1
dM2
dMk
cw
.. ...
.
84
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91
Fig. 3. Illustration of the exhaustive approach disadvantage.
Table 2 Summary of datasets.
Shapelets discovery ST
STk
ST+
Feature selection CS
CFS
FCBF
Shapelet transformation
Building models
Evaluation of results Fig. 4. Experimental configuration.
3. Extension approaches In this section we present our extension approaches to overcome some disadvantages of the shapelet transformation algorithm. We propose three approaches: exhaustive, relaxed and reduced. 3.1. Exhaustive approach In the exhaustive approach (STk), in order to explore the parameters min and max of the original algorithm [30], we suggest an exhaustive search process. In that approach, the EstimateMinMax algorithm is no longer necessary, thus, all the sizes of possible subsequences are analyzed to find the best k shapelets. The basis of this proposal rests on the idea that better shapelets are not considered when the size interval of subsequences is restricted. The smallest size of significant subsequence is min = 3 and the largest size possible for a shapelet is the size of the measured time series, max = m. As a result of using these parameters we will process more subsequences than the ST algorithm, however the time complexity is the remains same O(M2 m4 ) in the worst case. Also, the value of the parameter k is fixed to m 2 as in [31], therefore, the space complexity does not change in relation to ST, that is O(kM). 3.2. Relaxed approach In the original algorithm of shapelet transformation the authors proposed to use only the best k shapelets, and they presented two strategies. The first one uses a fixed value for k = m 2 and the second is based upon a cross validation using the training set. Both
#
Datasets
w
Mtrain
Mtest
m
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Adiac Beef CBF ChlorineConcentration Coffee DiatomSizeReduction ECG200 ECGFiveDays FaceAll FaceFour FacesUCR Fish GunPoint ItalyPowerDemand Lighting2 Lighting7 MedicalImages MoteStrain OliveOil OSULeaf SonyAIBORobotSurface SwedishLeaf Symbols SyntheticControl Trace TwoLeadECG TwoPatterns
37 5 3 3 2 4 2 2 14 4 14 7 2 2 2 7 10 2 4 6 2 15 6 6 4 4 2
390 30 30 467 28 16 100 23 560 24 200 175 50 67 60 70 381 20 30 200 20 500 25 300 100 10 0 0 23
391 30 900 3840 28 306 100 861 1690 88 2050 175 150 1029 61 73 760 1252 30 242 601 625 995 300 100 40 0 0 1139
176 470 128 166 286 345 96 136 131 350 131 463 150 24 637 319 99 84 570 427 70 128 398 60 275 128 82
strategies are based upon the size of time series m to compute the value of k. However, to the best of our knowledge, there is no evidence in the literature that demonstrates the relevance of m finding the parameter k. Furthermore, an inadequate estimation of the parameter k can significantly deteriorate the quality of the shapelet transformation, and therefore, the quality of the classification [31]. A possible problem is grouping similar shapelets in most of the k shapelets selected. Shapelets are considered similar when they represent the same morphological information at same or different time series. Thus, a set of shapelets with the highest information gain can be selected over all others shapelets (with lower gain information) that would better represent other classes. As the example shown in Fig. 3, let a time series domain be characterized by three classes {c1 , c2 , c3 } and let k = 4 a number of shapelets to be selected from a universe of 8 shapelets identified. The four highest quality shapelets can be used to describe only class c1 , while the shapelets for classes c1 and c2 will not be considered. Using the STk approach as basis, the idea of relaxed approach (ST+) is simply to omit the k parameter, so that all the subsequences are maintained during the search process and also in the final representation. The purpose of that modification is to eliminate the k parameter estimation, enabling the retention of shapelets that can best represent each of the classes. A major disadvantage of this approach is the need to store all non-overlapping subsequences. Thus, the space complexity is O( m 3 M ) in the worst case because min = 3. 3.3. Reduced approach Using ST+ approach, the estimation of parameter k is unnecessary, therefore, all the possible subsequences are used in shapelet
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91
85
Table 3 Experimental results for the evaluation 1. The values are presented in the acc(rank) form, where acc is accuracy (%) classification on the test set, and rank is the relative position of the algorithm in comparison to the other for a given data set. The highlighted cells represent the best accuracy for each data set. #
ST1
STk
ST+
ST+CFS
ST+CS
ST+FCBF
1 2 4 5 6 8 10 13 14 16 17 18 21 23 24 25 26 avg. acc avg. rank 1 vs. all
24.30 (6.00) 60.0 0 (6.0 0) 56.48 (6.00) 85.71 (4.50) 75.16 (1.00) 96.17 (6.00) 76.14 (1.00) 90.67 (6.00) 90.96 (5.00) 53.42 (3.00) 44.87 (6.00) 84.42 (6.00) 84.53 (3.00) 47.14 (6.00) 90.33 (6.00) 98.0 0 (2.0 0) 85.25 (6.00) 73.15 4.68 3
89.0 0 (4.0 0) 73.33 (2.00) 98.88 (2.00) 85.71 (4.50) 67.97 (2.00) 97.10 (3.00) 67.05 (4.00) 92.0 0 (3.0 0) 91.64 (2.50) 52.05 (4.50) 84.61 (3.00) 85.30 (3.00) 73.38 (5.00) 57.39 (4.50) 93.67 (5.00) 98.0 0 (2.0 0) 85.51 (4.00) 81.92 3.41 6
87.47 (5.00) 73.33 (2.00) 98.20 (4.00) 85.71 (4.50) 66.34 (4.00) 97.10 (3.00) 42.05 (6.00) 92.0 0 (3.0 0) 91.64 (2.50) 47.95 (6.00) 83.16 (4.00) 85.30 (3.00) 73.38 (5.00) 57.39 (4.50) 95.33 (3.00) 98.0 0 (2.0 0) 85.51 (4.00) 79.99 3.85 6
91.56 (3.00) 73.33 (2.00) 98.33 (3.00) 85.71 (4.50) 67.65 (3.00) 97.10 (3.00) 51.14 (5.00) 92.0 0 (3.0 0) 91.64 (2.50) 53.42 (2.00) 85.53 (1.50) 85.30 (3.00) 73.38 (5.00) 66.33 (3.00) 97.33 (1.00) 95.0 0 (6.0 0) 85.51 (4.00) 81.78 3.21 7
92.33 (1.00) 63.33 (5.00) 99.06 (1.00) 96.43 (1.50) 56.21 (5.00) 97.10 (3.00) 73.86 (2.50) 92.0 0 (3.0 0) 90.28 (6.00) 60.27 (1.00) 81.32 (5.00) 85.30 (3.00) 86.02 (1.50) 75.38 (1.00) 95.0 0 (4.0 0) 97.00 (4.50) 96.58 (1.50) 84.56 2.91 10
91.82 (2.00) 70.0 0 (4.0 0) 98.02 (5.00) 96.43 (1.50) 47.39 (6.00) 97.10 (3.00) 73.86 (2.50) 92.0 0 (3.0 0) 91.64 (2.50) 52.05 (4.50) 85.53 (1.50) 85.30 (3.00) 86.02 (1.50) 74.17 (2.00) 97.0 0 (2.0 0) 97.00 (4.50) 96.58 (1.50) 84.23 2.94 8
representation. As discussed in [30,31], the selection of few shapelets on ST cannot provide enough information to construct representative classifiers for a given domain. On the other hand, the use of too many shapelets can produce overfitting models on training set and dilute the influence of the most representative shapelets, depending on the classifier used. To handle this problem in ST+, we propose the reduced approach, which is based on a post-processing filter strategy, in order to select a smaller subset according to a quality criterion. Herein we applied this filter through feature selection algorithms. Among the subset selection algorithms, the reduced approach is the only one that does not need any learning algorithm. Therefore, we use filter algorithms to provide automatic estimation of parameter k and enable the analysis of intrinsic feature relationships in the data. Using this strategy, we can significantly reduce the number of shapelets in the final representation, thus the space complexity is O(kM) in the worst case, where we can expect that k m 3. Herein we propose the use and evaluation of three traditional subset selection algorithms: • Consistency-based Filter (CS) is a probabilistic algorithm which selects subsets with fewer features and which have high consistency with the class. The CS favors the subsets that have some majority class [37]. The time complexity of CS is O(k2 M) in the worst case. • The Correlation-based Feature Selection (CFS) algorithm ranks the feature subsets based on relevance and redundancy measures. So, the best subset presents features that are highly correlated with the class and uncorrelated with each other [38]. The time complexity of CFS is O(k2 M) in the worst case. • The Fast Correlation-based Filter (FCBF) algorithm evaluates the features in two separate steps. First, a set of features that is highly correlated with the class is selected. Second, the algorithm applies heuristics to remove the redundant features and keep the most relevant features to the class [39]. The advantage of the FCBF two-steps strategy is the time complexity, O(k2 ) in the worst case. As result of combination between ST+ and the subset selection algorithms, three strategies for reduced approach are created: ST+CS, ST+CFS and ST+FCBF.
Fig. 5. Critical difference diagram using the Bonferroni–Dunn post hoc test. Table 4 Comparison results between reduced approach variations and ST, in terms of wins, losses and draws.
Wins Losses Draws
ST+CS
ST+CFS
ST+FCBF
13 4 0
11 4 2
13 4 0
4. Experimental evaluation and results In this section we present an empirical evaluation of the original shapelet transformation algorithm ST and our proposed extensions (presented in Subsection 4.3). Moreover, we evaluated our approaches in comparison to embedded shapelet methods, such as logical shapelets (LS) and fast shapelets (FS), and to stateof-the-art methods, such as 1NN ED and 1NN DTW (presented in Subsection 4.4). 4.1. Datasets The datasets used in this experimental evaluation include 27 time series domain provided by the UCR benchmark [40], which is widely used by the time series community. All datasets available in the UCR are partitioned in training and testing sets, which include series of various sizes and number of classes. Furthermore, the UCR benchmark contains time series from artificial fields and the from real world domains. In our experiments we used 27 of 45 datasets available in UCR. Those sets were used in [32] to evaluate the LS and FS algorithms, also in [40] to evaluate the algorithms 1NN ED and 1NN DTW. In [31], 17 of the 27 datasets were used to evaluate ST. Table 3
86
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91
Fig. 6. Graphical representation of the results obtained by reduced approach variations (ST+CFS, ST+CS and ST+FCBF) versus the original shapelet transformation algorithm (ST). Each point represents a distinct dataset. The points above the diagonal indicate that the reduced approach has a better accuracy performance.
Table 5 Experimental results for the evaluation 2. The values are presented in the acc(rank) form, where acc is accuracy (%) classification on the test set, and rank is the relative position of the algorithm in comparison to the other for a given data set. The highlighted cells represent the best accuracy for each data set. #
LS
FS
1NN ED
1NN DTW
ST+CFS
ST+CS
ST+FCBF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 avg. acc avg. rank 1 vs. all
58.60 (6.00) 56.70 (4.00) 88.60 (5.00) 61.80 (6.00) 96.40 (3.00) 80.10 (4.00) 87.0 0 (2.0 0) 99.40 (2.00) 65.90 (6.00) 48.90 (7.00) 66.20 (7.00) 77.70 (7.00) 89.30 (7.00) 93.60 (3.00) 42.60 (7.00) 54.80 (5.00) 58.70 (6.00) 83.20 (6.00) 83.33 (6.00) 68.60 (3.00) 86.0 0 (3.0 0) 81.30 (4.00) 64.30 (7.00) 47.0 0 (7.0 0) 100.00 (1.50) 85.60 (6.00) 53.90 (7.00) 73.32 5.09 1
48.60 (7.00) 55.30 (5.00) 94.70 (2.00) 58.30 (7.00) 93.20 (4.00) 88.30 (3.00) 76.60 (6.00) 99.60 (1.00) 58.90 (7.00) 91.0 0 (1.0 0) 67.20 (6.00) 80.30 (5.00) 93.90 (1.00) 90.50 (6.00) 70.50 (3.00) 76.70 (1.00) 56.70 (7.00) 78.30 (7.00) 72.30 (7.00) 68.0 0 (4.0 0) 68.60 (7.00) 73.10 (7.00) 93.20 (2.00) 91.90 (5.00) 99.80 (3.00) 91.00 (4.50) 88.70 (2.00) 78.71 4.46 4
61.10 (4.00) 53.30 (6.00) 85.20 (6.00) 65.0 0 (4.0 0) 75.0 0 (7.0 0) 93.50 (2.00) 88.0 0 (1.0 0) 79.70 (6.00) 71.40 (3.00) 78.40 (3.00) 76.90 (4.00) 78.30 (6.00) 91.30 (5.00) 95.50 (1.00) 75.40 (2.00) 57.50 (4.00) 68.40 (5.00) 87.90 (1.00) 86.70 (2.50) 51.70 (7.00) 69.50 (6.00) 78.70 (6.00) 90.0 0 (3.0 0) 88.0 0 (6.0 0) 76.0 0 (7.0 0) 91.00 (4.50) 74.70 (6.00) 77.34 4.37 3
60.40 (5.00) 50.0 0 (7.0 0) 99.70 (1.00) 64.80 (5.00) 82.10 (6.00) 96.70 (1.00) 77.0 0 (5.0 0) 76.80 (7.00) 80.80 (1.00) 83.0 0 (2.0 0) 90.49 (1.00) 83.30 (4.00) 90.70 (6.00) 95.0 0 (2.0 0) 86.90 (1.00) 72.60 (2.00) 73.70 (4.00) 83.50 (5.00) 86.70 (2.50) 59.10 (6.00) 72.50 (5.00) 79.0 0 (5.0 0) 95.0 0 (1.0 0) 99.30 (1.00) 100.00 (1.50) 100.0 0 (1.0 0) 90.40 (1.00) 82.57 3.30 10
91.56 (3.00) 73.33 (1.00) 81.0 0 (7.0 0) 98.33 (2.00) 85.71 (5.00) 67.65 (5.00) 85.0 0 (3.0 0) 97.10 (4.00) 67.99 (5.00) 51.14 (6.00) 78.0 0 (3.0 0) 94.29 (1.00) 92.0 0 (3.0 0) 91.64 (4.50) 65.57 (4.50) 53.42 (6.00) 85.53 (1.50) 85.30 (3.00) 90.0 0 (1.0 0) 78.10 (1.00) 73.38 (4.00) 89.60 (2.00) 66.33 (6.00) 97.33 (2.00) 95.0 0 (6.0 0) 85.51 (7.00) 85.53 (3.00) 81.68 3.69 5
92.33 (1.00) 63.33 (3.00) 90.78 (4.00) 99.06 (1.00) 96.43 (1.50) 56.21 (6.00) 83.0 0 (4.0 0) 97.10 (4.00) 72.78 (2.00) 73.86 (4.50) 69.80 (5.00) 90.29 (3.00) 92.0 0 (3.0 0) 90.28 (7.00) 50.82 (6.00) 60.27 (3.00) 81.32 (3.00) 85.30 (3.00) 86.67 (4.50) 64.05 (5.00) 86.02 (1.50) 90.08 (1.00) 75.38 (4.00) 95.0 0 (4.0 0) 97.00 (4.50) 96.58 (2.50) 84.18 (4.00) 82.22 3.52 5
91.82 (2.00) 70.0 0 (2.0 0) 91.89 (3.00) 98.02 (3.00) 96.43 (1.50) 47.39 (7.00) 73.0 0 (7.0 0) 97.10 (4.00) 70.06 (4.00) 73.86 (4.50) 79.85 (2.00) 91.43 (2.00) 92.0 0 (3.0 0) 91.64 (4.50) 65.57 (4.50) 52.05 (7.00) 85.53 (1.50) 85.30 (3.00) 86.67 (4.50) 77.27 (2.00) 86.02 (1.50) 88.80 (3.00) 74.17 (5.00) 97.0 0 (3.0 0) 97.00 (4.50) 96.58 (2.50) 82.23 (5.00) 82.91 3.57 3
describes the relevant information for each dataset used in this experimental evaluation. For each dataset there are: number of classes (w), size of training set (Mtrain ), size of test set (Mtest ), and the length of time series (m). 4.2. Experimental configuration The experimental assessment of our approaches has been organized in four stages. In Fig. 4, the solid lines represent the flow of the experiments. Our experiments were performed in a JAVA environment on a Linux/Ubuntu system running a 3.00 GHz Intel Xeon (CPU E5-2690 v2) processor with 40 cores, and 250GB of RAM memory. Additional material about the experiments are available on a web page.1 1
http://www.willianz.tk/research/kbs2015.
(1) Shapelets discovery: in this step the process of the shapelet discovery2 is performed, according to the hereby proposed approaches; (2) Shapelet transformation: in this step, the shapelets identified in step (1) were used to transform the training and testing sets into the attribute value representation; (3) Building models: in this step we used only the training set according to the splitting of data proposed in UCR (Table 2). This splitting is the same as adopted by the other methods investigated in our experiments. Furthermore, we used the J48 algorithm with the default parameters values from the WEKA [41] framework for the construction of the
2 The source code of shapelet discovery was provided by Dr. Jason Lines, and the original code has been updated at project page http://www.uea.ac.uk/computing/ machine-learning/shapelets/shapelet-code.
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91
Fig. 7. Critical difference diagram using the Nemenyi post hoc test.
classification models (we performed only one run over the training set). (4) Evaluation of results: in this step, we evaluated the results using the accuracy on each test set. We performed a holdout evaluation according to the training and test splitting of the data provided by the UCR. Despite the existence of several other performance measures, accuracy was used because it has been widely used measure in the community. 4.3. Evaluation of the extension approaches In this evaluation we analyzed the performance of the algorithms based upon shapelet transformation approach. The idea behind this evaluation is to check whether there is any significant improvement to justify the use of the proposed extensions in comparison to the original shapelet transformation algorithm. In this scenario, we compared our extension approaches (STk, ST+, ST+CS, ST+CFS and ST+FCBF) to the original shapelet transformation algorithm (ST). In these experiments we used 17 of 27 datasets presented in Table 2, which were also used in [30,31]. The experimental results are shown in Table 3, where the results for ST algorithm were obtained from [31]. The values are presented in acc (rank) format, acc corresponds to the value of classification accuracy on each test set (in percentage), and rank
87
is the relative position of the algorithm in comparison to the other for a given dataset. The highlight cells represent the best accuracy for a given dataset, considering all evaluated algorithms. The last three lines represent, for each algorithm, the average accuracy values, the average rank, and the number of times that an algorithm showed the best accuracy (1 vs. all). As recommended in [42], the Friedman test can be used to perform the statistical performance analysis of different algorithms applied to distinct datasets. In this experimental assessment, we used the Imam and Davenport version [43] of the Friedman test to test the null-hypothesis that all algorithms have similar performance and the observed differences are merely random. Additionally„ we used the Bonferroni–Dunn post hoc test to examine if there was any significant difference in comparison to the ST algorithm. The null-hypothesis that all algorithms have the same accuracy for α = 5% was rejected with p − value = 0.0458 (the F-statistic is equal to 2.38). Using the Bonferroni–Dunn post hoc test for α = 5% the critical difference was CD = 1.65. Fig. 5 shows the critical difference diagram for ranked accuracies. The diagram is a graphical representation derived from the overall test of average ranks significance, where classifiers are grouped by cliques represented by solid bars. Based on the results shown in Fig. 5, we can see that the algorithms ST+CS and ST+FCBF are not included in the clique of ST. Therefore, the performance differences are significant in comparison to the ST algorithm. In Fig. 6, we summarize those results by plotting the accuracy performance for each dataset into pairwise scatter plots. The points above the diagonal line indicate that the method on the horizontal-axis has greater accuracy, and the points below the diagonal line indicate that the method on the vertical-axis has greater accuracy. Detailed results are presented in Table 4. One of the advantages associated with the reduced approach is that the feature selection algorithms used in this work are based
Fig. 8. Graphical representation of the results obtained by reduced approach variations (ST+CFS, ST+CS and ST+FCBF) versus the embedded methods (LS and FS). Each point represents a different dataset. The points above the diagonal indicate that the reduced approach has a better accuracy performance.
88
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91
Fig. 9. Graphical representation of the results obtained by reduced approach variations (ST+CFS, ST+CS and ST+FCBF) versus the state-of-the-art algorithms (1NN ED and 1NN DTW). Each point represents a different dataset. The points above the diagonal indicate that the reduced approach has a better accuracy performance.
Table 6 Comparison results between reduced approach and embedded methods, in terms of wins, losses and draws. LS
Wins Losses Draws
Table 7 Comparison results between reduced approach and state-of-the-art methods, in terms of wins, losses and draws.
FS
1NN ED
ST+CFS
ST+CS
ST+FCBF
ST+CFS
ST+CS
ST+FBCF
17 10 0
21 6 0
21 6 0
15 12 0
15 12 0
16 11 0
on intrinsic characteristics of data. Thus, the shapelets are selected based on the relationship between themselves and with the domain classes. Moreover, the reduced approach does not require the estimation of parameters such as minimum and maximum size of the subsequences, and the number of shapelets (k). Based on the results and analysis presented in this section we can observe that all proposed approaches presented average rank values greater than the ST, in particular ST+CS, ST+CFS and ST+FCBF, which presented the highest average rank values. Furthermore, we observed that exploring all possible length of subsequences, as we did in the exhaustive approach (STk), and keeping all the identified subsequences as we did in the relaxed approach (ST+), were not sufficient to demonstrate superior performance in comparison to ST. 4.4. Reduced approach versus embedded shapelet and state-of-the-art methods The idea behind this evaluation is to compare our proposal to the state-of-the-art methods for time series classification and other shapelet-based algorithms. In this evaluation, we used the reduced approach algorithms (ST+CS, ST+CFS and ST+FCBF), because they present the best results among the algorithms based on
Wins Losses Draws
1NN DTW
ST+CFS
ST+CS
ST+FCBF
ST+CFS
ST+CS
ST+FBCF
16 11 0
18 9 0
17 10 0
14 13 0
13 14 0
12 15 0
shapelet transformation (ST). We analyzed the performance of the reduced approach in comparison to embedded shapelet methods (LS and FS) and in comparison with state-of-the-art algorithms (1NN using ED and DTW distances measures). The experimental results are shown in Table 5. The values are presented in acc (rank) format, where acc corresponds to the value of classification accuracy on the test set (in percentage), and rank is the relative position of the algorithm in comparison to the other for a given dataset. The highlighted cells represent the best accuracy for a given dataset, considering all evaluated algorithms. The last three lines represent, for each algorithm, the average accuracy values, the average rank, and the number of times that an algorithm presented the best accuracy (1 vs. all), respectively. The results for LS and FS algorithms were obtained from [32]; and the results for 1NN were gathered from [44]. In these experiments we used all datasets presented in Table 2, we used the Friedman test as well. The null-hypothesis that all algorithms have the same accuracy for α = 5% was rejected with p − value = 0.0149 (the F-statistic is equal to 2.59). We used the Nemenyi post hoc test at α = 5% to find significant differences. The critical difference was CD = 1.73. Fig. 7 shows the critical difference diagram for ranked accuracies.
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91
89
Fig. 10. Texas sharpshooter plot for reduced approach (ST+CFS, ST+CS and ST+FCBF) versus 1NN (using ED and DTW).
Based on the results shown in Fig. 7, we can observe that the only significant difference is between LS and 1NN DTW. However, we point out better average ranks of the reduced approach in comparison with the LS, FS and 1NN ED algorithms. Only the 1NN DTW method has presented better average rank in comparison to our approaches. 4.4.1. Reduced approach versus embedded methods We did not find any statistical difference between the reduced approach and the embedded methods. However, in several datasets, the reduced approach presented better accuracy results. In comparison to the LS and FS algorithms, ST+CFS presented the best accuracy in 63% and 56% of datasets, respectively; ST+CS was better in 78% and 56% of datasets; and ST+FCBF was better in 78% and 59% of datasets. Table 6 shows the comparison in terms of wins, losses and draws. In Fig. 8, we summarize those results by plotting the accuracy performance for each dataset into pairwise scatter plots. Despite the similar performance among the variations of the reduced approach compared to the FS algorithm, we emphasize the non determinism of the FS algorithm, which can generate a different tree for each execution. Also, FS only presented better average rank than the LS algorithm. 4.4.2. Reduced approach versus state-of-the-art No statistical difference was found between our reduced approach and the state-of-the-art algorithms (1NN using ED and DTW). Notwithstanding, the accuracy results for the reduced approach was better than 1NN ED. In comparison to the 1NN ED and 1NN DTW algorithms, ST+CFS presented the best accuracy in 59% and 52% of datasets, respectively; ST+CS presented better performance in 67% and 48% of datasets; and ST+FCBF was better in 63% and 44% of datasets. Table 7 shows the comparison in terms of wins, losses and draws. In Fig. 9, we summarize those
results by plotting the accuracy performance for each dataset into pairwise scatter plots. Based on these experimental results, we can see that our reduced approach has similar accuracy performance to the stateof-the-art algorithms. In addition, our approaches are only based on symbolic models, such as decision trees. We emphasize that the 1NN DTW method presented the better average rank. 4.4.3. Texas sharpshooter fallacy In [18] an alternative analysis of the results for time series classification problems was presented. The authors proposed evaluating the results by using a type of plot based on the Texas Sharpshooter Fallacy problem. The motivation behind this proposal is that many papers have reported results using only the accuracy on the test set and claimed better performance in some datasets. Therefore, the authors argue that the utility of an algorithm should be indicated ahead of time, standing for which dataset the algorithm will perform well. They proposed the utilization of the function gain = AB to measure the expected gain (on training data) and the actual gain (on testing data). The values A and B represent the accuracy performance for algorithms A and B, respectively. The plot of expected gain vs. actual gain can be understood as a continuous form of contingency table. Also, the plot can be divided into four regions: region TP (True Positive) indicates: algorithm A is more accurate than B for training and testing; region TN (True Negative): algorithm B is more accurate than A for training and testing; region FN (False Negative): algorithm A is more accurate than B for testing, but not for training; region FP (False Positive): algorithm B is more accurate than A for testing, but not for training. In Fig. 10, we show the plots for gain comparisons between our reduced approach variations (ST+CFS, ST+CS and ST+FCBF) and the state-of-the-art algorithm 1NN (using ED and DTW), where each point inside the plot represents a dataset. In order to calculate
90
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91 Table 8 Results for each region into Texas Sharpshooter plot. 1NN ED
ST+CFS ST+CS ST+FCBF
tal evaluation. We also would like to acknowledge the Brazilian Funding Agency (CAPES) for the financial support of this work.
1NN DTW
TP
TN
FP
FN
TP
TN
FP
FN
12 13 13
4 2 3
5 6 7
1 1 0
14 14 13
6 4 5
3 5 5
0 0 0
the expected gain we used leave-one-out cross-validation and for the actual gain we used the accuracy results presented in Table 5. Furthermore, because some datasets were not used in the work from where the results were collected, we stress that for those plots we used 23 of the 27 datasets presented in Table 2. Based on the analysis of Fig. 10, we can see that most of the points fall in the TP and TN regions. More precisely, the performance of our approaches was correctly estimated for 73%, 68% and 70% of datasets, respectively, when in comparison to 1NN ED, and for 87%, 78% and 78% of datasets, respectively, when in comparison to 1NN DTW. Detailed results for each region are presented in Table 8. The results show that we can confidently predict when our approaches will outperform or will be outperformed by the competing algorithms. 5. Conclusion In this work we investigated shapelet transformation for time series classification using decision trees. In particular, we propose three approaches to evaluate and overcome the disadvantages of the original algorithm. Among the extensions presented, we obtained the most interesting results using the reduced approach. Using this approach, we have eliminated the need for estimating the minimum and maximum length of the subsequences as well as the number of selected shapelets (parameter k). In order to evaluate the proposed approaches, we provided experimental analyses using datasets widely studied in time series literature. Based upon the experimental results we demonstrated the superiority of our variations ST+CS and ST+FBCF, in terms of accuracy, in relation to the original shapelet transformation method (ST). In addition, we performed an experimental comparison between our reduced approach and the shapelet embedded algorithms (LS and FS), and the state-of-the-art methods (1NN using ED and DTW). In this analysis we did not find statistical difference to the reduced approach. However, except for 1NN DTW, the reduced approach variations presented mean rank values superior to all other evaluated algorithms. These results show how promising the reduced approach can be for time series classification. Based on the experimental results presented in this work, we can conclude that the quality of representation of shapelet transformation affects the performance of decision tree classifiers. Therefore, the development of techniques for improving shapelet identification processes can contribute to the construction of interpretable and competitive classifiers in comparison to non-symbolic methods, in terms of accuracy. Future work includes the expansion of the reduced approach to early classification problems. Other interesting extension of our work would be the analysis of order relationships between shapelets using sequential rule mining techniques. Acknowledgments We would like to thank Dr. Jason Lines for providing the original source code of shapelet transformation algorithm and Dr. Eamonn Keogh for providing the datasets used in the experimen-
References [1] A.G. Maletzke, H.D. Lee, G. Enrique, A.P.A. Batista, C.S.R. Coy, J.J. Fagundes, W.F. Chung, Time series classification with motifs and characteristics, in: R. Espin, R.B. Prez, A. Cobo, J. Marx, A.R. Valds (Eds.), Soft Computing for Business Intelligence, Studies in Computational Intelligence, 537, Springer Berlin Heidelberg, 2014, pp. 125–138. [2] C.M. Antunes, A.L. Oliveira, Temporal data mining: an overview, in: KDD Workshop on Temporal Data Mining, 2001, pp. 1–13. [3] M. Last, A. Kandel, H. Bunke, Data Mining in Time Series Databases, World Scientific Publishing, 2004. [4] F. Morchen, Time Series Knowledge Mining, Doctoral thesis Department of Mathematics and Computer Science–Philipps-University, Marburg, Hesse, Germany, 2006. [5] T. chung Fu, A review on time series data mining, Eng. Appl. Artif. Intell. 24 (1) (2011) 164–181, doi:10.1016/j.engappai.2010.09.007. [6] P. Esling, C. Agon, Time-series data mining, ACM Comput. Surv. 45 (1) (2012) 12:1–12:34, doi:10.1145/2379776.2379788. [7] B.R. Bakshi, G. Locher, G. Stephanopoulos, G. Stephanopoulous, Analysis of operating data for evaluation, diagnosis and control of batch operations, J. Process Control 4 (4) (1994) 179–194. [8] E. Keogh, M. Pazzani, An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback, in: R. Agrawal, P. Stolorz, G. Piatetsky-Shapiro (Eds.), 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), ACM Press, New York City, NY, 1998, pp. 239–241. [9] H. Zhang, T. Ho, M. Lin, A non-parametric wavelet feature extractor for time series classification, in: H. Dai, R. Srikant, C. Zhang (Eds.), Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, 3056, Springer Berlin Heidelberg, 2004, pp. 595–603. [10] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, E. Keogh, Querying and mining of time series data: experimental comparison of representations and distance measures, Proc. VLDB Endow. 1 (2) (2008) 1542–1552, doi:10.14778/ 1454159.1454226. [11] K. Buza, L. Schmidt-Thieme, Motif-based classification of time series with Bayesian networks and SVMs, in: A. Fink, B. Lausen, W. Seidel, A. Ultsch (Eds.), Advances in Data Analysis, Data Handling and Business Intelligence, Studies in Classification, Data Analysis, and Knowledge Organization, Springer Berlin Heidelberg, 2010, pp. 105–114. [12] A. Bagnall, L.M. Davis, J. Hills, J. Lines, Transformation based ensembles for time series classification, in: Proceedings of the 12th SIAM International Conference on Data Mining, Anaheim, California, USA, April 26–28, 2012., 2012, pp. 307–318, doi:10.1137/1.9781611972825.27. [13] J. Lin, R. Khade, Y. Li, Rotation-invariant similarity in time series using bag-of– patterns representation, J. Intell. Inf. Syst. 39 (2) (2012) 287–315. [14] J. Grabocka, A. Nanopoulos, L. Schmidt-Thieme, Invariant time-series classification, in: P. Flach, T. De Bie, N. Cristianini (Eds.), Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, 7524, Springer Berlin Heidelberg, 2012, pp. 725–740. [15] M. Baydogan, G. Runger, E. Tuv, A bag-of-features framework to classify time series, Pattern Anal. Mach. Intell. IEEE Trans. 35 (11) (2013) 2796–2802, doi:10. 1109/TPAMI.2013.72. [16] V. Souza, D. Silva, G. Batista, Extracting texture features for time series classification, in: 22nd International Conference on Pattern Recognition (ICPR), 2014, 2014, pp. 1425–1430, doi:10.1109/ICPR.2014.254. [17] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, E. Keogh, Experimental comparison of representation methods and distance measures for time series data, Data Min. Knowl. Discov. 26 (2) (2013) 275–309, doi:10.1007/ s10618- 012- 0250- 5. [18] G.E.A.P.A. Batista, X. Wang, E.J. Keogh, A complexity-invariant distance measure for time series, in: Proceedings of the 11th SIAM International Conference on Data Mining (SDM 2011), SIAM/Omnipress, Mesa, Arizona, USA, 2011, pp. 699–710. [19] P. Geurts, Pattern extraction for time series classification, in: Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, Springer-Verlag, London, UK, 2001, pp. 115–127. [20] Y. Yamada, E. Suzuki, H. Yokoi, K. Takabayashi, Experimental evaluation of time-series decision tree, in: S. Tsumoto, T. Yamaguchi, M. Numao, H. Motoda (Eds.), Active Mining, Lecture Notes in Computer Science, 3430, Springer Berlin Heidelberg, 2005, pp. 190–209. [21] L. Ye, E. Keogh, Time series shapelets: a new primitive for data mining, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in: KDD ’09, ACM, New York, NY, USA, 2009, pp. 947– 956, doi:10.1145/1557019.1557122. [22] Z. Xing, J. Pei, P. Yu, Early classification on time series, Knowl. Inf. Syst. 31 (1) (2012) 105–127, doi:10.1007/s10115- 011- 0400- x. [23] M. Ghalwash, V. Radosavljevic, Z. Obradovic, Extraction of interpretable multivariate patterns for early diagnostics, in: Data Mining (ICDM), 2013 IEEE 13th International Conference on, 2013, pp. 201–210, doi:10.1109/ICDM.2013.19. [24] R.S. Michalski, I. Bratko, M. Kubat, Machine Learning and Data Mining, Wiley, Chichester, West Sussex, England, 1998.
W. Zalewski et al. / Knowledge-Based Systems 112 (2016) 80–91 [25] J.J. Rodriguez, C.J. Alonso, H. Bostrom, Boosting interval based literals, Intell. Data Anal. 5 (3) (2001) 245–262. [26] F. Morchen, Time Series Feature Extraction for Data Mining Using DWT and DFT, Technical Report 33, Department of Mathematics and Computer Science, Philipps-University Marburg, Germany, 2003. [27] M.W. Kadous, C. Sammut, Classification of multivariate time series and structured data using constructive induction, Mach. Learn. 58 (2–3) (2005) 179–216, doi:10.1007/s10994- 005- 5826- 5. [28] P. Cotofrei, K. Stoffel, First-order logic based formalism for temporal data mining∗ , in: T. Young Lin, S. Ohsuga, C.-J. Liau, X. Hu, S. Tsumoto (Eds.), Foundations of Data Mining and knowledge Discovery, Studies in Computational Intelligence, 6, Springer Berlin Heidelberg, 2005, pp. 185–210, doi:10.1007/ 11498186-12. [29] L. Ye, E. Keogh, Time series shapelets: a novel technique that allows accurate, interpretable and fast classification, Data Min. Knowl. Discov. 22 (1–2) (2011) 149–182, doi:10.1007/s10618-010-0179-5. [30] J. Lines, L.M. Davis, J. Hills, A. Bagnall, A shapelet transform for time series classification, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in: KDD ’12, ACM, New York, NY, USA, 2012, pp. 289–297, doi:10.1145/2339530.2339579. [31] J. Hills, J. Lines, E. Baranauskas, J. Mapp, A. Bagnall, Classification of time series by shapelet transformation, Data Min. Knowl. Discov. 28 (4) (2014) 851–881, doi:10.1007/s10618- 013- 0322- 1. [32] E.J. Keogh, T. Rakthanmanon, Fast shapelets: a scalable algorithm for discovering time series shapelets, in: Proceedings of the 13th SIAM International Conference on Data Mining, May 2–4, 2013. Austin, Texas, USA., 2013, pp. 668–676, doi:10.1137/1.9781611972832.74. [33] Q. He, Z. Dong, F. Zhuang, T. Shang, Z. Shi, Fast time series classification based on infrequent shapelets, in: Machine Learning and Applications (ICMLA), 2012 11th International Conference on, 1, 2012, pp. 215–219, doi:10.1109/ICMLA. 2012.44. [34] P. Senin, S. Malinchik, SAX-VSM: interpretable time series classification using sax and vector space model, in: 2013 IEEE 13th International Conference on Data Mining (ICDM), 2013, pp. 1175–1180, doi:10.1109/ICDM.2013.52.
91
[35] D. Gordon, D. Hendler, L. Rokach, Fast and space-efficient shapelets-based time-series classification, Intell. Data Anal. (IDA) 19 (5) (2015) 953–981, doi:10. 3233/IDA-150753. [36] A. Mueen, E. Keogh, N. Young, Logical-shapelets: an expressive primitive for time series classification, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in: KDD ’11, ACM, New York, NY, USA, 2011, pp. 1154–1162, doi:10.1145/2020408.2020587. [37] H. Liu, R. Setiono, A probabilistic approach to feature selection - a filter solution, in: 13th International Conference on Machine Learning, 1996, pp. 319–327. [38] M.A. Hall, Correlation-based feature selection for discrete and numeric class machine learning, in: Proceedings of the 17th International Conference on Machine Learning, in: ICML ’00, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 20 0 0, pp. 359–366. [39] L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlation-based filter solution, in: Proceedings of the 20th International Conference on Machine Learning, AAAI Press, 2003, pp. 856–863. [40] E. Keogh, X. Xi, L. Wei, C.A. Ratanamahatana, The UCR time series classification/clustering homepage, URL http://www.cs.ucr.edu/∼eamonn/time_series_ data/, 2006. [41] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA data mining software: an update, ACM SIGKDD Explor. Newsl. 11 (1) (2009) 10–18. [42] J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [43] R.L. Iman, J.M. Davenport, Approximations of the critical region of the fbietkan statistic, Commun. Stat. - Theory Methods 9 (6) (1980) 571–595, doi:10.1080/ 03610928008827904. [44] D.F. Silva, V.M.A. De Souza, G.E.A.P.A. Batista, Time series classification using compression distance of recurrence plots, in: Data Mining (ICDM), 2013 IEEE 13th International Conference on, 2013, pp. 687–696, doi:10.1109/ICDM.2013. 128.