XG-SF: An XGBoost Classifier Based on Shapelet Features for Time Series Classification

XG-SF: An XGBoost Classifier Based on Shapelet Features for Time Series Classification

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect Procedi...

304KB Sizes 0 Downloads 32 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science 00(2019) (2019)24–28 000–000 Procedia Computer Science 147

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

2018 International Conference on Identification, Information and Knowledge 2018 International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2018 in the Internet of Things, IIKI 2018

XG-SF: XG-SF: An An XGBoost XGBoost Classifier Classifier Based Based on on Shapelet Shapelet Features Features for Time Series Classification for Time Series Classification

, Xiunan Zouaa , Yupeng Hubb , Shijun Liubb , Lei Lyuaa , Xiangwei Zhengaa Cun Jia,c,∗ , Xiunan Zou , Yupeng Hu , Shijun Liu , Lei Lyu , Xiangwei Zheng Cun Jia,c,∗ a School of Information Science and Engineering, Shandong Normal University, Jinan 250014, China b School of Information Science and Engineering, University, Jinan 250014, China of Software, Shandong Shandong University,Normal Jinan 250101, China c Shandong ProvincialbKey School of Software, Shandong University, Shandong Jinan 250101, China Jinan 250101, China Laboratory of Software Engineering, University, c Shandong Provincial Key Laboratory of Software Engineering, Shandong University, Jinan 250101, China a School

Abstract Abstract Time series classification (TSC) has attracted significant interest over the past decade. A lot of TSC methods have been proposed. Time series (TSC) has attracted significant interest over past A lot ofmore TSCaccurate, methods and havefaster been than proposed. Among theseclassification TSC methods, shapelet based methods are promising forthe they aredecade. interpretable, other Among these TSC methods, shapelet based methods are promising for they are interpretable, more accurate, and faster than other methods. For this, a lot of acceleration strategies are proposed. However, the accuracies of speedup methods are not ideal. To methods. For problems, this, a lotan ofXGBoost acceleration strategies However, the accuracies of in speedup methods are not ideal. To address these classifier basedare on proposed. shapelet features (XG-SF) is proposed this work. In XG-SF, an XGBoost address problems, an XGBoost classifier on shapelet features (XG-SF) is proposed in this work.demonstrate In XG-SF, an XGBoost classifierthese based on shapelet features is used tobased improve classification accuracy. Our experimental results that XG-SF classifier based on shapelet features is used to improve classification accuracy. Our experimental results demonstrate that XG-SF is faster than the state-of-the-art classifiers and the classification accuracy rate is also improved to a certain extent. is faster than the state-of-the-art classifiers and the classification accuracy rate is also improved to a certain extent. c 2019  2019 The The Authors. Authors. Published Published by by Elsevier Elsevier B.V. B.V. © c 2019  The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND This is an open access article under the CC BY-NC-ND license license (https://creativecommons.org/licenses/by-nc-nd/4.0/) (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) under responsibility responsibilityofofthe thescientific scientificcommittee committee the 2018 International Conference on Identification, Information Peer-review under ofof the 2018 International Conference on Identification, Information and Peer-review responsibility the scientific committee of the 2018 International Conference on Identification, Information and Knowledge the Internet of of Things. Knowledge inunder theinInternet of Things. and Knowledge in the Internet of Things. Keywords: time series classification; XGBoost; shapelet feature. Keywords: time series classification; XGBoost; shapelet feature.

1. Introduction 1. Introduction Time series classification (TSC) has long been an important research problem for both academic researchers and Time practitioners series classification long beentime an important research to problem academic researchers industry [14]. In(TSC) TSC, has an unlabeled series is assigned one of for twoboth or more predefined classes.and A industry practitioners [14]. In TSC, an unlabeled time series is assigned to one of two or more predefined classes.are A lot of TSC methods have been proposed [1, 4, 8, 9]. Among these TSC methods, classifier based shapelet features lot of TSC methods have been proposed [1, 4, 8, 9]. Among these TSC methods, classifier based shapelet features are promising for they are interpretable, more accurate, and faster than other methods. promising are interpretable, morebased accurate, and faster methods. Usually,for thethey training time in shapelet methods is highthan [6].other For this, a lot of acceleration strategies are proUsually, the training time in shapelet based methods is high [6]. For this, lot ofasacceleration strategies proposed. However, the classification accuracy rates of speedup methods are notaideal shown in our previousareworks posed. However, the classification accuracy rates of speedup methods are not ideal as shown in our previous works [11]. To address these problems, an XGBoost [7] classifier based on shapelet features (XG-SF) is proposed in this [11]. To address these problems, an XGBoost [7] classifier based on shapelet features (XG-SF) is proposed in this ∗ ∗

Corresponding author. Corresponding E-mail address:author. [email protected] E-mail address: [email protected] c 2019 The Authors. Published by Elsevier B.V. 1877-0509  1877-0509  © 2019 The The Authors. Published by B.V. c 2019 1877-0509 Authors. Published by Elsevier Elsevier B.V. This article under the CC licenselicense (https://creativecommons.org/licenses/by-nc-nd/4.0/) This isisan anopen openaccess access article under the BY-NC-ND CC BY-NC-ND (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review of the committee of the 2018 Conference on Identification, Information and Knowledge in Peer-reviewunder underresponsibility responsibility of scientific the scientific committee of theInternational 2018 International Conference on Identification, Information and Peer-review under of the scientific committee of the 2018 International Conference on Identification, Information and Knowledge in the Internet of Knowledge inThings. theresponsibility Internet of Things. the Internet of Things. 10.1016/j.procs.2019.01.179

2

Cun Ji et al. / Procedia Computer Science 147 (2019) 24–28 Cun Ji et al. / Procedia Computer Science 00 (2019) 000–000

25

work. The main contribution of this paper is that we combine the XGBoost classifier with shapelet features. The accuracy can be improved by the XGBoost classifier based on shapelet features. The remainder of this paper is structured as follows. Section 2 discusses some related work on shapelet based methods. The proposed XGBoost classifier based on shaplet features is introduced in Section 3. Experimental results are presented in Section 4 and our conclusions are given in Section 5. 2. Related Work Since being introduced in 2009 [18], shapelet based classifiers have aroused the interest of many researchers. • First, they are more compact than many alternatives, resulting in faster classification. • Second, shapelets are directly interpretable. • Third, shapelets allow for the detection of shape-based similarities in subsequences. This type of similarity is often hard to detect when examining the whole series. There are a lot of methods to classify time series with shapelet features. Ye et al. [18] embedded the shapeletdiscovery algorithm in a decision tree, and found a new shapelet at each node of the tree. Lines et al. [16] proposed shapelet transform (ST). They found the best k shapelets and used them to produce a transformed data set, where each of the k features represented the distance between a time series and a shapelet. The main advantages of ST are that it optimizes the process of shapelet selection and it allows various classification strategies (such as C4.5 Tree, 1NN, Naive Bayes, Bayesian Network, Random Forest, Rotation Forest, SVM, etc.) to be adopted. Bagnall et al. [3] ensemble some basic classifier for ST to get higher accuracies rate. Karlsson et al. [15] generalized forests using shapelets. Shi et al. [17] used shapelets to generalize pairswise forests. However, the classification accuracy rates of these methods are not ideal. 3. An XGBoost Classifier Based on Shapelet Features To build a classifier with higher accuracy, an XGBoost [7] classifier based on shapelet features (XG-SF) is proposed in this paper. There are mainly five setps in XG-SF: 1) Sample time series, 2) Filter subsequences, 3) Evaluating candidates, 4) Shapelet transform and 5) Classifier training. The steps sample time series and filter subsequences are two acceleration strategies which are proposed in our previous work [11]. For more information, please read [11]. Next, we will describe the following steps evaluating candidates, shapelet transform and classifier training in detail. 3.1. Evaluating Candidates In XG-SF, information gain is selected as evaluating methods. Suppose that a time series dataset D contains n time series from c different classes. The probability of time series in class i is pi . The entropy (e) of D can be calculated as (1). A shapelet s with one distance d can separate the dataset into two smaller datasets, DL and DR . The time series in D which distance to s is bigger than d is put into DR , otherwise it is put into DL . The number of time series in DL and DR are nL and nR , respectively. The information gain (gain) can be calculated as (2). We calculate the distance between the shapelet s and every time series in the dataset. We sort the objects according to the distances and find an optimal split point between two neighboring distances. The maximal value is the information gain of s as (3). e(D) = −pi ∗ log(Pi ) nL nR gain(s, d) = e(D) − e(DL ) − e(DR ) n n gain(s) = max {gain(s, d)} d∈optimal split points

(1) (2) (3)

Cun Ji et al. / Procedia Computer Science 147 (2019) 24–28 Cun Ji et al. / Procedia Computer Science 00 (2019) 000–000

26

3

3.2. Shapelet Transform We uses shapelets to transform time series into a new feature space. Each attribute of the transformed data corresponds to the distance between a shapelet and the original time series. The distance between a shapelet (S ) and one time series (T ) can be computed as (4). In (4), T il is a contiguous subsequence of a time series, the start point of it is i, the length of it is l. Also l is the length of S . dist(T il , S ) is the Euclidean distance between T il and S . Suppose that we choose top k shapelets in shapelet selection process, one time series T will be converted into the following form as (5). In (5), S 1 , · · · , S k are the top k shapelets we selected. l dist(T, S ) = min(dist(T 1l , S ), · · · , dist(T il , S ), · · · , dist(T m−l+1 , S )) T D = {dist(T, S 1 ), · · · , dist(T, S k )}

(4) (5)

3.3. Classifier Training Finally, we train a classifier on the transformed dataset. We select the XGBoost as the final classifier. XGBoost, a scalable tree boosting system, is a scalable end-to-end tree boosting system [7]. Nowadays, it is widely used by data scientists and provides state-of-the-art results on many problems. We train the params for our XGBoost classifier through crossing validation. We use 5-fold cross-validation. We partition the training dataset into 5 equal sized folds. One fold is retained as the validation data for testing the model, and the remaining 4 folds are used as training data. The cross-validation process is then repeated 5 times, with each of the 5 folds used exactly once as the validation data. The 5 results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 4. Experiments and Evaluation In this paper, we selected 12 data sets from the UEA & UCR Time Series Classification Repository [5]. The results of comparative experiments on the 12 data sets can be found in the relevant literature. Our method based on the code that is freely accessible from an online repository [2]. Our code and detailed results is open1 so that the results can be independently replicated. The experiments were carried out in Java on a 3.10 GHz Intel Core i5 CPU with 16 GB, 1333 MHz DDR3 internal storage, using MyEclipse with JDK 1.8. 4.1. Contrast Experiments for Classification Accuracy First of all, we contrast our method (XG-SF) with the methods (FSS, FS, ST, LS, SD, RS,gRSF). These methods are used in our previous works. Table 1 lists the classification accuracy of them. In Table 1, the most accurate classifier for each dataset is given in bold. As Table 1 shows, our method has the highest average classification rate and smallest average rank among all the methods. It demonstrates that our method works better than other shapelet based methods on accuracy. 4.2. Contrast Experiments for Running Time In Table 1, the average ranks of XG-SF, ST, LS, FSS and gRSF are smaller than FS, SD, RS and SALSA-R. And the average classification rate of XG-SF, ST, LS, FSS and gRSF are bigger than FS, SD, RS and SALSA-R. The results in Table 1 demonstrate that XG-SF, ST, LS, FSS and gRSF are better than FS, SD, RS and SALSA-R on accuracy. 1

Web page for our code: https://github.com/sdujicun/XGBoostWithST.



Cun Ji et al. / Procedia Computer Science 147 (2019) 24–28

4

27

Cun Ji et al. / Procedia Computer Science 00 (2019) 000–000 Table 1: Comparison of Classification Accuracy

Adiac Beef ChlorineConcentration Coffee DiatomSizeReduction ItalyPowerDemand Lightning7 MedicalImages MoteStrain Symbols Trace TwoLeadECG average rate average rank 1

ST1 0.783 0.900 0.700 0.964 0.925 0.948 0.726 0.670 0.897 0.882 1.000 0.997 0.866 3.208

LS1 0.522 0.867 0.592 1.000 0.980 0.960 0.795 0.664 0.883 0.932 1.000 0.996 0.849 3.667

XG-SF 0.762 0.833 0.698 0.929 1.000 0.985 0.757 0.732 0.950 0.920 1.000 1.000 0.881 2.625

FSS1 0.785 0.833 0.643 1.000 0.856 0.940 0.740 0.712 0.895 0.951 1.000 0.986 0.862 3.458

Acceleration Methods gRSF1 SALSA-R1 SD1 0.732 0.726 0.583 0.633 0.609 0.507 0.658 0.671 0.553 0.964 0.960 0.961 0.779 0.769 0.896 0.944 0.951 0.920 0.726 0.695 0.652 0.697 0.686 0.676 0.952 0.854 0.783 0.755 0.864 0.865 1.000 1.000 0.965 0.991 0.958 0.867 0.819 0.812 0.769 4.458 5.375 6.917

FS1 0.593 0.567 0.546 0.929 0.866 0.917 0.644 0.624 0.777 0.934 1.000 0.924 0.777 6.750

RS1 0.516 0.324 0.572 0.769 0.774 0.924 0.635 0.529 0.815 0.795 0.934 0.914 0.708 8.250

The accuracy listed for ST, LS, FSS, gRSF, SALSA-R, SD, FS and RS is taken from https://github. com/sdujicun/FastShapeletSelection. Table 2: Running Time Comparison

ChlorineConcentration Coffee DiatomSizeReduction Lightning7 MoteStrain Symbols Trace

XG-SF 8.655 0.165 0.313 10.636 0.036 4.785 18.121

FSS 488.623 2.554 1.997 22.608 0.887 7.887 31.899

gRSF 624.438 6.438 13.344 123.970 1.860 40.160 80.130

FS 175.519 6.798 8.295 114.9000 0.319 37.457 55.468

ST 21921.631 548.464 332.628 13646.7760 2.065 4477.296 6902.039

LS 6389.524 241.068 780.014 14535.8910 23.539 4781.698 5332.514

Next, we will show that our method XG-SF is fastest among the methods (XG-SF, ST, LS, FSS and gRSF) which accuracies are high. FS is also be used as baseline. Table 2 lists the running time of XG-ST, FSS, ST, LS, FS and gRSF 2 . Because the running time for ST and LS is too long (For example, the trainning time for ST on “CinCECGtorso” is about 20 days in our experiment environments), we select 7 data sets in this experiments. As Table 2 shows, XG-SF is the fastest method among the methods which accuracies are high.

5. Conclusion Shapelet based methods have attracted a lot of attention in the last decade. However, the time complexity of the shapelet selection process too high. For this, a lot of acceleration strategies are proposed. However, the classification accuracy rates of speedup methods are not ideal. To address these problems, XG-SF is proposed in this work. An XGBoost classifier based on shapelet features is used in XG-SF to improve classification accuracy. The experimental results demonstrate that our method XG-SF 2

Because it handles in parallel, we use CPU time as the running time for gRSF.

28

Cun Ji et al. / Procedia Computer Science 147 (2019) 24–28 Cun Ji et al. / Procedia Computer Science 00 (2019) 000–000

5

works better than other shapelet based methods on accuracy. Meanwhile, XG-SF is the fastest method among the methods which accuracies are high. In the future, we will apply XG-SF to our industrial big data platform [12, 13, 10] to solve real scene problems. Acknowledgements This work was supported by the National Natural Science Foundation of China (61872222, 91546203), the National Key Research and Development Program of China (2017YFA0700601), the Major Program of Shandong Province Natural Science Foundation (ZR2018ZB0419), the Key Research and Development Program of Shandong Province (2017CXGC0605, 2017CXGC0604, 2018GGX101019). References [1] Bagnall, A., Bostrom, A., Large, J., Lines, J., 2016a. The great time series classification bake off: An experimental evaluation of recently proposed algorithms. extended version. arXiv preprint arXiv:1602.01711 . [2] Bagnall, A., Bostrom, A., Lines, J., 2016b. The UEA TSC codebase. https://bitbucket.org/aaron_bostrom/ time-series-classification. [3] Bagnall, A., Davis, L., Hills, J., Lines, J., 2012. Transformation based ensembles for time series classification, in: Proceedings of the 2012 SIAM international conference on data mining, SIAM. pp. 307–318. [4] Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E., 2017. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery 31, 606–660. [5] Bagnall, A., Lines, J., Vickers, W., Keogh, E., 2016c. The UEA & UCR Time Series Classification Repository. http:// timeseriesclassification.com. [6] Chang, K.W., Deka, B., Hwu, W.W., Roth, D., 2012. Efficient pattern-based time series classification on GPU, in: 2012 IEEE 12th International Conference on Data Mining, IEEE. pp. 131–140. [7] Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, ACM. pp. 785–794. [8] Esling, P., Agon, C., 2012. Time-series data mining. ACM Computing Surveys 45, 12. [9] Fu, T.c., 2011. A review on time series data mining. Engineering Applications of Artificial Intelligence 24, 164–181. [10] Ji, C., Liu, S., Yang, C., Cui, L., Pan, L., Wu, L., Liu, Y., 2016a. A self-evolving method of data model for cloud-based machine data ingestion, in: 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), IEEE. pp. 814–819. [11] Ji, C., Liu, S., Yang, C., Pan, L., Wu, L., Meng, X., 2018. A shapelet selection algorithm for time series classification: New directions. Procedia Computer Science 129, 461–467. [12] Ji, C., Liu, S., Yang, C., Wu, L., Pan, L., 2015. IBDP: An Industrial Big Data Ingestion and Analysis Platform and Case Studies, in: International Conference on Identification, Information, and Knowledge in the Internet of Things, pp. 223–228. [13] Ji, C., Shao, Q., Jiao, S., Liu, S., Li, P., Lei, W., Yang, C., 2016b. Device data ingestion for industrial big data platforms with a case study. Sensors 16. [14] Ji, C., Zhao, C., Pan, L., Liu, S., Yang, C., Wu, L., 2017. A fast shapelet discovery algorithm based on important data points. International Journal of Web Services Research (IJWSR) 14, 67–80. [15] Karlsson, I., Papapetrou, P., Bostr¨om, H., 2016. Generalized random shapelet forests. Data Mining and Knowledge Discovery 30, 1053–1085. [16] Lines, J., Davis, L.M., Hills, J., Bagnall, A., 2012. A shapelet transform for time series classification, in: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM. pp. 289–297. [17] Shi, M., Wang, Z., Yuan, J., Liu, H., 2018. Random pairwise shapelets forest, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Springer. pp. 68–80. [18] Ye, L., Keogh, E., 2009. Time series shapelets: a new primitive for data mining, in: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM. pp. 947–956.