Computer Networks 157 (2019) 89–98
Contents lists available at ScienceDirect
Computer Networks journal homepage: www.elsevier.com/locate/comnet
A just-in-time shapelet selection service for online time series classificationR Cun Ji a,b,c, Chao Zhao b, Li Pan b,∗, Shijun Liu b,∗, Chenglei Yang b, Xiangxu Meng b a
School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China School of Software, Shandong University, Jinan 250101, China c Shandong Provincial Key Laboratory of Software Engineering, Shandong University, Jinan 250101, China b
a r t i c l e
i n f o
Article history: Received 24 April 2018 Revised 26 February 2019 Accepted 22 April 2019 Available online 24 April 2019 Keywords: Industrial internet of things Cyber-physical system Time series classification Shapelet Subclass split Local farthest deviation points
a b s t r a c t Time series classification attracted significant interest over the past decade as a result of the enormous data which can be inserted into the Cyber-Physical System. However, in such system time variant mechanisms violate the stationarity hypothesis which is mostly assumed in the design of classification systems, hence this impairs the accuracy of the classifier. In order to cope with this issue, classifiers with justin-time adaptive training mechanisms are needed, as they allow detecting a change in stationarity and modifying the classifier configuration accordingly to track the process evolution. This paper proposes an online time series classification system including a just-in-time shapelet selection service (JSSS) which selects shapelets as the features for time series classification. The JSSS is based on a fast shapelet selection algorithm (FSS). First, the FSS samples some time series from training dataset with the help of the subclass splitting method. Next, the FSS identifies Local Farthest Deviation Points (LFDPs) from sampled time series; then, the subsequences between two different LFDPs are selected as shapelet candidates. Through these two steps, the number of shapelet candidates is sharply reduced so that the training time is also sharply reduced, which ensures efficient training and feature extraction in an online time series classification system. The experiments showed the JSSS can get results in less than 30 s in the worst condition for all the datasets. At the same time, classification accuracy rates improved by more than 9.9% in the offline scenario and 7.1% in the online scenario. © 2019 Elsevier B.V. All rights reserved.
1. Introduction In the Industry 4.0 [1] era, the implementation of Internet of Things (IoT) is becoming increasingly prevalent. More and more machines are ingested into the Cyber-Physical System [2,3]. For example, the Chinese government requires that new energy vehicles makers establish monitoring platforms to control the safety state of their vehicles in their entire life cycle. In this regard, to monitoring machines, the cloud service technology and IoT technology have become more and more important [4,5]. Usually, machines deliver data with equal time interval. The data they send back naturally form the time series. Typically, the analysis of machines data for a single moment is meaningless. We
R This paper was completed by the first author when he was a Ph.D. student at Shandong University and was supplemented during his work at Shandong Normal University. ∗ Corresponding authors. E-mail addresses:
[email protected] (C. Ji),
[email protected] (C. Zhao),
[email protected] (L. Pan),
[email protected] (S. Liu),
[email protected] (C. Yang),
[email protected] (X. Meng).
https://doi.org/10.1016/j.comnet.2019.04.020 1389-1286/© 2019 Elsevier B.V. All rights reserved.
need to find out something meaningful by using the method of time series analysis. A time series analysis foresees tasks such as: indexing, representation, pattern discovery, clustering, classification, and anomaly detection [6–10]. Effective time series classification has long been an important research problem for both academic researchers and industry practitioners. However, an online time series classification is hard for higher training time complexity of time series classification algorithms. In order to address this problem, a just-in-time shapelet selection service (JSSS) is proposed in this paper. The JSSS is based on a fast shapelet selection algorithm (FSS), which aims at reducing the number of shapelet candidates. This service samples some time series from training datasets with the help of the subclass splitting method, firstly. Then, the FSS identifies Local Farthest Deviation Points (LFDPs) for the sampled time series and selects the subsequences between two different LFDPs as shapelet candidates. Through these steps, the number of shapelet candidates can be
90
C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98
greatly reduced, which leads to a significant reduction in time complexity. This paper makes the following contributions: • First, we put forward an online time series classification system with seven services (i.e., data receiving service, classify service, training service, feedback service, data transform service, shapelet selection service, and coordinate service) in which the JSSS ensures the just-in-time training. • Second, we propose a FSS. It can select shapelets fast by reducing the candidates. • Third, we conduct experiments to show the effects of our service. The experiments show the JSSS can get results in less than 30 s in the worst condition for all the datasets with classification accuracy rates improved at the same time. The remainder of this paper is structured as follows. Section 2 discusses some related works on Industrial Internet of Things (IIoT) and Cyber-Physical System (CPS), time series analysis services, and shapelet-based time series classifiers. In Section 3, we describe the problem motivation. In Section 4, we give our online time series classification system. The proposed JSSS with two acceleration strategies is introduced in Section 5. Experimental results are presented in Section 6. Our conclusions are given in Section 7. 2. Related work In this section,we briefly discuss some related works on 1) IIoT and CPS, 2) time series analysis services, 3) shapelet-based time series classifiers, and 4) acceleration strategies for shapelet-based classifiers. 2.1. IIoT and CPS The IoT is the information network of physical objects (such as sensors, devices, machine vehicles, home appliances, and other items) that allows interaction and cooperation of these objects to reach common goals [11–14]. IoT applications include among industrial environments, transportation, healthcare, smart homes, smart citys, etc. [11,15–17]. Among them, the IIoT refers in particular to industrial environments. In IIoT, smart electronics are embedded into the production systems along the life cycle of a product [18]. IIoT enables manufacturing industries to extend their existing applications and even conceive new ways of operating. For example, manufacturing industries analyze machine health or predict maintenance though data collected by IIoT. Many ideas of designing the hardware for IIoT can be traced back to the idea of CPS [19]. As NASA defines, CPS is an “emerging class of physical systems that exhibit complex patterns of behavior due to highly capable embedded software components” [20]. IIoT and CPS have aroused the interest of many researchers academic researchers and industry practitioners. For detailed reviews of IIoT and CPS, please see [18,21–24]. In this paper, we focus on time series data analysis services in CPS and IIoT. We will give the related works of time series data analysis services in next subsection. 2.2. Time series data analysis services Many previous works in time series data management for IIoT systems were carried out. Some of the time series storage services were proposed. Yin et al. proposed GridMW, which is a scalable time series data management for smart grid [25]. Cudr-Mauroux et al. proposed SciDB, which is a DBMS-based solution for scientific array data [26]. They
addressed the challenge of efficient time series data storage. Ji et al. proposed a heterogeneous device data ingestion model for an industrial big data platform, which ingested heterogeneous device data from multiple sources and stored them in the format of time series. However, these systems cannot provide generalized services for time series data that come from various IIoT applications. Some services can be used to search patterns of time series. Xu et al. proposed time series analytics as a service (TSaaaS) on IoT [27]. TSaaaS builds and updates the index, so that it can search a pattern much faster compared to scan-based approaches. Papadimitriou et al. proposed SPIRIT, which can incrementally capture correlations and discover trends of time series efficiently and effectively [28]. Also, some services are used in the field of Hyper Spectral Image (HSI) analysis. For example, Qi et al. proposed an ensemble learning framework, which applies the boosting technique to learn multiple kernel classifiers for classification problems [29]. But, the main data used in this framework are contiguous spectral images. However, there are few services which focus on time series classification. For this reason, we proposed an online time series system. In our system, we adopted the JSSS, which can select shapelet in a very short time. Thus, our system can use shapeletbased classifiers online. 2.3. Shapelet-based classifiers Two types of classification algorithms exist: those that consider the whole time series sequence as a global feature for classification, and those that consider a portion of a single time series sequence as a local feature for classification [30]. The shapeletbased algorithms are well-known local feature-based methods. Recent empirical evidences strongly suggest that shapelet-based algorithms outperform many previous classification algorithms [30]. Shapelet-based algorithms are interpretable, more accurate, and faster than state-of-the-art classifiers. Since 2009, when shapelet-based classifiers were introduced [31], they have aroused the interest of many researchers. Lines et al. proposed a single-scan shapelet algorithm that finds the best k shapelets and uses them to produce a transformed dataset, where each of the k features represents the distance between a time series and a shapelet [32,33]. This approach is called ST. Through this way, the time series classification problems are transformed into general classification problems [34]. However, the time complexity of the shapelet selection process in ST is high. Many technologies have been proposed to speed up the shapelet selection process. We briefly review acceleration strategies in the next subsection. 2.4. Acceleration strategies Some methods speed up the shapelet selection process by reducing the time complexity of evaluating each candidate shapelet. Ye and Keogh developed two speedup methods: subsequence distance early abandon (SDEA) and admissible entropy pruning (AEP) [35]. In SDEA, once the distance is larger than the current smallest distance, the computation is abandoned. AEP calculates a cheapto-compute upper bound of the information gain, and uses this to admissibly prune certain candidates. Mueen et al. accelerated the time required to find the same shapelet by reusing computations and pruning the search space [36]. Xing et al. accelerated the shapelet selection process by sharing the computation among different local shapelets [37]. Both of these speedup methods can only handle non-normalized time series. He et al. reduced the runtime by elaborating the usage of infrequent shapelet candidates [38]. They assumed that discriminative subsequences are usually
C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98
91
Fig. 1. The diagram of a time series classification system for machines status monitoring.
infrequent compared to other subsequences. This assumption may not be tenable in some specific databases. Also, some methods speed up the shapelet selection process by reducing the number of shapelet candidates. Xing et al. [37] set the minimum length of shapelets to 5. If the length of time series is bigger than 100, then the maximum length of the shapelets is set to half of the time series length. Moreover, the step size is set to 3. Through these setting, only about 1/3 or 1/6 subsequences are selected as shapelet candidates. Hills et al. defined a simple algorithm for estimating the minimum and maximum length of the shapelets [33]. They randomly selected ten series to find the best ten shapelets in this subset of the data and repeated it ten times. The shapelets were sorted by length, with the length of the 25th shapelet returned as minimum length and the length of the 75th shapelet returned as maximum length. 3. Problem motivation The effective time series classification method has long been an important research problem for both academic researchers and industry practitioners. For example, Fig. 1 shows a time series classification system for machines status monitoring (such as the Sony AIBO Robot system). The system receives machine data and classifies machine data as the format of time series. However, two problems arise for the system: A) In order to pursue high classifications accuracy rates, the trained classifier needs to be updated dynamically. If this cannot take place, poor classification accuracy may occur. Thus, new available supervised information and feedback need to be integrated into the classifier in order to improve the classification accuracy. Sometimes, at the beginning, the training set is very small in a machine status monitoring system. For example, in the Sony AIBO Robot system, only 20 and 27 time series can be used to build a classifier for two surface parameters at the beginning. B) Many classical classification methods are not suitable for time series due to the shortage of enormous training time consuming. For example, execution time on the ChlorineCon dataset (see Section 6) needs 100,421 s by a traditional algorithm and 697 s even in a fast algorithm. As a result, a fast algorithm is needed if this dataset is applied in an online classification system. The focus of this work was on finding a time series classification service which could train a classifier online and had an ideal classification accuracy rate. For this reason, we designed an online time series classification system. This system selects shapelets as the features of time series. Then, we can transform time series classification problems into general classification problems with
the help of shapelets. Through this way, our online time series system can train the classifier online and it also has an ideal classification accuracy rate. 4. Our online time series classification system Implementing various services across multiple domains, networks, and the cyber-physical world is imperative for efficiently analyzing and processing big data [39]. In this section, we will introduce our online time series classification system. As Fig. 2 shows, the online time series classification system is composed by 7 services: data receiving service, classification service, training service, feedback service, data transform service, shapelet selection service, and coordinate service. These services should work together to ideally speed up data processing. Among these seven services, two kinds of data streams exist: data for classification, which are represented with solid lines, and data for training, which are represented with the dotted line. Also, three data cache in the pipeline style service system are available: Shapelets cache, classifier cache, and training data cache. Every cache is refreshed by a former service and used for a subsequent service in the data pipeline. 4.1. The coordinate service In data transform services, we use a shapelets cache which is refreshed by shapelet selection services. In the classification service, we use a classifier which is built by the training service. In order to ensure that the system works in a stable way, our system provides the coordinate service. The coordinate service is responsible for composing and coordinating a series of service calls together by sending messages and data. The message flow between the coordinate service and other services is shown in Figs. 3 and 4. In these two figures, the sequence of control messages is marked by numbers. When a mistake occurs in the classification, the feedback service sends a “refresh shapelet cache” message to the coordinate service. Then, the coordinate service makes the data transform service wait until a new training process is completed. Subsequently, the shapelet selection service should be started by the coordinate service. After new shapelets are selected, the data transform service are recovered and new data are generated by the data transform service. When the classifiers cache needs to be refreshed, the data transform service sends a “refresh classifier cache” message to the coordinate service when it is recovered. The coordinate service makes the classification service wait until the training service finishes refreshing the classifier cache. After the classifier cache is
92
C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98
Fig. 2. Online time series classification system and the 7 services.
Fig. 3. The message flow between the coordinate service and other services while refreshing shapelets cache.
Fig. 4. The message flow between the coordinate service and other services while refreshing the classifier cache.
updated, the coordinate service makes the classification service recovered. The most important thing of the pipeline process is the just-intime shapelet selection, because new shapelets need to be selected to build the classifier while new available supervised information and feedback are generated.
4.2. The data receiving service The main function of the data receiving service is to accepting data and transforming data into the format of the time series. Also, it is responsible for transporting the time series to the data transformed service.
C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98
l dist = (S j − s j )
93
(2)
j=1
The data transform service uses shapelets cache which is refreshed by shapelet selection services. 4.4. The shapelet selection service Shapelet-based algorithms are interpretable, more accurate, and faster than state-of-the-art classifiers. However, the shapelet selection process is time consuming. In many cases, the shapelet selection must be executed online. For example, in the new energy vehicle monitoring applications, the feature extraction and reporting must be completed within only 1 minute after training data updates. Thus, the shapelet selection service must be fast enough to satisfy this requirement. For this reason, the JSSS for online time series classification is proposed. The JSSS is based on a FSS, which adopts two acceleration strategies: firstly, it samples some time series from the training dataset with the help of the subclass splitting method. Then, it identifies the local LFDPs of sampled time series and selects the subsequences between two different LFDPs as shapelet candidates. Through these two steps, the number of shapelet candidates can be greatly reduced, which, in turn, leads to a significant reduction in training time. We will describe the FSS in detail in next section.
Fig. 5. Slide window whole.
4.5. The training service The training service uses transformed training data to build the classifier. In our online time series classification system, the training time series data will be transformed in advance. Therefore, the classification can be viewed as a general problem. Many classical classification methods, such as C4.5 Tree, 1NN, Naive Bayes, and Random Forest, can be used in our system. The training service builds a classifier which is used in the classification service.
Fig. 6. Slide window partly.
The length of the time series is usually determined by the time series in the training dataset. Our data receiving service provides two methods to generate time series: one is called sliding window whole (as Fig. 5 shows) and generates time series without overlap subsequences; the other is called sliding window partly (as Fig. 6 shows) and generates time series with overlap subsequences. In addition, the data receiving service needs to handle missing data. It fills the missing data by using the interpolate value of adjacent data.
4.6. The classification service After the training service refreshes the classifier cache, the classification service uses it to classify new transformed data. In this service, unlabeled transformed data are assigned to one of two or more predefined classes which are given in the training time series. After this, the classification process is finished and the result is shown to user. 4.7. The feedback service
4.3. The data transform service Many classical classification methods do not adapt to time series because of the continuity of time series. For this reason, we transform time series classification problems into general classification problems. Each attribute of the transformed data corresponds to the distance between a shapelet and the original time series. The distance between a shapelet (S) and one time series (T) can be computed as (1). In (1), Til is a contiguous sequence of a time series, its start point is i, and its length is l. l is the length of S, too. We use Euclidean distance in (1). The distance between sequence s = {s1 , s2 , . . . , s j , . . . , sl } and shapelet S = {S1 , S2 , . . . , S j , . . . , Sl } can be calculated as (2).
dist (T , S ) = min{dist ( ), . . . , dist ( ), . . . , dist (Tml −l+1 , S )}, i = 1, . . . , m − l + 1 T1l , S
Til , S
(1)
The feedback service has two main functions: Firstly, the service can be used to show classification results to users. Secondly, the service can trigger the shapelet selection service if some mistake results are detected. The data in testing set are sent one by one to the simulation model. When the mistake result occurs in the classification, the feedback service will trigger the online training service. New shapelets are reselected and the classifier is updated. 5. The just-in-time shapelet selection service In our online time series classification system, the most critical part is the shapelet selection service. The selected shapelet can be used to transform time series classification problems into general classification problem.
94
C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98
Fig. 7. Flow diragram of the FSS.
In order to meet the requirements of real time for online time series classification, the JSSS is put forward. The flow diagram of the JSSS is shown in Fig. 7. The JSSS is based on a FSS. For every training dataset, three processes allow to get the final shapelets. Firstly, FSS samples some time series from the training dataset with the help of the subclass splitting method. Flowing by, the FSS identifies LFDPs for sampled time series and selects the subsequences between two different LFDPs as shapelet candidates. Finally, the FSS assesses the shapelet candidates and gets the final shapelets. Two acceleration strategies can be applied: the “sample time series” reduces the number of time series which will be used to generate shapelet candidates; the “generate candidates based on LFDPs” reduces the number of shapelet candidates that are generated by one time series. Through these two steps, the number of shapelet candidates can be greatly reduced, which, in turn, leads to a significant reduction on the shapelet selection time. Four assessment methods are available for shapelet quality measures [33]: information gain, Kruskal [40], F-statistic, and Moodâs median [41]. In this paper, we use the information gain, which is the most common method. Specifically, when assessing shapelet candidates, we calculate their information gain on the whole dataset. At last, those candidates which have the highest k information gain are selected as final shapelets. Then, the final shapelets are return (For a detailed description of the assessment of shapelet candidates, please see [32,33]). 5.1. Time series sampling A certain number of time series is usually available for every class in a training set. For example, in the dataset “ChlorineConcentration”, which is from The UEA & UCR Time Series Classification Repository [42], there are 3 class and 467 time series. Why do not we sample time series for every class? P. Sathianwiriyakhun et al. demonstrated that sampling one or some time series per class directly is often insufficient and could greatly reduce the accuracies [43]. One important reason is that subclasses are typically present in most real world datasets. Consequently, we sample time series with the help of the subclass splitting method. As shown in Fig. 8, there are three steps used to sample the time series: • Step 1: Select criteria time series; • Step 2: Split subclass; • Step 3: Sample time series.
Fig. 8. Processes to sample time series.
5.1.1. Step 1: Select criteria time series Suppose there are nc time series in one class and the time series length is m:
Dc = {T1 , T2 , . . . , Ti , . . . , Tnc }
(3)
Ti = {t1 , t2 , . . . , t j , . . . , tm }
(4)
Firstly, the sum value of every time series Ti can be calculated as (5), and the mean of the sum value can be calculated as (6). Then, the time series which is the closest to the average sum value is selected as criteria to split the subclass. It can be described as (7).
Sumi =
m
t j.
(5)
j=1
n c
Mean =
Sumi nc
i=1
TC = argmini (1,2,...,nc ) (|sumi − Mean| )
(6) (7)
5.1.2. Step 2: Split subclass After determining the criteria, we can calculate the Euclidean distance between the time series in the class and the criteria. The Euclidean distance between Ti = {t1 , t2 , . . . , t j , . . . , tm } and Tc = {c1 , c2 , . . . , c j , . . . , cm } can be calculated as (8).
dist =
m
(t j − c j )
(8)
j=1
Following by, we split the subclass as in [24]. Firstly, we sort time series by these distance values and we can determine the adjacent discrepancies of the distance values. Then, we calculate the standard deviation value of these adjacent discrepancies. Subsequently, we can separate the data into subclasses by splitting at the sequence that has a difference which is larger than half of the computed standard deviation. 5.1.3. Step 3: sample time series Finally, we sample time series from every subclass. In every subclass, the time series which have the smallest sum distance among other time series in this subclass are selected as sample.
C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98 Table 1 Datasets.
5.2. Generating candidates based on LFDP Every sampled time series still has too many subsequences. We should select some which have certain significance as shapelet candidates. Zhang et al. usually use key points to retain shapelet candidates that have certain significance [44]. In this research, we generate shapelet candidates with the help of LFDPs. LFDPs are defined as those points that satisfy the following conditions: the point is in the sequence with maximum weight and the point is at the maximum distance from the line of best fit of the sequence. In LFDPs definition, the sequence weight is related to the distances between the points and the fitting line of the sequence. The weight of one sequence is expressed as (9):
weight = max {distsum , 2 ∗ distmax },
95
(9)
In (9), distsum is the sum of the distances of all points in the sequence and distmax is the maximum among these distances. The distances are the fitting errors of the points to the fitting line of the sequence. There are three types of distance from one point to the fitting line of a sequence: Euclidean distance, orthogonal distance, and vertical distance [45,46]. The orthogonal distance and the vertical distance are widely used. In the generation of LFDPs, the vertical distance is equivalent to the orthogonal distance, but it can be computed more conveniently. In our work, we use the vertical distance. The LFDP selection method selects LFDPs of time series. Sequences between any two LFDPs are selected as shapelets candidates. Indeed, we do not use LFDP to down sample the time series. They are selected as the begin points and the end points of the shapelet candidates. On the basis of the definition of LFDPs, we show how to select LFDPs from time series. The algorithm of how to select LFDPs is shown in Algorithm 1. Algorithm 1 LFDPs Selection Algorithm. Input: time series data: T ={t1 , t2 , . . . , tm }, Output: the index of LFDP; √ 1: int[] LF DP =new int[ m + 1]; 2: LF DP [0]=1; LF DP [1]=m; 3: List list=new ArrayList(); 4: list.add(T ); 5: int tempNumber=2; √ 6: while {tempNumber < m + 1 } do 7: sortSegmentByWeight(list); Segment s= list.get(0); 8: 9: int LF DP Index= getLFDPIndex(s); LF DP [tempNumber] = LF DP Index ; 10: 11: tempNumber++; Segment sL =leftSegment(s, LF DP Index); 12: Segment sR =rightSegment(s, LF DP Index); 13: 14: l ist.add(sL ); l ist.add(sR ); list.remove(0); 15: 16: end while 17: return sortLFDPFromSmallToBig(LF DP ) Firstly, as shown in Line 2 in Algorithm 1, the first and last points of the time series are selected as LFDPs. And the sequence between the first and last points (the time series) is put into a sequence list as Line 3–4. Then, as Line 6–16 in Algorithm 1, the following steps are repeated until the number of selected LFDPs is √ up to m + 1 (m is the length of time series): Step 1: Sort the sequences in the sequence list by weight from the largest to the smallest. (Line 7 in Algorithm 1)
Dataset
Number of classes
Training set size
Testing set size
Time series length
ChlorineCon Coffee ECGFiveDays Trace TwoLeadECG
3 2 2 4 2
467 28 23 100 23
3840 28 861 100 1139
166 286 136 275 82
Step 2: Select the sequence with the largest weight from the sequence list. (Line 8 in Algorithm 1) Step 3: Select the point which is at the maximum distance to the line of best fit of the sequence which is selected in Step 2 as new LFDP. (Line 9–10 in Algorithm 1) Step 4: The sequence which is selected in Step 2 is divided into two sequences by the new LFDP, which is selected in Step 3, and the two sequences are put into the sequence list. (Line 12–14 in Algorithm 1) Step 5: Remove from the sequence list the sequence which is selected in Step 2. (Line 15 in Algorithm 1) After getting LDFPs, we can generate shapelet candidates of this sample time series with the help of LFDPs. Any of the subsequences between two different LFDPs are selected as shapelet candidates. The shapelet candidate generating process is shown in Algorithm 2. Algorithm 2 generateCandidatesWithLFDP. Input: time series data: T ={t1 , t2 , . . . , tm }, Input: the endpoints which selected for T : LF DP . Output: candidates: shapelet candidates 1: cand id ates = ; 2: for j = 1 to LF DP.length − 2 do begin = LF DP [ j]; 3: 4: for i = j + 1 to LF DP.lenght do end = LF DP [i]; 5: 6: S = {tbegin , tbegin+1 , tbegin+2 , . . . , tend−1 }; candidates.add (S ); 7: 8: end for 9: end for 10: return cand id ates
6. Experimental evaluation 6.1. Datasets We performed experiments on datasets from the UEA & UCR Time Series Classification Repository [42]. As Table 1 shows, we used the 5 datasets of the UEA & UCR Time Series Classification Repository. We selected these 5 datasets for two reasons: 1) the datasets are sensor reading time series or human sensor reading time series; 2) the datasets are relatively large scale: no less than 10 time series are in one class in average and the time series length is no less than 80. 6.2. Shapelet selection time comparision The experiments were carried out in Java on a 3.10 GHz Intel Core i5 CPU with 16 GB, 1333 MHz DDR3 internal storage, using MyEclipse with JDK 1.8. The number of shapelets (k) was set to half of the size of the training set. We compared the shapelet selection
96
C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98
Fig. 9. Shapelet selection time comparison.
Table 2 Shapelet selection time comparison (s). Dataset
ST
ST-LFDP
FSS
ChlorineCon Coffee ECGFiveDays Trace TwoLeadECG
100421.03 165.74 24.66 15612.81 4.66
697.62 1.08 0.40 61.84 0.15
6.66 0.30 0.30 20.01 0.09
classification accuracy rates are improved by 9.9%, 11.0%, 11.7%, and 11.9% respectively. In a word, the classification accuracy rates for the transformed data by the FSS improved more than 9.9%. 6.4. Online classification comparison
time of ST (the state-of-the-art classifiers), ST-LFDP (using LFDPs to speed up, but not using sampling technology), and FSS. Table 2 and plots in Fig. 9 report the shapelet selection execution time obtained by applying the ST, ST-LFDP, and FSS on different datasets. As Fig. 9 shows, the FSS is the fastest method. For all the datasets, the FSS can select shapelet less than 30 s in the worst condition. Thus, it can be used in an online time series classification system. 6.3. Offline classification accuracy rate comparison Table 3 lists the classification accuracy rates of four famous classifiers using origin time series or transformed data by the FSS. In Table 3, the higher accurate classification rates for each classifier are given in bold. As Table 3 shows, the transformed data produced better works on more datasets for all four classifier. Further, the average classification accuracy rates of transformed data are higher than the corresponding average classification accuracy rates of origin data. For C4.5 tree, 1NN, Native Bayes, and Random Forest, the average
Typically, the training set at the beginning is very small in a machine status monitoring system. For example, in the Sony AIBO Robot system, which is mentioned in the problem motivation in Section 3, there is only 20 and 27 time series can be used to build the classifier for two surface parameters at the beginning respectively. The dataset is shown in Table 4. The training set size is so small, the classification results with offline training is not ideal. Over time, the training set is larger and larger. However, the classifier cannot be updated dynamically before because of the delay in the shapelet selection. Nevertheless, the FSS is fast enough to ensure just-in-time shapelet selection. Then, the selected shapelets are used to transform the data and the classifier is built on the transformed data, which ensures the efficient training and feature extraction in an online tine series classification system. We also did a comparison for online time series classification. Fig. 10 shows the classification accurate rate for the Sony AIBO Robot system online and offline. Noticeably, in online classification comparison experiments, we used 1NN to classify data. As Fig. 10 shows, the classification accuracy rates of online are higher than the classification accuracy rates of offline. For “surface1”, the improved rate is 12.2%. For “surface2”, the improved rate is 7.1%. Also, both the online accuracy rates are higher than 90.0%.
Table 3 Accuracy rate comparison between origin data and transformed data. Datasets
C4.5 Tree
1NN
origin
transformed
origin
ChlorineCon Coffee ECGFiveDays Trace TwoLeadECG Average Rate Rate Improved
0.688 0.929 0.721 0.790 0.718 0.769 /
0.562 0.929 0.972 0.990 0.886 0.868 0.099
0.650 1.0 0 0 0.797 0.760 0.747 0.791 /
Naive Bayes
Random Forest
transformed
origin
transformed
origin
transformed
0.602 1.0 0 0 0.994 0.980 0.932 0.901 0.110
0.346 0.929 0.797 0.800 0.699 0.714 /
0.415 0.893 0.995 1.0 0 0 0.852 0.831 0.117
0.713 1.0 0 0 0.721 0.780 0.724 0.788 /
0.655 0.929 0.999 1.0 0 0 0.953 0.907 0.119
C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98
97
Table 4 Datasets for Sony AIBO robot surface. Dataset
Number of classes
Training set size
Testing set size
Time series length
Surface1 Surface2
2 2
20 27
601 953
70 65
Fig. 10. Classification accuracy rate comparison between online and offline.
7. Conclusion
References
Effective time series classification method has long been an important research problem for both academic researchers and industry practitioners. However, the training time for most time series classification algorithms is high and many classical classification methods do not adapt to time series because of the continuity of time series. In order to solve these two main difficulties, we have presented a just-in-time shapelets selection system for an online time series classification system. The JSSS service can reselect some shapelets very fast when the old classifier of our system needs to be updated. Then, the reselected are viewed as the new features. Then we can transform time series classification problems to general classification problem with the help of shapelets. The experiments show the JSSS can get results in less than 30 s in the worst condition for all the datasets. At the same time, the classification accuracy rates improved by more than 9.9% in offline scenario and 7.1% in online scenario too.
[1] S. Shead, Industry 4.0: the next industrial revolution, Engineer (2013). [2] N. Jazdi, Cyber physical systems in the context of Industry 4.0, in: Proceedings of the 2014 IEEE International Conference on Automation, Quality and Testing, Robotics, IEEE, 2014, pp. 1–4. [3] P. Derler, E.A. Lee, S. Tripakis, M. Törngren, Cyber-physical system design contracts, in: Proceedings of the ACM/IEEE 4th International Conference on Cyber– Physical Systems, ACM, 2013, pp. 109–118. [4] J. Bughin, M. Chui, J. Manyika, Clouds, big data, and smart assets: ten tech-enabled business trends to watch, McKinsey Q. 56 (1) (2010) 75–86. [5] C. Ji, S. Liu, C. Yang, L. Cui, L. Pan, L. Wu, Y. Liu, A self-evolving method of data model for cloud-based machine data ingestion, in: 2016 IEEE 9th International Conference on Cloud Computing, IEEE, 2016, pp. 814–819. [6] T.-C. Fu, A review on time series data mining, Eng. Appl. Artif.Intell. 24 (1) (2011) 164–181. [7] P. Esling, C. Agon, Time-series data mining, ACM Comput. Surv. (CSUR) 45 (1) (2012) 12. [8] C. Ji, C. Zhao, L. Pan, S. Liu, C. Yang, L. Wu, A fast shapelet discovery algorithm based on important data points, Int. J. Web Serv. Res.(IJWSR) 14 (2) (2017) 67–80. [9] C. Ji, S. Liu, C. Yang, L. Pan, L. Wu, X. Meng, A shapelet selection algorithm for time series classification: new directions, Procedia Comput. Sci. 129 (2018) 461–467. [10] C. Ji, C. Zhao, S. Liu, C. Yang, L. Pan, L. Wu, X. Meng, A fast shapelet selection algorithm for time series classification, Comput. Netw. 148 (2019) 231–240. [11] S. Jeschke, C. Brecher, H. Song, D.B. Rawat, Industrial Internet of Things: Cybermanufacturing Systems, Springer, 2016. [12] L. Atzori, A. Iera, G. Morabito, The internet of things: a survey, Comput. Netw. 54 (15) (2010) 2787–2805. [13] Y. Sun, H. Yan, L.U. Cheng, R. Bie, Z. Zhou, Constructing the web of events from raw data in the web of things, Mob. Inf. Syst. 10 (1) (2014) 105–125. [14] Y. Sun, A.J. Jara, An extensible and active semantic model of information organizing for the internet of things, Pers. Ubiquitous Comput. 18 (8) (2014) 1821–1833. [15] H. Song, D. Zhang, A. Jara, J. Wan, K. Boussetta, IEEE access special section editorial: smart cities, IEEE Access 4 (2016) 3671–3674. [16] H. Song, R. Srinivasan, T. Sookoor, S. Jeschke, Smart Cities: Foundations, Principles, and Applications, Wiley Publishing, 2017. [17] M.N. Sadiku, Y. Wang, S. Cui, S.M. Musa, Industrial internet of things, IJASRE 3 (2017). [18] Y. Liao, E.d.F.R. Loures, F. Deschamps, Industrial internet of things: A Systematic literature review and insights, IEEE Internet Things J. (2018). [19] S. Jeschke, C. Brecher, T. Meisen, D. Özdemir, T. Eschert, Industrial internet of things and cyber manufacturing systems, in: Industrial Internet of Things, Springer, 2017, pp. 3–19.
Conflict of interest The authors have no conflicts of interest to declare. Acknowledgments This work was supported by the National Natural Science Foundation of China [grant number 61872222]; the National Key Research and Development Program of China [grant number 2017YFA0700601]; the Major Program of Shandong Province Natural Science Foundation [grant number ZR2018ZB0419]; the Key Research and Development Program of Shandong Province [grant number 2017CXGC0605, 2017CXGC0604, 2018GGX101019]; and the Young Scholars Program of Shandong University.
98
C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98
[20] NASA, Cyber-Physical Systems Modeling and Analysis (CPSMA) Initiative, (http: //www.nasa.gov/centers/ames/cct/office/studies/cyber-physical_systems.html). Accessed September 30, 2018. [21] H. Song, D.B. Rawat, S. Jeschke, C. Brecher, Cyber-Physical Systems: Foundations, Principles and Applications, Morgan Kaufmann, 2016. [22] H. Song, G.A. Fink, S. Jeschke, Security and Privacy in Cyber-physical Systems: Foundations, Principles, and Applications, John Wiley & Sons, 2017. [23] Y. Lu, Industry 4.0: a survey on technologies, applications and open research issues, J. Ind. Inf. Integr. 6 (2017) 1–10. [24] J.A. Saucedo-Martínez, M. Pérez-Lara, J.A. Marmolejo-Saucedo, T.E. Salais– Fierro, P. Vasant, Industry 4.0 framework for management and operations: a review, J. Ambient Intell. Hum.Comput. (2017) 1–13. [25] J. Yin, A. Kulkarni, S. Purohit, I. Gorton, B. Akyol, Scalable real time data management for smart grid, in: Proceedings of the Middleware 2011 Industry Track Workshop, ACM, 2011, p. 1. [26] P. Cudré-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D.L. Wang, M. Balazinska, J. Becla, et al., A demonstration of scidb: a science-oriented DBMS, Proc. VLDB Endow. 2 (2) (2009) 1534–1537. [27] X. Xu, S. Huang, Y. Chen, K. Browny, I. Halilovicy, W. Lu, TSAaaS: time series analytics as a service on IoT, in: Proceedings of the 2014 IEEE International Conference On Web Services (ICWS), IEEE, 2014, pp. 249–256. [28] S. Papadimitriou, J. Sun, C. Faloutsos, Streaming pattern discovery in multiple time-series, in: Proceedings of the 31st international conference on Very large data bases, VLDB Endowment, 2005, pp. 697–708. [29] C. Qi, Z. Zhou, Y. Sun, H. Song, L. Hu, Q. Wang, Feature selection and multiple kernel boosting framework based on pso with mutation mechanism for hyperspectral classification, Neurocomputing 220 (2017) 181–190. [30] M. Arathi, A. Govardhan, Accurate time series classification using shapelets, Int. J. Data Min. Knowl.Manage. Process 4 (2) (2014) 39. [31] L. Ye, E. Keogh, Time series shapelets: a new primitive for data mining, in: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2009, pp. 947–956. [32] J. Lines, L.M. Davis, J. Hills, A. Bagnall, A shapelet transform for time series classification, in: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2012, pp. 289–297. [33] J. Hills, J. Lines, E. Baranauskas, J. Mapp, A. Bagnall, Classification of time series by shapelet transformation, Data Min. Knowl. Discovery 28 (4) (2014) 851–881. [34] C. Ji, X. Zou, Y. Hu, S. Liu, L. Lyu, X. Zheng, XG-SF: an XGBoost classifier based on shapelet features for time series classification, Procedia Comput. Sci. 147 (2019) 24–28. [35] L. Ye, E. Keogh, Time series shapelets: a novel technique that allows accurate, interpretable and fast classification, Data Min. Knowl. Discovery 22 (1–2) (2011) 149–182. [36] A. Mueen, E. Keogh, N. Young, Logical-shapelets: an expressive primitive for time series classification, in: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011, pp. 1154–1162. [37] Z. Xing, J. Pei, P.S. Yu, K. Wang, Extracting interpretable features for early classification on time series, in: Proceedings of the 2011 SIAM International Conference on Data Mining, SIAM, 2011, pp. 247–258. [38] Q. He, Z. Dong, F. Zhuang, T. Shang, Z. Shi, Fast time series classification based on infrequent shapelets, in: Proceedings of the 2012 11th International Conference on Machine Learning and Applications (ICMLA), vol. 1, IEEE, 2012, pp. 215–219. [39] X. Xu, Q.Z. Sheng, L.-J. Zhang, Y. Fan, S. Dustdar, From big data to big service, Computer 48 (7) (2015) 80–83. [40] W.H. Kruskal, et al., A nonparametric test for the several sample problem, Ann. Math. Stat. 23 (4) (1952) 525–540. [41] A.M. Mood, Introduction to the Theory of Statistics, 1950. [42] A. Bagnall, J. Lines, W. Vickers and E. Keogh, The UEA & UCR Time Series Classification Repository, http://www.timeseriesclassification.com, Retrived 201904-23. [43] P. Sathianwiriyakhun, T. Janyalikit, C.A. Ratanamahatana, Fast and accurate template averaging for time series classification, in: Proceedings of the 2016 8th International Conference on Knowledge and Smart Technology (KST), IEEE, 2016, pp. 49–54. [44] Z. Zhang, H. Zhang, Y. Wen, X. Yuan, Accelerating time series shapelets discovery with key points, in: Asia-Pacific Web Conference, Springer, 2016, pp. 330–342. [45] T.-C. Fu, F.-L. Chung, R. Luk, C.-M. Ng, Representing financial time series based on data point importance, Eng. Appl. Artif.Intell. 21 (2) (2008) 277–300. [46] C. Ji, S. Liu, C. Yang, L. Wu, L. Pan, X. Meng, A piecewise linear representation method based on importance data points for time series data, in: Proceedings of the 2016 IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD), IEEE, 2016, pp. 111–116.
Cun Ji obtained his BS and PhD degrees from Shandong University, China. His current research interests include time series data analysis, enterprise services computing and services system for manufacturing.
Chao Zhao is a master student at School of computer science and technology at Shandong University, China. He is a member of the HCI&VR Group of Shandong University. His main research interests include time series data analysis, cloud computing, etc.
Li Pan obtained her BS, MS and PhD degrees from Shandong University, China. She is a Lecturer of Shandong University. Her current research interests include cloud computing, cloud manufacturing, and market-oriented resource allocation.
Shijun Liu obtained his BS degree in oceanography from Ocean University of China, and MS and PhD degrees in computer science from Shandong University, China. He is a Professor of Shandong University. His current research interests include services computing, enterprise services computing and manufacturing data analytic.
Chenglei Yang obtained his BS, MS and PhD degrees from Shandong University, China. He is a Professor of Shandong University. His current research interests include human computer interaction and virtual reality.
Xiangxu Meng obtained his BS and MS degrees from Shandong University, China and PhD degrees from Chinese Academy of Sciences. He is a Professor of Shandong University. His current research interests include human computer interaction,virtual reality and data visualization.