A just-in-time shapelet selection service for online time series classification

Computer Networks 157 (2019) 89–98 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet A j...

Download PDF

2MB Sizes 2 Downloads 47 Views

Report

PDF Reader
Full Text

Computer Networks 157 (2019) 89–98

Contents lists available at ScienceDirect

Computer Networks journal homepage: www.elsevier.com/locate/comnet

A just-in-time shapelet selection service for online time series classiﬁcationR Cun Ji a,b,c, Chao Zhao b, Li Pan b,∗, Shijun Liu b,∗, Chenglei Yang b, Xiangxu Meng b a

School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China School of Software, Shandong University, Jinan 250101, China c Shandong Provincial Key Laboratory of Software Engineering, Shandong University, Jinan 250101, China b

a r t i c l e

i n f o

Article history: Received 24 April 2018 Revised 26 February 2019 Accepted 22 April 2019 Available online 24 April 2019 Keywords: Industrial internet of things Cyber-physical system Time series classiﬁcation Shapelet Subclass split Local farthest deviation points

a b s t r a c t Time series classiﬁcation attracted signiﬁcant interest over the past decade as a result of the enormous data which can be inserted into the Cyber-Physical System. However, in such system time variant mechanisms violate the stationarity hypothesis which is mostly assumed in the design of classiﬁcation systems, hence this impairs the accuracy of the classiﬁer. In order to cope with this issue, classiﬁers with justin-time adaptive training mechanisms are needed, as they allow detecting a change in stationarity and modifying the classiﬁer conﬁguration accordingly to track the process evolution. This paper proposes an online time series classiﬁcation system including a just-in-time shapelet selection service (JSSS) which selects shapelets as the features for time series classiﬁcation. The JSSS is based on a fast shapelet selection algorithm (FSS). First, the FSS samples some time series from training dataset with the help of the subclass splitting method. Next, the FSS identiﬁes Local Farthest Deviation Points (LFDPs) from sampled time series; then, the subsequences between two different LFDPs are selected as shapelet candidates. Through these two steps, the number of shapelet candidates is sharply reduced so that the training time is also sharply reduced, which ensures eﬃcient training and feature extraction in an online time series classiﬁcation system. The experiments showed the JSSS can get results in less than 30 s in the worst condition for all the datasets. At the same time, classiﬁcation accuracy rates improved by more than 9.9% in the oﬄine scenario and 7.1% in the online scenario. © 2019 Elsevier B.V. All rights reserved.

1. Introduction In the Industry 4.0 [1] era, the implementation of Internet of Things (IoT) is becoming increasingly prevalent. More and more machines are ingested into the Cyber-Physical System [2,3]. For example, the Chinese government requires that new energy vehicles makers establish monitoring platforms to control the safety state of their vehicles in their entire life cycle. In this regard, to monitoring machines, the cloud service technology and IoT technology have become more and more important [4,5]. Usually, machines deliver data with equal time interval. The data they send back naturally form the time series. Typically, the analysis of machines data for a single moment is meaningless. We

R This paper was completed by the ﬁrst author when he was a Ph.D. student at Shandong University and was supplemented during his work at Shandong Normal University. ∗ Corresponding authors. E-mail addresses: [email protected] (C. Ji), [email protected] (C. Zhao), [email protected] (L. Pan), [email protected] (S. Liu), [email protected] (C. Yang), [email protected] (X. Meng).

https://doi.org/10.1016/j.comnet.2019.04.020 1389-1286/© 2019 Elsevier B.V. All rights reserved.

need to ﬁnd out something meaningful by using the method of time series analysis. A time series analysis foresees tasks such as: indexing, representation, pattern discovery, clustering, classiﬁcation, and anomaly detection [6–10]. Effective time series classiﬁcation has long been an important research problem for both academic researchers and industry practitioners. However, an online time series classiﬁcation is hard for higher training time complexity of time series classiﬁcation algorithms. In order to address this problem, a just-in-time shapelet selection service (JSSS) is proposed in this paper. The JSSS is based on a fast shapelet selection algorithm (FSS), which aims at reducing the number of shapelet candidates. This service samples some time series from training datasets with the help of the subclass splitting method, ﬁrstly. Then, the FSS identiﬁes Local Farthest Deviation Points (LFDPs) for the sampled time series and selects the subsequences between two different LFDPs as shapelet candidates. Through these steps, the number of shapelet candidates can be

90

C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98

greatly reduced, which leads to a signiﬁcant reduction in time complexity. This paper makes the following contributions: • First, we put forward an online time series classiﬁcation system with seven services (i.e., data receiving service, classify service, training service, feedback service, data transform service, shapelet selection service, and coordinate service) in which the JSSS ensures the just-in-time training. • Second, we propose a FSS. It can select shapelets fast by reducing the candidates. • Third, we conduct experiments to show the effects of our service. The experiments show the JSSS can get results in less than 30 s in the worst condition for all the datasets with classiﬁcation accuracy rates improved at the same time. The remainder of this paper is structured as follows. Section 2 discusses some related works on Industrial Internet of Things (IIoT) and Cyber-Physical System (CPS), time series analysis services, and shapelet-based time series classiﬁers. In Section 3, we describe the problem motivation. In Section 4, we give our online time series classiﬁcation system. The proposed JSSS with two acceleration strategies is introduced in Section 5. Experimental results are presented in Section 6. Our conclusions are given in Section 7. 2. Related work In this section,we brieﬂy discuss some related works on 1) IIoT and CPS, 2) time series analysis services, 3) shapelet-based time series classiﬁers, and 4) acceleration strategies for shapelet-based classiﬁers. 2.1. IIoT and CPS The IoT is the information network of physical objects (such as sensors, devices, machine vehicles, home appliances, and other items) that allows interaction and cooperation of these objects to reach common goals [11–14]. IoT applications include among industrial environments, transportation, healthcare, smart homes, smart citys, etc. [11,15–17]. Among them, the IIoT refers in particular to industrial environments. In IIoT, smart electronics are embedded into the production systems along the life cycle of a product [18]. IIoT enables manufacturing industries to extend their existing applications and even conceive new ways of operating. For example, manufacturing industries analyze machine health or predict maintenance though data collected by IIoT. Many ideas of designing the hardware for IIoT can be traced back to the idea of CPS [19]. As NASA deﬁnes, CPS is an “emerging class of physical systems that exhibit complex patterns of behavior due to highly capable embedded software components” [20]. IIoT and CPS have aroused the interest of many researchers academic researchers and industry practitioners. For detailed reviews of IIoT and CPS, please see [18,21–24]. In this paper, we focus on time series data analysis services in CPS and IIoT. We will give the related works of time series data analysis services in next subsection. 2.2. Time series data analysis services Many previous works in time series data management for IIoT systems were carried out. Some of the time series storage services were proposed. Yin et al. proposed GridMW, which is a scalable time series data management for smart grid [25]. Cudr-Mauroux et al. proposed SciDB, which is a DBMS-based solution for scientiﬁc array data [26]. They

addressed the challenge of eﬃcient time series data storage. Ji et al. proposed a heterogeneous device data ingestion model for an industrial big data platform, which ingested heterogeneous device data from multiple sources and stored them in the format of time series. However, these systems cannot provide generalized services for time series data that come from various IIoT applications. Some services can be used to search patterns of time series. Xu et al. proposed time series analytics as a service (TSaaaS) on IoT [27]. TSaaaS builds and updates the index, so that it can search a pattern much faster compared to scan-based approaches. Papadimitriou et al. proposed SPIRIT, which can incrementally capture correlations and discover trends of time series eﬃciently and effectively [28]. Also, some services are used in the ﬁeld of Hyper Spectral Image (HSI) analysis. For example, Qi et al. proposed an ensemble learning framework, which applies the boosting technique to learn multiple kernel classiﬁers for classiﬁcation problems [29]. But, the main data used in this framework are contiguous spectral images. However, there are few services which focus on time series classiﬁcation. For this reason, we proposed an online time series system. In our system, we adopted the JSSS, which can select shapelet in a very short time. Thus, our system can use shapeletbased classiﬁers online. 2.3. Shapelet-based classiﬁers Two types of classiﬁcation algorithms exist: those that consider the whole time series sequence as a global feature for classiﬁcation, and those that consider a portion of a single time series sequence as a local feature for classiﬁcation [30]. The shapeletbased algorithms are well-known local feature-based methods. Recent empirical evidences strongly suggest that shapelet-based algorithms outperform many previous classiﬁcation algorithms [30]. Shapelet-based algorithms are interpretable, more accurate, and faster than state-of-the-art classiﬁers. Since 2009, when shapelet-based classiﬁers were introduced [31], they have aroused the interest of many researchers. Lines et al. proposed a single-scan shapelet algorithm that ﬁnds the best k shapelets and uses them to produce a transformed dataset, where each of the k features represents the distance between a time series and a shapelet [32,33]. This approach is called ST. Through this way, the time series classiﬁcation problems are transformed into general classiﬁcation problems [34]. However, the time complexity of the shapelet selection process in ST is high. Many technologies have been proposed to speed up the shapelet selection process. We brieﬂy review acceleration strategies in the next subsection. 2.4. Acceleration strategies Some methods speed up the shapelet selection process by reducing the time complexity of evaluating each candidate shapelet. Ye and Keogh developed two speedup methods: subsequence distance early abandon (SDEA) and admissible entropy pruning (AEP) [35]. In SDEA, once the distance is larger than the current smallest distance, the computation is abandoned. AEP calculates a cheapto-compute upper bound of the information gain, and uses this to admissibly prune certain candidates. Mueen et al. accelerated the time required to ﬁnd the same shapelet by reusing computations and pruning the search space [36]. Xing et al. accelerated the shapelet selection process by sharing the computation among different local shapelets [37]. Both of these speedup methods can only handle non-normalized time series. He et al. reduced the runtime by elaborating the usage of infrequent shapelet candidates [38]. They assumed that discriminative subsequences are usually

C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98

91

Fig. 1. The diagram of a time series classiﬁcation system for machines status monitoring.

infrequent compared to other subsequences. This assumption may not be tenable in some speciﬁc databases. Also, some methods speed up the shapelet selection process by reducing the number of shapelet candidates. Xing et al. [37] set the minimum length of shapelets to 5. If the length of time series is bigger than 100, then the maximum length of the shapelets is set to half of the time series length. Moreover, the step size is set to 3. Through these setting, only about 1/3 or 1/6 subsequences are selected as shapelet candidates. Hills et al. deﬁned a simple algorithm for estimating the minimum and maximum length of the shapelets [33]. They randomly selected ten series to ﬁnd the best ten shapelets in this subset of the data and repeated it ten times. The shapelets were sorted by length, with the length of the 25th shapelet returned as minimum length and the length of the 75th shapelet returned as maximum length. 3. Problem motivation The effective time series classiﬁcation method has long been an important research problem for both academic researchers and industry practitioners. For example, Fig. 1 shows a time series classiﬁcation system for machines status monitoring (such as the Sony AIBO Robot system). The system receives machine data and classiﬁes machine data as the format of time series. However, two problems arise for the system: A) In order to pursue high classiﬁcations accuracy rates, the trained classiﬁer needs to be updated dynamically. If this cannot take place, poor classiﬁcation accuracy may occur. Thus, new available supervised information and feedback need to be integrated into the classiﬁer in order to improve the classiﬁcation accuracy. Sometimes, at the beginning, the training set is very small in a machine status monitoring system. For example, in the Sony AIBO Robot system, only 20 and 27 time series can be used to build a classiﬁer for two surface parameters at the beginning. B) Many classical classiﬁcation methods are not suitable for time series due to the shortage of enormous training time consuming. For example, execution time on the ChlorineCon dataset (see Section 6) needs 100,421 s by a traditional algorithm and 697 s even in a fast algorithm. As a result, a fast algorithm is needed if this dataset is applied in an online classiﬁcation system. The focus of this work was on ﬁnding a time series classiﬁcation service which could train a classiﬁer online and had an ideal classiﬁcation accuracy rate. For this reason, we designed an online time series classiﬁcation system. This system selects shapelets as the features of time series. Then, we can transform time series classiﬁcation problems into general classiﬁcation problems with

the help of shapelets. Through this way, our online time series system can train the classiﬁer online and it also has an ideal classiﬁcation accuracy rate. 4. Our online time series classiﬁcation system Implementing various services across multiple domains, networks, and the cyber-physical world is imperative for eﬃciently analyzing and processing big data [39]. In this section, we will introduce our online time series classiﬁcation system. As Fig. 2 shows, the online time series classiﬁcation system is composed by 7 services: data receiving service, classiﬁcation service, training service, feedback service, data transform service, shapelet selection service, and coordinate service. These services should work together to ideally speed up data processing. Among these seven services, two kinds of data streams exist: data for classiﬁcation, which are represented with solid lines, and data for training, which are represented with the dotted line. Also, three data cache in the pipeline style service system are available: Shapelets cache, classiﬁer cache, and training data cache. Every cache is refreshed by a former service and used for a subsequent service in the data pipeline. 4.1. The coordinate service In data transform services, we use a shapelets cache which is refreshed by shapelet selection services. In the classiﬁcation service, we use a classiﬁer which is built by the training service. In order to ensure that the system works in a stable way, our system provides the coordinate service. The coordinate service is responsible for composing and coordinating a series of service calls together by sending messages and data. The message ﬂow between the coordinate service and other services is shown in Figs. 3 and 4. In these two ﬁgures, the sequence of control messages is marked by numbers. When a mistake occurs in the classiﬁcation, the feedback service sends a “refresh shapelet cache” message to the coordinate service. Then, the coordinate service makes the data transform service wait until a new training process is completed. Subsequently, the shapelet selection service should be started by the coordinate service. After new shapelets are selected, the data transform service are recovered and new data are generated by the data transform service. When the classiﬁers cache needs to be refreshed, the data transform service sends a “refresh classiﬁer cache” message to the coordinate service when it is recovered. The coordinate service makes the classiﬁcation service wait until the training service ﬁnishes refreshing the classiﬁer cache. After the classiﬁer cache is

92

C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98

Fig. 2. Online time series classiﬁcation system and the 7 services.

Fig. 3. The message ﬂow between the coordinate service and other services while refreshing shapelets cache.

Fig. 4. The message ﬂow between the coordinate service and other services while refreshing the classiﬁer cache.

updated, the coordinate service makes the classiﬁcation service recovered. The most important thing of the pipeline process is the just-intime shapelet selection, because new shapelets need to be selected to build the classiﬁer while new available supervised information and feedback are generated.

4.2. The data receiving service The main function of the data receiving service is to accepting data and transforming data into the format of the time series. Also, it is responsible for transporting the time series to the data transformed service.

C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98

l dist = (S j − s j )

93

(2)

j=1

The data transform service uses shapelets cache which is refreshed by shapelet selection services. 4.4. The shapelet selection service Shapelet-based algorithms are interpretable, more accurate, and faster than state-of-the-art classiﬁers. However, the shapelet selection process is time consuming. In many cases, the shapelet selection must be executed online. For example, in the new energy vehicle monitoring applications, the feature extraction and reporting must be completed within only 1 minute after training data updates. Thus, the shapelet selection service must be fast enough to satisfy this requirement. For this reason, the JSSS for online time series classiﬁcation is proposed. The JSSS is based on a FSS, which adopts two acceleration strategies: ﬁrstly, it samples some time series from the training dataset with the help of the subclass splitting method. Then, it identiﬁes the local LFDPs of sampled time series and selects the subsequences between two different LFDPs as shapelet candidates. Through these two steps, the number of shapelet candidates can be greatly reduced, which, in turn, leads to a signiﬁcant reduction in training time. We will describe the FSS in detail in next section.

Fig. 5. Slide window whole.

4.5. The training service The training service uses transformed training data to build the classiﬁer. In our online time series classiﬁcation system, the training time series data will be transformed in advance. Therefore, the classiﬁcation can be viewed as a general problem. Many classical classiﬁcation methods, such as C4.5 Tree, 1NN, Naive Bayes, and Random Forest, can be used in our system. The training service builds a classiﬁer which is used in the classiﬁcation service.

Fig. 6. Slide window partly.

The length of the time series is usually determined by the time series in the training dataset. Our data receiving service provides two methods to generate time series: one is called sliding window whole (as Fig. 5 shows) and generates time series without overlap subsequences; the other is called sliding window partly (as Fig. 6 shows) and generates time series with overlap subsequences. In addition, the data receiving service needs to handle missing data. It ﬁlls the missing data by using the interpolate value of adjacent data.

4.6. The classiﬁcation service After the training service refreshes the classiﬁer cache, the classiﬁcation service uses it to classify new transformed data. In this service, unlabeled transformed data are assigned to one of two or more predeﬁned classes which are given in the training time series. After this, the classiﬁcation process is ﬁnished and the result is shown to user. 4.7. The feedback service

4.3. The data transform service Many classical classiﬁcation methods do not adapt to time series because of the continuity of time series. For this reason, we transform time series classiﬁcation problems into general classiﬁcation problems. Each attribute of the transformed data corresponds to the distance between a shapelet and the original time series. The distance between a shapelet (S) and one time series (T) can be computed as (1). In (1), Til is a contiguous sequence of a time series, its start point is i, and its length is l. l is the length of S, too. We use Euclidean distance in (1). The distance between sequence s = {s1 , s2 , . . . , s j , . . . , sl } and shapelet S = {S1 , S2 , . . . , S j , . . . , Sl } can be calculated as (2).

dist (T , S ) = min{dist ( ), . . . , dist ( ), . . . , dist (Tml −l+1 , S )}, i = 1, . . . , m − l + 1 T1l , S

Til , S

(1)

The feedback service has two main functions: Firstly, the service can be used to show classiﬁcation results to users. Secondly, the service can trigger the shapelet selection service if some mistake results are detected. The data in testing set are sent one by one to the simulation model. When the mistake result occurs in the classiﬁcation, the feedback service will trigger the online training service. New shapelets are reselected and the classiﬁer is updated. 5. The just-in-time shapelet selection service In our online time series classiﬁcation system, the most critical part is the shapelet selection service. The selected shapelet can be used to transform time series classiﬁcation problems into general classiﬁcation problem.

94

C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98

Fig. 7. Flow diragram of the FSS.

In order to meet the requirements of real time for online time series classiﬁcation, the JSSS is put forward. The ﬂow diagram of the JSSS is shown in Fig. 7. The JSSS is based on a FSS. For every training dataset, three processes allow to get the ﬁnal shapelets. Firstly, FSS samples some time series from the training dataset with the help of the subclass splitting method. Flowing by, the FSS identiﬁes LFDPs for sampled time series and selects the subsequences between two different LFDPs as shapelet candidates. Finally, the FSS assesses the shapelet candidates and gets the ﬁnal shapelets. Two acceleration strategies can be applied: the “sample time series” reduces the number of time series which will be used to generate shapelet candidates; the “generate candidates based on LFDPs” reduces the number of shapelet candidates that are generated by one time series. Through these two steps, the number of shapelet candidates can be greatly reduced, which, in turn, leads to a signiﬁcant reduction on the shapelet selection time. Four assessment methods are available for shapelet quality measures [33]: information gain, Kruskal [40], F-statistic, and Moodâs median [41]. In this paper, we use the information gain, which is the most common method. Speciﬁcally, when assessing shapelet candidates, we calculate their information gain on the whole dataset. At last, those candidates which have the highest k information gain are selected as ﬁnal shapelets. Then, the ﬁnal shapelets are return (For a detailed description of the assessment of shapelet candidates, please see [32,33]). 5.1. Time series sampling A certain number of time series is usually available for every class in a training set. For example, in the dataset “ChlorineConcentration”, which is from The UEA & UCR Time Series Classiﬁcation Repository [42], there are 3 class and 467 time series. Why do not we sample time series for every class? P. Sathianwiriyakhun et al. demonstrated that sampling one or some time series per class directly is often insuﬃcient and could greatly reduce the accuracies [43]. One important reason is that subclasses are typically present in most real world datasets. Consequently, we sample time series with the help of the subclass splitting method. As shown in Fig. 8, there are three steps used to sample the time series: • Step 1: Select criteria time series; • Step 2: Split subclass; • Step 3: Sample time series.

Fig. 8. Processes to sample time series.

5.1.1. Step 1: Select criteria time series Suppose there are nc time series in one class and the time series length is m:

Dc = {T1 , T2 , . . . , Ti , . . . , Tnc }

(3)

Ti = {t1 , t2 , . . . , t j , . . . , tm }

(4)

Firstly, the sum value of every time series Ti can be calculated as (5), and the mean of the sum value can be calculated as (6). Then, the time series which is the closest to the average sum value is selected as criteria to split the subclass. It can be described as (7).

Sumi =

m

t j.

(5)

j=1

n c

Mean =

Sumi nc

i=1

TC = argmini (1,2,...,nc ) (|sumi − Mean| )

(6) (7)

5.1.2. Step 2: Split subclass After determining the criteria, we can calculate the Euclidean distance between the time series in the class and the criteria. The Euclidean distance between Ti = {t1 , t2 , . . . , t j , . . . , tm } and Tc = {c1 , c2 , . . . , c j , . . . , cm } can be calculated as (8).

dist =

m

(t j − c j )

(8)

j=1

Following by, we split the subclass as in [24]. Firstly, we sort time series by these distance values and we can determine the adjacent discrepancies of the distance values. Then, we calculate the standard deviation value of these adjacent discrepancies. Subsequently, we can separate the data into subclasses by splitting at the sequence that has a difference which is larger than half of the computed standard deviation. 5.1.3. Step 3: sample time series Finally, we sample time series from every subclass. In every subclass, the time series which have the smallest sum distance among other time series in this subclass are selected as sample.

C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98 Table 1 Datasets.

5.2. Generating candidates based on LFDP Every sampled time series still has too many subsequences. We should select some which have certain signiﬁcance as shapelet candidates. Zhang et al. usually use key points to retain shapelet candidates that have certain signiﬁcance [44]. In this research, we generate shapelet candidates with the help of LFDPs. LFDPs are deﬁned as those points that satisfy the following conditions: the point is in the sequence with maximum weight and the point is at the maximum distance from the line of best ﬁt of the sequence. In LFDPs deﬁnition, the sequence weight is related to the distances between the points and the ﬁtting line of the sequence. The weight of one sequence is expressed as (9):

weight = max {distsum , 2 ∗ distmax },

95

(9)

In (9), distsum is the sum of the distances of all points in the sequence and distmax is the maximum among these distances. The distances are the ﬁtting errors of the points to the ﬁtting line of the sequence. There are three types of distance from one point to the ﬁtting line of a sequence: Euclidean distance, orthogonal distance, and vertical distance [45,46]. The orthogonal distance and the vertical distance are widely used. In the generation of LFDPs, the vertical distance is equivalent to the orthogonal distance, but it can be computed more conveniently. In our work, we use the vertical distance. The LFDP selection method selects LFDPs of time series. Sequences between any two LFDPs are selected as shapelets candidates. Indeed, we do not use LFDP to down sample the time series. They are selected as the begin points and the end points of the shapelet candidates. On the basis of the deﬁnition of LFDPs, we show how to select LFDPs from time series. The algorithm of how to select LFDPs is shown in Algorithm 1. Algorithm 1 LFDPs Selection Algorithm. Input: time series data: T ={t1 , t2 , . . . , tm }, Output: the index of LFDP; √ 1: int[] LF DP =new int[ m + 1]; 2: LF DP [0]=1; LF DP [1]=m; 3: List list=new ArrayList(); 4: list.add(T ); 5: int tempNumber=2; √ 6: while {tempNumber < m + 1 } do 7: sortSegmentByWeight(list); Segment s= list.get(0); 8: 9: int LF DP Index= getLFDPIndex(s); LF DP [tempNumber] = LF DP Index ; 10: 11: tempNumber++; Segment sL =leftSegment(s, LF DP Index); 12: Segment sR =rightSegment(s, LF DP Index); 13: 14: l ist.add(sL ); l ist.add(sR ); list.remove(0); 15: 16: end while 17: return sortLFDPFromSmallToBig(LF DP ) Firstly, as shown in Line 2 in Algorithm 1, the ﬁrst and last points of the time series are selected as LFDPs. And the sequence between the ﬁrst and last points (the time series) is put into a sequence list as Line 3–4. Then, as Line 6–16 in Algorithm 1, the following steps are repeated until the number of selected LFDPs is √ up to m + 1 (m is the length of time series): Step 1: Sort the sequences in the sequence list by weight from the largest to the smallest. (Line 7 in Algorithm 1)

Dataset

Number of classes

Training set size

Testing set size

Time series length

ChlorineCon Coffee ECGFiveDays Trace TwoLeadECG

3 2 2 4 2

467 28 23 100 23

3840 28 861 100 1139

166 286 136 275 82

Step 2: Select the sequence with the largest weight from the sequence list. (Line 8 in Algorithm 1) Step 3: Select the point which is at the maximum distance to the line of best ﬁt of the sequence which is selected in Step 2 as new LFDP. (Line 9–10 in Algorithm 1) Step 4: The sequence which is selected in Step 2 is divided into two sequences by the new LFDP, which is selected in Step 3, and the two sequences are put into the sequence list. (Line 12–14 in Algorithm 1) Step 5: Remove from the sequence list the sequence which is selected in Step 2. (Line 15 in Algorithm 1) After getting LDFPs, we can generate shapelet candidates of this sample time series with the help of LFDPs. Any of the subsequences between two different LFDPs are selected as shapelet candidates. The shapelet candidate generating process is shown in Algorithm 2. Algorithm 2 generateCandidatesWithLFDP. Input: time series data: T ={t1 , t2 , . . . , tm }, Input: the endpoints which selected for T : LF DP . Output: candidates: shapelet candidates 1: cand id ates = ; 2: for j = 1 to LF DP.length − 2 do begin = LF DP [ j]; 3: 4: for i = j + 1 to LF DP.lenght do end = LF DP [i]; 5: 6: S = {tbegin , tbegin+1 , tbegin+2 , . . . , tend−1 }; candidates.add (S ); 7: 8: end for 9: end for 10: return cand id ates

6. Experimental evaluation 6.1. Datasets We performed experiments on datasets from the UEA & UCR Time Series Classiﬁcation Repository [42]. As Table 1 shows, we used the 5 datasets of the UEA & UCR Time Series Classiﬁcation Repository. We selected these 5 datasets for two reasons: 1) the datasets are sensor reading time series or human sensor reading time series; 2) the datasets are relatively large scale: no less than 10 time series are in one class in average and the time series length is no less than 80. 6.2. Shapelet selection time comparision The experiments were carried out in Java on a 3.10 GHz Intel Core i5 CPU with 16 GB, 1333 MHz DDR3 internal storage, using MyEclipse with JDK 1.8. The number of shapelets (k) was set to half of the size of the training set. We compared the shapelet selection

96

C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98

Fig. 9. Shapelet selection time comparison.

Table 2 Shapelet selection time comparison (s). Dataset

ST

ST-LFDP

FSS

ChlorineCon Coffee ECGFiveDays Trace TwoLeadECG

100421.03 165.74 24.66 15612.81 4.66

697.62 1.08 0.40 61.84 0.15

6.66 0.30 0.30 20.01 0.09

classiﬁcation accuracy rates are improved by 9.9%, 11.0%, 11.7%, and 11.9% respectively. In a word, the classiﬁcation accuracy rates for the transformed data by the FSS improved more than 9.9%. 6.4. Online classiﬁcation comparison

time of ST (the state-of-the-art classiﬁers), ST-LFDP (using LFDPs to speed up, but not using sampling technology), and FSS. Table 2 and plots in Fig. 9 report the shapelet selection execution time obtained by applying the ST, ST-LFDP, and FSS on different datasets. As Fig. 9 shows, the FSS is the fastest method. For all the datasets, the FSS can select shapelet less than 30 s in the worst condition. Thus, it can be used in an online time series classiﬁcation system. 6.3. Oﬄine classiﬁcation accuracy rate comparison Table 3 lists the classiﬁcation accuracy rates of four famous classiﬁers using origin time series or transformed data by the FSS. In Table 3, the higher accurate classiﬁcation rates for each classiﬁer are given in bold. As Table 3 shows, the transformed data produced better works on more datasets for all four classiﬁer. Further, the average classiﬁcation accuracy rates of transformed data are higher than the corresponding average classiﬁcation accuracy rates of origin data. For C4.5 tree, 1NN, Native Bayes, and Random Forest, the average

Typically, the training set at the beginning is very small in a machine status monitoring system. For example, in the Sony AIBO Robot system, which is mentioned in the problem motivation in Section 3, there is only 20 and 27 time series can be used to build the classiﬁer for two surface parameters at the beginning respectively. The dataset is shown in Table 4. The training set size is so small, the classiﬁcation results with oﬄine training is not ideal. Over time, the training set is larger and larger. However, the classiﬁer cannot be updated dynamically before because of the delay in the shapelet selection. Nevertheless, the FSS is fast enough to ensure just-in-time shapelet selection. Then, the selected shapelets are used to transform the data and the classiﬁer is built on the transformed data, which ensures the eﬃcient training and feature extraction in an online tine series classiﬁcation system. We also did a comparison for online time series classiﬁcation. Fig. 10 shows the classiﬁcation accurate rate for the Sony AIBO Robot system online and oﬄine. Noticeably, in online classiﬁcation comparison experiments, we used 1NN to classify data. As Fig. 10 shows, the classiﬁcation accuracy rates of online are higher than the classiﬁcation accuracy rates of oﬄine. For “surface1”, the improved rate is 12.2%. For “surface2”, the improved rate is 7.1%. Also, both the online accuracy rates are higher than 90.0%.

Table 3 Accuracy rate comparison between origin data and transformed data. Datasets

C4.5 Tree

1NN

origin

transformed

origin

ChlorineCon Coffee ECGFiveDays Trace TwoLeadECG Average Rate Rate Improved

0.688 0.929 0.721 0.790 0.718 0.769 /

0.562 0.929 0.972 0.990 0.886 0.868 0.099

0.650 1.0 0 0 0.797 0.760 0.747 0.791 /

Naive Bayes

Random Forest

transformed

origin

transformed

origin

transformed

0.602 1.0 0 0 0.994 0.980 0.932 0.901 0.110

0.346 0.929 0.797 0.800 0.699 0.714 /

0.415 0.893 0.995 1.0 0 0 0.852 0.831 0.117

0.713 1.0 0 0 0.721 0.780 0.724 0.788 /

0.655 0.929 0.999 1.0 0 0 0.953 0.907 0.119

C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98

97

Table 4 Datasets for Sony AIBO robot surface. Dataset

Number of classes

Training set size

Testing set size

Time series length

Surface1 Surface2

2 2

20 27

601 953

70 65

Fig. 10. Classiﬁcation accuracy rate comparison between online and oﬄine.

7. Conclusion

References

Effective time series classiﬁcation method has long been an important research problem for both academic researchers and industry practitioners. However, the training time for most time series classiﬁcation algorithms is high and many classical classiﬁcation methods do not adapt to time series because of the continuity of time series. In order to solve these two main diﬃculties, we have presented a just-in-time shapelets selection system for an online time series classiﬁcation system. The JSSS service can reselect some shapelets very fast when the old classiﬁer of our system needs to be updated. Then, the reselected are viewed as the new features. Then we can transform time series classiﬁcation problems to general classiﬁcation problem with the help of shapelets. The experiments show the JSSS can get results in less than 30 s in the worst condition for all the datasets. At the same time, the classiﬁcation accuracy rates improved by more than 9.9% in oﬄine scenario and 7.1% in online scenario too.

[1] S. Shead, Industry 4.0: the next industrial revolution, Engineer (2013). [2] N. Jazdi, Cyber physical systems in the context of Industry 4.0, in: Proceedings of the 2014 IEEE International Conference on Automation, Quality and Testing, Robotics, IEEE, 2014, pp. 1–4. [3] P. Derler, E.A. Lee, S. Tripakis, M. Törngren, Cyber-physical system design contracts, in: Proceedings of the ACM/IEEE 4th International Conference on Cyber– Physical Systems, ACM, 2013, pp. 109–118. [4] J. Bughin, M. Chui, J. Manyika, Clouds, big data, and smart assets: ten tech-enabled business trends to watch, McKinsey Q. 56 (1) (2010) 75–86. [5] C. Ji, S. Liu, C. Yang, L. Cui, L. Pan, L. Wu, Y. Liu, A self-evolving method of data model for cloud-based machine data ingestion, in: 2016 IEEE 9th International Conference on Cloud Computing, IEEE, 2016, pp. 814–819. [6] T.-C. Fu, A review on time series data mining, Eng. Appl. Artif.Intell. 24 (1) (2011) 164–181. [7] P. Esling, C. Agon, Time-series data mining, ACM Comput. Surv. (CSUR) 45 (1) (2012) 12. [8] C. Ji, C. Zhao, L. Pan, S. Liu, C. Yang, L. Wu, A fast shapelet discovery algorithm based on important data points, Int. J. Web Serv. Res.(IJWSR) 14 (2) (2017) 67–80. [9] C. Ji, S. Liu, C. Yang, L. Pan, L. Wu, X. Meng, A shapelet selection algorithm for time series classiﬁcation: new directions, Procedia Comput. Sci. 129 (2018) 461–467. [10] C. Ji, C. Zhao, S. Liu, C. Yang, L. Pan, L. Wu, X. Meng, A fast shapelet selection algorithm for time series classiﬁcation, Comput. Netw. 148 (2019) 231–240. [11] S. Jeschke, C. Brecher, H. Song, D.B. Rawat, Industrial Internet of Things: Cybermanufacturing Systems, Springer, 2016. [12] L. Atzori, A. Iera, G. Morabito, The internet of things: a survey, Comput. Netw. 54 (15) (2010) 2787–2805. [13] Y. Sun, H. Yan, L.U. Cheng, R. Bie, Z. Zhou, Constructing the web of events from raw data in the web of things, Mob. Inf. Syst. 10 (1) (2014) 105–125. [14] Y. Sun, A.J. Jara, An extensible and active semantic model of information organizing for the internet of things, Pers. Ubiquitous Comput. 18 (8) (2014) 1821–1833. [15] H. Song, D. Zhang, A. Jara, J. Wan, K. Boussetta, IEEE access special section editorial: smart cities, IEEE Access 4 (2016) 3671–3674. [16] H. Song, R. Srinivasan, T. Sookoor, S. Jeschke, Smart Cities: Foundations, Principles, and Applications, Wiley Publishing, 2017. [17] M.N. Sadiku, Y. Wang, S. Cui, S.M. Musa, Industrial internet of things, IJASRE 3 (2017). [18] Y. Liao, E.d.F.R. Loures, F. Deschamps, Industrial internet of things: A Systematic literature review and insights, IEEE Internet Things J. (2018). [19] S. Jeschke, C. Brecher, T. Meisen, D. Özdemir, T. Eschert, Industrial internet of things and cyber manufacturing systems, in: Industrial Internet of Things, Springer, 2017, pp. 3–19.

Conﬂict of interest The authors have no conﬂicts of interest to declare. Acknowledgments This work was supported by the National Natural Science Foundation of China [grant number 61872222]; the National Key Research and Development Program of China [grant number 2017YFA0700601]; the Major Program of Shandong Province Natural Science Foundation [grant number ZR2018ZB0419]; the Key Research and Development Program of Shandong Province [grant number 2017CXGC0605, 2017CXGC0604, 2018GGX101019]; and the Young Scholars Program of Shandong University.

98

C. Ji, C. Zhao and L. Pan et al. / Computer Networks 157 (2019) 89–98

[20] NASA, Cyber-Physical Systems Modeling and Analysis (CPSMA) Initiative, (http: //www.nasa.gov/centers/ames/cct/oﬃce/studies/cyber-physical_systems.html). Accessed September 30, 2018. [21] H. Song, D.B. Rawat, S. Jeschke, C. Brecher, Cyber-Physical Systems: Foundations, Principles and Applications, Morgan Kaufmann, 2016. [22] H. Song, G.A. Fink, S. Jeschke, Security and Privacy in Cyber-physical Systems: Foundations, Principles, and Applications, John Wiley & Sons, 2017. [23] Y. Lu, Industry 4.0: a survey on technologies, applications and open research issues, J. Ind. Inf. Integr. 6 (2017) 1–10. [24] J.A. Saucedo-Martínez, M. Pérez-Lara, J.A. Marmolejo-Saucedo, T.E. Salais– Fierro, P. Vasant, Industry 4.0 framework for management and operations: a review, J. Ambient Intell. Hum.Comput. (2017) 1–13. [25] J. Yin, A. Kulkarni, S. Purohit, I. Gorton, B. Akyol, Scalable real time data management for smart grid, in: Proceedings of the Middleware 2011 Industry Track Workshop, ACM, 2011, p. 1. [26] P. Cudré-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D.L. Wang, M. Balazinska, J. Becla, et al., A demonstration of scidb: a science-oriented DBMS, Proc. VLDB Endow. 2 (2) (2009) 1534–1537. [27] X. Xu, S. Huang, Y. Chen, K. Browny, I. Halilovicy, W. Lu, TSAaaS: time series analytics as a service on IoT, in: Proceedings of the 2014 IEEE International Conference On Web Services (ICWS), IEEE, 2014, pp. 249–256. [28] S. Papadimitriou, J. Sun, C. Faloutsos, Streaming pattern discovery in multiple time-series, in: Proceedings of the 31st international conference on Very large data bases, VLDB Endowment, 2005, pp. 697–708. [29] C. Qi, Z. Zhou, Y. Sun, H. Song, L. Hu, Q. Wang, Feature selection and multiple kernel boosting framework based on pso with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing 220 (2017) 181–190. [30] M. Arathi, A. Govardhan, Accurate time series classiﬁcation using shapelets, Int. J. Data Min. Knowl.Manage. Process 4 (2) (2014) 39. [31] L. Ye, E. Keogh, Time series shapelets: a new primitive for data mining, in: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2009, pp. 947–956. [32] J. Lines, L.M. Davis, J. Hills, A. Bagnall, A shapelet transform for time series classiﬁcation, in: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2012, pp. 289–297. [33] J. Hills, J. Lines, E. Baranauskas, J. Mapp, A. Bagnall, Classiﬁcation of time series by shapelet transformation, Data Min. Knowl. Discovery 28 (4) (2014) 851–881. [34] C. Ji, X. Zou, Y. Hu, S. Liu, L. Lyu, X. Zheng, XG-SF: an XGBoost classiﬁer based on shapelet features for time series classiﬁcation, Procedia Comput. Sci. 147 (2019) 24–28. [35] L. Ye, E. Keogh, Time series shapelets: a novel technique that allows accurate, interpretable and fast classiﬁcation, Data Min. Knowl. Discovery 22 (1–2) (2011) 149–182. [36] A. Mueen, E. Keogh, N. Young, Logical-shapelets: an expressive primitive for time series classiﬁcation, in: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011, pp. 1154–1162. [37] Z. Xing, J. Pei, P.S. Yu, K. Wang, Extracting interpretable features for early classiﬁcation on time series, in: Proceedings of the 2011 SIAM International Conference on Data Mining, SIAM, 2011, pp. 247–258. [38] Q. He, Z. Dong, F. Zhuang, T. Shang, Z. Shi, Fast time series classiﬁcation based on infrequent shapelets, in: Proceedings of the 2012 11th International Conference on Machine Learning and Applications (ICMLA), vol. 1, IEEE, 2012, pp. 215–219. [39] X. Xu, Q.Z. Sheng, L.-J. Zhang, Y. Fan, S. Dustdar, From big data to big service, Computer 48 (7) (2015) 80–83. [40] W.H. Kruskal, et al., A nonparametric test for the several sample problem, Ann. Math. Stat. 23 (4) (1952) 525–540. [41] A.M. Mood, Introduction to the Theory of Statistics, 1950. [42] A. Bagnall, J. Lines, W. Vickers and E. Keogh, The UEA & UCR Time Series Classiﬁcation Repository, http://www.timeseriesclassiﬁcation.com, Retrived 201904-23. [43] P. Sathianwiriyakhun, T. Janyalikit, C.A. Ratanamahatana, Fast and accurate template averaging for time series classiﬁcation, in: Proceedings of the 2016 8th International Conference on Knowledge and Smart Technology (KST), IEEE, 2016, pp. 49–54. [44] Z. Zhang, H. Zhang, Y. Wen, X. Yuan, Accelerating time series shapelets discovery with key points, in: Asia-Paciﬁc Web Conference, Springer, 2016, pp. 330–342. [45] T.-C. Fu, F.-L. Chung, R. Luk, C.-M. Ng, Representing ﬁnancial time series based on data point importance, Eng. Appl. Artif.Intell. 21 (2) (2008) 277–300. [46] C. Ji, S. Liu, C. Yang, L. Wu, L. Pan, X. Meng, A piecewise linear representation method based on importance data points for time series data, in: Proceedings of the 2016 IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD), IEEE, 2016, pp. 111–116.

Cun Ji obtained his BS and PhD degrees from Shandong University, China. His current research interests include time series data analysis, enterprise services computing and services system for manufacturing.

Chao Zhao is a master student at School of computer science and technology at Shandong University, China. He is a member of the HCI&VR Group of Shandong University. His main research interests include time series data analysis, cloud computing, etc.

Li Pan obtained her BS, MS and PhD degrees from Shandong University, China. She is a Lecturer of Shandong University. Her current research interests include cloud computing, cloud manufacturing, and market-oriented resource allocation.

Shijun Liu obtained his BS degree in oceanography from Ocean University of China, and MS and PhD degrees in computer science from Shandong University, China. He is a Professor of Shandong University. His current research interests include services computing, enterprise services computing and manufacturing data analytic.

Chenglei Yang obtained his BS, MS and PhD degrees from Shandong University, China. He is a Professor of Shandong University. His current research interests include human computer interaction and virtual reality.

Xiangxu Meng obtained his BS and MS degrees from Shandong University, China and PhD degrees from Chinese Academy of Sciences. He is a Professor of Shandong University. His current research interests include human computer interaction,virtual reality and data visualization.

A just-in-time shapelet selection service for online time series classification

A just-in-time shapelet selection service for online time series classification

Recommend Documents