Combining contextual neural networks for time series classification

Combining Contextual Neural Networks for Time Series Classification Communicated by Dr. Javier Andreu Journal Pre-proof Combining Contextual Neural...

Download PDF

1MB Sizes 0 Downloads 88 Views

Report

PDF Reader
Full Text

Combining Contextual Neural Networks for Time Series Classification

Communicated by Dr. Javier Andreu

Journal Pre-proof

Combining Contextual Neural Networks for Time Series Classification Amadu Fullah Kamara, Enhong Chen, Qi Liu, Zhen Pan PII: DOI: Reference:

S0925-2312(19)31636-4 https://doi.org/10.1016/j.neucom.2019.10.113 NEUCOM 21562

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

27 January 2019 20 October 2019 31 October 2019

Please cite this article as: Amadu Fullah Kamara, Enhong Chen, Qi Liu, Zhen Pan, Combining Contextual Neural Networks for Time Series Classification, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.10.113

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Combining Contextual Neural Networks for Time Series Classification

✩

Amadu Fullah Kamaraa,b , Enhong Chena,∗, Qi Liua , Zhen Pana a School

of Computer Science and Technology, University of Science and Technology of China, China of Mathematics & Statistics, Fourah Bay College, University of Sierra Leone, Sierra Leone

b Department

Abstract Ten years ago, linear models were applied in various domains. Before application of the algorithms, several current studies extracted features presumed to represent parochial markings from the data using engineering techniques. Recently the deep learning domain offered opportunities to directly feed data into the model without any extensive hand-crafted feature engineering techniques. In this paper, the proposed framework does the feature extraction in a non-supervised (i.e. self-supervised) manner using both Contextual Long Short-Term Memory (CLSTM) and Contextual Convolutional Neural Networks (CCNN). We can then concatenate data obtained from the CLSTM and CCNN blocks, feed it into the Attention block, pass it through the Multilayer Perceptron (MLP) block, ultimately passing it through a terminal layer for classification. The task involved here was non-trivial as there is a major challenge in implementing our model to solve the time series classification (TSC) problem: overfitting. We deal with this challenge as follows; firstly, we adjusted the number of neurons in each of the stages. Secondly we introduced dropouts after every layer in each stage of this model. Finally experiments regarding the University of California Riverside (UCR) dataset indicates the model’s superiority. Keywords: Time series classification, contextual convolutional neural networks, contextual long short-term memory, attention, multilayer perceptron

1. Introduction Recently time series classification problem have attracted many researchers from various fields including data mining [1], economics [2], statistics [3], seismology [4], meteorology [5], finance [6], industry [7], health care [8], etc. TSC problem, through research, was discovered to be a leading inspirational problem for the past ten years [9]. Currently it is clear the need for the extraction of useful information from time series data for addressing the problem of TSC increases every day. The time series classification (TSC) problem is merely the task of predicting class labels for time series. ∗ Corresponding author: [email protected] The code of the model is available at https://github.com/AmaduFullah/CNTC MODEL

Preprint submitted to Journal of LATEX Templates

December 4, 2019

However, most of the existing TSC approaches are categorised into two groups [10] those being distancebased, as well as feature-based methods. The distance-based approaches are remarkably k-nearest-neighbor in the TSC domain, and has been victorius [11]. Algoritmic rationale behind distance-rooted methods concerns appraising similitude of a pair of sequences. Depending on the likeness measure, some approaches for instance k-nearest-neighbors (kNN) [12] or support vector machines (SVM) [13] can be combined alongside comparability-rooted-kernels like Euclidean distance or Dynamic time warping (DTW) [14] to make a classification. An amalgamation of DTW similarities measure and k-nearest-neighbor classifier has been a celebrated and coherent method during the previous ten years. In feature-based approaches [15], each of the time series is signaled by a vector of features, as well as an application of any feature-based classifier which can perform classifications. The methods in this category mostly differ in the features extracted. However, it is also clear these approaches require a healthy repertoire which includes pre-processing techniques and a wealth of knowledge in handcrafted feature engineering [16]. Ensemble-based algorithms also produce cutting edge results in time series classification by placing different classifiers together. For example, the Elastic Ensemble (PROP) [17] integrates 11-time series classifiers using a scheme called weighted ensemble, a shapelet ensemble (SE) [18], and a model that uses shapelet transforms coupled with a mixed band to produce classifiers. Another well-known example consists of a collection of flat transform-based ensembles (COTE) [18]. These are a combination of thirty-five identified classifiers with the ability to extract both frequency and time domain characteristics. Previously efforts were made to apply deep neural networks to the task of time series classification specifically regarding convolutional neural networks (CNN) [19]. In [20], a multi-channel convolutional neural network (MCNN) is presented for the problem of multivariate TSC. In [21, 22], the deep learning approaches are seen to be fully utilized in the univariate time series classification task. Most features remain absorbed deep within the neural network, particularly those from the lower hidden layers. These layers may not be useful; as a result, interpretability becomes a problem; sometimes it even reduces the network’s ability to recognise objects. To improve on the above situation, [23] proposed a new form of architecture called multi-feature fusion deep networks (MFFDN) which is based on Denoising Autoencoder. In this study a presentation of the unique deep learning architecture, a combination of CCNN [24] and CLSTM [25] as feature extractors (CNTC), a neural network mainly designed to address the time series classification problem is completed. The proposed model was evaluated based on the UCR Benchmark datasets [26]. CNTC architecture remains a neural network model proposed to efficiently address the TSC problem. An application of either feature engineering or heavy preprocessing on the data is not necessary. The model was tested using a subset of 44 UCR time series benchmarks and compared with existing TSC models. Evidence shows CNTC defeats most of the up-to-date models for TSC by a wide margin. The paper is organised as follows. Section 2 provides a brief review of existing work on the time series classification. 2

Concaten ation

CLSTM

l1c1

l1

l2

l 2 c2

lncn

ln

Average/ Max/ Attention

Time Series Data

c1

Softmax MLP

cn

c2

CCNN

Figure 1: The network structure of the CNTC neural network.

Section 3 presents the interface of our proposed model. In Section 4, the overall setup of the experiment is fully described, listing utilised evaluation metrics provided to assess the model. Lastly it presents the figures for the time series data, hyper-parameters, and a variety of other measurable indicators. Section 5 ends the paper and discusses future work. 2. Related Work The Time Series Classification problem has been given serious consideration in the last decade. Profusion of TSC approaches has been advanced. The long-established TSC approaches are in two folds: the first one involved the use of kNN classifiers above distance measures bound by time series; and second one involve use of classifiers which are feature-rooted with the ability to search and extract deterministic features within both frequency and time domain before applying any known baseline methods [10]. Less than a decade ago several methods which grouped several classifiers together have been given consideration, as well. A complete history of the TSC linear approaches is not adequate for this work. We shall perform an allinclusive heuristic comparison of the baseline approaches and our proposed model in subsection 4.1. The related works regarding the proposed network are as follows: 3

Deep Neural Networks: Recently, there has been active research concerning non-linear networks [27, 28, 29, 30, 31, 32] which can blend both classification and the extraction of hierarchical features together. The above approaches reduce time consuming efforts to write complex code in linear methods. An indiscriminate comparison reveals the use of convolutions [33] in networks efficiently perform feature extraction more so than previous linear approaches [34]. Therefore mitigating labour-demanding efforts in feature engineering which provides results to support fruitful deep learning is examined. [22] contains three models namely; Multilayer Perceptron (MLP), Network of Residuals (ResNet), and Network of completely connected convolutional layers (FCN). All of the above named three models have settings of using their first layers to extract features from the data and the last layers which provide output. Unfortunately, the first layers only perform better based on their capacity to capture featuring short range dependencies; long term dependencies will not be properly captured. ResNet, is a profound model formed due to the addition of shortcut connections to each block, increasing model performance. This is not always possible due to the difficulty of training deep networks based on the problem vanishing gradient. Also, [35] proposed a deep convolutional neural network to classify ultrasonic signals that are in the form of polymer specimens. Because the above approach uses CNNs to learn representation for each of the signal from wavelet coefficients, performance was not optimal as CNNs do not provide necessary results to capture long term features. In another work, a multi-task learning framework capable of learning two classifiers using a deep neural network [36] is introduced. Because the method is based on an MLP settings, features of long range dependencies will not be effectively extracted. To continue investigating the performance of deep learning models concerning feature extraction, [37] presented a CNN based tracking model which incorporates spatial-temporal saliency detection as a guide though CNNs are poor feature extractors for long term dependencies. Multi-Channel CNN : This is a multi-channel convolutional neural network and was presented to assess the multivariate time series classification problem[19]. Extraction of these features is completed by placing each time series into a distinct CNN. Afterwards other features are concatenated and put into a new CNN framework. As the model is dominated by CNN modules, effective feature extraction is only done for short term dependencies. In [38], CNN is feed with doctored variables by applying a new method of choosing variables. Features for long term dependencies will not be effectively accounted for as the comprehensive work focused on models which were MLP or CNN based. In [39], a self-supervised model which transforms reliable time series data into a new space before applying nearest neighbor is proposed to address the TSC problem. This approach employs short range features by integrating a convolutional component that is nonlinear, but does not employ long term features. Multi-Scale convolutional neural networks: Because time series features cannot alter to right-scales, as well as because their discriminative-patterns are often deformed by random noise and perturbations of highfrequencies of the realistic time series data; the unique model, i.e., multi-scale convolutional neural networks otherwise known as MCNN [21] offers solutions to the above issues. The MCNN framework was proposed 4

to address the problem of time series classification by using the three stages below: • The transformation stage: Throughout this stage, various transformations such as identity mapping, time domain down-sampling transformations, and frequency domain spectral transformations are applied to the time series data. • Local convolution stage: Several convolutional layers are used as feature extractors. • Full convolution stage: In the full convolution stage, all extracted features are concatenated and applied via more convolutional layers, i.e. completely merged layers in addition to terminal layer, and softmax to produce the end output. This approach is pronounced via CNN based modules which do not perform well concerning long term feature extraction. LSTM Fully Convolutional networks: At first glance our proposed model architecture resembles LSTM Fully convolutional networks [40]. The model uses a parallel configuration of CNN and LSTM blocks as feature extractors to adequately address the TSC problem. CNN extracts short term dependencies while LSTM will perform the extraction of long-term dependencies from the time series. The separate sets extracted features are concatenated and passed through a softmax layer which is responsible for performing the classification task. Input data is simultaneously loaded into the two named blocks, showing effective extraction of features impossible; they are of opposite interests. The paper presents a great modus operandi to address the time series classification problem within a self-supervised and efficient feature extraction manner.

3. Network Architecture In this section, a comprehensive description of the framework to classify time series data is presented. The CNTC framework includes four sequential stages, those being feature extraction, concatenation, attention mechanism, and multilayer perceptron (MLP). The entire network’s parameters were initialised using Glorot uniform initialisation approach [41]. 3.1. Feature extraction stage Extraction of features was done by a Contextual Convolutional Neural Network (CCNN) [24] and a Contextual Long Short-Term Memory (CLSTM) [25] network. Time series data is fed into a CLSTM and CCNN networks simultaneously and is perceived differently. In the CLSTM block, the input data is viewed as a multivariate time series with a single time stamp. In contrast, the CCNN block receives univariate data with numerous time stamps. 3.1.1. Contextual CNN: It has been found that, CNN has outperformed leading-edge models in images congruent with semantic segmentation [42]. High success demands every classifier is a pixel of an output corresponding with any 5

receptive field. It can be observed that for semantic segmentation, the use of only CNN modules can persevere over refined models. To solve the TSC problem through efficient feature extraction we develop a Contextual CNN (CCNN) [24] as one arm of the feature extraction [43]. The ordinary CNN is better at extracting short range feature dependencies, performing poorly concerning the extraction of features for long term dependencies. Therefore, we refer to features for long term dependencies as the contextual features. The input time series data is a matrix of the form S ∈ Rp×d , formed by adding p number of times series vectors, each of length d. Both recurrence, i.e. one that extracts the context features and convolution are integrated into one layer to ensure efficient feature extraction by the CCNN arm of the model. This single layer is known as contextual convolutional layer, believed to be the most critical module in the network; the output is always transferred to the succeeding layer following an adequate number of iterations, which is step K. While the state of the context convolutional layer will evolve continuously, the input time series matrix remains static. The contextual region responsible for updating the present function of both the filter size n and some predefined iteration step K is: (n − 1)K + 1. Throughout the extraction process, the layer computes convolutions over the time series matrix to enable the output to update itself. After initialisation, the output unit µuv (k) of the v th feature map, uth index position and k-th time step will update via the input time series s and the matrix M in the following manner [24]: + bcv + bsv µuv (k) = β Uvc · suv (k) + Uvc · Mk−1 u

(1)

where suv is an input time series vector, Mk−1 is the antecedent output matrix of the same layer, bcv and bsv u are bias vector terms, and β regards a non-linear activation function estimated by the study as the ReLU. We can also form contextual modulation by eliminating the first term of equation (1) as follows [24]: + bcv + bsv . µuv (k) = β Uvc · Mk−1 u

(2)

Equation (2) represents the recurrent contextual layer, the same as equation (1) for k = 1 in a situation where M is initialised as the input matrix. The recurrent CNN (i.e. a network with recurrent convolutional layers) after K iteration steps and in the absence of a 1-D max pooling layer (i.e. one with a filter of size l × 1) is equivalent to a K layer CNN. However, after computing the convolutions the local response normalisation or Batchnormalisation for all layers put forward were done to prevent states from exploding. Local response normalisation is represented by the following equation [24]: µuv (k)

Φ(µuv (k)) = E+θ

min(p,u+(N P −1)/2)

u0 =max(0,u−(N −1)/2)

6

0

µuv (k)2

!λ ,

(3)

where N represents the number of filter maps adjacent while the constants E, θ, and λ are the hyperparameters responsible for controlling the normalisation amplitude. In this work, we will use Batchnormalisation in order to prevent states from exploding. The final output of the contextual convolutional layer is then input after numerous iterations into a standard CNN layer. The output also passed onto a Batchnormalisation layer. Lastly, we introduce a dropout layer via regularisation to prevent overfitting. 3.1.2. Contextual LSTM: Recurrence Neural Networks (RNN) models have produced results for numerous sequence learning tasks including but not limited to handwriting recognition and sentiment analysis [44]. RNN is well known for its ability to recall preceding states and learn interdependencies. Backpropagation is the approach by which RNN [45] is trained, the problem regarding the vanishing gradient affected them. LSTM networks were introduced to solve vanishing gradient issues by employing what is known as gating technique. This technique ignores irrelevant subsequent time stamps, estimating the relevant stamps [44, 46]. LSTM models have produced excellent results regarding sequence-to-sequence learning similar to machine translation [47], as well as image captioning [48, 49], and natural language generation and reconstruction [50, 51]. This work’s aim is to efficiently extract features from our data to achieve optimum results in solving the TSC problem. For this, Contextual LSTM (CLSTM) will be constructed [25] as an addition to feature extraction processes. We let M denote the mean of an i-th sliding window [52] with size j of our time series. Given a time series s1 , s2 , s3 , ····, sn−1 , sn we denote the i-th sliding window of size j as Li,j . By setting the windows size to be five, the corresponding windows are as follows; L1,5 = {s1 , s2 , s3 , s4 , s5 }, L2,5 = {s2 , s3 , s4 , s5 , s6 }, ······, Ln−1,5 = {sn−5 , sn−4 , sn−3 , sn−2 , sn−1 }, Ln,5 = {sn−4 , sn−3 , sn−2 , sn−1 , sn }. Therefore, the means of windows L1,5 , L2,5 ,······,Ln−1,5 , Ln,5 are M1 , M2 ,······, Mn−1 , Mn respectively. Let Pd be the mean vectors equivalent to the size of a window. For instance, a time series with a chosen window of length 3 will have vectors Pd as follows; P1 = {M1 , M2 , M3 }, P2 = {M4 , M5 , M6 },······, Pn−1 = {Mn−5 , Mn−4 , Mn−3 }, Pn = {Mn−2 , Mn−1 , Mn }. Hence, we have vector P = {P1 , P2 , P3 , · · ··, Pn−1 , Pn }. The CLSTM module refers to the mean of windows vector P as the contextual features. In all the four equations below, the bold red terms are the adjustments made to the primary LSTM equation. At the k-th time stamp of the extraction process, the LSTM block will update the hidden state, hk , by using the antecedent latent state hk-1 , the context vector P, as well as the time series input, sk ,

7

indicated in the following: uik

=η

Wis k sk

+

Wih k hk-1

+

bis k

+

bih k

+

Wip k P

+

            

bip k ,

ip ufk = η Wfk s sk + Wfk h hk-1 + bfk s + bfk h + Wip k P + bk , ip ip os oh oh uok = η Wos k sk + Wk hk-1 + bk + bk + Wk P + bk ,

(4)

           

ip ip cs ch ch uck = ufk hk-1 + uik tanh Wcs k sk + Wk hk-1 + bk + bk + Wk P + bk ,

hk = uok tanh(uck ),

where the input gate, forget gate, and output gate is uik , ufk , and uok respectively, hk-1 is the previous hidden state vector, sk , acting as the input time series vector; hk is the final output of this step. One CLSTM layer was applied with 8, 16, 32 or 64 cells depending on the dataset. 3.2. Concatenation stage This stage is responsible for merging the final outputs of the two feature extraction arms of our model. Given µk and hk as the final outputs of the CCNN and CLSTM stages respectively, then the final output of the concatenation stage ck is given as

0.7

0.7

0.6

0.6

0.5

0.5

DATS errors

NTSC errors

ck = concatenation(µk , hk )

0.4 0.3 0.2 0.1

(5)

0.4 0.3 0.2 0.1

0.0

0.0 0.0

0.1

0.2

0.3

0.4

CNTC errors

0.5

0.6

0.7

0.0

0.1

0.2

0.3

0.4

CNTC errors

0.5

0.6

0.7

Figure 2: (a) Scatter plot of testing errors of NTSC model against CNTC model on 44 UCR datasets; (b) Scatter plot of testing errors of DATS model against CNTC model on 44 UCR datasets.

3.3. Attention stage Average and max poolings are techniques based on non-linear down-sampling. They are applied first in this stage in an effort to reduce sizes of feature maps while reducing parameters of the following layer to prevent overfitting while improving computational efficiency. The stage is included because LSTM cannot efficiently capture long-term dependencies throughout a long time series [53]. An Attention Mechanism [54] module was developed to address the above problem adequately. This technique is frequently utilised in text 8

Table 1: Test errors and mean of rankings for fourty-four ucr datasets. Dataset

ResNet

FCN

ITRD

DT WF

DATS

LPS

MCNN

NTSC

TSCD

CNTC

Adiac

0.406

0.243

0.241

0.312

0.23

0.383

0.255

0.153

0.184

0.142

Beef

0.366

0.132

0.366

0.266

0.199

0.132

0.286

0.249

0.232

0.286

CBF

0.003

0.001

0.002

0.001

0

0.01

0.009

0

0.006

0

ChlorineCon

0.353

0.315

0.204

0.346

0.341

0.313

0.337

0.158

0.173

0.022

CinCECGTorso

0.339

0.054

0.048

0.12

0.115

0.011

0.252

0.177

0.219

0.204

0

0

0.036

0.036

0

0

0.004

0

0

0

CricketX

0.236

0.144

0.172

0.336

0.249

0.287

0.268

0.175

0.169

0.185

CricketY

0.255

0.166

0.153

0.327

0.207

0.325

0.258

0.207

0.194

0.2

CricketZ

0.245

0.127

0.141

0.312

0.245

0.276

0.262

0.186

0.186

0.211

DiatomSizeR

0.034

0.083

0.024

0.037

0.047

0.07

0.127

0.071

0.07

0.017

ECGFiveDays

0.232

0

0

0

0

0.055

0.183

0.015

0.045

0.057

FaceAll

0.202

0.115

0.245

0.251

0.22

0.257

0.244

0.081

0.176

0.192

0.17

0.091

0

0.034

0

0.034

0.051

0.068

0.068

0.088

0.085

0.047

0.053

0.093

0.032

0.069

0.08

0.042

0.032

0.064

0.3

0.181

0.18

0.357

0.291

0.278

0.199

0.311

0.263

0.12

fish

0.187

0.039

0.061

0.027

0.021

0.067

0.09

0.039

0.021

0.011

GunPoint

0.093

0.007

0

0

0

0.06

0.011

0

0.007

0

Haptics

0.624

0.489

0.531

0.585

0.537

0.608

0.489

0.45

0.496

0.415

InlineSkate

0.617

0.552

0.619

0.574

0.512

0.654

0.604

0.59

0.434

0.451

ItalyPower

0.049

0.035

0.029

0.085

0.052

0.052

0.095

0.029

0.039

0.037

Lightning2

0.13

0.163

0.163

0.261

0.147

0.097

0.256

0.196

0.245

0.24

Lightning7

0.284

0.257

0.229

0.298

0.352

0.284

0.272

0.147

0.174

0.143

MALLAT

0.056

0.026

0.047

0.054

0.003

0.082

0.027

0.01

0.011

0.005

MedicalImages

0.263

0.258

0.26

0.474

0.288

0.305

0.269

0.208

0.228

0.215

MoteStrain

0.165

0.085

0.079

0.115

0.073

0.051

0.135

0.055

0.105

0.06

NonInvThorax1

0.21

0.093

0.064

0.169

0.161

0.174

0.138

0.039

0.052

0.043

NonInvThorax2

0.135

0.073

0.06

0.118

0.101

0.118

0.13

0.045

0.049

0.049

OliveOil

0.157

0.09

0.123

0.123

0.09

0.123

0.08

0.157

0.123

0.037

OSULeaf

0.409

0.145

0.271

0.074

0.012

0.273

0.329

0.012

0.021

0.018

SonyAIBORobot

0.275

0.146

0.23

0.265

0.321

0.238

0.175

0.032

0.015

0.042

SonyAIBORobotII

0.159

0.066

0.06

0.178

0.088

0.056

0.186

0.028

0.028

0.03

StarLightCurves

0.103

0.041

0.033

0.106

0.031

0.103

0.032

0.043

0.039

0.048

SwedishLeaf

0.198

0.036

0.056

0.131

0.062

0.11

0.065

0.024

0.032

0.039

0.05

0.046

0.049

0.029

0.032

0.083

0.034

0.038

0.128

0.102 0

Coffee

FaceFour FacesUCR 50words

Symbols SyntheticControl

0.007

0

0.003

0.04

0.03

0.033

0.008

0.01

0

Trace

0

0.01

0

0

0

0.05

0.02

0

0

0

TwoLeadECG

0

0.015

0.001

0.015

0.004

0.029

0.001

0

0

0

TwoPatterns

0.096

0

0.002

0.001

0.016

0.048

0.046

0.103

0

0

UWaveX

0.271

0.195

0.179

0.269

0.24

0.247

0.163

0.245

0.212

0.181

UWaveY

0.365

0.266

0.267

0.363

0.312

0.321

0.248

0.274

0.331

0.29

UWaveZ

0.342

0.265

0.232

0.336

0.312

0.346

0.217

0.271

0.245

0.278

wafer

0.02

0.001

0.002

0.001

0.001

0.002

0.004

0.003

0.003

0

WordSynonyms

0.361

0.276

0.286

0.449

0.355

0.367

0.312

0.43

0.378

0.224

yoga

0.062

0.154

0.103

0.102

0.159

0.071

0.149

0.139

0.145

0.132

Score

3

7

6

4

10

4

3

13

8

18

Percentage Score (%)

7

16

14

9

23

9

7

30

18

41

7.727

4.712

4.977

7.114

4.023

7.182

7.530

3.932

4.500

3.136

Mean of Rankings

9

CNTC NTSC

0

5

10

testing error

testing error

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

15

20

25

dataset count

30

35

40

45

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

CNTC DATS

0

5

10

15

20

25

dataset count

30

35

40

45

Figure 3: (a) Plot of testing errors of NTSC and CNTC models on 44 UCR datasets; (b) Plot of testing errors of DATS and CNTC models on 44 UCR datasets.

translations in neural networks. It uses a context vector R that is conditioned on the target series y. Like in [53], our context vector ri rely on a set of annotations (g1 , · · ··, gN ) to which the input series are mapped by an encoder. For every annotation gi holds data about the entire input series with a strong concentration on the parts close to the i-th term of the input series. The discourse vector ri is therefore, calculated as a weighed sum of the above annotations in the following manner [55]:

ri =

N X

ηij gj .

(6)

j=1

Each annotation gi is given a weight ηij that is calculated as follows [55]:

where

exp(γij ) ηij = PN k k=1 exp(γi ) γij = p(si−1 , gj ).

(7)

(8)

Equation (8) represents an alliance model that scores how good the j-th input and the i-th output match. In this case, the score is based on both the latent state si−1 of the RNN, and the annotation gj at position j of the input time series. As in [53], the alignment model γij is parameterised as a feedforward network, and trained together with the entire constituents of the framework. The alignment model γij instantly calculates a soft alliance, which allows the back propagation of the slope of the cost function. For our problem settings, the attention mechanism requires 8 or 10 attention width with a dropout application of 0.5 applied to combat overfitting. 3.4. Multilayer Perceptron (MLP) stage In this final component of our model’s architecture, we develop an MLP module by stacking two layers each consisting of 64 neurons. Dropouts as regularisers are made use of in order to prevent our model from 10

overfitting. Dropouts ranging from 0.5 and 0.8 are applied to both the first and second layers respectively, after every Rectified Linear Unit (ReLU) activation layer of an MLP layer. Finally, the softmax layer, the last layer, provides the final label.

4. Experiment 4.1. Experimental Setup For this work, our focus is to compare the CNTC model alongside nine baseline models namely, Time Series Classification using Deep Learning for process planning (TSCD) [7], Learned pattern similarity (LPS) [56], DTW features (DT WF ) [57], Deep neural networks for TSC from scratch with (FCN, and ResNet) [22], TSC using Multi-scale convolutional neural networks (MMCN) [21], A novel non-parametric method for time series classification based on k-Nearest Neighbors and Dynamic Time Warping Barycenter Averaging (NTSC) [58], Data augmentation using synthetic data for time series classification with deep residual networks (DATS) [59], and Improved Time Series Classification with Representation Diversity and SVM (ITRD) [60]. The proposed model leverages high computational proficiency exceeding 65 GPUs in one colossal cluster. The results obtained are then compared with other, newer models. This experimental setup demanded adoption of the UCR dataset split, with results approximated to within 3 decimal places. The CNTC network is trained via the Adam optimizer [61] with 0.5 placed as the learning rate, β1 = 0.9, β2 = 0.999 and = 1e-7. The loss function for CNTC is sparse categorical cross-entropy defined as [62]:

max W,b

(i)

N X

log o(i) yi

(9)

i=1

where oyi is the yith output of the ith instance through the CNTC network. It is the probability of the true label. Implementation was performed in python programming language using the jupyter notebook [63] environment. These Neural networks are a familiar approximator that can be easy to overfit due to extraneous parameters. Due to the small nature of our datasets, remarkable overfitting was expected. The learning rate was continuously reduced by a factor of 0.7 for every 50 epochs no breakthrough or change in validation scores until the final learning rate was reached. We select the outstanding model that attains lowest training error and account for its performance in the test data. Also, the batch size takes values such as 16, 32, or 64 depending on the dataset. In the CLSTM module, the LSTM block is able to comprehend 8, 16, 32, 64 cells depending on the dataset. To prevent overfitting a dropout rate of 80% is used. For the CCNN module, RNN and CNN that make up the contextual convolutional layer takes 8, 16, 32, 64 neurons each depending on the dataset, while the standard CNN layer also takes 8, 16, 32 neurons based on the dataset. In order to 11

prevent states from exploding, we apply Batchnormalisation after the RNN and CNN layers of contextual convolutional layer. Also, the Attention block utilises 8 or 10 attention width based on the dataset assessing a dropout rate of 50% used to overcome overfitting. Finally, the MLP block is kept constant throughout the experiments. Also, we examined and compared the time complexities [64] of CNTC and three benchmark models. The number of epochs associated with the time complexities of the above models have also been investigated. Lastly, we investigated the proposed model’s performance when configured within the context convolution and context lstm replaced by ordinary convolution and vanilla lstm scenarios, using the same settings as above. 4.2. Dataset and Evaluation Metrics Presently the UCR data repository has over 128 datasets. Due to the slow nature of our proposed model a subset of 44 datasets were used to perform the experiment [65]. The UCR dataset is publicly available. The UCR archive is comprised of datasets based on realistic applications with different properties. Since we are using more than 30 datasets in our experiment, in order to further investigate the performance of our proposed model rank based statistics were a necessity as advised by the central limit theorem [65]. The rank based statistics used include the Wilcoxon rank-sum test [22], and Kruskal-Wallis test [66]. CNTC appraisal is performed via the following metrics: score, percentage score, mean of rankings [22], the Wilcoxon rank-sum test, and the Kruskal-Wallis test. The Wilcoxon rank-sum test is non-parametric in nature and statistical with the purpose of testing the null hypothesis used to indicate a randomly selected testing error value from a model and whether it will be less or greater than the testing error value randomly selected from another model. Both will share a dataset. The Kruskal-Wallis test is also non-parametric, testing to see if models originate from similar or the same distribution. We will use it to compare two or more independent models of equal or unequal model sizes. The mean of rankings is the arithmetic mean of testing error values ranked by each model. The Wilcoxon rank-sum test is used to compare between the median rank presented by benchmark models when compared to the proposed model, with null and alternative hypothesis as follows [40]: H0 : M ediancntc model = M edianbenchmark model H1 : M ediancntc model 6= M edianbenchmark model The Kruskal-Wallis Test is a one-tail test used to analyze whether the models are drawn from the same population. Both the null and alternative hypothesis of the Kruskal-Wallis Test are as follows: H0 : T he probability distributions of the proposed model and a benchmark model are the same 12

H1 : T he probability distributions of the proposed model and a benchmark model are not the same Since we observed ties in ranked data, we used the Kruskal-Wallis test of the form [66]: Hm =

H , M

(10)

where Hm is the Kruskal-Wallis statistic when data ties exist. We calculate H as follows [66]: H=

X R2 12 i − 3(N + 1), N (N + 1) i ni

(11)

where H represents the H-statistic, i.e. the Kruskal-Wallis statistic specified when there is no data relation. N represents testing error values of all the models in the experiment. Ri2 is the square of the sum of model rankings i, while ni represents the testing error values of model i, and i is a natural number within the range 1 ≤ i ≥ 4 in our problem settings. The value of M is obtained from the equation [66]: Pp (l3 − li ) M = 1 − i=13 i , N −N

(12)

indicating the number of groups with tied testing error values, denoted by p, the number of tied testing error values in an i-th group, denoted as li .

testing error

0.6

0.4

0.2

CD TS

TS DA

SC NT

CN

TC

0.0

Figure 4: Comparison of CNTC against three Baseline Models regarding test errors.

13

4.3. Results And Analysis A result summary is provided via the experiment represented in Table 1. In the above mentioned table, the bold black cells denote an outstanding model performance on a particular dataset. Our proposed model outperforms the benchmark models in at least 18 datasets, while the NTSC model performs better in 13 datasets. It has been rated second best following the CNTC model. The DATS model ranks third and performed the best on 10 datasets. Both the percentage score and mean of rankings metrics in Table 1, indicate the proposed model’s success over the already existed benchmark models. Table 2 compares our network’s accomplishments, as well as three benchmark models using the p-values of the Wilcoxon rank-sum test at the 5% significant level. In any column the p-value indication is a measure of performance of that model, implying that the higher the number of bold p-values the better the performance while the lower the number of bold p-values indicate a poor performance. The table indicates the superiority of the proposed model with four bold p-values, showing it outperforms the three models namely, NTSC, DATS, and TSCD, while NTSC (with three bold p-values) is seen to outperform additional models. It also indicates that DATS (with one bold p-value) and TSCD (having no bold p-value) are third and fourth respectively in performance. Table 3 offers insight into our proposed model. The Kruskal-Wallis Test has an H-value far greater than the 5% of the significant level implying we must accept H1 and reject H0 , i.e. the models are drawn from diverse populations. Furthermore, sums were arranged by ranking the four models in ascending order as follows: TCN T C < TN T SC < TDAT S < TT SCD . Since smaller sums of ranks indicate outstanding model performance, we see that our proposed model (CNTC) outrun the three baseline models due to the lowest value of its sums of ranks (TCN T C ). It was observed that NTSC, DATS, and TSCD models are ranked from best to worst respectively based on performance. Table 4 compares our proposed model and three baseline models using the average training time per epoch and the number of epochs. It was observed that on average one epoch’s training time of CNTC is lower than the other baseline models except for NTSC, and with regards to training cycles, it requires approximately 150 epochs. The NTSC, DATS, and TSCD models gets optimum results, approximately after 164, 172, and 189 epochs respectively. Table 5 reports on the performance of the proposed model with both CNN + LSTM and CCNN + CLSTM configurations on 44 UCR datasets. Also, the bold black cells denote an exceptional model performance on a particular dataset. The CNN + LSTM combination performs better on eleven datasets with 25% percentage score and mean of rankings of 4.001, while the CCNN + CLSTM scenario outperforms all in eighteen datasets possessing 41% percentage score with a mean of rankings value of 3.136. 14

In both subfigures 2(a) and 2(b), it observed that the regions above the lines of best fit are areas of outstanding performance for the NTSC and DATS models respectively, while the area below the line of best fit is a region of best performance for the CNTC model. Since the best score of an outstanding model performance (i.e. zero) for both plots is below the line of best fit, demonstrate that our proposed model outperforms both the NTSC and DATS models when applied on the 44 UCR datasets to perform the task of classification. In subfigures 3(a) and 3(b) we observed that our advanced model outperforms both NTSC and DATS models on almost all the 44 UCR datasets except for a few datasets, manifested by the appearance of the red plot below the blue and green plots in the above named two subfigures. Figure 4 also provides evidence that our proposed model performs better than the other three existing benchmark models due to its lowest median of testing errors on all the 44 UCR datasets. By observing the figure as mentioned above, we see that the NTSC model is the second best in performance to the CNTC model, while the DATS and TSCD are third and fourth best respectively in performance.

Table 2: Comparison of our proposed model with three baseline models using the p-values of Wilcoxon rank-sum test.

CNTC

NTSC

DATS

TSCD

ResNET

0.0015

0.0046

0.0296

0.3341

FCN

0.2880

0.4800

0.6709

0.1047

ITRD

0.2602

0.4141

0.6678

0.8712

DT WF

0.0217

0.0545

0.1180

0.5212

DATS

0.2391

0.3979

—

0.4891

LPS

0.0079

0.0249

0.1023

0.9221

MCNN

0.0102

0.0351

0.1371

0.7725

NTSC

0.4436

—

0.7957

0.1101

TSCD

0.3883

0.5663

0.7635

—

CNTC

—

0.8027

0.9366

0.5980

Table 3: The H-value of the Kruskal-Wallis test and the sums of ranks of our proposed and three benchmark models.

TCN T C

TN T SC

TDAT S

TT SCD

H-value

1581

1632

1689

1727

7.954

Table 4: Comparison of CNTC with three baseline models based on average training time per epoch and number of epochs.

M odel

T raining time(s)

Epochs

CNTC

10

150

NTSC

8

164

DATS

21

172

TSCD

29

189

5. Conclusions In this paper, we present CNTC, a novel model that achieves a notable improvement on the current benchmark. The objective of study is to investigate the performance of our proposed model on 44 UCR datasets. Our approach extracts features from the time series in a self-supervised manner by using both CCNN and CLSTM, and at the end uses softmax to make predictions. It is a well-tailored deep learning approach to classify time series. Our proposed network is trained end-to-end without any feature crafting or any preprocessing on the raw data, but still able to achieve improved realization. The high increase in achievement in contrast to the other nine networks indicates CNTC can advantageously complement the achievements of those models in classifying time series data. An all-embracing breakdown of the achievement of our proposed model provided and correlated to other approaches. As at present, all the datasets available for time series classification are not too broad, thus we envision that CNTC will perform better when applied on datasets that are considerably large. For the future work, we shall investigate our proposed model for the purpose of time series classification by using it to different data sources, such as text messages received per day by the customer care service of a mobile company, images uploaded on a particular website/social media group, and voice messages received by the customer service of a bank.

Acknowledgments We thank the UCR time series archive for making the datasets available. This research was partially supported by grants from the National Key Research and Development Program of China (No. 2016YFB1000904), and the National Natural Science Foundation of China (Grants No. U1605251, 61727809).

References [1] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, P.-A. Muller, Deep learning for time series classification: a review, Data Mining and Knowledge Discovery 33 (4) (2019) 917–963.

16

CNTC CNTC

Table 5: Ablation Performance of CNTC model with CNN + LSTM and CCNN + CLSTM blocks configurations on 44 UCR datasets.

cnn + lstm 0.142

0.159 0.286

0.132 0

0.04 0.022

0.120

ChlorinCon 0.204

0.138

CinCECGTorso

0

0.014

Coffee 0.185

0.144

CricketX 0.2

0.210

CricketY 0.211

0.158

CricketZ 0.017

0.020

DiatomSizeR

0.057

0.074

ECGFiveDays

0.192

0.081

FaceAll

0.088

0.043

FaceFour

0.064

0.032

FacesUCR

0.12

0.217

50words

0.011

0.026

fish

0

0.008

GunPoint

0.415

0.415

Haptics

0.451

0.513

InlineSkate

0.037

0.029

ItalyPower

0.24

0.229

Lightning2

0.143

0.202

Lightning7

0.005

0.003

MALLAT

Dataset

ccnn + clstm

CBF

0.215

0.291 0.06

0.145 0.043

0.088

0.049

0.091

0.037

0.037

OliveOil

0.018

0.056

OSULeaf

0.042

0.118

SonyAIBORobot

0.03

0.041

SonyAIBORobotII

0.048

0.139

StarLightCurves

0.039

0.104

SwedishLeaf

0.102

0.013

Symbols

0

SyntheticControl

0

0

Trace

0

0.114

TwoLeadECG

0.022

TwoPatterns

0.181

0.249

UWaveX

0.29

0.248

UWaveY

0.278

0.254

UWaveZ

0

0.033

wafer

0.224

0.410

WordSynonyms

0.062

0.170

yoga

18

11

Score

41

25

Percentage Score(%)

3.136

4.001

Mean of Rankings 0

cnn + lstm

NonInvThorax2

0

ccnn + clstm

NonInvThorax1

Dataset

Beef MoteStrain

17

Adiac MedicalImages

[2] G. M. Goerg, A nonparametric frequency domain em algorithm for time series classification with applications to spike sorting and macro-economics, Statistical Analysis and Data Mining: The ASA Data Science Journal 4 (6) (2011) 590–603. [3] A. Nanopoulos, R. Alcock, Y. Manolopoulos, Feature-based classification of time-series data, International Journal of Computer Research 10 (3) (2001) 49–61. [4] W. Li, N. Narvekar, N. Nakshatra, N. Raut, B. Sirkeci, J. Gao, Seismic data classification using machine learning, in: 2018 IEEE Fourth International Conference on Big Data Computing Service and Applications (BigDataService), IEEE, 2018, pp. 56–63. [5] G. He, L. Chen, C. Zeng, Q. Zheng, G. Zhou, Probabilistic skyline queries on uncertain time series, Neurocomputing 191 (2016) 224–237. [6] C. A. Assis, E. J. Machado, A. C. Pereira, E. G. Carrano, Hybrid deep learning approach for financial time series classification, Revista Brasileira de Computa¸ca ˜o Aplicada 10 (2) (2018) 54–63. [7] N. Mehdiyev, J. Lahann, A. Emrich, D. Enke, P. Fettke, P. Loos, Time series classification using deep learning for process planning: A case from the process industry, Procedia Computer Science 114 (2017) 242–249. [8] S. A. Lashari, R. Ibrahim, N. Senan, N. Taujuddin, Application of data mining techniques for medical data classification: A review, in: MATEC Web of Conferences, Vol. 150, EDP Sciences, 2018, p. 06003. [9] Q. Yang, X. Wu, 10 challenging problems in data mining research, International Journal of Information Technology & Decision Making 5 (04) (2006) 597–604. [10] Z. Xing, J. Pei, E. Keogh, A brief survey on sequence classification, ACM Sigkdd Explorations Newsletter 12 (1) (2010) 40–48. [11] G. E. Batista, X. Wang, E. J. Keogh, A complexity-invariant distance measure for time series, in: Proceedings of the 2011 SIAM international conference on data mining, SIAM, 2011, pp. 699–710. [12] T. M. Tran, X.-M. T. Le, V. T. Vinh, H. T. Nguyen, T. M. Nguyen, A weighted local mean-based k-nearest neighbors classifier for time series, in: Proceedings of the 9th International Conference on Machine Learning and Computing, ACM, 2017, pp. 157–161. [13] A. Kampouraki, G. Manis, C. Nikou, Heartbeat time series classification with support vector machines, IEEE Transactions on Information Technology in Biomedicine 13 (4) (2008) 512–518. [14] E. Keogh, C. A. Ratanamahatana, Exact indexing of dynamic time warping, Knowledge and information systems 7 (3) (2005) 358–386. [15] A. Onan, S. Koruko˘ glu, A feature selection model based on genetic rank aggregation for text sentiment classification, Journal of Information Science 43 (1) (2017) 25–38. [16] A. Onan, An ensemble scheme based on language function analysis and feature engineering for text genre classification, Journal of Information Science 44 (1) (2018) 28–47. [17] J. Lines, A. Bagnall, Time series classification with ensembles of elastic distance measures, Data Mining and Knowledge Discovery 29 (3) (2015) 565–592. [18] A. Bagnall, J. Lines, J. Hills, A. Bostrom, Time-series classification with cote: the collective of transformation-based ensembles, IEEE Transactions on Knowledge and Data Engineering 27 (9) (2015) 2522–2535. [19] Y. Zheng, Q. Liu, E. Chen, Y. Ge, J. L. Zhao, Time series classification using multi-channels deep convolutional neural networks, in: International Conference on Web-Age Information Management, Springer, 2014, pp. 298–310. [20] Y. Zheng, Q. Liu, E. Chen, Y. Ge, J. L. Zhao, Exploiting multi-channels deep convolutional neural networks for multivariate time series classification, Frontiers of Computer Science 10 (1) (2016) 96–112. [21] Z. Cui, W. Chen, Y. Chen, Multi-scale convolutional neural networks for time series classification, arXiv preprint arXiv:1603.06995. [22] Z. Wang, W. Yan, T. Oates, Time series classification from scratch with deep neural networks: A strong baseline, in:

18

Neural Networks (IJCNN), 2017 International Joint Conference on, IEEE, 2017, pp. 1578–1585. [23] G. Ma, X. Yang, B. Zhang, Z. Shi, Multi-feature fusion deep networks, Neurocomputing 218 (2016) 164–171. [24] J. Shin, Y. Kim, S. Yoon, K. Jung, Contextual-cnn: A novel architecture capturing unified meaning for sentence classification, in: Big Data and Smart Computing (BigComp), 2018 IEEE International Conference on, IEEE, 2018, pp. 491–494. [25] S. Ghosh, O. Vinyals, B. Strope, S. Roy, T. Dean, L. Heck, Contextual lstm (clstm) models for large scale nlp tasks, arXiv preprint arXiv:1602.06291. [26] Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, G. Batista, The ucr time series classification archive (2015). [27] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural computation 18 (7) (2006) 1527–1554. R in Machine Learning 2 (1) (2009) 1–127. [28] Y. Bengio, et al., Learning deep architectures for ai, Foundations and trends

[29] I. Arel, D. C. Rose, T. P. Karnowski, et al., Deep machine learning-a new frontier in artificial intelligence research, IEEE computational intelligence magazine 5 (4) (2010) 13–18. [30] P. Barros, G. I. Parisi, C. Weber, S. Wermter, Emotion-modulated attention improves expression recognition: A deep learning model, Neurocomputing 253 (2017) 104–114. [31] S. R¨ onnqvist, P. Sarlin, Bank distress in the news: Describing events through deep learning, Neurocomputing 264 (2017) 57–70. [32] S. Yu, Y. Wu, W. Li, Z. Song, W. Zeng, A model for fine-grained vehicle classification based on deep learning, Neurocomputing 257 (2017) 97–103. [33] L. Wei, R. Su, B. Wang, X. Li, Q. Zou, X. Gao, Integration of deep feature representations and handcrafted features to improve the prediction of n6-methyladenosine sites, Neurocomputing 324 (2019) 3–9. [34] H. P. Martinez, Y. Bengio, G. N. Yannakakis, Learning deep physiological models of affect, IEEE Computational Intelligence Magazine 8 (2) (2013) 20–33. [35] M. Meng, Y. J. Chua, E. Wouterson, C. P. K. Ong, Ultrasonic signal classification and imaging system for composite materials via deep convolutional neural networks, Neurocomputing 257 (2017) 128–135. [36] S. Sun, B. Zhang, L. Xie, Y. Zhang, An unsupervised deep domain adaptation approach for robust speech recognition, Neurocomputing 257 (2017) 79–87. [37] P. Zhang, T. Zhuo, W. Huang, K. Chen, M. Kankanhalli, Online object tracking based on cnn with spatial-temporal saliency guided sampling, Neurocomputing 257 (2017) 115–127. [38] M. Dalto, Deep neural networks for time series prediction with applications in ultra-short-term wind forecasting. rn (\(\vartheta\) 1). [39] Y. Zheng, Q. Liu, E. Chen, J. L. Zhao, L. He, G. Lv, Convolutional nonlinear neighbourhood components analysis for time series classification, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2015, pp. 534–546. [40] F. Karim, S. Majumdar, H. Darabi, S. Chen, Lstm fully convolutional networks for time series classification, IEEE Access 6 (2018) 1662–1669. [41] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256. [42] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440. [43] J. Xu, X. Luo, G. Wang, H. Gilmore, A. Madabhushi, A deep convolutional neural network for segmenting and classifying epithelial and stromal regions in histopathological images, Neurocomputing 191 (2016) 214–223. [44] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, G. Shroff, Lstm-based encoder-decoder for multi-sensor anomaly detection, arXiv preprint arXiv:1607.00148.

19

[45] M. M. Baig, M. M. Awais, E.-S. M. El-Alfy, Adaboost-based artificial neural network learning, Neurocomputing 248 (2017) 120–126. [46] K. Kawakami, Supervised sequence labelling with recurrent neural networks, Ph.D. thesis, PhD thesis. Ph. D. thesis, Technical University of Munich (2008). [47] K. Cho, B. Van Merri¨ enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using rnn encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078. [48] S. Bengio, O. Vinyals, N. Jaitly, N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, in: Advances in Neural Information Processing Systems, 2015, pp. 1171–1179. [49] G. Lv, T. Xu, E. Chen, Q. Liu, Y. Zheng, Reading the videos: Temporal labeling for crowdsourced time-sync videos based on semantic embedding, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016. [50] J. Li, M.-T. Luong, D. Jurafsky, A hierarchical neural autoencoder for paragraphs and documents, arXiv preprint arXiv:1506.01057. [51] K. Zhang, E. Chen, Q. Liu, C. Liu, G. Lv, A context-enriched neural network method for recognizing lexical entailment, in: Thirty-First AAAI Conference on Artificial Intelligence, 2017. [52] H. Hota, R. Handa, A. Shrivas, Time series data prediction using sliding window based rbf neural network, International Journal of Computational Intelligence Research 13 (5) (2017) 1145–1156. [53] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473. [54] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008. [55] F. Karim, S. Majumdar, H. Darabi, S. Chen, Lstm fully convolutional networks for time series classification, IEEE Access 6 (2017) 1662–1669. [56] M. G. Baydogan, G. Runger, Time series representation and similarity based on local autopatterns, Data Mining and Knowledge Discovery 30 (2) (2016) 476–509. [57] R. J. Kate, Using dynamic time warping distances as features for improved time series classification, Data Mining and Knowledge Discovery 30 (2) (2016) 283–312. [58] T. M. Tran, X.-M. T. Le, H. T. Nguyen, V.-N. Huynh, A novel non-parametric method for time series classification based on k-nearest neighbors and dynamic time warping barycenter averaging, Engineering Applications of Artificial Intelligence 78 (2019) 173–185. [59] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, P.-A. Muller, Data augmentation using synthetic data for time series classification with deep residual networks, arXiv preprint arXiv:1808.02455. [60] R. Giusti, D. F. Silva, G. E. Batista, Improved time series classification with representation diversity and svm, in: Machine Learning and Applications (ICMLA), 2016 15th IEEE International Conference on, IEEE, 2016, pp. 1–6. [61] D. Kingma, J. Ba, Adam: a method for stochastic optimization (2014). arxiv preprint, arXiv preprint arXiv:1412.6980. [62] P. Meletis, G. Dubbelman, On boosting semantic street scene segmentation with weak supervision, arXiv preprint arXiv:1903.03462. [63] F. Chollet, et al., Keras (2015) (2017). [64] E. Tsironi, P. Barros, C. Weber, S. Wermter, An analysis of convolutional long short-term memory recurrent neural networks for gesture recognition, Neurocomputing 268 (2017) 76–86. [65] H. A. Dau, A. Bagnall, K. Kamgar, C.-C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, E. Keogh, The ucr time series archive, arXiv preprint arXiv:1810.07758. [66] S. Guo, S. Zhong, A. Zhang, Privacy-preserving kruskal–wallis test, Computer methods and programs in biomedicine 112 (1) (2013) 135–145.

20

Amadu Fullah Kamara received his BSc degree in Applied Mathematics from Fourah Bay College, University of Sierra Leone, Sierra Leone, in 2005. He received his MSc degree in Computational Mathematics from the University of Science and Technology of China, Hefei, China, in 2013. He is currently a Ph.D. student in the School of Computer Science and Technology, University of Science and Technology of China. His research interests include data mining, deep learning, and big data analytics.

Enhong Chen (SM’07) is a professor and vice dean of the School of Computer Science at USTC. He received the Ph.D. degree from USTC. His general area of research includes data mining and machine learning, social network analysis and recommender systems. He has published more than 100 papers in refereed conferences and journals, including IEEE Trans. KDE, IEEE Trans. MC, KDD, ICDM, NIPS, and CIKM. He was on program committees of numerous conferences including KDD, ICDM, SDM. His research is supported by the National Science Foundation for Distinguished Young Scholars of China. He is a senior member of the IEEE.

Qi Liu is an associate professor at USTC. He received the Ph.D. degree in Computer Science from USTC. His general area of research is data mining and knowledge discovery. He has published prolifically in refereed journals and conference proceedings, e.g., TKDE, TOIS, TKDD, TIST, KDD, IJCAI, AAAI, ICDM, SDM and CIKM. He is a member of ACM and IEEE. Dr. Liu is the recipient of the ICDM 2011 Best Research Paper Award, the Best of SDM 2015 Award, and the KDD 2018 (Research Track) Best Student Paper Award.

Zhen Pan received the BE degree from the University of Science and Technology of China in 2014. He is currently working toward the PhD degree under advisory of Prof. Enhong Chen in the 21

School of Computer Science and Technology, University of Science and Technology of China. His research interests include social network, machine learning, and computational advertising.

22

AUTHOR DECLARATION We wish to draw the attention of the Editor to the following facts which may be considered as potential conflicts of interest and to significant financial contributions to this work. [OR] We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from: [email protected] Signed by all authors as follows: Amadu Fullah Kamara(10/20/2019), Enhong Chen(10/20/2019), Qi Liu(10/20/2019), Zhen Pan(10/20/2019)

Note from Editorial office: Submission was edited by Editorial office after acceptance, to upload new source files. No author contributions sections were available at the time of acceptance.

24

Combining contextual neural networks for time series classification

Combining contextual neural networks for time series classification

Recommend Documents