Trendlets: A novel probabilistic representational structures for clustering the time series data

Trendlets: A Novel Probabilistic Representational Structures for Clustering the Time Series Data Journal Pre-proof Trendlets: A Novel Probabilistic ...

Download PDF

18MB Sizes 0 Downloads 22 Views

Report

Full Text

Trendlets: A Novel Probabilistic Representational Structures for Clustering the Time Series Data

Journal Pre-proof

Trendlets: A Novel Probabilistic Representational Structures for Clustering the Time Series Data JohnpaulC I, Munaga V.N.K. Prasad, S. Nickolas, G.R. Gangadharan PII: DOI: Reference:

S0957-4174(19)30836-X https://doi.org/10.1016/j.eswa.2019.113119 ESWA 113119

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

1 September 2019 8 November 2019 2 December 2019

Please cite this article as: JohnpaulC I, Munaga V.N.K. Prasad, S. Nickolas, G.R. Gangadharan, Trendlets: A Novel Probabilistic Representational Structures for Clustering the Time Series Data, Expert Systems With Applications (2019), doi: https://doi.org/10.1016/j.eswa.2019.113119

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Trendlets: A Novel Probabilistic Representational Structures for Clustering the Time-series Data Highlights The salient contributions of this paper are as follows. • Timeseries representational method that presents the collective trend of the data. • A user defined segmentation method of time-series data.

• Probabilistic approach of forming representational building blocks for timeseries. • Unsupervised trend based hierarchical clustering of timeseries data.

1

Trendlets: A Novel Probabilistic Representational Structures for Clustering the Time Series Data Johnpaul C Ia,b , Munaga V. N. K. Prasadb , S. Nickolasa , G. R. Gangadharana,∗ b

a National Institute of Technology, Tiruchirappalli, India Institute for Development and Research in Banking Technology, Hyderabad, India

Abstract Time series data is a sequence of values recorded systematically over a period which are mostly used for prediction, clustering, and analysis. The two essential features of a time series data are trend and seasonality. Preprocessing of the time series data is necessary for performing prediction tasks. In most of the cases, the trend and the seasonality are removed before applying the regression algorithms. The accuracy of such algorithms depends upon the functions used for the removal of trend and seasonality. Clustering of an unlabeled time series data with the presence of trend and seasonality is challenging. In this paper, we propose a probabilistic representational learning method for grouping the time series data. We introduce five terminologies in our method of clustering namely the trendlets, uplets, downlets, equalets and trendlet string. These elements are the representational building blocks of our proposed method. Experiments on the proposed algorithm are performed with the renewable energy data on the electricity supply system of continental Europe which includes the demand and inflow of renewable energy for the term 2012 to 2014 and UCR-2018 time series archive containing 128 datasets. We compared our proposed representational method with various clustering algorithms using the silhouette score. Mini-batch k-means and agglomerative hierarchical clustering algorithms show better performance in terms of quality, logical accordance with data and time taken for clustering. ∗

Corresponding author Email addresses: [email protected] (Johnpaul C I), [email protected] (Munaga V. N. K. Prasad), [email protected] (S. Nickolas), [email protected] (G. R. Gangadharan)

Preprint submitted to Elsevier

December 10, 2019

Keywords: Time series, Clustering, Pattern Matching, Trend, Seasonality. 1. Introduction Time series data is a sequential ordering of observations with respect to a time interval. Data mining on time series data helps to generate models, decisions, rules and forecasting modules. Machine learning algorithms play a major role in understanding time series data for utilizing them in various applications. Unsupervised learning of time series data identifies the groups of unlabelled time series elements based on the nature of values. All the instances of an unlabelled time series data do not have a designated class label (L¨angkvist et al., 2014). If the class labels are not defined, then clustering methods are essential for identifying and labeling similar time series sequences. Class labels are less utilized in forecasting or prediction applications, but they are essential for developing the classifier models (Wang et al., 2019a). We focus on the trend of time series values of every instance and propose a representational structure to identify the similar groups among the instances based on trend. The time series sequences are characterized by two major features namely the trend and the seasonality (Lim et al., 2018). Trend is defined as the direction of growth of the time series values with respect to time interval. The trend can be increasing, decreasing or equal. Seasonality is the time duration in which the time series repeats its growth of trend. Traditional time series processing methods involve a systematic preprocessing steps to extract the relevant information from the data. In most of the methods the removal of trend and seasonality is required to obtain a stationary time series for applying various machine learning algorithms. In general, there is a possibility of data loss in any preprocessing steps (Kuznetsov and Mohri, 2015). The impact of this information loss depends on the domain and usage of time series data. Forecasting and prediction of current time series values based on the previous values will not be affected by the trend and seasonality removal (Ferreira and Zhao, 2016). Such tasks produce better results on a stationary time series data. Trend and seasonality are removed by transforming the time series into constant time period values with the help of various overlapping functions (Alsallakh et al., 2014). New representational models are essential to retain the trend information of time series data to perform clustering and other machine learning operations. 2

An Unlabeled time series data is less used for generating a classifier model owing to the following reasons. The quantity of the time series sequences which are recorded in specific time interval for a prescribed time period is huge. Identifying the trend of this voluminous data requires a reliable representation which can be used by various learning algorithms without an overhead in implementation and physical interpretation. The properties of the respective domain creates an ambiguity in selecting the methods of labeling a time series sequence. In most of the cases if the labels do not exist, then unsupervised clustering method is to be performed over the time series to label it (Ali et al., 2018). The methods for clustering the time series sequences are not similar to the clustering of traditional data-points. The two noticeable aspects of an unsupervised clustering mechanism are determining the right number of clusters and the selection of appropriate distance measures. Elbow method is one of the simple and common methods used with unsupervised clustering algorithms to find out an optimal number of clusters in the data (clu, 2019). In some cases, the experiments demand the number of intended clusters that can be generated from the data. The other methods used for determining the number of clusters are average silhouette and gap statistic (Tibshirani et al., 2001, Subbalakshmi et al., 2015). The selection of an appropriate distance measure forms an initial step in the clustering process. If the time series sequences are larger, then the distance based clustering will consume more time for clustering. The use of various dimensionality reduction methods improve the performance of a clustering mechanism to a less extent. Clustering structures apart from the numerical values which highlight the trend of a time series sequence helps in dimensionality reduction and fast clustering. The principal component analysis (PCA), wavelets etc., are some of the dimensionality reduction methods that are used along with the clustering of time series data (Garcke et al., 2017). The clustering of time series data helps to understand the existence of similar time series sequences. Further, this clustering model can be used for several applications which aims decision making, recognizing significant data-points, classification of new time series sequences, intervention analysis, pattern recognition with time period etc (Bode et al., 2019). The clustering of a time series data is categorized into model-based, shape-based and featurebased (web, 2018). In this paper we present trend based novel clustering structures called trendlets, which are the variants of model-based clustering. The important aspect of the proposed clustering method is the representation 3

of time series data into a set of triplets based on the trend. The generated triplets are subjected to various clustering methods for grouping the similar time series sequences. This new representational approach gives an emphasis to the trend of the time series. The proposed method of trendlet based clustering helps to determine the behaviour of a time series sequence. The salient contributions of this paper are as follows : • A simple time series representation using the trendlets that presents the collective trend of the time series data. • A time series data segmentation method and a probabilistic approach of forming representational building blocks namely trendlets, uplets, downlets, equalets and trendlet string. • A detailed analysis of hierarchical agglomerative clustering algorithm in an unsupervised environment to identify the clusters based on the trend of time series data. The remainder of this paper is organized as follows. Section 2 presents the literature studies relevant to time series clustering. Section 3 describes the proposed representational structure, clustering methods and approaches. Section 4 presents a case study and detailed analysis of various clustering methods with trendlets. Section 5 includes the experiments on UCR-2018 time series archive. Section 6 presents the conclusion and future scopes of the work. 2. Literature Review Clustering is a machine learning technique in which the data points are evaluated based on the distance measure between them and perform a grouping with the nearest data-points with or without any transformations. The nature of clustering differs in its structure and implementation. An interval based time series clustering method divide the time series sequences into blocks, and groups the sequences based on segment similarity. The performance of a clustering mechanism can be improved by increasing the stages of clustering process (Guijo-Rubio et al., 2018). All the clustering mechanisms attempt to ascertain a similarity between the data points. A variety of similarity and distance measures are used to find out the proximity of data sequences. Combination of various distance measures are also used in 4

time series clustering. Apart from the distance of data points, the density of data sequences also gives rise to groups. The most effective and promising distance measure used for time series clustering is dynamic time wrapping (DTW). DTW Barry-center averaging (DBA) is one of the widely used time series averaging methods (Ma and Angryk, 2017). (Wang et al., 2019) describe an interval-value based clustering of time series data with a modified version of DTW distance measure. This interval based clustering method focuses on obtaining a trend value from every segment. These values which represent a time series sequence are subjected to the clustering methods using DTW. Different variants of the DTW distances are discussed and compared with the new representational model of the time series data. The new distance measures are experimented with the hierarchical clustering algorithms by varying the depth parameter. Clustering of a set of time series sequences using modified density based clustering is discussed by (Putri et al., 2019). The clustering method named as ChronoClust is introduced and experimented with cytological response data. The work presents a significant contribution to life sciences by tracking the cell behaviour through the lineage determination and the historical proximity. The authors also present the formation of micro clusters in a complex environment of attributes. (Folgado et al., 2018) describe a new distance measure named as time alignment measurement (TAM) for time series data. Traditional time series clustering methods use DTW as an efficient method for distance measure calculation. TAM generates a similarity measurement in the time domain in contrast with the DTW, which brings out the similarity in the amplitude of time series sequences. TAM exhibits a better performance for the datasets with lesser amplitude deviation and higher temporal variance. It can be also considered as a new quality index for comparing various signal alignment procedures. (Jiang et al., 2019) explained a new distance measure for the time series clustering namely maximum shifting correlation distance (MSCD). Generally, the distance measures in time series sequences are divided into two categories namely lock-step and elastic measures. DTW outperforms among all the distance measures in-terms of finding the similarity between the time series sequences. MSCD distance is calculated based on the correlation of values between the two time series sequences. The maximum value of the correlation coefficients with respect to a shifting series is the base cause for the existence of similar time series sequences. 5

Time series clustering for obtaining similarity among multivariate time series data is discussed by (Jiang et al., 2019). They describe a similarity learning method on multivariate time series data with the missing values. The proposed clustering method generates a kernel function namely, the time series cluster kernel (TCK). The missing data is handled by the properties of Gaussian mixture model (GMM). This clustering task also performs the sensitivity analysis, hyper parameter tuning and visualization of clustered points. TCK is capable of addressing three types of missing values namely missing completely at random (MCAR), missing not at random (MNAR) and missing at random (MAR). (Paparrizos and Gravano, 2017) formulated a novel time series clustering algorithm for fast and accurate clustering. The two new clustering methods proposed by the authors are k-shape and k-multishape. These novel methods belong to the category of shape based time series clustering. A new distance measure called shape-based distance (SBD) and a normalized crosscorrelation measure is used to compare the shape of two time series sequences to find the similarity between them. The two sets of centroids are obtained to identify the proximity of the clusters. The two methods show high accuracy and efficiency in related to other traditional methods. (Mori et al., 2016) present a method to determine an appropriate distance measure from a set of distance measures used for time series clustering. The basic preliminaries and steps for clustering a time series data are discussed by the authors. It include the distance measures, parameters of clustering algorithms, statistical measurements confined to time series values etc. Systematic evaluation of clustering methods with accuracy and F-static measurement is discussed using the datasets of UCR database. Misaligned time series clustering proposed by (Salgado et al., 2017) help to find the similarity of time series sequences which contain the concealed patterns and trends. They proposed a fuzzy based clustering algorithm which uses DTW as the major distance measure namely mixed fuzzy clustering (MFC). The algorithm is experimented on the time series sequences from a medical database to identify the similar behavioral pattern in patients. MFC outperforms the traditional fuzzy c-means clustering algorithm on various evaluation metrics. (Slavakis et al., 2018) describe the time series clustering as one of the prominent tools in analyzing the brain networks utilizing the Riemannanian geometry (X Wang, 2014). Each cluster is represented by a finite number of structures obtained from Riemannian multi-manifold modeling (RMMM). 6

In contrast to the traditional clustering algorithms containing centroids, in RMMM methods there are structures called Riemannian sub-manifolds, which describe the spread of values in a hyper-plane. This structures are used for clustering network based time series sequences. Time series sequences can be generated from objects by mathematical transformation which are used in classification tasks. (Lexiang and Eamonn, 2009) illustrate a time series primitive named shapelet, which is a sub-sequence of a time series data. Shaplet structures distinguish the objects clearly without considering the whole data-sequence. DTW method is used to find out optimum distance between the elements for classification. Decision tree classifiers with shaplets improve the classification accuracy. Determining the time series data similarity by empirical recurrence ratio is proposed by (Bhaduri and Zhan, 2018). The method focuses on finding out the seasonal similarity between the time series sequences than other outliers. The empirical recurrence rates ratio (ERRR) is obtained from Poison distribution helps in binary classification problems. The ERRR values provide the rate of dependence between two time series sequences based on the time period. Clustering with ERRR shows better performance for various datasets, compared to the traditional distance measures. The time series data sequences contain characteristics or features which can provide valuable information about the behavior of the sequences. Majority of the clustering algorithms compute these features to reveal the behavior of time series data (wang et al., 2006). The usage of these characteristics are subjected to the type of clustering and decision model. The sub-sequence time series clustering discusses the similarity of internal segments in a time series data. These sub-sequences in time series clustering reveal sufficient information about the data which distinguish the sequential patterns, periodic patterns and motifs etc. The disadvantages of sub-sequence based clustering method include the high memory usage, the possibility of unexpected failures for some parameters, and high complexity (Zolhavarieh et al., 2014). Time series sequences are also grouped with semi-supervised clustering methods like COBRAS. It contain a set of predefined clusters which are refined iteratively by the semi-supervised clustering method. The cluster elements in every cluster are known as super instances. Different versions of COBRAS exist based on the usage of distance measures (Roelofsen, 2018). The clustering mechanisms are also affected by the uneven values in time series sequences. (Motlagh et al., 2019) describe various steps to obtain useful information from the noisy time series sequences and transform it to a 7

structure ready for other clustering algorithms. A noise metric is also defined to evaluate the density of the noise present in the time series sequences. The time series data which contain electrical load usage of residential electricity customers is considered for experimenting their clustering strategies. The labeled and unlabeled time series data requires necessary preprocessing steps to apply various clustering algorithms. The feature learning from the discriminating features of a time series data involves the segments and shaplets. (Wang et al., 2019b) discuss a new semi-supervised shaplets learning (SSSL) method to acquire necessary information from shaplets and segments of a time series data. (Salles et al., 2019) illustrate different learning methods on the stationary time series data. The non-stationary time series data cannot be used for prediction and regression tasks. The transformation of non-stationary to stationary mode is essential for applying forecasting algorithms. Proper transformation methods enhance the performance of forecasting of a typical non-stationary time series sequence. Time series classification (TSC) is a supervised model in which the incoming time series sequences are classified into an appropriate group. Compressing the time series data and using such data for creating a classification model is challenging. Wavelets are used to compress the data in a lossy strategy. Relevant methods of compression improve the performance of classification both in accuracy and time. Moreover efficient compression methods are extensively used to overcome the difficulties of storage and processing task to a great extent (Daoyuan et al., 2016). Clustering methods are often used for finding the pattern of electricity consumption. Shape based methods provide appropriate results in the grouping of time series sequences. A shapelet contains limited number of values (Ji et al., 2019). (Wen et al., 2019) describe a shape based method which uses the modified k-means clustering procedure to group the time series sequences. An extensive real-time experiment is performed with the data obtained from the commission for energy regulation (CER) which includes the energy consumption of Irish homes. (Khosravi et al., 2018) performed a case study on machine learning algorithms which predicts the wind speed in a wind farm. The wind speed is measured in specific time intervals and fed to different machine learning techniques including neural networks to predict the wind speed. The support vector regression (SVR) and multi-layer feed forward neural network (MFFNN) are used for the prediction task. Modified dynamic time wrapping method is used in time series clustering of remote sensing images. The modification to DTW is performed with the use of can8

berra distance (CD). This distance measure calculation is more reliable even the presence of obstacles like cloud. Clustering of images is done based on the maps with the help of modified distance measures on k-means algorithm (Zhao et al., 2016). Most of the existing methods discuss the clustering of time series data based on similarity measures which captures only the numerical similarity between the data points. The method of shaplets which is described in the related works is a noticeable representational structure with the following differences from our proposed method. Shaplets are exclusively used in classification experiments which is a supervised learning domain with existing distance measures and classifiers. The classification of objects using shaplets is also target on elements which have a structure rather than their functionality. We focus on unsupervised clustering of time series data based on the trend of values as a feature of their functionality, which is less discussed in literature studies. Moreover, preprocessing of time series data removes the trend and seasonality for further processing of machine learning algorithms. Trendlet clustering focuses on trend based clustering of time series data. In our novel method of clustering, we discuss a detailed analysis of clustering a time series data considering the data similarity and their nature of data values. Table 1 shows an outline of various methods for mining time series data. Table 1: List of Methods for Time series Feature Learning Author Guijo-Rubio et al., (2018)

Time series learning methods Variable time series segmental clustering

Ma and (2017)

Angryk

Distance density clustering method using medoids

Wang et al., ( 2019)

Improved DTW time series clustering

Putri et al., (2019)

Density based clustering (Chronoclust)

Folgado (2018)

al.,

Time alignment measurement (TAM)

Jiang et al., (2019)

Maximum shifting correlation distance (MSCD)

et

Features & Remarks Statistical segmentation clustering of time series, Skewness and variance based time series feature extraction Density based, Majority voting algorithms to determine cluster elements, Faster converging method for DTW Barrycenter Averaging (DBA). Interval based clustering, Mathematical and geometrical modification of DTW, Three tuple representation of a time series sequence. Clustering of discrete time series sequences, Maintains the temporal evolution, Historical proximity of cluster elements are considered for grouping, Temporal relationships among data is maintained. DTW based time series alignment with amplitude similarity of sequences, Temporal characterization of data related to human movement New distance measure MSCD also known as shrinking effect, & Drift in both amplitude and phase of time series sequences are considered

9

Paparrizos and Gravano, (2017)

Fast & accurate time series clustering, Shape based distance (SBD)

Mori et al., (2016)

Similarity measure selection procedure for time series

Salgado et al., (2017)

Mixed fuzzy (MFC)

Slavakis et al., (2018)

Time series clustering using Riemannanian geometry

Lexiang and Eamonn, (2009) Ji et al., (2019), Wen et al., (2019)

Time series generation from physical objects by geometrical structures called shapelets

Bhaduri and Zhan, (2018)

Time series similarity by empirical recurrence rate ratio (ERR) Structural characteristics of time series sequences

Wang et al., (2006)

clustering

Roelofsen (2018)

Semi-supervised tering method COBRAS

Motlagh et al., (2019)

Clustering of electricity load time series data

Salles et al., (2019)

Non-stationary time series preprocessing method

Daoyuan et al., (2016)

Time series classification method

Khosravi et al., (2018)

Prediction algorithms based on time series

Zhao et al., (2016)

Canberra distance DTW (CD-DTW) clustering

10

cluscalled

Novel methods to extract shapes from time series sequences namely shapes extraction (SE) and multi shapes extraction (MSE), Extensive statistical analysis of time series clustering methods. Two phase algorithm which include a clustering and evaluation step to select appropriate distance measure for time series experiments, Discuss various statistical features of time series sequences. Spatio temporal clustering algorithm for time series sequences, Processes misaligned time series data, DTW is used to find concealed patterns from time series data. Cluster representation using Riemannian multi-manifold modeling (RMMM) structures, Best suited for analysing brain network time series data. Time series primitive, sub-sequence based structures, Distinguish objects clearly with shaplet transformation, Modified k-means clustering using shapelets are used for clustering time series data. Binary classification of time series sequences, Seasonal similarity of time series data is addressed. Various clustering algorithms are experimented on extracting global time series features, Search algorithm to select the beast feature set. Predefined clusters refined iteratively, Pairwise medoid method is used for distance computation, The cluster elements are known as super instances. Dimensionality reduction methods are used, Processes noisy time series values, Metric also defined to measure the density of noise in time series sequences. Logarithmic transform (LT), Box-Cox transform (BCT), percentage change transform (PCT), and moving average smoother (MAS) etc are used for transformation. Discrete Wavelet Transform (DWT) and other wavelet functions like Haar, Daubechies are used in dimensionality reduction of time series sequences. Machine learning methods for predicting time series values, Wind speed predictions from historical time series data using neural network methods. Distance based method, For time series data clustering using remote sensing images, Extract features even in the presence of noisy data.

3. Trendlets based time series clustering Trendlets are the smallest building blocks of our novel method of clustering the time series data. The behavior (either increasing, decreasing or equal) of a time series data can be easily identified by observing the trendlets. Figure 1 shows the significant modules in clustering process. The accuracy of prediction tasks is determined by the capability of the functions which removes the trend and seasonality. In some cases the removal of trend and seasonality is not a wise option to proceed with the applications. For example, in time series clustering, the trend and seasonality structure play a major role in identifying similarity between the series. This method focuses

Figure 1: Structural flow of trendlets based time series clustering

on clustering by representing the trends as the base element of finding the similarity between the other time series data. Following are the five terminologies defined for the clustering process. Trendlets: Trendlets are the smallest elementary representational structures of our proposed representational method. It is formed by the encoding of increasing, decreasing or equality of the time series values. A trendlet is generated based on the probability of three elementary sets namely the uplet, downlet and equalet. A time series Tn with n values is divided into a group of segments n c + 1) trendlets, if (n mod m) 6= 0 each having a length of m, contains (b m n and m trendlets, if (n mod m) = 0. Uplet, Downlet and Equalet: Uplet, downlet and equalet are the set of elements in a segment, selected by a unidirectional comparison process (Refer to Theorem 1). These elements are named as trendlet components. The time series values in a segment are 11

divided into uplet, downlet and equalet, based on the consecutive comparison performed between the values one at a time. For instance, consider a segment S1 of the time series Tn contains t1 , t2 , t3 , t4 , t5 , ..., tm . If t1 < t2 , then t2 ∈ uplet and t1 ∈ downlet. If t2 > t3 then t3 ∈ downlet and so on. If t1 and t2 are equal, then t2 will be added to equalet set, provided t1 is added to any other set in the previous comparison process. The number of elements in the equalet set determines the number of equal trend values in the segment. Trendlet string: A trendlet string closely resembles a binary string which contains a series of 1’s and 0’s. The occurrence of different sequences of binary values is discussed in Equation 6. A Time series Tn is represented by a trendlet string n c + 1), where n and m represents the length of time series and of length (b m the length of a segment respectively. The formation of a trendlet is shown in Figure 2.

Figure 2: Formation of a typical trendlet string

3.1. Preprocessing and segmentation of time series data Time series clustering requires a systematic preprocessing which involves in finding the mean value of the required dimensions of time. In our proposed method we used two re-sampling methods over a time series data namely wavelets and variation of time dimensions respectively. The various dimensions of time include the day, week, month and year. A Lower time dimension provides a deeper insight into the clustering of time series data. The wavelet based re-sampling of time series data is performed for reducing the value n without loosing the properties of the series and there by the time 12

required for clustering can be reduced significantly. Figure 3 shows various steps that are adopted for preprocessing the time series data Tn . The segmentation process is performed after the re-sampling. If n is the length of the time series Tn , then a segment S contains m time series values. The condition m < n is followed through out the clustering process. The impact of the varying segment length m over a time series of length n is studied and discussed in the section 4. Re-sampling of time series data helps to uncover the structure of time series through the expressive behaviour of every trendlets.

Figure 3: Preprocessing of time series data for clustering

3.2. Trendlets Formation Segmentation is the first step of trendlets formation. Algorithm 1 refers the initial steps of clustering. The structure clusterInfo stores the details of clusters formed by the proposed method. SEG LEN is the user-defined length (number of values) which divides the sequence into different segments. The aggregation of time series data having a unique trendlet pattern helps to know the behavior of each time series sequence. The patterns generated by the extractTrend (refer Algorithm 2) is subjected to hierarchical agglomerative clustering with ward distance as the linkage measure. The variable nTS stores the count of time series sequences in the dataset D. The binary representation and trendlet components of the time series sequences are stored in the arrays namely binaryV al and cData. The hierarchical clustering is performed on the array, cData to identify the similar trended time series sequences. 13

Algorithm 1 Trendlets based Time series Clustering 1: procedure trendletClustering(series, length) 2: Input:A Time series dataset D containing time series data T1 to TnT S 3: Output:Clusters 4: Initialization:clusterInf o = { } 5: segmentLength = SEG LEN 6: binaryV al = [ ] 7: cData = [ ] 8: nT S = length(D) 9: seriesId = getTimeSeriesId(D) 10: index = 0 11: repeat 12: tsId ← seriesId[index] 13: bV al[tsId], cData[tsId] ← extractTrend(tsId, D[index], segmentLength) 14: index ← index + 1 15: until (index 6= nT S) 16: clusterFinding(nT S, cData) 17: finalClusterInfo() 18: end procedure

The function getTimeSeriesId() obtains the node identities from the dataset. The extractTrend() performs the pattern encoding of the time series data. The time series values for each generator node Tn (D[index]), corresponding node-id tsId, and the user-defined segment length segmentLength are the inputs to the extractTrend(). The global dictionary clusterInfo is updated by the Algorithm 2. It is used by finalClusterInfo() to compute the density of each cluster. The various sub procedures of the extractTrend() are namely initialization(), append(), appendTrend(), trendUpdate(), countUpdate(), emptyArrays(), valCheckIncrement() and stringCreation(). The variables of the extractTrend() are initialized by initialization() procedure. The definition of the variables are as follows. The length of a time series sequence is defined as tsLen; uplet, downlet, equalet and trendlet are arrays to store the trend values; The number of segments segNum in a time series sequence is obtained by the Equation 1, where segLen is the maximum number of time series values in a segment. segN um = d

tsLen e. segLen

(1)

The variable segCount which stores the segment count is initialised to 1. The iterative and flag variables namely lastCount, count and lastVal record the 14

count of proper segments, count of time series values and the last value of final proper segment respectively. These values are initialized to 0. The dictionary variable series records the respective trendlet set of each segment with the segment number as the key and trendlet sets as the value. Each iteration results in the formation of a dictionary entry, which is of the form { a, b }, where the dictionary key a, defines the respective node id and the dictionary value b defines the multiple array trendlet. The append() procedure described in Algorithm 3 locally updates the arrays after the formation of uplet, downlet and equalet values of a trendlet. The appendTrend() adds the trend information with the trendlet components of the respective segment of a sequence into an array, trendlet. The procedure trendUpdate() performs three functions namely the increment of count, appendTrend() and dictionary updation. The increment of count updates the variables namely count and nextVal to 0 and increment the segCount by 1. The dictionary updation includes the addition of trendlet into series with segCount - 1 as the key and reinitialize the arrays uplet, downlet, equalet and trendlet to null. The procedure emptyArrays() performs the re-initialization of respective arrays for the next segment. The function valCheckIncrement() performs the timely increment, updation and iterative check of flag variables. The valCheckIncrement() function initializes the variables check and current to 1, nextVal respectively. After this initialization nextVal is incremented by 1. Table 2: An instance of a time series values and trendlets formation

Time series Seg: Length Segments Seg: No 1 2 3

{ 12, 13, 9, 8, 2, 23, 20, 22, 1, 4, 11, 14, 15, 16, 18, 17, 3 } 6 3 Trendlets Uplet Downlet { 13, 23 } { 12, 9, 8, 2 } { 22, 4, 11, 14 } { 1, 20 } { 16, 18, 17 } { 15, 3 }

Equalet {} {} {}

The extractTrend() starts with an initial comparison of consecutive values of a time series sequence Tn (Refer to Theorem 1). The values are obtained in a sequence with the help of the procedure valCheckIncrement(). The output of the extractTrend() is the trendlet string. After the comparison process, the encoded string is formed by the probability of each trendlet components of 15

Algorithm 2 The Trendlets Formation 1: procedure extractTrend(tsid, ts, segLen) 2: Initialization() 3: for val ← 0, tsLen do // iterations till sequence length 4: current ← val // variables to track sequence indices 5: nextV al ← current + 1 6: if (nextV al 6= tsLen − 1) then // comparison of first element 7: if (count == segLen − 1) and(segCount 6= segN um − 1) then 8: if (ts[val − count] < ts[val − count + 1]) then 9: append(downlet, ts[val − count]) // downlet array formation 10: trendUpdate() 11: else if (ts[val − count] < ts[val − count + 1]) then 12: append(uplet, ts[val − count]) 13: trendUpdate() // trendlet updation and count increment 14: else 15: append(equalet, ts[val − count]) 16: trendUpdate() 17: end if // elements other than the first element into trendlets 18: else if (count 6= segLen − 1) then 19: if (ts[current] < ts[nextV al]) then 20: append(uplet, ts[nextV al]) 21: countUpdate() 22: else if (ts[current] > ts[nextV al]) then 23: append(downlet, ts[nextV al]) 24: countUpdate() // variables to track the segmental length 25: else 26: append(equalet, ts[val − count]) 27: countUpdate() 28: end if 29: else // inclusion of the last segmental element to trendlets 30: lastCount ← count 31: lastV al ← val 32: if (ts[val − count] < ts[val − count + 1]) then 33: append(downlet, ts[val − count]) 34: else if (ts[val − count] > ts[val − count + 1]) then 35: append(uplet, ts[val − count]) 36: else 37: append(equalet, ts[val − count]) 38: end if 39: appendTrend() // appending the three trendlet components 40: series[segCount] ← trendlet 41: emptyArrays(trendlet, uplet, downlet, equalet) // reinitialization 42: break 43: end if 44: else 45: break // trendlet formation completed from the last proper segment 46: end if 47: end for 16

48: LastSegN umber ← tsLen − (lastV al + 1) // last proper segment 49: current ← lastV al + 1 // initialization of additional segment iteration 50: nextV al ← current + 1 51: check ← 0 // tracking of last additional segmental index 52: for rest ← val + 1, tsLen − 1 do // iterating additional segment 53: if check == 0 then 54: if (ts[current] < ts[nextV al]) then 55: append(downlet, ts[current]) 56: append(uplet, ts[nextV al]) 57: valCheckIncrement() // updation of temporary variables 58: else if (ts[current] > ts[nextV al]) then 59: append(uplet, ts[current]) 60: append(downlet, ts[nextV al]) 61: valCheckIncrement() 62: else 63: append(equalet, ts[current]) 64: valCheckIncrement() 65: end if 66: else // last additional segmental element into trendlet components 67: if ts[current] < ts[nextV al] then 68: append(uplet, ts[nextV al]) 69: current ← nextV al // updation of iteration variables 70: nextV al ← nextV al + 1 71: else if ts[current] > ts[nextV al] then 72: append(downlet, ts[nextV al]) 73: current ← nextV al 74: nextV al ← nextV al + 1 75: else 76: append(equalet, ts[nextV al]) 77: current ← nextV al 78: nextV al ← nextV al + 1 79: end if 80: end if 81: end for 82: trendletS ← stringCreation() // string creation from trendlets 83: return trendletS, trendInf o 84: end procedure

17

Algorithm 3 Trendlet Appending Procedure 1: procedure appendTrend 2: append(trendlet, uplet) 3: append(trendlet, downlet) 4: append(trendlet, equalet) 5: end procedure

a segment. The procedure stringCreation() performs the probability computation and trendlet string formation. The illustration of trendlets formation is described with the help of a time series sequence shown in Table 2. The number of segments in the time series is three, since (n mod m) 6= 0. Consider the first segment from the time series given in Table 2. The elements of the first segment are { 12, 13, 9, 8, 2, 23 }. A one directional comparison is performed between the consecutive values of the segment. The set equalet is empty as there are no equal elements in the segment. The noticeable feature of the comparison process is that the total number of times every element of a segment involve in the comparison is always an even number, irrespective of the parity of segment size (Refer to Theorem 1) Theorem 1. If a segment Si , 0 ≤ i < m, contains k elements and if each element P in a segment Si involves in the comparison process Cit times, 1 ≤ t ≤ k, then kt=1 Cit , 0 ≤ i < m, is an even number irrespective of the parity of segment length m. Proof: Let Si be a typical segment of a time series sequence Tn . Let k be the number of elements in the segment. Let vi1 and vik be the first and the last element of the segment Si . In one directional comparison process, the involvement of the first and the last term is only once ie., Ci1 = 1 and Cik = 1. This condition is true for a segment Si , since vi0 and vik+1 are invalid values for the respective segment. There are k − 2 elements in a segment other than the first and last segment. If t1 , t2 , t3 , t4 ,...,tk are the time series values for a segment Si , then there are k − 2 values starting from t2 and ends with tk−1 . The value t2 involves in the comparison process with t1 and t3 . Similarly all other k − 2 values involve in the comparison process twice. ie Ci2 = 2, Ci3 = 2, Ci4 = 2,..Cik−1 = 2. Hence the total involvement of k elements T Ik in 18

the comparison process of a segment Si is calculated from the Equation 2 T Ik = Ci1 + Ci k +

k−1 X l=2

Cil , 0 ≤ i < m.

(2)

If Si is an even segment ie., the value k is an even number, then k − 2 is also Pk−1 an even number and l=2 Cil = 2(k − 2). According to the Equation 1, T Ik in the comparison process of an even segment is 2 + 2(k − 2), simplifies into the term 2(k − 1) which is an even number. An odd segment Sj contains k elements, where k is odd. If k is an odd number, then k − 2 is also an odd number. P As explained above the total involvement of k − 2 elements is calculated as k−1 l=2 Cjl = 2(k − 2). Hence according to the Equation 1, T Ik in the comparison process of an odd segment is also simplified into 2(k − 1), which is an even number. This establishes the proof for the above theorem.

Figure 4: Different types of time series segments

The functioning of the extractTrend() is based on the time series length n and the segment length m. A time series Tn of length n contains two types of segments namely proper segments and additional segment. Figure 4 shows the different types of the segments. If n mod m = 0, then Tn contains only proper segments of length m. If n mod m 6= 0, then Tn contains both proper segments and additional segment. The number of proper segments in such n cases is equals to b m c and the number of additional segment is n % m. The number of elements in the additional segment is lesser than m. The additional segment is processed after the processing of proper segments and a sequence of trendlet components are formed. 19

The stringCreation() described in Algorithm 4 contains five significant structures for a typical time series. They are trendlet, series, trendletProb, encodedString and trendInfo. The trendlet is an array which stores the sets of uplet, downlet and equalet for a segment Si . The variable series stores the corresponding trendlet with its segment number as the key. The dictionary variable trendProb contains the probability of uptrend, downtrend and equaltrend in the form of a triplet (A, B, C) where A, B and C signify the probabilities of uplet, downlet and equalet respectively. The trendInfo contains the count of trendlet components of each segment with the respective segment number. The procedure trendletProbability() described in Algorithm 5 computes the probability of trendlet components of each segment Si . The probability of the trendlet components is used to generate the trendlet string for every time series. The procedure for generating the trendlet string is described in the Algorithm 6. Equation 6 shows the basic rule set to generate the binary string corresponding to each time series sequence. Probability based modeling of a trendlet string: Let a time series Tn contain k segments. The value of k depends upon the size of the segment m and length of the time series n. A trendlet of a segment Si , 0 ≤ i < k contains three subsets namely uplet Ui , downlet Di and equalet Ei where |Si | = |Ui | + |Di | + |Ei |. Let |Ui | = p, |Di | = q and |Ei | = e. Let Piu , Pid and Pie are the probabilities of a segment to behave as an uptrend, downtrend or equaltrend. Piu , Pid and Pie are calculated from Equations 3, 4 and 5 respectively. For instance, the Equation 6 shows various representations of a segment, if the user is interested in analyzing the uptrend of time series data. The physical interpretation of first three cases of Equation 6 is as follows. |U | m |D| Pid = m |E| Pie = m Piu =

20

(3) (4) (5)

  [      [      [ Ei = [    [      [    [

1 0 0 1 0 1 1

1 0 1 0 1 0 1

1 1 0 0 1 1 0

] ] ] ] ] ] ]

if if if if if if if

Piu == Pid == Pie Pie > (Pid & Piu ) Pid > (Pie & Piu ) Piu > (Pid & Pie ) Piu < (Pid & Pie ) Pid < (Piu & Pie ) Pie < (Piu & Pid )

(6)

The binary sequence [1 1 1] represents that all the trendlet components are equal in the segment. The representation [0 0 1] shows that the equalets in the segment are dominating than other two trendlet components. Similarly [0 1 1] represents that downlets and equalets are greater than uplets. In a set of three variables there are two conditions for each element, if we consider a binary interpretation of either 0 or 1. The first element can exist as lesser than the other two or it can exist as greater than the other two. Similarly, the other two elements also exhibit a maximum of two possibilities. Algorithm 5 generates the significant structures for clustering the time series sequences. The variables probability and trendDetails stores the details of trendlet components. The cardinalities of uplet, downlet and equalet of a segment are stored in uCount, dCount and eCount respectively. These variables are stored with their segment numbers and and time series identities. A trendlet of a typical segment Si is replaced with an encoded value Ei , where Ei ∈ {0, 1}. The basic assumption regarding the selection of encoding values is the priority of trend selection. We assume an uptrend analysis through out our experiments. Hence under the assumption of this priority, Algorithm 6 generates a trendlet string for single time series sequence based on the Equation 6. The probability values for each trendlet component of a time series segment is obtained from Algorithm 5. Each element of the dictionary probability is in the form of a (key1 (key2,value)) pair where key1 refers to the series id, key2 refers to the segment id and value, an array of three elements which contains the probability of trendlet components. The values value[0], value[1] and value[2] contain the probabilities of uplet, downlet and equalet respectively. The binary string of each segment is

21

Algorithm 4 Trendlet String Formation 1: procedure stringCreation 2: Output:trendletS 3: appendTrend() 4: series[segCount + 1] ← trendlet 5: emptyArrays(uplet, downlet, equalet, trendlet) 6: trendletP rob, trendInf o ← trendProbability(series, segLen) 7: trendletS ← encoding(trendletP rob, segLen ) 8: return trendletS, trendInf o 9: end procedure

Algorithm 5 Probability Computation 1: procedure trendProbability(series, length) 2: Input:series, length 3: Output:probability 4: Initialization: 5: probability ← { } 6: trendDetails ← { } 7: for all key, value ∈ series do 8: trend ← value 9: uCount ← len(trend[0]) 10: dCount ← len(trend[1]) 11: eCount ← len(trend[2]) 12: trendDetails[key] ← [uCount, dCount, eCount] 13: uP robability ← (uCount)/length 14: dP robability ← (dCount)/length 15: eP robalility ← (eCount)/length 16: probability[key] = (uP robability, dP robability, eP robability) 17: end for 18: return probability, trendDetails 19: end procedure

concatenated to form a single string and stored in the variable trendletS. This binary string is returned to Algorithm 4, which finally sends the details to Algorithm 2. The different types of segment in a time series sequence is shown in Figure 4. The first part of Algorithm 2 processes the proper segments with a segment count of n mod m. The second loop structure of the corresponding algorithm contains the necessary statements for processing the additional segment. The steps for generating the trendlet components are also applied for additional segment. 22

Algorithm 6 Encoding of Trendlet String 1: procedure encoding(probability, length) 2: Output:trendletS 3: trendletS ← N U LL 4: for all key, value ∈ probability do 5: if value[0] == value[1] == value[2] then 6: trendletS ← concatenate(trendletS,0 1110 ) 7: else if value[2] > (value[1]&value[0]) then 8: trendletS ← concatenate(trendletS,0 0010 ) 9: else if value[1] > (value[0]&value[2]) then 10: trendletS ← concatenate(trendletS,0 0100 ) 11: else if value[0] > (value[1]&value[2]) then 12: trendletS ← concatenate(trendletS,0 1000 ) 13: else if value[0] < (value[1]&value[2]) then 14: trendletS ← concatenate(trendletS,0 0110 ) 15: else if value[1] < (value[0]&value[2]) then 16: trendletS ← concatenate(trendletS,0 1010 ) 17: else 18: trendletS ← concatenate(trendletS,0 1100 ) 19: end if 20: end for 21: return trendletS 22: end procedure

3.3. Trendlet clustering The clustering of trendlet strings is performed based on the count of trendlet components. Consider the time series dataset DnT S , where nT S is the number of time series sequences. Then the number of trendlet strings generated by the representational method described in section 2 is nT S. Algorithm 4 generates the details of trendlet components and store it in trendInfo and sent to algorithm 1. The trendlet details of every time series sequences are stored in a dictionary, cData by algorithm 1. The elements in cData are of the form of (key, value) pairs, where key refers to the time series id and value refers to the segmental count sequences of trendlet components. The aggregation of trendlets for clustering is shown in Figure 5. The segments S1, S2, . . . , Sn are grouped to create the cData, based on forenoon (F), afternoon (A), night (N) and midnight (MN) respectively.

23

Figure 5: Aggregation of trendlets for clustering process

Clustering is performed on these values by hierarchical agglomerative clustering algorithm. The clustering process is performed in an unsupervised environment where the number of initial clusters are not known. The parameters which are provided to the clustering algorithm are the distance threshold and the ward linkage measure. 4. Results and analysis Trendlet clustering experiments are performed using open-source tools and frameworks. We used python3 with various libraries namely pandas, NumPy, scikit-learn and matplotlib for our implementation. Graph generation is performed with matplotlib and gnuplot 5.0. The programs are executed on a desktop machine of Intel Core i7-5500U CPU (2.40 GHz 4), 16 GB of primary memory, and Ubuntu 16.04 LTS 64-bit operating system. The proposed algorithms were experimented with renewable energy data on the electricity supply system of continental Europe. The data includes the demand and inflow of renewable energy for the time period 2012 to 2014. Power generation is accomplished using solar and wind generators. The time series data is compiled from the electrical load values of 1494 generators is captured hourly from 2012 and 2014. We also performed the experiments on solar and wind datasets of COSMO (Consortium for Small Scale Modelling) and ECMWF (European Centre for Medium Weather Forecasts) respectively. The nature of time series data used in our experiments include the power generation load, intensity of solar energy received and wind energy generated respectively. The analysis performed over the time series are divided into the formats namely 24hour, forenoon, afternoon, night and midnight. The day wise clustering of time series sequences is performed with the segment size of 24

24. If the segment size is 6, then the details of midnight, forenoon, afternoon and night are obtained by the multiples of 1, 2, 3 and 4. This method of clustering helps to analyze and group the generators based on the user defined time period segmentation. The trend similarity information of the time series sequences is obtained from this representational method. 4.1. Hierarchical clustering of trendlets Clustering experiments are performed with hierarchical clustering algorithm to find the similar generators with the trend information. Since the number of clusters are not known before the grouping, it is a challenging task to determine the clusters or range of clusters. The number of clusters is one of the trivial parameters of traditional clustering algorithms to start with the clustering process. Hierarchical clustering demands the availability of anticipated clusters to start the grouping process. If the number of expected clusters are not known, then empirical analysis of the hierarchical clustering is performed by tuning the parameters like, distance threshold, linkage and distance measures etc. The experiments performed on the time series sequences are divided into two sections. Section 4.1.1 describes the process of finding the actual clusters and adjusted clusters through hierarchical clustering algorithm by the average silhouette method. This is performed by varying the distance threshold and plotting the corresponding graphs. We have performed a user-defined segmentation of the time series sequences into different parts of the day namely forenoon, afternoon, night and midnight having a time period of 6 hours each. The time series sequences contain the load values recorded in every hour of a day. We considered a segment length of six to capture the trend of different parts of a day. The segment length for daily basis analysis is taken as 24. The cluster formation also reveals the nature of load present in the time series sequences. Section 4.1.2 illustrates the performance four basic clustering algorithms namely birch, two-means, k-means and average hierarchical by initializing the clusters with the optimal clusters obtained in the experiments which are described in the section 4.1.1. 4.1.1. Analysis of clustering based on day wise segmentation (24-hour) The clustering of time series sequences based on 24-hour segmentation reveals the similarity of generators based on the trend of their daily power production. The results of cluster analysis is shown in Figures 6 to 12. In most of the datasets the clustering on trendlets shows a common pattern 25

Figure 6: Cluster and silhouette score variation with distance threshold (Load)

Figure 7: Cluster and silhouette score variation with distance threshold (Solar cosmo)

of silhouette variation with the change of distance threshold. Initially the silhouette score increases with the distance threshold. After a certain level, the score reduces with the distance. This shows the behaviour of clustering process, in which the rearrangement of cluster elements to its best. The quality of clusters decreases with further grouping of cluster elements. The threshold distance and clusters are noted at this level and are considered as the optimum values. For instance, Figures 6, 9 and 10 are marked with the optimum distance threshold with the highest silhouette score. Figures 7 and 8 show the maximum score obtained after its first decrease. The distance threshold at this level does not contribute to the optimum clusters. Every energy source shows a saturated point where the silhouette score decreases with the distance threshold of hierarchical clustering. Figures 11 and 12 show the consolidated variation of silhouette score, optimum clusters and distance 26

Figure 8: Cluster and silhouette score variation with distance threshold (Solar ecmwf)

threshold through out the dataset. The inferences obtained from these figures show the nature of hierarchical clustering. For instance the distance threshold obtained for the load data based on 24-hr clustering is 38 and the corresponding number of clusters that can be formed is 135. Silhouette score is higher for the load data and lower for the wind cosmo. The existence of sparse values in load dataset is lesser compared to other datasets. The cluster elements of the wind cosmo dataset are less similar compared to others. The distance threshold required to stabilize the clusters is larger compared to other datasets. The datasets, Solar cosmo and Solar emcwf contain surplus sparse values of power production in morning and night hours. The power generation in solar generators do not contain a common pattern since the distribution of day length is different in different regions. This can be evident from the graphs of figures 11 and 12 where the number of clusters and distance threshold is larger and silhouette score is lesser for solar datasets. The wind generators show an increased similarity in their pattern of power production compared to solar generators. The silhouette score of wind emcwf is greater and the number of clusters are lesser than that of solar ecmwf.

27

Figure 9: Cluster and silhouette score variation with distance threshold (Wind cosmo)

Figure 10: Cluster and silhouette score variation with distance threshold (Wind ecmwf)

Figure 11: Variation of distance threshold and optimum clusters with the energy source

28

Figure 12: Silhouette score variation with the energy source for the optimum clusters

The trend based cluster analysis for day wise segmented time series sequences shows better results when the values are continuous. The presence of sparse values in solar and wind data disrupt the pattern formation and thus grouping. The load data shows better clustering in-terms of trend with the hierarchical clustering method having a silhouette score between 0.8 and 0.9. 4.1.2. Analysis of clustering based on parts of a day segmentation The time series sequences are clustered based on different parts of day namely forenoon, afternoon, night and midnight. Since every generator contain hourly recorded values for two years, the segmentation of the series can be performed with a segment length of six and aggregating the time series sequences for the respective parts of a day. This clustering process generate more intuitive and meaningful clustering for each dataset which reveals the trend of power generation. Figure 13 shows the variation of silhouette score and number of clusters for forenoon in three datasets. The silhouette score of load data steadily increases, reaches a maximum level and decreases with the distance threshold. The solar data shows a less clustering tendency in the initial distance threshold values. After the first maximum, the clustering process shows a random behaviour owing to the nature of solar data. Figure 14 presents the process of clustering in afternoon segmentation. The load and wind dataset show a better clustering process compared to solar. Solar clustering process is random since there is a significant variance in the distribution of the solar energy across the generators at different regions. 29

Figure 13: Variation of silhouette score and number of clusters for forenoon segmentation

Figure 14: Variation of silhouette score and number of clusters for afternoon segmentation

30

Compared to the forenoon solar clustering, the afternoon segmentation of the solar data shows steady increase in the stability of the clustering process compared to the forenoon segmentation. The external factors such as the presence of clouds, rain and day length effect the production of electricity in solar generators. The details of distance threshold for different datasets with different parts of day is shown in Table 3 and Figure 15. The wind datasets require large distance thresholds to obtain the optimum clusters. Solar datasets show random values of distance threshold for various parts of a day. Table 3: Distance threshold for optimum clusters in various datasets for different parts of a day

Data set Forenoon Load 23 Solar cosmo 60 Solar ecmwf 42 Wind cosmo 69 Wind ecmwf 78

Afternoon 18 5 24 67 78

Night Midnight 18 18 15 11 5 12 69 71 77 78

Figure 15: Variation of distance threshold in various datasets for different parts of a day

The variation of actual clusters and adjusted clusters for different datasets are illustrated in Table 4. The actual clusters are the initial number of clusters, 31

Table 4: Adjusted and normal clusters in various datasets for different parts of a day Data sets

Load Solar Cosmo Solar ECMWF Wind Cosmo Wind ECMWF

Forenoon Normal Adjusted 279 113 1490 155 1491 392 1493 425 1492 390

Parts of the Day Afternoon Night Normal Adjusted Normal Adjusted 275 144 266 87 1491 1491 538 188 1491 266 950 548 1493 355 1491 425 1492 391 1492 418

Midnight Normal Adjusted 279 131 1450 264 1282 262 1490 375 1492 390

to start with the hierarchical clustering algorithm. In most of the cases the distance threshold of the algorithm is initialized to one and increase up to a level where further clustering is not possible. The adjusted clusters are determined by the distance threshold where the first drop of silhouette score is observed during the clustering process. The initial values of silhouette score will increase and reaches an optimum distance threshold and further it will decrease. The point is captured and the number of clusters is calculated from the graphs plotted. Figure 16 shows the range of adjusted clusters with the actual clusters in different parts of a day for various datasets.

Figure 16: Variation of actual clusters and adjusted clusters in various datasets for different parts of a day

4.1.3. Comparison of clustering algorithms with the optimal clusters Table 5 shows the silhouette score analysis of clustering processes by different clustering algorithms. The cluster initialization is done with the 32

normal and adjusted clusters which are obtained through the analysis of hierarchical clustering. The clustering algorithms which are selected for analysis and comparison are namely Birch, Mini-batch k-means (MBK-means), k-means and Avg (Hierarchical). These algorithms require minimum additional parameters to perform the clustering process. Table 5: Silhouette scores for normal and adjusted clusters formed by clustering algorithms for various datasets in different parts of a day LOAD

MIDNIGHT

NIGHT

AFTERNOON

FORENOON

Algorithms

SOLAR-COSMO

SOLAR-ECMWF

WIND-COSMO

WIND-ECMWF

Normal Adjusted Normal Adjusted Normal Adjusted

Normal

BIRCH

0.83077

0.8656

0.00082

0.0856

0.00401

0.0925

0.0000074

0.0774

0.002677

0.1504

MBKMEANS

0.83830

0.83675

0.00418

0.023

0.0088

0.0277

0.0030

-0.017238

0.01120

0.050

KMEANS

0.831994

0.86444

0.000822

0.0747

0.040160

0.0748

0.000074

0.06320

0.0026

0.0875

AVG(H)

0.831994

0.85774

0.000821

0.0510

0.00401

0.01164

0.00027

0.025501

0.00267

0.0209

BIRCH

0.8362

0.8592

0.000851

0.000851

0.00401

0.0909

0.00010

0.08172

0.002677

0.10373

MBKMEANS

0.8459

0.8453

0.0054

0.0054

0.00925

0.0165

.00344

-0.01044

0.08012

0.0424

KMEANS

0.8353

0.86560

0.00085

0.00085

0.0040

0.06994

0.0001

0.067175

0.002677

0.0877

AVG(H)

0.8353

0.8512

0.00077

0.00077

0.00401

-0.08101

0.00010

0.01373

0.00267

0.0016

BIRCH

0.8418

0.8544

0.6400

0.6847

0.4645

0.4600

0.0004

0.08318

0.0026

0.112

MBKMEANS

0.8529

0.83711

0.6498

0.6375

0.4734

0.46665

0.00472

-0.0045

0.00934

0.053

KMEANS

0.8433

0.85292

0.6405

0.6773

0.4645

0.4799

0.0004

0.0673

0.00267

0.0951

AVG(H)

0.8433

0.73719

0.6405

0.5169

0.4645

0.25827

0.0038

0.03071

0.002677

0.02610

BIRCH

0.8299

0.86022

0.05153

0.212

0.14710

0.252

0.4645

0.0778

0.0026

0.1103

MBKMEANS

0.84595

0.8398

0.0723

0.178

0.16583

0.1938

0.4751

-0.0150

0.00692

0.0475

KMEANS

0.8313

0.8642

0.05253

0.1928

0.15060

0.2730

0.4645

0.0631

0.0026

0.0915

AVG(H)

0.83132

0.85463

0.5153

0.00027

0.15060

0.07779

0.0

0.01866

0.00267

0.0103

33

Adjusted Normal Adjusted

Figure 17: Variation of silhouette score proportion of various datasets with normal and adjusted clusters during forenoon and afternoon

Table 5 shows the detailed analysis of four clustering algorithms on different datasets during various parts of the day. The pattern of clustering is observed in-terms of silhouette score variation of clusters formed by normal and adjusted number of clusters. Figure 17 shows the quality of clusters formed during forenoon and afternoon. Load dataset shows equal proportion of cluster score for adjusted and normal clusters during forenoon and afternoon. The solar data does not shows a clustering tendency during afternoon initially. The amount of energy intensity received by solar generators varies across different regions. Hence a common pattern of energy production is not available in afternoon. Forenoon shows a clustering tendency for adjusted clusters in solar cosmo. It shows the inactivity of solar generators across different regions during forenoon. There is a feeble tendency of clustering during afternoon in solar cosmo compared to solar ecmwf. This reveals that there is no pattern in energy production in solarcosmo with respect to solar ecmwf. Similar behavior is observed in the case of wind datasets during this period of time. In some cases the clustering process also shows a wrong tendency of clustering by resulting in a negative silhouette score. Solar-cosmo and wind-ecmwf exhibits a normal clustering tendency during forenoon. Figure 18 shows the nature of clustering algorithms on the datasets during night and midnight. Load and solar datasets show better clustering result than the wind data during night. The wind data shows a clustering tendency in midnight with the normal and adjusted clusters. The wind-ecmwf data 34

Figure 18: Variation of silhouette score proportion of various datasets with normal and adjusted clusters during night and midnight

exhibits a low tendency in clustering during all the segments of a day. This illustrates the nature of values generated by wind-ecmwf data, which does not contain a trend pattern for clustering. The clustering algorithms are also compared with their time of execution on each datasets by initializing them with actual and adjusted clusters. The experiments which are performed during different parts of the day are shown in Figure 19. Among the four algorithms, birch clustering takes significant amount of time as compared to others. Average hierarchical and Mini-batch k-means are comparatively faster in clustering the data. K-means shows an increased clustering time for adjusted clusters in solar-ecmwf data during midnight. The power generation is random during midnight and the presence of more unique pattern results in increased clustering time. Figures 18 and 19 show that Mini-batch k-means and k-means clustering show better performance when the quality and time of clustering is concerned. Birch clustering also outperforms in-terms of quality. The time taken for birch clustering is more compared to others.

35

Figure 19: Time taken by various clustering algorithms for different parts of the day

36

5. Experiments on UCR-2018 time series archive The UCR-2018 time series archive contains well organized time series datasets which are used for different experimental purposes (Dau et al., 2018). UCR-2018 archive contains 128 datasets from different domains. Each dataset contains a predefined training set, test set and a document which describes the details of the dataset. The experiments on UCR-2018 database are divided into two directives namely the clustering experiments and agglomerative clustering with varying distance threshold. We have performed these experiments on the trendlet representation of the dataset elements. Table 6 shows the details of the data-sets used in the experiments, which include the following characteristics namely number of classes (C), the imbalance ratio (IR) , number of instances (I) and length of the time series (LEN). The IR is computed as a ratio of the number of instances in majority and minority classes (Guijo-Rubio et al., 2018). This ratio gives the distribution of data instances among different classes. The machine learning models generated from a balanced dataset provide better results with the respective test cases. The number of instances (I) includes the total number of training as well as test instances. UCR-2018 archive contains datasets of different data formats normalised into to their respective time series domains. The database archive also contains a set of 15 existing datasets which contain missing values and variable data lengths (MVVDL). The datasets are subjected to linear interpolation methods to obtain the missing values.

37

Table 6: Details of the datasets in UCR-2018 time series archive Dataset ACSF1 Adiac AGWX AGWY AGWZ ArrowHead Beef BeetleFly BirdChicken CBF Car Chinatown ChlorineC CinCECGTorso Coffee Computers CricketX CricketY CricketZ Crop DiatomSizeR DPOAgeGroup DPOCorrect DistalPhalanxTW DLoopDay DLoopGame DLoopWeekend ECG200 ECG5000 ECGFiveDays EOGHorizontalSignal EOGVerticalSignal Earthquakes ElectricDevices EthanolLevel FaceAll FaceFour FacesUCR FiftyWords Fish FordA FordB FreezerRegularTrain FreezerSmallTrain Fungi GestureMidAirD1 GestureMidAirD2 GestureMidAirD3

#C 10 37 10 10 10 3 5 2 2 3 4 2 3 4 2 2 12 12 12 24 4 3 2 6 7 2 2 2 5 2 12 12 2 7 4 14 4 14 50 7 2 2 2 2 18 26 26 26

#I 200 781 1000 1000 1000 211 60 40 40 930 120 363 4307 1420 56 500 780 780 780 24000 322 539 876 539 158 158 158 200 5000 884 724 724 461 16637 1004 2250 112 2250 905 350 4921 4446 3000 2878 204 338 338 338

LEN 1460 176 500 500 500 251 470 512 512 128 577 24 166 1639 286 720 300 300 300 46 345 80 80 80 288 288 288 96 140 136 1250 1250 512 96 1751 131 350 131 270 463 500 500 301 301 201 360 360 360

%IR 1.0 3.75 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.5 1.545 1.0 2.879 2.6 1.0 1.0 1.56 1.285 1.64 1.0 6.0 8.56 1.702 11.83 1.2 1.0 1.0 2.22 146.0 1.55 1.03 1.03 4.55 4.72 1.0 1.0 2.66 8.25 52.0 1.33 1.051 1.047 1.0 1.0 1.0 1.0 1.0 38 1.0

Dataset GPoint GPointAgeSpan GPointMVF GPointOVY Ham HandOutlines Haptics Herring HouseTwenty InlineSkate InsectEPGRT InsectEPGST InsectWBS ItalyPD LKAppliances Lightning2 Lightning7 Mallat Meat MedicalImages MelbournePDT MPOAgeGroup MPOCorrect MPTW MSRTrain MSSTrain MoteStrain NIFlECGThorax1 NIFECGThorax2 OSULeaf OliveOil PLAID POutlinesCorrect Phoneme PickupGWZ PigAirwayPressure PigArtPressure PigCVP Plane PowerCons PPOAgeGroup PPOCorrect PPTW RefDevices Rock ScreenType SHandGenderCh2 SHandMovementCh2

#C 2 2 2 2 2 2 5 2 2 7 3 3 11 2 3 2 7 8 3 10 10 3 2 6 5 5 2 42 42 6 4 11 2 39 10 52 52 52 7 2 3 2 6 3 4 3 2 6

#I 200 451 451 451 214 1370 463 128 159 650 311 266 2200 1096 750 121 143 2400 120 1141 3633 554 891 553 2925 2525 1272 3765 3765 442 60 1074 2658 2110 100 312 312 312 210 360 605 891 605 750 70 750 900 900

LEN 150 150 150 150 431 2709 1092 512 2000 1882 601 601 256 24 720 637 319 1024 448 99 24 80 80 80 1024 1024 84 750 750 427 570 1344 80 1024 361 2000 2000 2000 144 144 80 80 80 720 2844 720 1500 1500

%IR 1.08 1.01 1.10 1.09 1.096 1.762 2.0 1.56 1.0 2.0 3.0 2.66 1.0 1.030 1.0 2.0 3.8 5.5 1.0 33.83 1.043 4.309 1.830 6.4 1.0 1.0 1.0 1.6 1.6 3.533 3.25 6.769 1.866 24.0 1.0 1.0 1.0 1.0 2.22 1.0 2.625 2.092 11.25 1.0 1.0 1.0 1.0 1.0

GesturePebbleZ1 GesturePebbleZ2 ShapeletSim SKitchenAppliances SonyAIBORS1 StarLightCurves SwedishLeaf SyntheticControl ToeSegmentation2 TwoLeadECG UMD UWaveGLibraryX UWaveGLibraryZ Wine Worms Yoga

6 6 2 3 2 3 15 6 2 2 3 8 8 2 5 2

304 304 200 750 621 9236 1125 600 166 1162 180 4478 4478 111 258 3300

455 455 500 720 70 1024 128 60 343 82 150 315 315 234 900 426

1.25 1.13 1.0 1.0 2.33 3.769 1.615 1.0 1.0 1.090 1.0 1.27 1.27 1.11 4.470 1.18

SHandSubjectCh2 ShakeGWZ ShapesAll SmoothSubspace SonyAIBORS2 Strawberry Symbols ToeSegmentation1 Trace TwoPatterns UWaveGLibraryAll UWaveGLibraryY Wafer WordSynonyms WormsTwoClass MVVDL

5 10 60 3 2 2 6 2 4 4 8 8 2 25 2 ***

900 100 1200 300 980 983 1020 268 200 5000 4478 4478 7164 905 258 ***

1500 385 512 15 65 235 398 277 275 128 945 315 152 270 900 ***

5.1. Cluster analysis of UCR-2018 archive using trendlets The trendlet representations of UCR-2018 archive are experimented with four basic clustering algorithms namely k-means, Birch, mini batch k-means (MBK-Means) and average hierarchical Avg (H). We have initialised these clustering algorithms with the number of classes and perform clustering on the trendlets representation to obtain the respective silhouette scores. The silhouette score comparison of various algorithms shows that most of the values reflected for the algorithms are positive. Even though hierarchical clustering produce better silhouette scores for some cases, the chances of wrong clustering is more due to the instability of IR and number of instances. This can be evident from the negative silhouette values from Table 7 in correlation with Table 6. Silhouette score greater than 0.5 exhibits a balanced clustering, which is shown for various algorithms in Table 7. Mini batch k-means and birch clustering show better clustering in-terms of silhouette score. The number of instances (I), classes (C) and IR result in maintaining the cluster quality. For instance if we consider the dataset PPOCorrect, it has the highest silhouette score because of two classes and lesser IR. The number of instances is also higher. If we consider the datasets namely PickupGWZ, PigAirwayPressure, PigArtPressure and PigCVP, the respective silhouette scores for Avg (H) is negative. Similarly, the other algorithms show lesser positive silhouette scores. This variation of silhouette score is due to the presence of more number of classes and less number of instances in their datasets. 39

1.0 1.0 1.0 1.0 1.45 1.799 2.66 1.0 1.476 1.143 1.27 1.27 9.30 30.0 1.381 ***

Table 7: Silhouette score comparison of various clustering algorithms on trendlets based representation of UCR-2018 archive Dataset ACSF1 Adiac AGWX AGWY AGWZ ArrowHead Beef BeetleFly BirdChicken CBF Car Chinatown ChlorineC CinCECGTorso Coffee Computers CricketX CricketY CricketZ Crop DiatomSizeR DPOAgeGroup DPOCorrect DistalPhalanxTW DLoopDay DLoopGame DLoopWeekend ECG200 ECG5000 ECGFiveDays EOGHorizontalSignal EOGVerticalSignal Earthquakes ElectricDevices EthanolLevel FaceAll FaceFour FacesUCR FiftyWords Fish FordA FordB FreezerRegularTrain FreezerSmallTrain Fungi GestureMidAirD1 GestureMidAirD2 GestureMidAirD3

K-Means 0.695 0.350 0.304 0.332 0.307 0.304 0.521 0.322 0.431 0.124 0.509 0.199 0.472 0.526 0.459 0.496 0.148 0.116 0.132 0.481 0.466 0.556 0.873 0.314 0.098 0.814 0.814 0.349 0.159 0.162 0.488 0.517 0.428 0.267 0.437 0.414 0.475 0.269 0.656 0.473 0.144 0.126 0.545 0.550 0.664 0.366 0.344 0.313

Birch 0.682 0.525 0.349 0.298 0.350 0.390 0.515 0.360 0.431 0.161 0.530 0.213 0.603 0.627 0.552 0.492 0.168 0.159 0.174 0.774 0.506 0.566 0.873 0.332 0.080 0.814 0.814 0.382 0.184 0.170 0.510 0.514 0.446 0.288 0.416 0.492 0.405 0.430 0.802 0.541 0.166 0.164 0.551 0.550 0.771 0.338 0.355 0.307

MBK-M 0.696 0.534 0.357 0.345 0.364 0.391 0.591 0.360 0.431 0.179 0.525 0.252 0.598 0.635 0.562 0.491 0.192 0.178 0.193 0.792 0.517 0.567 0.873 0.334 0.116 0.814 0.814 0.393 0.205 0.173 0.533 0.518 0.440 0.334 0.439 0.501 0.497 0.448 0.792 0.548 0.17 0.169 0.552 0.552 0.738 0.369 0.354 0.316

Avg (H) 0.481 0.305 0.490 0.445 0.239 0.545 0.416 0.367 0.058 0.275 0.108 0.484 0.342 0.436 0.428 0.238 -0.161 0.063 -0.117 0.490 0.086 0.769 0.873 0.439 0.394 0.786 0.786 0.577 0.438 0.480 0.350 0.628 0.506 -0.071 0.302 0.120 0.018 -0.128 0.624 0.377 0.448 0.494 0.474 0.539 0.429 0.017 0.170 0.129

40

Dataset GesturePebbleZ1 GesturePebbleZ2 GPoint GPointAgeSpan GPointMVF GPointOVY Ham HandOutlines Haptics Herring HouseTwenty InlineSkate InsectEPGRT InsectEPGST InsectWBS ItalyPD LKAppliances Lightning2 Lightning7 Mallat Meat MedicalImages MelbournePDT MPOAgeGroup MPOCorrect MPTW MSRTrain MSSTrain MoteStrain NIFlECGThorax1 NIFECGThorax2 OSULeaf OliveOil PLAID POutlinesCorrect Phoneme PickupGWZ PigAirwayPressure PigArtPressure PigCVP Plane PowerCons PPOAgeGroup PPOCorrect PPTW RefDevices Rock ScreenType

K-Means 0.479 0.479 0.481 0.482 0.470 0.509 0.448 0.757 0.204 0.291 0.553 0.500 0.452 0.451 0.101 0.275 0.576 0.466 0.353 0.159 00.284 0.527 0.189 0.487 0.888 0.268 0.253 0.248 0.282 0.290 0.271 0.435 0.535 0.433 0.881 0.108 0.381 0.250 0.201 0.166 0.438 0.106 0.545 0.911 0.528 0.420 0.307 0.3926

Birch 0.454 0.452 0.556 0.529 0.516 0.529 0.427 0.732 0.199 0.307 0.622 0.541 0.458 0.455 0.155 0.258 0.306 0.439 0.403 0.214 0.204 0.561 0.240 0.509 0.888 0.285 0.338 0.326 0.296 0.334 0.323 0.539 0.556 0.444 0.881 0.135 0.390 0.228 0.205 0.142 0.376 0.126 0.540 0.911 0.340 0.466 0.264 0.392

MBK-M 0.464 0.464 0.564 0.528 0.529 0.529 0.450 0.720 0.233 0.311 0.622 0.544 0.456 0.453 0.171 0.253 0.445 0.425 0.401 0.219 0.274 0.584 0.244 0.509 0.888 0.297 0.307 0.306 0.313 0.343 0.339 0.538 0.560 0.445 0.881 0.157 0.446 0.241 0.196 0.153 0.438 0.124 0.555 0.911 0.344 0.463 0.336 0.389

Avg (H) -0.097 -0.0702 0.576 0.611 0.611 0.611 0.654 0.857 0.301 0.413 0.485 0.421 0.205 0.577 0.387 0.510 0.341 0.632 0.234 0.130 0.663 0.551 0.314 0.618 0.888 0.209 0.039 0.083 0.618 -0.218 -0.123 0.483 0.371 -0.038 0.881 0.172 -0.120 -0.093 -0.160 -0.247 0.196 0.345 0.617 0.911 0.524 0.242 0.141 0.288

SemgHandGenderCh2 SHandSubjectCh2 ShapeletSim SKitchenAppliances SonyAIBORS1 StarLightCurves SwedishLeaf SyntheticControl ToeSegmentation2 TwoLeadECG UMD UWaveGLibraryX UWaveGLibraryZ Wine Worms Yoga

0.126 0.114 0.176 0.714 0.368 0.388 0.216 0.121 0.692 0.331 0.338 0.256 0.267 0.373 0.116 0.530

0.152 0.157 0.170 0.727 0.418 0.408 0.240 0.174 0.664 0.338 0.340 0.304 0.311 0.373 0.147 0.527

0.166 0.168 0.180 0.732 0.419 0.408 0.256 0.191 0.661 0.353 0.330 0.336 0.326 0.389 0.167 0.530

0.443 0.152 0.496 0.523 0.496 -0.048 -0.262 0.078 0.630 0.552 0.432 0.318 0.275 0.349 0.789 0.560

SemgHandMovementCh2 ShakeGWZ ShapesAll SmoothSubspace SonyAIBORS2 Strawberry Symbols ToeSegmentation1 Trace TwoPatterns UWaveGLibraryAll UWaveGLibraryY Wafer WordSynonyms WormsTwoClass MVVDL

0.123 0.469 0.232 0.651 0.351 0.523 0.471 0.609 0.493 0.288 0.285 0.274 0.497 0.542 0.783 ***

0.160 0.513 0.240 0.775 0.468 0.524 0.480 0.545 0.525 0.355 0.307 0.329 0.459 0.586 0.783 ***

0.158 0.515 0.254 0.849 0.462 0.524 0.481 0.528 0.546 0.355 0.337 0.327 0.457 0.593 0.783 ***

5.2. Hierarchical cluster analysis based on distance threshold and silhouette score Hierarchical clustering is widely used in studying the grouping characteristics of datasets. Section 4 describes various experiments on hierarchical clustering to identify the clusters based on the trend the time series sequences. In this section we analyze the UCR-2018 archive database using agglomerative hierarchical clustering with ward linkage. The closeness of data instances in a group is computed with the help of ward distance measure. Empirically we chose the range of distance threshold between 1 and 1000. Majority of the datasets form the predefined number of clusters within this range. Table 8 shows the range of distance threshold (DT), number of classes (C) and the respective silhouette score (Score) obtained in the clustering of each dataset in UCR-2018 archive. All the observed silhouette scores are positive which signify the proper clustering. The variation of distance threshold depends up on the number of instances (I) and groups (C) in a data set (refer section 5.1). Even uniformly distributed data instances in a dataset do not guarantee fast clustering in terms of trend. For instance if we consider the datasets namely Computers, LKAppliances, MSSTrain, MSRTrain with lesser C and IR as 1.0 (from Table 6), the distance threshold varies from beyond 1000 to reach the predetermined classes. The dataset StarLightCurves required a distance threshold of 6000 to converge into 3 classes. The IR and LEN values of this dataset are higher with respect to the number of instances in it. In general clustering 41

0.148 0.330 -0.473 0.720 0.357 0.480 -0.224 0.620 0.360 -0.250 0.242 0.295 0.497 0.419 0.802 ***

based on trendlets reveals various noticeable patterns in time series sequences which help in developing various learning models. Table 8: Distance threshold range of hierarchical clustering algorithm based on ward distance over the predefined number of clusters. Dataset ACSF1 Adiac AGWX AGWY AGWZ ArrowHead Beef BeetleFly BirdChicken CBF Car Chinatown ChlorineC CinCECGTorso Coffee Computers CricketX CricketY CricketZ Crop DiatomSizeR DPOAgeGroup DPOCorrect DistalPhalanxTW DLoopDay DLoopGame DLoopWeekend ECG200 ECG5000 ECGFiveDays EOGHorizontalSignal EOGVerticalSignal Earthquakes ElectricDevices EthanolLevel FaceAll FaceFour FacesUCR

DT 299-329 8-9 168-188 156-174 167-216 75-88 24-24 65-90 145-300 58-60 22-31 18-19 264-409 789-999 26-50 1450-2000 60-61 60-61 60-61 37-39 16-20 26-28 41-92 15-15 39-10 161-163 161-163 39-79 159-176 95-102 473-563 469-499 471-889 366-441 248-316 27-29 80-117 34-35

#C 10 37 10 10 10 3 5 2 2 3 4 2 3 4 2 2 12 12 12 24 4 3 2 6 7 2 2 2 5 2 12 12 2 7 4 14 4 14

Score 0.693 0.484 0.313 0.278 0.314 0.388 0.596 0.322 0.431 0.134 0.546 0.186 0.640 0.621 0.551 0.342 0.118 0.115 0.147 0.757 0.488 0.899 0.813 0.318 0.107 0.813 0.813 0.335 0.239 0.112 0.475 0.483 0.428 0.274 0.435 0.454 0.459 0.417

42

Dataset Gpoint GPointAgeSpan GPointMVF GPointOVY Ham HandOutlines Haptics Herring HouseTwenty InlineSkate InsectEPGRT InsectEPGST InsectWBS ItalyPD LKAppliances Lightning2 Lightning7 Mallat Meat MedicalImages MelbournePDT MPOAgeGroup MPOCorrect MPTW MSRTrain MSSTrain MoteStrain NIFlECGThorax1 NIFECGThorax2 OSULeaf OliveOil PLAID POutlinesCorrect Phoneme PickupGWZ PigAirwayPressure PigArtPressure PigCVP

DT 105-168 138-287 179-268 138-287 107-237 814-999 270-385 83-147 600-900 284-288 318-453 234-436 165-170 40-59 1200-1500 119-181 31-32 368-495 29-35 26-31 29-29 22-28 45-78 18-19 973-999 1000-1300 203-297 62-64 61-62 38-46 14-19 414-453 69-952 98-99 27-36 73-78 58-2500 36-38

#C 2 2 2 2 2 2 5 2 2 7 3 3 11 2 3 2 7 8 3 10 10 3 2 6 5 5 2 42 42 6 4 11 2 39 10 52 52 52

Score 0.502 0.528 0.488 0.528 0.450 0.749 0.195 0.291 0.553 0.525 0.611 0.622 0.110 0.256 0.523 0.466 0.315 0.181 0.750 0.552 0.199 0.895 0.455 0.267 0.258 0.259 0.269 0.293 0.287 0.495 0.583 0.408 0.882 0.086 0.406 0.259 0.230 0.165

FiftyWords Fish FordA FordB FreezerRegularTrain FreezerSmallTrain Fungi GestureMidAirD1 GestureMidAirD2 GestureMidAirD3 GesturePebbleZ1 GesturePebbleZ2 ShapeletSim SKitchenAppliances SonyAIBORS1 StarLightCurves SwedishLeaf SyntheticControl ToSegmentation2 TwoLeadECG UMD UWaveGLibraryX UWaveGLibraryZ Wine Worms Yoga

6-8 32-33 408-430 355-389 264-450 263-441 3-5 85-86 66-66 82-83 288-293 288-293 70-71 656-999 65 6000 38-41 20-23 246 140 36-42 757-847 778-783 14-16 283-335 369-380

50 7 2 2 2 2 18 26 26 26 6 6 2 3 2 3 15 6 2 2 3 8 8 2 5 2

0.684 0.511 0.115 0.107 0.533 0.532 0.732 0.347 0.342 0.316 0.476 0.476 0.176 0.714 0.378 0.352 0.226 0.139 0.632 0.303 0.398 0.294 0.294 0.34 0.795 0.533

Plane PowerCons PPOAgeGroup PPOCorrect PPTW RefDevices Rock ScreenType SHandGenderCh2 SemgHandMovementCh2 SemgHandSubjectCh2 ShakeGWZ ShapesAll SmoothSubspace SonyAIBORS2 Strawberry Symbols ToSegmentation1 Trace TwoPatterns UWaveGLibraryAll UWaveGLibraryY Wafer WordSynonyms WormsTwoClass MVVDL

19-21 75-86 21-27 29-554 14-16 629-999 552-641 1200-1900 220-240 125-153 142-179 54-58 73-74 11-13 90 200-230 396-461 185 32-46 580-730 2200-2300 770-820 1400-300 15-17 640-660 ***

7 2 3 2 6 3 4 3 2 6 5 10 60 3 2 2 6 2 4 4 8 8 2 25 2 ***

0.439 0.103 0.913 0.911 0.315 0.461 0.284 0.398 0.116 0.096 0.104 0.490 0.239 0.727 0.459 0.476 0.515 0.475 0.518 0.302 0.349 0.281 0.349 0.579 0.791 ***

6. Conclusion and future work Trendlet clustering is a novel method of grouping the temporal data based on the trend. We have performed an exploratory analysis of clustering a time series data with trendlets. The cluster results agree logically with the nature of data in various datasets. The encoding of a given time series sequence into a trendlet string represents the behavior of time series values present in it. The number of data points are also reduced to a factor of 3 ∗ (n mod m) approximately. Thus, clustering algorithms show better performance interms of execution time and cluster quality. The behavior of every segment in the time series sequences is expressed by a triplet which contains the count of uplets, downlets and equalets. The experiments show that Mini-batch k-means and k-means algorithm result in better clustering as compared to others. The trendlet representation can also be viewed as a method of feature based dimensionality reduction procedure that can be applied for a time series 43

data. During our experiments with trendlet representation, we obtained equalets which illustrate the presence of a segment having same consecutive numbers. We further verified with the individual time series mean and the global dataset mean to find the dominance of trendlets. Compared to other trendlet components the number of equalets are lesser, except for solar data. The silhouette score variation follows a constant pattern in identifying the quality of clusters in our experiments. The influence of different parts of a day in time series clustering of generator data can be visualized from our analysis. Unsupervised clustering of time series data requires extensive exploratory analysis. The process of trendlet formation is controlled by the user defined segmentation of time series data. Determining an optimal segment length in dividing the time series data needs to be explored further in-terms of number of time series values. Several features like mean, mode, distribution of data values in a segment etc can be considered to extract useful patterns. More features need to be explored so that the involvement of each segment can be identified by the influence of trendlets. Generally the clustering on a normal time series data is confined to all the values in the sequence. Trendlet formation provides a possibility of clustering based on user defined requirements. If the user is interested in a value of a particular time period, then such segment based clustering will result in the grouping of generator nodes having similar behavior in that time period. The trendlet based clustering resulted in representing a time series data using a binary string. The possibility of utilizing this binary data in decision making is to be explored further in-terms of power optimization, anomaly detection and generator placement etc. Experiments on UCR-2018 time series archive also reveals the importance of trendlets in determining predefined clusters and patterns of data values. Different types of trendlet features that can be generated from UCR2018 archive using fuzzy logic, linear programming etc., are to be explored further. Acknowledgements: This research received funding from the Netherlands Organization for Scientific Research (NWO) in the framework of the Indo Dutch Science Industry Collaboration Program in relation to project NextG enSmartDC (629.002.102).

44

References Ali, Abbas Raza, Bogdan Gabrys, and Marcin Budka (2018) “Cross-domain Meta-learning for Time-series Forecasting,” Procedia Computer Science, Vol. 126, pp. 9 – 18. Alsallakh, Bilal, Markus B¨ogl, Theresia Gschwandtner, Silvia Miksch, Bilal Esmael, Arghad Arnaout, Gerhard Thonhauser, and Philipp Z¨ollner (2014) “A Visual Analytics Approach to Segmenting and Labeling Multivariate Time Series Data,” in EuroVis Workshop on Visual Analytics, EuroVA 2014, Swansea, UK. Bhaduri, M. and J. Zhan (2018) “Using Empirical Recurrence Rates Ratio for Time Series Data Similarity,” IEEE Access, Vol. 6, pp. 30855 – 30864. Bode, Gerrit, Thomas Schreiber, Marc Baranski, and Dirk M¨ uller (2019) “A time series clustering approach for Building Automation and Control Systems,” Applied Energy, Vol. 238, pp. 1337 – 1345. (2019) “Determining clusters,” https://www.datanovia.com/en/lessons/, Accessed on February 3, 2019. Daoyuan, Li, Tegawend´e F. Bissyand´e, Klein Jacques, and Le Traon Yves (2016) “Time Series Classification with Discrete Wavelet Transformed Data: Insights from an Empirical Study,” The 28th International Conference on Software Engineering and Knowledge Engineering (SEKE 2016), pp. 01 – 06. Dau, Hoang Anh, Anthony J. Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ratanamahatana, and Eamonn J. Keogh (2018) “The UCR Time Series Archive,” ArXiv, Vol. abs1810.07758. Ferreira, Leonardo N. and Liang Zhao (2016) “Time series clustering via community detection in networks,” Information Sciences, Vol. 326, pp. 227 – 242. Folgado, Duarte, Mar`ılia Barandas, Ricardo Matias, Rodrigo Martins, Miguel Carvalho, and Hugo Gamboa (2018) “Time Alignment Measurement for Time Series,” Pattern Recognition, Vol. 81, pp. 268 – 279.

45

Garcke, Jochen, Rodrigo Iza-Teran, Marvin Marks, Mandar Pathare, Dirk Schollbach, and Martin Stettner (2017) “Dimensionality Reduction for the Analysis of Time Series Data from Wind Turbines,” Scientific Computing and Algorithms in Industrial Simulations: Projects and Products of Fraunhofer SCAI, pp. 317 – 339. Guijo-Rubio, David, Antonio Manuel Dur´an-Rosal, Pedro Antonio Guti´errez, Alicia Troncoso, and C´esar Herv´as-Mart´ınez (2018) “Time series clustering based on the characterisation of segment typologies,” Computing Research Repository, Vol. abs/1810.11624. Ji, Cun, Chao Zhao, Shijun Liu, Chenglei Yang, Li Pan, Lei Wu, and Xiangxu Meng (2019) “A fast shapelet selection algorithm for time series classification,” Computer Networks, Vol. 148, pp. 231 – 240. Jiang, Gaoxia, Wenjian Wang, and Wenkai Zhang (2019) “A novel distance measure for time series: Maximum shifting correlation distance,” Pattern Recognition Letters, Vol. 117, pp. 58 – 65. Khosravi, A., L. Machado, and R.O. Nunes (2018) “Time-series prediction of wind speed using machine learning algorithms: A case study Osorio wind farm, Brazil,” Applied Energy, Vol. 224, pp. 550 – 566. Kuznetsov, Vitaly and Mehryar Mohri (2015) “Learning Theory and Algorithms for Forecasting Non-stationary Time Series,” in Advances in Neural Information Processing Systems 28, pp. 541 – 549. L¨angkvist, Martin, Lars Karlsson, and Amy Loutfi (2014) “A review of unsupervised feature learning and deep learning for time-series modeling,” Pattern Recognition Letters, Vol. 42, pp. 11 – 24. Lexiang, Ye and Keogh Eamonn (2009) “Time Series Shapelets: A New Primitive for Data Mining,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, Paris, pp. 947 – 956. Lim, Bryan Y., Jiaguo (George) Wang, and Yaqiong Yao (2018) “Timeseries momentum in nearly 100 years of stock returns,” Journal of Banking Finance, Vol. 97, pp. 283 – 296.

46

Ma, R. and R. Angryk (2017) “Distance and Density Clustering for Time Series Data,” in 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 25 – 32, Nov. Mori, U., A. Mendiburu, and J. A. Lozano (2016) “Similarity Measure Selection for Clustering Time Series Databases,” IEEE Transactions on Knowledge and Data Engineering, Vol. 28, No. 1, pp. 181 – 195. Motlagh, Omid, Adam Berry, and Lachlan O’Neil (2019) “Clustering of residential electricity customers using load time series,” Applied Energy, Vol. 237, pp. 11 – 24. Paparrizos, John and Luis Gravano (2017) “Fast and Accurate Time-Series Clustering,” ACM Trans. Database Syst., Vol. 42, No. 2, pp. 8:1–8:49. Putri, Givanna H., Mark N. Read, Irena Koprinska, Deeksha Singh, Uwe R¨ohm, Thomas M. Ashhurst, and Nicholas J.C. King (2019) “ChronoClust: Density-based clustering and cluster tracking in high-dimensional timeseries data,” Knowledge-Based Systems, Vol. 174, pp. 9 – 26. Roelofsen, Pjotr (2018) Time series clustering: Department of Mathematics, Vrije Universiteit Amsterdam. Salgado, C. M., M. C. Ferreira, and S. M. Vieira (2017) “Mixed Fuzzy Clustering for Misaligned Time Series,” IEEE Transactions on Fuzzy Systems, Vol. 25, No. 6, pp. 1777 – 1794. Salles, Rebecca, Kele Belloze, Fabio Porto, Pedro H. Gonzalez, and Eduardo Ogasawara (2019) “Nonstationary time series transformation methods: An experimental review,” Knowledge-Based Systems, Vol. 164, pp. 274 – 291. Slavakis, K., S. Salsabilian, D. S. Wack, S. F. Muldoon, H. E. BaidooWilliams, J. M. Vettel, M. Cieslak, and S. T. Grafton (2018) “Clustering Brain-Network Time Series by Riemannian Geometry,” IEEE Transactions on Signal and Information Processing over Networks, Vol. 4, No. 3, pp. 519 – 533. Subbalakshmi, Chatti, G Rama Krishna, Krishna Mohan Rao, and P Venketeswa Rao (2015) “A Method to Find Optimum Number of Clusters Based on Fuzzy Silhouette on Dynamic Data Set,” Procedia Computer Science, Vol. 46. 47

Tibshirani, Robert, Guenther Walther, and Trevor Hastie (2001) “Estimating the Number of Clusters in a Data Set Via the Gap Statistic,” Journal of the Royal Statistical Society Series B, Vol. 63, pp. 411 – 423. Wang, Haishuai, Qin Zhang, Jia Wu, Shirui Pan, and Yixin Chen (2019a) “Time series feature learning with labeled and unlabeled data,” Pattern Recognition, Vol. 89, pp. 55 – 66. (2019b) “Time series feature learning with labeled and unlabeled data,” Pattern Recognition, Vol. 89, pp. 55 – 66. Wang, Xiao, Fusheng Yu, Witold Pedrycz, and Lian Yu (2019) “Clustering of interval-valued time series of unequal length based on improved dynamic time warping,” Expert Systems with Applications, Vol. 125, pp. 293 – 304. wang, Xiaozhe, Kate Smith, and Rob Hyndman (2006) “Characteristic-Based Clustering for Time Series Data,” Data Mining and Knowledge Discovery, Vol. 13, pp. 335 – 364. (2018) “Time Series Defnitions,” http://www.statsoft.com/textbook/timeseries-analysis, Accessed on December, 2018. Wen, Lulu, Kaile Zhou, and Shanlin Yang (2019) “A shape-based clustering method for pattern recognition of residential electricity consumption,” Journal of Cleaner Production, Vol. 212, pp. 475 – 488. X Wang, G. Lerman, K. Slavakis (2014) Riemannian multi-manifold modeling, Available: 1410.0095v1: arXiv e-prints. Zhao, Yao, Lei Lin, Wei Lu, and Yu Meng (2016) “Landsat time series clustering under modified Dynamic Time Warping,” in 2016 4th IEEE International Workshop on Earth Observation and Remote Sensing Applications (EORSA), pp. 62 – 66, July. Zolhavarieh, Seyedjamal, Aghabozorgi, Saeed Teh, and Ying Wah (2014) “A review of subsequence time series clustering,” Scientific World Journal, pp. 1 – 19.

48

Declaration of interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Credit author statement Johnpaul C I: Conceptualization, Methodology, Software, Writing - Original Draft Munaga V. N. K. Prasad: Validation, Investigation, Resources S. Nickolas: Visualization, Supervision, Investigation G. R. Gangadharan: Formal analysis, Project administration, Funding acquisition, Writing - Review & Editing

Trendlets: A novel probabilistic representational structures for clustering the time series data

Trendlets: A novel probabilistic representational structures for clustering the time series data

Recommend Documents