ARTICLE IN PRESS
Engineering Applications of Artificial Intelligence 21 (2008) 277–300 www.elsevier.com/locate/engappai
Representing financial time series based on data point importance Tak-chung Fua,b,, Fu-lai Chunga, Robert Luka, Chak-man Ngb a
Department of Computing, Hong Kong Polytechnic University, Hunghom, Kowloon, Hong Kong Department of Computing and Information Management, Hong Kong Institute of Vocational Education (Chai Wan), Chai Wan, Hong Kong
b
Received 12 September 2006; received in revised form 3 March 2007; accepted 27 April 2007 Available online 29 June 2007
Abstract Recently, the increasing use of time series data has initiated various research and development attempts in the field of data and knowledge management. Time series data is characterized as large in data size, high dimensionality and update continuously. Moreover, the time series data is always considered as a whole instead of individual numerical fields. Indeed, a large set of time series data is from stock market. Stock time series has its own characteristics over other time series. Moreover, dimensionality reduction is an essential step before many time series analysis and mining tasks. For these reasons, research is prompted to augment existing technologies and build new representation to manage financial time series data. In this paper, financial time series is represented according to the importance of the data points. With the concept of data point importance, a tree data structure, which supports incremental updating, is proposed to represent the time series and an access method for retrieving the time series data point from the tree, which is according to their order of importance, is introduced. This technique is capable to present the time series in different levels of detail and facilitate multi-resolution dimensionality reduction of the time series data. In this paper, different data point importance evaluation methods, a new updating method and two dimensionality reduction approaches are proposed and evaluated by a series of experiments. Finally, the application of the proposed representation on mobile environment is demonstrated. r 2007 Elsevier Ltd. All rights reserved. Keywords: Financial time series representation; Multi-resolution visualization; Incremental updating; Dimensionality reduction; Tree data structure; Mobile application
1. Introduction Recently, the increasing use of temporal data, in particular time series data, has initiated various research and development attempts in the field of data and knowledge management (Last et al., 2001). A time series is a collection of observations made chronologically. The nature of time series data include: large in data size, high dimensionality and update continuously. Moreover, the time series data is always considered as a whole instead of individual numerical field. There are varieties of time series data related research, for examples, finding similar time series (Liao et al., 2004), querying time series database (Rafiei and Mendelzon, 2000), segmentation (Wang and Willett, 2004; Feng et al., 2005), dimensionality reduction (Keogh et al., 2000; Keogh Corresponding author. Fax: +852 2774 0842.
E-mail address:
[email protected] (T.-c. Fu). 0952-1976/$ - see front matter r 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2007.04.009
et al., 2001), clustering (Policker and Geva, 2000), classification (Wang and Willett, 2004) and forecasting (Pantazopoulos et al., 1998; Sfetsos and Siriopoulos, 2004, 2005). Those researches have been studied in considerable detail by both database and pattern recognition communities for different domains of time series data (Keogh and Kasetty, 2002). While most of the research communities have concentrated on the above issues, the fundamental problem on how to represent a time series in multiresolution, which is also considered as information granulation as in (Bargiela and Pedrycz, 2003), has not yet been fully addressed so far. To represent a time series is essential because time series data is hard to manipulate in its original structure. Therefore, defining a more effective and efficient time series representation scheme is of fundamental importance. The time series data used in data and knowledge management is high dimensional, but before it can be processed and analyzed, this dimensionality must be
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
278
reduced, commonly using approaches that focus on lower bounding the Euclidean distance. These approaches, however, smooth out salient points of the original time series, which is counterproductive when applied to financial time series data, as financial analysis often depends on the shape of data and the salience of data points to identify technical patterns. For these purposes, then it is important to reduce dimensionality while retaining the information associated with these points and the salient points are considered as important points to the shape of the time series. Previous approaches to reducing dimensionality while retaining point information have included sampling. In this approach, a rate of m/x is used, where m is the length of time series P and x is the dimension after dimensionality reduction, but sampling approaches have the drawback of distorting the shape of sampled/compressed time series if the sampling rate is too low. As already noted, most other time series dimensionality reduction approaches, such as principal component analysis (PCA) (Fukunaga, 1990), singular value decomposition (SVD) (Korn, et al., 1997), discrete Fourier transform (DFT) (Agrawal et al., 1993; Rafiei and Mendelzon, 2000; Chu and Wong, 1999), discrete wavelet transform (DWT) (Popivanov and Miller, 2002; Kahveci and Singh, 2001; Chan and Fu, 1999), piecewise aggregate approximation (PAA) (Keogh et al., 2000; Yi and Faloutsos, 2000) and adaptive piecewise constant approximation (APCA) (Keogh et al., 2001), focus on lower bounding the Euclidean distance. However, because such approaches often lose important data points, they may fail to retain the general shape of the time series after compression (Fig. 1). A time series is constructed by a sequence of data points and the amplitude of a data point has different extent of influence on the shape of the time series. That is, each data
point has its own importance to the time series. A data point may contribute on the overall shape of the time series while another may only have little influence on the time series or may even be discarded. For example, frequently appearing technical time series patterns are typically characterized by a few salient points such as a head and shoulders. Time series pattern consists of a head point, two shoulder points and a pair of neck points. These points are perceptually important in the human visual identification process. These points are therefore more important than other data points in the time series. The data point with importance calculation is named as perceptually important point (PIP). The identification of PIP is first introduced by Chung et al. (2001) and used for pattern matching of technical (analysis) patterns in financial applications. The idea was later found similar to a technique proposed about 30 years ago for reducing the number of points required to represent a line by Douglas and Peucker (1973) (see also Hershberger and Snoeyink, 1992). We also found independent works by Perng et al. (2000), Pratt and Fink (2002) and Fink and Pratt (2003) which work on similar ideas. However, none of these techniques propose data structure to well-organize and store the salient points identified. In this paper, we propose a time series representation framework which is based on the concept of data point importance. Challenges in here are like how to recognize these salient points, a data structure to represent these points which can facilitate incremental updating, multiresolution retrieval and support dimensionality reduction. The proposed framework is capable to reduce the time series dimension to different levels of detail based on the importance of data point. On the other hand, the original accuracy can be maintained and salient points will not be distorted. A tree data structure, which stores the data points of the time series, is then proposed and efficient
a
b
Fig. 1. Dimensionality reduction by (a) sampling and (b) PAA.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
computation of cumulative new data points, maintaining the data structure views incrementally to avoid expensive recomputation and accessing method on this tree to retrieve the time series data point according to their importance are introduced. The remaining part of this paper is organized as follows: Section 2 describes the concept of data point importance and three methods for evaluating the data point importance. Section 3 describes the proposed time series representation framework, the proposed Specialized Binary Tree (SB-Tree) algorithm, and how the SB-Tree is used to create, update, retrieve and reduce the dimension of time series. In Section 4, we analyze the results of the experiments and the mobile application of the proposed representation is demonstrated on Section 5. Section 6 offers our conclusion. 2. Defining and evaluating data point importance In this section, we describe the concept of data point importance based on identifying the perceptually importance points (PIPs). Then, we introduce three methods for evaluating the importance of the PIPs in a time series, they are: euclidean distance (PIP-ED), perpendicular distance (PIP-PD) and vertical distance (PIP-VD). A simple example will be given at the end of this section to illustrate the PIP identification process using the different data point importance evaluation methods.
points of P. The next PIP will be the point in P with the greatest distance to the first two PIPs. The fourth PIP will then be the point in P with the greatest distance to its two adjacent PIPs, either between the first and second PIPs or between the second and the last PIPs. The process of locating the PIPs continues until all the points in P are attached to a list. To calculate the distance to the two adjacent PIPs, three data point importance evaluation methods are proposed to measure this distance.
2.2. Euclidean distance The first data point importance evaluation method is using the euclidean distance (ED). As illustrated in Fig. 3, this measurement calculates the sum of the ED of the test point p3 ¼ (x3, y3) to its adjacent PIPs p1 ¼ (x1, y1) and p2 ¼ (x2, y2), i.e., EDðp3 ; p1 ; p2 Þ ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx2 x3 Þ2 þ ðy2 y3 Þ2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi þ ðx1 x3 Þ2 þ ðy1 y3 Þ2 .
ð1Þ
This measurement is biased toward the middle part of the region covered by p1 and p2.
p3(x3,y3)
2.1. Identifying the perceptually important points The concept of data point importance is defined by the influence of a data point on the shape of the time series. A data point that has a greater influence to the overall shape of the time series is considered as more important. The data point with importance calculation is named as perceptually important point (PIP) and the process to identify the PIPs is as follows: given the time series P, all the data points, p1, y, pm, in P will go through the PIP identification process according to the pseudo code described in Fig. 2. Currently, the first two PIPs found will be the first and last
279
b
p2(x2,y2)
a
p1(x1,y1) Fig. 3. Euclidean distance-based data point importance evaluation method (PIP-ED).
Fig. 2. Pseudo code of the PIP identification process.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
280
2.3. Perpendicular distance The second data point importance evaluation method is using the perpendicular distance (PD). This measurement calculates the PD between the test point p3 and the line connecting the two adjacent PIPs as shown in Fig. 4, i.e. y y1 Slopeðp1 ; p2 Þ ¼ s ¼ 2 , (2) X 2 y1 xc ¼
x3 þ ðsy3 Þ þ ðsy2 Þ ðs2 x2 Þ , 1 þ s2
(3)
yc ¼ ðsX c Þ ðsX 2 Þ þ y2 , PDðp3 ; pC Þ ¼
(4)
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðX c X 3 Þ2 þ ðyc y3 Þ2 .
(5)
2.4. Vertical distance The final data point importance evaluation method is using the vertical distance (VD). This measurement, depicted in Fig. 5, calculates the VD between the test point p3 and the line connecting the two adjacent PIPs, i.e., x c x1 VDðp3 ; pc Þ ¼ yc y3 ¼ y1 þ ðy2 y1 Þ y3 , x2 x1 (6)
p3(x3,y3) p2(x2,y2) d pc(xc,yc)
p1(x1,y1) Fig. 4. Perpendicular distance-based data point importance evaluation method (PIP-PD).
p3(x3,y3) p2(x2,y2) d pc(xc,yc)
where xc ¼ x3. It is intended to capture the fluctuation of the sequence and the highly fluctuated points would be considered as PIPs. 2.5. An illustrating example An example of the PIP identification process using different data point importance evaluation methods is given in this subsection. Fig. 6 shows the steps in the PIP identification process. The first five PIPs in the sample time series is identified using PIP-VD. The earlier the PIP is identified, the more important it is. Fig. 7 shows the final result obtained from the PIP identification process. The points with the smaller number label are more important (e.g. PIP 3 is more important than PIP 4). Tables 1 and 2 show the data point importance lists built using different data point importance evaluation methods for this example. Besides the importance of the PIPs, their corresponding amplitude y and distance measured (PIPED, PIP-PD and PIP-VD) are also shown in the tables for reference. As we can see, both PIP-PD and PIP-VD obtained the same results in this case, while PIP-ED got a different behaviors. 3. Tree representation for dimensionality reduction The management of financial time series data in multiresolution requires the definition of a suitable time series representation data structure. In Section 3.1, we therefore describe a tree structure for representing financial time series representation that is based on determining the data point importance in the time series. Instead of storing the time series data according to time or transforming it into other domains (e.g. the frequency domain), data points of a time series are stored according to their importance. In Section 3.2, a new point-by-point updating method is proposed. In Section 3.3, we provide a formal definition of the retrieval of time series from the tree data structure and in Section 3.4, we propose the multi-resolution retrieval and dimensionality reduction approaches of this time series representation. 3.1. A tree data structure for representing a time series The proposed time series representation structure is called specialized binary (SB) tree and has been developed based on binary tree structure. The SB-Tree supports a fast lookup of time series starting from the most important data point. Intuitively, a SB-Tree contains a hierarchy of data points in the time series. All nodes have the same size. Here is a detailed description of a SB-Tree structure:
p1(x1,y1) Fig. 5. Vertical distance-based data point importance evaluation method (PIP-VD).
A node contains the information related to a PIP identified in each iteration during the PIP identification process. The information includes the x- and y-coordinates of that PIP.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
281
3 1
1
2
2
3
3
1
1 2
2 5
4
4 Fig. 6. Identification of the first 5 PIPs using PIP-VD.
3 8
10
The path between nodes, or the pointer to a child node, stores the distance measured when the child node is identified.
1 7 6 2
9 5
1
2
4
3
4
5
6
7
8
9
10
Fig. 7. The importance of the PIPs after the identification process using PIP-VD. Table 1 The data point importance list built using PIP-ED Importance
x
y
Distance measure ED
1 2 3 4 5 6 7 8 9 10
10 1 5 9 3 6 7 4 8 2
0.5 0.1 0.9 0.3 0.3 0.6 0.7 0.7 0.4 0.4
n/a n/a 9.1 5.06 4.10 4.06 3.04 2.10 2.05 2.05
1 2 3 4 5 6 7 8 9 10
x
10 1 5 9 3 2 6 7 8 4
y
0.5 0.1 0.9 0.3 0.3 0.4 0.6 0.7 0.4 0.7
If cnode.xopnode.x then goto the left arc of pnode J If pnode.left is empty, add cnode to this position J Else pnode ¼ pnode.left and next iteration start Else goto the right arc of pnode J If pnode.right is empty, add cnode to this position J Else pnode ¼ pnode.right and next iteration start.
Fig. 8 shows the detail algorithm for creating a SB-Tree. As a simple example, Fig. 9 shows the steps of creating a SB-Tree, based on the sample time series given in Fig. 6. In Fig. 9, the number of the node is the cnode.x value of the data point while the number of the path before each node shows the distance measured, cnode.dist.
Table 2 The data point importance list built using PIP-PD and PIP-VD Importance
To create a SB-Tree, the overall PIP identification process is adopted. The first and last data points in the sequence are the first two nodes of the SB-Tree. The node, which represents the last data point of the time series, becomes the root of the tree, and the node, which represents the first data point, becomes the child on the left-hand side of the root. The third PIP identified becomes the child on the right-hand side of the node, which represents the first data point of the time series. The tree can then be built recursively as follows, starting from the parent node pnode (the third PIP initially) and the current/ child node cnode (the fourth PIP initially), we have
Distance measure PD
VD
n/a n/a 0.57 0.22 0.09 0.14 0.08 0.14 0.04 0.03
n/a n/a 0.62 0.28 0.20 0.20 0.15 0.20 0.10 0.10
3.2. Incremental updating New time series data is available frequently and continuously. On the other hand, building/rebuilding a SB-Tree for the newly arrived data points is very time consuming. Therefore, an efficient updating mechanism for the SB-Tree is necessary. In our previous work (Fu et al., 2005), three categories of updating methods, namely, periodic rebuild, batch update and point-by-point update, were proposed. In this paper, a simplified method
ARTICLE IN PRESS 282
T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
Fig. 8. Pseudo code of creating SB-Tree.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
a
b
c
10 1
10 1
0.62 5
10 1
0.62 5
0.2
9
e
10 1
5 0.2
9
f
10 1
0.62 5
0.28
3
0.28
3
9 0.2
2
2
0.62 5
0.2
0.2
0.28
3
10
1
0.62
0.62 5
0.28
d
283
0.2
9
0.28
3
9
0.2
0.15 6
0.15
2
6 0.2 7
g
h 10 1
10 1
0.62
0.62
5
5
0.2 3
9
0.2 2
0.2
0.28
0.28
3 0.2
0.15 6
2 0.2
9 0.1 4
0.15 6 0.2
7
7 0.1
0.1
8
8
Fig. 9. SB-Tree building process.
of point-by-point updating is proposed to get a balance between the previous proposed guaranteed and approximate updating approach according to the performance and accuracy. A necessary step for the point-by-point updating methods is to determine whether there is any change of the PIPs identified after adding a new data point. If the same PIPs are identified, it shows that there is no change in this iteration and the same checking can be carried out along the right-hand side of the SB-Tree. It is because the new data point is appended to the rightmost of the time series, the potentially affected segment is only located on the right-hand side. In other words, if there is a change of the PIP at the root node, it will be the same as creating a new SB-Tree (see Fig. 10a). On the other hand, if there is no change on the leaf nodes of the rightmost path, the new
data point is simply added to the right-hand side of the rightmost leaf node (see Fig. 10b). With this behavior, the updating process can be formulated as follows:
Starting from the second node of the SB-Tree, pnode ¼ root.left If the point with the maximum VD is the same as its child at right-hand side, there will be no change on pnode.right and the left sub-tree. pnode ¼ pnode:right; and repeat the checking again ðfor the right subtreeÞ:
Else reconstruct the sub-tree on the right-hand side of this node (pnode.right) by the tree construction mechanism introduced in Section 3.1.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
284
a
b 11 11
1 1
9
5 5
10 9
3 6
3 2
4
2 7
4
10
6 7
8
8
Fig. 10. Different behaviors after adding a new data point.
Although previous proposed approximate updating approach (Fu et al., 2005) would provide a speedy updating method, it could not pledge the data point retrieved from the SB-Tree are in order of importance; while guarantee updating approach (Fu et al., 2005) would provide a sequence that is identical to building the whole tree, but it could not have much performance gain over building the whole tree in term of processing time. Therefore, an updating algorithm which would preserve the structure of the SB-Tree with efficient processing time is proposed. The updating process of locating the change of PIP is very similar to the PIP identification process when building a SB-Tree. The data point with maximum VD is determined by calculating the VD from every data point to the line joining the current PIPs and the new data points. Actually, the data point with maximum VD should be the next PIP for that segment. To define whether there is a change of PIP, we only need to check if the next PIP in the original SB-Tree is the same as the data point we identified in the process. If they are not the same, there is a change of PIP and all data points starting from the current PIP to new data point have to be rebuilt. Contrarily, there is no change of PIP in this iteration and the checking process will be continued for all data points at the right-hand side of next PIP to the new data point. Since this identification process of the change of PIP is simplified from the PIP identification process of building a SB-Tree by only accessing the nodes on the right-hand side, we named it as simplified updating approach. Fig. 11 illustrates the steps for identifying the change of PIP by this approach. The updating process starts at the second node of the SB-Tree (i.e. it is the child node of the root and the first data point of the series) and find the point with maximum VD within the first and last data points. If the point identified is the same as the PIP in the original SB-Tree, there is no change of PIP and the checking process continue on the child at the right-hand side of the node by finding the point with maximum VD within the child point and the last data point. The process ends if all the nodes at the right-hand side are traversed. If there is no change of PIP till the end of the process, the last data point is simply added to the rightmost leaf node and the new data point becomes the root of the SB-Tree. Otherwise, the sub-tree under current node has to be rebuilt.
The tree built in Fig. 9 needs to be updated when new data points have been appended to it (Fig. 12). Given the data point 11, as the shape of the overall time series has been greatly changed, when evaluating the root node with the new entry, data point 9 (the point with maximum distance on the opposite side before) has replaced the original PIP and become the new PIP identified. Therefore, rebuilding the whole tree is needed (Fig. 13b). The same case happens again when adding data point 12 (Fig. 13c). In this example, the rebuilding of the whole tree occurs more frequently because of the short length of the time series and the great fluctuation of the data points. In most of the cases, rebuilding always happens in a small sub-tree such as adding data points 13–15. Data point 12 is only needed to be appending to the SB-Tree in Fig. 13d and rebuilding of the sub-tree under data point 9 is needed when data point 14 has been added (Fig. 13e). It is obvious that a SB-Tree mostly grow biased to the right-hand side of the tree until the tree is greatly unbalanced and the whole tree is needed to be rebuilt (i.e. the shape of the time series is greatly influenced once a considerable number of data points is accumulated). The processing time of locating change of PIP depends on the structure of the SB-Tree, the fewer the number of right nodes the shorter the time. In other words, the SB-Tree biased to right take more time to update than the tree biased to the left. Besides, the processing time of updating a new point is related to the location of the change of PIP. The higher position the change of PIP located in the tree, more nodes are required to be rebuilt, thus the longer processing time. 3.3. Retrieving time series from the SB-Tree Suppose a SB-Tree is built and the time series must be retrieved. Time series data points are retrieved according to their importance. The SB-Tree is accessed recursively, starting from the root:
The root of the tree representation is the first PIP. Its child (the first point of the time series) will be put in a heap (a sorted heap is preferred) for retrieval. The tree representation is accessed from the root and each accessible node in each path of the tree is checked.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
Fig. 11. Pseudo code of the Find_Rebuild_Position function for simplified updating approach (SUA).
285
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
286
An accessible node is defined as the first node of a path that is not retrieved yet. All these nodes will be put into a heap.
Fig. 14 shows the pseudo code for accessing a SB-Tree. As an example, the sample time series is retrieved from the SB-Tree created in Section 3.1. Starting from the root of the SB-Tree in Fig. 15, the third PIP (the starting and ending points of the time series are the first two PIPs) is the third node (i.e. point 5, Fig. 15a). This node is marked as USED (in black in the diagram). To identify the next PIP, all the accessible nodes in the tree are first identified (bold
Fig. 12. Five data points are added to the sample time series.
a
By sorting the distances among all the accessible nodes in the heap, the first one (the one with maximum distance, VD) is selected as the next PIP. Again, this node will be removed from the heap. This process continues until all the nodes in the tree are processed and retrieved.
b
c
10 11
12
1 1 5 0.2
0.15 2
4
6 0.2
0.2
0.15
3
6
0.2
7
4
0.3 11
6 0.2
7
7
0.15 10
0.1
0.1
8
8
8 12 points
11 points
10 points
d
e
f 15
14
13 1
0.48
1
0.68
0.69
0.62
5
5
5 0.2
0.15
0.2
0.1
1
9 0.1
2
4
0.6
3 0.2
0.1
2
0.2
10
5
0.1
5
0.15
0.7
9
0.2
0.5
9
0.28
3
1
0.52
0.2
0.2
0.51
0.47 13
3 3 0.2 2
3
9 0.1
0.15
4
6
0.33
0.2
0.1
0.15
2
4
6
11 0.15
0.2 7
10
9
0.2
0.54
0.13
0.2
0.15
0.25
12
7
10
13
0.1
0.1
2
11
0.13 8
8
0.68
0.1
14
9
4
0.33
0.15
11
6 0.2
0.15 10
7
12 0.1
13 points
14 points Fig. 13. The SB-Tree updating process.
0.08
15 points
8
0.13 12
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
287
Fig. 14. Pseudo code of accessing SB-Tree.
circles). Then, the node with greatest distance stored is identified as the next PIP, for example, point 9 in Fig. 15b. This process continues until all the nodes in the tree are marked and the whole time series is retrieved according to the order of the data point importance in Table 2. The benefit of using the suggested approach over the traditional approach, which retrieves the time series from the start to the end of the time series sequentially, is that the overall shape of the time series can be captured even if only a few PIPs were obtained. That is, the suggested approach provides a mechanism to retrieve the time series from a low resolution to a high one. 3.4. Dimensionality reduction After transforming the time series data into a SB-Tree, it is possible to further reduce the size of the tree to minimize the space consumption of a SB-Tree. This can be done by determining the minimum number of PIPs that is necessary to representing the time series while retaining its shape. If only a few most-important PIPs are used to represent the whole time series, the error will be very large and the overall shape may also be deformed as shown in Fig. 16a. Conversely, if all the PIPs are manipulated, the system performance will be very low. Fig. 16b shows a suitable number of PIPs for representing the time series. The simplest way to reduce the size of a SB-Tree is to apply a lossless pruning approach which only prunes those nodes have distance stored equal to 0 (Fig. 17). This kind of node has no effect on the shape of the time series because they are located on a straight line formed by other PIPs. On the other hand, an acceptable level of error can be specified for filtering a large number of PIPs. Thus, a lossy
approach is preferred to reduce the size of a SB-Tree by pruning the ‘‘unnecessary’’ nodes of the SB-Tree. In this case, the compression ratio is defined as CR ¼
the number of data point ofthe original time series . the number of PIP used to represent the time series (7)
For example, if the compression ratio is set to 2, the number of PIPs filtered from the sample time series is equal to 5. The corresponding pruned tree is shown in Fig. 18a. Fig. 18b shows the result after dimensionality reduction. To control the compression of the SB-Tree to a suitable level, two dimensionality reduction approaches are proposed below: Tree Pruning Approach: Less significant data points of a time series can be filtered according to a threshold l as the tree is accessed from the top and the distance measurement of each node is considered. When the distance measurement of a node is smaller than l, the fluctuation is not wide and the descendants are considered as less important to the users. Thus, this node and all its descendants should be filtered. The corresponding pruned tree using a threshold equal to 0.2 (i.e. l ¼ 0.2) is shown in Fig. 19a. Fig. 19b shows the result after dimensionality reduction. However, it is difficult for the users to define the threshold l. In addition, for different time series, different l may be needed to preserve the general shape of the series. Therefore, an automatic approach for determining this threshold is necessary. It can be achieved by finding the natural gap of the change of the distance measurement along with the retrieval of PIPs from the SB-Tree. The distance measurement that has a significant decrease will be
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
288
a
b
c
10 1
10 1
0.62 0.28
3
0.28
3
6
0.1
2
6 7
7
0.1
0.1
0.1
8
8
8
f
10
10 1
0.62
1
0.28
5
0.2
0.28
3
6
0.1
2
3
6
9 0.1
2
0.2
7
7 0.1
0.1
8
8
8
h
i 10
10
10 1
0.62
1
0.62
5
0.62
5 0.28
3
5
0.2
9 0.1
6
0.1
g
0.2
0.15
4
0.2
7
4
0.28
0.2
0.15
4
0.2
0.2
0.2
9
0.2
0.15
4
1
0.62
5
9
2
10
0.62
5
0.1
6 0.2
e
3
0.15
4
0.2
d
0.2
9 0.1
2
7
0.2
0.28
3 0.2
0.15
4
0.2
2
0.2
9
0.2
0.15
4
5
0.2
9 0.1
1
0.62
5
0.2
2
1
0.62
5
0.2
10
0.28
3 0.2
0.15 6
2
0.2
9 0.1 4
0.2
3 0.2
0.15 6
2 0.2
7
0.28 9 0.1 4
0.15 6 0.2
7
7
0.1
0.1
0.1
8
8
8
Fig. 15. The SB-Tree accessing process.
considered as the suitable l for a particular time series. An example will be shown in Section 4.3. Error threshold approach: The second dimensionality reduction approach is based on determining the error of the representation compared to the original time series. The PIPs are retrieved from the SB-Tree for representing the time series until the error is lower than a given threshold, a. Error is defined as the mean square distance between the
original time series and the series formed by n PIPs used to represent the time series. Fig. 20 shows the error compared with the original time series when only 3 PIPs are used. Again, it is necessary to determine the value of a. A reasonable a value is the point that causes no significant decrease in the error. It can be determined by finding the largest decrease of error by adding one more PIP to represent the time series. Including this PIP, the suitable
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
289
Fig. 16. Number of PIPs used to represent a time series: (a) too few PIPs used and (b) a suitable number of PIPs used.
15 1 5 0.2
0.56
3
9
0.2
0.1
2
4
0.15
0.53
6
0.15
0.2 7
11
10
0.1 13
0.0
0.1 8
12
0.1 14
Fig. 17. Lossless pruning of the SB-Tree.
number of PIPs for representing the time series is obtained as the decrease of error will be at a much lower level compared to the first few PIPs performed. Again, an example will be given in Section 4.3. 4. Experimental results In this section, we evaluate the performance of the data point importance evaluation methods, PIP-ED, PIP-VD and PIP-VD, the proposed point-by-point updating method of the SB-Tree and the dimensionality reduction methods, tree pruning method and the error threshold method. The experiments are implemented with the C programming language. They were performed on a Sun computer (Sun Solaris Unltra5 with 2 sets of 200 MHz UltraSPARC CPU and 256MB memory). 4.1. Evaluation of the proposed data point importance evaluation methods To evaluate the three data point importance evaluation methods, PIP-ED, PIP-PD and PIP-VD, we conducted simulation tests using time series data with 2500 data points captured from the past ten years of Hong Kong Hang Seng Index (HSI).
By the simulation, we note that PIP-ED, PIP-PD and PIP-VD take a similar time to build the SB-Tree. Therefore, we focus on evaluating the objective performance, the error, by using the three methods for measuring the data point importance. Fig. 21 shows the error in representing the time series using different numbers of PIP determined by the three data point importance evaluation methods. All the three methods got a similar error decreasing curve. The error by representing the time series based on the first few PIPs identified by any of these three methods decreases deeply. When we further zoom in the tail of the curves, Fig. 22 shows that PIP-VD obtained the least error among the three methods. One point must be mentioned is that the error here is not a monotonically decreasing function with the increase of number of PIPs. It is because the measurement of the data point importance depends on the shape captured in previous PIP identification iteration but not an overall shape. Therefore, the change of such shape will affect the calculation of the data points in the original time series compared to the shape formed by the identified PIPs. This scenario is more obvious on the PIPPD method. Fig. 23 shows the error in representing the time series using different numbers of PIP according to different compression ratios. PIP-VD obtained the least error among the three methods again under the same compression ratio. Therefore, we can conclude that PIPVD outperformed the other two methods during the objective evaluation. Then, we evaluate the subjective performance, the visualization effect, by using the three methods for measuring the data point importance. By visualizing the time series using different numbers of PIP determined by the three data point importance evaluation methods, the error of the time series could be kept in a relatively low level even only one twenty-fifth of the data points were used. Fig. 24 shows the sample visualization effect when the time series is displayed after dimensionality reduction. The time series is very similar to the original one even only 100 PIPs out of 2500 data points are used. The compression ratio is 25 as shown in Fig. 24b. When the compression ratio is further increased to 250, the overall shape of the stock time series still can be preserved by the proposed methods and the important points (the salient
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
290
a
b 10 1 5 0.2
0.28
3
9
0.2
0.1 0.15
2
4
6 0.2 7 0.1 8
Fig. 18. Dimensionality reduction by compression ratio: (CR ¼ 2) (a) the accessing result of the pruned SB-Tree and (b) the result after dimensionality reduction.
a
b 10 1 5 0.2
0.28
3 0.2
9 0.1 0.15
2
4
6 0.2 7 0.1 8
Fig. 19. Dimensionality reduction by tree pruning approach: (l ¼ 0.2) (a) the accessing result of the pruned SB-Tree and (b) the result after dimensionality reduction.
Error by representing the time series with different no. of PIP 200000 PIP-VD PIP-PD PIP-ED
Error
150000 100000 50000 0 1
2
3
4
5
6
7
8
9
10
3
503
1003
1503
2003
2503
no. of PIP Fig. 20. Error in representing a time series when only 3 PIPs are used compared with the original time series.
Fig. 21. Error in representing the time series by different number of PIP.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
291
Error by representing the time series with different no. of PIP 20000
PIP-VD PIP-PD
Error
15000
PIP-ED
10000 5000 0 201
501
801
1101
1401
1701
2001
2301
no. of PIP Fig. 22. Tail of the error curve in Fig. 21.
Error vs. compression ratio 15000 PIP-VD PIP-PD PIP-ED
Error
12000 9000 6000 3000 0 1
2
3
4
5
6
7
8
9
10
Compression ratio Fig. 23. Error vs. compression ratio.
points) are preserved as well (Fig. 24a). Among the three methods, PIP-VD has the highest ability to capture the fluctuation of the time series as shown by the ending period of the sample time series. Combining the results of this section, PIP-VD is a preferable method for measuring the data point importance in most of the cases. 4.2. Updating the tree representation In this subsection, the performance of the proposed simplified updating approach (SUA) was compared with the previous proposed approaches in Fu et al. (2005), they include: periodic rebuild approach (PRA), sub-tree updating approach (STU) and its variant with rebuild mechanism (STUR), guarantee updating approach (GUA) and approximate updating approach (AUA). First, the precision of each approach was evaluated. The SB-Tree updated to have m data points would be compared with a newly created SB-Tree for the m new data points. Precision was evaluated by calculating the percentage of correct PIP retrieved. Starting from the 10 data points of the time series, comparison was carried out by adding each point until the remaining 2490 data points were added. By retrieving 40% of the PIPs from the updated SB-Tree and the newly created SB-Tree according to the retrieval method mentioned in Section 3.3, the percentage of correct PIPs retrieved was calculated by counting the number of PIPs retrieved from the newly created SB-Tree could also be found in PIPs retrieved from the updated SB-Tree.
To simplify the calculation, if the third PIP retrieved from the updated SB-Tree, is the same as the third PIP retrieved from newly created SB-Tree, it is considered as correct. Fig. 25 shows the average percentage of PIPs correctly retrieved for time series updated from 10 data points to 2500 data point by using different updating methods. Same as the GUA, the SUA could guarantee the structure of the updated SB-Tree being the same as building a new SBTree. Therefore, both the percentage of correct points retrieved and the percentage of correct sequence is 100% throughout the updating process. On the other hand, the reason of getting error by the other approaches (i.e. PRA, STU (STUR) and AUA) was due to the VD of some nodes in the SB-Tree was not updated. After adding a new data point, the VD of the nodes had been changed. However, those nodes not evaluated during the updating process would not be updated in these approaches. Moreover, PRA, STU (STUR) and AUA might not guarantee the structure of the updated SB-Tree being the same as rebuilding the whole tree. PRA simply added the last data point to the right-hand side of the rightmost node. STU only attached the sub-tree to the original tree and AUA might not be able to detect the change of PIP as PIP identified is not only limited to those PIP candidates. The second part of the updating experiment compared the processing time of different updating methods. The total time for updating the time series data point from 10 to 2500 required nearly 30 min if we build a new SB-Tree for every new data point (Fig. 26). For the simplified updating approach, the total time for updating required about 72 s. On average, about 0.03 s was required for adding a new data point. The average number of time series data points required to be rebuilt is about 91 s. Fig. 27 shows that the processing time increased as the number of points needed to rebuild increased. This implied that most of the processing time for updating a new data point was consumed on rebuilding the SB-Tree. Fig. 28 shows the average processing time of updating a data point for the process of updating the time series from 10 data points to 2500 data points using different updating methods. STU would update a point within the shortest
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
292
Representing HSI with 100 PIPs(PIP-ED) 20000
15000
15000 index
index
Representing HSI with 10 PIPs (PIP-ED) 20000
10000
10000 5000
5000
0
0 0
500
1000
1500
2000
2500
0
500
1000
no. of PIP
2500
Representing HSI with 100 PIPs (PIP-PD)
20000
20000
15000
15000 index
index
2000
no. of PIP
Representing HSI with 10 PIPs (PIP-PD)
10000
10000 5000
5000 0
0 0
500
1000
1500
2000
2500
0
500
1000
no. of PIP
1500
2000
2500
no. of PIP
Representing HSI with 10 PIPs (PIP-VD)
Representing HSI with 100 PIPs(PIP-VD)
20000
20000
15000
15000 index
index
1500
10000
10000 5000
5000 0
0 0
500
1500
1000
2000
2500
0
500
1000
no. of PIP
1500
2000
2500
no. of PIP
Fig. 24. Visualization of time series with different numbers of PIP using different data point importance evaluation methods: PIP-ED, PIP-PD and PIPVD (a) 10 PIPs and (b) 100 PIPs.
40%
40%
100%
100%
99%
6%
20%
38%
60%
67%
71%
100%
100%
99%
98%
80%
100%
Average % of PIPs Correc
100%
100%
Comparison of different updating approaches: Precision
0% Build
PRA
STU
STUR
Correct PIP retrieved
GUA
AUA
SUA
Correct Sequence
Fig. 25. Percentage of correct PIP retrieved and PIP in correct sequence for different updating approaches.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
time since it could add a number of points at a time. However, the advantage of STU could not be preserved when rebuild mechanism was introduced, i.e. STUR. It is because large amount of time would be consumed by rebuilding the whole tree. For the point-by-point updating methods, the performance of AUA and SUA were satisfactory with only 0.016 s and 0.029 s were required to update a point respectively. In summary, the percentage of correct points retrieved and the points in correct sequence is plotted against the
Build the whole tree Processing time for each update
293
processing time for different updating approaches in Fig. 29. According to the experimental results, same as GUA, SUA is the preferred solution if the data point retrieval order is critical. Both GUA and SUA could guarantee the SB-Tree structure, so that data points could be retrieved according to their importance. They should be an attractive solution to the multi-resolution data visualization applications, e.g. multi-resolution visualization of stock data on mobile devices (see Section 5). However, GUA requires large amount of time for each updating while SUA is a speed up approach of GUA. Finally, Fig. 30 shows that the processing time of the updating process is dependent to the number of points have to be rebuilt. The more the data points have to be rebuilt, the more time is required.
time (ms)
2500 2000
4.3. Evaluation of the proposed dimensionality reduction methods
1500 1000 500 0 0
500
1000
1500
2000
2500
point Fig. 26. Processing time to build the SB-Tree from 10 points to 2500.
Simplified Updating Approach Processing time for each update
time (ms)
2000 1500
Update Tree
1000 500 0 0
500
1000
1500
2000
2500
point Fig. 27. Processing time of updating the SB-Tree using simplified updating approach.
In this subsection, we evaluate the two proposed dimensionality reduction methods: the tree pruning method and the error threshold method. We adopt PIP-VD as the data point importance evaluation method. We test the ability of the two proposed dimensionality reduction approaches to reduce the number of data points of a time series to a number of PIPs suitable for representation. To achieve this goal, we use an objective measurement of how accurately of the proposed approaches to reduce the dimension to a suitable level. The data set used was a synthetic data set consisting of 110 time series of three different lengths (i.e. 25, 43 and 61). Each series belongs to one of the five technical patterns (with the corresponding number of PIPs follow in brackets), H&S (7), double tops (5), triple tops (7), rounded top (4 or 5) and spike top (5) (see Fig. 31). Each technical pattern was generated to 22 variants by applying different levels of scaling, time wrapping and noise. The pseudo code of the variant generation process is shown in Fig. 32. First, the patterns are uniform time scaling from 7 data points to 25, 43 and 61 data points. Then, each salient point of the patterns can be warped between its previous and next salient points.
Comparison of different updating approaches Processing time Average processing time (ms)
900 800
719
700 600 471
500
417
400 300 200 77
100
49
16
13
0 Build
PRA
STU
STUR
GUA
AUA
SUA
Fig. 28. Processing time of updating the SB-Tree for different updating approaches.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
294
Precision Vs. Progress Time of different updating approach
200
0%
100
Build
PRU
0
STU
% correct
Average Progress Time (ms)
100%
100%
99%
300
6%
20%
600
400
38%
40%
700
500
67%
71%
100%
100%
99%
98%
100%
60%
40%
% correct
80%
800 100%
100%
STRU
GUA
% correct sequence
AUA
SUA
process time
Fig. 29. Comparison on precision and progress time of different updating approaches.
Progress Time Vs. No.of Points Rebuilt of different updating approaches 1400
719
1200 600 471
1000 417 800
400 600 400
200 77 13
16
49
0
200
Number of points re-bulit
Average processing time
800
0 Build
PRU
STU
STRU
GUA
AUA
SUA
Fig. 30. Comparison on processing time and number of points rebuilt of different updating approaches.
Fig. 31. The five technical patterns: head and shoulders (H&S) (7 PIPs), double tops (5 PIPs), triple tops (7 PIPs), rounded top (4 or 5 PIPs) and spike top (5 PIPs).
Finally, noise is added to the set of patterns. The addition of noise is controlled by two parameters, namely, the probability of adding noise for each data point a and the level of noise being added to a point b. First, the process of the proposed dimensionality reduction methods is described by an example. Taking an H&S pattern as an example, Fig. 33a shows the distance (VD values) for different numbers of PIP. By using the tree pruning approach, Fig. 33b indicates that the suitable number of PIPs can be obtained by locating the PIP number with peak change of VD. The dotted lines in the figures show the size of the SB-Tree (i.e. the number of PIPs) after dimensionality reduction. Fig. 34 shows the
result using the error threshold approach. Both approaches could reduce the dimension of the time series to a correct number of PIPs, i.e., 7 PIPs. Table 3 compares the accuracy and processing time of the two dimensionality reduction approaches. Accuracy here means the number of patterns obtains the correct number of PIPs identified by the dimensionality reduction process. According to Table 3, the accuracy of the error threshold approach outperformed the tree pruning approach. However, the time consumed by the error threshold approach was one-third more higher than that of the tree pruning approach. It is because from 3 PIPs till obtaining the correct number of PIPs, the error between
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
295
Fig. 32. Pseudo code of generating the time series pattern variants.
120 100
Error
80 60 40 20 0 3
7
11
15
19
23
27
31
35
39
43
47
51
55
59
No. of PIP 16
Change of VD
14 12 10 8 6 4 2 0 4
7
10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 No. of PIP
Fig. 33. (a) The VD curve with different numbers of PIP retrieved by applying the tree pruning approach. (b) The change of VD with different numbers of PIP. The peak value corresponds to the suitable number of PIPs.
the original time series and the time series constructed by the selected PIPs has to be calculated correspondingly. On the other hand, the low accuracy of the tree pruning
approach was due to the over pruning of the SB-Tree based on determining the largest gap among the VD. Furthermore, the accuracy of the proposed dimensionality
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
296
200
Error
150
100
50
0 3
7
11
15
19
23
27
31
35
39
43
47
51
55
59
40
44
48
52
56
60
No. of PIP
Change of Error
60 50 40 30 20 10 0 4
8
12
16
20
24
28
32
36
No. of PIP Fig. 34. (a) The error curve with different numbers of PIP retrieved by applying the error threshold approach. (b) The change of error with different numbers of PIP. The peak value corresponds to the suitable number of PIPs.
Table 3 Comparisons on the two dimensionality reduction approaches Tree pruning Length of time series ¼ 25 Accuracy Processing time Length of time series ¼ 43 Accuracy Processing time Length of time series ¼ 61 Accuracy Processing time Overall Accuracy Processing time
Error threshold
72.22% 0.09s
97.30% 0.12s
62.63% 0.65s
94.74% 0.49s
72.86% 0.65s
100.00% 1.19s
69.09% 1.02s
97.27% 1.80s
reduction approaches is independent with the length of the time series. Fig. 35 shows three sample time series with incorrect size after the dimensionality reduction process. 5. Mobile application Unlike systems run on a fixed network (Saha et al., 2001), mobile devices operating in a wireless environment suffer from limited resources (Pham et al., 2001). Mobile devices are limited in display screen size, which makes it a
challenging task in illustrating a complete time series chart clearly. Network bandwidth of a mobile device is also limited, and sometimes expensive when using cellular technology. Storage and computation capacity of mobile devices are also much inferior to their fixed network counterparts. Moreover, mobile devices often experience constant changes in the availability of resources that significantly impact the performance of the system. On the other hand, financial time series data are often massive. Current approach for fixed networks is to transform the time series into a bitmap that suits the dimension of the designated device. However, such approach lacks efficiency and flexibility. It requires the pre-generation of bitmap images in different fidelities, scales, output dimensions and time ranges of the same time series. Likewise, the same user requesting different viewpoints of the same time series requires the retransmission of the bitmaps belonging to the same time series. Moreover, when compared to the data size of the raw times series, the bitmaps themselves are too large for wireless transmission. Hence, it is natural to send the raw time series data to the client for chart plotting. The remaining problem is how to represent and disseminate the time series data points, in response to different requirements from the mobile devices and end users. In view of the characteristics of mobile environment and financial time series data mentioned previously, a new
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
a
b
297
c
Fig. 35. Incorrect sample time series after dimensionality reduction, result from (a) both approaches, (b) result from the error threshold approach and (c) the tree pruning approach.
representation for financial time series visualization on mobile devices is needed. It should be able to
Capture the shape of time series, especially the fluctuations, and the important signals like the salient points (e.g. local highest and lowest prices) in the time series. They are important to m-commerce applications and mobile investors. In this regard, the transform-based representations (e.g. transform the time series to its frequency coefficients) are less preferred because they tend to smooth out the salient points. Facilitate multi-resolution visualization. Before making investment decisions, the mobile users would like to analyze the time series data with respect to different time spans, start and/or end date, levels of detail, etc. Multiresolution visualization becomes particularly important in such mobile applications. Reduce the dimensionality of the time series data in order to save the bandwidth or cost for data exchange in the mobile environment.
In this section, the application of the proposed representation to develop a multi-resolution time series visualization method for mobile financial data is described. SB-Tree representation is capable of presenting the time series in different levels of detail and facilitating streaming of time series data based on the importance of data points. Its application to stock price time series visualization on mobile devices is demonstrated. By accessing the tree, progressive time series data retrieval can be facilitated. Unlike the traditional database approach which stores and retrieves the time series from the start to the end time, the proposed method is able to pick a few (more important) PIPs to approximate the original time series. Furthermore, a coarseto-fine retrieval of the time series data can be obtained. Thus, multi-resolution time series visualization can be done. 5.1. Adaptation of the proposed representation in mobile environment Instead of accessing the whole time series, different criteria can be set to control the number of PIPs to be retrieved (e.g. CR) and transferred to the mobile devices for the purpose of dimensionality reduction and also multiresolution visualization. Another advantage of using this method is that only just-enough time series data rather than the whole time series data is needed to send to the users. Given the width of the device’s display, the
corresponding number of time points can be sent to the mobile device based on the tree. Data for communication can then be minimized while the received time series can still capture the main trends and fluctuations of the time series. Moreover, this strategy reduces the size of memory needed for storing the corresponding time series. Besides sending the whole time series at the first moment when the users request, the users may like to focus on a particular segment, especially when the display size of the mobile devices is limited. Therefore, multi-resolution consideration, mostly from low-resolution to high-resolution, is necessary. It is an easy task to introduce multi-resolution displaying ability to the proposed visualization framework. While the lower level resolution time series is being displayed, users can select the segment they are interested in at anytime by sending the starting and ending points of the requested segment to the server and the data points of that segment can then be sent to the user. The strategy of sending the corresponding data points is simply based on their order in accessing the tree (i.e. same as the proposed mechanism). Except the starting and ending points are sent to the user first, other data points within that segment are sent to the user based on their orders in accessing the tree which retrieves from the root of the tree. Therefore, the progressive visualization effect is still applied in displaying of the subsequences. This function is especially useful for the proposed mobile applications. The reason is that, in the time series progressive visualization, once users have found that they are interested in a particular segment, they can select the corresponding segment for further investigation immediately and they do not need to wait until the whole time series in a lower resolution has been completely transferred. Once the user makes a new request for displaying a segment of the original time series, the previous progressive visualizing process will stop, and the server will start to transfer the data points within the new requested segment based on their orders in accessing the tree. 5.2. Performance analysis In our mobile application experiments, the proposed methods for mobile time series charting (the mobile client) were implemented in Java 2 platform, micro edition (J2ME), connected limited device configuration (CLDC) which defines a standard Java platform for small, resourceconstrained, connected devices and enables the dynamic
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
298
delivery of Java applications and contents to those devices. The profile currently developed for the CLDC configuration is the Mobile Information Device Profile (MIDP). The screen size of the mobile device used is 96 90 pixels. To measure the ability to minimize the size of the time series data being sent to the mobile devices, different approaches were compared, including pre-generated graph, sampling and PAA. As extensive computation is needed to reconstruct the time series from transformation-based
original
sampling
pre-generate PIP
PAA 8000
data size (byte)
7000 6000 5000 4000 3000 2000 1000 0 500
1000
1500
2000
2500
3000
number of time series data points Fig. 36. Data size needed with different lengths of time series.
original
pre-generate PAA
sampling
PIP
data size (byte)
12000 10000 8000 6000 4000 2000 0 100
150
200
250
300
display width (data point) Fig. 37. Data size needed with different display widths in mobile devices.
techniques (like wavelets), the transformation-based methods are not suitable for introducing to the mobile client. Since there is no floating-point data type in J2ME, the time series data points were converted to approximate integer type before sending to the mobile devices and 2 bytes for each integer type data were used. Fig. 36 shows the amount of data needed to be transferred to or stored in the mobile devices by the different approaches. The pre-generated graph approach requires the largest data size for sending the time series to the client and the data size for sending the whole time series increases constantly according to the length of the time series. For sampling, PAA and the proposed PIP-based approach, since the same number of data points has to sent for a fixed screen size, the data size is depend on the screen size rather than the length of the time series, the data size is constant. The proposed approach has to send more data because, apart from the amplitude value of each data point (y-coordinate), the time position of the data point (x-coordinate) also need to sent to the client because the time series data is not sent according to the periodic sequential order. Fig. 37 shows the effect on data size with increasing screen size (x-axis) for fixed length of time series being sent. Again, sampling, PAA and the proposed method still keep a much smaller data size compared to sending the whole time series and pre-generated graph. Fig. 38 shows the visualization results for sampling, PAA and the proposed method. It is obvious that important signals (fluctuations) were lost in the sampling approach due to the sampling rate (see the circled region in Fig. 38a). Better result was achieved by the PAA approach as shown in Fig. 38b. Despite the fact that the shape of the time series has been better captured, the data points particularly the salient points of the time series have been smoothed out due to the averaging effect. On the other hand, PIP-based approach could capture the fluctuations of the times series and the important signals in the time series could be preserved even if the time series was compressed and displayed on a small screen (Fig. 38c). Therefore, it is more suitable for applications that the fluctuations of the time series are important like financial chart analysis (e.g. wave principle analysis). Due to the progressive nature of accessing the tree representation, it is very suitable to have progressive
Fig. 38. Visualization results in mobile device using different approaches (a) sampling, (b) PAA and (c) proposed PIP-based method.
ARTICLE IN PRESS T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
transmission of time series data to the mobile devices. During the transmission of PIPs over slow wireless link, a user can view an outline of the time series in an early stage. While subsequent PIPs add details to the time series, the end user can grasp the major fluctuations with only a few preceding data points and thus the user perceived delay can be significantly reduced. Moreover, the proposed method can be customized to the display dimension and network bandwidth. Thus, only just enough PIPs corresponding to the required mobile display are transmitted. In case of very low bandwidth, the number of PIPs can be reduced to a half or even one fifth to minimize the overall delay, while preserving sufficient details for financial requirements. In Fig. 39, an example of progressive/multi-resolution visualization of time series from only 3 to 24, 48 and 72 data points received by the mobile device is shown. The final display result for all the data points being received has been shown in Fig. 38c. Unlike the traditional visualization process, the overall trend of the time series can be shown in a very early stage by the proposed approach while data is still transferring and the waiting time of the user can be greatly reduced. In addition, user can interrupt the visualization process anytime when the chart displayed has already fulfilled their requirement for making a decision or go into more details of a particular segment
299
for further investigations/analyses. The communication bandwidth, air time and storage can then be reduced by minimizing the transmission and storage of unnecessary data points. On the other hand, it is also common for financial investors to investigate a particular time segment of a time series. With the proposed method, it is straightforward for the system to retrieve only the data points of a specific time range in the tree representation and schedule the transfer to the mobile device promptly. The significance is that, during the transmission of a time series, the proposed approach allows the mobile application to provide highly flexible interactions for users to zoom in and out of the time series and view any time series segments promptly without transmitting too much redundant data. Fig. 40a and 40b show the results of displaying the subsequence of the time series in Fig. 38c from the 1379th to the 1840th data point, 462 data points in total, and Fig. 40c and 40d show the effect when the user further zoom into the range from the 1547th to the 1679th data point, altogether 133 data points. Visualization results of different numbers of data point, i.e. 48 data points for Fig. 40a and 40c, and 96 data points for Fig. 40b and 40d, received by the mobile client were shown.
Fig. 39. Progressive/multi-resolution display of 3 to 24, 48 and 72 data points on mobile device using the proposed visualization approach.
Fig. 40. Multi-resolutions display of time series (a) segment 1379–1840, resolution ¼ 48 data points, (b) segment 1379–1840, resolution ¼ 96 data points, (c) segment 1547–1679, resolution ¼ 48 data points, and (d) segment 1547–1679, resolution ¼ 96 data points.
ARTICLE IN PRESS 300
T.-c. Fu et al. / Engineering Applications of Artificial Intelligence 21 (2008) 277–300
6. Conclusions This paper has presented a financial time series representation based on a tree structure according to the importance of the data points. The process of Perceptual Important Point Identification, which evaluates the importance of a data point, has been illustrated. Three data point importance evaluation methods: PIP-ED, PIP-PD and PIP-VD are proposed. Experiments show that PIP-VD is a preferable method for evaluating the data point importance in most of the cases in financial domain. Then, a novel tree structure, SB-Tree, has been proposed for storing the time series data hierarchically. To manipulate the tree, the creating, updating and retrieving of the SBTree have been discussed. Two dimensionality reduction approaches: tree pruning and error threshold approaches are proposed. The significant advantages of the proposed approaches over the traditional time series dimensionality approaches are: (1) Easy to reduce the dimension of the time series to different resolutions, (2) preserve the overall shape of the time series by determining the suitable number of data points to represent the time series, (3) the proposed approach can handle the incremental updating problem of the time series representation and (4) important points (i.e. salient points) will not be filtered even under a high compression ratio. One may find it particularly attractive in applications like stock technical pattern analysis and the mobile application of the proposed representation is demonstrated. References Agrawal, R., Faloutsos, C., Swami, A., 1993. Efficient similarity search in sequence databases. In: Proceedings of the Fourth International Conference of Foundations of Data Organization and Algorithms, pp. 69–84. Bargiela, A., Pedrycz, W., 2003. Recursive information granulation: aggregation and interpretation issues. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 33 (1), 96–112. Chan, K.P., Fu, A.C., 1999. Efficient time series matching by wavelets. In: Proceedings of the 15th International Conference on Data Engineering, pp. 126–133. Chu, K.K.W., Wong, M.H., 1999. Fast time-series searching with scaling and shifting. In: Proceedings of the 18th ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems, pp. 237–248. Chung, F.L., Fu, T.C., Luk, R., Ng, V., 2001. Flexible time series pattern matching based on perceptually important points. In: International Joint Conference on Artificial Intelligence Workshop on Learning from Temporal and Spatial Data, pp. 1–7. Douglas, D., Peucker, T., 1973. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. The Canadian Cartographer 10 (2), 112–122. Feng, L., Ju, K., Chon, K.H., 2005. A Method for segmentation of switching dynamic modes in time series. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 35 (5), 1058–1064. Fink, E., Pratt, B., 2003. Indexing of compressed time series. Data Mining in Time Series Databases, 51–78. Fu, T.C., Chung, F.L., Tang, P.Y., Luk, R., Ng, C.M., 2005. Incremental stock time series data delivery and visualization. In: Proceedings of the ACM 14th Conference on Information and Knowledge Management, pp. 279–280.
Fukunaga, K., 1990. Introduction to Statistical Pattern Recognition, second ed. Academic Press, New York. Hershberger, J., Snoeyink, J., 1992. Speeding up the Douglas–Peucker line-simplification algorithm. In: Proceedings of the Fifth Symposium on Data Handling, pp. 134–143. Kahveci, T., Singh, A., 2001. Variable length queries for time series data. In: Proceedings of the 17th International Conference on Data Engineering, pp. 273–282. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S., 2000. Dimensionality reduction for fast similarity Search in large time series databases. Journal of Knowledge and Information Systems 3 (3), 263–286. Keogh, E., Chakrabarti, K., Mehrotra, S., Pazzani, M., 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 151–163. Keogh, E., Kasetty, S., 2002. On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 102–111. Korn, F., Jagaciish, H.V., Faloutsos, C., 1997. Efficiently supporting ad hoc queries in large datasets of time sequences. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 289–300. Last, M., Klein, Y., Kandel, A., 2001. Knowledge discovery in time series databases. IEEE Transactions on Systems, Man, and Cybernetics— Part B: Cybernetics 31 (1), 160–169. Liao, S.S., Tang, T.H., Liu, W.Y., 2004. Finding relevant sequences in time series containing Crisp, Interval, and Fuzzy interval data. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 34 (5), 2071–2079. Pantazopoulos, K.N., Tsoukalas, L.H., Bourbakis, N.G., Brun, M.J., Houstis, E.N., 1998. Financial prediction and trading strategies using Neurofuzzy approaches. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 28 (4), 520–531. Perng, C.S., Wang, H., Zhang, R., Parker, D., 2000. Landmarks: a new model for similarity-based pattern querying in time series databases. In: Proceedings of the 16th International Conference on Data Engineering, pp. 33–42. Pham, T.L., Schneider, G., Goose, S., Pizano, A., 2001. Composite devices computing environment: a framework for situated interaction using small screen devices. Personal and Ubiquitous Computing 5 (1), 25–28. Popivanov, I., Miller, J., 2002. Similarity search over time-series data using wavelets. In: Proceedings of the 18th International Conference on Data Engineering, pp. 212–224. Policker, S., Geva, A.B., 2000. Nonstationary time series analysis by temporal clustering. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 30 (2), 339–343. Pratt, B., Fink, E., 2002. Search for patterns in compressed time series. Image and Graphics 2 (1), 89–106. Rafiei, D., Mendelzon, A., 2000. Querying time series data based on similarity. IEEE Transactions on Knowledge and Data Engineering 12 (5), 675–693. Saha, S., Jamtgaard, M., Villasenor, J., 2001. Bringing the wireless internet to mobile devices. Computer 34 (6), 54–58. Sfetsos, A., Siriopoulos, C., 2004. Time series forecasting with a hybrid clustering scheme and pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans 34 (3), 399–405. Sfetsos, A., Siriopoulos, C., 2005. Time series forecasting of averaged data with efficient use of information. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans 35 (5), 738–745. Wang, Z.J., Willett, P., 2004. Joint segmentation and classification of time series using class-specific features. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 28 (4), 1056–1067. Yi, B., Faloutsos, C., 2000. Fast time sequence indexing for arbitrary Lp norms. In: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 385–394.