Available online at www.sciencedirect.com Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2018) 000–000 Procedia Computer Science 139 (2018) 256–262 Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
The International Academy of Information Technology and Quantitative Management, The International Academy the Peter Kiewit of Information Institute,Technology University of and Nebraska Quantitative Management, the Peter Kiewit Institute, University of Nebraska
Finding Patterns of Stock Returns Based on Sequence Alignment Finding Patterns of Stock Returns Based on Sequence Alignment Yong Shia,b,c,d, Ye-ran Tanga,b,c, Wen Longa,b,c,* Yong Shia,b,c,d, Ye-ran Tanga,b,c, Wen Longa,b,c,*
a School of Economics & Management, University of Chinese Academy of Sciences, Beijing 100190 P.R. China a Research on Fictitious Economy & Data Science, Chinese Academy Sciences,Beijing Beijing, 100190 P.R. China School Center of Economics & Management, University of Chinese Academy of of Sciences, 100190 P.R. China c KeybResearch Laboratory of Big Mining & Knowledge Chinese Academy of Sciences, Beijing, 100190 China Center onData Fictitious Economy & DataManagement, Science, Chinese Academy of Sciences, Beijing, 100190 P.R.P.R. China d c College of Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182, Key Laboratory of Information Big Data Mining & Knowledge Management, Chinese Academy of Sciences, Beijing, 100190USA P.R. China d College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182, USA b
Abstract Abstract In this paper, we propose the method based on sequence alignment to find patterns of stock returns. We use 5 minutes high frequency dataweofpropose CSI 300the index to test thison method, andalignment find we can predict the sharply or drop stock returns In this paper, method based sequence to find patterns of stockrise returns. Wefor usethe 5 minutes high according to patterns the sample sequence. analysis suggests it is possible findfor andthe predict frequency data of CSIof300 index tosymbol test this method,The andempirical find we can predict the sharply rise or to drop stockpatterns returns in stock returns basedofonthe thesample method of sequence alignment. according to patterns symbol sequence. The empirical analysis suggests it is possible to find and predict patterns in stock returns based on the method of sequence alignment. © 2018 The Authors. Published by Elsevier B.V. © 2018 2018 The Authors. by Elsevier B.V. This is an open accessPublished article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) © The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license Peer is review under responsibility of the CC scientific committee of (http://creativecommons.org/licenses/by-nc-nd/4.0/) The International Academy of Information Technology and This an open access article under BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer review under responsibility of the scientific committee of The International Academy of Information Technology Quantitative Management, the Peter Kiewit Institute, University of Nebraska. Peer review under responsibility of the scientific committee of The and Quantitative Management, the Peter Kiewit Institute, UniversityInternational of Nebraska. Academy of Information Technology and Quantitative Management, the Peter Kiewit Institute, University of Nebraska. Keywords: finding patterns; sequence alignment; stock returns Keywords: finding patterns; sequence alignment; stock returns
1. Introduction 1. Introduction Stock market is important for optimization of resource allocation, and it is also one of the important ways for Stock market is important for optimization of resourceinallocation, and it is the alsodifferent one of the important waysand for investment. However, there contains a lot of information the stock market, level of investors, investment. However, there contains a lot of information in the stock market, the different level of investors, and the high returns with high risks. Therefore, a valid method to predict and analyze stock market is urgently needed, the high with high risks. Therefore, valid method to predict and analyze stock market is urgently needed, which is returns meaningful in both theoretical anda empirical studies. which is meaningful in both theoretical and empirical studies. * Corresponding author. Tel.: +86-10-82680927. address:
[email protected]. *E-mail Corresponding author. Tel.: +86-10-82680927. E-mail address:
[email protected]. 1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access under the CC by BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) 1877-0509 © 2018 The article Authors. Published Elsevier B.V. Peer review under responsibility of the committee of The International Academy of Information Technology and This is an open access article under the scientific CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Quantitative Management, the Peter Kiewit Institute, University of Nebraska . Academy of Information Technology and Peer review under responsibility of the scientific committee of The International Quantitative Management, the Peter Kiewit Institute, University of Nebraska .
1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer review under responsibility of the scientific committee of The International Academy of Information Technology and Quantitative Management, the Peter Kiewit Institute, University of Nebraska. 10.1016/j.procs.2018.10.265
Yong Shi et al. / Procedia Computer Science 139 (2018) 256–262 Yong Shi, Ye-ran Tang, Wen Long / Procedia Computer Science 00 (2018) 000–000
257
With the rapid development of the stock market, the structure of financial time series becomes complex and the amount of them sharply increased. The methods based on big data and interdiscipline get a wide range of applications [1-5]. Among them, sequence alignment, which is one of the methods in bioinformatics, is also used to analyze stock market [6-9]. The deals in stock market, as actions of investors, are affected by investors’ psychological factors. Investors often make the similar investments in similar situations based on the empirical experience. Therefore, based on the pattern in the stock market, we can predict stock price according to the historical stock data. In this paper, we propose a method based on bioinformatics to find patterns in stock market [10-12]. We can obtain the patterns from historical stock returns using the method of sequence alignment. If there is a common relationship between a sequence and the following returns, the samples in financial series are independent, which is called pattern in this paper. When we find the patterns, we can predict the stock price based on the historical sequence. This method is not only the expansion for the sequence alignment method in bioinformatics, but also the complement in methods of the stock returns analyses. The advantages of our proposed method including: (1) No guidance: we do not need to set certain sequence to analyze in advance; (2) Intuitive: the model is simple and can be understood for investors; (3) Universality: we do not need to test the stationarity of the financial time series. The method has a relative higher endurance to noise because of symbolization. 2. Methodology In this section, we introduce the method of finding patterns and prediction based on sequence alignment: (1) Symbolization of the returns sequence: we assume that the initial time sequence is {𝑟𝑟𝑡𝑡 , 𝑡𝑡 = 1,2, … , 𝑖𝑖}. Using the symbolization method of continuous numerical time sequence, we put {𝑟𝑟𝑡𝑡 } map to symbol set 𝛺𝛺 = {𝐴𝐴1 , 𝐴𝐴2 , … , 𝐴𝐴𝑗𝑗 }, and the symbolized sequence is {𝑅𝑅𝑡𝑡 , 𝑡𝑡 = 1,2, … , 𝑖𝑖}. (2) Setting time window and getting full permutation: aligned sequence is symbol sequence which can be used to search patterns in sample sequence. We assume that the length of aligned sequence is 𝑛𝑛. In order to predict the symbol of stock returns in 𝑛𝑛 + 1, we generate full permutation of length 𝑛𝑛 + 1 set {𝑆𝑆𝑘𝑘 , 𝑘𝑘 = 1,2, … , 𝑗𝑗 𝑛𝑛+1 }, where 𝑆𝑆𝑘𝑘 is one of the sets in full permutation, and |𝑆𝑆𝑘𝑘 | = 𝑛𝑛 + 1. For example, for 𝐴𝐴 ⏟1 𝐴𝐴1 … 𝐴𝐴1 , the number of sequence 𝑛𝑛+1
for full permutation with length of 𝑛𝑛 + 1 is 𝑗𝑗 𝑛𝑛+1 . (3) Getting the frequency of symbol sequence 𝑆𝑆𝑘𝑘 occurring in symbolized sample sequence {𝑅𝑅𝑡𝑡 }: we conduct sequence alignment between symbol sequence 𝑆𝑆𝑘𝑘 and {𝑅𝑅𝑡𝑡 } in order to search all 𝑆𝑆𝑘𝑘 in {𝑅𝑅𝑡𝑡 }. Then, we can obtain the frequency of symbol sequence 𝑆𝑆𝑘𝑘 occurring in symbolized sample sequence {𝑅𝑅𝑡𝑡 }. 𝑘𝑘 is from 1 to 𝑗𝑗 𝑛𝑛+1 , and we can get frequency for all possible 𝑆𝑆𝑘𝑘 . (4) Finding patterns and making prediction: Setting threshold as σ. Only when the frequency of 𝑆𝑆𝑘𝑘 > 𝜎𝜎 (𝜎𝜎 > 0) the sequence 𝑆𝑆𝑘𝑘 can be a pattern. Since for same 𝑛𝑛 symbols, the 𝑛𝑛 + 1th symbol is different, 𝑆𝑆𝑘𝑘 has 𝑗𝑗 situations. If the difference among each situation of 𝑆𝑆𝑘𝑘 is small, it is independent between 𝑛𝑛 symbols and 𝑛𝑛 + 1th symbol, which means we cannot predict the 𝑛𝑛 + 1th symbol given these 𝑛𝑛 symbols. On the contrary, if the difference is large, and one of the 𝑆𝑆𝑘𝑘 has a largest value 𝑆𝑆𝑘𝑘∗ , there is a pattern between 𝑛𝑛 + 1th symbol and previous 𝑛𝑛 symbols. Therefore, when the 𝑛𝑛 symbols continually appear before 𝑆𝑆𝑘𝑘∗ , we can make a prediction that the 𝑛𝑛 + 1th symbol is 𝑆𝑆𝑘𝑘∗ , and the relationship between 𝑛𝑛 + 1th symbol and 𝑛𝑛 symbols is a pattern. 3. Empirical studies on high frequency data of stock returns
3.1. Data and symbolization The sample data we use are the returns of CSI 300 index from June 3rd, 2014 to September 30th, 2015. The sampling frequency is 5 minutes. Since the opening time of the stock market are 9:30 to 11:30 and 13:00 to 15:00,
Yong Shi et al. / Procedia Computer Science 139 (2018) 256–262 Yong Shi, Ye-ran Tang, Wen Long/ Procedia Computer Science 00 (2018) 000–000
258
we can get 48 samples one day. Thus, the total sample size of the return series is 15,744. The test data are from October 8th, 2015 to December 30th, 2015. So, the size of test data is 2,928. Let the price of the stock index at time t be Pt , and its return is 𝑟𝑟𝑡𝑡 = 𝑙𝑙𝑙𝑙Pt − 𝑙𝑙𝑙𝑙Pt−1 . Then, using the method we proposed at section 2, we find patterns of the return series for SCI 300 index and test their validation when prediction. We use an ex post method for symbolization, that is, define boundary of each symbol interval according to the probability of the sample in each interval. The number of the group is j= 3. The first interval is drop, the second interval is stable, and the third interval is rise. They are represented as A, N and T respectively. The probabilities of each symbol are 1/3. The symbol intervals are shown in Table 1. Table 1. Symbol intervals of returns series of CSI 300 index 𝑟𝑟𝑡𝑡
(-∞,-0.0590)
[-0.0590,0.0749)
[0.0749,+∞)
A
N
T
𝑅𝑅𝑡𝑡
3.2. Finding patterns of returns series Using the method above to find patterns of the symbol sequence for the returns of SCI 300 index, we need to set the length of the patterns, and then to generate full permutation of symbol sets with this length. Let each of symbol set as an aligned sequence, and the time window is the length of each sequence. If the time window is too long, the sample sequence is difficult to be completely consistent with the aligned sequence; if the time window is too short, the certain patterns may not be formed, and cannot achieve the purpose of prediction. Therefore, the time windows are 5, 6, and 7 in this paper, which means the time intervals are 25 minutes, 30 minutes, and 35 minutes, respectively. Thus, the number of generated full permutation on the symbol set {𝐴𝐴, 𝑁𝑁, 𝑇𝑇} is 243, 729, 2187, respectively. Then, according to step (3) in section 2, conduct sequence alignment between each aligned symbol set and the symbolized sample sequence to get the number of successful matches. The total sample size is 15,744, and the average occurrence number in the lengths of 5, 6, and 7 is 64.8, 21.6, and 7.2, respectively. The calculation formula here is (sample size - length of aligned sequence +1) / number of full permutation. The threshold σ in step (4) in section 2 should be at least larger than the average occurrence number. In this paper, σ is set as 100, 50, and 25 when the time windows are 5, 6, and 7, respectively. Considering that investors are usually sensitive to large fluctuations of stock prices, when prediction, we are more concerned about the last symbol of A or T, that is, stock prices drop or rise. So we need to calculate occurrence probability of last symbols of A, N, and T respectively when the previous 𝑛𝑛 symbols are determined. For prediction, the last symbol which will occur is the 𝑛𝑛 + 1th symbol with the highest probability. Taking time window of 5 as an example, the matches of aligned sequence and sample sequence are shown in Table 2. Table 2. Matches of aligned sequence and sample sequence (taking time window of 5 as an example) previous 𝑛𝑛 symbols
aligned sequence
times of matches
AAATA
117
AAAT
AAATN
59
AAATT AATAA AATA * AATT
previous 𝑛𝑛 symbols
aligned sequence
times of matches
ATTAA
122
ATTA
ATTAN
51
103
ATTAT
134
AATAN
42
AATAT
140
AATTA
110
ATTT * NNNN
previous 𝑛𝑛 symbols
aligned sequence
times of matches
TATAA
110
TATA
TATAN
50
119
TATAT
119
ATTTA
146
TATTA
128
ATTTN
60
ATTTT
94
NNNNA
150
TATT TTAA
TATTN
60
TATTT
105
TTAAA
107
Yong Shi et al. / Procedia Computer Science 139 (2018) 256–262 Yong Shi, Ye-ran Tang, Wen Long / Procedia Computer Science 00 (2018) 000–000
ATAA
ATAT
259
AATTN
86
NNNNN
559
TTAAN
71
AATTT
123
NNNNT
135
TTAAT
122
ATAAA
102
TAAAA
96
TTATA
100
ATAAN
48
TAAAN
42
TTATN
64
ATAAT
135
TAAAT
127
TTATT
127
ATATA
133
TAATA
134
TTTAA
116
ATATN
59
TAATN
57
TTTAN
66
ATATT
113
TAATT
152
TTTAT
120
TAAA
TAAT
TTAT
TTTA
Note: * represents the time of matches when the last symbol in aligned sequence is A or T is not much larger than N. Here, the words “much larger” refers to the time of matches for A or T exceed 3 times than N.
From Table 2, it can be seen that except the first 4 symbols of AATT and NNNN, the matches of A or T on the last symbol is much larger than that of N, but no obvious difference between A and T. These results suggest that there are some patterns in the sequence of CSI 300 index, that is, when the first 4 symbols (except AATT and NNNN) appear, the next symbol will probably be A or T. We get the similar results when the time windows are 6 and 7. We count the number of the matches between above patterns in Table 2 and the sample sequence. The probability of matches is approximated by frequency. For example, in the symbolized sample sequence, the pattern AAAT occurs 279 times, including 117 times of AAATA, 59 times of AAATN, and 103 times of AAATT. Thus, when the first 4 symbols are AATA, the probability of the 5th symbol being A or T is (117+103)/279=78.85%. To compare the results of the time windows of 5, 6, and 7, we show 10 patterns with the highest probability of A or T for the last symbol in Table 3. 100
probability (%)
95
90 85 80 75 70 1
2
3
time window of 5
4
5
6
order
7
8
time window of 6
9
10
time window of 7
Fig. 1. Probability of A or T in the patterns with different time windows Table 3. 10 patterns with the highest probability of A or T for the last symbol time window of 5 ordering number 1
time window of 6
time window of 7
patterns
probability (%)
patterns
probability(%)
patterns
probability(%)
AATA
86.71
ATAAA
89.22
TAAATA
94.12
Yong Shi et al. / Procedia Computer Science 139 (2018) 256–262 Yong Shi, Ye-ran Tang, Wen Long/ Procedia Computer Science 00 (2018) 000–000
260
2
TAAA
84.15
AAATA
87.18
AATAAA
91.84
3
TAAT
83.38
4
ATAA
83.16
TAATA
86.57
AATAAT
89.39
ATATT
85.84
ATATAT
5
ATTA
88.71
82.53
AATAA
85.82
AAATAA
86.27
6 7
TATA
82.08
ATAAT
84.44
ATAATA
85.25
ATAT
80.66
TTAAT
84.43
ATTTAT
85.00
8
ATTT
80.00
ATATA
84.21
AATATA
84.62
9
TATT
79.52
TATTA
83.59
TTTAAT
84.31
10
AAAT
78.85
TATTT
81.90
TAATAT
81.67
It can be seen from Table 3 and Figure 1 that the probability of A or T in top 10 patterns with time windows of 6 and 7 are larger than the time window of 5, which suggests the patterns of aligned sequence and the next symbol are more reliable when the time windows are 6 and 7. Besides, as can be seen from Figure 1, from the 5th pattern, the probabilities in the time windows of 6 and 7 are very similar. 3.3. Prediction and testing We use a total of 2,928 test data from October 8th, 2015 to December 30th, 2015 for prediction, and test the validity of our finding pattern method based on sequence alignment. First, we symbolize original test data to a symbol sequence according to the symbolization rules in Section 3.1. Then, when the time windows are set as 5, 6, and 7, we match patterns in test sequence based on the top 10 patterns listed in Table 3. Next, we check the accuracy of the prediction. After obtaining the symbol sequences in test sequence that are completely consistent with the aligned sequence, we can get the number of times that A, N, and T occurs at the last symbol respectively. According to orders of patterns in Table 3, calculate the cumulative number of times of A, N, and T at the last symbol respectively, and then, we can get the overall probability of A or T at the last symbol. Table 4. The times and probability of A or T at the last symbol in test sequence
time window of 5
time window of 6
time window of 7
Numb er of pattern s
cumulati ve time of A or T at last symbol
cumulativ e time of N at last symbol
probability of A or T at last symbol (%)
cumulativ e time of A or T at last symbol
cumulati ve time of N at last symbol
probability of A or T at last symbol (%)
cumulati ve time of A or T at last symbol
cumulati ve time of N at last symbol
probability of A or T at last symbol (%)
1
31
4
88.57
15
1
93.75
8
1
88.89
2
73
12
85.88
35
2
94.59
13
1
92.86
3
106
25
80.92
42
4
91.30
17
1
94.44
4
143
41
77.72
49
9
84.48
27
2
93.10
5
177
53
76.96
58
15
79.45
31
8
79.49
6
228
64
78.08
77
17
81.91
36
9
80.00
7
273
75
78.45
85
24
77.98
41
11
78.85
8
304
89
77.35
113
29
79.58
49
13
79.03
9
325
99
76.65
120
31
79.47
51
16
76.12
10
364
113
76.31
129
34
79.14
53
17
75.71
Yong Shi, Ye-ran Tang, Wen Long / Procedia Computer Science 00 (2018) 000–000 Yong Shi et al. / Procedia Computer Science 139 (2018) 256–262
261
Table 4 and Figure 2 show the cumulative times and probabilities of the last symbol of A or T in test sequence after the aligned sequence as the time windows are 5, 6 and 7, respectively. The ratio of the last symbol of A or T in test sequence reflects the accuracy of prediction, since we predict that the next symbol after the aligned sequence is A or T. Thus, the higher the ratio, the higher the accuracy of the prediction, and vice versa. From Figure 2, we can see that the ratio of A or T in last symbol when time window is 6 or 7 is higher than time window of 5. Therefore, the prediction results when time window is 6 or 7 is better than the results when the time window is 5. But on the other hand, with the increase of time windows, the patterns we find decreased sharply, mainly because the longer the patterns, the more difficult to match the sequence. Therefore, in practice, we need a trade-off between accuracy and the number of patterns. On the one hand, for certain number of patterns we need to find, the length of time window has positive impacts on accuracy; on the other hand, for certain time window, with the increase of patterns amount, the accuracy gradually decreased.
probability (%)
100 95 90 85 80 75 70 1
2
3
4
5
6
7
8
9
10
number of patterns time window of 5
time window of 6
time window of 7
Fig. 2. Probability of A or T in the patterns with different time windows in test sequence
Figure 2 also suggests that the prediction accuracy of next symbol of A or T is 76% - 80% when the number of patterns is more than 4 with the time windows of 5, 6, and 7, which is much higher than average value of 66.67%. This indicates the certain sequence and the next symbol is not independent, and patterns exists, which valid the proposed method in this paper. 4. Conclusion In this paper, we use the method of sequence alignment to find the pattern of stock returns and make predictions on stock returns. We apply this method to 5 minutes high frequency returns of CSI 300 index. By finding patterns and using it to predict sharply rise or drop, we test the valid of the proposed method. In the investment practice, if the symbolized sequence of stock returns is matched to the patterns we find, we can predict the sharply rise or drop on stock price in the near future, which can provide investors with the alarm of risk. Thus, our proposed method in this paper is meaningful for investment practice. Acknowledgements This research was supported by the grants from National Natural Science Foundation of China (No. 71771204, 71331005, 91546201).
262
Yong Shi et al. / Procedia Computer Science 139 (2018) 256–262 Yong Shi, Ye-ran Tang, Wen Long/ Procedia Computer Science 00 (2018) 000–000
References [1] Novak M G, Velušček D. Prediction of stock price movement based on daily high prices. Quantitative Finance, 2016, 16(5):793-826. [2] Abu-Mostafa Y S, Atiya A F. Introduction to financial forecasting. Applied Intelligence, 1996, 6(3):205-213. [3] Gestel T V, Suykens J A K, Baestaens D E, et al. Financial time series prediction using least squares support vector machines within the evidence framework. IEEE Transactions on Neural Networks, 2001, 12(4):809. [4] Huang W, Nakamori Y, Wang S Y. Forecasting stock market movement direction with support vector machine. Computers & Operations Research, 2005, 32(10):2513-2522. [5] Mettenheim H J V, Breitner M H. Forecasting Daily Highs and Lows of Liquid Assets with Neural Networks. Operations Research Proceedings, 2014:253-258. [6] Takuya Y., et al., Analysis of indicator time series by quantitative sequence alignment. Computational Statistics and Data Analysis, 2008(53) : 486-495. [7] Xu X. Technical analysis model for stock market based on fuzzy candlesticks sequence alignment. Computer Applications & Software, 2010. [8] Mei X U. Empirical Research of Stock Market Volatility Based on Sequence Alignment Method. Journal of Wuhan University of Technology, 2013. [9] Yih-Wenn Laih. Measuring rank correlation coefficients between financial time series: A GARCH-copula based sequence alignment algorithm. European Journal of Operational Research, 2014, 232(2):375-382. [10] Needleman S B, Wunsch C D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 1970, 48(3):443-453. [11] Smith T F, Waterman M S. Identification of common molecular subsequences. Journal of Molecular Biology, 1981, 147(1):195–197. [12] Pevsner J. Basic Local Alignment Search Tool (BLAST). Bioinformatics and Functional Genomics. John Wiley & Sons, Inc. 2005:87125.