ARTICLE IN PRESS
Physica A 355 (2005) 183–189 www.elsevier.com/locate/physa
Non-linear logit models for high-frequency data analysis Naoya Sazuka Corporate Finance & Strategy Office, Global Hub, Sony Corporation, 6-7-35 Kitashinagawa Shinagawa-ku, Tokyo 141-0001, Japan Received 31 October 2004; received in revised form 1 February 2005 Available online 4 May 2005
Abstract We analyze tick-by-tick data, the most high frequency data available, of yen–dollar exchange rates with focus on the direction of up or down price movement. We propose a nonlinear logit model to describe a non-trivial probability structure, apparently invisible from the price change itself, shown in binarized data extracting up or down price movement. The model selected by AIC agrees well with empirical results. Additionally, the similar bias is obtained from binarized tick-by-tick data on NYSE, for example GE. Our model could be useful for a wide range of binary time series extracting their non-trivial probability structures. r 2005 Elsevier B.V. All rights reserved. Keywords: High-frequency data; Tick-by-tick data; Yen–dollar exchange rate; Binary data; Conditional probabilities; Non-linear logit models; AIC
1. Introduction High-frequency financial data analysis has received considerable attention. In recent years, various studies has been conducted on high-frequency data showing the property cannot be observed from low-frequency data [1]. Above all, we analyze the most high-frequency data available called tick-by-tick data, which is the records of every transaction. The purpose of this paper is to show a non-trivial probability E-mail address:
[email protected]. 0378-4371/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.physa.2005.02.082
ARTICLE IN PRESS N. Sazuka / Physica A 355 (2005) 183–189
184 124 122
122.5 120
Y
Y
120 118
117.5 115
116 112.5 50000
100000
(A)
150000
t
200000
00000
250000
(B)
200000
300000 400000
500000
t
Fig. 1. Time series plots for data set A in (A) and data set B in (B).
structure of tick-by-tick data on yen–dollar exchange rates with focus on up down price movements. We propose a non-linear logit models of a non-trivial probability structure shown in binarized data extracting the up and down direction. The model agrees well with the empirical results. We also show the similar bias from tickby-tick data on NYSE. Therefore, our model could be useful for a wide range of binary data.
2. Tick-by-tick data We analyze tick-by-tick data, the finest data available, of yen–dollar exchange rates. We use two data sets of tick-by-tick ‘‘trade’’ data provided by Bloomberg for the period of 10/26/1998 to 11/30/1998 (data set A) and 1/4/1999 to 3/12/1999 (data set B) (Fig. 1). The time series data sets are composed of values Y ðtÞ of yen value per dollar at ‘‘tick step’’ t. We note that t is not a real time, but rather discrete steps with variable time intervals at which the transaction took place. The data set A and B contains 267,398 and 578,509 data points, respectively. On average two ticks are separated by 7 s. We binarize the data to extract the direction of the up down price movements in the following way, X ðtÞ ¼ þ1 if Y ðt þ 1Þ Y ðtÞ40 and X ðtÞ ¼ 1 if Y ðt þ 1Þ Y ðtÞo0. In order to focus on up down price movements, we leave out the cases that the last trade is at the same price as the one before ðY ðt þ 1Þ Y ðtÞ ¼ 0Þ. With this reduction, the data sets A and B contain 145,542 and 344,791 data points, respectively, and the average time interval between ticks is about 10–13 s [2].
3. The probability structure in the up down price movements In our previous study [3], we showed the striking similarities between two binarized data sets, for example the correlation functions and the estimated parameters of AR model, a linear autoregression type model (Fig. 2) [3]. However, when analyzing the price change itself, these properties are completely different for
ARTICLE IN PRESS N. Sazuka / Physica A 355 (2005) 183–189
185
Estimated Parameter
correlation function
0 1 0.75 0.5 0.25 -0.25
2
4
6
-0.5
8
10
lag
-0.2 -0.4 -0.6 -0.8 1.5
2
2.5
3
3.5
4
lag
Fig. 2. (Left) Correlation functions C X ðkÞ ¼ hX ðtÞ; X ðt þ kÞi for binary data A (triangle) and binary data B (circle) are almost overlapped. (Right) Estimated parameters fitted to an AR(4) process for binary data A (triangle) and binary data B (circle) are very similar. The properties are, however, completely different for each data sets when analyzing price change itself.
each data sets. We also showed a non-trivial structure in the conditional probabilities, which represents the dependency relation between a current step and several steps in the past. This probability structure has a strong bias with an up ðþÞ down ðÞ price movement symmetry not only in the first-order conditional probabilities but also in the higher order ones. Additionally, prices tend to continue moving in the same direction, after three consecutive steps in a same direction (i.e., Pðþ j þ; þ; þ; þÞ4P ðþ j þ; þ; þÞ ¼ P ðþ j þ; þÞ). This tendency may reflect dealers’ behaviour trying to follow the market trend. These are summarized in Table 1 [2]. Other studies [5] mentioned the bias of the price movement between two consecutive ticks but not among multiple relations. On the other hand, similar higher order statistics every half hour did not observe notable deviations from yen–dollar trading [4]. Therefore, at the tick level, a very notable probabilistic rule may exist in the up down price movements for yen–dollar exchange rate as found in these data sets. It could be useful to know this property for actual trading.
4. Model P The standard logit model, log P=1 P ¼ ki¼1 qi X ðt iÞ, is known to be suitable for binary analysis [6]. However, we have found that a conventional logit model was not sufficient for our data due to its non-linear behavior as shown in the Fig. 5. This motivated us to develop a new extended non-linear logit model of the binary probabilistic structure [7]. The non-linear logit model of order k is defined by log
k k X X P ¼ q0 þ qi1 X ðt i1 Þ þ qi1 i2 X ðt i1 ÞX ðt i2 Þ 1P i ¼1 i ;i ¼1 1
þ þ
1 2
k X i1 ;i2 ;...;ik ¼1
qi1 i2 ik X ðt i1 ÞX ðt i2 Þ X ðt ik Þ .
ð1Þ
ARTICLE IN PRESS 186
N. Sazuka / Physica A 355 (2005) 183–189
Table 1 The conditional probabilities computed from data set A, data set B and a simulation of the non-linear logit model
.
The number of training data for a simulation is 57,891, corresponding to the amount of data for a week, including the cases that the price remains unchanged between two consecutive ticks. We compute the maximum likelihood estimate of the parameters using the first week data of Data B then simulate 20,000 points by the maximum likelihood model. The error bars represent the 95% confidence interval on the mean. All the other conditional probabilities not shown in the table can be derived from those shown by an up down symmetry. Without time correlation in the data, such similarity between two data sets should not be observed.
where P ¼ P ðX ðtÞ ¼ þ1 j X ðt 1Þ; . . . ; X ðt kÞÞ is the probability to be þ1 at time t given the previous k states in the past. The kth order model considers up to k times products of X. Then, we select the appropriate order model of the non-linear logit model based on AIC (Akaike Information Criterion), whose penalty function is 2(Maximum log likelihood)+2(the number of the model parameters) [8]. AIC has a minimum value at the 5th order for both data sets and the both function shapes are similar (Fig. 3). On the other hand, AIC does not show a minimum value at the same order for AR models and normal logit models (Figs. 4, 5). Therefore, the 5th-order model is an appropriate one according to AIC and it captures the probability structure very well, even for a tendency to continue moving in the same direction mentioned above (Table 1). A period of the 5 states is roughly equivalent to few minutes in real time. This result is consistent with the dealers’ perceptions, which is said that their strategy is slowly changing in the time scale of few minutes [9].
ARTICLE IN PRESS N. Sazuka / Physica A 355 (2005) 183–189
1.242
1.246 1
1.244
2
3
4
5
6
1 AIC
AIC
187
1.242
2
3
4
5
6
1.238 1.236 1.234
1.24
1.232 order
order
Fig. 3. AIC values of the non-linear logit model for data A (triangle) and data B (circle). AIC shows a minimum value at the 5th order for both data sets.
-5.96 -4.95 2
4
6
8
10
12
AIC
AIC
-5.98 14
-6.02
2
4
6
8
10 12 14
-5.05
-6.04
-5.1
-6.06 lag
lag
Fig. 4. AIC values of the AR model for data A (triangle) and data B (circle). It shows gradually decrease as the order of the model grows, which means that it cannot be selected an appropriate order. 1.247
1.234 1.232 1
2
3
4
1.244
5
AIC
AIC
1.246
1.228
1
2
3
4
5
1.226
1.243
1.224
1.242 order
order
Fig. 5. AIC values of the normal logit model for data A (triangle) and data B (circle). AIC shows a minimum value but at different orders for both of data sets.
5. The bias across the different market In order to investigate the generality of our model, we are now analyzing the tickby-tick GE data, one of the most active stocks, on NYSE. Interestingly, we found a much stronger bias from GE data. We use two data sets of tick-by-tick ‘‘trade’’ data taken from the normal trading hours 9:30 am to 4:00 pm Eastern time for October and December 2002. The number of the data points is about 18,000 a day on average, more frequent than yen–dollar exchange rate. Then, we show a strong bias in the conditional probabilities with an up down symmetry from the independent data sets. It should be noted that the bias in GE data is stronger than the one in yen–dollar exchange rates. The comparison result is shown in Fig. 6. Furthermore, there exists such a strong bias not only in GE data but also other active stock data
ARTICLE IN PRESS 188
N. Sazuka / Physica A 355 (2005) 183–189
0 P(+) P(+|+)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
USD/JPY GE
P(+|+,+) P(+|-,+) P(+|+,+,+) P(+|-,+,+) P(+|+,-,+) P(+|+,+,-) P(+|+,+,+,+) P(+|-,+,+,+) P(+|+,-,+,+) P(+|+,+,-,+) P(+|+,+,+,-) P(+|-,-,+,+) P(+|-,+,-,+) P(+|-,+,+,-) Fig. 6. The comparison between the conditional probabilities of Data B of yen–dollar exchange rates and those of tick-by-tick GE stock data for December 2002. The tick-by-tick GE stock data has a much stronger bias in the conditional probabilities than the one in the yen–dollar exchange rates. The error bars represent 95% confidence interval on the mean.
such as MSFT, INTL, WMT, PFE, CSCO, JPM, etc. So the bias in high-frequency data exists across the different market. We speculate that the bias is due to dealers’ reflex action, which may affect the market price in a very short time window. Time intervals of tick-by-tick data could be too short to take conscious decisions referred to the news or events. The bias that we have found indicates that some notable rules might be visible in the aggregate of many people’s reflex action. Therefore, we expect that the model would be useful in this case as well.
6. Conclusion In this paper, we have analyzed tick-by-tick data, the most high-frequency data available, of yen–dollar exchange market with focus on up down price movements. The proposed non-liner logit model has been able to capture the non-trivial probability structure in the binary tick-by-tick data, which was impossible using the conventional methods such as AR models or normal logit models. This is an
ARTICLE IN PRESS N. Sazuka / Physica A 355 (2005) 183–189
189
indication that this new non-linear logit model is a useful tool for analyzing a wide range of binary time series with non-trivial dynamical structures. In order to explore the potential of our model, we are now analyzing the tick-by-tick GE data, one of the most active stocks, on NYSE. Interestingly, we have found that a much stronger bias is observed in conditional probabilities of binary GE stock data. We are also getting similar strong bias from other tick-by-tick stock data, such as MSFT, INTL, WMT, PFE, CSCO, JPM, etc. These bias in a very short time window may be attribute to dealers’ reflex action in which some notable rules might be visible. Finally, although we did not mention in this paper, even when including the case ‘‘unchanged’’ that the price is at the same value as the trade before, the bias can also be observed in the conditional probabilities of the three states ‘‘up’’, ‘‘down’’ and ‘‘unchanged’’.
Acknowledgements I would like to thank Toru Ohira of Sony Computer Science Laboratories and Yoshiyuki Kabashima of Tokyo Institute of Technology (TIT) for their fruitful comments and suggestions, and Jun-ichi Inoue of Hokkaido University and A.C.C. Coolen, Giulia Iori, Peter Sollich of King’s College, London, Toshiyuki Tanaka of Tokyo Metropolitan University for stimulating discussions. The NYSE data are from King’s College, London while I was working at the college. References [1] M. Dacorogna, R. Gencay, U. Mu¨ller, R. Olsen, V. Pictet, An Introduction to High-Frequency Finance, Academic Press, San Diego, 2001. [2] T. Ohira, N. Sazuka, K. Marumo, T. Shimizu, M. Takayasu, H. Takayasu, Physica A 308 (N1–4) (2002) 368. [3] N. Sazuka, T. Ohira, K. Marumo, T. Shimizu, M. Takayasu, H. Takayasu, Physica A 324 (N1–2) (2003) 366. [4] Y. Zhang, Physica A 269 (1999) 30. [5] R. Tsay, Analysis of Financial Time Series, Wiley-Interscience Publication, New York, 2002. [6] J. Aldrich, F. Nelson, Linear Probability, Logit, and Probit Models, Sage Publications, Beverly Hills, 1984. [7] N. Sazuka, T. Ohira, Non-linear logit models for high-frequency high frequency currency exchange data, in: Computational Finance and its Applications, WIT press, 2004, pp. 297–305. [8] H. Akaike, Information theory as an extension of the maximum likelihood principle, Proceedings of Second International Symposium on Information Theory, Budapest, 1973, pp. 267–281. [9] M. Takayasu, H. Takayasu, M. Okazaki, Transaction Interval Analysis of High Resolution Foreign Exchange Data, in: H. Takayasu (Ed.), Empirical Science of Financial Fluctuations: The Advent of Econophysics, Springer, Tokyo, 2002, pp. 18–25.