Model selection of the generalized von Mises distribution based on empirical mode decomposition with data analyses

Model selection of the generalized von Mises distribution based on empirical mode decomposition with data analyses

Statistical Methodology 14 (2013) 15–25 Contents lists available at SciVerse ScienceDirect Statistical Methodology journal homepage: www.elsevier.co...

607KB Sizes 0 Downloads 49 Views

Statistical Methodology 14 (2013) 15–25

Contents lists available at SciVerse ScienceDirect

Statistical Methodology journal homepage: www.elsevier.com/locate/stamet

Model selection of the generalized von Mises distribution based on empirical mode decomposition with data analyses✩ Xu Qin a,∗ , Jiang-She Zhang b , Xiao-Dong Yan c a

School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu 611731, PR China

b

School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, PR China

c

Key Laboratory of Regional Climate-Environment Research for Temperature East Asia, Institute of Atmospheric Physics, Chinese Academy of Science, Beijing 100029, PR China

article

info

Article history: Received 9 November 2012 Received in revised form 31 January 2013 Accepted 31 January 2013 Keywords: Generalized von Mises Empirical mode decomposition Circular data Wind data

abstract This paper presents a method for selecting a distribution within the generalized von Mises (GvM) class. In this method, the logarithmic form of the GvM probability frequency function is regarded as the sum of a constant and several cosine functions with different frequencies. Based on the empirical mode decomposition (EMD) method, the corresponding logarithmic series is decomposed to several intrinsic mode functions (IMF) whose corresponding instantaneous frequencies (IF) are used to be the basis of the GvM model selection. The applications of the proposed method are illustrated using simulated circular data and real wind direction data. The results demonstrate that the method proposed here can provide a good choice for the GvM model selection. © 2013 Elsevier B.V. All rights reserved.

1. Introduction Circular data are the data which have no magnitude and can be conveniently represented as points on the circumference of a unit circle centered at the origin or as unit vectors connecting the origin to these points. Circular data arise quite frequently in many natural and physical sciences like Biology, Medicine, Ecology, Geology etc. [12,13]. For instance, some biologists study the flight directions of just-released birds as they disappear over the horizon [17]. Jammalamadaka et al. [11] discuss an

✩ This work is supported by the National Natural Science Foundation of China (Grant No. 51205046).



Corresponding author. E-mail address: [email protected] (X. Qin).

1572-3127/$ – see front matter © 2013 Elsevier B.V. All rights reserved. doi:10.1016/j.stamet.2013.01.007

16

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

interesting medical application where the angle of knee flexion was measured to assess the recovery of orthopedic patients [11]. Ecologists consider the prevailing wind direction as an important factor in many studies including those which involve pollutant transport [21]. Geologists study paleocurrents to infer about the direction of flow of rivers in the past [18] and analyze the paleomagnetic direction of the Earth’s magnetic pole to investigate the phenomenon of pole-reversal as well as in support of the hypothesis of continental drift [4]. Also, any periodic phenomenon with a known period such as a day, a month or a year, can be represented on the circle where the circumference corresponds to this period, by aggregating as necessary, data over several individuals or periods. Examples include arrival times of patients over the day, or the timing of breast cancer surgery within the menstrual cycle. As another example, the circle may be taken to represent the 365 days in the year and one could plot the occurrence of airplane accidents to see if they are uniformly distributed over the different seasons of the year. A circular random variable is one which takes values on the circumference of a circle, i.e. they are angles in the range (0, 2π ) radians or (0°, 360°). Because the circumference is a bounded closed space, this random variable must be analyzed by techniques differing from those appropriate for the usual Euclidean type variables. The circular probability distributions are used to fit the distribution of circular data [12]. Since the early 1970s, researchers have presented some circular probability distribution models [19,12,15,2]. Therein the von Mises distribution (vM) is the most commonly used circular distribution (illustrated in [14]). In 2007 Gatto and Jammalamadaka generalized the vM model and presented the generalized von Mises of order k (GvMk ) distribution, which was broad enough to cover unimodality as well as multimodality, symmetry as well as asymmetry of circular data [7,5]. In [7], the authors mainly considered the case of k = 2, i.e. the GvM2 distribution. In further research, researchers mainly consider the computer aspects and the corresponding information theory for the case of small k [5,6]. In practice, given one circular series, the modality of the circular data is unknown. We need to select a proper GvMk distribution model for fitting the distribution of the given circular data. However, in available literature, the study of model selection of the GvMk distribution is scarce. In [5], Gatto used the Akaike information criterion (AIC) and the measured entropy (ME) to select models. In that paper, the authors computed all possible GvM models given the value of order k and then obtained the corresponding values of AIC and ME. And the results of experiments show that the two model selection methods can provide similar results. For small k, the two methods of model selection are effective. However, just as stated by Gatto [5], cases of k > 2 lead to some practical difficulties, e.g. in the (k) computation of the normalizing constants G0 and in the computation of the estimators. In addition, the GvM2 distribution models are not always to be fitted well for some circular data in practice. In this paper, we try to search for an intuitive method for selecting GvM distribution models. We find that the logarithmic form of the density function of the GvM distribution can be regarded as the sum of a constant and several cosine functions which is similar to the result of empirical mode decomposition (EMD). The concept of EMD has been developed rapidly in many disciplines of science and engineering since Huang et al. invented the method [8,9,16]. The EMD algorithm has been designed for the time–frequency analysis of signals. It decomposes the signal in hand into a number of oscillatory modes called intrinsic mode functions (IMFs) from which we can estimate the corresponding instantaneous frequencies (IF) [1]. And based on the available IF series, we can get a choice of the k in the GvM model. The rest of the paper is organized as follows. In Section 2, we simply introduce the method of EMD. In Section 3, we give the model selection method based on EMD. In Section 4, several simulated examples and real data are carried out to illustrate the effectiveness of the method proposed in this paper. Finally, Section 5 summarizes the main results and makes relevant discussion. 2. EMD method The EMD method is necessary to decompose any data from non-stationary and non-linear processes into a series of IMFs which will yield meaningful IF through the Hilbert transform. Contrary to almost all the previous decomposition methods, EMD is empirical, intuitive, direct, and adaptive [8].

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

17

Given one time series x(t ), the algorithm of EMD can be summarized as follows: Algorithm 1. S1. Identify the extrema of x(t ). S2. Interpolate the local maxima (resp. minima) by cubic splines to form an upper envelope of x(t ), a1 (t ) (resp. lower envelope, a2 (t )). Calculate the mean of a1 (t ) and a2 (t ) to get the mean envelope m11 (t ) = (a1 (t ) + a2 (t ))/2. S3. Calculate h11 (t ) = x(t ) − m11 (t ). S4. Repeat S1–S3, until h1(k−1) (t ) − m1k = h1k (t ) satisfies the stopping criterion, i.e. h1k becomes an IMF component, where k denotes the sifting times. S5. Designate c1 (t ) = h1k (t ), then c1 (t ) is the first IMF component from x(t ). S6. Make r1 (t ) = x(t ) − c1 (t ). It is treated as the new data and subjected to the same sifting process as described above. Finally, the result is r2 (t ) = r1 (t ) − c2 (t )

.. . rn (t ) = rn−1 (t ) − cn (t ). The above decomposed process should stop when the residue, rn (t ), becomes a constant, or a monotonic function, or a function containing a single extremum. On the stopping criterion mentioned in step 4, Huang et al. presents successively two modes, the Cauchy type [8] and the CE(S,M) [9]. In this paper, we adopt the latter one. With the EMD method, we finally obtain the following equation x(t ) =

n 

ci (t ) + rn (t )

i=1

i.e. x(t ) is decomposed into the sum of n IMF components and one residue. Then we use the Hilbert transform to each IMF component in order to get the corresponding series [1]. 3. Model selection based on EMD The density function of the GvMk distribution is defined as f (θ; µ1 , . . . , µk , κ1 , . . . , κk )

=

1 (k)

2π G0 (δ1 , . . . , δk−1 , κ1 , . . . , κk )

exp

 k 

 κj cos j(θ − µj )

(1)

j =1

where κ1 , . . . , κk > 0, θ ∈ [0, 2π ), µ1 ∈ [0, 2π ), µ2 ∈ [0, π ), µk ∈ [0, 2π /k), δ1 = (µ1 − µ2 ) mod π , δ2 = (µ1 − µ3 ) mod (2π /3), . . . , and δk−1 = (µ1 − µk ) mod (2π /k). the normalizing (k) constant G0 is given by (k)

G0 (δ1 , . . . , δk−1 , κ1 , . . . , κk ) =

1 2π





exp{κ1 cos θ + κ2 cos 2(θ + δ1 )

0

+ · · · + κk cos k(θ + δk−1 )}dθ . Consider the logarithmic form of the density function of the GvMk distribution log f (θ; µ1 , . . . , µk , κ1 , . . . , κk )

=

k 

κj cos j(θ − µj ) − log[2π G(0k) (δ1 , . . . , δk−1 , κ1 , . . . , κk )].

(2)

j=1

(k)

In (2), log f (θ; µ1 , . . . , µk , κ1 , . . . , κk ) can be regarded as the sum of a constant ‘− log[2π G0 ]’ and several cosine functions with different frequencies ‘j’. The right side of (2) is similar to the result of empirical mode decomposition (EMD), which inspires us to use the EMD method to decompose the log(f ) series and further get the basis for selecting models.

18

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

Given a series of circular data {θi , i = 1, . . . , n}, we try to use the GvMk distribution to fit the data. The steps of the model selection method based on EMD are as follows Algorithm 2. S1. Give a rough estimate of the distribution density series {fi , i = 1, . . . , h + 1} at [0 : 2π /h : 2π ] by virtue of the given data {θi , i = 1, . . . , n}. S2. Apply the EMD method to the series{log(fi ), i = 1, . . . , h + 1}. And we get p IMF components and one residue. S3. Compute the IF of each IMF component. Then for each IF series, take the integer of the mean value. We can get p integral rough frequency series {mj , j = 1, . . . , p}. Thus we choose the following model f (θ; µm1 , . . . , µmp , κm1 , . . . , κmp )



1

=

exp

(k )

2π G0 m (µm1 , . . . , µmp , κm1 , . . . , κmp )

p 

 κmj cos mj (θ − µmj )

j =1

(k )

as the finally chosen GvM distribution model where the normalizing constant G0 m is given by (km )

G0

(µm1 , . . . , µmp , κm1 , . . . , κmp ) =

1







exp

 p 

0

 κmj cos mj (θ − µmj ) dθ .

(3)

j =1

Remark 1. In step 1, we use the circular kernel density estimation method to estimate the density function roughly [20,10]. And the circular kernel is taken as the von Mises kernel. In this paper, h is taken as 100. Given a series of circular data {θi }, the computation of circular kernel density estimation at θ used in this paper is as follows [20]: fˆ (θ; z ) =

n 

1

n(2π )I0 (z ) i=1

exp{z cos(θ − θi )}

where



z = 3nκˆ 2 I2 (2κ){ ˆ 4π 1/2 I0 (κ) ˆ 2 } −1

2/5

.

And the computation of κˆ used in this paper is [10]:

 q √ 1/3  q √ 1/3 + − − D − κˆ = − + D 2

2

1 6(t − 1)

where

 p 3

D=

3



2

+

t = C +S

 q 2 2

 2 1/2

,

,

p=

3(t − 1) − 2 24(t − 1)2

C = n− 1

n  i=1

, q=

4 − 9(t − 1) + 54(t − 1)2

cos(θi ), S = n−1

432(t − 1)3 n 

sin(θi ).

i =1

Remark 2. In step 2, the mode mixing problem of the EMD method is solved by virtue of masking signals [3]. Remark 3. For the case of p = 2, we can deduce the computation of the corresponding normalization (k ) constant G0 m , which is similar with the computation of (6) stated in [7]: If m2 can be divided by m1 , denote tm = m2 /m1 , δm = (µm1 − µm2 ) mod (2π /m2 ), (k )

G0 m = I0 (κm1 )I0 (κm2 ) + 2

∞  l =1

Itm l (κm1 )Ik (κm2 ) cos(tm κm1 lδm ).

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

19

Table 1 The model selection results of the eight simulation series. Data

Number of samples

Selected model

Data 1 Data 2 Data 3 Data 4 Data 5 Data 6

167 220 146 140 140 352

m1 m1 m1 m1 m1 m1

=1 =2 =3 = 1, m2 = 3 = 2, m2 = 3 = 2, m2 = 4

If m2 cannot be divided by m1 , (k )

G0 m = I0 (κm1 )I0 (κm2 ).

4. Experiments In order to show the effectiveness of the model selection method proposed in this paper, we apply it to six simulated series and four real wind data series. All programs are written in Matlab. The EMD is computed by running emd.m in the EMD package (http://perso.enslyon.fr/patrick.flandrin/emd.html). And the IF is computed by running hilbert .m. 4.1. Simulated data We mainly consider the selection of {mj } in Eq. (3). Six random circular series are generated by a standard ratio-of-uniform algorithm stated in [5]. And the distribution density functions of the six series are as follows Data 1 : f (θ; π , 1.2) ∝ exp{1.2 cos(θ − π )} Data 2 : f (θ; 0, 0.9, 0, 1.2) ∝ exp{1.2 cos 2(θ − 0.9)} Data 3 : f (θ; 0, 0, 1, 0, 0, 1.8) ∝ exp{1.8 cos 3(θ − 1)} Data 4 : f (θ; 0, 0, 1, 1.2, 0, 1.8) ∝ exp{1.2 cos θ + 1.8 cos 3(θ − 1)} Data 5 : f (θ; 0, 0, 0, 0, 1, 2) ∝ exp{cos 2θ + 2 cos 3θ } Data 6 : f (θ; 0, 0.5, 0, 0, 0, 0.2, 0, 1.2) ∝ exp{0.2 cos 2(θ − 0.5) + 1.2 cos 4θ }. For the six series, the values of {mj } are {m1 = 1}, {m1 = 2}, {m1 = 3}, {m1 = 1, m2 = 3}, {m1 = 2, m2 = 3}, {m1 = 2, m2 = 4} respectively. And the values of order k are 1, 2, 3, 3, 3, 4. We apply the method proposed to the GvM models for the six series. The decomposition results of log(f ) are shown in Figs. 1–6. Table 1 shows the results of model selection for these series. For the six series, the numbers of IMFs are 1, 1, 1, 2, 2, and 2 respectively. And the corresponding IF series of IMFs oscillate around 1, 2, 3, (1, 3), (2, 3), and (2, 4) respectively which are consistent with the actual results. We can see that the method based EMD can provide satisfied results of model selection for the simulated data under study. 4.2. Wind data In this section, we take four daily wind direction series from NCEP reanalysis data at 850 hPa in 2005 to illustrate the effectiveness of the method proposed in this paper. The four grid points are (112.5°E, 30°N), (117.5°E, 30°N), (112.5°E, 37.5°N), and (112.5°E, 27.5°N) respectively. For each series, the length of the sample is 365. For convenience, we use vM2 , vM3 , GvM13 and GvM23 to denote the case of p = 1, m1 = 2, the case of p = 1, m1 = 3, the case of p = 2, m1 = 1, m2 = 3 and the case of p = 2, m1 = 2, m2 = 3 respectively. The method proposed in this paper is applied to the four series. Figs. 7–10 show the decomposition results of the corresponding {log(f )} series for the

20

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

Fig. 1. The EMD result of the series {log(fi ), i = 1, . . . , 101} for Data 1.

Fig. 2. The EMD result of the series {log(fi ), i = 1, . . . , 101} for Data 2.

four grid points. We can see that the selected models for the four series are vM2 , GvM23 , GvM13 , vM2 respectively. We also use the selection method based on AIC to these four series. And we use the maximum likelihood (ML) method to estimate the parameters of the possible models including vM, vM2 , vM3 , GvM13 , GvM2 , GvM23 . The lower the AIC value is, the greater the fit. The AIC value results are shown in Table 2. For the four grid points, the selected models are vM2 , GvM23 , GvM13 , GvM2 respectively judging from the AIC values. Compared with the results obtained by the EMD method, the selected models are the same for grid points (112.5°E, 30°N), (117.5°E, 30°N), and (112.5°E, 37.5°N)

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

21

Fig. 3. The EMD result of the series {log(fi ), i = 1, . . . , 101} for Data 3.

Fig. 4. The EMD result of the series {log(fi ), i = 1, . . . , 101} for Data 4.

while the selected models are different for grid point (112.5°E, 27.5°N). However, we can see that the difference between the AIC value of GvM2 and the one of vM2 is not obvious for grid point (112.5°E, 27.5°N). Consequently, the model selection based on EMD without computation of AIC can still provide a satisfied choice. 5. Discussion and conclusions In this paper, we present an empirical method for model selection of the GvM distribution models based on the EMD method. For a circular data, we decompose the logarithmic form of the roughly estimated distribution density function to several IMF components whose corresponding frequencies

22

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

Fig. 5. The EMD result of the series {log(fi ), i = 1, . . . , 101} for Data 5.

Fig. 6. The EMD result of the series {log(fi ), i = 1, . . . , 101} for Data 6. Table 2 The AIC values (×103 ) of several given models for the grid data.

(112.5°E, 30°N) (117.5°E, 30°N) (112.5°E, 37.5°N) (112.5°E, 27.5°N)

vM

vM2

vM3

GvM13

GvM2

GvM23

1.3436 1.3267 1.2588 1.3229

1.2461 1.3251 1.3311 1.1728

1.3425 1.3160 1.3026 1.3333

1.3476 1.3012 1.2173 1.3147

1.2478 1.3130 1.2325 1.1603

1.2510 1.2995 1.2921 1.1645

are used to select the GvM distribution model. And the simulation and real experiments under study all illustrate its effectiveness.

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

23

Fig. 7. The EMD result of the series {log(fi ), i = 1, . . . , 101} for grid data on (112.5°E, 30°N).

Fig. 8. The EMD result of the series {log(fi ), i = 1, . . . , 101} for grid data on (117.5°E, 30°N).

The method proposed in this paper is an empirical selection for the GvM distribution model. Compared with the existing method based on AIC, we do not need to consider all the possible models and compute the corresponding parameters in the method proposed in this paper. In practice, the empirical method can reduce significantly the computational complexity and computation time. In this paper, we mainly consider the case of p = 2 taking account of the complexity of the (k ) normalization constant G0 m . For the case of p > 3, we will discuss this in a further study.

24

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

Fig. 9. The EMD result of the series {log(fi ), i = 1, . . . , 101} for grid data on (112.5°E, 37.5°N).

Fig. 10. The EMD result of the series {log(fi ), i = 1, . . . , 101} for grid data on (112.5°E, 27.5°N).

In addition, we mainly give the selection of k in the GvM distribution density function f (θ; µ1 , . . . , µk , κ1 , . . . , κk ). The study about estimation of the parameters is outside the scope of this paper. We will discuss it in a further study. References [1] B. Boashash, Estimating and interpreting the instantaneous frequency of a signal, part 2: algorithm and application, Proceeding of IEEE 80 (1992) 540–568. [2] J.A. Carta, C. Bueno, P. Ramirez, Statistical modelling of directional wind speeds using mixtures of von Mises distribution: case study, Energy Conversion and Management 49 (2008) 897–907.

X. Qin et al. / Statistical Methodology 14 (2013) 15–25

25

[3] R. Deering, F.J. Kaiser, The use of a masking signal to improve empirical mode decomposition, in:ICASSP2005, vol. 4, 2005, pp. 485–488. [4] M. Fuller, C. Laj, E. Herrero-Bervera, The reversal of the earth’s magnetic field, American Scientist 84 (1996) 552–561. [5] R. Gatto, Some computational aspects of the generalized von Mises distribution, Statistics and Computing 18 (2008) 321–331. [6] R. Gatto, Information theoretic results for circular distributions, Statistics 43 (2009) 409–421. [7] R. Gatto, S.R. Jammalamadaka, The generalized von Mises distribution, Statistical Methodology 4 (2007) 341–353. [8] N.E. Huang, Z. Shen, S.R. Long, et al., The empirical mode decomposition and the Hilbert spectrum for non-linear and non-stationary time series analysis, Proceedings of the Royal Society of London. Series A 454 (1998) 899–955. [9] N.E. Huang, M.L. Wu, S.R. Long, et al., A confidence limit for the empirical mode decomposition and Hilbert spectral analysis, Proceedings of the Royal Society of London. Series A 459 (2003) 2317–2345. [10] A.G. Hussin, I.B. Mohamed, Efficient approximation for the von Mises concentration parameter, Asian Journal of Mathematics and Statistics 1 (2008) 165–169. [11] S.R. Jammalamadaka, N. Nhadra, D. Chaturvedi, et al., Functional assessment of knee and ankle during level walking, Data Analysis in the Life Sciences (1986). [12] S.R. Jammalamadaka, A. SenGupta, Topics in Circular Statistics, World Scientific Publishing, 2001. [13] A. Lee, Circular data, Wiley Interdisciplinary Reviews: Computational Statistics 2 (2010) 477–486. [14] J.A. Mooney, P.J. Helms, I.T. Jolliffe, Fitting mixture of von Mises distributions: a case study involving sudden infant death syndrome, Computational Statistics & Data Analysis 41 (2003) 505–513. [15] A.M. Razali, A. Ahmad, A. Zaharim, et al. Fitting the probability distribution to wind direction data in Bangi, in: Seminar on Engineering Mathematics, 2008. [16] G. Rilling, P. Flandrin, P. Goncalvés, On empirical mode decomposition and its algorithms, in: IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing, NSIP-03, 2003. [17] K. Schmidt-Koenig, On the role of loft, the distance and site of release in pigeon homing (the ‘cross-left experiment’), Biology Bulletin 125 (1963) 154–164. [18] S. SenGupta, J.S. Rao, Statistical analysis of crossbedding azimuths from the Kamthi formation around Bheemaram, Pranhita Godavari Valley, Sankhya¯ 28 (1967) 165–174. [19] O.E. Smith, Vector wind and vector shear models 0–27 km altitude for Caper Kennedy, NASA TM-X-73319, Florida and Vandenberg AFB, California, 1976. [20] C.C. Taylor, Automatic bandwidth selection for circular density estimation, Computational Statistics & Data Analysis 52 (2008) 3493–3500. [21] I.C. Ziomas, D. Melas, C.S. Zerefos, et al., Forecasting peak pollutant levels from meteorological variables, Atmospheric Environment 29 (1995) 3703–3711.