A Bayesian mixture model for clustering circular data

A Bayesian mixture model for clustering circular data

Computational Statistics and Data Analysis 143 (2020) 106842 Contents lists available at ScienceDirect Computational Statistics and Data Analysis jo...

637KB Sizes 1 Downloads 95 Views

Computational Statistics and Data Analysis 143 (2020) 106842

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

A Bayesian mixture model for clustering circular data ∗

Carlos E. Rodríguez a , , Gabriel Núñez-Antonio b , Gabriel Escarela b a b

Department of Probability and Statistics, IIMAS-UNAM, Mexico Department of Mathematics, Universidad Autónoma Metropolitana (Iztapalapa), Mexico

article

info

Article history: Received 17 January 2019 Received in revised form 17 September 2019 Accepted 19 September 2019 Available online 27 September 2019 Keywords: Classification Label switching Projected normal Slice sampler Reversible jump

a b s t r a c t Clustering complex circular phenomena is a common problem in different scientific disciplines. Examples include the clustering of directions of animal movement in the wild to identify migration patterns, and the classification of angular positions of meteorological events to investigate seasonality fluctuations. The main goal is to develop a novel methodology for clustering and classification of circular data, under a Bayesian mixture modeling framework. The mixture model is defined assuming that the number of components is finite, but unknown, and that each component follows a projected normal distribution. Model selection is performed by jointly making inferences about the parameters of the mixture model and the number of components, choosing the model with the highest posterior probability. A deterministic relabeling strategy is used to recover identifiability for the components in the chosen model. Estimates of both the posterior classification probabilities and the scaled densities are approximated via the relabeled MCMC output. The proposed methods are illustrated using both simulated and real datasets, and performance comparisons with existing strategies are also given. The results suggest that the new approach is an appealing alternative for the clustering and classification of circular data. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Directional data arise from observations that can be represented as unit vectors in a q-dimensional space, with the sample space being the unit sphere with dimension q − 1. Circular data are a particular case of directional data when directions can be represented in a 2-dimensional space, and thus the corresponding sample space is the unit circle. Measurements representing directions appear in several areas of research and are common in biology, geology, meteorology, ecology and environmental sciences. Examples can be found in the analysis of wind directions, migration patterns of some species, camera-trap records of activity pattern of mammalian species in the wild, propagation of cracks in materials, orientation of mineral deposits, departure directions of animals, adjustments in the timing of seasonal events that modulate climate impacts, and radiotherapy X-ray beams on a circle. Clustering analysis is a technique used to partition a complete dataset into meaningful groups of observations, where the observations within each group are similar in some sense. From a statistical perspective, clustering methodologies can be divided into two approaches: model-based and model-free (or non-parametric). Under the non-parametric view, a distance (or a similarity measure), combined with a computational strategy, is used for the partitioning of the data into clusters, e.g. k-means as in Lloyd (1982). Under a model-based setting, it is assumed that each cluster follows a probability ∗ Correspondence to: Department of Probability and Statistics, IIMAS, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Circuito Escolar s/n, C.P.04510, CDMX, Mexico. E-mail address: [email protected] (C.E. Rodríguez). https://doi.org/10.1016/j.csda.2019.106842 0167-9473/© 2019 Elsevier B.V. All rights reserved.

2

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

distribution, and thus mixture models can be implemented to model the underlying distribution of each cluster. Such mixture models provide a flexible tool to study datasets arising from irregular densities or to model heterogeneity. In the Bayesian framework of finite mixture model analysis, Markov Chain Monte Carlo (MCMC) techniques have been implemented when the number of components is assumed to be known (Diebolt and Robert, 1994). For the inference of an unknown number of components, Richardson and Green (1997) used reversible jump MCMC (Green, 1995; Stephens, 2000a) used a birth–death process, and Nobile and Fearnside (2007) proposed an alternative method requiring that the component parameters can be integrated out of the model analytically. Recently, Malsiner-Walli et al. (2016) proposed an MCMC strategy that consists in deliberately overfitting a finite Gaussian mixture model, imposing a sparse prior over the weights to empty superfluous components. For a recent and broad overview of the methods and applications of mixture models: including EM algorithm, Bayesian mixture models, model-based clustering, high-dimensional data and hidden Markov models, see Fruhwirth-Schnatter et al. (2019). Under a Bayesian setting, typically the main goal is to find the posterior distribution, and then compute posterior summaries via MCMC methods. However, the likelihood of a mixture model is invariant under permutation of the indices since, under symmetric priors, the posterior will inherit the likelihood’s invariance. As a result, in an MCMC algorithm, labels of the components can permute multiple times between iterations of the sampler, causing ergodic averages to become useless to estimate characteristics of the components. To deal with this matter, which is known as the label switching problem, a proposal is to undo the label switching deterministically, finding permutations of the parameters that minimize a loss function (e.g. Stephens (1997) and Rodríguez and Walker (2014a)). This can be regarded as a partition problem, where the latent allocations define partitions over the observations; here, an optimal partition has to be chosen from a sample of partitions (e.g. Rastelli and Friel (2018)). Several procedures have been proposed for the clustering of linear data, see e.g. Kaufman and Rousseeuw (2008) and Xu and Wunsch (2005); however, cluster analysis for circular data has not been fully developed, despite its potential implementation in various disciplines (e.g. Cressie (1977), Lund (1999), Gopal and Yang (2014), and Whitfield (2017)). It is important to point out that the majority of the proposals to tackle this problem have come from the frequentist perspective of statistics, and there have been fewer attempts from a Bayesian viewpoint. Now, if we work under topological spaces other tan Rk , such as the unit circle, the complexity increases. On the one hand, following a non-parametric approach, the difficult part is to define coherent measures of distance that allow the grouping of circular data, and these data have a periodic behavior. On the other hand, if we adopt a model-based perspective, we need to use distributions defined over the unit circle, and these always come with some challenges. A general approach has been simply to adapt procedures for linear data to the analysis of circular data (e.g. Ackermann (1997)); nevertheless, such an approach is ill-specified since it fails to consider the inherent topology of the unit circle. In this paper, we devise a methodology for clustering circular data under a Bayesian mixture modeling setting. First, we set a finite mixture model with an unknown number of components, assuming that each component follows a projected normal distribution defined on the unit circle. Second, we perform model selection by jointly making inferences about the parameters of the mixture and the number of components. Third, we select the model with highest posterior probability, and use a deterministic relabeling strategy to recover identifiability for its mixture components. Finally, we find estimates of both the posterior classification probabilities for the draws of observations and the densities for each cluster. The rest of this paper is organized as follows. In Section 2, we present the model and describe some of its characteristics. Our MCMC strategy to sample from the posterior distribution of parameters for each component of the mixture, as well as for the number of components, is described in Section 3; here, we first deal with the case of a fixed number of components and then move to the trans-dimensional case. In Section 4, we discuss an approach to deal with the label switching problem, which is based on the data and a deterministic relabeling strategy. In Section 5, we illustrate and compare our methodology with other alternatives using simulated data, and we also analyze two real datasets. We conclude with a discussion and an outline of future work. 2. The model 2.1. The projected normal distribution One relatively straightforward way to generate distributions for circular data is to radially project a distribution originally defined on the plane R2 onto the unit circle S. Let X = (X1 , X2 )t be a vector defined on R2 , and define the transformation to polar coordinates (R, Θ ), where X1 = R cos (Θ ) and X2 = R sin (Θ ). It follows that, integrating out R, the variable Θ has a distribution defined on S. An important instance is that in which X has a bivariate normal distribution, N2 (·|µ, Σ), with mean vector µ = E(X ) and variance matrix Σ. Then, Θ is said to have a projected normal distribution, which is denoted by PN (·|µ, Σ). Let

µ=

(

µ1 µ2

)

( and Σ =

σ12 ρσ1 σ2

) ρσ1 σ2 , σ22

then if θ ∼ PN (θ|µ, Σ), the probability density function (p.d.f.) of θ is given by p(θ|µ, Σ) =

(

1 2π A(θ )

)

⎡ 1

B(θ ) FZ

|Σ|− 2 exp {C } ⎣1 + √

A(θ ) f Z

(

√B(θ ) A(θ )

)⎤

(

√B(θ ) A(θ )

)⎦

1 (0,(θ2)π] ,

(1)

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

1

3

where ut = (cos(θ ), sin(θ )), A(θ ) = ut Σ−1 u, B(θ ) = ut Σ−1 µ and C = − 21 µt Σ−1 µ; here, is the indicator function, (0,2π ] and FZ (·) and fZ (·) denote the cumulative density function (c.d.f) and p.d.f. of a standard normal distribution, respectively. The properties of distribution in Eq. (1) were examined, for example, in Mardia (1972) and Wang and Gelfand (2013). The projected normal distribution depends on the parameters µ and Σ in a complex way, which has precluded its wider use in many applications (see, for example, Jammalamadaka and SenGupta (2001) and Wang and Gelfand (2013)). The parameters µ and Σ are not identifiable without further constraints, since, for any a > 0, taking µ∗ = aµ and Σ∗ = a2 Σ does not alter the distribution of the observed directions. One way of addressing this issue is to restrict the determinant of Σ to be equal to 1. A simpler approach, which we use here, is to set Σ = I , the 2 × 2 identity matrix, with the resulting distribution being denoted as PN (·|µ). Thus, the projected normal distribution becomes unimodal and rotationally symmetric about the mean direction φ , which is such that (cos(φ ), sin(φ )) = µ/∥µ∥. Under the above formulation, ∥µ∥ plays the relevant role in the concentration of Θ , since the mean resultant length of Θ is given by

ρ=



(·)

π ζ /2e−ζ [I0 (ζ ) + I1 (ζ )],

where ζ = ∥µ∥2 /4 and Iq (·) is the modified Bessel function of the first kind of order q, see Kendall and Harding (1974); also, the parameter ρ is a useful benchmark to compare models. It follows that PN (θ|µ) =

{

1 exp − ∥µ∥2 2π 2 1

}[ 1+

b(θ ) FZ (b(θ )) fZ (b(θ ))

]

1 (0,(θ2)π] ,

(2)

where b(θ ) = ut µ. Wang and Gelfand (2013) imposed the less restrictive constraint Var(X2 ) = 1 and showed that this leads to a more flexible circular density which can be asymmetric and bimodal. This idea has been recently generalized by Hernandez Stumpfhauser et al. (2017) to model directional data in higher dimensions. Even though the projected normal distribution can be used to deal with bimodality and some forms of asymmetry, the search for more flexible models is currently an active field of research (e.g., Ghosh et al. (2003) and McVinish and Mengersen (2008)). In this study, we define a cluster as a set of observations described by a unimodal and symmetric distribution. In order to avoid bimodality and asymmetry, we will focus our attention on the projected normal distribution with parameters (µ, Σ = I ), PN (θ|µ). This simpler model does not present identifiability problems, and its use avoids the need for a prior distribution over the parameters of the full model in Eq. (1). 2.2. Finite mixture of projected normal distributions In our model formulation, the finite mixture model with k components is defined as follows

w , M , k) = p(θi |w

k ∑

wj PN (θi |µj ), independently for i = 1, . . . , n,

(3)

j=1

where w = {wj }kj=1 , M = {µj }kj=1 and k is assumed to be unknown; here, the weights w are non-negative and sum up to one. Each component is a circular distribution with the location (mean direction) and the dispersion (concentration) defined by the vector µj = (µ1,j , µ2,j )t . The aim of this choice of components is to identify the number of mixtures, k, with the number of modes of the data. Hence, as described in the previous section, the best alternative is to avoid bimodality and identifiability problems for each mixture component. A useful idea, when working with mixture models, is to include the latent allocation vector z = {zi }ni=1 , such that, given zi , the component from which θi has been drawn is known. That is, p(θi |M , zi ) = PN (θ|µzi ),

w) ∝ with p(z|w

∏k

j=1

nj j

w and nj =

(4)

∑n

i=1

1

{zi =j}

. While Eq. (3) is employed to perform model based clustering, the latent

allocation vector z automatically induces a clustering structure over the observations, and thus the mixture model in Eq. (3) leads to the definition of a cluster as a set of observations, which can be modeled by a single projected normal distribution; here, k is the number of clusters within the data, wj is the weight of cluster j in the population, and PN (θ|µj ) is a parametric distribution that models the behavior of cluster j. 3. Inference In this section, we establish a procedure to sample from the posterior distribution of the parameters (w , M , k). First, in sub-Section 3.1, we describe a general algorithm for sampling from the posterior of µ. Second, in sub-Section 3.2, we outline a sampling procedure for the mixture model with a fixed number of components, and then, in sub-Section 3.3, we present a methodology for handling an unknown number of components. Our hybrid MCMC strategy follows the ideas in Tierney (1994) with the following steps:

4

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

1. Gibbs kernel. This procedure updates the Markov Chain, for fixed k. 2. Split–combine step. This move is intended to split a component into two or combine two components into one. It employs a mixture of two Metropolis–Hastings kernels, both attempting to move k. 3. Birth and death of empty components. This move is intended to add a new empty component or delete an existing one. It uses another mixture of two Metropolis–Hastings kernels, both attempting to move k. The preceding steps are iterated until the sampler has converged. This strategy has been performed by Richardson and Green (1997) to make inferences on the number of components in a finite mixture of normal distributions, and by Rodríguez and Walker (2014b), who replaced the univariate normal distributions in the mixture for flexible unimodal distributions. 3.1. Sampling the projected normal distribution Sampling from the posterior distribution of µ, when assuming the projected normal distribution model PN (θ|µ), is not straightforward. However, the inclusion of the strategic latent vector r = (r1 , . . . , rn ) leads to the sampling from the posterior of the means of two independent normal distributions (see Núñez-Antonio and Gutiérrez-Peña (2005)), thereby easing the computational burden. This procedure is described as follows: for each i, let (xi , yi ) ∼ N2 (µ, I ), that means, the joint density function of X and Y denoted by pX ,Y (xi , yi |µ) is a bivariate normal distribution with identity covariance matrix. Using the transformation (X , Y ) → (R, θ ) into polar coordinates, i.e., xi = ri cos(θi ) and yi = ri sin(θi ), where ri = ∥(Xi , Yi )∥, we obtain p(θi , ri |µ) = ri pX ,Y (ri cos(θi ), ri sin(θi )|µ)

} { 1 ∝ ri exp − (ri2 − 2ri b(θi ) + ∥µ∥) 2

1 (0(r, ∞) ) 1 (0(,θ2)π ] . i

i

Integrating out the latent variable, we have PN (θi |µ) =





p(θi , ri |µ)dri , 0

and thus the projected normal distribution is recovered. With the inclusion of the latent vector r, it is possible to go from the independent normals to the projected normal distribution. This idea also works backwards: first, start with the projected normal distribution model, and then introduce the latent vector r. Given a random sample {θ1 , θ2 , . . . , θn } from PN (θ|µ), we include the latent vector r = (r1 , . . . , rn ). Thus, after transforming xi = ri cos(θi ) and yi = ri sin(θi ), the likelihood of the model becomes p(x, y|µ) =

n ∏ i=1

{ n } { n } N(xi |µ2 , 1)N(yi |µ1 , 1) ∝ exp − (x¯ − µ1 ) exp − (y¯ − µ2 ) , 2 2

where x = {xi }ni=1 and y = {yi }ni=1 . Setting the prior p(µ|µ01 , τ01 , µ02 , τ02 ) = N(µ1 |µ01 , 1/τ01 )N(µ2 |µ02 , 1/τ02 ),

(5)

the posterior density for the parameter µ is given by p(µ|x, y) ∝ p(x, y|µ)p(µ|µ01 , τ01 , µ02 , τ02 ),

) ( ⏐ ∑n ) ⏐ ∑n x + µ τ 1 1 01 01 ⏐ i=1 i ⏐ i=1 yi + µ02 τ02 = N µ1 ⏐ N µ2 ⏐ . , , n + τ01 n + τ01 n + τ02 n + τ02 A Markov Chain Monte Carlo (MCMC) strategy to sample from the posterior of µ is described in the Supplementary (

material. 3.2. Finite mixture model, fix k In the mixture model of Eq. (3), with fixed k, we can include latent variables r1 , r2 , . . . , rn , and devise a Gibbs sampler to sample from the posterior distribution of the parameters (w , M , z). In this case, the full joint conditionals are proportional to

w) = p(x, y|M , z)p(z|w

n ∏ i=1

wzi p(xi , yi |µzi ) =

⎧ k ⎨ ∏ j=1



n

wj j

∏ {i:zi =j}

⎫ ⎬

p(xi , yi |µj )

.

(6)



In this study, we specify a symmetric Dirichlet distribution p(w |δ ) = Dir(w |δ, . . . , δ ) for the prior of the weights, independent normals, as in Eq. (5), for the prior of µj , j = 1, . . . , k, and finally, a hyper-prior Gamma for the priors of the scale parameters τ01 and τ02 . Here, Gamma distributions Ga(τ |α, β ) are parametrized such that the expectation is given by α/β . Since the concentration of the variable Θ is a function of the vector ∥µ∥, the vector (τ01 , τ02 ) can also be viewed as a parameter that contributes to the concentration of Θ . A complete description of the Gibbs sampler for (w , M , z) is provided in the supplementary material.

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

5

3.3. Finite mixture model, moving k For the moving k case, the acceptance probability for the Metropolis–Hastings is obtained via the product space model (Godsill, 2001). In our implementation, we use the derivation in Rodríguez and Walker (2014a), which can be seen as a reversible jump sampler (Green, 1995), and as a ‘‘standard’’ Metropolis–Hastings under the product space formulation. 3.3.1. Split–combine steps In the split move, component j is chosen randomly from one of the k mixture components and split into components j1 and j2 . In the combine step, this is exactly the ‘‘inverse move’’; here, the components j1 and j2 are selected randomly from {1, . . . , k + 1} and combined into a component j, with the number of components being within the finite set {1, 2, . . . , kmax }. The integer kmax depends on the prior on k, e.g. this could be discrete uniform over the range {1, 2, . . . , kmax } or a truncated Poisson distribution over the same set. If bk and dk denote the probabilities of proposing the split or combine moves, respectively, then they can be formulated as bk = dk = 12 for k = 2, 3, . . . , kmax − 1 and b1 = dkmax = 1 where bk + dk = 1 for k = 1, . . . , kmax . With a mixture of k components, the first step is to make a random choice between the split or combine moves. This choice is made with probabilities bk and dk , as indicated above. Let us assume the split move was chosen, then to achieve the dimensional matching, the random variables u ∼ beta(2, 2), ϵ1 ∼ N(0, 1), and ϵ2 ∼ N(0, 1), are generated and, then,

wj1 = uwj ↔ wj2 = (1 − u)wj , µ1j1 = µj − ϵ1 ↔ µ1j2 = µj + ϵ1 , µ2j1 = µj − ϵ2 ↔ µ2j2 = µj + ϵ2 . These proposals are similar to those used by Cappé et al. (2003), and originally inspired by Richardson and Green (1997). Thus, the acceptance probability for the split is given by

{ } πk+1 αk,k+1 = min 1, , πk with

∏ πk+1 = πk

i∈Sj1



N(xi |µ1si , 1)N(yi |µ2si , 1) Sj2



δ−1+nj1

(k + 1)dk+1 p(k + 1) wj1 bk

N(xi |µ1j , 1)N(yi |µ2j , 1)

p(k)

δ−1+nj

wj

δ−1+nj2

wj2

B(δ, kδ )

i∈Sj

×

N(µ1j1 |µ01 , 1/τ01 )N(µ1j2 |µ01 , 1/τ01 )N(µ2j1 |µ02 , 1/τ02 )N(µ2j2 |µ02 , 1/τ02 ) N(µ1j |µ01 , 1/τ01 )N(µ2j |µ02 , 1/τ02 )

× {beta(u|2, 2)N(ϵ1 |0, 1)N(ϵ2 |0, 1)Palloc }−1 × 4wj ,

(7)

where Sl = {zi = l}. The transformation for the combine is obtained directly from the split as follows:

wj = wj1 + wj2 ↔ u = µ1j = µ2j =

(µ1j1 + µ1j2 ) 2

(µ2j1 + µ2j2 ) 2

↔ ϵ1 = ↔ ϵ2 =

wj1 , wj1 + wj2 (µ1j1 − µ1j2 ) 2

(µ2j1 − µ2j2 ) 2

, ;

{ [

here, the acceptance probability for the combine is given by αk,k−1 = min 1,

πk πk−1

]−1 }

.

3.3.2. Birth and death for empty components In the birth move, a new empty component is generated, whilst on the death move, a random choice is made between any existing empty components, for which no observations have been assigned, and then the chosen component is deleted. Here, just proposals for the continuous variables are needed, whereas for the birth, a weight and parameters for the proposed new component are drawn as:

wj∗ ∼ beta(1, k),

µ1,j∗ ∼ N(µ01 , 1/τ01 ),

µ2,j∗ ∼ N(µ02 , 1/τ02 ),

6

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

with the existing weights being rescaled so that they sum up to 1, i.e.

wj′ = wj (1 − wj∗ ). In the death process for the existing empty components, an empty component is randomly chosen and deleted, and the remaining weights are rescaled to sum up to 1; i.e. if j∗ is the chosen empty component then wj′ = wj /(1−wj∗ ). In this case,

πk+1 p(k + 1) 1 1 n+kδ−k = wδ− (k + 1) ∗ (1 − wj∗ ) πk p(k) B(kδ, δ ) j ×

1

dk+1

(k0 + 1)bk beta(wj∗ |1, k)

(1 − wj∗ )k−1 ,

where, k0 is the number of empty components for a mixture of k components, see Richardson and Green (1997, 1998) for details. To attempt the move from a mixture of k components to one of k − 1, we use again αk,k−1 . This completes the specification of the moving k steps. 4. Dealing with the label switching problem The likelihood of a mixture model is invariant under permutations of its parameters and, under symmetric priors across the components, the corresponding posterior distribution inherits such invariance. Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations and, as a result, the hidden groups cannot be identified, making ergodic averages useless when it comes to estimating characteristics of the components. In the mixture models literature, this has been called the label switching problem, see for example Stephens (2000b) and Rodríguez and Walker (2014a). A possible solution to the label switching problem is based on a deterministic relabeling strategy. Suppose a sample from the posterior distribution is generated with copies of (w t , M t ) ∼ p(w , M |θ ), where t = 1, . . . , h indexes the iterations of the MCMC algorithm. Then, we need to find a permutation ρt ∈ Pk (the set of the k! permutations of the labels {1, 2 . . . , k}), such that the following permuted sample recovers identifiably for the mixture components:

ρt (w t , M t ) = ((wρt t (1) , . . . , wρt t (k) ), (µtρt (1) , . . . , µtρt (k) )),

(8)

for t = 1, . . . , h. To find such permutations, the ideas in Rodríguez and Walker (2014a) are followed closely. First, define a loss function, second, select a pivot variable and, finally, minimize the loss given the pivot. The loss function is defined to connect the information about the clusters in the latent variables and the data. The objective is to reduce the variability of the loss function, leading to a stable and robust relabeling algorithm. Let Lt be the loss function to minimize, and clt,j the associated cost if permutation ρt (l) = j is chosen. The minimization step on the sampled values at iteration t can be formulated via the so-called assignment problem, see Burkard et al. (2009). Hence, we obtain the following optimization problem: Minimize Lt =

k k ∑ ∑

al,j clt,j ,

(9)

l=1 j=1

subject to

k ∑

al , j =

j=1

k ∑

al,j = 1, (l, j = 1, . . . , k),

l=1

al,j ∈ {0, 1},

(l, j = 1, . . . , k).

Thus, we need to solve Eq. (9) by minimizing over the (al,j ); if for some t, (ˆ al,j ) is an optimal solution for the problem and it is that ˆ al,j = 1, then this corresponds to the permutation ρt (l) = j. 4.1. Data based relabeling for circular data Our pivot variables will be taken as the centers of the groups {m1 , . . . , mk }, which are assumed to be known. To undo the label switching, we find the permutation ρt ∈ Pk , t = 1, . . . , N, that minimizes the loss function in Eq. (9) with the following costs: cl,j = nj −



cos(θi − ml ),

{i:zi =j}

= nj − cos(ml )

∑ {i:zi =j}

cos(θi ) − sin(ml )

∑ {i:zi =j}

sin(θi ).

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

7

In most cases, the centers are not known, but if the MCMC algorithm has converged, the data and the allocation variables can be used to find estimates {m ˆ1 , . . . , m ˆk }. Thus, the strategy is divided into two steps: first, find estimates for the centers, and then, determine permutations to recover identifiability for the mixture components. This strategy is described in Algorithms 1 and 2. Algorithm 1 Data based relabeling for circular data: finding estimates Require: θ , zt and nt1 , nt2 , . . . , ntk for t = 1, . . . , h 1: for j = 1 to k do s 2: Initialize nm j = 1, nj = 1 3: 4: 5: 6:

Initialize m ˆj = θ(1) + R k+1 {with R = θ(n) − θ(1) } end for for t = 1 to h do Find ρt solving Eq. (9) with j

(

)



ˆl ) clt,j = ntj − cos(m

cos(θi ) − sin(m ˆl )

{i:zit =j}

7: 8: 9: 10:

11:

12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

for j = 1 to k do if ntρt (j) > 0 then m ˆj = (nm ˆj j − 1)m ∑ 1

µ1j =

µ2j =

nρt (j)t 1



sin(θi )

{i:zit =j}

cos(θi )

i:zit =ρt (j)

{

}



sin(θi ) {i:zit =ρt (j)} if arctan(µ2j , µ1j ) > 0 then ) 1 ( m ˆj = m m ˆj + arctan(µ2j , µ1j ) nρt (j)t

nj

else m ˆj =

) 1 ( m ˆj + arctan(µ2j , µ1j ) + 2π m nj

end if m nm j = nj + 1 end if end for end for return m ˆ1 , m ˆ2 , . . . , m ˆk

Algorithm 2 Data based relabeling for circular data: finding the permutations Require: m ˆ1 , m ˆ2 , . . . , m ˆk , θ , zt and nt1 , nt2 , . . . , ntk for t = 1, . . . , h 1: for t = 1 to h do 2: Find ρt solving Eq. (9) with clt,j = ntj − cos(m ˆl )



cos(θi ) − sin(m ˆl )

{i:zit =j}

3: 4:



sin(θi )

{i:zit =j}

end for return ρt , for t = 1, 2 . . . , h

After relabeling the MCMC output via Algorithms 1 and 2, the estimation of the posterior classification probabilities is carried out via p(zi = j|θi ) ≈

N 1 ∑

N

t =1

wjt PN (θi |µtj ) , ∑k t t l=1 wj PN (θi |µj )

(10)

8

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

and then each observation is classified into the component with highest probability, the so-called single best clustering algorithm. Then, scaled predictive components are approximated by calculating E(wj PN (θi |µj , I )|θ ) ≈

N 1 ∑

N

wjt PN (θi |µtj , I ),

(11)

t =1

and posterior estimates for the parameters are computed via E(µj |θ ) ≈

N 1 ∑

N

µtj and E(wj |θ ) ≈

t =1

N 1 ∑

N

wjt ,

(12)

t =1

where (w t , M t ) ∼ p(w , M |θ, k) for t = 1, . . . , N. 5. Simulation study and illustrations In this section, we illustrate our methodology using five simulated datasets, a dataset previously analyzed in the literature, and a real dataset concerning animal activity patterns. 5.1. Models for the simulation study For the simulation study, we analyzed data from the following models: Model 1 : g1 (θ ) = 0.5 PN (θ|µ1 ) + 0.2 PN (θ|µ2 ) + 0.3 PN (θ|µ3 ), where Model 2 : where Model 3 : where Model 4 : where Model 5 : where

µ1 = (0, 4)t µ2 = (−4, 0)t and µ3 = (−4, −4)t . g2 (θ ) = 0.3 PN (θ|µ1 ) + 0.2 PN (θ|µ2 ) + 0.3 PN (θ|µ3 ) + 0.2 PN (θ|µ4 ), µ1 = (0, 3)t , µ2 = (−2, 0.5)t , µ3 = (−0.75, −1.5)t and µ4 = (2.3, −2.3)t . g3 (θ ) = 0.1 PN (θ|µ1 ) + 0.2 PN (θ|µ2 ) + 0.05 PN (θ|µ3 ) + 0.3 PN (θ|µ4 ) + 0.15 PN (θ|µ5 ) + 0.2 PN (θ|µ6 ), µ1 = (4, 2.5)t , µ2 = (0.2, 3)t , µ3 = (−3, 5)t , µ4 = (−3, 0.4)t , µ5 = (−2, −2)t and µ6 = (6, −2)t . g4 (θ ) = 0.3v M(θ|µ1 , κ1 ) + 0.7v M(θ |µ2 , κ2 ), µ1 = π/4, κ1 = 5, µ2 = π, κ2 = 3 g5 (θ ) = 0.4 PN (θ|µ1 , Σ1 ) + 0.6 PN (θ|µ2 , Σ2 ), µ1 = (2.5, 1.5)t , σ11 = 1, σ12 = 0.3, ρ1 = −0.7 and, µ2 = (−3, −3.5)t , σ21 = 2, σ22 = 2.5, ρ2 = −0.4.

These models were designed to present challenging scenarios: different number of clusters, different degrees of overlapping, and small weights. Models 4 and 5 are mixtures of von Mises and projected normal distributions (Σ1 ̸ = I and Σ2 ̸ = I ), respectively. Thus, components of the Model 4 do not follow a projected normal distribution, and components of the Model 5 are unimodal but asymmetric; hence, we assess robustness of the approach with respect to deviations from our fundamental assumption: the distribution of each component is a unimodal projected normal distribution, which is rotationally symmetric about its mean direction. The density functions of the five mixture models in the simulation study are shown in Fig. 1. 5.2. Inference The following settings for the unspecified constants of the model are used in the inference process (see, subSection 3.2.):

µ01 = µ02 = 0, α1 = α1 = β1 = β2 = 0.275, δ = 1. Also, under the assumption of no information, a discrete uniform prior over k is considered; that is, p(k) =

1 kmax

with kmax = 15.

1 (1, . .(k) .,k

max )

,

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

9

Fig. 1. Densities for models 1 to 5.

For all the datasets analyzed, we used the simulation algorithms described in the previous sections to approximate the posterior for k. In all cases, we ran our hybrid sampler for 600,000 iterations, with the first 500,000 iterations being used as a burn-in period. The starting point for k in all the runs was k = 15. From these runs, k was estimated by simply taking the mode of p(k|θ ), then the relevant MCMC output corresponding to this k was extracted. Algorithms 1 and 2 were applied to these MCMC outputs. Thus, with the permuted samples, we estimated posterior classification probabilities via Eq. (10), and then the single best clustering was determined by calculating

ˆ zi = arg max{p(zi = j|θi )}, for i = 1, . . . , n. j=1,...,k

5.3. Comparisons To evaluate the performance of our proposal, we have included comparisons of our method with the following four strategies, which are all available via the corresponding functions and packages in R. 1. Function Mclust from package mclust. A Gaussian mixture model (with variable variance across the mixture components) fitted by the EM algorithm, with the optimal model being chosen according to the BIC criterion (Scrucca et al., 2016). 2. Function mix.vmf from package Directional. A model based clustering for circular data assuming von Mises– Fisher distributions. The function requires a matrix expressed as unit vectors, which we calculate as (x = cos(θ ), y = sin(θ ))n×2 . The function bic.mixvmf was first used to choose the number of components (Tsagris et al., 2018). 3. Function cec from package CEC. This cross entropy clustering divides the data into Gaussian type clusters. The function automatically returns the optimum number of clusters based upon the concentration of the clusters beyond that expected from uniformly distributed data (Kamieniecki and Spurek, 2018). 4. Function kmeans from R base and author’s code. The circular data are clustered by the k-means method, which aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centers is minimized using the algorithm of Forgy (1965). The function uses k-means to generate from 2 clusters to 10, and the number of clusters in the chosen model is that with the lowest number of clusters, whose ratio of the between/total sum of squares is at least 90% of the maximum. These and other clustering strategies, for circular data, were compared by Whitfield (2017) and found that none of the methods performed consistently well.

10

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

To assess the quality of the clustering generated by each algorithm, the following agreement indices were used: the Jaccard similarity coefficient, the Rand Index (RI) and the Adjusted Rand Index (ARI), which are available via the function adjustedRand in the library clues of R, see Chang et al. (2010). The Jaccard coefficient measures similarity between the estimated and true clustering, and is defined as the size of the intersection divided by the size of the union of the two partitions. The Rand Index is a measure of assignment of cluster members to the correct cluster (Rand, 1971), and the ARI penalizes for cases assigned to the incorrect cluster and for having an incorrect number of clusters, see Hubert and Arabie (1985). It is important to highlight the fact that, though the Jaccard and Rand coefficients may only yield values over the interval [0, 1], the ARI could yield negative values if the index is less than the expected index. Classifications with Jaccard, RI and ARI index closer to 1 are always better. 5.4. Simulation study To assess the performance of our proposal, 1. We drew one random sample of size n = 20 from Model 1, and obtained the true clustering directly while simulating the data, via i.i.d

zi ∼ Mult(z |0.5, 0.2, 0.3) H⇒ θi ∼ PN (θ|µzi ). Thus, z = (z1 , . . . , z20 ) is the true clustering. 2. We estimated the number of groups and clustering with our proposal, and also with the four alternative methods. 3. We calculated the three similarity coefficients for the clusterings that were produced with each strategy. We repeated the same steps with other 99 random samples of size n = 20, hence generating information using 100 samples of size n = 20, and performed the same process for samples of sizes n = 50, 100 and 300, completing all simulations for Model 1. We then proceeded in the same manner for models 2 to 5. 5.5. Results



Tables 1 and 2 show summaries of the simulation study: k is the mode of the 100 estimates, 1(kt =k) is the number of estimates that coincide with the true number of clusters, and Jaccard, RI, and ARI are the medians of each of the similarity coefficients. We call our proposal Bayesian mixture of projected normal distributions (BMPN). Observing the summaries corresponding to Model 1, k-means is the best algorithm for size n = 20 since it estimates correctly the number of groups in 58 samples, and obtains maximum values for the medians of the three agreement indices. For n = 50, BMNP gives the best results, followed closely by k-means. For samples of size n = 100, again BMNP gives the best results, followed by the Mclust alternative, and for n = 300, Mclust marginally beats BMPN. In the case of Model 2, summaries of samples of size n = 20 show that there is a competition between CEC and k-means, with CEC delivering the best accuracy for estimating the number of clusters, and k-means giving the best clustering; also, for n = 50, k-means is the best algorithm overall, whereas for samples of sizes n = 100 and 300 BMNP is the best alternative. In Model 3 we observe a similar pattern as in Models 1 and 2, and note poor performances overall when it comes to estimating the number of groups with samples of size n = 20 and 50, which is a consequence of the fact that larger random samples are needed to recover the 6 modes of the mixture model. It possible to notice that CEC and k-means are the best strategies for small sample sizes, while BMNP is the best approach for larger sample sizes. Models 4 and 5 do not follow the hypothesis of BMNP; however, it is the best strategy for Model 4 overall. For the various sample size scenarios, either Mclust or CEC is the runner-up for both models, and k-means is the best algorithm for Model 5. It can be noticed that the performance of BMNP decreases when the sample size increases. The cause of such poor performance is similar to that when data from a mixture of skew normal distributions are fitted with a mixture of normals: to model the asymmetry, two or more normal distributions are needed. Data generated from models 1, 2 and 3, were obtained using a mixture of symmetric projected normal distributions, and thus, judging from the corresponding summaries, we can conclude that, for small sample sizes, k-means is the best alternative, while BMPN is the best strategy for moderate to large sample sizes. Also, results for Model 4 suggest that BMPN also performs well for data sets with few and symmetric unimodal clusters. Results for Model 5 indicate that, in the presence of asymmetric clusters and small samples sizes, BMPN has acceptable performance; however, when the sample size increases, BMPN overestimates the number of groups, with the corresponding estimate approaching k = 4. To have a better picture of this shortcoming, we have further analyzed the results for a single random sample of size n = 300, and used it to estimate k, along with the classification and scaled densities. The results are displayed in Fig. 2, where the density estimate is shown on the left hand side. For this random sample, p(k = 4|θ ) ≈ 0.45, which is the maximum; thus, the estimated number of clusters is 4 and when we know the true number of groups is 2. The clustering and the scaled density estimates for k = 4 depicted in Fig. 2 indicate that, to model a single asymmetric projected normal distribution, we need two symmetric projected normal distributions.

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

11

Table 1 ∑ Summaries of the simulation study: sample sizes n = 20 and 50. k is the mode of the 100 estimates, 1(k t =k) is the number of estimates that coincide with the true number of clusters, Jaccard, RI, and ARI are the medians of each of the similarity coefficients. Model

1

2

3

4

5

Algorithm

n = 20 k



BMPN

2

Mclust Mix von Mises

n = 50 Jaccard

RI

ARI

k



32

0.82

0.92

0.84

3

2

29

0.81

0.92

0.84

2

18

0.80

0.91

0.82

CEC

4

39

0.59

0.84

k-means

3

58

0.88

0.95

k



BMPN

2

Mclust

2

Mix von Mises

1(k t =3)

Jaccard

RI

ARI

72

0.93

0.97

0.94

3

48

0.86

0.93

0.86

2

15

0.79

0.90

0.80

0.67

6

11

0.71

0.89

0.77

0.90

3

60

0.88

0.96

0.91

Jaccard

RI

ARI

1(k t =3)

Jaccard

RI

ARI

k



0

0.43

0.71

0.46

3

12

0.48

0.76

0.50

6

0.38

0.69

0.41

2

2

0.44

0.70

0.44

1

1

0.27

0.27

0.00

1

4

0.27

0.27

0.00

CEC

3

32

0.48

0.81

0.59

7

6

0.36

0.80

0.45

k-means

3

13

0.53

0.83

0.63

3

29

0.53

0.83

0.59

k



Jaccard

RI

ARI

k



Jaccard

RI

ARI

BMPN

2

0

0.35

0.66

0.38

3

2

0.46

0.79

0.51

Mclust

2

4

0.33

0.63

0.34

2

0

0.33

0.58

0.28

Mix von Mises

1

0

0.20

0.20

0.00

1

0

0.24

0.24

0.00

CEC

4

2

0.43

0.82

0.55

9

12

0.44

0.86

0.56

k-means

3

0

0.52

0.84

0.64

3

0

0.53

0.83

0.60

k



Jaccard

RI

ARI

BMPN

2

Mclust

1

Mix von Mises

1(k t =4)

1(k t =6)

1(k t =4)

1(k t =6)

Jaccard

RI

ARI

k



83

0.84

0.90

0.81

2

87

0.93

0.96

0.92

42

0.68

0.73

0.39

2

67

0.87

0.92

0.84

1

32

0.61

0.67

0.00

2

52

0.63

0.75

0.51

CEC

2

34

0.61

0.74

0.51

4

13

0.69

0.81

0.61

k-means

3

10

0.53

0.72

0.50

4

1

0.52

0.72

0.48

k



Jaccard

RI

ARI

BMPN

2

Mclust Mix von Mises

1(k t =2)

1(k t =2)

Jaccard

RI

ARI

k



74

1.00

1.00

1.00

2

59

1.00

1.00

1.00

2

80

1.00

1.00

1.00

2

81

1.00

1.00

1.00

2

78

1.00

1.00

1.00

2

77

1.00

1.00

1.00

CEC

3

18

0.62

0.80

0.62

3

7

0.65

0.82

0.66

k-means

2

86

1.00

1.00

1.00

2

94

1.00

1.00

1.00

1(k t =2)

1(k t =2)

5.5.1. Computations and storage requirements All the experiments were performed in an iMac (processor 3.5 GHz Intel Core i5) running macOS High Sierra. We coded our approach in C and used the .C interface to R to have a friendly data input, see Peng and Leeuw (2002) for a good introduction. The analysis and graphics were implemented in R (R Core Team, 2019). To solve Eq. (9) efficiently via the Hungarian method, we borrowed the R function solve_LSAP from the library clue, whose source code is written in C, see Hornik (2019). The reported CPU times were measured using the function system.time of R. To assess the performance of our proposal, we generated a random sample of n = 300 from Model 3. This is the only dataset that we analyze in this section.

12

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

Table 2 Summaries of the simulation study: sample sizes n = 100 and 300. k is the mode of the 100 estimates, ∑ 1(k t =k) is the number of estimates that coincide with the true number of clusters, Jaccard, RI, and ARI are the medians of each of the similarity coefficients.

Model

Algorithm

n = 100 k

1

ARI

k



1(k t =3)

Jaccard

RI

ARI

0.94

0.98

0.96

3

96

0.94

0.98

0.95

Mclust

3

74

0.93

0.97

0.94

3

99

0.94

0.98

0.95

Mix von Mises

2

21

0.77

0.89

0.78

4

3

0.63

0.86

0.67

CEC

4

20

0.84

0.94

0.87

3

68

0.93

0.97

0.94

k-means

3

70

0.93

0.98

0.95

3

81

0.94

0.98

0.95

Jaccard

RI

ARI



1(k t =4)

Jaccard

RI

ARI

k



1(k t =4)

BMPN

3

34

0.55

0.83

0.60

4

78

0.59

0.87

0.66

Mclust

2

4

0.46

0.71

0.43

3

12

0.52

0.81

0.55

Mix von Mises

3

6

0.48

0.75

0.49

4

53

0.55

0.85

0.61

CEC

7

1

0.45

0.83

0.52

7

5

0.52

0.85

0.58

k-means

3

26

0.54

0.82

0.59

3

22

0.51

0.81

0.54

Jaccard

RI

ARI

k

Jaccard

RI

ARI



1(k t =6)



1(k t =6)

BMPN

4

8

0.54

0.84

0.61

6

49

0.64

0.91

0.72

Mclust

2

1

0.39

0.63

0.34

3

0

0.52

0.82

0.58

3

1

0.48

0.81

0.54

5

2

0.56

0.88

0.65

10

10

0.46

0.86

0.55

5

22

0.55

0.86

0.61

0

0.54

0.83

0.60

3

0

0.52

0.82

0.58

Jaccard

RI

ARI

Mix von Mises

k-means

3 k



1(k t =2)

Jaccard

RI

ARI

k



1(k t =2)

BMPN

2

87

0.92

0.95

0.89

2

82

0.91

0.95

0.89

Mclust

3

48

0.88

0.92

0.84

3

16

0.89

0.94

0.87

Mix von Mises

3

43

0.58

0.76

0.54

3

1

0.54

0.73

0.48

CEC

2

28

0.83

0.89

0.78

2

70

0.87

0.92

0.82

k-means

4

0

0.51

0.71

0.45

4

0

0.50

0.70

0.44

1(k t =2)

Jaccard

RI

ARI

k

5

RI

88

CEC

4

Jaccard

3

k

3

1(k t =3)

n = 300

BMPN

k

2





1(k t =2)

Jaccard

RI

ARI

k



BMPN

2

42

0.93

0.96

0.93

4

1

0.59

0.79

0.59

Mclust

2

80

1.00

1.00

1.00

2

45

0.93

0.96

0.93

Mix von Mises

2

58

0.96

0.98

0.96

3

22

0.63

0.80

0.62

CEC

3

15

0.77

0.88

0.77

2

47

0.93

0.96

0.93

k-means

2

95

1.00

1.00

1.00

2

100

1.00

1.00

1.00

The BMPN strategy takes 3.6 min over all: run 600,000 iterations for the MCMC, generate the posterior for k, extract the relevant MCMC output, undo the label switching, estimate the posterior probabilities and generate the single best clustering. Also, to store the MCMC output and the classification probabilities for the relevant k, 966 MB of space were used. With the alternatives, function Mclust estimated the number of groups and the partition for the clustering in 0.028 s, the functions of the libraries Directional and CEC took 0.4 s and 0.27 s, respectively, to perform both tasks. Finally, for the k-means approach, it took only 0.014 s. We did not notice any storage requirements of these methods.

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

13

Fig. 2. Density estimate (left), scaled predictive densities and single best clustering for a random sample of size n = 300 from Model 5, k = 4. Table 3 Posterior distribution of k for the turtle data. k

1

2

3

>3

p(k |θ )

0.00068

0.46

0.3

0.235

5.6. Real-data examples In this section, we apply our methodology to the analysis of two real datasets. The number of iterations, burn in period and unspecified constants of the model are the same as for the simulated datasets. Here we took advantage of our model base framework, and were able to perform density and parameter estimation. 5.6.1. Example 1 For our first analysis, we consider the well known turtles dataset. These data are formed by the directions taken by 76 turtles after treatment. Fig. 3(a) shows a circular plot of these data, see Stephens (1969) for a complete description. Mardia and Jupp (2000) described these data as a bimodal distribution with modes roughly 180◦ apart. The dominant mode is located in the interval (60◦ , 90◦ ) and the subsidiary mode in (240◦ , 270◦ ). In addition, these data have been analyzed in the literature via a mixture of two von Mises distributions. A comparative study of different procedures is presented in Chang-Chien et al. (2012), where it is pointed out that the concentration parameters of the corresponding components have not been well estimated. In Table 3 posterior probabilities for the number of components are displayed. The posterior for the number of components supports k = 2, as in previous analyses. Fig. 3(b) displays the single clustering and the scaled density estimates. Posterior mean estimates of the parameters of the mixture are given by E(w |θ ) = {0.83, 0.17},

M |θ ) = {(−0.9692, −1.9)t , (0.7, 1.43)t }. E(M Thus, the mean direction of the dominant component (E(w1 |θ ) = 0.83) is 63.9◦ , which was defined by the mean direction of the first mixture component E(µ1 |θ ) = (−0.9692, −1.9)t , and 242.9◦ for the second component. To compare our parameter estimates with previous analyses via the von Mises distribution, we calculate estimates of the concentration parameters ρ1 and ρ2 associated to each of the projected normal distribution components. The corresponding estimators are ρˆ 1 = 0.763 and ρˆ 2 = 0.868, which are concordant with all the estimators of the concentration parameters κj , j = 1, 2, as previously reported in the literature, see Table 10 in Chang-Chien et al. (2012). 5.6.2. Example 2 The study of the interaction among species is an active area of research in Ecology. The use of camera-trapping strategies have allowed ecologists to generate detailed information, even for animal species otherwise difficult to detect in the wild. This type of records leads naturally to the analysis of circular data. As part of a larger research project in El Triunfo biosphere reserve (Mexico), in 2015, a set of camera trap data relating to the presence of three species was collected. Here, we will only refer to a specific mammalian species, namely Collared Peccary (Pecari tajacu). These data

14

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

Fig. 3. Turtles data. Table 4 Peccary data. 176.23

157.12

184.57

45.93

132.15

153.83

260.79

248.09

134.19

266.66

130.37

140.83

189.94

234.26

252.63

244.26

Table 5 Posterior distribution of k for the Peccary data. k

1

2

3

>3

p(k |θ )

0.16

0.31

0.23

0.3

are shown in Table 4 (in degrees). The circular diagram corresponding to transformed time-of-day (based on a 24-h clock) is displayed in Fig. 4(a). These data have been studied by Núñez-Antonio et al. (2018) using a Bayesian nonparametric analysis, in which the posterior predictive density was based on infinite mixtures of projected normal distributions. Then, this predictive, was used to carry out inferences for an overlapping coefficient between species. Here we apply our clustering and classification procedure to the Peccary data. Table 5 shows posterior probabilities for the number of clusters. The posterior distribution for k gives the highest probability to k = 2 groups. Fig. 4(b), shows the single best clustering and the scaled density estimate. This classification is in concordance with the results in Núñez-Antonio et al. (2018). Estimates for the mixture parameters are E(w |θ ) = {0.63, 0.37},

M |θ ) = {(−1.19, 0.52)t , (−0.86, −1.64)t , } E(M It appears that Peccaries in El Triunfo have two activity peaks, the most active period as shown by the weight of the first mixture component (E(w1 |θ ) = 0.63), is around 10:25 h (defined by the mean direction of the first mixture component, E(µ1 |θ ) = (−1.19, 0.52)t , and a less active period at 16:12 h (the mean direction of the second mixture component). This activity pattern could be related to the interaction with other species. We believe this type of analysis could help ecologists improve their understanding of the activity patterns for these kinds of species.

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

15

Fig. 4. Peccary data.

6. Discussion In this paper, we have proposed a Bayesian methodology to perform model-based clustering of circular data. First, via the posterior distribution of the number of components of a mixture, we estimated the number of clusters within the data. Second, using a divergence measure, we devised a Data relabeling algorithm to solve the label switching problem for circular data, with the key being the relationship between the allocation variables and the observations. Comparisons with four alternative clustering strategies showed that when analyzing circular data, our proposal gives acceptable estimates for the number of clusters and also yields good clustering results. We observed that if the distributional assumptions of the model are met, the performance of our strategy improves when the sample size increases, showing uniformly better results than the alternatives for samples sizes around n = 100 and larger. Our model-based strategy uses a mixture of projected normal distributions, where both the mean direction and dispersion depend upon the vector (µ1 , µ2 ). Initially, this choice of distribution seemed too restrictive for clustering; however, we were able to describe clusters generated by mixtures of von Mises distributions. In the presence of asymmetric clusters, nevertheless, the performance of our strategy decreases when the sample size increases. This is simply because, in order to model a single asymmetric cluster, two or more symmetric distributions are needed. This has been noted, for example, by Richardson and Green (1997) and Escobar and West (1995) who overestimated the number of groups using mixtures of normal distributions. Under a simulated setting, and solving the label switching problem, we confirmed that this is indeed the case, see Fig. 2. Also, this is consistent with our cluster definition; a set of data modeled by a symmetric projected distribution. Based on our simulation study, we can go further and say that for small sample sizes (n ≤ 100) there is not enough information for our algorithm to consistently detect the asymmetric distributions, but when the sample size increases, it detects this asymmetry and overestimates the number of clusters. Some extensions to our approach are possible. First, it would be interesting to extend our ideas to mixtures of projected normal distributions with a covariance matrix other than the identity, aiming to model symmetric and asymmetric clusters. Second, while we have followed the performance of the MCMC procedures for the projected normal distribution, which consists of the generation of the latent variables for each observed data, we believe it is possible to define a more efficient strategy that avoids the need of generating such latent variables. Finally, although we have focused on the case of a circular response, this methodology can be extended to the case of a directional response in higher dimensions in a more natural fashion than existing procedures. Acknowledgments The authors gratefully acknowledge two anonymous referees, the associate editor, as well as professor Eduardo Gutiérrez Peña for his constructive comments on an earlier draft of this paper. We are deeply thankful to professor Eduardo Mendoza (Universidad Michoacana de San Nicolás de Hidalgo, Mexico) for the corresponding permission to use the Peccari-data. This research was supported in part by CONACYT, Mexico, through Sistema Nacional de Investigadores, and a PRODEP, Mexico research project, Mexico.

16

C.E. Rodríguez, G. Núñez-Antonio and G. Escarela / Computational Statistics and Data Analysis 143 (2020) 106842

Appendix A. Supplementary data Supplementary material related to this article can be found online at https://doi.org/10.1016/j.csda.2019.106842. The supplementary material includes a detailed description of the slice sampler for the projected normal distribution, the Gibbs sampler for finite mixtures of projected normal distributions, and details for the calculation of the probabilities for the proposals of the discrete allocation variables (Palloc ). References Ackermann, H., 1997. A note on circular nonparametrical classification. Biom. J. 39 (5), 577–587. Burkard, R.E., Dell’Amico, M., Martello, S., 2009. Assignment Problems. SIAM, Philadelphia. Cappé, O., Robert, C.P., Rydén, T., 2003. Reversible jump, birth-and-death and more general continuous time Markov chain Monte Carlo samplers. J. R. Stat. Soc. Ser. B Stat. Methodol. 65 (3), 679–700. Chang, F., Qiu, W., Zamar, R.H., Lazarus, R., Wang, X., 2010. Clues: An R package for nonparametric clustering based on local shrinking. J. Stat. Softw. 33 (4), 1–16. Chang-Chien, S., Wen-Liang, H., Miin-Shen, Y., 2012. On mean shift-based clustering for circular data. Soft Comput. 16 (6), 1043–1060. Cressie, N., 1977. On some properties of the scan statistic on the circle and the line. J. Appl. Probab. 14 (2), 272?283. Diebolt, J., Robert, C.P., 1994. Estimation of finite mixtures distributions through Bayesian sampling. J. R. Stat. Soc. Ser. B Stat. Methodol. 56 (2), 363–375. Escobar, M.D., West, M., 1995. Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 (430), 577–588. Forgy, E.W., 1965. Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21 (3), 768–769. Fruhwirth-Schnatter, S., Celeux, G., Robert, C.P. (Eds.), 2019. Handbook of Mixture Analysis. In: Chapman & Hall/CRC Handbooks of Modern Statistical Methods, CRC Press. Ghosh, K., Jammalamadaka, R., Tiwari, R., 2003. Semiparametric Bayesian techniques for problems in circular data. J. Appl. Stat. 30 (2), 145–161. Godsill, S.J., 2001. On the relationship between Markov chain Monte Carlo methods for model uncertainty. J. Comput. Graph. Statist. 10 (2), 230–240. Gopal, S., Yang, Y., 2014. von Mises-Fisher clustering models. In: Proceedings of the 31 th International Conference on Machine Learning, Vol. 32. Green, P.J., 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 (4), 711–732. Hernandez Stumpfhauser, D., Breidt, F.J., van der Woerd, M.J., 2017. The general projected normal distribution of arbitrary dimension: modeling and Bayesian inference. Bayesian Anal. 12 (1), 113–133. Hornik, K., 2019. clue: Cluster ensembles. R package version 0.3-57. URL https://CRAN.R-project.org/package=clue. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classification 2 (1), 193–218. Jammalamadaka, S.R., SenGupta, A., 2001. Topics in Circular Statistics. In: Series on Multivariate Analysis, World Scientific. Kamieniecki, K., Spurek, P., 2018. CEC: Cross-Entropy Clustering. Kaufman, L., Rousseeuw, P.J., 2008. Finding Groups in Data. John Wiley & Sons, Inc.. Kendall, D.G., Harding, E.F., 1974. Stochastic Geometry: A Tribute to the Memory of Rollo Davidson. John Wiley & Sons, Wiley, London, New York. Lloyd, S., 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theory 28 (2), 129–137. Lund, U., 1999. Least circular distance regression for directional data. J. Appl. Stat. 26 (6), 723–733. Malsiner-Walli, G., Frühwirth-Schnatter, S., Grün, B., 2016. Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26 (1), 303–324. Mardia, K., 1972. Statistics of Directional Data. In: Probability and Mathematical Statistics: A Series of Monographs and Textbooks, Academic Press. Mardia, K.V., Jupp, P.E., 2000. Circular Data. John Wiley and Sons, Inc.. McVinish, R., Mengersen, K., 2008. Semiparametric Bayesian circular statistics. Comput. Statist. Data Anal. 52 (10), 4722–4730. Nobile, A., Fearnside, A.T., 2007. Bayesian finite mixtures with an unknown number of components: the allocation sampler. Stat. Comput. 17 (2), 147–162. Núñez-Antonio, G., Gutiérrez-Peña, E., 2005. A Bayesian analysis of directional data using the projected normal distribution. J. Appl. Stat. 32 (10), 995–1001. Núñez-Antonio, G., Mendoza, M., Contreras-Cristán, A., Gutiérrez-Peña, E., Mendoza, E., 2018. Bayesian nonparametric inference for the overlap of daily animal activity patterns. Environ. Ecol. Stat. 25 (4), 471–494. Peng, R.D., Leeuw, J., 2002. An Introduction to the .C Interface to R. Manual, UCLA: Academic Technology Services, Statistical Consulting Group, http://www.ats.ucla.edu/stat/r/library/interface.pdf. R Core Team, 2019. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Rand, W.M., 1971. Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc. 66 (336), 846–850. Rastelli, R., Friel, N., 2018. Optimal Bayesian estimators for latent variable cluster models. Stat. Comput. 28 (6), 1169–1186. Richardson, S., Green, P.J., 1997. On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 59 (4), 731–792. Richardson, S., Green, P.J., 1998. Corrigendum: On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 (3), 661. Rodríguez, C.E., Walker, S.G., 2014a. Label switching in Bayesian mixture models: deterministic relabeling strategies. J. Comput. Graph. Statist. 1 (23), 25–45. Rodríguez, C.E., Walker, S.G., 2014b. Univariate Bayesian nonparametric mixture modeling with unimodal kernels. Stat. Comput. 1 (24), 35–49. Scrucca, L., Fop, M., M., T.B., Raftery, A.E., 2016. Mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 8 (1), 205–233. Stephens, M.A., 1969. Techniques for directional data. Technical Report 150, Department of Statistics, Stanford University, Stanford. Stephens, M., 1997. Bayesian methods for mixtures of normal distributions. unpublished ph.d thesis, University of Oxford, Oxford, http://stephenslab. uchicago.edu/MSpapers/DPhilMS.ps.gz. Stephens, M., 2000a. Bayesian analysis of mixture models with an unknown number of components - an alternative to reversible jump methods. Ann. Statist. 28 (1), 40–74. Stephens, M., 2000b. Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 (4), 795–809. Tierney, L., 1994. Markov chains for exploring posterior distributions (with discussion). Ann. Statist. 22 (4), 1701–1762. Tsagris, M., Athineou, G., Sajib, A., Amson, E., Waldstein, M.J., 2018. Directional: directional statistics. Wang, F., Gelfand, A.E., 2013. Directional data analysis under the general projected normal distribution. Stat. Methodol. 10 (1), 113–127. Whitfield, P.H., 2017. Clustering of seasonal events: A simulation study using circular methods. Comm. Statist. Simulation Comput. 1–23. Xu, R., Wunsch, D., 2005. Survey of clustering algorithms. IEEE Trans. Neural Netw. 16 (3), 645–678.