Modal posterior clustering motivated by Hopfield’s network

Modal posterior clustering motivated by Hopfield’s network

Computational Statistics and Data Analysis 137 (2019) 92–100 Contents lists available at ScienceDirect Computational Statistics and Data Analysis jo...

391KB Sizes 0 Downloads 24 Views

Computational Statistics and Data Analysis 137 (2019) 92–100

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Modal posterior clustering motivated by Hopfield’s network ∗

Ruth Fuentes-García a , Ramsés H. Mena b , , Stephen G. Walker c a

Facultad de Ciencias, Universidad Nacional Autónoma de México, Mexico IIMAS, Universidad Nacional Autónoma de México, Mexico c Department of Mathematics, University of Texas at Austin, Austin, USA b

article

info

a b s t r a c t Motivated by the Hopfield’s network, a conditional maximization routine is used in order to compute the posterior mode of a random allocation model. The proposed approach applies to a general framework covering parametric and nonparametric Bayesian mixture models, product partition models, and change point models, among others. The resulting algorithm is simple to code and very fast, thus providing a highly competitive alternative to Markov chain Monte Carlo methods. Illustrations with both simulated and real data sets are presented. © 2019 Elsevier B.V. All rights reserved.

Article history: Received 25 April 2018 Received in revised form 15 February 2019 Accepted 16 February 2019 Available online 23 February 2019 Keywords: Conditional maximization Hopfield network Modal estimation

1. Introduction Given a set of observations, y := (y1 , . . . , yn ), an important problem in statistics is to partition the data into k ≤ n groups. Such a problem is of interest for clustering analysis, (e.g. McLachlan and Peel, 2000; Fraley and Raftery, 2002), density estimation, (e.g. Lo, 1984; Escobar and West, 1995), change point detection, (e.g. Loschi and Cruz, 2005; Martinez and Mena, 2014), and product partition models, (e.g. Quintana and Iglesias, 2003). The problem can be recast by means of the so-called latent allocation variables, d := (d1 , . . . , dn ), where di = j indicates yi is allocated to cluster j ∈ {1, . . . , k}, where k can include being ∞. If one assumes a probabilistic model for y which depends on d, say f (y | d), then a prior π (d) on all possible allocations induces a posterior model given by

π (d | y) ∝ π (d)f (y | d).

(1)

To provide a concrete example, let us consider the data model where the (yi ) are independent and identically distributed (i.i.d.) from the mixture model f (y | w, θ ) =

∞ ∑

wj f (y | θj ),

j=1

∞ where the weights w := {wj }∞ j=1 sum to one almost surely (a.s.), the parameters θ := {θj }j=1 are i.i.d. from some fixed distribution with density p(θ ), and are independent of w. Here, f (y | θ ) is a density function for each θ ∈ Θ . Hence, the posterior on allocations reduces to

π (d | y) ∝ Ew

[ n ∏ i=1

] wdi

× Eθ

[ n ∏

] f (yi | θdi ) ,

(2)

i=1

∗ Correspondence to: Department of Probability and Statistics at IIMAS-UNAM, Avenida Universidad 3000, Circuito Escolar s/n, C.P. 04510, CDMX, Mexico. E-mail address: [email protected] (R.H. Mena). https://doi.org/10.1016/j.csda.2019.02.008 0167-9473/© 2019 Elsevier B.V. All rights reserved.

R. Fuentes-García, R.H. Mena and S.G. Walker / Computational Statistics and Data Analysis 137 (2019) 92–100

93

where the first term accounts for the prior π (d) and the second term for the intra-cluster likelihood f (y | d). Note that this approach includes both finite and infinite mixture models. Also note that for any partition on the y there are many ds yielding such a partition for which π (d | y) are all identical. In particular, there will be a configuration of d which has no gaps; in the sense that if the number of partitions is k then the components of d are precisely taken from the integers {1, . . . , k}. Letting ρj := {i : di = j} and considering only those j for which nj := |ρj | > 0, the variable d induces a partition ρn := (ρ1 , . . . , ρk ), i.e. a P[n] valued variable, where P[n] denotes the set of all partitions of [n] := {1, 2, . . . , n}. Indeed, a popular approach in the Bayesian nonparametric literature is to look at the posterior partition model π (ρn | y) ∝ π (ρn ) f (y | ρn ), where, e.g. π (ρn ) is a suitable exchangeable partition probability function (EPPF) (cf. Hjort et al., 2010), i.e. a symmetric probability function depending only on the composition induced by the group sizes in ρn . Another approach to construct a π (ρn ) without directly using a mixture model, is through the so-called product partition model (PPM), given by k ( ) ∏ π ρ n = (ρ1 , . . . , ρ k ) | k ∝ c(nj ), j=1

+

where c : [n] → R ∪ {0} is known as a cohesion function, (see Hartigan, 1990; Barry and Hartigan, 1992, 1993). Conditioning on a particular allocation d, the intra-cluster likelihood is computed as f (y | d) =

⎧ k ∫ ⎨ ∏ ∏ j=1

Θ



f (yi | θ )

i∈ρj

⎫ ⎬

p(θ ) dθ.

(3)



Though in principle, such approaches resolve posterior clustering inference of a given data set, several complications arise. The foremost of these is the lack of an ordering and the possible values d can take; it is simply too large. To date, Markov chain Monte Carlo methods have attempted to perform inference via sampling. However, it has become increasingly recognized that this method is no longer feasible and alternatives to sampling strategies have now started to gain traction. Direct sampling strategies for sampling d from π (d | y) have been investigated by, among others, Fuentes-García et al. (2010b), who proposed an algorithm to sample from (2) with the Dirichlet process mixture model. See also Mena and Walker (2015) who adjusted this model to reduce identifiability problems associated with it; the well known labelswitching problem and the inherent unidentifiability of the parameters. Wang and Dunson (2011) and Wang and Walker (2017) attempted to reduce the dimension of d by approximating π (d | y) with the idea of determining the posterior sequentially; working with the sequence of conditional distributions

π (di |y1 , . . . , yi−1 , d1 , . . . , di−1 ).

(4)

Starting with d1 = 1, one can then easily force di to be one of the previous (d1 , . . . , di−1 ), i.e. no new group, or 1 + di−1 , implying a new group. Another direction has been to find an optimal d according to some specified loss function. That is, one minimizes l(d ′ ) =



l(d ′ , d) π (d | y),

d

for some choice of l(d ′ , d). Dahl (2006) and Lau and Green (2007) propose loss functions which penalize miss-assigned groups. A discussion on types of loss functions and routines to find the optimal d are given in Wade and Ghahramani (2018). Here we argue that a coherent Bayesian point estimate of d is the maximum a-posteriori (MAP) estimate. Nevertheless, there are points of view criticizing this particular single point estimate; see Wade and Ghahramani (2018). As a Bayesian there is a posterior belief distribution on d space, each with a belief probability π (d | y). Given the only information is y, to us at least, it appears incoherent to select a d ′ as opposed to the MAP estimate ˆ d for which

π (d ′ | y) < π (ˆ d | y). For whatever d ′ is, the experimenter believes it is less likely to be the true classification compared to ˆ d. At this juncture there seems little motive for selecting d ′ as a single point estimate. The argument for not using the mode is that all other classifications, no matter what their posterior probabilities are, are assigned the same loss value with the 0–1 loss function associated with the decision theoretic approach to selecting the mode. We do not believe the introduction of loss functions is warranted in this problem as the coherence of the mode is overwhelming. However, even from a decision theoretic point of view, if one single classification is sought then assigning all the unwanted classifications to have the same loss makes sense. Indeed, other loss functions, such as those used in the literature, are more extreme than the 0–1 loss in that they would assign larger loss to classifications with higher posterior probabilities; i.e. l(d) > l(d′ ) yet π (d | y) > π (d′ | y). This, to us, appears incoherent. It is worth noting that we would not make the same claim in a continuous setting, as in that case the mode of a posterior density does not represent a probability of a particular outcome.

94

R. Fuentes-García, R.H. Mena and S.G. Walker / Computational Statistics and Data Analysis 137 (2019) 92–100

However, it remains a non-trivial exercise to achieve the goal of finding ˆ d. The main objective of this paper is to provide an algorithm for finding the MAP estimate, and which does it fast, so that in particular it can be run with multiple starting points (recommended for all optimization algorithms), thus mitigating the problem of local modes. The inspiration for the algorithm comes from neuroscience in the form of Hopfield’s network. It is the speed at which the Hopfield’s network algorithm reaches a mode which provided the backdrop for the initial experimentation of using sequences of conditional maximizations. While such routines are known in the statistics estimation literature, they are far from popular. Hopefully, the present paper will spark a new life into these type of algorithms. The layout of the paper is as follows. In Section 2 we provide the Hopfield’s network foundations of the search algorithm for a mode of the posterior distribution on the allocation variables. In Section 3 we look at choices of prior for the allocation variables in the context of a mixture model; whereas in Section 4 we consider various intra-clustering models for the data for which allocation variables are relevant. Section 5 contains numerical illustrations involving both synthetic and real data. 2. Hopfield’s network A Hopfield network is essentially an algorithm to search for the minimum of an energy function of the form E = S ′ WS =

N ∑

si sj wi,j ,

i,j=1

where W = (wi,j ) is a N × N matrix which contains ‘‘memories’’, i.e. each node is representing a memory stored in neurons denoted by (si ). Here the sj ∈ {0, 1}, where sj = 1 indicates neuron j is ‘‘on’’ and sj = 0 indicates neuron j is ‘‘off’’. The input/output of the Hopfield’s network is a sequence of conditional maximization steps from some joint probability mass function. As such it guarantees an increase in the joint probability and, by suitably defining the energy, is equivalent to a decrease in energy. The algorithm works as follows; from a starting point s one moves through integers j = 1, . . . , N, and if



wi,j si > 0 then sj → 1,

else sj → 0.

(5)

i ̸ =j

One way to view the network is whether the current state of a neuron changes or not; if it does it moves to a fixed alternate state. The change or not can be viewed as a conditional maximization step from the probability mass function, p(si | s−i ) ∝ exp(− 12 s′ Ws). For us, the network, i.e. joint density and algorithm, we can also first see it as whether a current state changes or not; if it does, it moves to a new d which yields a maximum probability from a set of N − 1 probabilities, as opposed to the deterministic fixed move of the Hopfield network. The change or not is also governed by a conditional maximization step from some joint density function, i.e. (4). It is remarkable that this algorithm finds the nearest mode very fast; indeed a handful of iterations, however large the dimension. For further details see Hopfield (1982). To understand this algorithm better, let us consider the joint density mass function on {0, 1}N given by p(s1 , . . . , sN ) ∝ exp − 12 s′ W s ,

)

(

(6)

where s = (s1 , . . . , sN ). We can easily see that a full conditional density is given by p(sj = 1 | s−j ) ∝ exp

⎧ ⎨

1

⎩2



si wi,k sk +



i̸ =j,k̸ =j

i ̸ =j

si wi,j

⎫ ⎬ ⎭

and p(sj = 0 | s−j ) ∝ exp

⎧ ⎨

1

⎩2

∑ i̸ =j,k̸ =j

si wi,k sk

⎫ ⎬

.



Therefore, if (5) holds, this conditional density is maximized with sj = 1, otherwise it is maximized with sj = 0. The net will converge to a stable configuration; (Fausett, 1993; Takefuji, 1992; Bruck, 1990). So we see that the Hopfield algorithm is a sequence of maximizations of the full conditional densities obtained from the joint density (6). As a point of reference, the Gibbs sampler would involve taking a sample from the full conditional densities. While conditional maximization algorithms for finding modes of probability mass functions have been seen in the literature, they are by no means common, at least not in the present context. Mode hunting of a joint density function is usually done using the EM algorithm with conditional maximization steps only used when the M step as a whole is intractable.

R. Fuentes-García, R.H. Mena and S.G. Walker / Computational Statistics and Data Analysis 137 (2019) 92–100

95

As we mentioned, our idea, motivated by the Hopfield’s network and the speed with which it finds a mode, is to use conditional maximization steps for finding the mode of the posterior density for the allocation variables of a mixture model. In this respect, we find

π (d1 , . . . , dn | data),

(7)

and then search for a mode using conditional maximization, i.e. for each i we find the j which maximizes π (di = j | d−i , data). What choices of j are available depends on the exact form of the model. If the number of components is known or has an upper bound then the choices are explicit. When these are not known we can and do use the sample size as an upper limit. We can run the algorithm repeatedly with different starting points to diminish the effect of being trapped in local modes. Due to the label switching problem associated with π (d | y) we have different configurations inducing the same clustering structure, i.e. with same probability in π (d | y). Hence, in order to abate such effect in the algorithm we do our search on those labelings without gaps. Thus, if a d has only 4 distinct labels then they would be {1, 2, 3, 4}. We can ensure this by only allowing the j from π (di = j | d−i , data) to be one of the values from d−i or 1 + max{d−i }. The convergence of the algorithm to a mode is straightforward. Given a current d, it is easy to see that

π (d′i | d−i , data) ≥ π (di | d−i , data), where di is the current value and d′i the new value. Hence,

π (d′i , d−i | data) ≥ π (di , d−i | data), and so the value of (7) is non-decreasing. When there is no strict increase, the algorithm must have reached a mode. As with all searches for a global mode, a number of runs with different starting points is needed. An R implementation of this code is available in the Supplementary material, see Appendix A. 3. Priors on allocations derived from Bayesian models Although we focus on the classifications from the Bayesian nonparametric mixture model, in principle any model for which we can highlight a vector of integer valued variables crucial to the estimation of the model falls under our methodology for obtaining a posterior mode. Here we look into some cases. 3.1. Nonparametric mixture models In order to compute (2) for the Bayesian mixture model, we need to specify the weights and to determine a prior for them. For the sake of illustration, we assume the commonly used stick-breaking construction, i.e.

w1 = v1 ,



wj = vj

(1 − vk )

ind

with vj ∼ Be(aj , bj ),

k
where Be(a, b) denotes the Beta distribution with mean a/(a + b), and the parameters (aj , bj ) are chosen such that the weights add up to one a.s. As described in Ishwaran and James (2001), such a setting encompasses nonparametric mixture models based on the Dirichlet process, where (aj , bj ) = (1, α ), or the two parameter Poisson–Dirichlet process, where (aj , bj ) = (1 − ξ , α + jξ ), with α > −ξ , as well as those based on more general nonparametric priors; see Favaro et al. (2016), for example. Hence, for such a class of models, the prior on the allocations is given by

π (d) = Ew

[ n ∏

] wdi

i=1

⎡ ⎤ k k ∏ ∏ Γ (aj + bj )Γ (nj + aj )Γ (mj + bj ) n , = E ⎣ vj j (1 − vj )mj ⎦ = Γ (aj )Γ (bj )Γ (nj + mj + aj + bj ) j=1

(8)

j=1

∑n

∑n

where k = max{d1 , . . . , dn }, nj = i=1 I(di = j) and mj = i=1 I(di > j). Noting that mj = nj+1 + mj+1 , n = n1 + m1 and arranging terms, the two parameter Poisson–Dirichlet process case reduces to

⎛∏ π (d) = ⎝

k−1 j=1 (

k α + jξ ) ∏

(α + 1)n−1

j=1

⎞ (1 − ξ )nj −1 ⎠

k ∏

nj − ξ

j=1

nj + mj + α + ξ (j − 1)

,

(9)

where (a)n = a(a + 1) · · · (a + n − 1). The expression in the parenthesis in (9) corresponds to Pitman’s sampling formula, which reduces to Ewens’ sampling formula when ξ = 0. The above expression could equally be computed for other classes of random weights, such as those introduced in Fuentes-García et al. (2010a). In particular, if wj = λ(1 − λ)j−1 , with λ ∼ Be(a, b), i.e. the geometric weights nonparametric prior, then

π (d) =

(a)n (b)D−n (a + b)D

∝ β (a + n, D − n + b),

(10)

96

R. Fuentes-García, R.H. Mena and S.G. Walker / Computational Statistics and Data Analysis 137 (2019) 92–100

∑n

where D = i=1 di , and β (a, b) denotes Euler’s Beta function. Other priors are available, such as those based on product partition models. The posterior is given by (1), with f (y | d) as in (3). 3.2. Change point detection models In (3) it is assumed that observations (yi ) are i.i.d. from the mixture model which, given d, implies that observations are also i.i.d within each cluster. More generally, one could have f (y | d) =

k ∫ ∏

Θ

j=1

f (y ρj | θ )p(θ ) dθ,

(11)

where f (y ρj | θ ) refers to the density corresponding to {yi : i ∈ ρj }, which shows some dependence among its elements. This is of interest, for example, when the underlying classification problem is aimed at determining the time points at which a set of time-dependent observations change their distribution. Indeed, change point detection techniques have, as their primary goal, to identify the time points, τ = {(τj )kj=1 ; τ1 < · · · < τk }, at which the data y = (y1 , . . . , yn ), observed at times t = {(ti )nj=1 ; t1 < · · · < tn } change their distribution. Clearly, such a problem is a model-based classification problem as it depends on the properties and flexibility of the stochastic model assumed for y, namely the regime distribution. The distinctive feature of prior and posterior distributions on d, for change point models, is that overlapping group configurations such as ({y1 , y3 }, {y2 , y4 , . . . , yn }) should receive probability zero. In order words, only allocations satisfying the constraint d1 ≤ d2 ≤ · · · ≤ dn are allowed in the support of all time-segmentations. When dealing with random partitions, Martinez and Mena (2014) discuss some prior distributions meeting such requirements. They show that a well defined model for segmentations can be constructed via

πc (d) =

n! k!

∏k

j=1

nj !

π (d),

(12)

where π (d) is given as in (9), for instance, and (12) is supported on those ds preserving the ordering. Another mechanism is attained by truncating a given posterior distribution on allocations (di ) that preserved the ordering. We would then have

πc (d | y) ∝ π (d | y) 1(d1 ≤ · · · ≤ dn ), see also Fuentes-García et al. (2010b). Clearly, the constraint will lead to a necessary modification of the algorithm. For when we look to update values, we only need to check those di s which are on a border; i.e. di−1 ̸ = di or di+1 ̸ = di . 4. Intra-cluster likelihood examples Depending on the problem under study, there might be many scenarios for the law regulating the data. We present three examples of intra-cluster likelihoods; two related to Bayesian nonparametric mixture models (univariate and multivariate) and one corresponding to the law of an AR(1) process, with the purpose of modeling each of the regimes in a change-point detection time series problem. 4.1. Univariate AR(1) process likelihood Let us assume that the time series y = (y1 , . . . , yn ) follow an autoregressive process of order one, specifically Yt = µ + φ Yt −1 + εt , i.i.d.

where µ ∈ R, 0 < φ < 1 and εt ∼ N(0, λ−1 ). If y0 ∼ N(µ, λ−1 ) then the model is strictly stationary and y ∼ N(µ, Σ ), where µ = (µ1 , . . . , µn )′ , Σ = (cov(yi , yj )) with cov(yi , yj ) = λ−1 φ |i−j| . Here, N(µ, Σ ) stands for the Gaussian distribution with mean µ and covariance Σ . In this case, the corresponding precision matrix is given by Σ −1 = λ(1 − φ 2 )−1 A, where



1

⎜−φ ⎜ . A=⎜ ⎜ .. ⎝ 0 0

−φ 1 + φ2 .. . ··· ···

0

−φ .. .

··· ··· .. .

−φ

1 + φ2

0

−φ

0 0 ⎟



.. ⎟ ⎟ . ⎟. ⎠ −φ

(13)

1

We assume the prior π (µ, λ) = π (µ|λ)π (λ) with π (µ|λ) = N(µ; 0, (c λ)−1 ), π (λ) = Ga(λ; a, b), and where c > 0. Thus, for a given value of the dependence parameter, φ , (11) reduces to

( )a ( )1/2 k ∏ b(1 − φ 2 ) Γ (nj /2 + a)(2π )−nj /2 c(1 + φ )(1 − φ 2 ) , f (y | d) = ( )nj /2+a c + nj − φ (nj − c − 2) Γ (a) b(1 − φ 2 ) + S j /2 j=1

(14)

R. Fuentes-García, R.H. Mena and S.G. Walker / Computational Statistics and Data Analysis 137 (2019) 92–100

97

with S j = y ′j Aj y j −

(1 − φ )

(∑n j

i=1

yj,i − φ

∑nj −1 i=2

yj,i

)2

c + nj − φ (nj − c − 2)

,

and where Aj is the matrix (13) corresponding to segmentation y j = (yj,i : i = 1, . . . , nj ). 4.2. Univariate mixture models Here we consider the mixture model defined in (2) with f (· | θ ) chosen as N(· | µ, λ−1 ), a conjugate prior π (µ, λ) = π (µ | λ)π (λ), where π (µ | λ) = N(µ | 0, (c λ)−1 ) and π (λ) = Ga(λ | a, b), with c > 0. In such a scenario, expression (3) is a particular case of (14), when φ = 0, and thus reduces to the form √ k ∏ Γ (a + nj /2)ba c(2π )−nj /2 f (y | d) = , √ {b + Sj2 /2}a+nj /2 c + nj Γ (a) j=1 where Sj2 =



y2i −

i:di =j

nj y¯ 2j

1 and y¯ j = n− j

1 + c /nj



yi .

i:di =j

In this case the prior π (d) is assumed to be of the form (8). 4.3. Multivariate mixture case Now we consider the intra-cluster likelihood induced by a p–dimensional Gaussian kernel, Np (µ, Σ ) and prior distribution π (µ, Σ ) = π (µ | Σ ) π (Σ ), with π (µ | Σ ) = Np (µ | 0, (K −1 Σ )) and π (Σ ) = InvWishartc (Σ ; T ). In such a case, (3) takes the form

f (y | d) ∝

k ∏ π −nj p/2 |T |c /2 Γp ((c + nj )/2)K p/2 j=1

|Ψ |(c +nj )/2 Γp (c /2)(nj + K )p/2

,

where

Ψ := T +



Yi Yi′ −

i:di =j

n2j nj + K

Y¯j Y¯j



and

Γp (a) := π p(p−1)/4

p ∏

Γ (a + (1 − j)/2).

j=1

We also assume (8) as the prior for d. 5. Illustrations We present three scenarios finding a mode of a posterior classification distribution using the conditional maximization routine, as championed by the Hopfield’s network. For the first case, we consider benchmark univariate data sets available in the literature, specifically the Galaxy, Enzyme and the Acidity data sets (e.g. Richardson and Green, 1997). The second case, consists of the bivariate diabetes data used in Banfield and Raftery (1993). Lastly, we use the USD–Mexican Peso exchange rate used in Martinez and Mena (2014) within the context of change point detection. The purpose of these examples is merely to illustrate the flexibility and the low number of iterations required to attain a mode. It is worth mentioning that, given the model-based nature of the above models, the modal allocation might vary among different choices of kernels, the intra-cluster likelihood, and the prior on allocations. All this said, the results can be compared with those in the literature using the same kernel and prior specifications, e.g. an application of our method with the intra-cluster likelihood in Section 4.2, can be compared with those typically applied for Bayesian mixture models. 5.1. Univariate example For the three univariate data sets, we apply the Hopfield’s network algorithm described in Section 2 to find a modal ˆ of (1). We assume the intra-cluster likelihood, f (y | d), is given as in Section 4.2 and the prior π (d) is given allocation, d, by (9). For this latter, we consider the Dirichlet process and an instance of the two parameter Poisson–Dirichlet process. In both cases, the corresponding parameters, that is (α, ξ ), where chosen so that the expected number of groups is far from the most accepted allocation. If Kn denotes the random variable indicating the number of different groups induced by the two parameter Poisson–Dirichlet process, then we choose (α, ξ ) such that E[Kn ] ≈ 10, 15 and 60 for the Acidity, Galaxy and Enzyme data sets, respectively. This corresponds to prior specifications far from the most accepted number of groups for these data sets.

98

R. Fuentes-García, R.H. Mena and S.G. Walker / Computational Statistics and Data Analysis 137 (2019) 92–100 Table 1 Results of the conditional maximization via Hopfield’s networks to attain the modal allocation of mixture models. Data set

Acidity Enzyme Galaxy

n

155 245 82

Most frequent cluster (92, 63) (152, 86, 7) (7, 72, 3)

DP(α )

2PD(α, ξ )

Avg. Iter.



Avg. Iter.



3.55 4.57 3.49

85% 60% 85%

3.55 4.80 3.17

88% 65% 93%

Fig. 1. Scatter plots for diabetes data set.

The integrated likelihood, and thus the posterior π (d | y), is sensitive to the choice of hyperparameters (a, b, c). We have chosen these so that c −1 (b/(a − 1)) ≈ max{y } − min{y }. In particular, we set c = 0.1, i.e. not to encourage too many groupings, and subsequently tune a and b. In a more general setup with further hyper-priors on these parameters, the resulting integrated likelihood might be complicated, however by means of finite discrete priors one could work it out without adding too much pressure on the search algorithm. We started the algorithm with a large number of groups and randomized allocations. Specifically, we run the Hopfield optimization algorithm for 100 different starting points for d and compute the average number of iterations required to attain a mode as well as the percentage of these coinciding with the most frequent modal cluster. Table 1 summarizes the outcomes of all these cases. It is evident that the number of iterations required to attain a modal allocation is very low and a large percentage of the nets attain the most accepted mode for these data sets, i.e. by means of Bayesian nonparametric mixture estimation approaches. A similar conclusion of the non-informativeness effect, as ξ increases, noted in Lijoi et al. (2007) is also present here. We also note that with very few iterations we attain a mode of the posterior distribution on classifications, in average one optimization in our personal computer took one second. While there are relatively fast algorithms now in the literature, they end up with approximations which rely on MCMC outputs and can be challengingly to optimize. 5.2. Multivariate mixture model For the diabetes data of Banfield and Raftery (1993), shown in Fig. 1, we consider the integrated likelihood of Section 4.3 with prior distribution on allocation given by (9), where α = 2 and ξ = 0, e.g. corresponding to a Dirichlet process mixture prior with E[Kn ] ≈ 9. This data set has 145 observations in three known groups; Chemical diabetes (36 observations), Normal (76 observations) and Overt diabetes (33 observations). The algorithm was run with the following values, (K = 1, c = 3) and T = I2 or T = I3 , for the specific dimension case. As before, 100 different starting points were used. The average number of iterations required to find a mode was 5.5, with 83% of the searchers pointing to a mode

R. Fuentes-García, R.H. Mena and S.G. Walker / Computational Statistics and Data Analysis 137 (2019) 92–100

99

Table 2 Classification for the modal partition for the diabetes data set. Actual/Estimated

Normal

Chemical

Overt

Normal Chemical Overt

75 27 2

1 7 1

0 2 30

Fig. 2. Price of one US Dollar in Pesos. Change points are indicated with vertical lines for the different models.

with three groups and the other 17% to a mode with four groups. The 83% was distributed as follows: 55.4% pointed to ˆ with partition sizes (104, 9, 32), 25.3% to (101, 9, 35), 15.7% to (104, 6, 35) and 3.6% to (109, 1, 35). a mode, d, For the sake of completeness, Table 2 summarizes the outcomes corresponding to a global misclassification rate of 22.7%. As with the previous data sets, we verified that the number of iterations required to attain a mode is very low, thus yielding in a competitive fast algorithm. 5.3. Change point detection In this last illustration we aim at detecting the change points of the time series corresponding to the exchange rate between US Dollars and Mexican Pesos during the period of January 2007–December 2012. For this case, we use the regime likelihood (14) with a = b = 1, c = 0.3 and φ = 0.5. Furthermore, we use the prior on segmentations, i.e. ordered allocations, given by (12) with (α, ξ ) = (1, 0.9). These specifications match those used by Martinez and Mena (2014). The modal segmentation was found in 180 iterations and it is shown in Fig. 2. The modal change points are at: October 3, 2008, May 1, 2009, Feb 10, 2010 and August 9, 2011, which go along with those reported in Martinez and Mena (2014). 6. Discussion In this paper, we have presented a Bayesian point estimate for a latent set of classification/allocation variables. With this mode we can also construct the posterior distribution for the parameters conditional on this classification. If a point estimate is required for the allocations, which is often the case, we have argued the coherent choice, based on posterior beliefs, must be the mode. Moreover, the mode is actually, and probably surprisingly so, remarkably easy and quick to find. The motivation was provided by the Hopfield’s network which has input/output equivalent to a sequence of conditional maximizations from a joint probability mass function. The variables move to their modal values at a very fast speed; we have demonstrated this speed in a number of illustrations. ˆ for example, Future directions of research would include how to sample from π (d | y) with the assistance of d; ˆ This would enable us to sample very quickly from the entire posterior having a proposal distribution to be centered on d. distribution.

100

R. Fuentes-García, R.H. Mena and S.G. Walker / Computational Statistics and Data Analysis 137 (2019) 92–100

Acknowledgments We thank an Associate Editor and two anonymous referees for comments made on a previous version of the paper. The work was completed during a visit by the first and second authors to the Department of Statistics & Data Sciences at the University of Texas at Austin. Hospitality from the department is gratefully acknowledged as is support from a Fulbright Scholarship for the second author. The first author thankfully acknowledges the support of CONACyT, Mexico project 241195. The third author acknowledges support from NSF, USA grants DMS 1506879 & 1612891. The authors are also grateful to Enrique Reyes for his advice with R programming. Appendix A. Supplementary data Supplementary material related to this article can be found online at https://doi.org/10.1016/j.csda.2019.02.008. References Banfield, J., Raftery, A., 1993. Model-based gaussian and non-Gaussian clustering. Biometrics 49, 803–821. Barry, D., Hartigan, J., 1992. Product partition models for change point problems. Ann. Statist. 20, 260–279. Barry, D., Hartigan, J., 1993. A Bayesian analysis for change point problems. J. Amer. Statist. Assoc. 88, 309–319. Bruck, J., 1990. On the convergence properties of the Hopfield model. Proc. IEEE 78 (10), 1579–1585. Dahl, D.B., 2006. Model-based clustering for expression data via a dirichlet process mixture model. In: Do, K., Mueller, P., Vannucci, M. (Eds.), Bayesian Inference for Gene Expression and Proteomic. Cambridge University Press, Cambridge, pp. 201–218. Escobar, M.D., West, M., 1995. Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 (430), 577–588. Fausett, L., 1993. Fundamentals of Neural Networks. Architectures, Algorithms and Applications. In: Lecture notes in statistics, Prentice Hall. Favaro, S., Lijoi, A., Nava, C., Nipoti, B., Prünster, I., Teh, Y., 2016. On the stick-breaking representation for homogeneous NRMIs. Bayesian Anal. 11, 697–724. Fraley, C., Raftery, A.E., 2002. Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 (458), 611–631. Fuentes-García, R., Mena, R.H., Walker, S.G., 2010a. A new Bayesian nonparametric mixture model. Comm. Statist. Simulation Comput. 39 (4), 669–682. Fuentes-García, R., Mena, R.H., Walker, S.G., 2010b. A probability for classification based on the Dirichlet process mixture model. J. Classification 27, 389–403. Hartigan, J., 1990. Partition models. Comm. Statist. Theory Methods 19 (8), 2745–2756. Hjort, N., Holmes, C., Müller, P., Walker, S.G., 2010. Bayesian Nonparametrics. In: Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. Hopfield, J.J., 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79 (8), 2554–2558. Ishwaran, H., James, L.F., 2001. Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. 96 (453), 161–173. Lau, J.W., Green, P.J., 2007. Bayesian model-based clustering procedures. J. Comput. Graph. Statist. 16 (3), 526–558. Lijoi, A., Mena, R., Prünster, I., 2007. Controlling the reinforcement in Bayesian non-parametric mixture models. J. R. Stat. Soc. Ser. B 69 (4), 715–740. Lo, A., 1984. On a class of Bayesian nonparametric estimates i. density estimates. Ann. Statist. 12, 351–357. Loschi, R.H., Cruz, F.R.B., 2005. Extension to the product partition model: computing the probability of a change.. Comput. Statist. Data Anal. 48, 255–268. Martinez, A., Mena, R., 2014. On a nonparametric change point detection model in Markovian regimes. Bayesian Anal. 9, 823–858. McLachlan, G., Peel, D., 2000. Finite Mixture Models. Wiley–Interscience. Mena, R., Walker, S.G., 2015. On the Bayesian mixture model and identifiability. J. Comput. Graph. Statist. 24, 1155–1169. Quintana, F.A., Iglesias, P.L., 2003. Bayesian clustering and product partition models. J. R. Statist. Soc. Ser. B 65, 557–574. Richardson, S., Green, P., 1997. On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. R. Stat. Soc. Ser. B 59, 731–792. Takefuji, Y., 1992. Neural Network Paralell Computer. Kluwer Academic Publishers. Wade, S., Ghahramani, S., 2018. Bayesian cluster analysis: point estimation and credible balls. Bayesian Anal. 13, 559–626. Wang, L., Dunson, D., 2011. Fast Bayesian inference in Dirichlet process mixture models. J. Comput. Graph. Statist. 20, 196–216. Wang, X., Walker, S.G., 2017. An optimal data ordering scheme for the Dirichlet process mixture model. Computat. Statist. Data Anal. 112, 42–52.