Learning the fundamentals in a stationary environment

Learning the fundamentals in a stationary environment

Games and Economic Behavior 109 (2018) 616–624 Contents lists available at ScienceDirect Games and Economic Behavior www.elsevier.com/locate/geb Le...

NAN Sizes 0 Downloads 88 Views

Games and Economic Behavior 109 (2018) 616–624

Contents lists available at ScienceDirect

Games and Economic Behavior www.elsevier.com/locate/geb

Learning the fundamentals in a stationary environment ✩ Nabil I. Al-Najjar ∗ , Eran Shmaya Department of Managerial Economics and Decision Sciences, Kellogg School of Management, Northwestern University, Evanston, IL 60208, United States

a r t i c l e

i n f o

Article history: Received 16 December 2016 Available online 16 March 2018 JEL classification: C61 D83

a b s t r a c t A Bayesian agent relies on past observations to learn the structure of a stationary process. We show that the agent’s predictions about near-horizon events become arbitrarily close to those he would have made if he knew the long-run empirical frequencies of the process. © 2018 Elsevier Inc. All rights reserved.

Keywords: Learning Merging Stationarity

1. Introduction Consider a Bayesian decision maker who observes a stationary stochastic process with a finite set of outcomes. From the ergodic theorem, we know that, in the limit, this observer will learn all the long-run empirical frequencies of the process. Learning from infinite histories, however, is of little value when making decisions based on finite past observations. This paper relates these two perspectives: frequentist learning over infinite horizons and Bayesian predictions based on finite data. We consider the long-run properties of the predictive distribution, defined as the distribution on next period’s outcome given the finite number of past observations. We show that, almost surely, the predictive distribution in most periods becomes arbitrarily close to the predictive distribution had the true data generating process been known. Thus, as data accumulates, an observer’s predictive distribution based on a finite history becomes nearly as good as what it would have been given knowledge of the objective empirical frequencies over infinite histories. We demonstrate that the various qualifications we impose cannot be dropped. Our results connect several literatures on learning and predictions in stochastic environments. First, there is the literature on merging of opinions, pioneered by Blackwell and Dubins (1962). They prove that the beliefs of Bayesian observers with mutually absolutely continuous priors will strongly merge, in the sense that their posteriors about the infinite future will become arbitrarily close. Motivated by applications for decision making under uncertainty and game theory, Kalai and Lehrer (1994) and Lehrer and Smorodinsky (1996) introduced weaker notions of merging, which focus on closeness of near-horizon predictive distributions. While Blackwell and Dubins’ strong merging obtains only under stringent assumptions,



*

We thank Ehud Kalai, Ehud Lehrer, Rann Smorodinsky and Benjamin Weiss for helpful discussions. Corresponding author. E-mail address: [email protected] (N.I. Al-Najjar).

https://doi.org/10.1016/j.geb.2018.02.007 0899-8256/© 2018 Elsevier Inc. All rights reserved.

N.I. Al-Najjar, E. Shmaya / Games and Economic Behavior 109 (2018) 616–624

617

weak merging can be more easily satisfied. In our setting the posteriors typically do not strongly merge with the true parameter, no matter how much data accumulates.  Another line of inquiry focuses on representations of the form μ = μθ dλ(θ), where the law of the stochastic process μ is expressed as a convex combination of distributions {μθ }θ∈ that may be viewed as “simple,” or “elementary.” Such representations, also called decompositions, are useful in models of learning where the set of parameters  may be viewed as the object of learning. Two seminal theorems of this type are de Finetti’s representation of exchangeable distributions and the ergodic representation of stationary processes. The ergodic decomposition is the finest decomposition possible using parameters that are themselves stationary. Our main theorem states that a Bayesian decision maker’s predictions about a stationary process become arbitrarily close to those he would have made given knowledge of the true ergodic component. Our result should also be contrasted with Doob’s consistency theorem which states that Bayesian posteriors weakly converge to the true parameter. When the focus is on decision making, what matters is not the agents’ beliefs about the true parameter but the quality of their predictions. Although the two concepts are related, they are not the same. The difference is seen in the following example (Jackson et al., 1999, Example 5): Assume that the outcomes Heads and Tails are generated by tossing a fair coin. If we take the set of all Dirac measures on infinite sequences of Heads and Tails as “parameters,” then the posterior about the parameter converges weakly to a belief that is concentrated on the true realization. On the other hand, the agent’s predictions about next period’s outcome is constant and never approach the predictions given the true “parameter.” This example highlights that convergence of posterior beliefs to the true parameter may have little relevance to an agent’s predictions and behavior. A third related literature, which traces to Cover (1975), concerns the non-Bayesian estimation of stationary processes. See Morvai and Weiss (2005) and the references therein. This literature seeks an algorithm for making predictions about near-horizon events that are accurate for every stationary process. Our proof of Theorem 1 and Example 4 rely on techniques that were developed in this literature. There is however a major difference between that literature and our work: we are interested in a specific algorithm, namely predictions derived from Bayesian updating. 2. Decompositions and learning In this section we recall the notion of ergodic decomposition of a stationary process, interpreted as an agent’s belief, and introduce the notion of weak merging which formalizes the idea of learning used in this paper. We also contrast this notion of learning with the more familiar notion of Bayesian consistency which concerns learning the underlying parameter. 2.1. Preliminaries An agent (a decision maker, a player, or a statistician) observes a stochastic process (ζ0 , ζ1 , ζ2 , . . . ) where the in each period belongs to some fixed finite set A. Time is indexed by n and the agent starts observing the process Let  = A N be the space of realizations of the process, with generic element denoted ω = (a0 , a1 , . . .). Endow  product topology and the induced Borel structure F . Let () be the set of probability distributions over . A way to represent uncertainty about the process is in terms of an index set of “parameters:” Definition 1. A decomposition of



outcome at n = 0. with the standard



μ ∈ () is a quadruple , B, λ, (μθ )θ∈ where

• (, B , λ) is a standard probability space of parameters; • μθ ∈ () for every θ ∈ ; • for every S ∈ F , the map θ → μθ ( S ) is B -measurable and



μ( S ) =

μθ ( S ) λ(dθ).

(1)



A decomposition may be thought of as a way for a Bayesian agent to arrange his beliefs. The agent views the process as a two stages randomization: in the first stage, a parameter θ is chosen according to λ, and in the second the outcomes are generated according to μθ . Beliefs can be represented in many ways. The two extreme decompositions are: (1) the trivial decomposition with  = {θ¯ }, B is trivial, and μθ¯ = μ; and (2) the Dirac decomposition with  = A N , B = F , and λ = μ. A “parameter” in this case is just a Dirac measure δω that assigns probability 1 to the realization ω . Note that parameters are not required to be elements of some finite dimensional vector space, as is often assumed in statistical models. Stationary beliefs admit a well-known decomposition with natural properties. Recall that a stochastic process (ζ0 , ζ1 , . . . )  is stationary if, for every natural number k, the joint distribution of the k-tuple ζn , ζn+1 , . . . , ζn+k−1 does not depend on n. A stationary distribution is the distribution of a stationary process. The set of stationary distributions over  is convex and compact in the weak∗ -topology. Its extreme points are called ergodic distributions. We denote the set of ergodic distributions by E . Every stationary belief μ ∈ () admits a unique decomposition in which the parameter set is the set of ergodic  distributions: μ = ν λ(dν ) for some belief λ ∈ (E ). This decomposition is called the ergodic decomposition.

618

N.I. Al-Najjar, E. Shmaya / Games and Economic Behavior 109 (2018) 616–624

A fundamental fact about ergodic processes, called the ergodic theorem, ties the probability of an event to the frequency of its occurrence in the realized sequence of outcomes. According to the ergodic theorem, for every stationary belief μ and every block (¯a0 , . . . , a¯ k−1 ) ∈ A k , the limit frequency

(ω; a¯ 0 , . . . , a¯ k−1 ) = lim

n→∞

 1  # 0 ≤ t < n : at = a¯ 0 , . . . , at +k−1 = a¯ k−1 n

exists for μ-almost every realization ω = (a0 , a1 , . . . ). When μ is ergodic this limit is the same for μ-almost every ω , and equals the probability μ({ω = (a0 , a1 , . . . )|a0 = a¯ 0 , . . . , ak−1 = a¯ k−1 }). Thus, for ergodic processes, the probability of observing every block equals its (objective) empirical frequency in the infinite realized sequence. The ergodic decomposition theorem states that for μ-almost every ω , the function  (ω; ·) defined over blocks can be extended to a stationary measure over () which is also ergodic. Moreover, μ = (ω; ·) μ(dω), so that the function ω → (ω; ·) recovers the ergodic parameter from the realization of the process.1 In summary, the parameter μθ in the ergodic decomposition of a stationary process represents the empirical distribution of finite sequences of outcomes along the realization. These parameters capture our intuition of “fundamentals” of the process. The belief λ over the parameters in the ergodic decomposition represents uncertainty over these fundamentals. From an econometric point of view, the ergodic theorem implies that the ergodic component can be reconstructed from a single realization of the process. In Al-Najjar and Shmaya (2015), we make this formal using the concept of empirical identification. A special case of the ergodic decomposition is the decomposition of an exchangeable distribution μ via i.i.d. distributions. For future reference, consider the following example: Example 1. The set of outcomes is A = {0, 1} and the agent’s belief is given by





μ ζn = a0 , . . . , ζn+k−1 = ak−1 =

1

(k + 1) ·

k d

for every n, k ∈ N and a0 , . . . , ak−1 ∈ A where d = a0 + · · · + ak−1 . Thus, the agent believes that if he observes the process k consecutive periods then the number d of good periods (periods with outcome 1) is distributed uniformly in {0, . . . , k} and all configuration with d good outcomes are  equally likely.  De Finetti’s decomposition is given by , B , λ, (μθ )θ∈ where  = [0, 1] is equipped with the standard Borel structure B and the Lebesgue measure λ, and μθ ∈ () is the distribution of i.i.d coin tosses with probability of success θ :





μθ ζn = a0 , . . . , ζn+k−1 = ak = θ d (1 − θ)k−d 2 2.2. Learning For every μ ∈ () and sequence (a0 , . . . , an−1 ) ∈ A n with positive μ-probability, the period-n predictive distribution is the probability distribution μ(·|a0 , . . . , an−1 ) on A representing the agent’s prediction about next period’s outcomes given the prior μ and the observations a0 , . . . , an−1 . For expository simplicity, predictive distributions in this paper will always refer to one-step ahead predictions (our analysis covers any finite horizon). Kalai and Lehrer (1994), and Kalai et al. (1999) introduced notions of merging of predictive distributions. First, some definitions: A bounded sequence of real numbers a0 , a1 , . . . is said to strongly Cesàro converges to a real number a, denoted s.c

an −−−→ a, if limn→∞ n→∞

1 n

n−1 k=0

s.c

|ak − a| = 0. Equivalently, an −→ a if there exists a set T ⊆ N such that limn→∞,n∈ T an = a and

T has density 1, i.e., limn→∞ | T ∩ {0, 1, . . . , n}|/n = 1. Also, for every pair p , q ∈ ( A ) we let | p − q | = maxa∈ A | p (a) − q(a)|.

μ, μ˜ ∈ (). Then μ˜ merges with μ if μ ˜ (·|a0 , . . . , an−1 ) − μ(·|a0 , . . . , an−1 ) −−−→ 0

Definition 2. Let

n→∞

for

μ-almost every realization ω = (a0 , a1 , . . . ) ∈ A N ; μ˜ weakly merges with μ if s.c μ ˜ (·|a0 , . . . , an−1 ) − μ(·|a0 , . . . , an−1 ) −−−→ 0 n→∞

for

μ-almost every realization ω = (a0 , a1 , . . . ) ∈ A N .

˜ to be close to the These definitions were inspired by Blackwell and Dubins (1962), who required the prediction of μ prediction of μ not just for the next period but for the infinite horizon. Kalai and Lehrer (1993) applied the concept of merging to learning in games. 1

For a reference, see Gray (2009).

N.I. Al-Najjar, E. Shmaya / Games and Economic Behavior 109 (2018) 616–624

Definition 3. A decomposition (, B , λ, (μθ )) of μ ∈ () is learnable if μ merges with decomposition is weakly learnable if μ weakly merges with μθ for λ-almost every θ .

619

μθ for λ-almost every θ . The

As an example of a learnable decomposition, consider the Bayesian agent of Example 1. In this case

μ(1|a0 , . . . , an−1 ) =

a0 + · · · + an−1 + 1 n+1

.

The strong law of large numbers implies that for every parameter θ ∈ [0, 1] this expression converges μθ -almost surely to θ . In addition, μθ (1|a0 , . . . , an−1 ) = θ for every a0 , . . . , an−1 . Therefore μ merges with μθ for every θ , so De Finetti’s decomposition is learnable (and, a fortiori, weakly learnable). This is a rare case in which the predictions μ(ζn ∈ ·|a0 , . . . , an−1 ) and μθ (ζn ∈ ·|a0 , . . . , an−1 ) can be calculated explicitly. In general merging and weak merging are difficult to establish because the Bayesian prediction about the next period is a complicated expression which potentially depends on the entire history of observations. 2.3. Motivation for weak merging

˜ the agent’s belief. To say that μ ˜ weakly In applications, μ represents the true process generating observations and μ merges with μ means that this agent’s prediction about next period’s outcome is accurate except for rare times. Consider a statistical decision problems where in each period n the agent chooses the action f (a0 , . . . , an−1 ) that maximizes expected utility given past observations. The agent’s payoff depends on this decision and on the outcome an for that period. If the agent’s prior weakly merges with the true data generating process, then it is straightforward to show that this agent will take an -optimal decision in the long-run. A similar conclusion holds if the agent aggregates periods’ payoffs using a discount factor that is close enough to 1.2 Kalai et al. (1999) provide another motivation for weak merging in terms of the properties of calibration tests, which ˜ weakly merges compare the predicted frequency of events to their realized empirical frequencies. They showed that μ ˜ pass a class of calibration tests when the outcomes are generated according to μ. with μ if and only if forecasts made by μ Finally, Lehrer and Smorodinsky (2000) provide a characterization of weak merging in terms of the relative entropy between μ˜ and μ.3 No similar characterization is known for merging. 2.4. Merging and the consistency of Bayesian estimators A common way to think about Bayesian inference is in terms of the consistency of Bayesian estimator. Consistency is formulated in terms learning the parameter itself. Recall that the Bayesian estimator of the parameter θ is the agent’s conditional belief over θ after observing the outcomes of the process. It is well known that under any ‘reasonable’ decomposition (, B , λ, (μθ )), the Bayesian estimator is consistent, i.e., the estimator weakly converges to the Dirac measure over the true parameter as data accumulates, for a set of parameters of λ-measure 1. The argument traces back to Doob. See, for example, Weizsäcker (1996) and the references therein (see also Freedman (1963) for consistency for every parameter).4 The problem is that consistency of an estimator does not imply that the agent can make better predictions about future outcomes. Recall the example of Jackson et al. (1999), discussed in the introduction, with i.i.d. coin tosses and the Dirac decomposition. After observing the first n outcomes of the process the agent’s belief about the parameter is uniform over all ω that agrees with the true parameter ω∗ on the first n coordinates. While this belief indeed converges to δω∗ , the agent does not gain any new insight about the process that enables him to make better predictions about the future. This decomposition is therefore not learnable in the sense of Definition 3. 3. Main theorem We are now in a position to state our main theorem. Theorem 1. The ergodic decomposition of every stationary stochastic process is weakly learnable. To understand the difficulty in proving the theorem, note that if θ were known, the predictive distribution about period n outcome is:

μθ (a0 , . . . , an−1 , an ) . μθ (a0 , . . . , an−1 ) 2 3 4

See the working paper version for a formal statement of these observations. However, we do not know whether their condition can be used to prove Theorem 1 without repeating the whole argument. Bayesian consistency holds whenever the decomposition has the property that the realization of the process determines the parameter.

(2)

620

N.I. Al-Najjar, E. Shmaya / Games and Economic Behavior 109 (2018) 616–624

The numerator and the denominator of this expression are of the form μθ (b) for some block b = (a0 , . . . , ak ) of outcomes. When the agent does not know θ , the ergodic theorem implies that μθ (b) is the asymptotic frequency of occurrences of block b in the realized sequence. The agent’s assessment of μθ (b), therefore, becomes asymptotically accurate for every such b. However, in period n, when the agent actually has to compute the predictive distribution (2), he only has one observation of a block of length n and no observation of a block of length n + 1, so he has no observations to use in inferring the probabilities that appear in (2). Theorem 1 says that the ratio appearing in (2), i.e., the agent’s predictive distribution, will still be approximately correct in most periods. The remainder of the section introduces examples that illustrate some of the implications of Theorem 1. We begin with an example of a Hidden Markov process: Example 2. An agent believes that the state of an economy in every period is a noisy signal of an underlying “hidden” states that changes according to a Markov chain with memory 1. Formally, let A = {B, G} be the set of outcomes, H = {B, G} be ˆ p ,q ∈ ( H × A )N the steady the set of hidden (unobserved) states, and let 1/2 < p , q < 1 be the parameters. Denote by μ state distribution of the ( H × A )-valued Markov process with transition matrix ρ : H × A → ( H × A ) given by



 



ρ (h, a)[h , a ] = p δh,h + (1 − p )(1 − δh,h ) · qδh ,a + (1 − q)(1 − δh ,a ) . Thus, if the hidden state in some period is h then in the next period the hidden state remains h with probability p and changes with probability 1 − p. The observed outcome in every period is the hidden state of that period with probability q. ˆ p ,q over A N , which represents the belief over observed outcomes. Then μ p ,q is a Let μ p ,q ∈ ( A N ) be the marginal of μ stationary process which is not Markov of any order. If the agent is uncertain about p and q, then his belief about the outcome process is again stationary, and can be represented by some prior over the parameter set  = (1/2, 1] × (1/2, 1]. This decomposition of μ is the ergodic decomposition. 2 The consistency of the Bayesian estimator in Example 2 implies that the conditional belief over the parameter ( p , q) converges almost surely in the weak topology over () to the belief concentrated on the true parameter. However, because next-period’s predictions involve complicated expressions that depend on the entire history of the process, it is not clear whether these predictions merge with the truth. Nevertheless, Theorem 1 implies that these predictions weakly merge with the truth as data accumulates. The following well-known example recalls that, although the agent’s predictions about near-horizon events are as good as if he knew the fundamentals of the process, this conclusion does not extend to predicting long-run events, no matter how much data accumulates: Example 3. Consider, as in Example 1, an agent who faces a sequence of i.i.d. coin tosses with parameter θ ∈ [0, 1] representing the probability of Heads, and assume that the agent has a uniform prior over [0, 1]. This agent will eventually learn to predict near-horizon outcomes as if he knew the true parameter θ , but he will continue to assign probability 0 to the event that the long-run frequency is exactly θ . 2 Our final example illustrates that weak learnability in Theorem 1 cannot be replaced by learnability. The example is a modification of an example given by Ryabko (1988) for the forward prediction problem in a non-Bayesian setup. Example 4. Every period there is a probability 1/2 for eruption of war. If no war erupts then the outcome is either a bad economy or a good economy depending on the number of peaceful periods since the last war. The function from the number of peaceful periods to outcomes is an unknown parameter of the process, and the agent has a uniform prior over this parameter. Formally, let A = {W, B, G} be the set of outcomes. We define μ ∈ ( A N ) through its ergodic decompositions. Let  = {B, G}{1,2,... } be the set of parameters with the standard Borel structure B and the uniform distribution λ. Thus, a parameter is a function θ : {1, 2, . . . } → {B, G}. We can think about the belief μθ as a hidden Markov model where the unobservable process ξ0 , ξ1 , . . . is the time that elapsed since the last time a war occurred. Thus, ξ0 , ξ1 , . . . is the N-valued stationary Markov process with transition ρ : N → (N) given by

⎧ ⎪ ⎨1/2, if j = k + 1, ρ (k)[ j ] = 1/2, if j = 0, ⎪ ⎩ 0, otherwise,

for every j , k ∈ N, while



ζn =

μθ is the distribution of a sequence ζ0 , ζ1 , . . . of A-valued random variables such that

if ξn = 0 θ(ξn ), otherwise. W,

N.I. Al-Najjar, E. Shmaya / Games and Economic Behavior 109 (2018) 616–624

621

Consider a Bayesian agent who observes the process. After the first time a war erupts the agent keeps track of the state of the process ξn at every period. If there is no uncertainty about the parameter, i.e., if the Bayesian agent knew θ , his prediction about the next outcome when ξn = k gives probability 1/2 to outcome W and probability 1/2 to outcome θ(k + 1). On the other hand, if the agent does not know θ but believes that it is randomized according to λ, he can deduce the values θ(k) gradually while he observes the process. However for every k ∈ {1, 2, 3, . . . } there will be a time in which the agent observe k consecutive peaceful periods for the first time and at this point the agent’s prediction about the next outcome will give probability 1/2 to W, 1/4 to B and 1/4 to G. Thus there will be infinitely many occasions in which an agent who predicts according to μ will predict differently from an agent who predicts according to μθ . Therefore the decomposition is not learnable. On the other hand, in agreement with Theorem 1, these occasions become more infrequent so the decomposition is weakly learnable. 2 4. Proof of Theorem 1 Up to now we assumed that the stochastic process starts in period n = 0. When working stationary processes it is natural to extend the index set of the process from N to Z, i.e. to assume that the process has infinite past. This is without loss of generality: every stationary stochastic process ζ0 , ζ1 , . . . admits a unique extension . . . , ζ−1 , ζ0 , ζ1 , . . . to the index set Z (Kallenberg, 2002, Lemma 10.2). We therefore assume hereafter, with a harmless contrast with our previous notation, that  = A Z . For every n, m ∈ Z ∪ {+∞, −∞}, we denote by Fmn the σ -algebra of subsets of  generated by the variables ζk for m ≤ k < n. For every pair D , D of σ -algebras we denote by D ∨ D the smallest σ -algebra that contains both. We require some definitions and properties of conditional probabilities in standard Borel spaces. All these definitions are intuitive and have an easy formulation in the case of conditioning over a σ -algebra which is generated by a finite or countable partition, but they involve some subtleties in the general case. Let D be a σ -algebra of Borel subsets of . The quotient space of (, F , μ) with respect to D is the unique (up to isomorphism of measure spaces) standard probability space (, B , λ) and a measurable map α :  →  (called projection) such that D is generated by α , i.e., for every D -measurable function f from  to some standard probability space there exists a (unique up to equality λ-almost surely) B -measurable function ˜f defined over  such that f = ˜f ◦ α , μ − a.s. The conditional distributions of μ over D is the unique (up to equality λ-almost surely) family (μθ )θ∈ of probability measures over (, F , μ) such that the following two conditions hold: (a) for every θ ∈ , we have:

μθ ({ω ∈ |α (ω) = θ}) = 1, (b) the map θ → μθ ( S ) is B -measurable and satisfies (1) for every S ∈ F . We call (, B , λ, μθ ) the decomposition of μ induced by D . For every belief μ ∈ (), the trivial decomposition of μ is generated by the trivial σ -algebra {∅, }, and the Dirac decomposition is generated by the σ -algebra of all Borel subsets of . If μ is stationary then its ergodic decomposition is induced by the σ -algebra I of all invariant Borel sets of , i.e. all Borel sets S ⊆  such that S = T ( S ) where T :  →  is the left shift given by T (ω)n = ωn+1 for every n ∈ Z. For every bounded Borel function f :  → R, the conditional expectation of f given D , denoted E ( f |D ), is the D -measurable random variable given by E ( f |D)(ω) = f dμα (ω) where α :  →  is the projection map to the quotient space. In the case that f is an indicator function, i.e., f = 1 B for some Borel set B, we call this random variable the conditional probability of B given D and denote it by μ( B |D ). If ζ :  → A is a random variable with values in some finite or countable set A then the conditional distribution of ζ given D , denoted by μ(ζ = ·|D ) is the ( A )-valued random variable given by μ(ζ = ·|D )(a) = μ(ζ = a|D ). We will prove a more general version of Theorem 1, which may be interesting in its own right. Let T : A Z → A Z be the left shift. A σ -algebra D of Borel subsets of  is shift-invariant if for every Borel subset S of A Z it holds that S ∈ D if and only if T ( S ) ∈ D . Theorem 2. Let μ be a stationary distribution over  and let D be a shift invariant Then the decomposition of μ induced by D is weakly learnable.

0 σ -algebra of subsets of  such that D ⊆ F−∞ .

Theorem 1 follows from Theorem 2 since the σ -algebra of invariant sets I which induces the ergodic decomposition satisfies the assumption of Theorem 2.5 We will prove Theorem 2 using Lemma 4.1. Lemma 4.1. Let μ be a stationary distribution over A Z and let D be a shift invariant σ -algebra of Borel subsets of A Z . Then s.c

n | μ(ζn = ·|F0n ∨ D) − μ(ζn = ·|F−∞ ∨ D) | −−−→ 0, μ-a.s. n→∞

5

We formulated Theorem 1 for  = A N , while in Theorem 2 we assumed  = A Z . This contrast is harmless because the ergodic decomposition of

μ ∈ ( A N ) is the same as the decomposition of its unique extension to μ ∈ ( A Z ).

622

N.I. Al-Najjar, E. Shmaya / Games and Economic Behavior 109 (2018) 616–624

n For each n the random variables μ(ζn = ·|F0n ∨ D ) and μ(ζn = ·|F−∞ ∨ D) are, respectively, the conditional distribution n of ζn given F0n ∨ D and given F−∞ ∨ D . These random variables take value in ( A ). To understand the implications of Lemma 4.1, consider the case in which D = {∅, }. Then Lemma 4.1 says that a Bayesian agent who observes a stationary process from period 0 onwards will make predictions in the long run as if he knew the infinite history of the process.

Proof of Lemma 4.1. For every n ≥ 0 let f n :  → ( A ) be a version of the conditional distribution of ζ0 according to given the finite history ζ−1 , . . . , ζ−n and D :

μ

0 f n = μ(ζ0 = ·|F− n ∨ D ),

and let f ∞ :  → ( A ) be a version of the conditional distribution of ζ0 according to and D :

μ given the infinite history ζ−1 , . . .

0 f ∞ = μ(ζ0 = ·|F−∞ ∨ D).

Let gn = | f n − f ∞ |. By the martingale convergence theorem limn→∞ f n = f ∞

lim gn = 0,

n→∞

μ-a.s and therefore

μ-a.s.

(3)

μ and the fact that D is shift invariant that   n μ ζn = ·|F0n ∨ D = f n ◦ T n and μ ζn = ·|F−∞ ∨ D = f∞ ◦ T n.

It follows from the stationarity of





μ-a.s.,     n | μ ζn = ·|F0n ∨ D − μ ζn = ·|F−∞ ∨ D | = | f n ◦ T n − f ∞ ◦ T n | = gn ◦ T n

Therefore,

(4)

and N −1 1 

N

n =0

N −1     1  n | μ ζn = ·|F0n ∨ D − μ ζn = ·|F−∞ ∨D | = gn ◦ T n −−−−→ 0,

N

n =0

N →∞

where the equality follows from (4) and the limit follows from (3) and Maker’s generalization of the ergodic theorem (Kallenberg, 2002, Corollary 10.8) to cover multiple functions simultaneously: Maker’s Ergodic Theorem. Let μ ∈ () be a stationary distribution and let g 0 , g 1 , · · · :  → R be such that supn | gn | ∈ L 1 (μ) and gn → g ∞ μ-a.s. Then N −1 1 

N

n =0

gn · T n −−−−→ E ( g ∞ |I ), N →∞

μ-a.s. 2

0 n n Proof of Theorem 2. From D ⊆ F−∞ it follows that F−∞ ∨ D = F−∞ . Therefore, from Lemma 4.1 we get that

s.c

n | μ(ζn = ·|F0n ∨ D) − μ(ζn = ·|F−∞ ) | −−−→ 0 μ-a.s. n→∞

By the same lemma (with D trivial) s.c

n | μ(ζn = ·|F0n ) − μ(ζn = ·|F−∞ ) | −−−→ 0 μ-a.s. n→∞

By the last two limits and the triangle inequality

| μ(ζn = ·|F0n ) − μ(ζn = ·|F0n ∨ D) | ≤ s.c

n n | μ(ζn = ·|F0n ) − μ(ζn = ·|F−∞ ) | + | μ(ζn = ·|F0n ∨ D) − μ(ζn = ·|F−∞ ) | −−−→ 0 μ-a.s. n→∞

(5)

Let (, B , λ) be the quotient of (, F , μ) over D and (μθ )θ∈ the corresponding conditional distributions. Let S be the set of all realizations ω = (. . . , a−1 , a0 , a1 , . . . ) such that s.c

| μ(ζn = ·|an−1 , . . . , a0 ) − μα (ω) (ζn = ·|an−1 , . . . , a0 ) | −−−→ 0. Then



n→∞

μ( S ) = 1 by (5). But μ( S ) = μθ ( S )λ(dθ) by (1). It follows that μθ ( S ) = 1 for λ-almost every θ , as desired. 2

N.I. Al-Najjar, E. Shmaya / Games and Economic Behavior 109 (2018) 616–624

623

5. Ergodicity and sufficiency for prediction Mixing conditions formalize the intuition that observing a sequence of outcomes of a process does not change one’s belief about events in the far future. Standard examples of mixing processes are i.i.d. processes and non-periodic Markov processes. In this section we recall the mixing condition, sufficiency for prediction, introduced by Jackson et al. (1999). We show that the ergodic decomposition is not necessarily sufficient for prediction, and that a finer decomposition than the ergodic decomposition is both sufficient for prediction and weakly learnable.

− →



σ -algebra, where Fm∞ the σ -algebra of  generated by (ζm , ζm+1 , . . . ). A probability − → − → distribution (not necessarily stationary) ν ∈ () is mixing if it is T -trivial, i.e., if ν ( B ) ∈ {0, 1} for every B ∈ T .6 Let T =

∞ be the future tail

m≥0 Fm

Proposition 3. Let (, B , λ, (μθ )) be the decomposition of λ-almost every θ .

− →

μ ∈ () induced by the tail σ -algebra T . Then μθ is mixing for

This proposition appears as Theorem 15 in Berti and Rigo (2007). The result is not trivial: it is not true for every

σ -algebra D that the conditional distributions of μ over D are almost surely D -trivial. This property is very intuitive (and

indeed, easy to prove) when D is generated by a finite partition, or more generally when D is countably generated, but the tail is not countably generated. We note that the proof of Theorem 1 in Jackson et al. (1999), which asserts that μθ is sufficient for predictions, contains a gap: the first sentence in that proof assumes the Berti–Rigo result above. The following is a well-known example of a situation in which the tail decomposition differs from the ergodic decomposition. Example 5. Let A = {B, G} and let α ∈ [0, 1] be irrational. Let ξ0 , ξ1 , . . . be the [0, 1]-valued stationary process of rotation by α , so that ξ0 has uniform distribution and ξn = (ξ0 + nα ) mod 1. Let



ζn =

G, if ξn mod 1 > 1/2, B,

if ξn mod 1 < 1/2.

The process ζ0 , ζ1 , . . . is ergodic so its ergodic decomposition is the trivial decomposition. On the other hand, the tail decomposition of this process is the Dirac decomposition consisting of atomic measures over the possible realizations of the process.7 2 Theorem 4 below uses Lemma 4.1 to show that the tail decomposition is also weakly learnable. Since the tail representation is finer than the ergodic representation, this theorem implies that the ergodic decomposition does not capture all the learnable properties of a stationary process. Theorem 4. The tail decomposition of a stationary stochastic process is weakly learnable.

← −  m m F−∞ . From Theorem 2 it follows that the decomposition ← − ← − 0 induced by the past tail T is learnable, since T is shift invariant and is contained in F−∞ .

Proof. Consider the past tail of the process given by T =

The theorem now follows from the fact that for every stationary belief

μ over a finite set of outcomes it holds that

− → ← − ← − − → T μ = T μ where T μ and T μ are the completions of the past and future tails under μ. See Weiss (2000, Section 7). − → ← − Therefore, the decomposition of μ induced by T equals the decomposition induced by T , which is learnable. 2 We note that the proof makes use of the equality of the past and future tails of a stationary process. This fact, established in Weiss (2000), is not trivial, as it relies on the notion of entropy and the finiteness of the set of outcomes A. We conclude with further comments on the relationship with Jackson et al. (1999). Their main result characterizes the class of distributions that admit a decomposition which is both learnable and sufficient for prediction. They dub these processes “asymptotically reverse mixing.” In particular, they prove that for every such process, the decomposition induced by the future tail is learnable and sufficient to prediction. In Example 4, the tail decomposition equals the ergodic decomposition, and, as we have shown, is not learnable. This shows that stationary processes needs not be asymptotic reverse mixing. On the other hand, the class of asymptotically reverse mixing processes contains non-stationary processes. For example, the Dirac atomic measure δω is asymptotically reverse mixing for every realization ω ∈ (). 6

An equivalent way to write this condition is that for every n and

, there is m such that

|ν ( B |a0 , . . . , an−1 ) − ν ( B )| < ∞ and partial history (a , . . . , a n for every B ∈ Fm 0 n−1 ) ∈ A . Jackson et al. (1999) call such belief sufficient for prediction. They establish the equivalence with the mixing condition in the proof of their Theorem 1. 7 This follows from the fact that an observer who starts observing the process from some period n onwards will be able to deduce the value of ζ0 .

624

N.I. Al-Najjar, E. Shmaya / Games and Economic Behavior 109 (2018) 616–624

6. Extensions In this section we discuss the extent to which the theorems and tools of this paper extend to a larger class of processes. This sheds further light on the assumptions made in our work. 6.1. Infinite set of outcomes The definitions of merging and weak merging can be extended to the case in which the outcome set A is a compact ˜ ∈ ( A N ) merges with μ ∈ ( A N ) if metric space8 : if φ is the Prokhorov Metric over ( A ), then we say that μ

  ˜ (·|a0 , . . . , an−1 ) −−−→ 0 φ μ(·|a0 , . . . , an−1 ), μ n→∞

˜ weakly merges with μ if the limit holds in the strong Cesàro for μ-almost every realization ω = (a0 , a1 , . . . ) ∈ A N and that μ sense. The proof of Theorem 1 extends to the case of an infinite set of outcomes. The following example shows, however, that Theorem 4 requires finiteness: Example 6. Let A = {0, 1}N equipped with the standard Borel structure. Thus an element a ∈ A is given by a = (a(0), a(1), . . . ) where a(k) ∈ {0, 1} for every k ∈ N. Let μ be the belief over A Z such that {ζn (0)}n∈Z are i.i.d. fair coin

− →

tosses and ζn (k) = ζn−k (0) for every k ≥ 1. Note that in this case T = B , so the future tail contains the entire history of the process. The tail decomposition in this case will therefore be the Dirac decomposition. However, this decomposition is not learnable: an agent who predicts according to μ will at every period n be completely in the dark about ζn+1 (0). 2 6.2. Relaxing stationarity Stationary beliefs are useful to model situations where there is nothing remarkable about the point in time in which the agent started to keep track of the processes (so other agents who start observing the process at different times have the same beliefs) and that the agent is a passive observer who has no impact on the process itself. The first assumption is strong and can be somewhat relaxed. In particular, consider a belief that is the posterior of some stationary prior conditioned on the occurrence of some event. (A similar situation is an agent who observes a finite state Markov process that starts at a given state rather than the stationary distribution.) Let us say that a belief ν ∈ ( A N ) is conditionally stationary if there exists some stationary belief μ such that ν = μ(·| B ) for some Borel subset B of A N with μ( B ) > 0. While such processes are not stationary, they still admit an ergodic decomposition. In addition, they exhibit the same tail behavior of stationary processes. In particular, Theorem 1 extends to such processes. References Al-Najjar, N.I., Shmaya, E., 2015. Uncertainty and disagreement in equilibrium models. J. Polit. Economy 123, 778–808. Berti, P., Rigo, P., 2007. 0–1 laws for regular conditional distributions. Ann. Probab. 35, 649–662. Blackwell, D., Dubins, L., 1962. Merging of opinions with increasing information. Ann. Math. Stat. 33, 882–886. Cover, T.M., 1975. Open problems in information theory. In: 1975 IEEE Joint Workshop on Information Theory, pp. 35–36. D’Aristotile, A., Diaconis, P., Freedman, D., 1988. On merging of probabilities. Sankhya, Ser. A 50, 363–380. Freedman, D.A., 1963. On the asymptotic behavior of Bayes’ estimates in the discrete case. Ann. Math. Stat. 34, 1386–1403. Gray, R., 2009. Probability, Random Processes, and Ergodic Properties. Springer Verlag. Jackson, M.O., Kalai, E., Smorodinsky, R., 1999. Bayesian representation of stochastic processes under learning: de Finetti revisited. Econometrica 67, 875–893. Kalai, E., Lehrer, E., 1993. Rational learning leads to Nash equilibrium. Econometrica 61, 1019–1045. Kalai, E., Lehrer, E., 1994. Weak and strong merging of opinions. J. Math. Econ. 23, 73–86. Kalai, E., Lehrer, E., Smorodinsky, R., 1999. Calibrated forecasting and merging. Games Econ. Behav. 29, 151–159. Kallenberg, O., 2002. Foundations of Modern Probability, second edition. Springer-Verlag, New York. Lehrer, E., Smorodinsky, R., 1996. Compatible measures and merging. Math. Oper. Res. 21, 697–706. Lehrer, E., Smorodinsky, R., 2000. Relative entropy in sequential decision problems. J. Math. Econ. 33, 425–439. Morvai, G., Weiss, B., 2005. Forward estimation for ergodic time series. Ann. Inst. Henri Poincaré Probab. Stat. 41, 859–870. Ryabko, B.Y., 1988. Prediction of random sequences and universal coding. Probl. Pereda. Inf. 24, 3–14. Weiss, B., 2000. Single Orbit Dynamics. AMS Bookstore. Weizsäcker, H., 1996. Some Reflections on and Experiences with SPLIFs. Lecture Notes—Monograph Series, pp. 391–399.

8 Similar definitions may be introduced for the case that A is a separable metric space, but in that case there are several possible non-equivalent definitions of merging; see D’Aristotile et al. (1988). We note that in our setup, where the set of outcomes is the same in every period, this definition of merging is the same as ‘weak star merging’ in D’Aristotile et al. (1988).