A note on the e–a histogram

A note on the e–a histogram

Statistics and Probability Letters 103 (2015) 105–109 Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage:...

335KB Sizes 1 Downloads 159 Views

Statistics and Probability Letters 103 (2015) 105–109

Contents lists available at ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

A note on the e–a histogram T.H. Kirschenmann a , P. Damien b,∗ , S.G. Walker c a

Institute for Computational Engineering and Sciences, University of Texas at Austin, 201 East 24th St., CO200, Austin, TX 78712, USA

b

Department of Information, Risk, and Operations Management, McCombs School of Business, University of Texas at Austin, 1 University Station, B6500, Austin, TX 78712, USA c

Department of Mathematics, College of Natural Sciences, University of Texas at Austin, 1 University Station C1200, Austin, TX 78712, USA

article

info

Article history: Received 7 February 2015 Received in revised form 2 April 2015 Accepted 2 April 2015 Available online 27 April 2015

abstract The e–a histogram dominates, under square error loss, a fixed bin width histogram when both are assigned the same number of bins. That is, the fixed bin width histogram is inadmissible and the dominating alternative is the e–a histogram. © 2015 Elsevier B.V. All rights reserved.

Keywords: Bin width Histogram Quantile Random mass

1. Introduction Histograms are useful exploratory graphical tools to summarize data. The aim of this paper is to study the e–a histogram of Denby and Mallows (2009); henceforward we will label this reference as D&M. We demonstrate that the e–a histograms dominate fixed bin width histograms, with respect to square error loss for the mass assigned to the bins. It should be made clear at the outset that this paper is not about how to select the number of bins for a given sample size. Our conclusions relate to the type of bin once the number has been selected. There is substantial literature on how to select the number of bins; see, for example, He and Meeden (1997), who provide a decision theoretic approach to the selection of the number of bins. Another point that we make here is that we do not treat the histogram as a density estimator. Several authors have addressed these points; see, as examples, Sturges (1926), Doane (1976), Scott (1979), Freedman and Diaconis (1981), Kanazawa (1992), Wand (1997), Shimazaki and Shinomoto (2007), Scott and Scott (2008), Wang and Zhang (2012), and Lu et al. (2013). We regard the histogram as a graphical tool, and it is the probabilities assigned to each bin and their accuracy which is the critical factor. We are thus assessing the accuracy of the probability assigned to each bin and we will use mean square error as our tool to achieve this. m A histogram is a set of probabilities; that is, for sets (At )m t =1 , mass F (At ) is assigned such that t =1 F (At ) = 1. Throughout this paper we assume that m is fixed and known, and the exercise is about how to assign mass to the bins. We consider two types of histograms: the fixed bin width histogram treats the (At ) as fixed to which mass is randomly assigned (determined by the observations). In the e–a histogram the (At ) are random and the mass assigned, (F (At )), is fixed at 1/m for each bin.



Corresponding author. E-mail addresses: [email protected] (T.H. Kirschenmann), [email protected] (P. Damien), [email protected] (S.G. Walker). http://dx.doi.org/10.1016/j.spl.2015.04.021 0167-7152/© 2015 Elsevier B.V. All rights reserved.

106

T.H. Kirschenmann et al. / Statistics and Probability Letters 103 (2015) 105–109

D&M motivate and demonstrate the use of e–a histograms, and study some asymptotic properties. On the other hand, we obtain a quite remarkable result which is that the e–a histogram dominates the standard fixed bin-width histogram. That is, the latter is inadmissible with respect to square loss for the mass assigned to the bins. As D&M note, since a histogram is mainly a graphical tool the shape, determined by the mass of the bins, should be accurate; that is, in terms of the bin mass, the e–a histograms dominate the fixed bin-width ones. The e–a histogram is best suited to showing spikes in the observed data where the bins corresponding to such spikes would be tall and narrow. On the other hand, large bin sizes in the tails tend to conceal outliers, but the e–a histogram, given its inherent property by construction of identifying anomalies in the data, will show remote outliers quite clearly; see D&M for a graphical illustration of this property. To assess the accuracy of these histograms, consider the expected square error loss over the sets; that is, consider v = (vt )

2

where vt = E F (At ) − F0 (At ) and the expectation is either over the mass F0 (At ), with F (At ) fixed as a constant, or the set F (At ) where F0 (At ) is a constant. Here F0 denotes the true distribution function. If F1 denotes histograms based on fixed bin widths and F2 the histogram based on quantiles, then we show that



2

 2 > E F2 (A2t ) − F0 (A2t )

E F1 (A1t ) − F0 (A1t )



(1)

for all t, where (A1t ) and (A2t ) will be defined properly later in the paper. The striking conclusion is that the standard fixed bin width histogram is inadmissible with respect to square error loss for the mass assigned to the bins. In Section 2 we start by considering results when there is a relation between the sample size and the bin sizes; that is n/m is an integer or (n + 1)/m is an integer. Later in this section we extend the results to arbitrary n and m. Section 3 contains a brief discussion. 2. Main results 2.1. Restricted sample and bin sizes To introduce the idea we start by considering the simplest case possible involving histograms with two bins. For the e–a histogram, with n = 2r + 1 observations, we would use the r + 1 ordered statistic x(r +1) from the sample (x1 , . . . , xn ) to provide the partition point for the bins. The rest of the points provide the mass. Each bin A would have r points: A = (0, x(r +1) ] and A = (x(r +1) , 1]. Even though the first A has included x(r +1) as a boundary, we only use this point to provide the estimate of the median, and the others to assign the mass. To keep the mathematics from becoming a distraction and keeping the notation as simple as possible, and without loss of generality, we assume the sample to be independent and identically distributed uniformly from the interval (0, 1). Consequently, F0 (A) = |A| for all sets A. This without loss of generality follows straightfowardly since if the (xi ) are independent and identically distributed from F0 , then it is well known that F0 (x(1) ), . . . , F0 (x(n) ) is equal in distribution to u(1) , . . . , u(n) where (u1 , . . . , un ) are independent and identically distributed from the uniform distribution. And this equality in distribution is sufficient for all our results to hold whatever F0 is. Standard results for such ordered statistics (David and Nagaraja, 2003) give  q1/2 = x(r +1) ∼ beta(r + 1, r + 1). Hence, q1/2 ] and mass 12 assigned to ( q1/2 , 1). Of the random histogram, F2 , based on quantiles, is given by mass 21 assigned to (0, course the correct uniform distribution has

1 2

as the median.

2

Here we measure the accuracy of the histogram by considering F2 (A) − F0 (A)



F0 (A) = |A| and F2 (A) = since F2 (A) =

1 2

1 . 2

Hence, for either A, v2 = E F2 (A) − F0 (A)



2

=E

and for A = (0, q1/2 ], we have F0 (A) =  q1/2 .

1 2

where A = (0, q1/2 ] and ( q1/2 , 1) and

− x(r +1)

2

which is Var(x(r +1) ). This follows

2

On the other hand, we define the square error from the fixed width histogram as v1 = E F1 (A)− F0 (A)

=E

1

On the other hand, the standard histogram with bins (0, 21 ] and ( 12 , 1) would provide square error loss given by E

1



where n1 ∼ Bin(n, 21 ), and now we have A = (0, 21 ] and ( 12 , 1) with F0 (A) =

1 2

and F1 (A) = n1 /n.

2

− n1 /n

2

− n1 /n

2

Lemma 2.1. With v1 and v2 defined as above, v1 > v2 . Proof. Using well known results for the beta distribution, we have

v2 =

(r + 1)2 1 = . (2r + 2)(2r + 3) 4(2r + 3)

where n1 ∼ Bin(n, 21 ). Then, we have

v1 = Var(n1 /n) = thus proving the lemma.

1 4n

=

1 4(2r + 1)

> v2 ,

2

T.H. Kirschenmann et al. / Statistics and Probability Letters 103 (2015) 105–109

(x(n/2) + x(n/2+1) ) = 12 (u + v),  which is given by 14 E 1 − 2u − 2v + u2 + v 2 + 2uv where the joint density for

If, on the other hand, we have an even number of observations, then we take  q1/2 =

2

and are interested in v2 = E 2 − q1/2 (u, v), see David and Nagaraja (2003), is given by

1

107

1 2



f (u, v) ∝ un/2−1 (1 − v)n/2−1 1(u < v). Then, E ( u) =

1 2

n n+1

,

E(v) =

1 2

n+2 n+1

,

E(u2 ) =

1 4

n n+1

,

E(v 2 ) =

1 4

n+4 n+1

and E(uv) =

1 4

n+2 n+1

.

This now yields v2 = 1/[4(n + 1)] < 1/(4n) = v1 . 2.2. General bin sizes Now we develop this idea to histograms with m bins and we let n = (r + 1)m − 1, for some integer r ≥ 1, for simplicity of notation. In the quantile-based histogram the m bins are constructed from the m−1 points x(r +1) , x(2(r +1)) , . . . ,

x((m−1)(r +1)) , and these form the boundaries of the partitions or bins. The first and last bins are given by (0, x(r +1) ] and



(x((m−1)(r +1)) ], respectively. Each bin will have mass m1 . In this case, we are now interested in

v2t = E

1 m

2 − (x((t +1)(r +1)) − x(t (r +1)) )

for t = 0, 1, . . . , m − 1 and x(0) = 0 and x(m(r +1)) = 1. This follows since F0 (At ) = F0 (x(t (r +1)) , x((t +1)(r +1)) ] = x((t +1)(r +1)) − x(t (r +1)) .





We are going to compare this with the accuracy of the fixed bin width histogram; hence, we are interested in v1t = E

1 m

− nt /n

2

where marginally nt ∼ Bin(n,

1 m

).

Theorem 2.2. For all t, with v1t and v2t defined as above,

v1t > v2t . Proof. Following David and Nagaraja (2003), the joint density for the ordered statistics (u, v) = (x(t (r +1)) , x((t +1)(r +1)) ) is given by f (u, v) ∝ ut (r +1)−1 (1 − v)(r +1)(m−t −1)−1 (v − u)r 1(u < v). From this we need to evaluate E(v − u) and E(v − u)2 . Given that

 u
ua−1 (1 − v)b−1 (v − u)c du dv =

Γ (a)Γ (b)Γ (c + 1) Γ (a + b + c + 1)

we can state E(v − u) =

r +1 m(r + 1)

=

1 m

and

E(v − u)2 =

(r + 2) m(m(r + 1) + 1)

.

This yields

v2t = v2 =

m−1 m2 (m(r + 1) + 1)

.

For the standard histogram with bins of equal width

v1t = v1 =

m−1 m2 n

=

m−1 m2 (m(r + 1) − 1)

1 , m

we have v1t = E

1 m

− nt /n

2

with nt ∼ Bin(n,

1 m

). Hence

> v2 ,

completing the proof. In the above we have assumed that (n + 1)/m is an integer. If n/m is an integer, then we would select the bin intervals to be  qt /m = 21 (x(tn/m) + x(tn/m+1) ) for t = 1, . . . , m − 1. Hence, the actual intervals would be, for t = 1, . . . , m, ( q(t −1)/m , qt /m ), where  q0 = 0 and  q1 = 1. Based on the result for even n with 2 bins we can again show that the mean square error for the e–a histogram is smaller than that for the fixed bin width histogram.

108

T.H. Kirschenmann et al. / Statistics and Probability Letters 103 (2015) 105–109

2.3. General sample and bin sizes In this section we assume there is no special relation between n and m. We need to estimate quantiles  qt /m for t = 1, . . . , m − 1. There are a number of ways to do this and in particular we choose the linear interpolation method (see Hyndman and Fan, 1996) and, for illustration and without loss of generality, we take t = 1: so, let k be such that (k − 1)/ n < 1/m < k/n, then  q1/m = w x(k−1) + (1 − w)x(k) , where w = k − n/m. Letting u = x(k−1) and v = x(k) we have the joint density for (u, v) as f (u, v) ∝ uk−2 (1 − v)n−k 1(u < v). It now follows that E(u) =

k−1 n+1

,

k

E(v) =

n+1

,

E(u2 ) =

k(k − 1) , (n + 1)(n + 2)

E(v 2 ) =

k(k + 1) (n + 1)(n + 2)

and E(uv) =

(k − 1)(k + 1) . (n + 1)(n + 2)

Therefore, we are interested in v2 = E E( q1/m ) = n/[m(n + 1)] and so

v2 =

1 m

− q1/m

2

and want to show this is no greater than v1 = (m − 1)/(nm2 ). Now

−n + 1 w 2 k(k − 1) + (1 − w)2 k(k + 1) + 2w(1 − w)(k − 1)(k + 1) + m2 (n + 1) (n + 1)(n + 2)

and hence

v2 =

k2 + k(1 − 2w) − 2w(1 − w) −n + 1 + . 2 m (n + 1) (n + 1)(n + 2)

Given that k = w + n/m, we have

v2 ≤

−n + 1 (n/m)2 + n/m + 2 m (n + 1) (n + 1)(n + 2)

and so we are left with showing that n(m − 1) + 2 m2

(n + 1)(n + 2)



m−1 m2 n

which is straightforward to show.

2

1 It is then straightforward using the same techniques to also show that v2t = E m − ( qt /m − q(t −1)/m ) < v1 for all t = 2, . . . , m. It should be noted in general that as sample size tends to infinity, and assuming the number of bins to increase slower than the sample size, both types of histograms discussed in this paper would perform roughly the same asymptotically using the square error metric. But again, depending on the number of spikes in the data, the e–a histogram with increasing sample size would show better details in the shape of the data while singling out abrupt spikes. A formal result along the lines of the above insight is given in D&M. Our result, namely the dominance of the e–a histogram over the fixed bin width histogram, is more demonstrable for smaller data sets.



3. Discussion The e–a histogram proposed by D&M dominates the fixed width histograms; that is max E F1 (A1t ) − F0 (A1t )



A1t

2

 2 > max E F2 (A2t ) − F0 (A2t ) A2t

where A1t = (t /m, (t + 1)/m] and A2t = (x(t (r +1)) , x((t +1)(r +1)) ]. This result, combined with the fact that quantiles are interpretable, implies that the random e–a histogram is a superior exploratory graphical tool when compared to the fixed bin width histogram for which the bins are largely uninterpretable. At the heart of the matter is that the linear interpolated  q1/m based on a sample of size n has a smaller MSE for estimating 1 1 −1 than does a n Binomial ( , n ) random variable, even though the latter is unbiased. m m References David, H.A., Nagaraja, H.N., (2003) Order Statistics. In: Wiley Series in Probability and Statistics. Denby, L., Mallows, C., 2009. Variations on the histograms. J. Comput. Graph. Statist. 18, 21–31.

T.H. Kirschenmann et al. / Statistics and Probability Letters 103 (2015) 105–109 Doane, D.P., 1976. Aesthetic frequency classification. Amer. Statist. 30, 181–183. Freedman, D., Diaconis, P., 1981. On the histogram as a density estimator: L2 theory. Z. Wahrscheinlichkeitstheor. Verwandte Geb. 57, 453–476. He, K., Meeden, G., 1997. Selecting the number of bins in a histogram: a decision theoretic approach. J. Statist. Plann. Inference 61, 59–69. Hyndman, R.J., Fan, Y., 1996. Sample quantiles in statistical packages. Amer. Statist. 50, 361–365. Kanazawa, Y., 1992. An optimal variable cell histogram based on the sample spacings. Ann. Statist. 20, 291–304. Lu, L., Jiang, H., Wong, W.H., 2013. Multivariate density estimation by Bayesian sequential partitioning. J. Amer. Statist. Assoc. 108 (504), 1402–1410. Scott, D., 1979. On optimal and data-based histograms. Biometrika 66, 605–610. Scott, D.W., Scott, W.R., 2008. Smoothed histograms for frequency data on irregular intervals. Wiley Internat. Rev. Comput. Stat. 62, 256–261. Shimazaki, H., Shinomoto, S., 2007. A method for selecting the bin size of a time histogram. Neural Comput. 19, 1503–1527. Sturges, H.A., 1926. The choice of a class interval. J. Amer. Statist. Assoc. 65–66. Wand, M.P., 1997. Data-based choice of histogram bin width. Amer. Statist. 1, 59–64. Wang, X.X., Zhang, J.F., 2012. Histogram-kernel error and its application for bin width selection in histograms. Acta Math. Appl. Sin. 28, 607–624.

109