Statistical Methodology 5 (2008) 46–55 www.elsevier.com/locate/stamet
Estimating relative rank correlation: Theory and application Alan D. Hutson ∗ University at Buffalo, Department of Biostatistics, Farber Hall Room 249A, 3435 Main Street, Buffalo, NY 14214-3000, United States Received 4 October 2006; received in revised form 2 January 2007; accepted 6 April 2007
Abstract In this note we introduce a new novel correlation estimator termed the relative rank correlation (RRC), along with a new descriptive RRC plot. The goal is to estimate the RRC per subject for the purpose of examining “within subject” correlations. The interesting feature of this correlation estimator is that it can be calculated given only a single paired observation per subject. Specific applications of the RRC estimator include model diagnostics, examination of influential observations, graphical exploration and simple description. This approach extends to “within subject” simple linear regression estimation as well. c 2007 Elsevier B.V. All rights reserved.
Keywords: Bootstrap; Concomitant order statistics
1. Introduction In this note we introduce a new novel correlation estimator termed the relative rank correlation (RRC), along with a new descriptive RRC plot. The goal is to estimate the RRC per subject for the purpose of examining “within subject” correlations. The interesting feature of this correlation estimator is that it can be calculated given only a single paired observation per subject. Specific applications of the RRC estimator include model diagnostics, examination of influential observations, graphical exploration and simple description. This approach extends to “within subject” simple linear regression estimation as well. The key to our approach is to take advantage of certain local asymptotic correlation structures of concomitant order statistics equated to the observed data points; see David et al. [2] for ∗ Tel.: +1 716 829 2594.
E-mail address:
[email protected]. c 2007 Elsevier B.V. All rights reserved. 1572-3127/$ - see front matter doi:10.1016/j.stamet.2007.04.002
A.D. Hutson / Statistical Methodology 5 (2008) 46–55
47
an introductory discussion of concomitant order statistics. Our approach aims to incorporate estimates of the local correlation structure of concomitant order statistics in order to derive a summary correlation measure per subject. This new approach is straightforward, computationally feasible and follows from standard theory. Simulation studies and a biomedical example are provided. 2. Relative rank correlation Let X1 , X2 , . . . Xn be an i.i.d. sample of size n from f (x), Xi ∈ R2 . Furthermore, denote the corresponding marginal ranks of Xi as Ri , where the rankings occur within the respective margin j, j = 1, 2. If we denote the specific elements of the 2 × 1 vectors Xi and Ri as X i, j and Ri, j , respectively, then jth marginal order statistics may be represented as X Ri, j :n for i = 1, 2, . . . , n and j = 1, 2. Given the above data structure we then proceed with the following: Theorem 2.1. Denote F j as the marginal d.f. corresponding to the bivariate distribution function F, j = 1, 2. Re-express the corresponding rank of X i, j as Ri, j = nλi, j + O p (n 1/2 ), λi, j ∈ (0, 1) such that Q j (λi, j ) = F j−1 (λi, j ) denotes the λi, j th quantile for the jth margin. If we define F j to be continuously twice differentiable in a neighborhood of Q j (λi, j ) then the bivariate asymptotic distribution of the vector √ n X Ri,1 :n − Q 1 (λi,1 ), X Ri,2 :n − Q 2 (λi,2 ) (2.1) is a bivariate normal random variable with mean vector zero, and variance–covariance matrix λi,1 (1 − λi,1 ) σi1,2 τT, (2.2) Σ i = τi σi2,1 λi,2 (1 − λi,2 ) i where τi = Q 01 (λi,1 ), Q 02 (λi,2 ) ,
(2.3)
σi1,2 = σi2,1 = F Q 1 (λi,1 ), Q 2 (λi,2 ) − λi,1 , λi,2
(2.4)
and Q 0j (λi, j ) denotes the derivative of Q j (λi, j ) with respect to λi, j . Proof. Follows straightforward from Goel and Hall [4] in conjunction with re-expression of the results found in Babu and Rao [1]. Taking advantage of the components developed via Theorem 2.1 leads to the following: Definition 2.1. The relative rank correlation (RRC) for subject i is then defined simply as RRC = ρ(X Ri,1 :n , X Ri,2 :n ) = p
σi1,2 λi,1 (1 − λi,1 )λi,2 (1 − λi,1 )
,
(2.5)
where σi1,2 corresponds to the off-diagonal element of Σ i at (2.2). 3. RRC estimation The key idea towards estimating ρ(X Ri,1 :n , X Ri,2 :n ) at (2.5) is to capture the local correlation structure around the pair of concomitant order statistics X Ri,1 :n , X Ri,2 :n . An alternative form of
48
A.D. Hutson / Statistical Methodology 5 (2008) 46–55
the RRC at (2.5), given by Goel and Hall [4], which lends itself well for the purpose of estimation is given by ρ(X Ri,1 :n , X Ri,2 :n ) = p
π1 π4 − π2 π3 p , λi,1 (1 − λi,1 ) λi,2 (1 − λi,2 )
(3.1)
where π1 = P(X Ri,1 :n < x Ri,1 :n , X Ri,2 :n < x Ri,2 :n ),
(3.2)
π2 = P(X Ri,1 :n < x Ri,1 :n , X Ri,2 :n > x Ri,2 :n ),
(3.3)
π3 = P(X Ri,1 :n > x Ri,1 :n , X Ri,2 :n < x Ri,2 :n ),
(3.4)
π4 = P(X Ri,1 :n > x Ri,1 :n , X Ri,2 :n > x Ri,2 :n ).
(3.5)
In order to estimate (3.1), and to provide a finite sample continuity type correction, first we defined our knots corresponding to the empirical quantile function to be at the points ri, j /(n +1), i.e. replace λi, j in (3.1) with ri, j /(n + 1). Secondly, we estimate π1 through π4 at (3.2)–(3.5) by n P
πˆ 1 =
k=1
n+1 n P
πˆ 2 =
k=1
k=1
I(xk,1 >xri,1 :n ,xk,2 ≤xri,2 :n ) n+1
n P
πˆ 4 =
I(xk,1 ≤xri,1 :n ,xk,2 >xri,2 :n ) n+1
n P
πˆ 3 =
I(xk,1 ≤xri,1 :n ,xk,2 ≤xri,2 :n )
k=1
I(xk,1 ≥xri,1 :n ,xk,2 ≥xri,2 :n ) n+1
,
(3.6)
,
(3.7)
,
(3.8)
,
(3.9)
where I(·) denotes the indicator function and (xi,1 , xi,2 ) are the observed marginal pairs of data. Note that the form of the inequalities within the indicator functions above account for the inherent discreteness within the data. The estimator of the RRC for the ith subject, ρ(X Ri,1 :n , X Ri,2 :n ), then simply takes the form ˆ = ρ(X RRC ˆ Ri,1 :n , X Ri,2 :n ) = q
πˆ 1 πˆ 4 − πˆ 2 πˆ 3 q , λˆ i,1 (1 − λˆ i,1 ) λˆ i,2 (1 − λˆ i,2 )
(3.10)
where the estimates of π1 through π4 are given at (3.6)–(3.9), λˆ i,1 = ri,1 /(n + 1) and λˆ i,2 = ri,2 /(n + 1), respectively. Estimates for the off-diagonal elements of Σ given at (2.2) are then given by the product of σˆ ri1 ,k1 :n , σˆ ri2 ,k2 :n and ρ(X ˆ Ri1 ,k1 :n , X Ri2 ,k2 :n ). 4. Large sample properties The large sample asymptotic properties are straightforward to obtain and are summarized by the following:
49
A.D. Hutson / Statistical Methodology 5 (2008) 46–55
Fig. 1. RRC plot for simulated bivariate normal data (ρ = 0).
Theorem 4.1. For large n, the distribution of the RRC estimator is given by √ 2 n ρ(X ˆ Ri,1 :n , X Ri,2 :n ) − ρ(X Ri,1 :n , X Ri,2 :n ) ∼ AN 0, σρ(X ˆ R :n ,X R i,1
) i,2 :n
,
(4.1)
where 2 σρ(X ˆ R
i,1 :n
,X Ri,2 :n )
= π22 π3 (1 − 4π3 ) + π1 π4 (π1 + π4 − 4π1 π4 ) + π2 π3 (π3 + 8π1 π4 ).(4.2)
Proof. Follows straightforward from a combination of Slutsky’s theorem and standard large sample theory pertaining to multinomial vectors, e.g. see Serfling [5].
50
A.D. Hutson / Statistical Methodology 5 (2008) 46–55
Fig. 2. RRC plot for simulated bivariate normal data (ρ = 0.5).
Large sample inference and confidence interval construction follows from Theorem 4.1 in a standard fashion. 5. RRC plot The RRC plot is a graphical tool that plots the raw data overlaid with the estimated RRC’s for all n pairs. For the purpose of illustration we simulated random variates from a bivariate normal population with a sample size of n = 100 with values of ρ = 0, 0.5, 0.9, 0.99. The plots are given in Figs. 1–4 and clearly illustrate the estimated RRC. What the RRC plot does is allow one to descriptively examine the leverage or influence of individual points relative to the data
A.D. Hutson / Statistical Methodology 5 (2008) 46–55
51
Fig. 3. RRC plot for simulated bivariate normal data (ρ = 0.9).
cluster and to examine correlation patterns “within” the data. In Fig. 1, with ρ = 0, we see that the corresponding values for the estimated RRC’s are also close to 0. As we cycle through the values of ρ, one can visually see how the RRC’s are also increasing as a function of ρ. As an example pertaining to real data, we generated the RRC plot in Fig. 5 from a study looking at the correlation between serum and urinary folate levels (nmol/d) for n = 24 subjects. The raw data and estimated RRC values are given in the Table 1. The Spearman rank correlation for this data was r = 0.71. We can immediately see that subject 21 is a potential outlier with a low estimated RRC = 0.11379, and hence his/her values may need further examination/explanation. This would not be immediately obvious from looking at a standard scatterplot, where the data for subject 21 is “near” the center of the data.
52
A.D. Hutson / Statistical Methodology 5 (2008) 46–55
Fig. 4. RRC plot for simulated bivariate normal data (ρ = 0.99).
6. Relative rank regression Simple “within subject” regression, for subject i, of the form X Ri,1 :n = β0 (X Ri,1 :n , X Ri,2 :n ) + β1 (X Ri,1 :n , X Ri,2 :n )X Ri,2 :n + ,
(6.1)
may be carried forth using the RRC correlation estimator at (3.10) along with some exact bootstrap moment estimators given by Hutson and Ernst [3], where E() = 0 and is assumed to have finite variance.
A.D. Hutson / Statistical Methodology 5 (2008) 46–55
53
Fig. 5. Urinary versus serum folate levels.
Towards this end define the exact bootstrap estimators of the variance of an order statistic as σˆ r2i, j :n =
n X
wk(ri, j ) (X k, j:n − µˆ ri, j :n )2 ,
(6.2)
k=1
where X k, j:n is the kth order statistics corresponding to the jth margin, j = 1, 2, and µˆ ri, j :n =
n X k=1
wk(ri, j ) X k, j:n ,
(6.3)
54
A.D. Hutson / Statistical Methodology 5 (2008) 46–55
Table 1 Estimated RRC’s and regression parameters per each individual subject Subject
Urinary level
Serum level
ρ(X ˆ Ri,1 :n , X Ri,2 :n )
βˆ0 (X Ri,1 :n , X Ri,2 :n )
βˆ1 (X Ri,1 :n , X Ri,2 :n )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
28.62 8.76 14.61 15.86 6.22 10.42 264.9 11.05 15.30 8.85 175.6 120.0 189.7 188.5 212.5 9.73 115.2 11.44 167.6 165.3 159.3 27.55 152.7 12.58
16.00 10.72 12.38 8.92 4.32 8.57 23.52 11.72 9.13 11.93 17.98 16.79 18.90 13.19 24.05 21.70 16.49 11.24 18.38 22.23 24.31 17.98 24.40 9.23
0.76 0.52 0.75 0.42 1.00 0.59 0.47 0.42 0.31 0.49 0.41 0.76 0.59 0.39 0.34 0.25 0.76 0.40 0.42 0.36 0.11 0.55 0.27 0.30
−208 2.51 −72.4 −63.6 3.53 5.75 59.40 −0.34 −47.9 5.05 106.6 −236 64.67 139.6 14.02 6.57 −230 −9.21 80.31 47.45 7.44 −212 −803 −15.2
18.36 0.55 6.87 11.25 0.62 0.64 8.08 1.01 7.74 0.33 3.96 20.74 6.67 3.19 8.60 0.16 20.09 1.98 4.58 5.03 5.72 14.38 38.38 3.03
where wk(ri, j ) = ri, j
k k−1 B ; ri, j , n − ri, j + 1 − B ; ri, j , n − ri, j + 1 , (6.4) n n ri, j n
Rx and B(x; a, b) = 0 t a−1 (1 − t)b−1 dt. It follows straightforward that the least squares estimators for β0 (X Ri,1 :n , X Ri,2 :n ) and β1 (X Ri,1 :n , X Ri,2 :n ) are given by βˆ0 (X Ri,1 :n , X Ri,2 :n ) = µˆ ri,1 :n − µˆ ri,2 :n βˆ1 (X Ri,1 :n , X Ri,2 :n ) βˆ1 (X Ri,1 :n , X Ri,2 :n ) = ρ(X ˆ Ri,1 :n , X Ri,2 :n )
σˆ ri,1 :n , σˆ ri,2 :n
(6.5) (6.6)
respectively, where the functional form of the estimators corresponding to µˆ ri, j :n and σˆ ri, j :n are given above. Obviously, the regression of X Ri,2 :n on X Ri,1 :n can be performed in a similar fashion by simply reversing the subscripts corresponding to j in the formulation above. The large sample results pertaining to the distribution of βˆ1 (X Ri,1 :n , X Ri,2 :n ) follow from Theorem 4.1 such that ! σr2i,1 :n √ 2 n βˆ1 (X Ri,1 :n , X Ri,2 :n ) − β1 (X Ri,1 :n , X Ri,2 :n ) ∼ AN 0, σρ(X ,(6.7) ˆ Ri,1 :n ,X Ri,2 :n ) σ 2 ri,2 :n
A.D. Hutson / Statistical Methodology 5 (2008) 46–55
and of less interest, √ n βˆ0 (X Ri,1 :n , X Ri,2 :n ) − β0 (X Ri,1 :n , X Ri,2 :n ) ! σr2i,1 :n 2 2 , ∼ AN 0, σρ(X ˆ Ri,1 :n ,X Ri,2 :n ) µri,2 :n σ 2 ri,2 :n
55
(6.8)
where, as before 2 σρ(X ˆ R
i,1 :n
,X Ri,2 :n )
= π22 π3 (1 − 4π3 ) + π1 π4 (π1 + π4 − 4π1 π4 ) + π2 π3 (π3 + 8π1 π4 ).(6.9)
Now let us return to our example where we generated the RRC plot in Fig. 5 from a study looking at the correlation between serum and urinary folate levels (nmol/d) for n = 24 subjects. In addition to the RRC estimates in Table 1 we provide corresponding estimates for β0 (X Ri,1 :n , X Ri,2 :n ) and β1 (X Ri,1 :n , X Ri,2 :n ), such that individual regression lines may be overlaid if so desired. One may now consider several standard regression features of interest such as individual confidence bands, prediction intervals, inferences etc. Acknowledgements We wish to thank the reviewer and associate editor for their helpful critique. References [1] C.J. Babu, C.R. Rao, Joint asymptotic distribution of marginal quantiles and quantile functions in samples from a multivariate population, Journal of Multivariate Analysis 27 (1988) 15–23. [2] H.A. David, M.J. O’Connell, S.S. Yang, Distribution and expected value of the rank of a concomitant of an order statistics, Annals of Statistics 5 (1977) 216–223. [3] A.D. Hutson, M.D. Ernst, The exact bootstrap mean and variance of an L-estimator, Journal of the Royal Statistical Society — Series B 62 (2000) 89–94. [4] P.K. Goel, P. Hall, On the average difference between concomitants and order statistics, The Annals of Probability 22 (1994) 126–144. [5] R.J. Serfling, Approximation Theorems of Mathematical Statistics, Wiley, New York, 1980.