Statistics & Probability North-Holland
Letters
1.5 (1992) 329-335
13 November
1992
Asymptotic comparison factors for smoothing parameter choices in regression problems Jim Kay Department
of Mathematics
and Statistics,
University of Stirling, Scotland,
UK
Received March 1989 Revised March 1992
Abstract: We present a theoretical comparison of a number of different choices of the smoothing parameter within a general regression framework. Asymptotic formulae are derived which provide a generalisation of some results reported in Hall and Titterington (1987). This leads to the development of asymptotic comparison factors which do not depend on the unknown model parameters and which therefore may be utilised to correct sub-optimal choices. Keywords: Regression; factor; regularisation.
smoothing
parameter;
optimal
smoothing;
generalised
singular
value decomposition;
asymptotic
comparison
1. Introduction
We consider the question of the choice of smoothing parameter within the context of a general regression model which is used in a variety of problems including image restoration, the regularisation of linear integral equations, periodic spline-smoothing and linear ridge regression. See Kay (19881, Davies and Andersson (1986a,b), Hall and Titterington (1987) and Golub et al. (1979), for instance. Within these problems it is important to choose the smoothing parameter judiciously because it controls the trade-off between fidelity to the observed data and the smoothness of the resulting solution. Too small a value leads to a highly oscillatory solution, even to the point that the solution is noisier than the data in the case of inverse problems such as image restoration. On the other hand, too large a value leads to an over-smoothed, biased solution. While it is sometimes the case that a range of values of the smoothing parameter provides an acceptable solution, it is important to be in the correct window of values. In addition, from the theoretical viewpoint, in the derivation of asymptotic results the smoothing parameter must tend to zero at the correct rate as the amount of information in the data grows to infinity. Following Hall and Titterington (1987), we consider a number of choices of the smoothing parameter and develop asymptotic formulae for the amount of smoothing imposed by the different methods. As a by-product, we obtain asymptotic comparison factors which enable a direct comparison of the different degrees of smoothing conferred by the different methods; these factors are independent of the model parameters and so may be used to correct sub-optimally determined degrees of smoothing. This work generalises the results of Hall and Titterington (1987) and demonstrates, in particular, that their results Correspondence
to: Dr. Jim Kay, Department
0167-7152/92/$05.00
0 1992 - Elsevier
of Mathematics
Science
Publishers
and Statistics,
University
B.V. Ail rights reserved
of Stirling,
Stirling
FK9 4LA, Scotland,
UK.
329
Volume 15, Number 4
STATISTICS&PROBABILITY
LETTERS
13 November 1992
relating to ridge-regression are not canonical. In addition we derive rates of convergence for the smoothing parameter based on both prediction and estimation criteria and demonstrate the difference between them. The article proceeds as follows. In Section 2, we re-iterate the definitions of the five choices of the smoothing parameter, following as closely as possible the notation of Hall and Titterington (1987) in order to facilitate comparison with their paper. We employ the generalised singular value decomposition in Section 3 in order to derive simple forms for the algebraic apparatus required in making the choices of smoothing parameter. In Section 4, asymptotic formulae are derived for the smoothing parameter choices defined in Section 2 and the asymptotic comparison factors are then easily deduced. We then present a comparison of our results with those of Hall and Titterington (1987) in the case of linear ridge-regression.
2. Smoothing
parameter
choices
Consider the regression model Z=My+,q
(1)
where M is a known n X s matrix, assumed to be of full rank min(n, s), n is a vector of random errors such that lE(n) = 0 and Cov(q) = u2Z, where Z is the II X n identity matrix, and Z is a random vector representing the observed data. In problems such as image restoration and the regularisation of linear integral equations, it is usually the case that s 2 IZ.In such situations, the usual least-squares estimator of y, although unbiased, normally exhibits an unacceptable level of variability. As a result of this, the method of regularisation is adopted and we take the regularised estimate of y to be the solution to the optimisation problem Inf,{(l/n)
IIZ -My
II * + hyTFTFy).
(2)
Here we have taken, without loss of generality, the matrix of the quadratic penalty to be of the form A (> 0) is the smoothing parameter. The resulting estimator of y is
FTF, where F is an r X s matrix of rank r (r’< s>. The parameter
7= (MTM+nhFTF)-‘MTy.
(3)
In order to simplify the analysis; we now employ the generalised singular value decomposition (GSVD); see Golub and van Loan (1983). By the GSVD, there exist orthonormal matrices U, V of orders n X n and r X r, respectively, and a non-singular s x s matrix P such that M=
UD,P
and
F = W,P.
(4)
Here D, is of the form [C,, 01, where C, = diaglc,, c2,. . ., c,.) (r GS> and D, takes the forms [$I, 01 and M, in the cases n > s, n
where Y = lJTZ, p = Py and E = UTq, so that E(E) = 0, CM&) = v*I, since U is orthonormal. Problem (2) becomes
Inf,{(l/n) 330
II Y- Dip II * + A~~D;D~~}
Volume
15. Number 4
STATISTICS&PROBABILITY
LETTERS
13 November
1992
with solution &A)
= (D$,
+nA@DJIDTY.
We define RSS(A) = $~~Z-Mj~~2-
$ilY-D,P^jI’=
+N(A))Y1[2,
where N = N(A) = D,(DTD, + nADTD,)DT. We also define EDF(A) = tr{l- N(A)). We now re-iterate the definitions of the choices of smoothing parameter to be considered; notation follows Hall and Titterington (1987) as closely as possible. (a) AREsIDis defined as the solution of lE(RSS(A)} =u2.
the
(5)
(b) Ann, is defined as the solution of [E(RSS(A)} = (a2/n)~~F(~). cc)
AGCV
(6)
is defined to be the value of A that minimises IE{GCV(A)} = E{n RSS(A)}/(EDF(A)j2,
where GCV(A) = n RSS(A)/{EDF(AN2. (d) AP_RIsKis defined to be the minimiser of the mean-squared
(7) prediction error
(8) ce)
A E-RISK
is defined to be the minimiser of the mean-squared
EMSE(A) = ll{ +(A)
estimation error
(9)
-p”‘).
We now note the following expressions for future reference. [E(RSS(A)} = ;tr(I-iV)2+
PMSE(A) = ctr(N*)
EMSE(A) = ;tr[(D$,
(10)
$BTo:(1-N)2D1/3,
(11)
+ ~flTD~(I-N)‘D,/3,
+nAD;D,)-‘D$,(D;D,
+nhD;D2)-‘]
(12) We now proceed to develop simple algebraic expressions which enable the computational (5149).
solution of
331
Volume 15. Number 4
3. A canonical
STATISTICS&PROBABILITY
LETTERS
13 November 1992
characterisation
Using (lo)-(12), we present canonical versions of (.5)-(9). The expressions are similar in all of the cases II > s, n < s and n = s, and are identical with respect to A. We present only the results for the case n < s, the other cases yeilding identical results.
EDF( A) = f: mt i= I rnf + nhc,2 ’
(13) (14) (15) (16)
Note that Z= min(r, n). Now in the case of ridge-regression, we have r = s and FTF = I, and so c” = 1 (i = 1, 2,. . . , s>. It is then clear that expressions (13)-(16) reduce to those given in Section 2 of Hall and Titterington (1987) only in the special ‘no-blur’ case, M = I. In particular the results 2.8, 2.9 and the statement on p. 187 of their paper, that PMSE(A) = EMSE(A), do not hold in this more general presentation. The canonical expressions (13)-(16) have direct computational advantages; this point is also made by Nychka (1984) in connection with a different canonical version of a different inverse problem. We now develop an asymptotic comparison of the methods defined in Section 2. This general approach has the advantage that results for a variety of different problems appear as special cases.
4. Asymptotic
comparisons
In order to develop our asymptotic analysis, we assume that s = O(n) and that 1 = O(n) and that n is large. The conditions of our theorem are rather restrictive. However, they are particularly appropriate in the problem of image restoration, where n may be 216 with s at least as large. Circulant approximations and Fourier computation can be justified in this case; see Gonzales and Wintz (1987). Other applications for which these assumptions are relevant are given in Green et al. (1985) and Craig and Brown (1985). We now state the theorem; an outline of its proof is described in an appendix. As stated in Section 3, we develop the results for the case n < s; identical results may be derived in the other cases. Theorem.
For each method of choice of smoothing parameter we assume:
(i) rnf N nd; ‘i-p, c” N d 2iK as n -+ 03, where d,, d, are positive constants, and
v=p+~>l
(ii) 332
A+0
asn+m,
~,K>O. whilenA1/“+m.
(17) (18)
Volume
15, Number 4
For parts (a)-(d)
STATISTICS&PROBABILITY
LETTERS
13 November
1992
we assume also that
b, = d,di 5 iY+KPf < 00, i=l
whereas for part (e) we assume that b, = d:d,’ 5 i”‘p~ < 03.
(19)
i=l
Then it follows that:
(a)
*RESlD = [a2(2k,
-k2)/(nb,)]“‘(zu+1)(1
(b)
*ED, = [a2(k,
(c)
A,,,=
(d)
AP_nIsK= [~2~i/(nb,)]u”2”f”(I
(e)
A,.,,,,
-k2)/(nb,)]“(2”+1)(1
[(T*k2/(2n”b,]Y’(2v+1)(1
+0(l)),
+0(l)), + o(l)),
nb,)] v’(*+2vtl)(
= [ a2k,/(
+0(l)),
1 + o( 1)))
where o(l) + 0 as n + 00 and
k, =
(d,dJ1”‘&$,
k, = (44-‘/“~m
9
(;;:‘:,s
k, = d,( dld2)-(Y+1)‘vj
By expressing the constants k, - k, in terms that, under the conditions of the theorem:
of the standard
~0
x2” dx
0 (l+X”)3’
Beta function,
it follows
immediately
Corollary.
(9
*RESrD/*P-RISK - WV + l)/(v -
(d (h)
*EDF/*p_R,SKN
(2v/(
* ocv/* P-RISK
17
(9
A RESID/A
EDF
N -
(v+l)
v-
w’(2v+1),
1))y’(2v+ l),
v/(2v+
1)
.
We see that the asymptotic comparison factors presented in (f)-(i) depend only on v; in particular, we note that they do not involve the unknown model parameters. It is also clear from results (f)-(i) that, from an asymptotic standpoint, results (2.8) ad (2.9) of Hall and Titterington (1987) do not hold in general. Clearly, Ann,,,, A,,, and Aocv all tend to zero at the same rate as the minimiser of the mean-squared prediction error, and so these are prediction, not estimation, criteria. Our results also demonstrate that the minimisers of the mean-squared errors for prediction and estimation tend to zero at different rates and have different constants. Compared with A,,,,,, ARESID and A,,, tend, on average, to impose too much smoothing. Result (h) is a re-expression of the ‘GCV-theorem’ of Golub et 333
Volume
STATISTICS&PROBABILITY
15, Number 4
LETTERS
13 November
1992
al. (1979) which shows that A,,, is an asymptotically optimal prediction criterion in the sense of having the correct rate and constant. A useful by-product of results (f) and (g) is that they may be used to provide asymptotic correction factors, hence making available the results obtained by Hall and Titterington (1987), in the context of periodic spline-smoothing, for a wider class of problems. Given that, with respect to mean squared prediction error, Ann,,, and A,,,, are sub-optimal and given a consistent extimator of c?, hP_nISKmay be estimated from either An.,,, or Ann, using results (f) and (g) respectively. The results, (a)-(i), may be used in a variety of problems. For example in linear ridge-regression, usually the matrix in the quadratic roughness penalty would be positive definite. Taking d, = 1 and K = 0 gives asymptotic formulae for this case. Taking d, = 1 and I_L= 0 provides asymptotic results for the case where periodic smoothing splines are used in nonparametric regression. Our results also apply to the problems of image restoration and the inversion of periodic, linear, discretised integral equations. In practice, where periodic approximations are appropriate and Fourier computation, via the fast Fourier transform, may be used, condition (i) of the theorem may be checked using plots of log rnf and log c’ (ordered) versus log i. If these are approximately linear, d,, d,, p and k can be ‘estimated’ and hence the asymptotic comparison factors can be calculated. This technique was used succesfully in Thompson, Kay and Titterington (1991).
Appendix
We present an outline proof of the theorem and corollary of Section 4. Recall that 1 = O(n). Using (lo), (5) and (14), we see that equation (5) is equivalent to i-g A2d,d; i i=l (i-w-”
+ Adld2)2
As n + ~4, given (17) and (18), the left-hand side is asymptotic to A*b, whereas the right-hand side is -k,). This gives (a). (Note that k, > k2.) asymptotic to A-” “(a2/nX2k, From (6), (lo), (13) and (14), equation (6) becomes i-&Lpf A’d,d;
; i=l (i-w-”
+ Ad,d2)2
= %
i
i$l l+ Af,,,i-
- i$l (1 +A~ld2i’)2
Given (17) and (18) as IZ-+ ~4,this becomes A%,=A-r’“a2(k,-k,)(l+o(l)) which yields (b). Wc omit the proof of (c) as it is directly analagous to a part of a proof in Wahba (1977, Section 3, Theorem 1). From (8), (11) and (15), (a/aA) IID,&A) -Or/3 112 = 0 is equivalent to
(d,4+
,i_l_:v,“,“:
)3 1
2
=
$+-I
(1
+
A;;
.p+K)3. 1
21
Given (17) and (18) the left-hand side is asymptotic to Ab,, whereas the right-hand side is asymptotic to n-1a2A-1-‘/“k3, which gives (d). 334
Volume
15, Number
4
STATISTICS&PROBABILITY
LETTERS
13 November
1992
From (91, (12) and (161, (a/M) (I&(A) - p II2 = 0 is equivalent to
Given (17) and (181, the left-hand side asymptotic to Ab,, whereas the right-hand side is asymptotic to nP1,2A-2-‘/“k,. q The corollary follows from the direct use of the following expressions for k, - k, in terms of the standard Beta function. l/v,
1 - l/v),
k, = v-‘( d,d,)
-“%(
k, = v-‘(
--I’“B( 1 + l/v,
d,d,)
k,=v-‘(d,d,)-““B(l/v,2-l/v),
2 - l/v),
References Craven, P. and G. Wahba (19791, Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalised cross-validation, Numer. Math. 31, 377-403. Davies, A.R. and R.S. Anderssen (1986a1, Improved estimates of statistical regularisation parameters in Fourier differentiation and smoothing, Numer. Math. 48, 671-697. Davies, A.R. and R.S. Anderssen (1986b1, Optimisation in the regularisation of ill-posed problems, J. Austral. Math. Sot. Ser. B 28, 114-133. Golub, G.H., M. Heath and G. Wahba (19791, Generalised cross-validation as a method for choosing a good ridge parameter, Technometrics 21, 215-223.
Gonzales, R.C. and P. Wintz (19871, Digital Image Processing (Addison-Wesley, Reading, MA). Hall, P. and D.M. Titterington (1987), Common structure of techniques for choosing smoothing parameters in regression problems, J. Roy. Statist. Sot. Ser. B 49(2), 184-198. Nychka, D., G. Wahba, S. Goldfarb and T. Pugh (19841, Cross-validated spline methods for the estimation of three-dimensional tumor size distrubutions from observations on two-dimensional cross sections, J. Amer. Statist Assoc. 79, 832-846. Thompson, A.M. J.W. Kay and D.M. Titterington (19911, Noise estimation in signal restoration using regularisation, Biometrika 78, 475-488. Wahba, G. (19771, Practical approximate solutions to linear operator equations when the data are noisy, SolM J. Numer. Anal. 14, 651-667.
335