Agricultural Systems 64 (2000) 37±53
www.elsevier.com/locate/agsy
Statistical methods for evaluating a crop nitrogen simulation model, N_ABLE J. Yang a,*, D.J. Greenwood b, D.L. Rowell a, G.A. Wadsworth a, I.G. Burns b a
Department of Soil Science, The University of Reading, Whiteknights, PO Box 233, Reading RG6 6DW, UK b Department of Soil and Environment Sciences, Horticulture Research International, Wellesbourne, Warwick, CV35 9EF, UK
Abstract Modelling nitrogen (N) dynamics is a valuable tool to predict the uptake, mobility and leaching of mineral N in soil pro®les. Many crop/soil N simulation models have been developed in the last 20 years for this purpose. However, methods for the operational evaluation of simulation models are not well established. Standard test statistics such as the F- and t-tests are being questioned because they may give dierent conclusions. Dierence measures are being used as alternatives but they have not been thoroughly investigated. This paper reviews statistical methods that may be helpful in comparing simulations with measured variables. They have been used to analyse comparisons of simulations produced by a nitrogen response model, N_ABLE, with measurements made in a ®eld experiment with lettuce. The techniques include: (1) data transformation, (2) regression analysis and (3) analysis of dierence. They show that accuracy of prediction for dierent variables varied in dierent growth periods. Mean absolute errors (MAE) were within 0.5% and 35 kg haÿ1 of the measured values of percent-N and soil mineral N, respectively, over the whole growth period. They also demonstrate systematic errors when crop weights exceeded 2 t haÿ1, with 0±0.38 t haÿ1 under-estimation of dry weight and 3±16 kg haÿ1 under-estimation of N uptake when the harvest date was set at values between 42 and 61 days. Use of regression analysis showed that the data sets all violated normality and equal variance assumptions and that choosing the right transformation before analysis was crucially important. When dierence measures were used, it was found that there was strong correlation between the outcome of some, but not others. Overall, the results suggest that the tests can be grouped according to the degree of correlation between individual tests and that only one test needs to be made from each correlated group. We suggest that two sets of four statistics can be used, each statistic explaining a special property of the data. Each set leads to useful conclusions. The ®rst set is mean of error (E), root mean square error, modi®ed forecasting eciency and paired t-statistic, and the second set is E, MAE, forecasting coecient and F-ratio of lack of ®t over experimental error (F(Y=X) ). Either LF * Corresponding author. 0308-521X/00/$ - see front matter # 2000 Elsevier Science Ltd. All rights reserved. PII: S0308-521X(00)00010-X
38
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
set can give the same conclusions which could not be quantitatively detected by graphical inspection of the experimental data. # 2000 Elsevier Science Ltd. All rights reserved. Keywords: Statistical evaluation; Test statistics; Dierence measures; Nitrogen simulation model, N_ABLE; Nitrogen uptake; Soil mineral N
1. Introduction Operational evaluation is the assessment of accuracy and precision of a simulation model and methods vary from inspection and demonstration to analytical test (Knepell and Arangno, 1993). In general, what we most require to know is what proportion of the treatment variation, i.e. that excluding experimental error, can be accounted for by the model. We also need to know about biases the model has over the entire response surface, i.e. in this case yield against N fertiliser rate and time. At a more detailed level it may be helpful for us to know: (1) In what regions of the response surface are the discrepancies between simulated and measured variables greatest and over what regions are they smallest? (2) Does the model generally give the right shape of response surface? Do the simulated values simply dier from the measured values by a ®xed amount or are they a ®xed proportion of the measured values? (3) Does the model give transformed or un-transformed values that are linearly related to the measured values? The purpose of this paper is to discuss a range of statistical techniques for answering these questions. Two types of statistical methods have commonly been used in model evaluation: test statistics and dierence measures. Test statistics (both parametric and non-parametric) have been reviewed by Reckhow et al. (1990) and O'Leary and Connor (1996), and comprehensive discussions on dierence measures have been given by Willmott et al. (1985), Loague and Green (1991) and Kabat et al. (1995). It seems that there is no robust statistic which can be used for all models because of the complexity of the data structure of a model, e.g. continuous or discrete, error or error-free and one- or two-dimensional. In this paper, we try to select some available statistical methods both from test statistics and from dierence measures for the evaluation of the model, N_ABLE (Greenwood and Draycott, 1989a, b; Greenwood et al. 1996) using four measured data sets from a nitrogen experiment with lettuce. 2. Data transformation 2.1. Experiment and data structure A nitrogen experiment with lettuce was carried out on light sandy loam soil during April±July 1997 in the South of England (51 270 N, 0 560 W). The details of the experiment have been given by Yang et al. (1999) and a brief description of the data structure follows. In the experiment, plant and soil were sampled after transplanting in six N treatments each with three replicate plots to monitor plant growth and soil
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
39
mineral N on a total of seven dates on Days 12, 19, 29, 42, 50, 57 and 61 in each plot during the growth period. Four state variables were measured regularly during the experiment in each of six N treatments with lettuce (Yang et al., 1999). The state variables are: weight of dry matter excluding ®brous roots (WDM, t haÿ1), soil mineral N in the 0±30 cm layer (NS, kg haÿ1), N uptake excluding ®brous roots (NU, kg haÿ1) and percent N in dry matter (PN,%). The above data were simulated with the model N_ABLE. The model calculates the daily changes in the distributions of mineral-N and water down the soil pro®le together with the increments in plant dry weight and N-uptake. The major inputs to the model are level and timing of fertiliser application, the maximum potential yield of dry matter, and the daily rainfall and potential evaporation together with the initial soil characteristics. Detailed values of the input parameters were given by Yang et al. (1999). It was found that output is sensitive to the harvest time parameter (Th), i.e. the date on which the simulation is stopped. So the eect of changing Th was examined in the simulation by setting Th to the sample Days 42, 50, 57 or 61, producing four data sets for each of the four variables. The following notation is used in the data analysis later. A measured variable is represented by Y and a simulated variable by X, while measured data are denoted by yij (i=1, 2 . . . n represents the total measurements on each data set (see later example) and j=1, 2 . . . r represents replicates) and simulated data by xi, because the model cannot simulate random experimental errors for each replicate. Then the dierence between simulation and measurement can be de®ned as: DYÿX dij yij ÿ xi
1 i 1; 2 . . . n;
j 1; 2 . . . r
where yij is the measured value for the ith measurement in the jth replicate and xi is the simulated value for the ith measurement. The data set was prepared as follows for WDM, NU, PN and NS. There were measured values and simulated values. At each of the harvest dates there were six fertiliser treatments and three replicates. The measured data for the ®rst harvest date were added to that for the second and so on. As there were harvests at 12, 19, 29 and 42 days there are a total of 6 (treatments)3 (replicates)4 (measurements) measured values at Th=42 days. For each level of fertiliser at 12, 19, 29 and 42 days, the simulated values were obtained by running the model with the input value of the maximum yield Wmax=2.5 at Th=42 days. This produced a total of 6 (treatments)4 (measurements) simulated values (i.e. no simulated random errors for replicates). These measured values plus those made at 50 days (i.e. a total of 635 values) were compared with the model values (i.e. a total of 65 values) simulated with the input value corresponding to the maximum yield Wmax=3.0 at Th=50 days. Likewise the measured data sets at 57 days (i.e. a total of 636 values) were prepared for the simulated values (i.e. a total of 66 values) simulated with the maximum yield Wmax=3.5 at Th=57. Finally, the measured data sets at 61 days (i.e. a total of 637 values) were prepared for the simulated values (i.e. a total of 67 values) simulated with the ®nal harvested maximum
40
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
yield Wmax=4.3 at the ®nal harvest Th=61 days. Thus there is an overlap in the measured but not the simulated values for the dierent times. The magnitude of variables can be in¯uenced by N levels and by the duration of growth. Graphs of simulated NU and PN and measured data against time of growth are shown in Figs. 1 and 2. Graphs of measured WDM and NS were shown in a previous paper (Yang et al., 1999). 2.2. Test of normality and homoskedasticity Regression methods are frequently used in model testing and validation processes (Sutherland et al., 1986; Reckhow et al., 1990; O'Leary and Connor, 1996) and provided one of the methods used in this paper (see later for details). In the linear regression model: Yi Xi "i
2
the t-test can be used to test H0: a=0, b=1 since ta
a ÿ =Sa t (Nÿ2) and tb=(b ÿ )/Sbt (Nÿ2), where a and b are least squares measures of and (Aigner, 1971). However, for the classical least squares estimation in Eq. (2), the signi®cant test of the regression parameters assumes that the individual error terms ei are normally distributed, are of equal variance, are independent of each other, and
Fig. 1. Simulated and measured above ground N uptake (means of three plots and the standard error) for dierent fertiliser-N treatments. Symbols (*) are experimental values; lines are simulated values. The four lines for each treatment represent simulated values for Wmax=4.3, 3.5, 3.0 and 2.5 at Th=61, 57, 50 and 42 (from the bottom to the top), respectively. N0, N30/20, NA, 80/50/50, N180/60/70, NB, 180/60, NC, 70/120/50 refer to dierent N fertiliser treatments where the numbers in subscript refer to basal and top dressings (kg N haÿ1).
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
41
Fig. 2. Simulated and measured percent-N in dry matter (means of three plots and the standard error) with dierent fertiliser-N treatments. Symbols (*) are experimental values; lines are simulated values. The four lines for each treatment are simulated values for Wmax=4.3, 3.5, 3.0 and 2.5 at Th=61, 57, 50 and 42 (from the top to the bottom), respectively.
are not correlated with the independent variable, Xi. In our research, transformations are generally required because the range in response variables is large. For example, the largest values of WDM, NU and NS are all 10 or 100 times more than the smallest values. Diagnostic plots for WDM, NU, PN and NS produced evidence that these data sets seriously violated the assumption of equal variance, and slightly violated the assumption of normality (Yang, 1999). Violations of these two assumptions may cause biased estimates of the parameter variances in the regression analysis. Shapiro and Wilk's W-test is considered to be the best multi-functional test of normality (Shapiro and Wilk, 1965; Royston, 1982), whereas White's w2-test was thought to be a good statistic to check heteroskedasticity of ei since it was derived from a null hypothesis that not only are the errors homoskedastic, but they are also independent of the regressors (White, 1980). 2.3. Transformation for normality and homoskedasticity Common assumptions for statistical analyses have been discussed in detail for water quality models (Reckhow et al., 1990), in biological research (Bowker, 1993), and in agricultural research (Whitmore, 1991; O'Leary and Connor, 1996). Although the logarithmic transformation is the most frequently used method (Sokal and Rohif, 1987), it assumes that the original relationship is Y=aebX which is not the same as our assumption [Eq. (2)]. In order to get a better transformation to ®t the
42
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
standard linear model shown in Eq. (2), there is a need to stabilise error variances. Two frequently used transformation methods were chosen for use in this paper based on the methods discussed by Pindyck (1981) and Bowker (1993). They are: 1. Model Yi Xi "i was transformed by 1/Si, and the model becomes: Yi =Si
1=Si
Xi =Si "i =Si
3
p where Si is the sample standard derivation, i.e. Si
yij ÿ yi 2 =
r ÿ 1, and yi is the mean value of the ith measured data. Eq. (3) shows that Si will be measured with experimental errors, meaning that the greater the number of the replicates, the more stable the standard errors. 2. Model Yi Xi "i was transformed by 1=X=2 i , and the model becomes: 1ÿ=2
1=X=2 "i =X=2 Yi =X=2 i i
Xi i
4
In ®tting Eq. (4), d/2 was set to 1.25, 1, 0.75, 0.5 and 0.25 based on our ®nding from diagnostic plots that residual error ei is proportionally related to the independent variable X/2 i . Eqs. (3) and (4) were ®tted by the weighted regression method using SAS software (SAS Institute, 1989, 1990a, b). To select the best transformation, W and w2 statistics of random errors in Eqs. (3) and (4) were ®rst calculated for each data set at Th=42, 50, 57 and 61 days individually, and then mean values of W and w2 for dierent Th values were compared. The ideal transformation is one where both the minimum w2 and the maximum W occurred simultaneously, but this condition was only roughly met by the NS data set (Table 1). When this condition did not hold, we identi®ed the minimum value of w2 as our selected transformation. Following this rule, we concluded that the 1/Xi transformation gave the best correction for equal variance in WDM, NU and PN data gave the best correction for both norsets with slight loss of normality, and 1/X0.5 i mality and equal variance in NS data sets (Table 1). 2.4. Transformation of normality of data set D (YÿX) When the paired t-statistic [Eq. (11)] is used to test the dierence between measured and simulated data D, the data set D should be a normal variable with a null hypothesis H0: d y ÿ x 0: Our research shows that all D data sets derived from each variable WDM, NU, PN and NS violated normality. Again W statistics of data sets D were calculated to select the best transformation from them. Following the same rule as the method given earlier, we concluded that 1/Xi transformation gives the best correction of normality of D data sets generated from WDM and NU, 1/X0.5 i gives the best correction for PN, and 1/X0.25 gives the best correction for NS data i (data not shown).
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
43
Table 1 Comparison of dierent transformations of data for testing normality and homoskedasticity using mean values (Th=42, 50, 57 and 61) of W and White-w2 statistics for each state variablea State variables
Transfer variables
W (P>W) H0; normal of residual
w2 (P>w2) H0; homoskedastic of residual
WDM
No-transfer 1/Si 1/X1.25 i 1/Xi 1/X0.75 i 1/X0.5 i 1/X0.25 i
0.9118 (0.00) 0.8838 (0.00) 0.8835 (0.00) 0.8722 (0.00) 0.9152 0.9222 (0.01) 0.9350 (0.01)
21.25 (0.00) 10.06 (0.03) 9.30 (0.03) 4.52 (0.42) 17.60 (0.00) 20.19 (0.00) 18.95 (0.00)
NU
No-transfer 1/Si 1/X1.25 i 1/Xi 1/X0.75 i 1/X0.5 i 1/X0.25 i
0.9236 0.8968 (0.00) 0.8835 (0.00) 0.8991 (0.00) 0.9152 (0.00) 0.9276 (0.01) 0.9350 (0.01)
18.10 (0.00) 10.23 (0.03) 9.29 (0.03) 8.77 (0.03) 17.59 (0.00) 18.37 (0.00) 18.95 (0.00)
PN
No-transfer 1/Si 1/X1.25 i 1/Xi 1/X0.75 i 1/X0.5 i 1/X0.25 i
0.9711 (0.26) 0.9756 (0.40) 0.9635 (0.18) 0.9650 (0.19) 0.9665 (0.20) 0.9681 (0.22) 0.9696 (0.23)
11.49 (0.01) 6.36 (0.17) 8.37 (0.05) 5.99 (0.10) 9.49 (0.03) 9.59 (0.01) 12.40 (0.01)
NS
No-transfer 1/Si 1/X1.25 i 1/Xi 1/X0.75 i 1/X0.5 i 1/X0.25 i
0.9571 (0.13) 0.9491 (0.03) 0.9660 (0.21) 0.9636 (0.14) 0.9619 (0.14) 0.9604 (0.15) 0.9590 (0.14)
9.78 (0.01) 9.82 (0.02) 11.33 (0.01) 9.35 (0.01) 8.59 (0.06) 3.59 (0.23) 5.55 (0.19)
3. Regression analysis 3.1. Testing of the null hypothesis
H0 : 0; 1 Linear regression was therefore carried out using Eq. (2) with the weighted variable 1/Xi for WDM, NU and PN data sets, and with the weighted variable 1/Xi0.5 for NS data set, and the t-test used to test H0: =0, =1 in the model. In our data, N=nr, and a and b are weighted least squares measures of and . Calculated t-statistics are: ta
ÿ 0=sa
sa
tb
b ÿ 1=sb
sb
p p
MSE =nr x2 s2b
5
MSE =
xi ÿ x2
6
44
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
where MSE is the mean square of experimental error, calculated as shown later. Signi®cance of the values of coecients a and b were tested by comparing ta and tb with the signi®cance level of t0.05 (nrÿ2) or t0.01 (nrÿ2). 3.2. Lack of ®t test In regression analysis with replicates of dependent variables, the lack of ®t test was carried out by the F ratio of MSLF over MSE. Values of both mean squares were obtained as follows. The sums of squares of residuals (SSR) were partitioned into two parts, the sums of squares of the randomised error (SSE) and the sums of squares of the lack of ®t (SSLF). The degrees of freedom of residuals (DFR) were then partitioned into the degree of freedom of randomised error (DFE) and the degree of freedom of lack of ®t (DFLF), where: SSR SSE SSLF SSR
yij ÿ a ÿ bxi 2 SSE
yij ÿ yi 2 SSLF SSR ÿ SSE
DFR DFE DFLF DFR
nr ÿ 2 DFE n
r ÿ 1 DFLF
n ÿ 2
MSLF
SSLF =DFLF MSE
SSE =DFE
7
8
and where yi is the measured mean of the ith measurement. The variance ratio is: FLF MSLF =MSE
9
Statistical inferences from the above analysis of variance of residuals were drawn by comparing FLF with the signi®cance level of F0.05 (DFLF, DFE) or F0.01 (DFLF, DFE). 3.3. Goodness of ®t test R2 is a commonly used statistic for testing the goodness of ®t and is de®ned as: R2 SSU =SSY 1 ÿ
SSR =SSY
10
Following the earlier method, the results of regression with a lack of ®t test on simulated and measured data from the four state variables are listed in Table 2. Graphical displays of Y against X and regression lines for NU and PN are shown in Fig. 3 and 4. 4. Analysis of dierence Regression analysis in model evaluation actually tests its precision, i.e. the degree to which simulated values approach a linear function of the measured data (Willmott et
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
45
Table 2 Weighted linear regression, Y=a+bX, with a lack of ®t test on simulated data (xi) and measured data yija Variablesb
Th
WDM
42 50 57 61
NU
42 50 57 61
PN
42 50 57 61
NS
42 50 57 61
a b
b
FLF
R2
0.93** 1.06** 1.29** 1.36**
3.60** 2.75** 4.60** 7.39**
0.9498 0.9591 0.9515 0.9373
1.04 1.16** 1.33** 1.38**
4.42** 3.68** 6.34** 8.30**
0.9448 0.9567 0.9408 0.9236
1.19** 0.64** 0.10 0.32*
0.78** 0.88* 0.97 0.91*
6.92** 4.76** 5.12** 3.69**
0.3768 0.6258 0.7369 0.7641
4.11 6.77 6.25 9.50**
1.17** 1.15** 1.09 1.03
2.27* 2.67** 2.25** 4.09**
0.8607 0.8228 0.8131 0.7824
a 0.007* 0.007* 0.006 0.009* ÿ0.42* 0.175 ÿ0.12 0.074
*,** Refer to 0.05 and 0.01 signi®cance levels. WDM, NU and PN data were weighted by 1/Xi for the regression; NS data were weighted by 1=X0:5 i for the regression.
al., 1985). In model testing, however, our interest is in the model's accuracy, i.e. the extent to which simulated values approach the measured data. This can be achieved by examining the dierence directly from dij=yijÿxi. It is equal to the test of the goodness of ®t of model Y=X, but not Y=a+bX. In this circumstance, the regression method with a=0 or b=1 uses the following equation to calculate R2, e.g. R2=1ÿ(residual sum of squares/uncorrected total sum of squares) (SAS Institute, 1990b), to avoid situations where R2>1 or R2<0 (Aigner, 1971; Pindyck, 1981). 4.1. Test statistics Two dierent statistics were used for this propose: 4.1.1. Paired t-statistic In testing simulated data against experimental data, the paired t-statistic was used to test the null hypothesis d
xi ÿ yi 0 (Reckhow et al., 1990). In our data, it is calculated as follows: paired
t d=sd
11
where d is the mean value for the dierence variable shown in Eq. (1), and sd is the standard error of the mean. It is important that the dierence variable D (YÿX) should be normally distributed and be independent without considering equal variance in the paired t-test (Snedecor and Cochran, 1976). D data sets used in the paired t-test were transformed as discussed previously.
46
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
Fig. 3. Comparison of measured N uptake against simulated values for each of the sampling dates, for simulations with dierent harvest times; dashed lines are Y=X. Solid lines are Y=a+bX; Th is the harvest date (days after transplanting).
4.1.2. F
YX statistic LF This still holds for the model Y=X (Whitmore, 1991), but the sums of squares of residuals SSR (Y=X) and FLF (Y=X) values are calculated dierently from those in Eqs. (7) and (9):
yij ÿ xi 2 SS
YX R
YX ÿ SSE SSLF SS
YX R
DF
YX nr R DF
YX n LF
12
where SSE is the same as given in Eq. (7). The variance ratio is
SS
YX =DF
YX =
SSE =DFE F
YX LF LF LF
13
Calculated values of paired t and FLF (Y=X) statistics using Eqs. (11) and (13) are recorded in Table 3, and used to test the dierence D (YÿX).
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
47
Fig. 4. Comparison of measured percent-N in dry matter against simulated values for each of the sampling dates, for simulations with dierent harvest times; dashed lines are Y=X, solid lines are Y=a+bX; Th is the harvest date (days after transplanting).
4.2. Dierence measures Several simple dierence measures have been developed for this purpose by Loague and Green (1991) and Willmott et al. (1985). Kabat et al. (1995) employed ®ve of them in evaluating soil nitrogen simulation models. These statistics have the following features in common: (1) they measure dierence, dij yij ÿ xi , in several ways, including the sum of dierence, (yijÿxi), the sum of absolute dierence, jyijÿxij and the sum of squares of dierence, (yijÿxi)2; and (2) they de®ne a statistical function of dij as a measure of dierence to examine model performance by characterising systematic under- or over-prediction. Bearing in mind these de®nitions, here we discuss six statistics as our alternative indicators for further evaluation of the N_ABLE model: Mean error E
1=nr
yij ÿ xi
14
48
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
Table 3 Values of dierence statistics (test statistics and dierence measures) when assessing dierence D (YÿX) for each state variablea Th (days)
F(Y=X)b LF YX
Paired jtjc
WDM 42 50 57 61
3.91** 4.14** 15.86** 23.31**
1.84 4.89** 13.58** 14.55**
NU 42 50 57 61
4.23** 5.85** 16.84** 21.07**
PN 42 50 57 61 NS 42 50 57 61
Ed
MAE (50)
RMSE (50)
0.01 ÿ0.08 ÿ0.30 ÿ0.38
0.12 0.19 0.33 0.40
0.03 5.85** 11.08** 12.09**
ÿ3.12 ÿ7.00 ÿ13.43 ÿ15.64
9.28** 5.88** 4.83** 3.77**
2.65** 2.51** 0.37 1.13
ÿ0.15 ÿ0.11 0.02 0.05
3.66** 3.91** 2.78** 4.74**
4.66** 5.07** 3.93** 3.85**
ÿ23.7 ÿ23.3 ÿ16.2 ÿ13.1
C (50)
EF (41)
EF1 (41)
No. of measurements
0.19 0.30 0.48 0.56
0.1318 0.1429 0.1917 0.1976
0.9575 0.9366 0.8866 0.8640
0.8447 0.8238 0.7437 0.7088
72 90 108 126
6.34 9.41 14.13 16.86
9.91 14.31 20.84 23.83
0.1517 0.1661 0.2067 0.2171
0.9397 0.9137 0.8498 0.8197
0.8178 0.7759 0.6984 0.6597
72 90 108 126
0.48 0.41 0.34 0.33
0.56 0.50 0.44 0.44
0.1003 0.0887 0.0787 0.0774
0.2149 0.5749 0.7371 0.7596
0.0487 0.3233 0.5022 0.5539
72 90 108 126
47.88 47.39 43.99 40.53
0.2476 0.2542 0.2313 0.2287
0.7285 0.7293 0.7679 0.7877
0.5690 0.5659 0.6205 0.6211
63e 80 97 115
34.4 34.5 30.1 28.3
a
*,** Refer to 0.05 and 0.01 signi®cance levels. For F(Y=X) calculation, data set have been transformed as in Table 1. LF c For paired t-calculation, data sets D have been transformed based on selections made as follows; D data from WDM, and NU were transformed by 1/Xi; D data from PN were transformed by 1/X0.5 i ; and D data from NS were transformed by 1/X0.25 . i d No data transformation was made for calculation of E, MAE, RMSE, C, EF and EF1. e There were missing values in the measured NS data set. b
Mean absolute error MAE
1=nr j yij ÿ xi j
15
Root mean square error RMSE
p
yij ÿ xi 2 =nr
16
Forecasting eciency EF 1 ÿ
yij ÿ xi 2 =
yij ÿ y2
17
Modi®ed forecasting eciency EF1 1 ÿ
j yij ÿ xi j = j yij ÿ y j
18
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
49
Cooecient of error C
1=nr j yij ÿ xi j=y MAE=y
19
E is an indicator of whether the model predictions tend to over- or under-estimate measured data (Addiscott and Whitmore, 1987). MAE and RMSE have been used by Willmott et al. (1985) to evaluate their models. E, MAE and RMSE all take on units of dij yij ÿ xi and are mainly used as measures of accuracy to compare the output of the same variables, e.g. to compare WDM output of the model with measured data for dierent simulation dates, or to compare the output of the same variables among dierent models. EF is a relative measure of error used by many authors (Loague and Green, 1991) and preferred by Loague and Freeze (1985). It is de®ned as for R2 in regression analysis, but EF6R2. EF=1 if yij xi and EF<1 for any realistic simulation. EF<0 if the model predicted values are worse than simply using the measured mean of yij, whereas corresponding values of R2 are 04R241. This is because EF is actually a statistic to test the goodness of ®t of the model Y=X, and thus has the restrictions a=0 and b=1 in the model Y=a+bX. Under these conditions, the partition of SSY into SSU and SSR no longer holds in general: SSY 6SSU+SSR (Aigner, 1971). For this reason, we consider that R2 is not a good statistic in testing the goodness of ®t of the model Y=X. It is considered that quadratic residual functions, such as EF, are sensitive to outliers (Klepper and Rouse, 1991). To overcome this, we have de®ned another function EF1 by replacing the sum of squares of dierence with the sum of absolute dierences [Eq. (18)], where jEF1j4jEFj. C is another relative average measure of absolute dierence which is expressed as a proportion of the mean of the measured variable, Y (Klepper and Rouse, 1991). C, EF and EF1 are all dimensionless indices (C50, EF and EF141), and they have been used to depict the degree to which dij yij ÿ xi approaches the null set, i.e. dij=0. They have also been used to compare accuracy of model outputs for dierent variables. Detailed results of dierence measures calculated using Eqs. (11), (13) and (14)±(19) are also listed in Table 3. 5. General discussion It can be seen from Table 2 that most of the values of FLF and a and b are statistically signi®cant at the 0.01 level, but the values are small, e.g. FLF < 9. In general, good ®ts for WDM and NU are obtained for early harvests (Days 42 and 50), and for PN at Days 57 and 61. There is only a small dierence between the four lines for NS. The order of goodness of ®t can be ranked by R2 as WDM>NU>NS>PN (Table 2), indicating that the poorest linear relationship detected is for PN when Th is set to 42 days (Fig. 4a). Table 3 provides detailed information for assessing the dierence between measured and simulated values. Four paired t-values in Table 3 are not signi®cant at the 0.05 level, indicating that there are no statistically signi®cant dierences between
50
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
means of simulated and measured data. This conclusion is valuable because it cannot be deduced from a graphical display. It shows that all the dierences between measured and simulated values in these regions of the response surfaces can be attributed to experimental error. All signi®cant values of the paired t and F(Y=X) LF values indicate that further modi®cation of the model is needed. This focuses attention on regions of WDM and NU for Th=61 in the ®rst instance because of the and paired t-values. F(Y=X) values calculated in Table 3 higher values of both F(Y=X) LF LF were based on the model Y=X, indicating the extent to which the dierences (yijÿxi) dier from experimental error. By comparison, the values of FLF in Table 2 were based on the model Y=a+bX, indicating the extent to which the residual errors (yijÿaÿbxi) diered from experimental error. For this reason, the F(Y=X) LF values are the real measures of the accuracy of model simulation. Most of the E values are negative (Table 3), indicating that the model generally underestimated the measured data except for PN at Days 57 and 61 and WDM at Day 42. The magnitude of underestimation increased with time for the WDM and NU variables, but the reverse is true for PN and NS variables. MAE and RMSE showed that the largest dierence occurred for NU when Th=61. C values indicate that the largest relative error is <25% for all variables, meeting the tolerance limit of 26% given by Klepper and Rouse (1991). The order of goodness of match among the four variables is WDM>NU>NS>PN ranked by values of EF and EF1 (Table 3), giving a similar result to that drawn from R2 in Table 2. It has been found that the dierence statistics in Table 3 are strongly correlated with the outcome of some tests but not others. The tests can be grouped according to the degree of correlation (Table 4). Thus, RMSE, MAE and C were strongly correlated with one another, but not with any of the others. Likewise, paired t and were strongly correlated, but not with the other tests as were EF and EF1. A F(Y=X) LF single linkage dendrogram for the correlation matrix (rij>0) in Table 4, for eight dierence statistics is shown in Fig. 5. Table 4 shows that values of E have a strong negative correlation with C, MAE and RMSE. In this study, the negative correlations result from negative E values which are caused by X>Y in Eq. (1). In contrast, the other seven statistics are all Table 4 Correlation coecient matrix rij for eight dierence statistics listed in Table 3a
F(Y=X) LF Paired t E MAE RMSE C EF EF1 a
F(Y=X) LF
Paired t
E
MAE
RMSE
C
EF
EF1
1.00 0.89 0.07 ÿ0.22 ÿ0.22 0.23 0.08 0.08
1.00 ÿ0.15 0.02 0.02 0.51 0.26 0.28
1.00 ÿ0.96 ÿ0.96 ÿ0.83 ÿ0.02 ÿ0.05
1.00 0.99 0.82 0.01 0.03
1.00 0.82 0.01 0.04
1.00 0.29 0.34
1.00 0.99
1.00
Values with bold font represent correlation coecients among each group.
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
51
Fig. 5. Single linkage dendrogram for the correlation matrix rij (>0) in Table 4 for eight dierence statistics.
positive irrespective of whether the dierence D=(YÿX) is negative or positive. The correlations between E and C, MAE and RMSE are not universally applicable and for this reason the negative correlation has not been used in the cluster analysis in Fig. 5. We conclude from Fig. 5 that only one statistic from each group needs be used to test the validation of the model. Group 1 includes RMSE, MAE and C (grouped by rij 50.82) and provides measures of the magnitude of the dierence. Group 2 includes EF and EF1 (grouped by rij50.99), and gives measures the goodness of match or the and paired t-statistics (grouped co-ordinate of the dierence. Group 3 includes F(Y=X) LF by rij50.89) and gives a measure of the probability of signi®cant dierence. Group 4 includes only E and provides a measure of direction of dierence (i.e. under- or overestimation). The maximum correlation coecient between Groups 1 and 2 is very small (i.e. rij=0.51), and between Groups 3 and 4 is only 0.07 (Table 4). 6. Conclusions Data from the nitrogen experiment were shown to violate normality and equal variance assumptions and transformation is necessary to ensure standard statistical analyses are carried out correctly. Shapiro and Wilk's W-test and White's w2 test have been found to be useful methods for testing the normality and heteroskedasticity of "i . Two types of transformations were employed in this study to ensure both normality and equal variance, and 1/Xi was required for the WDM, NU and PN data, and 1=X0:5 was required for the NS data. In addition, transformation i to normalise the data set D (YÿX) was also necessary where the paired t-test was used. Regression analysis with a lack of ®t test was used to evaluate a simulation model, but this method proved to be very precise and it should be interpreted carefully. The FLF test is a precise measure for lack of ®t, and is sensitive to both experimental error and degrees of freedom. Whitmore (1991) suggested that a baltest. Furthermore, ance of DFLF and DFE should be considered in the F(Y=X) LF
52
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
Reckhow et al. (1990) suggested that a prediction b0 could be used for testing b0 but not for testing 1, where b0 can be decided from a real subject (i.e. b0=1, jj>0 is an acceptable error). In addition, if the t-test shows that parameters and do not deviate from H0 when regression analysis is carried out, R2 can be used to check the goodness of ®t. Thus, if both of these tests are carried out, we can then draw a ®nal conclusion, whereas a decision based on any one of them can be misleading. Dierence measures shown in Eqs. (14)±(19) have advantages over test statistics in that they are easy to interpret and do not need data transformation, but the methods are not fully developed. Many authors believe that there is no robust statistic which can be used to draw conclusions in model evaluations and therefore several methods need to be used together to give a double check. In this study, we found that eight dierence statistics (Table 3) were strongly correlated with each other allowing one to be chosen from each correlated group so as to save time without losing accuracy. We suggest that the same conclusion can be reached by using either RMSE, EF1, and E as the dierence measures because each paired t and E or MAE, EF, F(Y=X) LF of them is present in Groups 1, 2, 3 and 4 (Fig. 5). In addition, graphical display should be used as a visual aid for comparison between simulation and measurement and for interpretation of the statistics used in the testing and evaluation processes. Acknowledgements Financial support through an ORS award from UK is gratefully acknowledged. We also thank Andrew Mead from Horticulture Research International, UK, for his statistical comments on the ®rst draft of this paper. References Addiscott, T.M., Whitmore, A.P., 1987. Computer simulation of changes of soil mineral nitrogen and crop nitrogen during autumn, winter and spring. Journal of Agricultural Science, Cambridge 109, 141±157. Aigner, D.J., 1971. Basic Eeconometrics. Prentice-Hall, Englewood Clis, NJ. Bowker, D.W., 1993. Dynamic models of homogeneous systems. In: Fry, J.C. (Ed.), Biological Data Analysis: A Practical Approach. Oxford University Press, Oxford, pp. 313±343. Greenwood, D.J., Draycott, A., 1989a. Experimental validation of an N-response model for widely different crops. Fertiliser Research 18, 153±174. Greenwood, D.J., Draycott, A., 1989b. Quantitative relationships for growth and N content of dierent vegetable crops grown with and without ample fertiliser-N on the same soil. Fertiliser Research 18, 175±188. Greenwood, D.J., Rahn, C.R., Draycott, A., Vaidyanathan, L.V., Paterson, C., 1996. Modelling and measurement of the eects of fertilizer-N and crop residue incorporation on N-dynamics in vegetable cropping. Soil Use and Management 12, 13±24. Kabat, P., Marshall, B., van den Broek, B.J., 1995. Comparison of simulation results and evaluation of parameterization schemes. In: Katab, P., Marshall, B., Van den Broek, B.J., Vos, J., Van Keulen, H. (Eds.), Modelling and Parameterization of the Soil-plant-atmosphere System. A Comparison of Potato Growth Models. Wageningen Pers, Wageningen, pp. 439±501. Klepper, O., Rouse, D.I., 1991. A procedure to reduce parameter uncertainty for complex models by comparison with real system output illustrated on a potato growth model. Agricultural Systems 36, 375±395.
J. Yang et al. / Agricultural Systems 64 (2000) 37±53
53
Knepell, P.L., Arangno, D.C., 1993. Simulation Validation: A Con®dence Assessment Methodology. IEEE Computer Society Press, Los Alamitos, CA. Loague, K.M., Freeze, R.A., 1985. A comparison of rainfall-runo modelling techniques on small upland catchments. Water Resources Research 21, 329±348. Loague, K., Green, R.E., 1991. Statistical and graphical methods for evaluating solute transport models: overview and application. Journal of Contaminant Hydrology 7, 51±73. O'Leary, G.J., Connor, D.J., 1996. A simulation model of the wheat crop in response to water and nitrogen supply: II. Model validation. Agricultural Systems 52, 31±55. Pindyck, R.S., 1981. Econometric Models and Economic Forecasts. McGraw-Hill Inc, USA. Reckhow, K.H., Clements, J.T., Dodd, R.C., 1990. Statistical evaluation of mechanistic water-quality models. Journal of Environmental Engineering 116, 250±268. Royston, J.P., 1982. An extension of Shapiro and Wilk's W Test for Normality to large samples. Applied Statistics 31, 115±124. SAS Institute Inc, 1989. SAS/STAT User Guide, Version 6, 4th Edition, Volume 2. SAS Institute, Cary, NC. SAS Institute Inc, 1990aa. SAS Procedures Guide. Version 6, 3rd Edition. Cary, NC, USA. SAS Institute Inc, 1990bb. SAS/ETS Software, Applications Guide 2, Version 6, First Edition: Econometric Modelling, Simulation, and Forecasting. Cary, NC, USA. Shapiro, S.S., Wilk, M.B., 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 591±611. Snedecor, G.W., Cochran, W.G., 1976. Statistical Methods, 6th Edition. The Iowa State University Press, Ames, IA, USA. Sokal, R.R., Rohif, F.J., 1987. Introduction to Biostatistics, 2nd Edition. W.H. Freeman, New York. Sutherland, R.A., Wright, C.C., Verstraeten, L.M.J, Greenwood, D.J., 1986. The de®ciency of the `economic optimum' application for evaluating models which predict crop yield response to nitrogen fertilizer. Fertilizer Research 10, 251±262. White, H., 1980. A heteroskedasticity-consistent coveriance matrix estimator and a direct test for heteroskedasticity. Econometrica 48, 817±829. Whitmore, A.P., 1991. A method for assessing the goodness of computer simulation of soil processes. Journal of Soil Science 42, 289±299. Willmott, C.J., Ackleson, S.G., Davis, R.E., Feddema, J.J., Klink, K.M., Legates, D.R., O'Donnell, J., Rowe, C.M., 1985. Statistics for the evaluation and comparison of models. Journal of Geophysical Research 90, 8995±9005. Yang, J., 1999. Testing and evaluation of a nitrogen simulation model, N_ABLE, using independent ®eld data. PhD thesis, The University of Reading, Reading, UK Yang J., Wadsworth, G.A., Rowell, D.L., Burns, I.G., 1999. Evaluating a crop nitrogen simulation model, N_ABLE, using a ®eld experiment with lettuce. Nutrient Cycling in Agroecosystems, 55, 221± 230.