Ecological Modelling 294 (2014) 19–26
Contents lists available at ScienceDirect
Ecological Modelling journal homepage: www.elsevier.com/locate/ecolmodel
Applying an artificial neural network to simulate and predict Chinese fir (Cunninghamia lanceolata) plantation carbon flux in subtropical China Xuding Wen a , Zhonghui Zhao a , Xiangwen Deng a, *, Wenhua Xiang a , Dalun Tian a , Wende Yan a , Xiaolu Zhou a,b , Changhui Peng a,b, ** a Faculty of Life Science and Technology, National Engineering Laboratory for Applied Technology of Forestry & Ecology in South China, Central South University of Forestry and Technology, Changsha, Hunan 410004, China b Center of CEF/ESCER, Department of Biological Sciences, University of Quebec at Montreal, Case postal 8888, Succ Centre-Ville, Montreal, QC H3C 3P8, Canada
A R T I C L E I N F O
A B S T R A C T
Article history: Received 27 March 2014 Received in revised form 4 August 2014 Accepted 10 September 2014 Available online xxx
Carbon (C) flux between forest ecosystems and the atmosphere is an important ecosystem C cycling component. Modeling C flux plays a critical role in assessing both C cycles and budgets. This study aimed to determine important non-redundant input variables to quantify C flux and to develop a new application of a genetic neural network (GNN) model that accurately simulates C flux. Four input variables (atmospheric CO2 concentration, air temperature, photosynthetically active radiation (PAR), and relative humidity) were fixed, whereas three additional input variables (wind speed, soil temperature, and rainfall) were randomly combined to compile eight combinations of input variables (CIV 1–CIV 8). C flux and meteorological data were collected over a four-year period between January 2008 and December 2011 at the Huitong National Research Station of Forest Ecosystem. Results showed that CIV 8 (grouping atmospheric CO2 concentration, air temperature, PAR, relative humidity, wind speed, and soil temperature) performed best, yielding a correlation coefficient (R2) of 0.87, outlier of 0.79%, and a root mean squares of errors (RMSE) of 0.11. C flux data during summer generally provided the best performance with R2 ranging from 0.74 to 0.82, volumetric fitting (Ivf) ranging from 1.00 to 1.02, and outliers ranging from 1.20% to 1.40%. Spring data performance ranked second and winter last. When combining seasonal data to reflect the entire year, R2 ranged from 0.77 to 0.83, Ivf ranged from 0.92 to 0.97, outliers ranged from 1.40% to 1.78%, and RMSE ranged from 0.10 to 0.11, indicating that the GNN model is capable in capturing C flux dynamics while successfully simulating and predicting C flux in a Cunninghamia lanceolata plantation in subtropical China. ã 2014 Published by Elsevier B.V.
Keywords: C flux Cunninghamia lanceolata plantation Artificial neural network Nonlinear problem
1. Introduction Quantifying the carbon (C) flux and understanding its drivers are essential for forest (in particular plantation) management to mitigate carbon-related climate change (Melesse and Hanley, 2005). Cunninghamia lanceolata (Lamb.) Hook is the third most commonly used plantation tree species worldwide (Lugo et al., 2006). C. lanceolata has a total plantation area of 9.21 million hectares in China in 2005 (Lei, 2005). These plantations account for
* Corresponding author. Tel.: +86 731 85623458. ** Corresponding author. Tel.: +1 514 987 3000x1056; fax: +1 514 987 4718. E-mail addresses:
[email protected] (X. Deng),
[email protected] (C. Peng). http://dx.doi.org/10.1016/j.ecolmodel.2014.09.006 0304-3800/ ã 2014 Published by Elsevier B.V.
approximately 30% of all China’s plantations and contribute approximately 65% of all national C sequestration as reported by Piao et al. (2009). Owing to its extensive plantation area and the ecosystem services it offers, C. lanceolata plantation is a good and important case to determine the plantations as C sinks or sources. Accordingly, an effective method must be developed to quantify and forecast C. lanceolata plantation C flux. Over the years, several methods have been used to calculate C flux. These include sample plot inventory, model fitting (Geng et al., 2012), and micrometeorology methods (Hagen et al., 2006). Compared to other approaches, eddy covariance has been the most widely used technique under the spectrum of micrometeorology (Schimel et al., 2001), being one of the most accurate approaches with which to measure C flux between the atmosphere and terrestrial ecosystems when atmospheric conditions (wind, temperature, humidity,
20
X. Wen et al. / Ecological Modelling 294 (2014) 19–26
and CO2 concentration) are steady (Baldocchi, 2003), but for a variety of reasons (precipitation, frost, etc.), between 25% and 35% of flux tower data are found to be erroneous or missing (Eva et al., 2001; Pattey et al., 2002). Many techniques have been recently applied to calculate C flux into and out of forest ecosystems, a number of them being based on eddy covariance. However, most researches related to this subject have been carried out during the growing season (Sun et al., 2008 Zhou et al., 2008) on a short-term basis (Melesse and Hanley, 2005) or multiple years of data collected (Bracho et al., 2012). Although complicated and nonlinear relationships between C flux and other micrometeorological parameters (moisture, temperature, etc.) exist, there is a large literature on gap filling and predictive models used in Europe and North America. Therefore, an important issue exists within the context of subtropical C. lanceolata plantations, concerning how to best select non-redundant environmental factors that control C flux while forecasting its differing spatial and temporal scales. Being a developing area in cross-disciplinary science, the artificial neural network (ANN) method is efficient for dealing with nonlinear problems. An ANN has an accurate mapping ability to nonlinear problems. An ANN does not strictly limit the sample data. The ANN method has been widely used to simulate the effects of global climate change as well as disturbances on ecosystem structure, function, and services (Liu et al., 2010). This is due to the substantial ability shown by this type of computational model when tackling the typical difficulties that arise when handling forest data, such as nonlinear relationships and non-normality (Diamantopoulou, 2005; Ingram et al., 2005; Melesse and Hanley, 2005; Scrinzi et al., 2007). The ANN models have been successfully used to map soil structure (Levine et al., 1996) as well as to simulate soil erosion (Licznar and Nearing, 2003), organic C (Somaratne et al., 2005), and hydrological cycles (Liu et al., 2012). Back propagation neural networks (BPNNs) are known for their ability in managing a wide variety of prediction and classification complexities. Problems such as a slow learning convergent velocity and ease in converging to local minima cannot be avoided when adopting the BPNN gradient method. In addition, the selection of both learning factors and inertial factors affects BPNN convergence, which is usually determined by experience (Wen et al., 2000). One of the biggest concerns with machine learning techniques, such as ANNs, is the potential for overfitting. Overfitting is a common problem found in the application of ANNs. Liu et al. (2012) have shown that a BPNN model appeared to be overfitting with some limited training data (Liu et al., 2012). The best way to avoid overfitting is to use lots of training data. If one has at least 30 times as many training cases as there are weights in the network, one is unlikely to suffer from much overfitting (Geman et al., 1992). Genetic algorithms (GAs) were pioneered by John Holland, whose first book (Holland, 1975) was an early landmark in evolutionary algorithms and the best introduction to GAs (Goldberg and Holland, 1988). In this study, we used the GA to develop a new genetic neural network (GNN) in order to select training data and to interpret the output behavior of neural networks (Liu et al., 2012). The tendency of a GA is to converge on optimal or nearoptimal solutions (Wong and Tan, 1994). However, an ANN has yet to be applied to simulate C flux for an entire year in C. lanceolata plantations.The objectives of this study were: (1) to select important non-redundant input variables (factors) for the GNN model to simulate C flux data obtained between 2008 and 2011 from a C. lanceolata plantation in Huitong County, Hunnan Province, China; and (2) to validate model simulations against eddy flux tower measurements as well as to identify any potential limitations in application that ANNs might have in modeling C cycles.
2. Data sources and methods 2.1. Site descriptions Experimental plots were situated within the central C. lanceolata plantation production area in Guangping Township (latitude 109 450 E, longitude 26 500 N), Huitong County, Hunan Province, China (Fig. 1). The region is typical of a humid mid-subtropical monsoon climate. It has an annual mean temperature from 15.0 C to 17.8 C, ranging from an average of 4.4 C during the coldest month (January) to 29.5 C during the warmest month (July). Elevation of the study site is approximately 280–390 m above mean sea level (Zhao et al., 2009). Uniformly distributed mean annual precipitation is 1275.3 mm, ranging from 1100.0 to 1500.0 mm. Air humidity is greater than 80% (Deng et al., 2007). The soil is a clay loam red soil developed from shale and slate parent rocks. The flux tower (32.5 m high and extending approximately 16 m above the tree canopy) situated within the research station was constructed at the second C. lanceolata plantation watershed. Most measurement data (i.e., air temperature and air relative humidity) were collected at the top of the tower above the tree canopy. From the position of the tower, to the south beyond 200 m is a stream; to the east beyond 400 m is farmland; the north beyond 300 m consists of sparse fields; and the west is bordered by several kilometers of C. lanceolata plantations. A total of eight C. lanceolata plantation watersheds are situated within the research station. The C. lanceolata plantation located in the study area was planted in 1966. A clear cutting experiment was carried out at the second watershed at the end of 1992, followed by prescribed burning and the complete cultivation of the site. Because of poor growth, a clear cutting was carried out again in 1995. The site was once again afforested to a C. lanceolata plantation in the spring of 1996. Between 1996 and 1997, young trees were completely managed twice a year (May and August). 2.2. Data sources Half-hourly meteorological data from the flux tower (including air temperature, relative humidity, photosynthetically active radiation, atmospheric CO2 concentration, wind speed, and soil temperature) were used to derive CO2 flux modeling parameters. Collected and analyzed were four years of daily half-hourly measured CO2 flux data from 2008 to 2011. Data between January 1 and February 27, 2008 were missing due to ice damage on the tower; missing data between July 21 and August 22, 2010 was due to a failure in data storage, and missing data between January 15 and March 15, 2011 was due to an interruption in power supply. Given that ice damage took place in 2008, two years were substituted (2008 and 2009) for data training. Data that could not meet quality standards were excluded. In all, approximately 35% of data from each year were excluded. It should be noted that all the missing data were interpolated by means of interdiurnal variation and nonlinear regression methods. The details are reported by Zhao et al. (2011). We also needed to deal with the problem of nighttime flux estimation under calm conditions (low wind speed), which caused the underestimation of night C flux, by removing the night flux data when the friction velocity was less than critical speeds (0.1 m s1) (Reichstein et al., 2005). The model was first run using different combinations of input parameters (groupings of 4, 5, 6, and 7 parameters), utilizing partial data (from July to September in 2008). The assessment criteria value used is provided in Table 2. Data collected over the four specified years were separated into two parts: 2008 and 2009 data were used for training while 2010 and 2011 data were used for validation. This study used NeuroShen Easy Predictor software (Yuksel et al., 2000; Ryan et al., 2004), and the upper limit on the amount of training data
X. Wen et al. / Ecological Modelling 294 (2014) 19–26
21
Fig. 1. Location of the study site (Huitong National Research Station of Forest Ecosystem) where the eight Chinese fir plantation watersheds are located.
is 16,000. Owing to the great number of data obtained throughout an entire year (beyond the upper bounds capable of ANNs), the model was run in half-year intervals. The two half-years’ input and output data were later combined. 2.3. Artificial neural networks Neural networks use machine learning based on the concept of internal control parameter self-adjustment. An ANN is nonparametric when modeling the human brain. Just as humans apply knowledge gained from past experiences to solve new problems or circumstances, a neural network takes previously solved examples to build a system of “neurons” to make new decisions, classifications, and forecasts. A number of types of ANNs have been widely used in many situations. The arrangement of neurodes (network architecture) and the ways to determine weights and functions for inputs and neurodes (training) are the main differences between the various types of ANNs developed (Caudill and Butler, 1992). However, a network must include an input layer, a hidden layer, and an output layer. In theory, a threelayer ANN model provides any nonlinear mapping sequence (Liu et al., 2010). The genetic algorithm (GA) is an algorithm used for optimization and learning based on a number of features specific to biological evolution. It often takes the form of binary coding. Problems are coded as chromosomes, and each chromosome is gradually evolved by biological operations (Bauer, 1994). After the initialization step, each chromosome is evaluated by the fitness function. According to the value of the fitness function, the chromosomes associated with the fittest individuals will be reproduced more often than those associated with unfit
Fig. 2. Schematic diagram of GNN. Input variables include: PAR (photosynthetically active radiation), Ta (air temperature), CO2 (carbon dioxide concentration in air), Rh (relative humidity), Ws (wind speed), and Ts (soil temperature). The GNN was constructed in three layers: an input layer, a hidden layer, and an output layer. Input layers have six variables. The hidden layer generated panel points automatically.
22
X. Wen et al. / Ecological Modelling 294 (2014) 19–26
P
individuals (Davis, 1994). In addition, mutation arbitrarily alters one or more components of a selected chromosome. It provides themeans for introducing new information into the population which is initialized by chromosomes. Finally, genetic algorithms tend to converge on optimal or near-optimal solutions (Wong and Tan, 1994). As they relate to ANNs, GAs have been widely used in three ways including: (1) to set weights in fixed architecture; this includes both supervised unsupervised learning. The application has been used to set the learning dates which in turn are used by other types of learning algorithms, (2) to learn neural network topologies when evolving neural networks topologies for function approximation; this includes the problem of specifying how many hidden units a neural network should have and how the nodes are connected; and (3) to select training data and to interpret ANN output behavior.
Ivf ¼ P
^j jy jyj
Outliers (unusual values) are another important assessment criteria index for model evaluation (Blessing, 1997; Aggarwal and Yu, 2001; Hodge and Austin, 2004). The fewer the outliers, the better the simulation performance will be. The “unusual value” in a standard curve is distinguished by the standard residual error (e0 ) method, increasing the precision of simulated values. Error e0 and outliers are calculated as follows: ^i yi y ffi ei 0 ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX u ðyi y ^ i Þ2 t np1
2.4. Model training and validation For this study, measured parameters including atmospheric carbon dioxide concentrations (CO2), air temperature (Ta), relative humidity (Rh), photosynthetically active radiation (PAR), wind speed (Ws), soil temperature (Ts), and rainfall were closely correlated to C flux dynamics (Fig. 2). Important non-redundant factors affecting C flux must first be selected. Half-hourly C flux (Cf) derived from tower measurements was compared to model outputs (Cf). Throughout the experiment, half-hourly data between 2008 and 2009 were used for training and half-hourly data between 2010 and 2011 were used for validation. In order to test which factors would have a significant effect on C flux, the data between July and September in 2008 were used for training, incrementally applying an increasing grouping of three factors, four factors, five factors, six factors, and seven factors. The first four factors (CO2, Ta, PAR, and Rh) are most important for photosynthesis and plant respiration; however, photosynthesis and respiration are also the primary challenges for C flux dynamics. Every factor combination possible was grouped. The comparison carried out in this study used the four most commonly applied assessment criteria: R2, which is widely used in regression modeling testing (Nash and Sutcliffe, 1970), volumetric fitting (Ivf), the root mean squares of errors (RMSE), and outliers, which are widely used to assess model performance. R2 is calculated as follows: P ^ Þ2 ðy y R2 ¼ 1 P ðy yÞ2
outlier ¼
the total number of ei 0 < 3andei 0 > 3 total number ofy
The RMSE can be calculated using: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P ^ yÞ2 ðy RMSE ¼ n In the equations above, y is the observed C flux (Cf) data collected ^ is the simulated C flux (Cf) resulting from from the flux tower; y ANN outputs; y is the mean of y values; i denotes the ith element within all data inputted into the model; n is the number of observations; and P is the number of inputs. R2 ranges from 0 (worst case scenario) to 1 (perfect correlation). Ivf ranges from 0 (no correlation) to 1 (perfect correlation) to +1 (worst case scenario) (Hu et al., 2008). The “unusual value” (also referred to as the outlier) is either less than 3 or greater than 3 given a 99.7% range from 3 to 3. The quality, or performance, of the ANN model can be used to estimate how much of the response can be mapped (explained) with the input drivers provided. During the ANNptraining, ffiffiffiffiffiffiffiffiffiffiffiffiffiffi the sum of squared errors (SSEerr) is optimized: RMSE = SSEerr. 3. Results 3.1. Selecting C flux driving factors Table 1 shows that CIV 1 through CIV 7 had an R2 range from 0.8325 to 0.8360, but CIV 8 (excluding rain) provided the best R2 result overall (0.8679). Parameters included in CIV 8 were CO2, Ta,
Ivf, an index widely used to assess model performance, is calculated as follows:
Table 1 Major factors related to C flux selected for GNN. This table shows the important non-redundant parameters affecting C flux. Y represents the combination of input parameters while N represents unused parameters. Three evaluation criteria for each input combination are provided. There are eight options in all, but CIV 8 yielded the best overall R2 performance (0.8679) and best overall outlier performance (0.79%). Included in CIV 8 were CO2, Ta, PAR, Ws, Rh, and Ts. RMSE: root mean squares of errors. Inputs
Evaluation criteria
CIV
Ta
PAR
Rh
CO2
Rain
Ts
Ws
R2
Ivf
Outlier (%)
RMSE
1 2 3 4 5 6 7 8
Y Y Y Y Y Y Y Y
Y Y Y Y Y Y Y Y
Y Y Y Y Y Y Y Y
Y Y Y Y Y Y Y Y
Y Y Y Y N N N N
N Y N Y Y N N Y
N N Y Y N N Y Y
0.8325 0.8332 0.8355 0.8360 0.8332 0.8325 0.8354 0.8679
0.9936 0.9925 0.9913 0.9906 0.9916 0.9926 0.9898 0.9908
1.06 1.14 1.09 1.23 1.09 1.04 1.11 0.79
0.1264 0.1262 0.1253 0.1251 0.1262 0.1264 0.1253 0.1123
X. Wen et al. / Ecological Modelling 294 (2014) 19–26
23
Fig. 3. Observational comparisons (observations) and model simulations (predictions) for selected days of the experimental years. (a) for 2008–2010 and (b) for 2008–2011. The date is shown on the horizontal axis and Cf (CO2 flux) is shown on the vertical axis. When Cf < 0, trees uptake more C via photosynthesis than respiration during daytime hours while when Cf > 0 either only tree respiration takes place or more respiration takes place than photosynthesis.
Rh, PAR, Ws, and Ts. CIV 6 yielded the maximum Ivf value (0.9926) while CIV 7 yielded the minimum Ivf value (0.9898). CIV 8 yielded an Ivf value between the maximum and minimum (0.9908). However, no significant differences were found between the eight input variables (CIV 1 through CIV 8). Similar to R2, outliers also showed regular patterns. CIV 1–CIV 7 yielded values ranging from 1.04% to 1.23%, and CIV 8 yielded the minimum value (0.79%). Results showed that CIV 8 yielded the best R2 among all simulations, while also yielding the minimum outlier. Ivf showed no significant differences among all groupings. CIV 8 had the minimum RMSE (0.1123), while other groupings’ RMSE values ranged from 0.1251 to 0.1264. CIV 8 is very different from CIV 1 to CIV 7.
to 1.78% for 2008). C flux diagrams were produced for every fifteenday interval during the growing season (the date shown on the horizontal axis and Cf (C flux) shown on the vertical axis). When Cf < 0, trees take up more C via photosynthesis than respiration
3.2. Training results and model validation Observed yearly C fluxes for 2008 and 2009 were 915.15 g CO2 m2 year1 and 946.19 g CO2 m2 year1, respectively. Model simulation results were 820.62 g CO2 m2 year1 for 2008 and 800.93 g CO2 m2 year1 for 2009. For the training period, R2, Ivf, and RMSE values showed no significant differences between 2008 and 2009, but 2009 exhibited lower outliers (1.48% compared
Fig. 4. Linear regressions of observations and model simulations during the training period. Given that the flux tower experienced ice damage in 2008, two years were substituted (2008 and 2009) for data training. Observations are shown on the horizontal axis and simulations are shown on the vertical axis.
24
X. Wen et al. / Ecological Modelling 294 (2014) 19–26
Table 2 Comparison results of statistical performance for two different years. Data from the four year experimental period were separated into two parts: 2008 and 2009 data were used for training while 2010 and 2011 data were used for validation. 2010 yielded better R2 values than 2011, but the opposite was true for Ivf. Year
R2 Ivf Outlier RMSE
Training
Validation
2008
2009
0.8286 0.9698 1.78% 0.0997
0.8298 0.9494 1.48% 0.0993
2008
2009
2010
2011
2010
2011
0.8003 0.9208 1.66% 0.1040
0.7762 0.9561 1.40% 0.1065
0.7775 0.9204 1.68% 0.1098
0.7668 0.9565 1.62% 0.1087
outlier result overall. Table 2 shows that 2011 Ivf and outliers outperformed 2010 while R2 was inferior to 2010. The RMSE values showed no significant differences between 2010 and 2011. A linear regression was additionally carried out between observations and simulations during both training and validation periods. For the validation period, simulation and observation line regressions are provided in Fig. 5 for every parameter shown in Table 3. There was no big difference between 2008 and 2009 for the training period. For the validation period, 2011 had a large number of A (0.827 and 0.822), but inferior B (0.012 and 0.011) and smaller R2 (0.780 and 0.771) than 2010, which had a (0.805 and 0.803), b (0.000 and 0.012), and R2 (0.800 and 0.778). There was no significant difference between 2010 and 2011 in terms of RMSE values. 3.3. Training and validation: seasonal performance
Fig. 5. Different yearly linear regressions of observations and model simulations during the validation period. All regressions between observations and simulations during the validation period were carried out. 2010 and 2011 were simulated by 2008 and 2009, respectively. Observations are shown on the horizontal axis and simulations are shown on the vertical axis.
during daytime hours; when Cf > 0, either only tree respiration takes place or more respiration takes place than photosynthesis (Fig. 3a and b). A linear regression was additionally carried out between simulations and observations (Fig. 4). Simulations were compared to observed values, and the results are provided inTable 3. For the validation period, the 2008 simulation predicted 2010 results well (R2 = 0.8003) while the 2009 simulation Ivf value (0.9565) was used to predict 2011 results. Outliers ranged from 1.40% to 1.68%. The 2008 simulation predicted the best 2011
A year was divided into four sections: winter (January–March), spring (April–June), summer (July–September), and autumn (October–December). Four years of observed half-hourly data were used for model training and validation. Quarter-yearly results are provided in Table 4. For the training period, R2 ranged between 0.7397 and 0.8486; Ivf ranged between 0.8510 and 0.9868; outliers ranged between 1.04% and 2.4%; and RMSE ranged between 0.0701 and 0.1253. Summer results (R2, Ivf, and outliers) outperformed other seasons, but had a worse RMSE. The 2009 winter simulation did not predict well the 2010 results (R2 = 0.6378). Similarly, the 2008 winter simulation did not predict well the 2011 results (the worst overall Ivf generated). The 2008 winter simulation yielded the worst R2 (0.7397) and outliers (2.4%) overall. For the validation period, 2011 yielded two maximum outliers in winter (2.85% and 2.25%). Ivf between model simulations and observations was always greater than 1.0 in summer. The best R2 performance overall was 0.8022 in autumn 2011, predicted by the 2008 simulation. R2 ranged between 0.6378 and 0.6699 in winter while it ranged between 0.7406 and 0.7938 in summer. The best Ivf overall (1.0033) was yielded in spring 2011, predicted by the 2008 simulation. The Ivf minimum value for 2011 was 0.6468, predicted by the 2008 simulation. Ivf ranged between 0.6468 and 0.8351 in winter and ranged between 1.0049 and 1.0246 in summer. The minimum outlier (1.19%) was yielded in spring 2011, predicted by the 2008 simulation. Outliers ranged between 1.19% and 1.47% in spring and summer while they ranged between 1.81% and 2.85% in winter. For the whole study period, RMSE showed an opposite trend. Summer had the largest RMSE, followed by the spring, the autumn, and finally the winter. For the validation period, 2010 outperformed 2011 during growing seasons (spring and summer), but not in autumn and winter. 4. Discussion To our knowledge, this is the first time that GNN model has been used to simulate the C flux and its key control factors for Chinese fir
Table 3 Linear regression parameter list (Y = AX + B; Y is the simulation value; X is the observed value; R2 is the coefficient of determination). RMSE is the root mean squares of errors. R2 ranged from 0.77 to 0.83. Parameters ranged from 0.803 to 0.835. Results demonstrated that 2011 outperformed 2010. Parameters
Training 2008
A B R2 RMSE N
0.834 0.001 0.8280 0.0997 17,568
Validation 2009
0.835 0.000 0.8301 0.0993 17,520
2008
2009
2010
2011
2010
2011
0.805 0.000 0.8002 0.1040 17,520
0.827 0.012 0.7800 0.1065 17,520
0.803 0.000 0.7783 0.1098 17,520
0.822 0.011 0.7710 0.1087 17,520
X. Wen et al. / Ecological Modelling 294 (2014) 19–26
25
Table 4 Comparison results of statistical performance for different periods of a year. A year was divided into four parts: January–March, April–June, July–September, and October– December (generally corresponding to winter, spring, summer, and autumn). Four years of observed data acquired daily at half-hourly intervals was used for model training and validation. Data is presented quarter yearly in this table. Results demonstrated that the model performed better in summer than in winter. Period
Training
Year
2008
January–March (winter)
April–June (spring)
July–September (summer)
October–December (autumn)
Ivf R2 Outlier RMSE Ivf R2 Outlier RMSE Ivf R2 Outlier RMSE Ivf R2 Outlier RMSE
0.8956 0.7397 2.40% 0.0701 0.9780 0.8238 1.26% 0.1087 0.9863 0.8238 1.20% 0.1253 0.9747 0.8457 1.86% 0.0856
plantation ecosystem. This study aimed to take advantage of GNNs to determine the important non-redundant factors that control C flux. Based on these controlling factors and the observed C flux data from an entire year, the following year’s C flux can be predicted. In our view, the performance of GNN was better than previous ANN models (Papale and Valentini, 2003; Diamantopoulou, 2005; Melesse and Hanley, 2005; Liu et al., 2012) because that the genetic algorithms used in GNN is able to overcome the defects that converge slowly and easy generate local convergence in other algorithms (like BPNN). In addition, the GNN model has good stability and less over-fitting in training period (Liu et al., 2012). The best results pertaining to ANN modeling do not derive from structure complexity or a greater number of input parameters. Hu et al. (2008) suggested that more inputs could generate more accumulated errors. This study combined different numbers of input parameters (groupings of 4, 5, 6, and 7) for model training. CIV 8 provided the best results for model training. It comprised Ta, CO2, Rh, PAR, Ws, and Ts. Plant photosynthesis and respiration and soil respiration are the most important combinations for C flux dynamics. Ta, CO2, and PAR play an important role in plant photosynthesis (Zhou et al., 2008; Antje et al., 2010). Rh is critical for stomatal opening (Zhou et al., 2008), which has an indirect effect on photosynthesis. Ws affects Ta, CO2, and Rh, and Ts plays a crucial role in controlling soil respiration (Phillips et al., 2010). Together, they work in conjunction to provide optimal model performance. For this study, R2 and outlier provided better results for all summer training and validation procedures, but not RMSE. R2 ranged between 0.7406 and 0.8238, and outlier between1.04% to 1.40% in summer. In contrast to other indices, RMSE ranged between 0.0701 and 0.1290. The summer had the worst performance but the winter had the best performance. This is because the model failed to simulate C flux peaks during specific periods of a year. Summer, compared to other seasons, has larger C sequestration during daytime, but larger C emission during nighttime. The difference between observation and prediction peaks is too large to be a good RMSE performance. Otherwise, R2 ranged between 0.7397 and 0.8486 during training periods. Ivf ranged from 0.8510 to 0.9868. The 2009 simulation outperformed the 2008 simulation in winter. For all other seasons, 2009 exhibited no obvious advantage over 2008. For autumn and spring, 2008 results were slightly better than 2009 results, but only occasionally was this true for an entire year (with the exception of summer). 2009 outperformed 2008 owing to the ice damage that
Validation 2009
0.8510 0.7988 1.97% 0.0779 0.9647 0.8132 1.28% 0.1087 0.9868 0.8132 1.04% 0.1220 0.9469 0.8319 2.04% 0.0812
2008
2009
2010
2011
2010
2011
0.7307 0.6536 2.06% 0.0985 0.9392 0.7938 1.47% 0.1057 1.0099 0.7938 1.38% 0.1112 0.9052 0.7917 1.90% 0.0998
0.6468 0.6417 2.85% 0.0865 1.0431 0.7561 1.19% 0.1144 1.0246 0.7561 1.27% 0.1237 0.9618 0.8022 1.63% 0.0970
0.8351 0.6378 1.81% 0.1007 0.9105 0.7649 1.30% 0.1128 1.0229 0.7649 1.40% 0.1164 0.8367 0.7550 2.06% 0.1083
0.8343 0.6699 2.25% 0.0830 1.0033 0.7406 1.40% 0.1179 1.0049 0.7406 1.27% 0.1290 0.9154 0.7957 2.06% 0.0985
occurred in January and February 2008. R2 ranged between 0.6378 and 0.8554 for the model validation period. Ivf ranged from 0.6468 to 1.0431, its best performance occurring in spring (1.0033). Even though data between July 25 and August 22, 2010 were lost, the summer of 2010 yielded the best overall R2. This was largely on account of the good training model developed for 2008 and 2009 as well as the loss of data that occurred between July 25 and August 22, 2010. Since missing data were interpolated, the data could have had a similar pattern to the model (a phenomenon called overfitting). The winter of 2011 exhibited the worst performance overall. Generally speaking, winter has a greater amount of rainy, foggy, and snowy days, and this could result in unreliable data. Consequently, conditions for perfect model training and validation were not possible for winter simulations. Although this generally was not the case for summer, results for Ivf suggested not all summer model simulations outperformed other seasons. Spring 2011 resulted in the best Ivf performance overall (predicted by the 2009 simulation). The largest Ivf measured (1.0431) occurred in spring 2011 (predicted by the 2008 simulation), but this could be due to certain randomness in Ivf. For this study, summer Ivf performance was greater than 1.000 while yielding larger R2 and smaller outliers than other seasons. This is attributable to the absence of fog and frost days as well as plant growth promoted by favorable summer climate conditions. Table 3 provides evidence that the GNN model could be applied to predict yearly subtropical C. lanceolata plantation C flux. For different training periods, R2 ranged between 0.7397 and 0.8486. During validation periods, however, the worst R2 results exhibited a decrease (down to 0.6378) compared to the best R2 results (that can reach up to 0.8554). No significant differences were observed for Ivf and outliers for different seasons. Results also suggested that the GNN model performed well overall. For example, no overfitting in observed data was found. For winter, 2009 training outperformed 2008 training with regards to model validation. Summer yielded the best performance overall followed by autumn and spring. Winter yielded the worst performance overall. This is may be related to the fact that summer has less noisy data than autumn, spring, and winter, and the summer has more variation of carbon flux (difference between the influx carbon (uptake by phtosynthesis) and efflux carbon (ecosystem respiration). The value of simulating an entire year was greater than results from observed values. In other words, no good answer was discovered as to why the model failed to simulate C flux peaks during specific
26
X. Wen et al. / Ecological Modelling 294 (2014) 19–26
periods of a year. This was similar to ANN C flux results reported in other studies (Papale et al., 2003; Diamantopoulou, 2005; Melesse and Hanley, 2005; Sun et al., 2008; Liu et al., 2012). There are several possible reasons for this anomaly. First, observed C flux values may contain certain errors due to flux tower positioning. Second is measurement and model uncertainty (i.e., noisy data or an imperfect model), which has been recently discussed in a study by Hagen et al. (2006). 5. Conclusions An ANN was used to simulate C flux, applying select micrometeorological parameters obtained between 2008 and 2011 from a subtropical C. lanceolata plantation in China. Seven inputs were selected to run the model. Daily half-hourly observed micrometeorological data were divided into training (using 2008 and 2009 data) and validation (using 2010 and 2011 data). Results showed that the ANN was successfully able to simulate C flux for a C. lanceolata plantation. Simulated C flux was in good agreement with observations. In general, model performance was better in summer than in other seasons except for the peak values. A more accurate GNN model with no overfitting would have resulted if there were no missing data during the training and validation periods. However, the ANN was still unable to simulate the C flux peak value, which may be attributable to uncertainties related not only to measurements and noisy data but also to the model itself (parameterization and choice of functional form). Future studies related to this research topic should focus on capturing C flux peak values and seasonality. Acknowledgments This study was supported by the Chinese 973 Program (2010CB833504-X), the Introducing Advanced Technology Program (948 Program) (no. 2010-4-03), the Program for Science and Technology Innovative Research Team in Higher Educational Institutions of Hunan Province, and the Furong Scholar Program. We thank Brian Doonan for his assistance in editing of the manuscript.
Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.ecolmodel.2014. 09.006. References C.C., Aggarwal, P.S., Yu, 2001. Outlier detection for high dimensional data. Newsletter ACM SIGMOD Record Homepage. 30 (2): 37–46. Antje, M.M., Clemens, B., Galinl, C., et al., 2010. Characterization of ecosystem responses to climatic controls using artificial neural networks. Glob. Change Biol. 16, 2737–2749. Bauer, J.R., 1994. Genetic Algorithms and Investment Strategies, Wiley Finance Editions John Wiley and Sons, New York, pp. 55–94. Blessing, R.H., 1997. Outlier treatment in data merging. J. Appl. Cryst. 30, 421–426. Bracho, R., Starr, G., Gholz, H.L., Martin, T.A., Cropper Jr., W.P., Loescher, H.W., 2012. Controls on carbon dynamics by ecosystem structure and climate for southeastern U.S. pine plantations. Ecol. Monographs 82, 101–128. Caudill, M., Butler, C., 1992. Understanding Neural Networks. Basic Networks, 1st ed. MIT Press, Cambridge, MA. Davis, L., 1994. Genetic algorithms and financial applications. In: Deboeck, G.J. (ed.), Trading on the Edge, Wiley,133–147. Deng, X.W., Kang, W.X., Tian, D.L., et al., 2007. Runoff changes in Chinese fir plantations at different age classes Huitong, Hunan Province. Scientia Silvae Sinicae 436, 1–6 (in Chinese). Baldocchi, D.D., 2003. Assessing the eddy covariance technique for evaluating carbon dioxide exchange rates of ecosystems: past, present and future. Glob. Change Biol. 9 (4), 479–492.
Diamantopoulou, M.J., 2005. Artificial neural networks as an alternative tool in pine bark volume estimation. Comput. Electron. Agr. 48, 235–244. Eva, F., Dennis, B., Richard, O., et al., 2001. Gap filling strategies for defensible annual sums of net ecosystem exchange. Agr. Forest. Meteorol. 107, 43–69. Geman, S., Bienenstock, E., Doursat, R., 1992. Neural networks and the bias/variance dilemma. Neural Comput. 4, 1–58. Geng, J., Ruan, H.H., Tu, L.L., et al., 2012. Estimation of net primary productivity by a satellite data-driven Carnegie-Ames-Stanford approach model in Wawu Mountain forest farm. China Forest. Sci. Technol. 26 (3), 90–96 (in Chinese). Goldberg, D., Holland, J.H., 1988. Genetic algorithms in machine learning. Mach. Learn. 3, 95–99. Hagen, S.C., Braswell, B.H., Linder, E., et al., 2006. Statistical uncertainty of eddy flux– based estimates of gross ecosystem carbon exchange at Howland Forest. J. Geophys. Res. 111 (8), 1984–2012. Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. University of Michigan Press. Hodge, V.J., Austin, J., 2004. A survey of outlier detection methodologies. Artif. Intell. Rev. 22 (2), 85–126. Hu, C.H., Hao, Y.H., Pang, B., et al., 2008. Simulation of spring flows from a karst aquifer with an artificial neural network. Hydrol. Process. 22, 596–604. Ingram, J.C., dawson, T.P., Whittaker, R.J., 2005. Mapping tropical forest structure in southeastern Madagascar using remote sensing and artificial neural networks. Remote Sens. Environ. 94, 491–507. Lei, J.F., 2005. Forest Resources of China, 1st ed. China Forestry Publish, Beijing, pp. 172 (in Chinese). Levine, E.R., Kimes, D.S., Sigillito, V.G., 1996. Classifying soil structure using neural networks. Ecol. Model. 92, 101–108. Licznar, P., Nearing, M.A., 2003. Artificial neural networks of soil erosion and runoff prediction at the plot scale. Catena 51, 89–114. Liu, Z.L., Peng, C.H., Xiang, W.H., et al., 2010. Application of artificial neural networks in global climate change and ecological research: an overview. Chinese Sci. Bull. 55, 2987–2997. Liu, Z.L., Peng, C.H., Xiang, W.H., et al., 2012. Simulations of runoff and evapotranspiration in Chinese fir plantation ecosystems using artificial neural networks. Ecol. Model. 226, 71–76. Lugo, A.D., Ball, J., Carle, J., 2006. Global planted forest thematic study: results and analysis. In: Planted Forests and Trees Working Paper. Food and Agriculture Organization of the United Nations (FAO) Forestry Department. Wiley, Rome. Melesse, A.M., Hanley, R.S., 2005. Artificial neural network application for multiecosystem carbon flux simulation. Ecol. Model. 189, 305–314. Nash, J.E., Sutcliffe, J.V., 1970. River flow forecasting through conceptual models. J. Hydrol. 10, 282–290. Papale, D., Valentini, R., 2003. A new assessment of European forests carbon exchanges by eddy fluxes and artificial neural network spatialization. Glob. Change Biol. 9, 525–535. Pattey, E., Strachan, I.B., Desjardins, R.L., et al., 2002. Measuring nighttime CO2 flux over terrestrial ecosystems using eddy covariance and nocturnal boundary layer methods. Agr. Forest. Meteorol. 113, 145–158. Phillips, C.L., Nickerson, N., Risk, D., et al., 2010. Interpreting diel hysteresis between soil respiration and temperature. Glob. Change Biol. 17 (1), 515–527. Piao, S., Fang, J., Philippe, C., et al., 2009. The carbon balance of terrestrial ecosystems in China. Nature 458, 1009–1013. Reichstein, M., Falge, E., Baldocchi, D., et al., 2005. On the separation of net ecosystem exchange into assimilation and ecosystem respiration: review and improved algorithm. Glob. Change Biol. 11 (9), 1424–1439. Ryan, M.G., Binkley, D., Fownes, J., Giardina, C.P., Seenock, R.S., 2004. An experimental test of the causes of forest growth decline with stand age. Ecol. Monogr. 74 (3), 393–414. Schimel, D.S., House, J.I., Hibbard, K.A., et al., 2001. Recent patterns and mechanisms of carbon exchange by terrestrial ecosystems. Nature 414, 169–172. Scrinzi, G., Marzullo, L., Galvagni, D., 2007. Development of a neural network model to update forest distribution data for managed alpine stands. Ecol. Model. 206, 331–346. Somaratne, S., Seneviratne, G., Coomaraswamy, U., 2005. Prediction of soil organic carbon across different land-use patterns: a neural network approach. Soil Sci. Soc. Am. J. 69, 1580–1589. Sun, J., Peng, C., McCaughey, H., et al., 2008. Simulating carbon exchange of Canadian boreal forests II. Comparing the carbon budgets of a boreal mixedwood stand to a black spruce forest stand. Ecol. Model. 219, 276–286. Wen, J., Zhao, J.L., Luo, S.W., et al., 2000. The improvements of BP neural network learning algorithm. 5th International Conference Signal Processing Proceedings 3, 1647–1649. Wong, F., Tan, C., 1994. In: Deboeck, G.J. (Ed.), Trading on the Edge, Machine Learning: Hybrid Neural, Genetic, and Fuzzy Systems. Wiley, pp. 243–261. Yuksel, N., Turkoglu, M., Baykara, T., 2000. Modelling of the solvent evaporation method for the preparation of controlled release acrylic microspheres using neural networks. J. Microencapsul. 17 (5), 541–551. Zhao, M.F., Xiang, W.H., Tian, D.L., et al., 2009. Simulating age-related changes in carbon storage and allocation in a Chinese fir plantation growing in southern China using the 3-PG model. Forest Ecol. Manag. 257, 1520–1531. Zhao, Z.H., Zhang, L.P., Kang, W.X., et al., 2011. Characteristics of CO2 flux in a Chinese Fir plantation ecosystem in Huitong County, Hunan Province. Scientia Silvae Sinicae 47 (11), 6–12 (in Chinese). Zhou, X., Peng, C., Dang, Q., et al., 2008. Simulating carbon exchange in Canadian Boreal forests I. Model structure, validation, and sensitivity analysis. Ecol. Model. 219, 287–299.