Evaluation of different approaches for identifying optimal sites to predict mean hillslope soil moisture content

Evaluation of different approaches for identifying optimal sites to predict mean hillslope soil moisture content

Accepted Manuscript Research papers Evaluation of different approaches for identifying optimal sites to predict mean hillslope soil moisture content K...

881KB Sizes 0 Downloads 64 Views

Accepted Manuscript Research papers Evaluation of different approaches for identifying optimal sites to predict mean hillslope soil moisture content Kaihua Liao, Zhiwen Zhou, Xiaoming Lai, Qing Zhu, Huihui Feng PII: DOI: Reference:

S0022-1694(17)30052-5 http://dx.doi.org/10.1016/j.jhydrol.2017.01.043 HYDROL 21782

To appear in:

Journal of Hydrology

Received Date: Revised Date: Accepted Date:

12 September 2016 21 January 2017 23 January 2017

Please cite this article as: Liao, K., Zhou, Z., Lai, X., Zhu, Q., Feng, H., Evaluation of different approaches for identifying optimal sites to predict mean hillslope soil moisture content, Journal of Hydrology (2017), doi: http:// dx.doi.org/10.1016/j.jhydrol.2017.01.043

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Evaluation of different approaches for identifying optimal sites to predict mean hillslope soil moisture content Kaihua Liaoa, Zhiwen Zhoua, Xiaoming Laia, Qing Zhua,b*, Huihui Fenga a

Key Laboratory of Watershed Geographic Sciences, Nanjing Institute of Geography

and Limnology, Chinese Academy of Sciences, Nanjing 210008, China b

Jiangsu Collaborative Innovation Center of Regional Modern Agriculture &

Environmental Protection, Huaiyin Normal University, Huaian, 223001, China

Submitted to: Journal of Hydrology

*

Corresponding author. Tel.: +86 25 86882139; fax: +86 25 57714759. E-mail addresses: [email protected] (Q. Zhu). 1

ABSTRACT: The identification of representative soil moisture sampling sites is

important for the validation of remotely sensed mean soil moisture in a certain area and ground-based soil moisture measurements in catchment or hillslope hydrological studies. Numerous approaches have been developed to identify optimal sites for predicting mean soil moisture. Each method has certain advantages and disadvantages, but they have rarely been evaluated and compared. In our study, surface (0-20 cm) soil moisture data from January 2013 to March 2016 (a total of 43 sampling days) were collected at 77 sampling sites on a mixed land-use (tea and bamboo) hillslope in the hilly area of Taihu Lake Basin, China. A total of 10 methods (temporal stability (TS) analyses based on 2 indices, K-means clustering based on 6 kinds of inputs and 2 random sampling strategies) were evaluated for determining optimal sampling sites for mean soil moisture estimation. They were TS analyses based on the smallest index of temporal stability (ITS, a combination of the mean relative difference and standard deviation of relative difference (SDRD)) and based on the smallest SDRD, K-means clustering based on soil properties and terrain indices (EFs), repeated soil moisture measurements (Theta), EFs plus one-time soil moisture data (EFsTheta), and the principal components derived from EFs (EFs-PCA), Theta (Theta-PCA), and EFsTheta (EFsTheta-PCA), and global and stratified random sampling strategies. Results showed that the TS based on the smallest ITS was better (RMSE=0.023 m3 m-3) than that based on the smallest SDRD (RMSE=0.034 m3 m-3). The K-means clustering based on EFsTheta (-PCA) was better (RMSE<0.020 m3 m-3) than these based on EFs (-PCA) and Theta (-PCA). The sampling design stratified by the land 2

use was more efficient than the global random method. Forty and 60 sampling sites are needed for stratified sampling and global sampling respectively to make their performances comparable to the best K-means method (EFsTheta-PCA). Overall, TS required only one site, but its accuracy was limited. The best K-means method required <8 sites and yielded high accuracy, but extra soil and terrain information is necessary when using this method. The stratified sampling strategy can only be used if no pre-knowledge about soil moisture variation is available. This information will help in selecting the optimal methods for estimation the area mean soil moisture. Keywords: Mean soil moisture; Temporal stability; K-means clustering; Random sampling

1. Introduction Soil moisture is an important variable influencing water, solute and energy fluxes in the earth surface (Vereecken et al., 2007; Destouni and Verrot, 2014). It is a major component of the hydrologic cycle, controlling processes of runoff, infiltration and evapotranspiration at various scales (Pachepsky et al., 2003). In addition, soil moisture variations have substantial influence on nutrient losses and availability (Zhu et al., 2009; Schmidt et al., 2011). Therefore, soil moisture variation is critical in hydrological, ecological and environmental studies (Fu et al., 2003; Lin, 2006; Zhu and Lin, 2011; Romano, 2014). Passive and active microwave satellites have been demonstrated to provide useful retrievals of near-surface soil moisture variations at regional and global scales (McCabe et al., 2005; Owe et al., 2008; Liu et al., 2011). Recently, the launch of the Soil Moisture Active Passive (SMAP) satellite has produced large amounts of soil 3

moisture data, with improved spatio-temporal resolution (Vereecken et al., 2014). The most important step after obtaining satellite-based soil moisture estimates is to validate them against the observed values. This needs to upscale point-scale soil moisture

measurements

to

satellite-scale

pixel

averages.

However,

direct

measurement of soil moisture content at the satellite resolution is costly, time-consuming and labor-intensive (Hu and Si, 2014). In this case, the representative sampling sites are needed to be identified for accurate prediction of mean soil moisture content in a certain area and ground-based soil moisture measurements in catchment or hillslope hydrological studies. Temporal stability (TS) analysis has often been used to determine the optimal sampling site of mean soil moisture in previous studies (Hu et al., 2010; Zhao et al., 2010; Penna et al., 2013; Li and Shao, 2014). Vachaud et al. (1985) defined the TS as the time invariant association between a spatial location and classical statistical parameters. The optimal sampling site for estimating the field-mean soil moisture should have the smallest standard deviation of the mean difference between the soil moisture of this site and the observed field-mean soil moisture. An advantage of the TS analysis is that only one site is needed for the mean soil moisture prediction. However, this method has been subject to some debates whether the representative sites identified by this approach can always yield reliable estimates. Several studies (e.g., Zhao et al., 2010; Penna et al., 2013) have reported that the representative site with the smallest standard deviation of relative difference (SDRD) is more appropriate. However, others (e.g., Yang, 2010; Van Arkel and Kaleita, 2014) selected 4

the site with the smallest index of TS (ITS) which was calculated as a combination of the mean relative difference and the SDRD (refer to 2.4.1 for the equations to calculate ITS). They have argued that the TS method had poor accuracy since the ITS of rank stable locations changed from time to time. The mixed results suggest that the accuracy of the TS method depend on the empirical data used to identify representative sites. Therefore, further evaluation on the performance of different TS methods is still needed. Recently, Van Arkel and Kaleita (2014) applied soil properties and terrain indices as inputs into the K-means clustering algorithm to determine the critical sampling sites. They found that this method performed better than TS analysis in estimating field-mean soil moisture. The K-means clustering can separate the data points into a predefined k number of clusters containing points with similar characteristics. In this algorithm, cluster centers are initially randomly chosen from the input observation vectors. An iterative approach is then employed to minimize the Euclidean distance between the input vector and the centroid vector. Finally, the centroid vector of each cluster is identified. When using this method, exhaustive presampling of soil moisture is not needed (Van Arkel and Kaleita, 2014). However, its results are sensitive to variations within the magnitude or scales from the original variables (Yeung and Ruzzo, 2001; Luai et al., 2006). The K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round clusters. Therefore, clusters tend to be separated along variables with higher variance, which can result in the decline of clustering efficiency. In addition, Euclidean distances tend to be inflated 5

in high-dimensional spaces, leading to very large and approximately equal distances between any two data points (Yao and Ruzzo, 2006). Therefore, running a dimensionality reduction algorithm such as principal component analysis (PCA) prior to K-means clustering can solve these problems and achieve more stable results (Ben-Hur and Guyon, 2003). However, no report referred to the use of PCA and K-means clustering for identifying optimal moisture sampling sites, with the consideration of various combinations of inputs. The third method was the random sampling strategy. This method was demonstrated to be comparable to the TS analysis (Yang, 2010). Van Arkel and Kaleita (2014) also proposed that random sampling can be a viable approach to estimate the field-mean soil moisture. The advantage of this method is none pre-analysis is needed. However, it may produce erratic results when a limited number of data points are selected (Aune-Lundberg and Strand, 2014). Little research has been conducted to quantify the influence of sampling density on the accuracy of the random sampling for estimation field-mean soil moisture. Such analysis is important for selecting optimal number of sampling sites. In addition, a sampling design stratified by a variable correlated with the target variable has been demonstrated to obtain precise regional estimates (Stehman et al., 2011). For example, the land use has been considered to play a significant role in the spatio-temporal pattern of soil moisture (Jia et al., 2013; Wang et al., 2013). In this case, this property can be used for the sample stratification. However, the stratified sampling strategy has rarely been used in mean soil moisture prediction in a certain area. 6

The advantages and disadvantages of the TS, K-means clustering and random sampling strategy have not been systematically compared and comprehensively discussed in previous studies. Therefore, the objective of this study is to evaluate these three approaches for identifying optimal sites to predict mean hillslope soil moisture content. Specifically, a selection strategy is proposed for all considered methods, with consideration of the accuracy and resources required for the prediction of mean soil moisture. We hypothesize that comparing different methods can help find optimal sites and improve the mean soil moisture estimation with limited resources.

2. Materials and methods 2.1. Study hillslope This study was conducted on a hillslope (31°21′N, 119°03′E) (has an area of 0.6 ha) in the hilly area of Taihu Lake Basin, China (Fig. 1). This study area is characterized with a north subtropical-middle subtropical transition monsoon climate with four distinctive seasons. The annual mean temperature is 15.9 °C and the annual mean precipitation is 1157 mm from 2006 to 2016. Green tea (Camellia sinensis (L.) O. Kuntze) and Moso bamboo (Phyllostachysedulis (Carr.) H. de Lehaie) are dominant on the hillslope. The elevation of the hillslope ranges from 77 to 88 m and the slope ranges from 0 to 21 %. The soil type of the hillslope is shallow lithosols according to the FAO soil classification (Orthents according to Soil Taxonomy). Parent material is quartz sandstone. Soils are described as silt loam texture with silt content generally larger than 60%. Surface (0-20 cm) soil organic matter contents 7

were about 2% on this hillslope. The depth to bedrock varies from <0.3 m at the summit slope position to about 1.0 m at the foot slope position (Liao et al., 2016). 2.2. Soil moisture measurement For monitoring volumetric soil moisture, access polyvinyl chloride tubes were installed at 77 sites on the hillslope (Fig. 1). A portable time-domain reflectrometry TRIME-PICO-IPH soil moisture probe (IMKO, Ettlingen, Germany) was used on 43 dates from January 2013 to March 2016. The factory-set calibration curve that translates the dielectric constant of the soil into soil water content was used for all measurements. Before the campaign on each survey date, the TDR probe was calibrated in buckets with dry and saturated beads following the standard procedure in the user manual. Volumetric soil moisture was measured at the depth of 0 to 20 cm each time (note that the TRIME-PICO-IPH probe has a length of 18 cm). For each site, the TRIME-PICO-IPH probe was twisted in the access tube to face different directions and 3 readings were then taken. The average of these readings was used as the final water content for each site on a specific date. In addition, an automatic rain gauge was set up to record the rainfall. 2.3. Soil properties and terrain attributes Around each soil moisture access tube (within 1-m distance), soil samples at the depth of 0 to 20 cm were collected using a hand auger. Three subsamples were collected for each site and then fully mixed. These samples were air dried, weighted, ground and sieved through a 2 mm polyethylene sieve. Particles larger than 2 mm (rock fragments) were weighed to determine the rock fragment (RF) content. Soils 8

that passed through the 2 mm polyethylene sieve were used to analyze the particle size distribution using the Malvern Mastersizer 2000 laser analyzer (Malvern Instruments Inc., Worcestershire, UK). The fractions of <0.002 mm (clay), 0.002–0.05 mm (silt), and 0.05–2 mm (sand) were determined for each soil sample. The percentage of the organic matter (OM) in the soil was measured by the titration method, which is based on the oxidation of organic matter by K2Cr2O7. In addition, the depths to bedrock (DB) of all 77 sites were also determined when installing the access tubes for soil moisture measurements and taking soil samples using a hand auger. A high-resolution (1 m) digital elevation model (DEM) of the study hillslope was derived from a 1: 1000 contour map. Terrain attributes including elevation, slope, plane curvature (PLC), profile curvature (PRC), and topographic wetness index (TWI) were determined from this DEM in ArcGIS 10.0 (ESRI, Redlands, CA) (Tarboton, 1997; Ågren et al., 2014). The terrain attributes of all 77 sampling points were then extracted. Soil properties and terrain attributes of this hillslope are listed in Table 1. 2.4. Approaches for identifying optimal sampling sites 2.4.1. Temporal stability analysis The TS of soil moisture was analyzed using the approach proposed by Vachaud et al. (1985):

θj 

δi j 

1 Ν

Ν

θ i 1

ij

θ ij  θ j θj 9

(1)

(2)

MRD 

SDRD 

1 M

M

δ j 1

(3)

ij



1 M  δi  MRD M  1 j 1 j



2

(4)

where θij is the soil moisture at site i in day j; θj is the arithmetic mean of soil moisture in day j; N is the number of sites; δij is the relative difference of soil moisture at site i and day j; and M is the number of sampling days. MRD is the arithmetic mean relative difference of soil moisture at site i; SDRD is the standard deviation of relative difference. There are two criteria of TS. The first one is SDRD. Smaller SDRD means more temporally stable. Given the soil moisture content (θOSS) from the optimal sampling site (OSS) with the smallest SDRD, the mean soil moisture (  est ) can be calculated using the following equation (Grayson and Western, 1998):

 est 

 OS S 1  MRDOS S

(5)

where MRDOSS is the MRD from this OSS. The second one is the index of TS (ITS), which can be defined as (Penna et al., 2013):

ITS  MRD 2  SDRD 2

(6)

The soil moisture content from the OSS with the smallest ITS was the mean soil moisture content. 2.4.2. K-means clustering and principal component analysis K-means is one of the most commonly used clustering algorithms that separates the data points into a predefined k number of clusters containing points with similar characteristics. In the K-means algorithm, cluster centers are initially chosen at 10

random from the set of input observation vectors. An iterative approach was then employed by minimizing the Euclidean distance ξ between the input vector and the centroid vector: n

ξ   di  C j

2

(7)

i 1

where Cj is the center of jth cluster and is the center nearest to data object di, and n is the number of elements in dataset. The centroid vector of each cluster was then identified. Finally, the input vector with the smallest distance from each centroid corresponded to the best matching unit (BMU) to the cluster centroid. The mean soil moisture (  est ) can be calculated using the following equation:

θ est

 

k

i 1

θ BMU ij  ni

(8)

N

where θBMUij is the soil moisture content on the jth day for the BMU to the centroid of the ith cluster, k is the number of clusters, ni is the number of sampling sites in the ith cluster, and N is the total number of sampling sites (N=77). It is difficult to determine the optimal number of clusters in K-means algorithm. Therefore, the try-out method was applied. However, we have to consider sampling efficiency, due to the fact that direct measurement of soil moisture content on the hillslope is costly, time-consuming and labor-intensive. In this case, we explored selection of 2, 4, 6 and 8 clusters. This represents the range of OSS numbers from 2 (the least number) to approximately 10% of the sampling sites (the ideal number). Three kinds of input variables were used in the algorithm. The first one, denoted as “Theta”, applied the temporal soil moisture data during the calibration period to 11

find the OSSs. The second one, denoted as “EFs”, employed the RF, sand, silt, clay, OM, DB, elevation, slope, PLC, PRC and TWI as inputs into the algorithm. While the last one, denoted as “EFsTheta”, used the EFs and the soil moisture data on Jan 9 2013 (the first sampling date). In order to reduce the input dimensions, PCA was used for each kind of input variables (namely “Theta-PCA”, “EFs-PCA”, and “EFsTheta-PCA”). Each principal component (PCi) is a linear combination of the standardized original variables (Uj), as reported in the following equation: M

PC i   aijU j

aij  Lij / λi

j 1

(9)

where M is the number of original variables, aij is the factor score coefficient; Lij is the factor loadings, and λi is the variance of PCi. Each PCi explains a part of the total variance of Uj, with the total variance equal to the sum of λi: M

Var (U j 1

M

j

)   λi

(10)

i 1

The first few PCs that explain more than 90% of the total variance are considered in the PCA. 2.4.3. Random sampling strategy Two kinds of random sampling strategy were used in this study. The first one is global random sampling, in which 100 random realizations of 10, 20, 40, 60 and 70 points were generated from all 77 sites. For each realization, sampling was conducted with no replacement in such a way that each point has an equal probability of being chosen. The multiple realizations represent the uncertainty in mean soil moisture estimation. The mean soil moisture content can be calculated by averaging the 12

observed soil moisture at the corresponding random points on each day. The second one is stratified random sampling. First, all 77 points are partitioned into 2 subgroups based on the land use. One hundred random realizations of 5, 10, 20, 30 and 35 points were then generated for each subgroup. Finally, the mean soil moisture was obtained. These two sampling strategies are also the most common sampling approaches. Global sampling is appropriate when the entire population is homogeneous, while stratified sampling is generally used when the population is heterogeneous or dissimilar. Stratified sampling offers certain advantages and disadvantages compared to global sampling. It is essential to compare these two sampling strategies for predicting mean hillslope soil moisture content. 2.5. Evaluation criteria The temporal soil moisture data on the first 25 dates (accounting for 60% of the entire observation dates) were used as the calibration periods for the analyses of TS and K-means clustering. The remaining 40% (18 dates) was applied as the validation period to test the performances of different approaches for predicting the mean soil moisture content based on the Pearson correlation coefficient (r), the mean error (ME) and the root mean squared error (RMSE). 3. Results and discussion 3.1. Temporal variations of mean soil moisture content For the mean soil moisture, its temporal variations were analyzed. From January 2013 to March 2016, the mean soil moisture contents ranged from 0.090 to 0.252 m3 m-3, showing a substantial fluctuation of the mean soil moisture on the study hillslope 13

(Fig. 2). This fluctuation was influenced by precipitation and evapotranspiration. However, seasonal patterns of mean soil moisture were similar from one year to the next. The highest mean soil moisture contents were observed in summer season (from June to August) due to intense and heavy rainfalls occurred, while the lowest moisture values were found in winter and spring seasons (from November to March) due to the relatively low precipitation. Temporal variations of soil moisture variability as described by standard deviation (SD) were also investigated. The SDs of soil moisture ranged from 0.037 to 0.101 m3 m-3. The temporal variations of SD showed the similar trend as those of the mean soil moisture (Fig. 2). A significant positive linear correlation (R2=0.587, p<0.01, n=43) was found between the mean soil moisture and SD, indicating that the soil moisture variation increases as the soil got wetter. These results were different from some previous studies (e.g., Western et al., 2003; Vereecken et al., 2007; Penna et al., 2009; Brocca et al., 2012), who found a convex upward relationship between the mean soil moisture and SD. These varied results can be attributed to the different ranges of soil moisture monitored in different studies. If soil moisture can be monitored at a wide range (e.g., from very dry to very wet), a bell shape relationship between the mean soil moisture and SD is expected. However, if the monitored soil moisture covers the relatively dry range as found in our study, positive increasing is more likely to be observed (e.g., Famiglietti et al., 1998; Hupet and Vanclooster, 2002). 3.2. Identification of representative sites based on TS analysis 14

The relationship between MRD and SDRD was investigated. The rank ordered MRD and its associated SDRD, as well as the ITS values are shown in Fig. 3. The range of MRD was 1.869 (-0.804–1.065), which was comparable to those reported by Hu et al. (2010), but greater than Grayson and Western (1998), Mohanty and Skaggs (2001), and Grant et al. (2004). The main reason for the great range of MRD may be the wide range of terrain attributes on this hillslope. For example, the elevation ranges from 77 to 88 m on the 0.6-ha hillslope, indicating large ratio of elevation change (m) over the total area (ha) of 18.3. The SDRD was also found to vary greatly in space, with the range of 0.053–0.399. During the calibration period, MRD was positively correlated (R2=0.316, p<0.01, n=77) with SDRD, which implies that drier sites were found to produce more pronounced TS. This is consistent with the results of Martínez-Fernández and Ceballos (2003) and Hu et al. (2010). The representative site was identified based on the smallest SDRD. The SDRD of site 13 was the smallest, indicating that it can be considered as the OSS to predict mean soil moisture content according to Eq. (5). This site is located near the boundary of the hillslope, has a slope of 10.22% that is close to the mean hillslope slope percent (10.04%). Brocca et al. (2012) found that time stability patterns were preserved in sites with the average topographic characteristics. The predicted mean soil moisture contents by using site 13 were plotted against the observed mean values during the validation period (Fig. 4). The r value was 0.545 (p=0.019, n=18), indicating that only margin accuracy was achieved. The ME (-1.89%) was less than 0, suggests that the mean soil moisture contents were generally overestimated. Compared to other studies, 15

the RMSE (0.034 m3 m-3) suggests that our result is worse than those reported by Zhao et al. (2010) and Penna et al. (2013). Hu et al. (2010) also proposed that it is dangerous to identify sites for mean soil moisture content based on the SDRD when using Eq. (5) due to the sensitivity of TS results to the bias of relative difference. The representative site was also identified based on the smallest ITS. The ITS ranged between 0.101 and 1.118, showing considerable variation in space (Fig. 3). The smallest ITS was found at site 35, which may be related to the fact that the elevation of this site (82.04 m) is close to the mean hillslope elevation (81.85 m). Therefore, site 35 can directly represent the mean soil moisture conditions on study hillslope. The soil moisture contents from this site were plotted against the observed mean values during the validation period (Fig. 4). The r, ME and RMSE values were 0.912 (p=0.000, n=18), 2.03% and 0.023 m3 m-3 , respectively. A strong correlation was found between the observed and the predicted values. The ME was larger than 0, suggesting that the site 35 generally underestimated the mean soil moisture contents. The performances of the TS analyses based on 2 indices were evaluated. In terms of the RMSE, the performance of the TS method based on the site 35 was better than that based on the site 13, but still worse than those reported by Zhao et al. (2010) and Penna et al. (2013). This can be attributed to that the ITS of rank stable locations was different between the calibration and validation periods (results not shown). The ITS of site 35 was not the smallest during the validation period, which can decrease the accuracy of predictions. Previous studies also found that the accuracy of the TS method is limited when the TS results changed during different time periods 16

(Martinez-Fernandez and Ceballos, 2003; Schneider et al., 2008). 3.3. Identification of representative sites using K-means clustering The representative sites were identified by using the PCA and K-means clustering methods. The PCs derived from the original variables (EFs, Theta and EFsTheta) were used as inputs in the K-means clustering algorithm to determine critical sampling sites. The first 3 PCs can explain 92.18% of the total variance in Theta (Fig. 5). However, the first 7 PCs can explain more than 90% of the total variance in EFs and EFsTheta. Therefore, these corresponding PCs were used in the K-means approach. The representative sites identified by K-means clustering were different depending on the input data and the number of clusters (Table 2). This is consistent with the findings of Van Arkel and Kaleita (2014). However, some interesting findings can be observed in our study. For example, the sites 28 and 33 were selected 10 and 8 times, respectively. In addition, the site 35 identified by TS analysis was also selected by K-means methods on the temporal soil moisture data. The performances of the K-means clustering on the original variables and the corresponding PCs were evaluated in terms of the r, ME and RMSE values (Fig. 6). The r values are larger than 0.80 (p=0.000, n=18) for all cases. These results indicate a good agreement between the observed mean soil moisture contents and the predicted values obtained by K-mean methods. The r values generally increase with increasing number of representative sites for all cases except EFsTheta. Compared with using the original variables, the uses of PCs in the K-means algorithm yield higher correlations. The ME values ranged between -3% and 4%, showing no consistent trend from 2 to 8 17

clusters for all cases. However, the MEs of Theta, Theta-PCA and EFsTheta-PCA are closer to 0 than those of others, indicating a weak bias generated by using these three kinds of inputs. In terms of the RMSE, the K-means methods on Theta, Theta-PCA and EFsTheta-PCA yielded smaller RMSE values (always <0.020 m3 m-3) and more stable predictive results than others. This suggests that these three kinds of inputs have better performance in predicting mean soil moisture content. This may be related to that the soil moisture data were directly used in the K-means algorithm. The poor performance of K-means on EFs suggests that soil and terrain properties alone are not enough to select optimal sites to predict mean soil moisture. The effectiveness of clustering with the PCs instead of the original variables was investigated in detail. It is found that the PCs used in the K-means algorithm generally produced lower RMSE values and relatively higher accuracy than the original variables, especially in the case of EFsTheta-PCA. This is related to the fact that if the original variable were used, clusters will tend to be separated along variables with higher variance. However, PCA can reduce the dimensions of the dataset, which would improve clustering efficiency. Since the Euclidean distance calculated with the first few PCs is just an approximation to the Euclidean distance calculated with all the datasets, the first few PCs may contain most of the cluster information while the last PCs are mostly noise (Yeung and Ruzzo, 2001). Previous studies have combined the PCA and clustering to achieve more stable results. Yeung and Ruzzo (2001) found that the K-means clustering using the first few PCs usually achieved higher or comparable adjusted Rand indices to those without PCA. In the study by Ben-Hur and 18

Guyon (2003), using PCA as a preprocessing before clustering can improve the quality of clustering of gene expression data. In our study, the K-means clustering with PCA can be applied to identify the optimal sites for the prediction of the mean soil moisture content. 3.4. Random sampling strategy for predicting mean soil moisture content The performances of the global random sampling strategy in predicting mean soil moisture content are shown in Fig. 7. The r values are larger than 0.95 (p=0.000, n=18) for different number of points used, indicating a good correlation between the observed mean soil moisture contents and the predicted values. However, the r values increase with increasing number of sites. The median values of r increase from 0.982 to 0.999 when the sites increase from 10 to 70. However, the range of variation (Max-value minus Min-value) in r decreases from 0.040 to 0.001. This suggests that the larger number of sites produces more stable predictions of the mean soil moisture content. In terms of ME, the maximum (5.55%) and minimum (-6.58%) values can be observed with 10 sites, indicating a large bias generated by using 10 sites for predicting mean soil moisture content. With increasing number of points, the median values of ME become closer to 0 (from -1.63% to -0.19%), while the range of variation in ME decreases from 12.13% to 1.69%. The RMSE also substantially varies with the number of sites. The median values of RMSE decrease from 0.022 to 0.004 m3 m-3 when the number of points increased from 10 to 70, while the range of variation in RMSE decrease from 0.061 to 0.008 m3 m-3. The performances of the stratified random sampling strategy are shown in Fig. 8. 19

All r values are larger than 0.960 (p=0.000, n=18), which is slighter higher than those of the global sampling. The median values of r increase from 0.986 to 0.999 when the sites increase from 10 to 70. However, the range of variation in r decreases from 0.028 to 0.001. With increasing number of sites, the median values of ME become closer to 0 (from -1.42% to -0.02%), while the range of variation in ME decreases from 6.01% to 0.95%. This implies that the bias in predictions generated by stratified sampling is less than that produced by global sampling. The RMSE also substantially varies with the number of sites. The median values of RMSE decrease from 0.022 to 0.004 m3 m-3 when the sites increased from 10 to 70, while the range of variation in RMSE decreases from 0.061 to 0.008 m3 m-3. Due to spatial heterogeneity of soil moisture, stratified sampling produced higher accuracy than global sampling when the same number of sampling sites was selected. Stratified sampling works well when a heterogeneous population is split into fairly homogeneous groups. The optimal number of sampling sites required for each random sampling strategy was analyzed. Previous studies have applied the random sampling strategy for estimating mean soil moisture content. For example, Yang (2010) found that the performances of the random sampling strategy were comparable to those of the temporal stability method in the estimation of the mean soil moisture content. Van Arkel and Kaleita (2014) observed that random sampling can give good results, but can also produce worse performance than the TS and K-means methods. However, only a limited number of random points (four points) were selected in their studies. In our study, the random sampling from small to large number of points were conducted, 20

aiming at finding the optimal number of sites for predicting mean soil moisture. To make the performances of the random sampling comparable to those of the best K-means methods, the optimal numbers of sites were identified with the RMSE values less than 0.020 m3 m-3. For global sampling, 60 sampling sites should be chosen for accurately predicting mean soil moisture content, while 40 sampling sites are needed for stratified sampling. The soil moisture content of the forest with lower elevation was two times larger than that of the tea garden with higher elevation (Liao et al., 2016). Therefore sites selected with the stratified sampling resulted in fewer number of sampling points required compared with the global sampling. 3.5. Comprehensive evaluation of different methods In this section, the applicability of the TS analysis, K-means clustering and random sampling strategy for predicting mean soil moisture has been investigated. The TS analysis based on the smallest ITS is more suitable for predicting mean soil moisture than that based on the smallest SDRD. The RMSE (0.023 m3 m-3) for the ITS-based method suggests an acceptable performance. The advantage of this method is that it needs only one sampling site for the prediction of mean soil moisture. However, TS analysis requires good pre-knowledge of soil moisture before applying this method. Previous studies proposed that at least about one year is needed to identify the time-stable site (Martínez-Fernández and Ceballos, 2003; Hu et al., 2012). This makes TS analysis hard to implement in practice. Among all K-means methods, the EFs (-PCA) and EFsTheta are not recommended to determine the optimal sites in our study. In contrast, the Theta (-PCA) 21

and EFsTheta-PCA are viable K-means methods to predict mean soil moisture content due to their relatively low RMSE values. Like TS analysis, Theta (-PCA) also needs enough soil moisture data. However, EFsTheta-PCA requires only “one-time” measurement of soil and terrain properties and soil moisture data. Therefore, EFsTheta-PCA is more practical to find optimal sites than other K-means methods. In addition, EFsTheta-PCA also needs very few sampling sites (<8) for the prediction of mean soil moisture. A good match between the predictions by EFsTheta-PCA and the observations can be observed, regardless of the number of clusters (Fig. 9). For the random sampling method, the stratified sampling was more efficient than the global sampling. The advantage of the stratified sampling is that this method does not need any soil moisture data prior to use. However, in order to ensure the accuracy and reliability of the stratified sampling, 40 sampling sites are needed for our study hillslope. The measurement of soil moisture at 40 sites is still costly, time-consuming and labor-intensive in practice, although the number of sites is reduced by half. If there is a lack of any site data, stratified sampling can be only used with the premise of selecting enough sampling sites. Otherwise, it may produce large bias in predictions. As a result, comparing different methods is useful for finding optimal sites and improving the mean soil moisture estimation with limited resources. Based on the above analysis, selection strategies for all considered methods can be developed and are summarized in Table 3. This table helps us select the suitable methods according to available soil and terrain properties, with consideration of the accuracy and 22

resources required for the prediction of mean soil moisture on the study hillslope. In addition, it also provides references for soil moisture estimation in other regions. 4. Conclusions This study evaluates three methods (TS analysis, K-means clustering and random sampling strategy) for identifying representative locations to predict the mean soil moisture. The TS results show that the sites 13 and 35 had the smallest SDRD and the smallest ITS, respectively. The performance of TS analysis based on the smallest ITS was better than that based on the smallest SDRD in estimating the mean soil moisture content according to the RMSE. In K-means methods, Theta, Theta-PCA and EFsTheta-PCA produced smaller RMSE values and higher accuracy than EFs, EFs-PCA and EFsTheta, as well as the TS analysis. Unlike the Theta(-PCA) and TS method, EFsTheta-PCA requires only “one-time” measurement of soil and terrain properties and soil moisture data. The stratified sampling was more efficient than the global sampling. Nevertheless, 40 sampling sites should be selected with the stratified sampling, in order to achieve the predictive accuracy comparable to the best K-means method. Overall, selection of the optimal approach to predict mean soil moisture should be based on different existing data, meeting the requirements of both the accuracy and cost-saving of the study. Acknowledgements This study was financially supported by the National Natural Science Foundation of China (41622102 and 41571080) and the Natural Science Foundation of Jiangsu 23

Province (BK20151613 and BK20151061). References Ågren, A.M., Lidberg, W., Strömgren, M., Ogilvie, J., Arp, P.A., 2014. Evaluating digital terrain indices for soil wetness mapping – a Swedish case study. Hydrol. Earth Syst. Sci. 18, 3623–3634. Aune-Lundberg, L., Strand, G.H., 2014. Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ. Modell. Softw. 61, 87–97. Ben-Hur, A., Guyon, I., 2003. Detecting stable clusters using principal component analysis. In Functional Genomics: Methods and Protocols. Brownstein, M.J., Kohodursky, A. (eds.) Humana press, pp. 159–182. Brocca, L., Tullo, T., Melone, F., Moramarco, T., Morbidelli, R., 2012. Catchment scale soil moisture spatial–temporal variability. J. Hydrol. 422–423, 63–75. Destouni, G., Verrot, L., 2014. Screening long-term variability and change of soil moisture in a changing climate. J. Hydrol. 516, 131–139. Famiglietti, J.S., Rudnickim, J.W., Rodell, M., 1998. Variability in surface moisture content along a hillslope transect: Rattlesnake Hill, Texas. J. Hydrol. 210, 259–281. Fu, B.J., Wang, J., Chen, L.D., Qiu, Y., 2003. The effects of land use on soil moisture variation in the Danangou catchment of the Loess Plateau, China. Catena 54, 197–213. Grant, L., Seyfried, M., McNamara, J., 2004. Spatial variation and temporal stability 24

of soil water in a snow-dominated, mountain catchment. Hydrol. Process. 18, 3493–3511. Grayson, R.B., Western, A.W., 1998. Towards areal estimation of soil water content from point measurements: time and space stability of mean response. J. Hydrol. 207, 68–82. Hu, W., Shao, M., Han, F., Reichardt, K., Tan, J., 2010. Watershed scale temporal stability of soil water content. Geoderma 158, 181–198. Hu, W., Tallon, L.K., Si, B.C., 2012. Evaluation of time stability indices for soil water storage upscaling. J. Hydrol. 475, 229–241. Hu, W., Si, B.C., 2014. Can soil water measurements at a certain depth be used to estimate mean soil water content of a soil profile at a point or at a hillslope scale? J. Hydrol. 516, 67–75. Hupet, F., Vanclooster, M., 2002. Interseasonal dynamics of soil moisture variability within a small agricultural maize cropped field. J. Hydrol. 261, 86–101. Jia, X., Shao, M., Wei, X., Wang, Y., 2013. Hillslope scale temporal stability of soil water storage in diverse soil layers. J. Hydrol. 498, 254–264. Li, D., Shao, M., 2014. Temporal stability of soil water storage in three landscapes in the middle reaches of the Heihe River, northwestern China. Environ. Earth Sci. doi: 10.1007/s12665-014-3604-z. Liao, K., Lai, X., Liu, Y., Zhu, Q., 2016. Uncertainty analysis in near-surface soil moisture estimation on two typical land-use hillslopes. J. Soil. Sediment. 16, 2059–2071, doi: 10.1007/s11368-016-1405-6. 25

Lin, H. 2006. Temporal stability of soil moisture patterns and subsurface preferential flow pathways in the Shale hills catchment. Vadose Zone J. 5, 317–340. Liu, Y.Y., Parinussa, R.M., Dorigo, W.A., De Jeu, R.A.M., Wagner, W., van Dijk, A.I.J.M., McCabe, M.F., Evans, J.P., 2011. Developing an improved soil moisture dataset by blending passive and active microwave satellite-based retrievals. Hydrol. Earth Syst. Sci. 15, 425–436. Luai, A.S., Zyad, S., Basel, K., 2006. Data mining: A preprocessing engine. J. Comput. Sci. 2(9), 735–739. Martínez-Fernández, J., Ceballos, A., 2003. Temporal stability of soil moisture in a large-field experiment in Spain. Soil Sci. Soc. Am. J. 67, 1647–1656. McCabe, M.F., Gao, H., Wood, E.F., 2005. Evaluation of AMSRE-derived soil moisture retrievals using ground-based and PSR airborne data during SMEX02. J. Hydrometeorol. 6, 864–877. Mohanty, B.P., Skaggs, T.H., 2001. Spatio-temporal evolution and time-stable characteristics of soil moisture within remote sensing footprints with varying soil, slope, and vegetation. Adv. Water Resour. 24, 1051–1067. Owe, M., De Jeu, R., Holmes, T., 2008. Multisensor historical climatology of satellite-derived global land surface moisture, J. Geophys. Res. 113, F01002, doi:10.1029/2007JF000769. Pachepsky, Y., Radcliffe, D.E., Selim, H.M., 2003. Scaling Methods in Soil Physics. CRC Press, Boca Raton, FL. Penna, D., Borga, M., Norbiato, D.,

Dalla Fontana, G., 2009. Hillslope scale soil 26

moisture variability in a steep alpine terrain. J. Hydrol. 364, 311–327. Penna, D., Brocca, L., Borga, M., Dalla Fontana, G., 2013. Soil moisture temporal stability at different depths on two alpine hillslopes during wet and dry periods. J. Hydrol. 477, 55–71. Romano, N., 2014. Soil moisture at local scale: Measurements and simulations. J. Hydrol. 516, 6–20. Schneider, K., Huisman, J.A., Breuer, L., Zhao, Y., Frede, H.-G., 2008. Temporal stability of soil moisture in various semi-arid steppe ecosystems and its application in remote sensing. J. Hydrol. 359 (1-2), 16–29. Schmidt, J.P., Beegle, D.B., Zhu, Q., Sripada, R., 2011. Improving in-season nitrogen recommendations for corn using an active sensor. Field Crop Res. 120, 94–101. Stehman, S.V., Hansen, M.C., Broich, M., Potapov, P.V., 2011. Adapting a global stratified random sample for regional estimation of forest cover change derived from satellite imagery. Remote Sens. Environ. 115, 650–658. Tarboton, D.G., 1997. A new method for the determination of flow directions and upslope areas in grid digital elevation models. Water Resour. Res. 33, 309–319. Vachaud, G., Passerat de Silans, A., Balabanis, P., Vauclin, M., 1985. Temporal stability of spatially measured soil water probability density function. Soil Sci. Soc. Am. J. 49, 822–828. Van Arkel, Z., Kaleita, A.L., 2014. Identifying sampling locations for field-scale soil moisture estimation using K-means clustering. Water Resour. Res. 50, 7050–7057. 27

Vereecken, H., Kamai, T., Harter, T., Kasteel, R., Hopmans, J., Vanderborght, J., 2007. Explaining soil moisture variability as a function of mean soil moisture: a stochastic unsaturated flow perspective. Geophys. Res. Lett. 34, L22402, doi:10.1029/2007GL031813. Vereecken, H., Huisman, J.A., Pachepsky, Y., Montzka, C., van der Kruk, J., Bogena, H., Weihermüller, L., Herbst, M., Martinez, G., Vanderborght, J., 2014. On the spatio-temporal dynamics of soil moisture at the field scale. J. Hydrol. 516, 76–96, doi:10.1016/j.jhydrol.2013.11.061. Wang, X., Pan, Y., Zhang, Y., Dou, D., Hu, R., Zhang, H., 2013. Temporal stability analysis of surface and subsurface soil moisture for a transect in artificial revegetation desert area, China. J. Hydrol. 507, 100–109. Western, A.W., Grayson, R.B., Blöschl, G., Wilson, D.J., 2003. Spatial variability of soil moisture and its implications for scaling. In: Perchepsky, Y., Radcliffe, D.E., Magdi Selim, H. (Eds.), Scaling Methods in Soil Physics. CRC Press, Boca Raton, Fla, pp. 19–142. Yang, L., 2010. Spatio-temporal patterns of field-scale soil moisture and their implications for in situ soil moisture network design, PhD thesis, Iowa State Univ., Ames. Yao, Z., Ruzzo, W.L., 2006. A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data. Bioinformatics 7 Suppl 1, 273–288. Yeung, K., Ruzzo, W., 2001. An empirical study of principal component analysis for clustering gene expression data. Bioinformatics 17, 763–774. Zhao, Y., Peth, S., Wang, X., Lin, H., Horn, R., 2010. Controls of surface soil 28

moisture spatial patterns and their temporal stability in a semi-arid steppe. Hydrol. Process. 24, 2507–2519. Zhu, Q., Schmidt, J.P., Lin, H.S., Sripada, R.P., 2009. Hydropedological processes and their implications for nitrogen availability to corn. Geoderma 154, 111–122. Zhu, Q., Lin, H.S., 2011. Influences of soil, terrain, and crop growth on soil moisture variation from transect to farm scales. Geoderma 163, 45–54. List of Tables:

Table 1. Statistical summaries of soil and terrain properties on the study hillslope. Properties

Min

Max

Mean

SD

CV

Terrain attributes

77.50

87.17

81.85

2.69

0.03

Elevation (m)

77.50

87.17

81.85

2.69

0.03

Plane curvature

-9.80

13.67

-0.15

3.50

-23.14

Profile curvature

-20.70

12.63

0.04

4.94

116.74

Slope percent (%)

0.02

19.49

10.04

3.83

0.38

Topographic wetness index

-2.65

4.94

0.51

1.61

3.15

Depth to bedrock (cm)

18.12

86.34

52.55

13.40

0.26

Rock fragment content (%)

17.27

65.59

45.89

9.61

0.21

Sand (%)

4.60

34.76

12.63

5.18

0.41

Silt (%)

55.89

82.24

73.62

4.84

0.07

Clay (%)

9.35

19.90

13.75

2.04

0.15

Organic matter (%)

1.31

2.95

2.12

0.38

0.18

Soil properties

SD: standard deviation; CV: coefficient of variation. Table 2. Sites identified for sampling by the K-means clustering with the temporal soil

moisture data during the calibration period (Theta), soil and terrain properties (EFs), both EFs and soil moisture data on Jan 9, 2013 (EFsTheta), and the principal components derived from Theta (Theta-PCA), EFs (EFs-PCA), and EFsTheta (EFsTheta-PCA). Method

2 sites

4 sites

6 sites 29

8 sites

Theta

31,48

30,35,48,63

21,22,30,63,71,76

5,22,31,57,62,63,70,77

Theta-PCA

31,50

27,35,48,62

9,30,35,48,63,76

9,30,35,56,62,63,71,76

EFs

18,70

3,22,56,66

3,7,19,33,65,75

19,28,33,43,56,58,60,70

EFs-PCA

30,75

28,33,59,75

17,19,28,29,50,65

4,12,30,31,33,34,51,75

EFsTheta

18,56

22,24,28,56

19,28,32,33,42,63

19,20,28,33,60,62,70,77

EFsTheta-PCA

28,50

28,50,59,65

19,28,33,37,41,50

17,28,29,30,33,34,51,75

Table 3. Selection of the methods for predicting mean soil moisture based on the root mean squared errors. RMSE/m3 m-3

Method TS

K-means

0.02-0.03

<0.02

SDRD

+

-

-

ITS

-

+

-

Theta

-

-

+

Theta-PCA

-

-

+

EFs

-

-

+

EFs-PCA

-

+

-

EFsTheta

-

+

-

-

-

+

Soil and terrain properties and few soil moisture data are needed

-

-

+

60 sites are needed, with no

EFsTheta-PCA

Random sampling

Remark

>0.03

Global Stratified

Require only 1 site, but large amounts of soil moisture data Large amounts of soil moisture data are needed Soil and terrain properties are needed

pre-analysis -

-

+

40 sites are needed, with no

pre-analysis

For K-means methods, 2 clusters (sites) are selected.

List of Figures: Fig. 1. Location of the study area and sampling sites on the study hillslope. Fig. 2. Time series of precipitation, mean soil moisture and corresponding standard deviations. Fig. 3. Rank ordered mean relative differences (MRD) during the calibration period. Vertical bars correspond to ±standard deviation of the relative difference (SDRD) over time. The blue solid line denotes the index of TS (ITS). Sampling sites are presented 30

orderly according to the MRD. Fig. 4. Observed mean soil moisture content vs. predicted soil moisture content obtained based on the TS analysis with the smallest standard deviation of relative difference (SDRD) and the smallest index of TS (ITS). Fig. 5. Percentage of the total variance explained by the principal components (PCs) derived from soil and terrain properties (EFs), temporal soil moisture data during the calibration period (Theta), and both EFs and soil moisture data on Jan 9 2013 (EFs-Theta). The horizontal dash line denotes 90% of the total variance explained by the PCs. Fig. 6. Performances of the K-means clustering approach on the soil and terrain properties (EFs), temporal soil moisture data during the calibration period (Theta), both EFs and soil moisture data on Jan 9 2013 (EFsTheta), and the principal components derived from EFs (EFs-PCA), Theta (Theta-PCA), and EFsTheta (EFsTheta-PCA). Fig. 7. The influences of sampling size on the accuracy of the random global sampling strategy. The Box-Whisker plots show the minimum, maximum, median, lower quartile and upper quartile values. Fig. 8. The influences of sampling size on the accuracy of the random stratified sampling strategy. The Box-Whisker plots show the minimum, maximum, median, lower quartile and upper quartile values. Fig. 9. Observed mean soil moisture content vs. predicted values by using the K-means clustering approach on the principal components derived from both soil and 31

terrain properties and soil moisture data on Jan 9 2013 (EFsTheta-PCA), with different number of clusters.

32

Figure 1

33

Figure 2

34

Figure 3

35

Figure 4

Figure 5

Figure 6

36

Figure 7

37

Figure 8

38

Figure 9

39

40

Highlights



Three approaches were compared in predicting mean hillslope soil moisture.



Temporal stability analysis had limited accuracy of soil moisture prediction.



K-means clustering had high accuracy of soil moisture prediction.



Random sampling required many (>40) sites for accurately predicting mean soil moisture.



Selection of the optimal approach should be based on different existing data.

41