A probabilistic method for assisting knowledge extraction from artificial neural networks used for hydrological prediction

A probabilistic method for assisting knowledge extraction from artificial neural networks used for hydrological prediction

Mathematical and Computer Modelling 44 (2006) 499–512 www.elsevier.com/locate/mcm A probabilistic method for assisting knowledge extraction from arti...

1MB Sizes 0 Downloads 22 Views

Mathematical and Computer Modelling 44 (2006) 499–512 www.elsevier.com/locate/mcm

A probabilistic method for assisting knowledge extraction from artificial neural networks used for hydrological prediction Greer B. Kingston ∗ , Holger R. Maier, Martin F. Lambert Centre for Applied Modelling in Water Engineering, School of Civil and Environmental Engineering, The University of Adelaide, Adelaide, S.A., 5005, Australia Received 6 December 2004; accepted 11 February 2005

Abstract Knowledge extraction from artificial neural network weights is a developing and increasingly active field. In the attempt to overcome the ‘black-box’ reputation, numerous methods have been applied to interpret information about the modelled input-tooutput relationship that is embedded within the network weights. However, these methods generally do not take into account the uncertainty associated with finding an optimum weight vector, and thus do not consider the uncertainty in the modelled relationship. In order to take this into account, a generic framework for extracting probabilistic information from the weights of an ANN is presented in this paper together with the specific methods used to carry out each stage of the process. The framework is applied to two case studies where the results show that the consideration of uncertainty is extremely important if meaningful information is to be gained from the model, both in terms of an ANN’s ability to capture physical input-to-output relations and improving the understanding of the underlying system. c 2005 Elsevier Ltd. All rights reserved.

Keywords: Artificial neural networks; Knowledge extraction; Bayesian methods; Hydrological modelling

1. Introduction Artificial neural networks (ANNs) have been used increasingly in recent years for numerous hydrological applications including rainfall–runoff modelling, streamflow forecasting and prediction of water quality [1,2]. As opposed to conventional modelling approaches, ANNs do not require in depth knowledge of the physics driving a process, nor do they require the functional form of the model to be specified a priori. Therefore, ANNs can be viewed as an appealing alternative when knowledge of the underlying hydrological relationship is lacking. While the predictive capability of ANNs in hydrology has been proven in numerous studies [3], the utility of ANNs as hydrological modelling tools has been limited due to their inability to explain, in a comprehensible way, the process by which the model outputs are generated [4]. Any explanation of the internal behaviour of an ANN is only revealed back to the user in the form of an “optimal” weight vector, which is difficult to interpret as a functional relationship. Therefore, ANNs are frequently criticized for operating as “black box” models [1]. In the endeavour to have ANNs ∗ Corresponding author. Tel.: +61 8 8303 5451; fax: +61 8 8303 4359.

E-mail address: [email protected] (G.B. Kingston). c 2005 Elsevier Ltd. All rights reserved. 0895-7177/$ - see front matter doi:10.1016/j.mcm.2006.01.008

500

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

become more widely accepted and reach their full potential as hydrological models, methods have been applied to extract the information embedded within a trained ANN [5–8]. If it can be demonstrated that an ANN has captured knowledge of the underlying system, the model can be applied with greater confidence. Moreover, the explanation capability achieved provides insight into the system under study, which helps towards further understanding of the underlying physical processes. In particular, if part of the modelled relationship (e.g. the contribution of one or more inputs to the generation of the output) can be validated by an expert, new information extracted from the ANN (e.g. the contribution of a poorly understood component of the system) is more likely to be considered plausible. Typically, when knowledge extraction methods are applied, the aim is to interpret information contained in a single optimal weight vector [9–11]. However, due to the stochastic nature of hydrological processes, the limitations of a finite training data set and the generally high dimensionality (i.e. degrees of freedom) of ANN models, there may exist many different weight vectors that result in similar network performance; in other words, there may be a range of “optimal” weight vectors. As a consequence, there may be substantial uncertainty in the modelled relationship and therefore, when considering the abilities of a given ANN to capture knowledge of the system, it is more informative to examine the range of possible relationships that can be modelled. A framework is presented in this paper for probabilistic knowledge extraction from the weights of an ANN in order to reveal the range of relationships between the model inputs and outputs that can be modelled. This framework involves several general steps and recommended methods for carrying out each step are presented in this paper, including a Bayesian inference approach for estimating weight distributions and a ‘connection weight’ method for determining the explanatory contributions of the causal inputs. However, the framework is generic, which enables users to apply alternative methods at each step if desired. Two case studies are presented, in which the recommended methods are applied and the importance of considering knowledge extracted from a trained ANN probabilistically is demonstrated. The first case study involves the use of a synthetically generated data set, which allows the probabilistic input contributions obtained by the method to be compared to the actual input contributions of the relationship used to generate the data, such that the methods used can be verified. In the second case study, the proposed framework is applied to examine the relationship modelled when an ANN is used to simulate streamflow from observed data. In this case study, uncertainties associated with fluxes into and out of the system, and the aptness of a single weight vector in representing the range of possible relationships that can be modelled, are investigated. The ANNs used in this paper are limited to multilayer perceptrons (MLPs) with one hidden layer, as MLPs are the most popular ANN architecture [12] and networks with one hidden layer are able to model any continuous function given sufficient degrees of freedom [13]. However, the proposed framework is applicable to other types of ANNs provided that suitable methods for implementing each step of the framework are available. 2. Methods The proposed framework for extracting probabilistic knowledge from trained ANNs involves the following steps: Step 1: Train the ANN and quantify the uncertainty associated with the optimized weight vector. Step 2: Select an appropriate method for extracting knowledge from ANN weights, such as a measure of the causal importance of an input. Step 3: Use a Monte Carlo approach to randomly sample weight vectors from the distribution obtained in Step 1 and calculate the corresponding input importance measures, forming distributions that encapsulate the uncertainty in the modelled relationship. There is some controversy as to which methods are best suited to extracting information about input-to-output relationships from ANNs. Therefore, as the emphasis of this paper is on the importance of considering such information probabilistically, rather than the methods used to achieve this, the framework is deliberately generic, allowing the user to select the specific methods necessary to carry out each step according to preferences or new findings. However, the methods applied in this study are considered to be the most appropriate given the current state of knowledge and are discussed in the following subsections.

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

501

2.1. Quantification of weight uncertainty ANNs, like all models, make the assumption that there exists a function f (·) relating a set of causal input variables, x, to the dependent variable of interest, y, as follows: y = f (x; w) + 

(1)

where w is a vector of connection weights and network biases and  is a random noise component with zero mean and constant variance. The objective of ANN calibration, or “training”, is to find an optimal estimate of w such that the best fit between the observed data and the modelled responses is obtained. Thus, for a given network structure, w contains the information used by the network to model the desired relationship. However, the weights of an ANN are highly dependent on the data used for training and, due to the stochastic nature of these data, the optimized weight vector is partly the result of chance; therefore, so is the modelled relationship. By quantifying the uncertainty associated with w, the full set of plausible weight vectors (those that give an acceptable fit to the data) is identified and can be used to assess the range of relationships that can be modelled by a given ANN. In this paper, the uncertainty associated with w is quantified using Bayesian methods. 2.1.1. Bayesian weight estimation Bayesian methodology was first applied to ANN training by Mackay [14] and Neal [15]. Under this paradigm, Bayes’ Theorem is used to describe the posterior of the weight vector (the distribution of the weight vector given the training data) as follows: P(w|y) =

P(y|w)P(w) P(y)

(2)

where y is a vector of N observations. In (2), P(w) is the prior weight distribution, which describes any knowledge of the weights before the data were obtained, P(y|w) is the likelihood function, which is the function through which the prior knowledge of w is updated by the data, and P(y) is a normalization factor. The likelihood expresses all of the information about w contained in the training data. Therefore, if there is only vague prior knowledge of w, weight vectors producing the highest likelihood values will dominate the posterior. Assuming that the model residuals ( in (1)) are independently and normally distributed with zero mean and constant variance σ 2 , the likelihood function can be described by   ! N Y 1 1 yi − f (xi ; w) 2 P(y|w) = (3) exp − √ 2 σ 2Π σ 2 i=1 where f (xi ; w) is the ANN output for the ith input vector. For complex hydrological problems, the high dimensionality of the conditional probabilities in (2) makes it virtually impossible to directly calculate the posterior weight distribution. Consequently methods have been introduced to approximate the posterior. Mackay [14] used the simplifying assumption that the posterior weight distribution could be approximated by a Gaussian distribution. However, the assumption of a Gaussian weight distribution is generally not a good one, particularly in the case of multilayered ANNs, which often have multimodal weights. To avoid the need to make such an approximation, Neal [15] used a Markov chain Monte Carlo (MCMC) implementation to sample from the posterior weight distribution. 2.1.2. The Metropolis algorithm The objective of MCMC methods is to generate samples from a continuous target density, which in this case is the posterior weight distribution. The Metropolis algorithm is a commonly used MCMC approach, which has had great success in a number of applications [13]. As it is difficult to sample from the complex target distribution directly, this method makes use of a simpler, symmetrical distribution, known as the proposal distribution, to generate candidate weight vectors. Simply, the proposal distribution can be a multinormal distribution centred on the current weight state, wi . Thus, a Markov chain is generated which forms a random walk in weight space. The Metropolis algorithm makes use of an adaptive acceptance–rejection criterion such that the random walk sequence continually adapts to the posterior distribution of the weights. At each iteration of the algorithm, the

502

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

probability of accepting the candidate weight vector, w∗ , generated from the proposal distribution is calculated from   P(y|w∗ )P(w∗ ) α(w∗ |wi ) = min , 1 . (4) P(y|wi )P(wi ) However, there is generally little prior knowledge of the weights of an ANN and therefore it is reasonable to assume a uniform prior distribution over a specified range. In this study, the range [−3, 3] was used for the input to hidden layer weights, while the range [−6, 6] was used for the hidden to output layer weights, for reasons discussed in Section 2.2. The assumption of uniform prior distributions simplifies (4) to   P(y|w∗ ) ∗ i ,1 (5) α(w |w ) = min P(y|wi ) which only requires evaluation of the likelihood for each new candidate weight state. However, in this study, the effect of the prior distribution was included by ensuring that each candidate weight vector conformed to the weight ranges specified. If α(w∗ |wi ) is greater than a random number u ∼ U (0, 1), the candidate weight vector is accepted as the new weight state, wi+1 . Otherwise, the candidate is rejected and the new weight state wi+1 remains the same as the previous weight state wi and the process is repeated. Given sufficient iterations, the Markov chain produced by the Metropolis algorithm should converge to a stationary distribution. From this point onwards it can be considered that the sampled weight vectors are generated from the posterior distribution. However, the covariance of the proposal distribution has important implications for the convergence properties and efficiency of the algorithm. If the spread of the proposal distribution is large relative to the posterior weight distribution, then acceptance of the candidate weight vectors will be low and convergence will be slow. On the other hand, if the spread is small, the algorithm will take longer to sample from the entire region of the posterior and therefore the distribution may not be adequately represented by the samples generated within a specified number of iterations [16]. This makes it difficult to determine whether convergence has been achieved or how many iterations are required for convergence. Haario et al. [17] introduced a variation of the Metropolis algorithm that was developed to increase the convergence rate. In this algorithm, the proposal distribution continually adapts to the posterior distribution by updating the covariance at each iteration based on all previous states of the weight vector. The adaptation strategy ensures that information about the target distribution, accumulated from the beginning of the simulation, is used to increase the efficiency of the algorithm. In a comparison of four MCMC approaches, Marshall et al. [18] found the adaptive Metropolis (AM) algorithm to be superior in terms of both statistical and computational efficiency. Therefore, in this study, the AM algorithm was used to estimate the posterior distribution of the weight vector. 2.2. Knowledge extraction There is currently no widely accepted method for extracting knowledge from the weights of a trained ANN. However, a number of methods have been proposed for interpreting the weights in order to assess the importance of the model inputs (see [9,11] for a discussion of these). The relative contributions of ANN inputs in calculating the output are dependent on the magnitude and direction of the connection weights. An input will have a positive impact on the output if the input-hidden and hidden-output weights are of the same sign (i.e. both positive or both negative), whereas it will have an inhibitory effect on the output if the signs of the input-hidden and hidden-output weights are opposing. Additionally, if an input has a positive effect on the output through one hidden node, but an inhibitory effect on the output through another hidden node, the counteracting signs act to diminish the overall contribution of the input. In this study, the Connection Weight Approach [11], which is based on the sum of the products of inputhidden and hidden-output connection weights, or ‘overall connection weight’ [6], was selected to determine the causal importance of each input. With reference to Fig. 1, the overall connection weight (OCW) of input 1 can be calculated by determining c A,1 and c B,1 , which are the contributions of input 1 via hidden nodes A and B, respectively, and summing them to obtain OCW 1 as follows: c A,1 = w A,1 × w O,A c B,1 = w B,1 × w O,B

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

503

Fig. 1. Example ANN structure.

OCW 1 = c A,1 + c B,1 .

(6)

The OCWs of inputs 2 and 3 can then be calculated in a similar fashion. It has been noted that the OCW measure ignores the “squashing” effect nonlinear activation functions have on the weights [9], and therefore the magnitude of the OCWs may be misleading. Nevertheless, Olden et al. [11] found the Connection Weight Approach to provide the best overall methodology for quantifying ANN input importance in comparison to other commonly used methods. In this study, the hyperbolic tangent function was used on the hidden layer, while a linear function was used at the output. Therefore, only squashing of the input to hidden layer weights was a concern, and, as the effect of squashing becomes more pronounced as the magnitude of the weights increases, the values of these weights were restricted to the range [−3, 3] to reduce the impact of squashing. This range was selected as the hyperbolic tangent function is insensitive to values outside of this range. Restricting the hidden to output layer weights was not necessary and therefore the wider range of [−6, 6] was prescribed for these weights. In this study, the OCWs were normalized to determine the relative contribution (RC) of each input in predicting the output as follows: RCi =

OCW i × 100% n P |OCW j |

(7)

j=1

where n is the total number of inputs. A positive RC value indicates that the input positively contributes towards the generation of the output, whereas a negative value indicates that the input has an inhibitory effect on the model output. 2.3. Quantifying uncertainty in input contributions Once a distribution of the weight vector has been obtained and a method has been selected to estimate input importance, a Monte Carlo approach can be used to randomly sample a specified number of the plausible weight vectors such that when the corresponding input importance measures are evaluated, distributions of the importance measure for each input are produced. These distributions should encapsulate the uncertainty in the relationship modelled by a given ANN. This step is straightforward when a Monte Carlo method, such as the AM algorithm, is used to evaluate the distribution of plausible weight vectors in Step 1 of the proposed framework, as the output of such methods is a string of random samples from this distribution. In this study, the AM algorithm was used to randomly sample 100,000 weight vectors from the posterior distribution and the corresponding input RC values were calculated, in accordance with (7), producing RC distributions for each input. While the RC distributions should contain all of the information about the relationship modelled by a given ANN for a given set of data, it may be practical to summarize the distributions using a measure of location (e.g. mean, mode, median) and a measure of variation (e.g. standard deviation). The mean of the distribution is the expected value of the input RC, given the data, and the mode may be viewed as the ‘most typical’ RC value for the given data set. However, while these values may be useful, probably of greater importance is the variation measure, which indicates

504

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

how well-posed the model is in relation to the data and shows how much confidence to place in any information gained about the underlying process. 3. Case study 1 — synthetically generated data In order to demonstrate the importance of considering the knowledge extracted from an ANN probabilistically and to verify the methodology used to carry out each step of the framework, the methods described in Section 2 were applied to an ANN developed to model synthetically generated data. The use of synthetic data meant that the actual input contributions were known, making it possible to compare the modelled input contributions to the actual input contributions. As a basis for comparison of the proposed framework, a deterministic approach for developing an ANN and extracting knowledge from the weights was also applied, i.e. a model was trained to obtain a single “optimum” weight vector and the single input RC values were calculated for this set of weights in accordance with (7). This approach involved maximizing the likelihood function given by (7) using the SCE-UA algorithm [19], as this algorithm has been found to be effective for globally optimizing ANNs [20]. The optimized weights obtained with this algorithm were also used to initialize the AM algorithm. The assumption is often made that if an ANN provides a good fit to observed data, a good approximation to the underlying relationship has been obtained. Therefore, the outputs of each model were also examined to assess the validity of this assumption. The posterior weight distribution was used to obtain a predictive distribution for each input pattern, which was then reduced to a set of 95% prediction limits. The variation in the RC values was compared to the variation in the model predictions (as indicated by the width of the prediction limits) to assess how the uncertainty in the modelled relationship was reflected in the predictions. 3.1. Model and data Autoregressive (AR) models are commonly used to model hydrological time series data. An autoregressive model of order 9 (i.e. AR(9)) was used to generate a set of synthetic time series data according to xt = 0.3xt−1 − 0.6xt−4 − 0.5xt−9 + t

(8)

where t is a random noise component with distribution N (0, 1), making the assumption of independently, normally distributed noise made by the use of (3) valid. The actual input contributions were determined using (7), substituting the coefficients in (8) for the OCW values. To help validate the methodology introduced in this paper and to assess the effects of model structure on the relationship modelled, two ANNs were considered under both the probabilistic and deterministic knowledge extraction approaches. Each ANN contained one hidden layer, with the first having two hidden layer nodes and the second having three. The two-hidden-node ANN was considered to be more than adequate to model the simple linear relationship, while the three-hidden-node ANN was used to assess the relationship modelled given further degrees of freedom. Theoretically, the RC distributions obtained using each of the models should be similar (e.g. shape and location) if the methodology applied is appropriate for probabilistic knowledge extraction from ANNs. However, it was expected that the distributions obtained using the three-hidden-node ANN would exhibit more uncertainty due to the greater number of parameters available for estimating the relationship. Hereafter, the two- and three-hidden-node probabilistic ANNs will be referred to as ANN 2P and ANN 3P, respectively, and the two- and three-hidden-node deterministic ANNs will respectively be referred to as ANN 2D and ANN 3D. For each of the models the hyperbolic tangent activation function was used in the hidden layer while a linear activation function was used at the output layer. It has been shown that a one-hidden-layer network with this configuration of activation functions is able to approximate any continuous function (linear or nonlinear) arbitrarily well, given that there are a sufficient number of hidden nodes [13]. Eq. (8) was used to generate a total of 1000 data points, of which the first 300 were discarded to cancel out any effects due to initial conditions, another 500 were used to train the model and the remaining 200 were used for validation. It was considered that only two data sets were necessary, as cross-validation with a test set is only required if overtraining is suspected. It has been demonstrated, both theoretically and empirically, that overtraining does not occur if the number of training samples is at least 30 times the number of weights in the network [21]. The twoand three-hidden-node networks used to model the data have 11 and 16 weights, respectively, and therefore crossvalidation was considered unnecessary when a training set of 500 data points was used (i.e. 500/11 ≈ 45 > 30 and 500/16 ≈ 31 > 30).

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

505

Fig. 2. Validation outputs generated by ANN 2D and ANN 3D. E is the coefficient of efficiency. Table 1 Relative contributions (RCs) of AR(9) inputs obtained using ANN 2D and ANN 3D in comparison to the actual RC values (%) Input

Actual

ANN 2D

ANN 3D

xt−1 xt−4 xt−9

21.43 −42.86 −35.71

20.30 −46.53 −33.17

34.15 −30.27 −35.57

The data generated by (8) were used to represent “measured” hydrological data, where the random noise component, , was included to characterize the random measurement errors that could be expected during data collection. In order to assess how well the ANN was able to estimate the underlying data-generating relationship, a second set of validation data was also generated without the addition of noise. This data set was considered to represent the “true” time series. 3.2. Results Plots of the validation outputs generated by ANN 2D and ANN 3D against the “measured” and “true” validation data are shown in Fig. 2. According to the fit between the model outputs and the “true” time series, it appears that the underlying relationship has been estimated well by both models. The coefficient of efficiency, E, was used to quantitatively assess the fit between the model outputs and the data, and values of this measure are also displayed in Fig. 2. The range of E is −∞ to 1, with −∞ being the worst case and a value of 1 indicating perfect correlation between the model outputs and the data. It can be seen that, according to the E values, a slightly better fit to both the “measured” and “true” data was obtained using ANN 2D, rather than ANN 3D. This confirms that two hidden nodes were sufficient for modelling the AR(9) data and three hidden nodes provided unnecessary complexity. Nevertheless, it would generally be concluded that both ANN 2D and ANN 3D have generalized well to the underlying relationship. A comparison of the actual input contributions of the AR(9) relationship with the RCs estimated by ANN 2D and ANN 3D (calculated according to (7)) is given in Table 1. It can be seen that, according to the RC values, ANN 2D has approximated the underlying relationship reasonably well, whereas ANN 3D has incorrectly modelled the order of input importance, as shown by the magnitudes of the RC values. This was neither evident from the E value obtained for this model, nor by visual inspection of the output plot. Furthermore, it is not apparent in Fig. 2 that ANN 2D and ANN 3D have estimated different relationships, as for the most part, the output plots overlie one another. These results demonstrate that a good fit to the data does not necessarily indicate that the underlying relationship has been modelled accurately and care needs to be taken when considering the knowledge extracted from the optimized weights. Histograms of the input RCs obtained using ANN 2P and ANN 3P are displayed in Fig. 3(a) and (b), respectively, where the dotted lines mark the actual input contributions and the crosses mark the deterministic RC estimates obtained using ANN 2D (Fig. 3(a)) and ANN 3D (Fig. 3(b)). As expected, the distributions obtained by the two

506

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

Fig. 3. Relative contribution (RC) distributions of the AR(9) inputs as modelled by (a) ANN 2P and (b) ANN 3P. The dotted lines mark the actual input contributions and the X marks the deterministic input RC values. Table 2 Summary measures of RC distributions for AR(9) inputs obtained using ANN 2P and ANN 3P in comparison to the actual RC values (%) Input

Actual

Mean

Mode

ANN 2P xt−1 xt−4 xt−9

Standard deviation

21.43 −42.86 −35.71

20.06 −47.23 −32.71

23.46 −43.58 −34.71

4.08 6.35 3.41

ANN 3P xt−1 xt−4 xt−9

21.43 −42.86 −35.71

27.46 −42.04 −30.45

30.04 −34.89 −33.02

9.17 15.75 7.90

models are similar, being of approximately the same shape and centred roughly around the same values. It can also be seen that the distributions obtained using ANN 3P exhibit greater uncertainty, or variance, than those obtained using ANN 2P due to the greater complexity of this model. Table 2 contains the mean, mode and standard deviation summary measures of the RC distributions obtained with both models. It can be seen in this table that the summary measures of location (i.e. mean and mode) are close to the actual RC values. This demonstrates the capacity of each ANN for extracting the correct input-to-output relationships from the data and supports the use of the Connection Weight Approach for estimating the input-to-output relations modelled by an ANN. Furthermore, although ANN 3D incorrectly estimated the order of input importance, both the mean and mode values of the distributions obtained using ANN 3P indicate the correct order of input importance, particularly the mean values which give a much better approximation of the data-generating relationship than the deterministic RC values. However, it can also be seen that there is a rather high degree of variance (indicated by the standard deviations) associated with each of the input contributions, in particular, those obtained using ANN 3P, which means that a reasonably wide range of estimated relationships result in similar network performance. This was confirmed by an inspection of the 95% prediction limits obtained using each of the models, as shown in Fig. 4. The tightness and similarity of the 95% prediction limits obtained using ANN 2P and ANN 3P show that the large variance in the modelled relationships is not reflected well in

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

507

Fig. 4. 95% prediction limits generated by ANN 2P and ANN 3P for validation data.

the predictions. The fact that each true data point has been accounted for within these limits implies that the “optimum relationship” is included within the estimated range of plausible relationships; however, although the RC distributions give a good indication of the order and signs of the input contributions, they are too wide, or uncertain, to pinpoint the true relationship. This suggests that caution should be taken in using precise values to gain information about the system under study, such as those obtained from a single weight vector or by summarizing the RC distributions. 4. Case study 2 — Boggy Creek streamflow In the previous case study, the uncertainty associated with the relationship modelled by an ANN was demonstrated, highlighting the difficulty in selecting a single optimum weight vector to best characterize the input-to-output relations contained in the data, as many different relationships may achieve a similar fit to the data. In this case study, the proposed probabilistic knowledge extraction framework was applied to an ANN developed to model observed streamflow data, in order to assess the impact of the uncertainty in the modelled contributions of rainfall and evaporation on the generation of streamflow and compare the RC distributions obtained to prior knowledge of the system. The deterministic knowledge extraction approach used in the previous case study was again applied in order to assess how reasonable (or unreasonable) it is to only consider a single deterministic estimate of the modelled relationship in a real hydrological situation. Unlike in Case Study 1, it was not possible to determine the actual input contributions for the streamflow relationship; however, to satisfy a mass balance of the system, rainfall must have a positive influence on the generation of streamflow while the effect of evaporation must be negative. The outputs of the deterministic ANN and 95% prediction limits of the probabilistic ANN were again examined to assess the fit of the model to the data, given the estimated input RCs, and to compare the variation in the RC values to the variation in the simulated streamflow. 4.1. Study site and data Boggy Creek is located within the Ovens River Basin in Victoria, Australia (Fig. 5), and has a catchment area of 108 km2 . The streamflow and rainfall data used in this case study were recorded at Angleside (−36.610S, 146.360E), site code 403226, while the evaporation data were recorded at station number 82039 (−36.105S, 146.509E). 17 years of daily rainfall, evaporation and streamflow data, from 1/1/76 to 31/12/92, were available for calibrating and validating the ANN. The data were divided such that the calibration set contained 3000 data points between 1/1/76 and 7/8/89 and the validation set contained 1242 data points between 8/8/89 and 31/12/92. Cross-validation was again considered unnecessary, as only ANNs with greater than 100 weights would require cross-validation when a training data set of 3000 samples is used (i.e. 3000/100 = 30). From preliminary investigations, it was seen that such complexity was not needed to model the streamflow data.

508

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

Fig. 5. Ovens River Basin, Victoria, Australia. Table 3 Relative contributions (RCs) of streamflow generation inputs obtained with a single weight vector Input rainfallt evaporationt streamflowt−1

RC (%) −2.24 0.38 97.39

4.2. ANN model An ANN was developed using current rainfall and evaporation data as inputs, together with streamflow data lagged by one time step (i.e. one day), which were included to incorporate memory into the model. These inputs were selected based on investigations carried out on the Boggy Creek streamflow data by Metcalf et al. [22], where it was found that a good fit to the observed data could be achieved with an AR(1) model, but the inclusion of current rainfall data was required to achieve synchronization of modelled streamflow with rainfall events. Evaporation was included as an input in the current study in an attempt to improve the physical plausibility of the modelled relationship. Preliminary investigations showed that an ANN with a single hidden layer and two hidden nodes was capable of accurately simulating the streamflow data given these inputs. The hyperbolic tangent activation function was again used in the hidden layer and a linear activation function was used at the output layer. 4.3. Results A plot of the validation outputs generated by the deterministic ANN (single weight vector) is shown in Fig. 6, together with the observed validation streamflow data and the coefficient of efficiency, E. From the fit between the model outputs and the observed data, it is apparent that a good approximation of the underlying input-to-output relationships has been obtained. However, the corresponding RC values, given in Table 3, show that the input contributions estimated by the model do not satisfy the criteria that the contribution of rainfallt should be positive and the contribution of evaporationt should be negative. This indicates that the rainfall–runoff relationship estimated by the deterministically trained ANN was incorrect. Histograms of the input RCs obtained using the proposed probabilistic approach are shown in Fig. 7(a). The deterministic RC values for each input are also shown in the figure (marked by crosses), demonstrating how poorly the deterministic estimates of the modelled input–output relationships summarize, or represent, the range of possible relationships that can be modelled by the ANN to achieve a good fit to the observed streamflow data. The rainfallt and evaporationt RC distributions span zero, with almost equal probability of obtaining a solution where rainfallt and/or evaporationt have physically implausible contributions. This is due to the fact that streamflowt−1 dominates the simulations with rainfallt and evaporationt having little effect on the overall fit of the model outputs to the data.

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

509

Fig. 6. Validation streamflow outputs generated by a single (deterministic) weight vector. E is the coefficient of efficiency.

Fig. 7. Relative contribution (RC) distributions of the streamflow generation inputs obtained using (a) the original set of plausible weight vectors and (b) the reduced set of weight vectors. In (a) the X marks the deterministic input RC values.

This is in agreement with the findings of Metcalf et al. [22] and further justifies the use of the RC measure to quantify modelled input contributions. However, by identifying the physically implausible solutions within the range of modelled relationships providing a good fit to the data, it was possible to filter out and remove weight vectors resulting in these erroneous solutions from the string of weight vectors sampled from the posterior distribution. This resulted in 41 144 physically plausible weight vectors and histograms of the modified input RCs are shown in Fig. 7(b). The location and variation summary measures of both the original and modified RC distributions are presented in Table 4. Here, the large variation in the original rainfallt and evaporationt RC distributions (relative to their mean) can be seen, with the mean and mode values of these distributions having conflicting signs. Therefore, it is not possible to gain any definite information about these input contributions, other than that they are minor. Conversely,

510

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

Table 4 Summary measures of RC distributions for streamflow generation inputs obtained with the posterior weight distribution (%) Input

Mean

Mode

Standard deviation

Original RC distributions rainfallt evaporationt streamflowt−1

−0.42 0.01 95.55

2.01 −0.06 98.00

4.91 0.62 3.16

Modified RC distributionsa rainfallt evaporationt streamflowt−1

3.65 −0.53 95.82

3.01 −0.37 96.67

2.08 0.34 2.22

a Weight vectors resulting in physically implausible RC values removed.

Fig. 8. 95% simulation limits generated by the posterior weight distribution (probabilistic) for the validation streamflow data.

after removing the weight vectors that resulted in physically implausible solutions, the variation in these distributions is significantly reduced and the mean and mode values are approximately equal. In this case, greater confidence can be placed in the knowledge extracted about the modelled relationship. A plot of the 95% prediction limits obtained with the unreduced set of weight vectors is shown in Fig. 8. Like for the probabilistic ANN models in Case Study 1, the variation in the modelled relationship is not reflected in the width of the prediction limits, and in this case, the tightness of these limits indicates that even a change in direction of the rainfallt and evaporationt contributions does not significantly affect the simulations. By removing the physically implausible weight vectors, the average width of the 95% simulation limits was reduced from 0.19 mm/day to 0.09 mm/day, thus increasing the confidence in the simulations; however, a plot of these limits is not shown as this result is difficult to see given the already narrow 95% simulation limits generated by the entire posterior weight distribution. 5. Summary and conclusions In this paper, a general framework was proposed for extracting probabilistic information from the weights of an ANN in order to assess the modelled input-to-output relationships. The results of the case studies presented demonstrated the importance of considering such information probabilistically by highlighting the uncertainty associated with the relationships modelled by ANNs. In Case Study 1 (synthetic AR(9) data), it was found that the three-hidden-node deterministic ANN (ANN 3D) incorrectly estimated the order of input importance in the relationship underlying the data. However, when the probabilistic knowledge extraction approach was applied (ANN 3P) the resulting RC distributions clearly indicated the signs and relative magnitudes of the input contributions. The RC distributions obtained using both the two- and three-hidden-node probabilistic ANNs (ANN 2P and ANN 3P, respectively) were similar in shape and approximately

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

511

centred on the actual input RC values of the data-generating relationship, thus verifying the methodology used to obtain the distributions and the capacity of the ANNs for extracting information about the input-to-output relationships from the data. However, the uncertainty in the relationship modelled by each ANN was demonstrated by the variation in the RC distributions, particularly in the case of the three-hidden-node ANN, which provided unnecessary complexity for modelling the relationship. This variation was not reflected in the model predictions, suggesting that many different relationships result in a similar fit to the data and, therefore, a good fit to the observed data is insufficient for assuming that the underlying input-to-output relationships have been adequately approximated. In Case Study 2 (observed Boggy Creek streamflows), the deterministic RC values indicated that the rainfall–runoff relationship had been incorrectly estimated, with rainfall providing a negative contribution and evaporation making a positive contribution to the generation of streamflow. Inspection of the RC distributions obtained under the proposed framework showed that the relative contributions of rainfall and evaporation spanned zero, with almost equal likelihood that one or both of these contributions were physically implausible. This led to the identification of erroneous weight vectors within the set of weight vectors that provided a good fit to the data, allowing them to be removed from the posterior distribution, thus improving the confidence in the knowledge extracted from the weights and in the predictions, as only physically plausible solutions were considered. If the ANN was assessed based only on the deterministic RC values, it is unlikely that a great deal of confidence would be placed in the predictions made by the ANN, or that much trust would be placed in the RC measure to quantify the modelled input contributions. Overall, the results of this study demonstrated the difficulties associated with extracting knowledge from deterministically trained ANNs, particularly as the complexity of the model increases and there are more degrees of freedom available for estimating input-to-output relationships. However, given the difficulty in selecting the appropriate ANN structure to model a given set of data, it is not uncommon for ANNs to be more complex than necessary. In such cases, a single measure of input importance obtained using deterministic knowledge extraction approaches is unlikely to be sufficient for gaining meaningful information about the relationship modelled by the ANN for a given set of data. The proposed probabilistic approach assists with knowledge extraction from ANNs by revealing all of the information about the modelled relationship given the data. The degree of variation in the RC distributions obtained may be used as a diagnostic to indicate how well-posed the model is with respect to the available data and whether the model needs to be made more parsimonious or more data need to be obtained to increase the confidence in the information gained from the weights and/or in the predictions. This information would not be available if the modelled relationship was assessed deterministically. Although alternative methods may be used to apply the proposed framework, it is considered that the methods presented in this paper are well suited to ANNs and their validity in this approach was verified in the case studies presented. Furthermore, the methods presented are relatively simple to implement and therefore should not detract from an ANNs’ ease-of-use appeal. Acknowledgements The authors would like to thank Dr. Theresa Heneker for providing the data used in the Boggy Creek case study. This project is funded by an Australian Research Council Discovery grant. References [1] ASCE Task Committee on Application of Artificial Neural Networks in Hydrology, Artificial neural networks in hydrology. I: preliminary concepts, Journal of Hydrologic Engineering, ASCE 5 (2) (2000) 115–123. [2] H.R. Maier, G.C. Dandy, Neural networks for the prediction and forecasting of water resources variables: a review of modelling issues and applications, Environmental Modelling and Software 15 (1) (2000) 101–124. [3] ASCE Task Committee on Application of Artificial Neural Networks in Hydrology, Artificial neural networks in hydrology. II: hydrologic applications, Journal of Hydrologic Engineering, ASCE 5 (2) (2000) 124–137. [4] R. Andrews, J. Dieterich, A.B. Tickle, Survey and critique of techniques for extracting rules from trained artificial neural networks, Knowledge-Based Systems 8 (6) (1995) 373–389. [5] T. Tchaban, M.J. Taylor, J.P. Griffin, Establishing impacts of the inputs in a feedforward neural network, Neural Computing and Applications 7 (1998) 309–317. [6] J.D. Olden, D.A. Jackson, Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks, Ecological Modelling 154 (1–2) (2002) 135–150. [7] A. Jain, K.P. Sudheer, S. Srinivasulu, Identification of physical processes inherent in artificial neural network rainfall–runoff models, Hydrological Processes 18 (3) (2004) 571–581.

512

G.B. Kingston et al. / Mathematical and Computer Modelling 44 (2006) 499–512

[8] R.L. Wilby, R.J. Abrahart, C.W. Dawson, Detection of conceptual model rainfall–runoff processes inside an artificial neural network, Hydrological Sciences Journal 48 (2) (2003) 163–181. [9] W.S. Sarle, How to measure the importance of inputs? SAS Institute Inc., Cary, NC, 2000. ftp://ftp.sas.com/pub/neural/importance.html. [10] M. Gevrey, I. Dimopoulos, S. Lek, Review and comparison of methods to study the contribution of variables in artificial neural network models, Ecological Modelling 160 (3) (2003) 249–264. [11] J.D. Olden, M.K. Joy, R.G. Death, An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data, Ecological Modelling 178 (3–4) (2004) 389–397. [12] C.W. Dawson, R.L. Wilby, Hydrological modelling using artificial neural networks, Progress in Physical Geography 25 (1) (2001) 80–108. [13] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, 1995. [14] D.J.C. Mackay, A practical Bayesian framework for backpropagation networks, Neural Computation 4 (3) (1992) 448–472. [15] R.M. Neal, Bayesian training of backpropagation networks by the hybrid Monte Carlo method, Technical Report CRG-TR-92-1, Department of Computer Science, University of Toronto, 1992. [16] M. Thyer, G. Kuczera, Q.J. Wang, Quantifying parameter uncertainty in stochastic models using the Box–Cox transformation, Journal of Hydrology 265 (1–4) (2002) 246–257. [17] H. Haario, E. Saksman, J. Tamminen, An adaptive Metropolis algorithm, Bernoulli 7 (2) (2001) 223–242. [18] L. Marshall, D. Nott, A. Sharma, A comparative study of Markov chain Monte Carlo methods for conceptual rainfall–runoff modeling, Water Resources Research 40 (2) (2004) W02501. [19] Q. Duan, S. Sorooshian, V. Gupta, Effective and efficient global optimization for conceptual rainfall–runoff models, Water Resources Research 28 (4) (1992) 1015–1031. [20] G.B. Kingston, H.R. Maier, M.F. Lambert, Understanding the mechanisms modelled by artificial neural networks for hydrological prediction, in: MODSIM 2003, vol. 2, 2003, pp. 825–830. [21] S.-I. Amari, N. Murata, K.-R. M¨uller, M. Finke, H.H. Yang, Asymptotic statistical theory of overtraining and cross-validation, IEEE Transactions on Neural Networks 8 (5) (1997) 985–996. [22] A. Metcalf, T.M. Heneker, M.F. Lambert, G. Kuczera, I. Atan, A comparison of models for catchment runoff, in: 27th Hydrology and Water Resources Symposium, 2002 (on CD-ROM).