Environmental Modelling & Software 83 (2016) 271e275
Contents lists available at ScienceDirect
Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft
Software data news
A programming tool for nonparametric system prediction using Partial Informational Correlation and Partial Weights Ashish Sharma*, Raj Mehrotra, Jingwan Li, Sanjeev Jha 1 School of Civil and Environmental Engineering, University of New South Wales, Sydney, Australia
a r t i c l e i n f o
a b s t r a c t
Article history: Received 16 September 2015 Received in revised form 25 May 2016 Accepted 26 May 2016 Available online 20 June 2016
Identification of system predictors forms the first step towards formulating a predictive model. Approaches for identifying such predictors are often limited by the need to assume a relationship between the predictor and response. To address this limitation, (Sharma and Mehrotra, 2014) presented a nonparametric predictive model using Partial Informational Correlation (PIC) and Partial Weights (PW). This study describes the open source Nonparametric Prediction (NPRED) R-package. NPRED identifies system predictors using the PIC logic, and predicts the response using a k-nearest-neighbor regression formulation based on a PW based weighted Euclidean distance. The capabilities of the package are demonstrated using synthetic examples and a real application of predicting seasonal rainfall in the Warragamba dam near Sydney, Australia. The results show clear improvements in predictability as compared to the use of linear predictive alternatives, as well as nonparametric alternatives that use an un-weighted Euclidean distance. © 2016 Elsevier Ltd. All rights reserved.
Keywords: Nonparametric prediction Partial Informational Correlation Partial Weight k-nearest neighbor (knn) Partial Mutual Information (PMI)
Software availability The open-source R-software NPRED is available for download from the following website “http://hydrology.unsw.edu.au/ download/software/NPRED”. Source codes are available, along with help-files and example synthetic datasets used to generate the outcomes reported. 1. Introduction We present here an empirical alternative for system specification that has its roots in information theory, which is capable of approximating the underlying predictive relationship using observational information alone. The alternative is the NPRED R software package, and consists of three key tools. The first of these is the Partial Informational Correlation (PIC), an information theory based measure of partial dependence between the response and a predictor given a pre-existing vector of predictor variables (Sharma and Mehrotra, 2014). The second tool is the Partial Weight (PW), which is the information theory based weight that when used in
* Corresponding author. E-mail addresses:
[email protected] (A. Sharma),
[email protected] (S. Jha). 1 Currently at University of Manitoba, Winnipeg, Canada. http://dx.doi.org/10.1016/j.envsoft.2016.05.021 1364-8152/© 2016 Elsevier Ltd. All rights reserved.
estimating the weighted Euclidean distance between observations and the conditioning predictor vector, leads to a clearer predictive relationship between predictors and the response (Sharma and Mehrotra, 2014). The third tool is a modified k-nearest neighbor regression estimator, formulated after the k-nn conditional simulator of Lall and Sharma (1996), which is used as the predictive model to illustrate the benefits of the tools presented. System specification has a long and rich history, with an excellent review of the contributions to this area provided in Galelli et al. (2014). We focus here on the predictor selection methods that have their basis in informational theory, or, those that are derived from the Mutual Information (MI) criterion of Fraser and Swinney (1986). MI assesses dependence between a predictor and a response by quantifying the ratio between their joint distribution and the product of their marginal distributions (equations and details presented in the next section). When the predictor and response are independent, their joint distribution collapses to a product of the two marginals, leading to the ratio equaling one. As this metric is based on probabilities, as long as the estimation of the associated probability distributions is sensitive to the actual data samples representing the two variables, the resulting measure is generic and not specific to any assumptions made. However, the MI can only quantify dependence between the response and the predictor, and is of little use when multiple predictors need to be identified as the added impact of a new predictor cannot directly be
272
A. Sharma et al. / Environmental Modelling & Software 83 (2016) 271e275
quantified. To address this limitation, Sharma (2000a) presented the Partial Mutual Information (PMI) as an extension to the MI on the lines the Partial Correlation is an extension to the linear correlation as a measure of dependence between two variables. The utility of the approach over alternatives where the functional relationship needed to be assumed was demonstrated with applications to predict seasonal rainfall in Sharma (2000b), Sharma et al. (2000). This led to a number of applications in different fields, some recent ones ranging from those for assessing and interpreting brain-electroencephalogram signals (Zhao et al., 2015), predicting drought (AghaKouchak, 2015), forecasting agriculture commodity futures prices (Xiong et al., 2015), to predicting algal blooms (Coad et al., 2014). In addition, the method has been modified and expanded in scope by Fernando et al. (2009), Li et al. (2015), Chao and Xuefeng (2015), including a major modification published recently by Sharma and Mehrotra (2014). This modification presented a new mathematical formulation of the PMI, which was transformed to allow expression as a correlation and termed the Partial Informational Correlation (PIC). In addition, the measure was used to ascertain the nonparametric equivalent to regression coefficients in multiple linear regression, allowing these to be expressed as Partial Weights (PW) in the estimation of the weighted Euclidean distance between the response and the predictor variable set. The formulation of the PW has significant impact in any nonparametric regression or conditional simulation estimator, as the Euclidean distances that are used to arrange “neighbors” from near to far, now offer a better representation of predictability than was the case before. This paper attempts to encapsulate the PIC and PW metrics in an easy to use R package termed NPRED (a shortening of Nonparametric PREDiction). Additionally, a modified version of the knearest neighbor bootstrap (Lall and Sharma, 1996), modified to use the proposed weighted Euclidean distance metric, and structured to output either the conditional expectation (regression) or bootstrap (simulation), is included.
where Equation (1) is used to calculate a sample estimate of MI. Using a transformation by (Linfoot, 1957), the PI can be converted to a (0,1) scale to the Partial Informational Correlation (PIC):
2. Methodology
b ¼ PIC
2.1. Partial Informational Correlation (PIC) For a finite sample, the Mutual Information (MI) (Fraser and Swinney, 1986) between two variables X and P can be estimated as (Sharma, 2000a):
1 c MIðX; PÞ ¼ n
n X i¼1
log
fX;P ðxi ; pi Þ fX ðxi ÞfP ðpi Þ
(1)
where fX ðxÞ and fP ðpÞ are the marginal probability density function (PDF) estimates of X and P, and fX;P ðx; pÞ is the joint PDF of X and P, and (xi,pi), i ¼ 1 … n, are sample observations of (X,P). Throughout this paper, upper case terms denote variables (such as X, P) and lower case, observations. Readers are referred to Sharma (2000a) and Sharma and Mehrotra (2014) on the kernel density estimation procedures used to estimate the marginal and joint probability densities involved, and to Harrold et al. (2001) for rules for optimal bandwidth estimation when the aim is to estimate the MI. The estimate in (1) stems as an expectation of the term inside the summation, which simplifies into a sample estimate instead of an integral about the probability density as in the original formulation of the Mutual Information by Fraser and Swinney (1986). The specification of MI is easily extendable to more than two variables. Consider the vector (X,P,Z) where X is the response, P a potential predictor to the response, and Z an additional variable which we will later define as a pre-existing predictor. If multiple pre-existing
predictors exist (in which case the pre-existing predictor vector is denoted as Z), the product in the denominator in (1) comprises of the product of the marginal probability density of individual variables in Z (and not the joint probability of Z). As an extension of the MI for predictor identification, Sharma (2000a) introduced the Partial Mutual Information (PMI), which estimated the MI between variables (X,P) given the pre-existing predictor set Z. This was accomplished by first forming new variables (X0 ,P0 ) which represented the original variables (X,P) after removing the information that could be predicted in each individually, given the pre-existing predictor set Z. Alternately, (X0 ,P0 ) represented the residuals of (X,P) given Z, the residuals being formulated using a kernel regression approach in the original PMI formulation in Sharma (2000a). While this represented a unique way to quantify partial dependence and hence identify a predictor set that could assist in the best assumption free predictive model being formulated, the need to estimate model residuals (and hence formulate a regression model) introduced complexity and made the procedure less stable with limited sample sizes and a potentially large predictor set. To address these limitations, Sharma and Mehrotra (2014) presented Partial Information (PI) as an extension to the MI, and an improvement over the PMI given the need to estimate model residuals was eliminated. A sample estimate of PI(X,PjZ) is written as:
1 b PIðX; PjZÞ ¼ n
"
# fXjZ;PjZ ðxi ; pi jZi Þ log fXjZ ðxi jZi ÞfPjZ ðpi jZi Þ i¼1
n X
(2)
which can be rearranged to equal:
b PIðX; PjZÞ ¼ c MIðX; P; ZÞ c MIðX; ZÞ c MIðP; ZÞ þ c MIðZÞ
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi b 1 exp 2 PI
(3)
(4)
thereby enabling the use of the same measures of confidence that are associated with the coefficient of correlation. Readers are referred to Sharma and Mehrotra (2014) for additional details on the tests of significance that apply. 2.2. Modified k-nearest neighbor conditional bootstrap or regression estimator Given a conditioning vector (denoted Z as per the algorithm in the preceding section), a nonparametric approach for system prediction or classification involves identifying observations that are “local” to the conditioning plane. This definition of “local” involves the use of some measure of distance to identify neighbors to the predictor vector Z, the most common of these being the Euclidean distance, or, as one of its special cases, the Weighted Euclidean distance in (5):
x2i
!2 m X bj Zj zi;j ¼ sj j¼1
(5)
Here, xi is the distance of the m-dimensional conditioning vector Z to the i'th m-dimensional data point zi, sj represents a measure of spread for the j'th dimension such as the sample standard deviation, and bj a measure of importance for the j'th predictor Zj. Most applications do not have a basis for specifying bj, and the default for
A. Sharma et al. / Environmental Modelling & Software 83 (2016) 271e275
this is assumed to equal 1. The k-nearest neighbor conditional bootstrap (Lall and Sharma, 1996) is a nonparametric alternative for resampling a response (X) conditional to a given predictor set (Z). It operates by identifying the k-nearest neighbors of Z as per (5), and then resampling the response based on the following cumulative conditional probability distribution:
PðXjZÞ ¼
X i
=k P 1 i ki;j
(6)
3.1. Details of the software
j
X i
xi=k P 1 i ki;j
(7)
j
This represents the k-nn regression estimator that will be used to ascertain the results reported later. 2.3. Partial Weight (PW) Writing the covariance of [X, Z] as:
Cov½X; Z ¼
standard deviations representing the scaled spread of the residuals obtained using the unweighted version of the k-nn regression estimator in (10). Scaling is performed by dividing the conditional standard deviations with the standard deviation of X and Zj respectively. Readers are referred to additional details about these choices in Sharma and Mehrotra (2014). 3. Software architecture
1
where ki represents the number of observations whose distance from Z is less than or equal to the distance to zi, the subscript j ranging from 1 to K, where K is the maximum number of neighbors permissible. Lall and Sharma (1996) suggest choosing K equal to the square root of n, where n represents the number of observations in the sample. The conditional expectation derived from the probability distribution in (6) can be written as:
EðXjZÞ ¼
273
SXX SZX
STZX SZZ
(8)
the conditional standard deviation or the spread associated with the conditional probability density of ðXjZÞ can be written as:
S0 ¼ SXX STZX S1 ZZ SZX
(9)
Sharma and Mehrotra (2014) presented an estimate of bj in (5) as equivalent to the regression coefficient in a linear regression model, except that instead of measures of partial correlation that are used in developing such estimates in linear regression, the PIC metric in (4) is adopted. This relationship can then be written as:
bj ¼ PICX;Zj jZðjÞ
SX jZðjÞ SZj jZðjÞ
(10)
where Z(j) represents the predictor vector without the j'th predictor variable, and SX jZðjÞ and SZj jZðjÞ are the scaled conditional
The PIC-PW algorithm described in the previous section is implemented on the R platform in the NPRED software tool. The information on required inputs and their recommended formats are provided in the help file. It should be noted that if the aim is to ascertain the autoregressive predictors instead of identifying a subset from a set of exogenous predictors, the user needs to arrange columns 2 onwards at a lag of 1 and more. The resulting predictor outcomes will define the structure of the time series forecasting or simulation model that results. An example of such an application to synthetic data is provided in the Supplementary Material associated with this publication. The R package has inbuilt codes for predictor identification (“calc.PIC”) with inbuilt metrics to define significance, estimation of Partial Weights (“calc.PW”), as well as the modified k-nearest neighbor bootstrap or regression function estimator (“knn”). Each of these codes come with associated help files that guide users on how they are to be used. Interested users are referred to the help manuals for additional details on the package. Fig. 1 illustrates the sequence of R commands that can be used to invoke the predictor identification and prediction problem using the NPRED package for a synthetic model (an Autoregressive model of order 9, see Supplementary Material for additional details and results). Once the predictors and their associate Partial Weights have been identified, the modified k-nn conditional bootstrap is invoked to first resample the response, and next to ascertain the conditional expectation. All codes in the package are open source. 3.2 Application to Warragamba quarterly rainfall The NPRED package was applied to the seasonal prediction of June-July-August (JJA) rainfall averaged over the Warragamba dam catchment, 80 km west of Sydney. The Warragamba dam supplies the majority of water to Sydney and nearby areas, and is hence of high importance from a water management point of view. This application presented here is tailored after a similar application reported in Sharma (2000b), Sharma et al. (2000), but uses more recent data (1961e2010) that is more reliable than the above
Fig. 1. R-script illustrating a typical usage of the NPRED package. Here the task is to identify predictors of an Autoregressive order 9 model, where 15 plausible predictor variables have been considered.
274
A. Sharma et al. / Environmental Modelling & Software 83 (2016) 271e275
Table 1 Concurrent prediction of JJA Warragamba rainfall (mm). As the linear regression coefficients reported here are derived using un-standardised data, they cannot be compared to the PW values presented. Identified predictors
Regression coefficient
PW
MSE-L
MSE-K
MSE-KPW
MSE-Kall
MAE-L
MAE-K
MAE-KPW
MAE-Kall
II NINO3
14.1 0.11
0.27 0.23
7780
7302
7222
9706
72
67
66
76
referenced study. Furthermore, the relationship explored here is for a concurrent prediction of rainfall, using established indices derived from global sea surface temperature anomaly datasets as per Khan et al. (2015). Only winter season results are presented as they conveyed a significant skill using multiple predictor variables in the regression formulation used. The 9 predictor variables used for seasonal prediction here are: 1 e Indian Ocean west pole index, 2 e Indian Ocean east pole index (Robertson and Wang, 2012), 3 e Indonesian index (II) (Verdon and ~ o Modoki-W (Ashok et al., 2007), 5 e Nin ~ oFranks, 2005), 4 e El Nin ~ o Modoki-C (6), 7 e 4 (Sharma and Chowdhury, 2011), 6 e El Nin ~ o-3.4, 8 e Nin ~ o-3 (Sharma and Chowdhury, 2011), 9 e El Nin ~o Nin Modoki-E (Ashok et al., 2007). Table 1 presents results from the use of NPRED. All Mean Square Error (MSE) or Mean Absolute Error (MAE) statistics are ascertained in a leave-one-out cross-validation mode. MSE-L or MAE-L represent the performance of a linear regression model fitted using the predictor variables identified based on PIC. MSE-K or MAE-K represent the performance using a regular k-nn regression model (equal weights). MSE-KPW or MAEKPW represent the case of k-nn using Partial Weights. MSE-Kall or MAE-Kall represent the k-nn regression case using equal weights and all 9 indices as predictor variables. This last case illustrates the problem in using redundant predictor variables most clearly in a nonparametric predictive setting. The results are typical of the predictive accuracy for seasonal rainfall in the region, and also of the improvements that are likely when using the PW rationale when predictors are few and their weights not markedly different. Similar predictive performances are reported in Khan et al. (2015) with sound arguments for an asymptotically maximum predictability of 14.7% of the variance of seasonal rainfall in Westra and Sharma (2010).
4. Discussion and conclusions The previous section outlined the open-source NPRED R package that contains the codes and scripts for nonparametric system identification and prediction based on the metrics described in Sharma and Mehrotra (2014). We emphasize that the PIC, PW rationale is not specific to the knn bootstrap or regression model as adopted here, but plays a significant role in any predictive algorithm that requires an un-weighted Euclidean distance to ascertain neighbors. This has significant implications in many nonparametric predictive methods currently in use. We also emphasize that the approach presented here represents an empirical platform to specify a system model, unlike physically based modelling approaches that presume physical laws and relationships characterize the system. However, the generic steps for characterizing system performance as listed in the position paper by Bennett et al. (2013) are equally valid here as they are for the specification of a physically based model. Model uncertainty here can be considered by formulating multiple empirical models as per the seasonal forecasting approach described in Sharma and Chowdhury (2011), with the model choice refined as their performance in testing and validation in extreme settings are found to be wanting.
Acknowledgements We acknowledge funding support from the Australian Research Council (FT110100576) and Sydney Catchment Authority over the years, which has contributed significantly to the work reported in this paper. Assistance from Dipayan Chowdhury in preparing the data sets used for the Warragamba rainfall example is gratefully acknowledged. Appendix A. Supplementary data Supplementary data related to this article can be found at http:// dx.doi.org/10.1016/j.envsoft.2016.05.021. References AghaKouchak, A., 2015. A multivariate approach for persistence-based drought prediction: application to the 2010e2011 East Africa drought. J. Hydrol. 526, 127e135. Ashok, K., Behera, S.K., Rao, S.A., Weng, H.Y., Yamagata, T., 2007. El Nino Modoki and its possible teleconnection. J. Geophys. Res. Oceans 112 (C11). Bennett, N.D., Croke, B.F.W., Guariso, G., Guillaume, J.H.A., Hamilton, S.H., Jakeman, A.J., Marsili-Libelli, S., Newham, L.T.H., Norton, J.P., Perrin, C., Pierce, S.A., Robson, B., Seppelt, R., Voinov, A.A., Fath, B.D., Andreassian, V., 2013. Characterising performance of environmental models. Environ. Model. Softw. 40, 1e20. Chao, C., Xuefeng, Y., 2015. Optimization of a multilayer neural network by using minimal redundancy maximal relevance-partial mutual information clustering with least square regression. Neural Netw. Learn. Syst. IEEE Trans. 26 (6), 1177e1187. Coad, P., Cathers, B., Ball, J.E., Kadluczka, R., 2014. Proactive management of estuarine algal blooms using an automated monitoring buoy coupled with an artificial neural network. Environ. Model. Softw. 61, 393e409. Fernando, T.M.K.G., Maier, H.R., Dandy, G.C., 2009. Selection of input variables for data driven models: an average shifted histogram partial mutual information estimator approach. J. Hydrol. 367 (3e4), 165e176. Fraser, A.M., Swinney, H.L., 1986. Independent coordinates for strange attractors from mutual information. Phys. Rev. A 33 (2), 1134e1140. Galelli, S., Humphrey, G.B., Maier, H.R., Castelletti, A., Dandy, G.C., Gibbs, M.S., 2014. An evaluation framework for input variable selection algorithms for environmental data-driven models. Environ. Model. Softw. 62, 33e51. Harrold, T.I., Sharma, A., Sheather, S., 2001. Selection of a kernel bandwidth for measuring dependence in hydrologic time series using the mutual information criterion. Stoch. Environ. Res. Risk Assess. 15 (4), 310e324. Khan, M.Z.K., Sharma, A., Mehrotra, R., Schepen, A., Wang, Q.J., 2015. Does improved SSTA prediction ensure better seasonal rainfall forecasts? Water Resour. Res. 51 (5), 3370e3383. Lall, U., Sharma, A., 1996. A nearest neighbor bootstrap for time series resampling. Water Resour. Res. 32 (3), 679e693. Li, X., Zecchin, A.C., Maier, H.R., 2015. Improving partial mutual information-based input variable selection by consideration of boundary issues associated with bandwidth estimation. Environ. Model. Softw. 71, 78e96. Linfoot, E.H., 1957. An informational measure of correlation. Inf. Control 1 (1), 85e89. Robertson, D.E., Wang, Q.J., 2012. Bayesian approach to predictor selection for seasonal strearnflow forecasting. J. Hydrometeorol. 13 (1), 155e171. Sharma, A., 2000a. Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: part 1-A strategy for system predictor identification. J. Hydrol. 239 (1e4), 232e239. Sharma, A., 2000b. Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: part 3-A nonparametric probabilistic forecast model. J. Hydrol. 239 (1e4), 249e258. Sharma, A., Chowdhury, S., 2011. Coping with model structural uncertainty in medium-term hydro-climatic forecasting. Hydrol. Res. 42 (2e3), 113. Sharma, A., Luk, K.C., Cordery, I., Lall, U., 2000. Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: part 2-predictor identification of quarterly rainfall using ocean-atmosphere information. J. Hydrol. 239 (1e4), 240e248. Sharma, A., Mehrotra, R., 2014. An information theoretic alternative to model a
A. Sharma et al. / Environmental Modelling & Software 83 (2016) 271e275 natural system using observational information alone. Water Resour. Res. 50, 650e660. Verdon, D.C., Franks, S.W., 2005. Indian ocean sea surface temperature variability and winter rainfall: Eastern Australia. Water Resour. Res. 41 (9). Westra, S., Sharma, A., 2010. An upper limit to seasonal rainfall predictability? J. Clim. 23 (12), 3332e3351.
275
Xiong, T., Li, C., Bao, Y., Hu, Z., Zhang, L., 2015. A combination method for interval forecasting of agricultural commodity futures prices. Knowl. Based Syst. 77, 92e102. Zhao, H., Guo, X., Wang, M., Li, T., Pang, C., Georgakopoulos, D., 2015. Analyze EEG signals with extreme learning machine based on PMIS feature selection. Int. J. Mach. Learn. Cybern. 1e7.