Complete ensemble empirical mode decomposition hybridized with random forest and kernel ridge regression model for monthly rainfall forecasts

Complete ensemble empirical mode decomposition hybridized with random forest and kernel ridge regression model for monthly rainfall forecasts

Journal Pre-proofs Research papers Complete ensemble empirical mode decomposition hybridized with random forest and Kernel ridge regression model for ...

3MB Sizes 0 Downloads 26 Views

Journal Pre-proofs Research papers Complete ensemble empirical mode decomposition hybridized with random forest and Kernel ridge regression model for monthly rainfall forecasts Mumtaz Ali, Ramendra Prasad, Yong Xiang, Zaher Mundher Yaseen PII: DOI: Reference:

S0022-1694(20)30107-4 https://doi.org/10.1016/j.jhydrol.2020.124647 HYDROL 124647

To appear in:

Journal of Hydrology

Received Date: Revised Date: Accepted Date:

23 November 2019 30 January 2020 31 January 2020

Please cite this article as: Ali, M., Prasad, R., Xiang, Y., Yaseen, Z.M., Complete ensemble empirical mode decomposition hybridized with random forest and Kernel ridge regression model for monthly rainfall forecasts, Journal of Hydrology (2020), doi: https://doi.org/10.1016/j.jhydrol.2020.124647

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Elsevier B.V. All rights reserved.

Complete ensemble empirical mode decomposition hybridized with random forest and Kernel ridge regression model for monthly rainfall forecasts Mumtaz Ali1, Ramendra Prasad2, Yong Xiang1, and Zaher Mundher Yaseen3 1School

of Information Technology, Deakin University, VIC 3125 Australia

2Department

of Science, School of Science and Technology, The University of Fiji, Saweni, Lautoka, Fiji.

3

Research and Development, Duy Tan University, Da Nang 550000, Vietnam

E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] E-mail: [email protected]

Abstract Persistent risks of extreme weather events including droughts and floods due to climate change require precise and timely rainfall forecasting. Yet, the naturally occurring nonstationarity entrenched within the rainfall time series lowers the model performances and is an ongoing research endeavour for practicing hydrologists and drought-risk evaluators. In this paper, an attempt is made to resolve the non-stationarity challenges faced by rainfall forecasting models via a complete ensemble empirical mode decomposition (CEEMD) combined with Random Forest (RF) and Kernel Ridge Regression (KRR) algorithms in designing a hybrid CEEMD-RF-KRR model in forecasting rainfall at Gilgit, Muzaffarabad, and Parachinar in Pakistan at monthly time scale. The rainfall time-series data are simultaneously factorized into respective intrinsic mode functions (IMFs) and a residual

element using CEEMD. Once the significant lags of each IMF and the residual are identified, both are forecasted using the RF algorithm. Finally, the KRR model is adopted where the forecasted IMFs and the residual components are combined to generate the final forecasted rainfall. The CEEMD-RF-KRR model shows the best performances at all three sites, in comparison to the comparative models, with maximum values of correlation coefficient (0.97-0.99), Willmott’s index (0.94-0.97), Nash-Sutcliffe coefficient (0.94-0.97) and LegatesMcCabe’s index (0.74-0.81). Furthermore, the CEEMD-RF-KRR model generated the most accurate results for Gilgit station considering the Legate-McCabe’s index as base assessment criteria in addition to obtaining the lowest magnitudes of RMSE = 2.52 mm and MAE = 1.98 mm. The proposed hybrid CEEMD-RF-KRR model attained better rainfall forecasting accuracy which is imperative for agriculture, water resource management, and early drought/flooding warning systems. Keywords: 1.

rainfall; water resources; drought, CEEMD; random forest; KRR.

Introduction

The rainfall (P) patterns are significantly affected by climate change that reshuffles agricultural crops, water resources and natural extreme weather scenarios (Barredo, 2007) such as droughts (Ali et al., 2018a; Vörösmarty et al., 2010) and flooding (Bhalme and Mooley, 1980). and In Pakistan, the 2010 monsoon rainfall (News, 2010) caused severe damages costing around $43 billion (Tarakzai, 2010). Hence, the policymakers in conjunction with hydrologists must design new and reliable predictive tools to forecast P events for effective hydrological planning and to devise new strategies in managing rainfall risks. The accuracy of current hydrological decision support systems, which characterize different watersheds (Gebremariam et al., 2014; Romagnoli et al., 2017) is contingent upon the physical conditions and proper calibrations. As such a number of statistical modelling

techniques have been explored including an empirical method (Chiew et al., 1998), statistical non-parametric probabilistic model (Sharma, 2000), copula models (Nguyen-Huy et al., 2017) and Autoregressive Moving Averages (ARMA) model (Burlando et al., 1993). Gaussian copula methods showed huge improvements in the correlation coefficient and a decrease in root mean square error in estimating rainfall from satellite data in Iran (Moazami et al., 2014). For the case of Pakistan, only a handful of studies pertaining to rainfall forecasting have been performed. An autoregressive integrated moving averaging (ARIMA) model was developed to forecast precipitation trends (Salma et al., 2012) while multiple linear regression models were designed for seasonal runoff (Archer and Fowler, 2008). Additionally, extreme rainfall events in Indus River Valley, Pakistan were estimated using the global data assimilation model (Reale et al., 2012). Later, the Thiessen polygon method was developed for precipitation forecasts (Faisal and Gaffar, 2012). The other issue is that these tools are generally built on regional-scale sub-models and the spatiotemporal aspects of the variables and require enormous spatial data with apt calibrations (He et al., 2014; Prasad et al., 2017). The artificial intelligence algorithms, on the other hand, do not require detailed information about physical or hydrological behaviours of the watershed in providing immense insights into a rainfall forecasting and forewarning system. Artificial intelligence models have been successfully applied in forecasting future P trends (Luk et al., 2001) by dealing with non-linear input features (Nasseri et al., 2008). Artificial neural networks (ANN) were used to forecast rainfall in Thailand (Hung et al., 2009). ANN was also used for a localized P forecasting system (Kashiwao et al., 2017) using climatological data yet the forecasts were not significant in comparison to Japan Meteorological Agency. In Australia, monthly rainfall was forecasted using climate indices as inputs via the ANN model (Abbot and Marohasy, 2014). In another study for Spain, the occurrence and amount of rainfall were

forecasted with meteorological variables via hierarchical nominal–ordinal support vector classifier (Sánchez-Monedero et al., 2014), while support vector machines were utilized for P forecasts in Taiwan (Lin et al., 2009). Again, literature shows that for Pakistan only one study with machine learning models has been explored whereby (Ali et al., 2018b) designed a Markov Chain, Monte Carlo, using the copula-based machine learning model to forecast rainfall. As such, more studies in regard to rainfall forecasting need to be carried out for Pakistan either using statistical or machine learning models. In addition, the past studies reveal that precipitation forecasting trends have been mostly based on climate data using the regression model or basic machine learning models. It is also evident that the previous works have been done on large scales at either provincial or national levels, but not for small areas. Hence, more studies pertaining to designing datadriven algorithms for accurate precipitation forecasting at a smaller spatial scale are required that will be undertaken in this study. Furthermore, the above-mentioned studies were beneficial to various stakeholders to develop machine learning models for a reasonably accurate forecast of precipitation but all these studies were not able to handle the non-stationarity and non-linearity issues that arise in the rainfall data (Ali et al., 2018b; Sánchez-Monedero et al., 2014). The naturally abrupt P trends compounded by changing climate conditions that have been observed for the last few years in Pakistan and in other regions need to be aptly considered and captured by such models. Particularly for Pakistan, there is no study performed on the development of an advanced data intelligent model that tackles the issues of non-stationarity and non-linearity of rainfall data. To address these gaps, there is a dire need for developing data-driven models for accurate P forecasting at a much better scale than endeavoured beforehand that is able to ameliorate non-stationarity and non-linearity issues, which is the novelty of this current study.

To overcome these issues an apt multi-resolution analytical tool is used to extract the embedded features from the rainfall time series data that inherently has embedded climateinduced variability. The extracted features form a suite of non-static time-series signals that help to improve a standalone data intelligent model. To deal with this issue, the complete ensemble empirical mode decomposition (CEEMD) which is an advanced form of empirical mode decomposition (Huang et al., 1998), can act as a useful tool in improving the accuracy that is lacking in the existing models. The CEEMD is able to clearly resolve and isolate the large fluctuating signals into respective smaller frequency components. After the inception of CEEMD method, it gained attention because of its completely data-dependent and selfadaptability characteristics (Alvanitopoulos et al., 2014). Thus, CEEMD is able to extract pertinent features while preserving the physical structure of the temporal input (Wu et al., 2011). In this work, CEEMD is integrated with a robust bootstrap-aggregation approach, i.e., random forest (RF) model and kernel ridge regression (KRR) to develop the “CEEMD-RFKRR” model for rainfall forecasting. The key objectives of this study include; (1) the application of CEEMD to decompose the predictor variables into respective IMFs and a residual factor to resolve the non-stationarity and non-linearity problems; (2) to integrate the RF model to CEEMD in order to forecast the respective IMFs with residual component; (3) to aggregate the forecasted IMFs and residuals using the KRR to forecast P values; (4) to validate the forecasting capability of CEEMD-RF-KRR model in rainfall forecasting of diverse climatic regions in Pakistan. For comparison, the results are benchmarked with standalone KRR, RF and CEEMD-RF models. The developed CEEMD-RF-KRR approach is evaluated to forecast precise point-based rainfall in Gilgit, Muzaffarabad, and Kurrum Agency, Parachinar districts in Pakistan. 2.

Theoretical Background

2.1

Complete ensemble empirical mode decomposition (CEEMD) method The CEEMD data decomposition technique addresses the non-stationarity and non-

linearity of input data (Torres et al., 2011). In the decomposition process, Gaussian white noise is added to the original data to construct a uniform reference frame, which eventually is removed by averaging of the respective IMFs and residue. A short description of the CEEMD process for an undecomposed input x(t), is presented below (Torres et al., 2011): 1) Supply some white noise ω(t) to the time series (x(t)) such that x’(t) = x(t) + ω(t). 2) Decompose x’(t) via EMD (the original ensemble mode decomposition) technique to obtain the first decomposed component. 3) Follow steps 1 and 2 with distinct number of realizations. This process is usually repeated with‘𝜏’ numbers of times (ensemble number) each with different realizations. 4) Then compute the mean of all IMF1 as below: 1

𝜏

𝐼𝑀𝐹1(𝑡) = 𝜏 ∑𝑘 = 1𝜓1(𝑥(𝑡) + 𝜀.𝜔𝑘(𝑡))

(1)

Where 𝜓1 is the computation of kth IMF and ε specifies the amplitude regulation which is necessary to obtain an appropriate signal. 5) Hence, the first residual component or the remaining component results as follows: R1(t) = x(t) – IMF1(t)

(2)

6) Next, the extraction of IMF2 from the original signal, x(t), can be extracted as: 1

𝜏

𝐼𝑀𝐹2(𝑡) = 𝜏 ∑𝑘 = 1𝜓1[𝑥(𝑡) + 𝜀.𝜓1(𝜔𝑘(𝑡))] 7) The above steps are repeated to acquire the (β + 1)th IMF factor as:

(3)

1

𝜏

𝐼𝑀𝐹𝛽 + 1(𝑡) = 𝜏 ∑𝑘 = 1𝜓1[𝑥(𝑡) + 𝜀.𝜓𝛽(𝜔𝑘(𝑡))]

(4)

8) Finally, the residuals are averaged which generally is the trend, display a gradual variation around the long-term average. 2.2

Random Forest (RF) Model The building blocks of decision tree-based modelling approach, i.e., the RF model is

bootstrapping, and aggregation called bagging (Breiman, 1996; Schapire et al., 1998). The RF model randomly adopts a bagging approach in identifying features whereby each node is randomly separated by choosing the most dominant possible predictors that would improve the model accuracy without causing overfitting (Breiman, 2001). The following steps are adopted in designing RF models: 1: Using the input predictor variables, ntrees of bootstrapping ensembles are generated where n is the number of trees. 2: An unpruned regression tree is built by choosing maximum predictors splitting creating a randomized input predictors sample denoted mtry 3: The predictions from ntrees are aggregated to forecast future rainfall. The RF model has been applied in forecasting soil moisture (Prasad et al., 2018), shallow water table (Koch et al., 2019) as well as for hydrological (Moore et al., 1991) and environmental management (Ascough Ii et al., 2008) applications. Interested readers may refer to (Breiman, 2001; Liaw and Wiener, 2002) for a comprehensive background on RF modelling. 2.3

Kernel Ridge Regression (KRR) model The KRR is based on the kernel approach combined with ridge regressions (Zhang et

al., 2013). A key advantage is that the KRR uses a regularization and kernel technique in

capturing the non-linear relationship overcoming the over-fitting issues in the regression (You et al., 2018). The KRR can be expressed in mathematical form as; 2

1 m arg min  fi  yi   f m i 1

2 H

m

fi    j   x j , xi 

(5)

(6)

j 1

Where . H is the Hilbert normed space. The Eq. (5) can concisely be rewritten for an m×m kernel matrix K as follows:

 K  nI   y m



y    x , x  i i i 1

(7)



(8)

During the training stage, the KRR algorithm approximates  by finding solutions for Eq. (7). It then searches for an optimum  and  from the parameter set for optimized performances. The α is then used in the testing stage to calculate the regressand of the newer samples x as in Eq. (8). In this study, the linear, polynomial and Gaussian kernels were used to get the best accuracy (Alaoui and Mahoney, 2015; You et al., 2018). The ridge regression reduces the sum of squared errors (SSE) with constraint fulfillment providing an estimate to penalize it (You et al., 2018). The feature’s estimate does not perform arbitrarily. In case we have very large estimates, the SSE will reduce with the increment in penalty term. Thus, KRR selects the less influential features to undergo more penalization estimates (Zhang et al., 2013). If the number of independent predictors is too many where it is not easy to determine which predictor influences more the target/response variable, the KRR handles the situation better than others (Alaoui and Mahoney, 2015).

In KRR, there might be a possibility of the case where the number of dimensions can be higher, or even infinitely greater than the number of data and the inverse is not easy to compute. This problem can be handled with the help of Eq. (8) to perform the inverse either by the dimension of feature space or the number of data-cases (Welling, 2013). The solution must exist in the span of the data even the dimension is very large which shows the algorithm is linear in the feature space. 3.

Materials and Methods

3.1

Rainfall (P) Data As the data-driven models highly depend on prognostic features from the past, the

rainfall data from 1962 to 2013 was collated from the Pakistan Meteorological Department (PMD, 2016) for Gilgit, Muzaffarabad, Parachinar, as shown in Fig. 1. To assess the adaptability of the CEEMD-RF-KRR for P forecasting in Pakistan, the selected locations were broadly the main representative of the diverse climatic regions. Gilgit (35.281°N, 74. 841°E), is the capital of Gilgit-Baltistan province in Pakistan which is located in a broad valley adjacent to the confluence of the Gilgit River and Hunza River. The climate of Gilgit has sharp variations due to surrounding mountains and hills (PMD, 2016) with an average temperature of 20-25oC across the valleys and a drop in temperature in January with an average of -10 to 0oC. Normally, May is the rainiest month (approximately 25.1 mm) while November considered being the driest month (2.4mm rainfall). The capital of Azad Kashmir is Muzaffarabad (34.359° N, 73.471°E) situated near the Jhelum and Neelum rivers. Muzaffarabad is normally warm and temperate with heavy rainfall in the driest months. The average annual temperature and rainfall in Muzaffarabad are 20.2°C and 1457 mm respectively with a difference in rainfall is 272 mm between the driest and rainiest months. Parachinar (33.899° N, 70.100°E) is the capital of Kurram district in Khyber Pakhtunkhwa

(KP) province, which features a humid subtropical climate with much higher rainfall. The snowfall is common in the winter season and frost occurs on most mornings. The average annual rainfall is 781.7 mm while the average mean temperature is 15.3o C. Table 1 shows some descriptive statistics of rainfall for the selected study zones. 3.2 Design of the hybrid CEEMD-RF-KRR and comparison models The proposed hybrid CEEMD-RF-KRR model was established in MATLAB R2016b (The Math Works Inc. USA) running on Pentium 4, dual-core Central Processing Unit with 2.93 GHz of processing speed. The rainfall data were incorporated to design the hybrid CEEMDRF-KRR, CEEMD-RF, standalone RF, and standalone KRR models. The steps involved in establishing the CEEMD-RF-KRR are: Step 1: The CEEMD method was employed to resolve the rainfall (P) into IMFs and residuals. The parameters defined in constructing CEEMD are the number of realizations of the Gaussian Noise (𝜏 = 500) and the provided amplitude in terms of added white noise (ε = 0.2) (Ouyang et al., 2016; Ren et al., 2015; Wang et al., 2013; Wu and Huang, 2009). For Gilgit and Muzaffarabad, the CEEMD method decomposed the P time-series into nine IMFs (IMF1, IMF2,…, IMF9) and a residual part whereas Parachinar has eight IMFs (IMF1, IMF2,…, IMF8) with a residual factor. Figure 2 and Table 2 illustrates the CEEMD decomposed IMFs and residuals with P data for all locations. Step 2: The statistically significant lag time series computed by partial autocorrelation function (PACF) of IMFs and residual components presented in Fig. 3. Step 3: The normalization was established using Eq. (9) for the purpose of data fluctuation and spikes (Hsu et al., 2003) as below:

 norm 

    min 

  max   min 

(9)

where  indicates input/target,  min is the minimum point,  max denotes the maximum point and  norm is the anticipated normalized value. Step 4: The normalized significant lags of IMFs and the residual factors were incorporated as inputs to the RF model to forecast IMFs and residuals one by one. The number of trees (ntrees = 1000) and the number of predictors (5) were the predefined parameters used in the RF model at this stage. Basically, these parameters were determined by the trial and error technique taking different combinations. After several times of trial and error, the optimum results with these parameters were obtained. Step 5: The KRR model was finally applied to eventuate the CEEMD-RF-KRR model to forecast P where the forecasted IMFs and residuals were utilized as input predictors. KRR models use different types of kernels (Gaussian, Polynomial, and Linear). In this study, the polynomial kernel appeared to be the most suitable choice after a comparative study of the respective kernels were performed. The performance of CEEMD-RF-KRR (i.e., CEEMDRF-KRRPolynomial) was evaluated against CEEMD-RF, CEEMD-RF-KRRGaussian, CEEMD-RFKRRLinear, standalone KRR (based on each kernel type), and RF models. Table 3 describes the parameters used in the CEEMD-RF-KRR model vs. CEEMD-RF, CEEMD-RFKRRGaussian, CEEMD-RF-KRRLinear, standalone KRR, and standalone RF models. Figure 4 illustrates a schematic view of the hybrid CEEMD-RF-KRR model. 3.3 Model Assessment Criteria The performance of CEEMD-RF-KRR vs. CEEMD-RF, CEEMD-RF-KRRGaussian, CEEMDRF-KRRLinear, standalone RF and the standalone KRR models in rainfall forecasting were assessed utilizing a number of statistical and standardized metrics (Yen, 1995). The description of these assessment criteria is discussed below (Dawson et al., 2007; Legates and McCabe, 1999; Willmott, 1984).

I. Correlation coefficient (R) is formulated as:   r     

P n

OBS , j

j 1

P n

j 1

OBS , j

 P OBS , j

 P OBS , j

 P

FOR , j

 P 2

 P FOR , j

n

j 1

FOR , j



 P FOR , j



2

      

(10)

II. Willmott’s Index (EWI) is defined as:

E

WI

n 2   PFOR, j  POBS , j    j 1  1  n    PFOR , j  P OBS , j  POBS , j  P OBS , j  j 1





  , 0  E 1 2 WI   

(11)

III. Nash-Sutcliffe coefficient (ENS) is described as:

ENS

2   n    POBS , j  PFOR , j    ,   ENS  1  1   jn1 2   P P FOR , j   OBS , j   j 1 



(12)



IV. Root mean square error (RMSE, mm) is given as: RMSE 

2 1 n PFOR , j  POBS , j    n j 1

(13)

V. Mean absolute error (MAE, mm) is expressed as: MAE 

1 n   PFOR, j  POBS , j  n j 1

(14)

VI. Legates-McCabe’s index (ELM) is defined as:

E

LM

 n   PFOR ,i  POBS , j  1   jn1    POBS , j  POBS , j  j 1

  ,0  E 1 LM   

(15)

VII. Relative root mean square error (RRMSE¸%) is stated as:

RRMSE 

2 1 n PFOR , j  POBS , j    n j 1 100 1 n P   OBS , j  n j 1

(16)

Where POBS, j and

PFOR, j

are the observed and forecasted jth values of P, P OBS , j and P FOR , j are

the average of P and n is the entire number of tested values. In addition, the computation of the RMSE is based on the aggregation of residuals between observed and forecasted P (Nourani et al., 2011). The higher P values are largely captured by the RMSE, which ranges from 0 (ideal value) to positive infinity. However, this metric should not essentially be used to compare model performance at geographically diverse sites (Hora and Campos, 2015) as these are in their absolute terms. As such the relative values of root mean squared error (RRMSE) were utilized (Dawson et al., 2007) for this purpose.

4.

Results The CEEM-RF-KRR model is undertaken to forecast rainfall for three highly

geographically diverse regions of Pakistan. The hybrid CEEMD-RF-KRR (based on Polynomial kernel) model is appraised with respect to standalone RF, KRR, and CEEMD-RF models. To establish parsimonious models, PACF played a vital role in deciding the most relevant lags of P. The arithmetical metrics discussed in Eq. (9–15) are utilized to explain the results. In terms of numerical model quantification, the performance of the CEEMD-RF-KRR (i.e. CEEMD-RF-KRRPolynomial) model is evaluated against the counterparts in Table 4. It is evident that the hybrid CEEMD-RF-KRR performed better at all three sites in achieving lower errors, RMSE and MAE and very high Pearson’s correlation coefficient (R), while the standalone KRR model performed poorly at all the three sites under consideration. Ideally, the magnitudes of RMSE and MAE must be as small as possible to reflect the lowest deviations of forecasted values from the observed ones. For the case of Muzaffarabad and Parachinar, high error values were apparent from all models. The competing models (CEEMRF, CEEMD-RF-KRRGaussian, CEEMD-RF-KRRLinear, standalone RF, and Standalone KRR

(based on each kernel) registered higher error values with MAE ranging between 25.03 mm to 195.18 mm and RMSE ranged between 35.51 mm to 222.07 mm. Yet the hybrid CEEMDRF-KRR model outperformed with MAE of 13.65 mm (Parachinar) and 19.99 mm (Muzaffarabad) and least RMSE as well. In comparison to the standalone KRR (worst performing model), a two-fold increase in r is noted at Parachinar (percentage increase ≃ 93 %), and a huge over three-fold increase is seen for Muzaffarabad (percentage increase ≃ 160 %) and Gilgit sites (percentage increase ≃ 171 %). Even with a comparison to the CEEMDRF (second-best performing model), the high percentage decrease in MAE (range: -17 % to 80 %) and RMSE (range: -25 % to -82 %) was achieved by the CEEMD-RF-KRR model further depicting its suitability in forecasting monthly P. To holistically evaluate the model performances, the correlation coefficient, R, is plotted in the form of a Taylor diagram that provides a more detailed appraisal of the model performances (Xu et al., 2016). The Taylor diagram (Fig. 5) portrays a more tangible and convincing statistical relation between the forecasted and observed monthly P depending on R with respect to standard deviations. It is clearly displayed that the standalone KRR and standalone RF models are not appropriate as the r to SD points were highly parted from the ideal observed point. The proposed hybrid CEEMD-RF-KRR model was lying close to the observed P which confirms the forecasting accuracy was significantly higher at all three sites than the other three comparing models (CEEMD-RF, standalone RF, and standalone KRR) in this study. The model evaluation was also undertaken with the normalized metrics (EWI, ENS, and ELM) for the three sites in forecasting monthly rainfall, which ideally must be equal to unity for perfectly fitting models (Table 5). The normalized metrics also justified a better utility of the CEEMD-RF-KRR (i.e. CEEMD-RF-KRRPolynomial) model as there was a significant increase in these values at all three study sites. The CEEMD-RF models at all sites did

perform equally well, however, was not able to outperform the proposed CEEMD-RF-KRR models. At all sites, the CEEMD-RF-KRR model attained the highest values of normalized metrics, EWI, ENS, and ELM, with percentage increase in EWI ranging between 3.6 % to 5.7 %, ENS ranging between 4 % to 6 %, and ELM ranging between 7 % to 13 % in comparison to the competing CEEMD-KRR model which has the second-best performance. It must be noted that models with ENS below 0.800 are ‘unsatisfactory’, those between 0.800 and 0.900 are ‘fairly good’ and ENS values greater than 0.900 are considered to be ‘very satisfactory’(Shamseldin, 1997). Hence, according to the abovementioned criteria, the proposed hybrid CEEMD-RF-KRR models at all sites could be regarded as ‘very satisfactory’ when applied for precipitation forecasting at these three sites in Pakistan. More precisely, Table 5 examines the CEEMD-RF-KRR efficiency in predicting P using EWI, ENS and ELM criterion for Gilgit: (EWI = 0.966, ENS = 0.972 and ELM = 0.810), Muzaffarabad (EWI = 0.938, ENS = 0.941 and ELM = 0.739) and Parachinar (EWI = 0.950, ENS = 0.950 and ELM = 0.759). The outcomes of benchmark models (Table 5) confirms the supremacy of the CEEMD-RF-KRR for all study sites in this paper. Hence, based on the above criteria, the proposed model could be regarded as ‘satisfactory’ for P forecasting at these sites in Pakistan. In addition, the model appraisal was undertaken using diagnostic plots. Prior to making the box-plots, forecasting errors (FE) were computed during the testing period. The FE ideally needs to be zero, hence when plotted on box plots, the distribution of FE needs to be as close to zero as possible for best performing models. The absolute FE (|𝐹𝐸|) values were plotted for ease of comprehension. In accordance with Tables 4 and 5, Fig. 6 CEEMDRF-KRR models at all sites registered smaller |FE| distributions revealing its better forecasting capability. According to the normalized metrics, EWI, ENS, and ELM, the CEEMDRF models seemed to have a very close performance to that of CEEMD-RF-KRR models.

Contrariwise, the box-plots make a clear distinction of performances since the distribution of |𝐹𝐸| generated from CEEMD-RF models for Muzaffarabad and Parachinar were fairly scattered with a large number of outlier points. Likewise, for Site 1: Gilgit the CEEMD-RFKRR model had a narrower |𝐹𝐸| distribution bounded near zero with very fewer outliers. Additionally, the empirical cumulative distribution function (ECDF) plots of |𝐹𝐸| were also plotted in Fig. 7 to get a clearer picture of the forecast error distributions. For Sites 2Muzaffarabad and 3-Parachinar, the ECDF line plots of CEEMD-KRR, standalone RF and standalone KRR displayed a very close profile, while at Site 1-Gilgit the ECDF profiles of standalone RF and standalone KRR were very similar. On the other hand, at all three sites, the ECDF profile of CEEMD-RF-KRR displayed a remarkable narrow profile confined within the least |𝐹𝐸| range. Hence, the box-plots (Fig. 6) together with the ECDF plots (Fig. 7) for all three sites further ascertain better performance of the CEEMD-RF-KRR model in forecasting monthly P compared to the competing models (CEEMD-RF, standalone RF, and standalone KRR). A further diagnostic was performed with bar graph plots (Fig. 8) based on correlation coefficient, R. Note that the interest here is to evaluate all models in examining the percentages of P values that can be effectively forecasted by the respective models (CEEMDRF-KRR, CEEMD-KRR, and standalone KRR and standalone RF). The standalone KRR models at all sites performed poorly registering lower R values. Evidently, the CEEMD-RFKRR model exhibits larger R values at all three sites in comparison with the competing models. Consequently, over 90% of forecasted values are directly associated with the observed values in the testing period. Specifically, results reflected that for the case of Gilgit (R = 0.986), Muzaffarabad R = 0.971) and Parachinar R = 0.976), which concur with the results in Tables 4 and 5.

From the outcomes of the numerical metrics (Table 4 and 5), a difference in performances of models at the three distinctive sites is apparent. The key limitation of these metrics is their inability to compare models at geographically disparate sites. The relative performance (Table 6) suitably discovered that the CEEMD-RF-KRR (i.e. CEEMD-RFKRRPolynomial) models exhibit the lowest values of RRMSE in comparison with CEEMD-RF, standalone RF and KRR models for all three sites. More accurately, the RRMSE magnitudes when comparing CEEMD-RF-KRR with the second-best performing model CEEMD-RF, in the combination [CEEMD-RF-KRR: CEEMD-RF] were as follows: Gilgit: [19.20%: 29.40%];

Muzaffarabad:

[21.50%:

29.60%]

and

Parachinar:

[18.20%:

24.40%].

Consequently, the RRMSE exhibited that the CEEMD-RF-KRR accomplished the best accuracy for Site 3-Parachinar, followed by Site 1-Gilgit and Site 2- Muzaffarabad. 5.

Discussion This study has designed a hybrid CEEMD-RF-KRR (i.e. CEEMD-RF-KRRPolynomial)

model using the historical significant lags of rainfall (P) data to forecast monthly P in Pakistan at three sites that experience a differing annual climate pattern. The model utilized the CEEMD algorithm to factorize the P into respective IMFs and residuals components. The PACF is used to compute the statistically relevant lags of IMFs with residuals that are utilized into the RF model to predict each IMF and the residue. Finally, the KRR model was applied to forecast and aggregate the P by incorporating the forecasted IMFs and residuals as input predictors. The proposed CEEMD-RF-KRR model was appraised with CEEMD-RF, standalone RF, and KRR models. The developed CEEMD-RF-KRR and comparison models were successfully appraised to generate smaller relative percentage errors in terms of RRMSE with reasonably higher magnitudes of R, EWI, ENS and ELM (Tables 4, 5 and 6). The performance was high,

according to the achieved assessment criteria (Eqs. 9-15) used in this paper. Thus, the proposed hybrid CEEMD-RF-KRR framework can aptly be applied to forecast monthly P where the forecasting of future P is likely to become more erratic caused by global warming and an increase in water demand. The hybrid CEEMD-RF-KRR and competing models generated slightly varying results for each of the stations due to diverse geographic conditions as displayed in Table 1. Each study zone possesses distinct climatic and environmental properties which significantly affect rainfall (Qing et al., 2011). For example, Gilgit is basically a hilly area at higher elevation 2437 m which significantly affects the rainfall in the region. Rainfall is a lot more common at a higher elevation region as the cooler weather and atmosphere cannot hold in as much condensation, causing it to rain or snow more often (Poage and Chamberlain, 2001; Siegenthaler and Oeschger, 1980). Since the study locations in this paper are at different elevation levels from low (Muzaffarabad = 679 m) to high (Gilgit = 2437 m), the models produced different prediction results (Vuille and Glaciers, 2011). Accurate rainfall forecasts are extremely important for the efficient operation of water infrastructure and disaster risk reductions in cases of floods and droughts (Anderson and Burt, 1985). The nature of the rainfall forecast horizon is usually defined by the application for which the forecasts are being used. For instance, rainfall forecasts can be made from the start of the forecast up to short, medium or long term into the future. Forecasts for water management are needed to plan effective use of water, ranging from hydroelectric power generation to water supply, irrigation as well as drought and flood early warning systems. In particular, an effective early warning system must provide sufficient lead time for individual communities in the case of drought/flood caused by rainfall to respond accordingly. In the case of flood forecasts, lead time can be critical in reducing infrastructural damages and loss of life. Forecasts must be sufficiently accurate to uphold confidence so that communities and

users will take effective actions when warned. If forecasts are inaccurate, credibility is reduced and an adequate response is not made. Accurate P forecasting and developing early warning systems can have numerous benefits, both nationally and internationally (Barredo, 2007; Langridge et al., 2006). Precise P forecasts can play an important role in the aforementioned situations for both developed and developing nations such as Pakistan. Governments and water resource managers use machine learning tools in providing guidance for the operation of infrastructure and the implementation of appropriate measures (Anderson and Burt, 1985). Further, this study can be adopted anywhere for P forecasting using the CEEMD algorithm in addressing non-static and non-linear problems. The hybrid CEEMD-RF-KRR model can be applied for agriculture crop yield prediction which could be important to the government’s national policy-making to help minimize crop estimation uncertainties (Akhtar, 2012). However, establishing a viable rainfall forecasting and warning system for communities at risk requires the combination of meteorological and hydrological data, forecast tools and trained forecasters (Bellerby et al., 2000) and more studies pertaining to advanced machine learning modelling approaches are recommended. It is noted that the current study utilized the historical significant lags of P to forecast future P and therefore, carries some limitations. To enhance the scope of this study, other meteorological data such as synoptic-scale climate mode indices, temperature, humidity, cloud cover, sunshine, air temperatures, soil moisture, wind, and solar radiation could also be collated as predictor inputs. Remotely sensed predictor variables through satellites or atmospheric simulation models (Bauer, 1975; Collins et al., 2006) may also improve the model performances for modelling P in remote areas. More importantly, since process-based modeling is resource-demanding and cost-prohibitive for developing countries, the proposed hybrid CEEMD-RF-KRR model may be seen as a viable solution. Moreover, the proposed

hybrid model CEEMD-RF-KRR model can be applied in those regions where other meteorological data is not available. The hybrid CEEMD-RF-KRR model could be further developed with ensemble strategies to possibly achieve more accurate results. Additionally, bio-inspired techniques such as Quantum-Behaved Particle Swarm Optimization and the Firefly Algorithm can assist in selecting input predictors, which have been tested in the hybridization of RF and other models (Ghorbani et al., 2017; Pal et al., 2012). Another possible choice could be empirical wavelet transform (Gilles, 2013) and advance multi-resolution analysis (Huang et al., 1998). Future work could apply statistical copula tools (Nguyen-Huy et al., 2017) in order to study the joint behavior of multivariate data. 6.

Conclusions

A hybrid CEEMD-RF-KRR (i.e. CEEMD-RF-KRRPolynomial) model has been constructed in this paper to forecast future rainfall (P). The P time-series from 1962-2013 were decomposed by employing the CEEMD algorithm into respective IMFs and residuals components. The significant lags of each IMFs and residual were calculated using PACF. Each significant lag of PACF was incorporated into an RF model to predict IMFs and residual factor. Finally, the forecasted IMFs with residuals were inserted in the KRR model to design the robust CEEMD-RF-KRR model. Further, different kinds of assessment metrics were utilized to measure the efficiency of the CEEMD-RF-KRR model. The proposed hybrid CEEMD-RF-KRR model was compared with CEEMD-RF, CEEMD-RF-KRRGaussian, CEEMD-RF-KRRLinear, standalone RF, and standalone KRR (with respective kernel functions) models. As evident by assessment metrics, the accuracy of the proposed CEEMD-RF-KRR model is better in relation to comparative models. The |FE| errors for the best site Gilgit attained by the proposed hybrid CEEMD-RF-KRR model were

RMSE = 2.52 mm and MAE = 1.98 mm while the achieved normalized metrics were R = 0.986, WI = 0.966 and ENS = 0.972 (See; Table 4 and 5). On the basis of ELM, the acquired magnitudes of ELM agreement values for site 1 (i.e. Gilgit) were ELM = 0.810 (CEEMD-RF-KRR), 0.717 (CEEMD-RF), 0.403 (standalone RF) and 0.047 (standalone KRR) whereas the RRMSE values were 19.2% (CEEMD-RF-KRR) compared with 29.4% (CEEMD-RF), 68.4% (standalone RF) and 108.5% (standalone KRR). A significant amount of geographic variability was seen on the basis of RRMSE for Gilgit site with comparison to Parachinar and Muzaffarabad. This study provides a baseline in terms of using the CEEMD algorithm to dissolve the data into IMFs and residual component to avoid the non-stationarity and non-linearity issues in monthly P forecasting. The CEEMD-RF-KRR model may be used for drought, soil moisture, agricultural crops, solar radiations, etc. that will help hydrologists and agriculturist of Pakistan in the optimal resource management. Besides, precise P forecasting can be used to alert the government and impacted stakeholders prior to significant water resources depletion, flood, and drought events. Acknowledgement This paper has utilized climate data that were sourced from the Pakistan Meteorological Department, Pakistan and is duly acknowledged. References Abbot, J. and Marohasy, J., 2014. Input selection and optimisation for monthly rainfall forecasting in Queensland, Australia, using artificial neural networks. Atmos. Res., 138: 166-178. Doi: https://doi.org/10.1016/j.atmosres.2013.11.002 Akhtar, I.U.H., 2012. Pakistan Needs a New Crop Forecasting System. Sci. Dev.Net. http://www.scidev.net/global/climate-change/opinion/pakistanneeds-a-new-crop-forecasting-system.html. Alaoui, A. and Mahoney, M.W., 2015. Fast randomized kernel ridge regression with statistical guarantees. Proceedings of Adv. in Neural Information Pro.

Sys. (NIPS 2015): 775-783. http://papers.nips.cc/paper/5716-fastrandomized-kernel-ridge-regression-with-statistical-guarantees.pdf Ali, M., Deo, R.C., Downs, N.J. and Maraseni, T., 2018a. An ensemble-ANFIS based uncertainty assessment model for forecasting multi-scalar standardized precipitation index. Atmospheric Research, 207: 155-180. Doi: https://doi.org/10.1016/j.atmosres.2018.02.024 Ali, M., Deo, R.C., Downs, N.J. and Maraseni, T., 2018b. Multi-stage hybridized online sequential extreme learning machine integrated with Markov Chain Monte Carlo copula-Bat algorithm for rainfall forecasting. Atmospheric Research, 213: 450-464. Doi: https://doi.org/10.1016/j.atmosres.2018.07.005 Alvanitopoulos, P.-F., Andreadis, I., Georgoulas, N., Zervakis, M. and Nikolaidis, N., 2014. Solar Radiation Time-Series Prediction Based on Empirical Mode Decomposition and Artificial Neural Networks, 10th IFIP International Conference on Artificial Intelligence Applications and Innovations (AIAI). IFIP Advances in Information and Communication Technology, AICT-436. Springer, Rhodes, Greece: 447-455. Anderson, M.G. and Burt, T.P., 1985. Hydrological forecasting, 372. Wiley Chichester. Archer, D.R. and Fowler, H.J., 2008. Using meteorological data to forecast seasonal runoff on the River Jhelum, Pakistan. Journal of Hydrology, 361(1): 10-23. Doi: https://doi.org/10.1016/j.jhydrol.2008.07.017 Ascough Ii, J., Maier, H., Ravalico, J., and Strudley, M., 2008. Future research challenges for incorporation of uncertainty in environmental and ecological decision-making. Ecological modelling, 219(3-4): 383-399. https://doi.org/10.1016/j.ecolmodel.2008.07.015 Barredo, J.I., 2007. Major flood disasters in Europe: 1950–2005. Natural Hazards, 42(1): 125-148. https://doi.org/10.1007/s11069-006-9065-2 Bauer, M.E., 1975. The role of remote sensing in determining the distribution and yield of crops. Advances in Agronomy, 27: 271-304. https://doi.org/10.1016/S0065-2113(08)70012-9 Bellerby, T., Todd, M., Kniveton, D. and Kidd, C.J.J.o.a.M., 2000. Rainfall estimation from a combination of TRMM precipitation radar and GOES multispectral satellite imagery through the use of an artificial neural network. Journal of Applied Meteorology and Climatology, 39(12): 21152128. https://doi.org/10.1175/15200450(2001)040<2115:REFACO>2.0.CO;2 Bhalme, H.N., and Mooley, D.A., 1980. Large-scale droughts/floods and monsoon circulation. Monthly Weather Review, 108(8): 1197-1211. https://doi.org/10.1175/1520-0493(1980)108<1197:LSDAMC>2.0.CO;2 Breiman, L., 1996. Bagging Predictors. Machine Learning, 24(2): 123-140. Breiman, L.J.M.l., 2001. Random forests. 45(1): 5-32. https://doi.org/10.1023/A:1018054314350 Burlando, P., Rosso, R., Cadavid, L.G. and Salas, J.D., 1993. Forecasting of short-term rainfall using ARMA models. Journal of Hydrology, 144(1-4): 193-211. https://doi.org/10.1016/0022-1694(93)90172-6 Chiew, F.H., Piechota, T.C., Dracup, J.A. and McMahon, T.A., 1998. El Nino/Southern Oscillation and Australian rainfall, streamflow and drought: Links and potential for forecasting. Journal of Hydrology, 204(1-4): 138149. https://doi.org/10.1016/S0022-1694(97)00121-2 Collins, W.D., Rasch, P.J., Boville, B.A., Hack, J.J., McCaa, J.R., Williamson, D.L., and Briegleb, B.P., 2006. The formulation and atmospheric simulation of

the Community Atmosphere Model version 3 (CAM3). Journal of Climate, 19(11): 2144-2161. https://doi.org/10.1175/JCLI3760.1 Dawson, C.W., Abrahart, R.J. and See, L.M., 2007. HydroTest: a web-based toolbox of evaluation metrics for the standardised assessment of hydrological forecasts. Environmental Modelling & Software, 22(7): 10341052. https://doi.org/10.1016/j.envsoft.2006.06.008 Faisal, N. and Gaffar, A., 2012. Development of Pakistan’s new area-weighted rainfall using Thiessen polygon method. Pakistan Journal of Meteorology, 9(17): 107-116. Gebremariam, S.Y., Martin, J.F., DeMarchi, C., Bosch, N.S., Confesor, R., and Ludsin, S.A., 2014. A comprehensive approach to evaluating watershed models for predicting river flow regimes critical to downstream ecosystem services. Environmental modelling & software, 61: 121-134. https://doi.org/10.1016/j.envsoft.2014.07.004 Ghorbani, M.A., Deo R.C., C., Zaher M., Y., Mahsa H, K. and Babak, M., 2018. Pan Evaporation Prediction Using a Hybrid Multilayer Perceptron-Firefly Algorithm (MLP-FFA) Model: Case study in North Iran. Theoretical and Applied Climatology, 133, pages 1119–1131. https://doi.org/10.1007/s00704-017-2244-0 Gilles, J., 2013. Empirical wavelet transform. IEEE transactions on signal processing, 61(16): 3999-4010. Doi: 10.1109/TSP.2013.2265222 He, Z., Wen, X., Liu, H. and Du, J., 2014. A comparative study of artificial neural network, adaptive neuro-fuzzy inference system and support vector machine for forecasting river flow in the semiarid mountain region. Journal of Hydrology, 509: 379-386. https://doi.org/10.1016/j.jhydrol.2013.11.054 Hora, J. and Campos, P., 2015. A review of performance criteria to validate simulation models. Expert Systems, 32(5): 578-595. https://doi.org/10.1111/exsy.12111 Hsu, C.W., Chang, C.C. and Lin, C.J., 2003. A practical guide to support vector classification. https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf Huang, N.E., Shen, Z., Long, S.R., Wu, M.C., Shih, H.H., Zheng, Q., Yen, N.C., Tung, C.C., and Liu, H.H., 1998. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis, Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. The Royal Society: 903-995. https://doi.org/10.1098/rspa.1998.0193 Hung, N.Q., Babel, M.S., Weesakul, S. and Tripathi, N., 2009. An artificial neural network model for rainfall forecasting in Bangkok, Thailand. Hydrology and Earth System Sciences, 13(8): 1413-1425. https://doi.org/10.5194/hess-13-1413-2009, 2009. Kashiwao, T., Nakayama, K., Ando, S., Ikeda, K., Lee, M., Bahadori, A., 2017. A neural network-based local rainfall prediction system using meteorological data on the Internet: A case study using data from the Japan Meteorological Agency. Applied Soft Computing, 56: 317-330. https://doi.org/10.1016/j.asoc.2017.03.015 Koch, J., Berger, H., Henriksen, H.J. and Sonnenborg, T.O., 2019. Modelling of the shallow water table at high spatial resolution using Random Forests. Hydrology and Earth System Sciences Discussions 23, 4603-4619. https://doi.org/10.5194/hess-23-4603-2019

Langridge, R., Christian-Smith, J., and Lohse, K., 2006. Access and resilience: analyzing the construction of social resilience to the threat of water scarcity. Ecology and Society, 11(2). http://www.ecologyandsociety.org/vol11/iss2/art18/ Legates, D.R. and McCabe, G.J.J., 1999. Evaluating the use of “goodness‐of‐fit” measures in hydrologic and hydroclimatic model validation. Water Resources Research, 35(1): 233-241. Doi: 10.1029/1998WR900018 Liaw, A. and Wiener, M.J., 2002. Classification and regression by random Forest. R News, 2(3): 18-22. Lin, G.F., Chen, G.R., Wu, M.C. and Chou, Y.C., 2009. Effective forecasting of hourly typhoon rainfall using support vector machines. Water Resources Research, 45(8). https://doi.org/10.1029/2009WR007911 Luk, K.C., Ball, J.E. and Sharma, A., 2001. An application of artificial neural networks for rainfall forecasting. Mathematical and Computer modelling, 33(6-7): 683-693. https://doi.org/10.1016/S0895-7177(00)00272-7 Moazami, S., Golian, S., Kavianpour, M.R. and Hong, Y., 2014. Uncertainty analysis of bias from satellite rainfall estimates using copula method. Atmospheric research, 137: 145-166. https://doi.org/10.1016/j.atmosres.2013.08.016 Moore, I.D., Grayson, R. and Ladson, A., 1991. Digital terrain modelling: a review of hydrological, geomorphological, and biological applications. Hydrological processes, 5(1): 3-30. https://doi.org/10.1002/hyp.3360050103 Nasseri, M., Asghari, K. and Abedini, M., 2008. Optimized scenario for rainfall forecasting using genetic algorithm coupled with artificial neural network. Expert Systems with Applications, 35(3): 1415-1421. https://doi.org/10.1016/j.eswa.2007.08.033 News, D., 2010. Floods to hit economic growth: Finance Ministry, Dawn News. https://www.dawn.com/news/667567 Nguyen-Huy, T., Deo, R.C., An-Vo, D.A., Mushtaq, S. and Khan, S., 2017. Copula-statistical precipitation forecasting model in Australia’s agroecological zones. Agricultural Water Management, 191: 153-172. https://doi.org/10.1016/j.agwat.2017.06.010 Nourani, V., Kisi, Ö. and Komasi, M., 2011. Two-hybrid Artificial Intelligence approaches for modeling rainfall-runoff process. Journal of Hydrology, 402(1-2): 41-59. https://doi.org/10.1016/j.jhydrol.2011.03.002 Ouyang, Q., Lu, W., Xin, X., Zhang, Y., Cheng, W., Yu, T., 2016. Monthly Rainfall Forecasting Using EEMD-SVR Based on Phase-Space Reconstruction. Water Resources Management, 30(7): 2311-2325. https://doi.org/10.1007/s11269-016-1288-8 Pal, S.K., Rai, C. and Singh, A.P., 2012. Comparative study of firefly algorithm and particle swarm optimization for noisy non-linear optimization problems. International Journal of Intelligent Systems and Applications, 4(10): 50. Doi: 10.5815/ijisa.2012.10.06 PMD, 2016. Pakistan Meteorological Department, Pakistan. http://www.pmd.gov.pk/en/ Poage, M.A., and Chamberlain, C.P., 2001. Empirical relationships between elevation and the stable isotope composition of precipitation and surface waters: considerations for studies of paleoelevation change. American Journal of Science, 301(1): 1-15. http://earth.geology.yale.edu/~ajs/2001/Jan/001/001.htm

Prasad, R., Deo, R.C., Li, Y. and Maraseni, T., 2017. Input selection and performance optimization of ANN-based streamflow forecasts in the drought-prone Murray Darling Basin region using IIS and MODWT algorithm. Atmospheric Research, 197: 42-63. https://doi.org/10.1016/j.atmosres.2017.06.014 Prasad, R., Deo, R.C., Li, Y. and Maraseni, T., 2018. Soil moisture forecasting by a hybrid machine learning technique: ELM integrated with ensemble empirical mode decomposition. Geoderma, 330: 136-161. https://doi.org/10.1016/j.geoderma.2018.05.035 Qing, Y., Zhu-Guo, and M., Liang, 2011. A preliminary analysis of the relationship between precipitation variation trends and altitude in China. Atmospheric and Oceanic Science Letters, 4(1): 41-46. https://doi.org/10.1080/16742834.2011.11446899 Reale, O., Lau, K., Susskind, J., and Rosenberg, R., 2012. AIRS impact on analysis and forecast of an extreme rainfall event (Indus River Valley, Pakistan, 2010) with a global data assimilation and forecast system. Journal of Geophysical Research: Atmospheres, 117(D8). https://doi.org/10.1029/2011JD017093 Ren, Y., Suganthan, P.N. and Srikanth, N., 2015. A comparative study of empirical mode decomposition-based short-term wind speed forecasting methods. IEEE Transactions on Sustainable Energy, 6(1): 236–244. 10.1109/TSTE.2014.2365580 Romagnoli, M., Portapila, M., Rigalli, A., Maydana, G., Burgues, M., Garcia, C.M., 2017. Assessment of the SWAT model to simulate a watershed with limited available data in the Pampas region, Argentina. Science of The Total Environment, 596: 437-450. https://doi.org/10.1016/j.scitotenv.2017.01.041 Salma, S., Rehman, S. and Shah, M.J., 2012. Rainfall trends in different climate zones of Pakistan. Pakistan Journal of Meteorology, 9(17). http://www.pmd.gov.pk/rnd/rnd_files/vol8_issue17/4.pdf Sánchez-Monedero, J., Salcedo-Sanz, S., Gutiérrez, P.A., Casanova-Mateo, C. and Hervás-Martínez, C., 2014. Simultaneous modelling of rainfall occurrence and amount using a hierarchical nominal–ordinal support vector classifier. Engineering Applications of Artificial Intelligence, 34: 199-207. https://doi.org/10.1016/j.engappai.2014.05.016 Schapire, R.E., Freund, Y., Bartlett, P. and Lee, W.S., 1998. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of statistics: 1651-1686. https://www.cc.gatech.edu/~isbell/tutorials/boostingmargins.pdf Shamseldin, A.Y., 1997. Application of a neural network technique to rainfallrunoff. Journal of Hydrology, 199: 272-294. https://doi.org/10.1016/S0022-1694(96)03330-6 Sharma, A., 2000. Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 3—A nonparametric probabilistic forecast model. Journal of Hydrology, 239(1): 249-258. https://doi.org/10.1016/S0022-1694(00)00348-6 Siegenthaler, U. and Oeschger, H., 1980. Correlation of 18O in precipitation with temperature and altitude. Nature, 285: 314–317. Doi: https://doi.org/10.1038/285314a0. Tarakzai, S., 2010. Pakistan battles economic pain of floods, Jakarta Globe. https://en.wikipedia.org/wiki/2010_Pakistan_floods

Torres, M.E., Colominas, M.A., Schlotthauer, G. and Flandrin, P., 2011. A complete ensemble empirical mode decomposition with adaptive noise, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 4144-4147. 10.1109/ICASSP.2011.5947265 Vörösmarty, C.J., McIntyre, P.B., Gessner, M.O., Dudgeon, D., Prusevich, A., Green, P., Gliddens, S., Bunn, S.E., Sullivan, C.A., Reidy Liermann, C., and Davies, P.M., 2010. Global threats to human water security and river biodiversity. Nature, 467(7315): 555-561. https://doi.org/10.1038/nature09440 Vuille, M., 2011. Climate variability and high altitude temperature and precipitation. In: Singh V.P., Singh P., Haritashya U.K. (eds) Encyclopedia of Snow, Ice and Glaciers. Encyclopedia of Earth Sciences Series. Springer, Dordrecht: 153-156. https://doi.org/10.1007/978-90-4812642-2_66 Wang, W.-c., Xu, D.-m., Chau, K.-w. and Chen, S., 2013. Improved annual rainfall-runoff forecasting using PSO–SVM model based on EEMD. Journal of Hydroinformatics, 15(4): 1377-1390. https://doi.org/10.2166/hydro.2013.134 Welling, M., 2013. Kernel ridge regression. Max Welling's Classnotes in Machine Learning: 1-3. https://www.ics.uci.edu/~welling/classnotes/papers_class/KernelRidge.pdf Willmott, C.J., 1984. On the evaluation of model performance in physical geography, Spatial statistics, and models. Springer: 443-460. https://doi.org/10.1007/978-94-017-3048-8_23 Wu, Z. and Huang, N.E., 2009. Ensemble empirical mode decomposition: A noise-assisted data analysis method. Advances in Adaptive Data Analysis, 1(1): 1-41. https://doi.org/10.1142/S1793536909000047 Wu, Z., Huang, N.E., Wallace, J.M., Smoliak, B.V. and Chen, X., 2011. On the time-varying trend in global-mean surface temperature. Climate Dynamics, 37(3-4): 759-773. https://doi.org/10.1007/s00382-011-11288 Xu, Z., Hou, Z., Han, Y. and Guo, W., 2016. A diagram for evaluating multiple aspects of model performance in simulating vector fields. Geoscientific Model Development, 9(12): 4365-4380. https://doi.org/10.5194/gmd-94365-2016 Yen, B.C., 1995. Discussion and Closure: Criteria for Evaluation of Watershed Models. Journal of Irrigation and Drainage Engineering, 121(1): 130-132. https://doi.org/10.1061/(ASCE)0733-9437(1995)121:1(130) You, Y., Demmel, J., Hsieh, C.-J. and Vuduc, R., 2018. Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems. ACM International Conference on Supercomputing (ICS) 2018: June 12– 15, 2018, Beijing, China. Doi: 10.1145/3205289.3205290 Zhang, Y., Duchi, J. and Wainwright, M., 2013. Divide and conquer kernel ridge regression, Conference on Learning Theory: 592-617. http://proceedings.mlr.press/v30/Zhang13.html

List of Figures

Figure 1: Map of the study region showing the locations.

Site 1: Gilgit

140

P

100 60 20 1

70

140

210

280

0

20

IMF 4

IMF 3

0

0

10

IMF 6

-20

10

IMF 5

-20

0

0 -10

4 2 0 -2 -4

1 0.5 0 -0.5

IMF 8

-10

IMF 7

552

0

20

Residuals

1

IMF 9

490

-20

-50

0 -1 1

60

120

180

240

300

360

420

552

15 14 13 1

60

120

180

240

300

360

420

552

Site 2: Muzaffarabad

700

P

420

20

IMF 2

IMF 1

50

350

300 0

200

IMF 2

0

0

-200

-200

200 100 0 -100

50

IMF 4

IMF 3

IMF 1

200

50

IMF6

IMF 5

0 -50

50 0

0

20

10

IMF 8

-50

IMF 7

-50

0

Residuals

5

IMF 9

0 -10

-20

0 -5 1

552

480

400

320

240

160

80

1

80

160

240

320

400

480

552

140

135 1

80

160

240

320

400

480

552

Kurrum Agency, Parachinar

P

400 200 1

100

200

300

0

40 20 0 -20 -40

IMF 4

100

IMF 3

-100

0 -100

20

IMF 6

20

IMF 5

552

0

-200

0 -20

0

-20

20 10 0 -10 1

IMF 8

20

100

200

300

Residuals

IMF 7

500

100

IMF 2

IMF 1

200

400

400

500

0

552

-20 1

100

200

300

400

200

300

400

500

552

100

50 1

100

500

552

Figure 2: The resolved subseries: Intrinsic mode functions (IMFs) and residual from CEEMD for Site 1: Gilgit, Site 2: Muzaffarabad, and Site 3: Parachinar. IMF1

8

2

4 Lags

6

IMF7

2

4 Lags

6

-1

8

0

2

4 Lags

6

2

4 Lags

6

8

6

8

0 -1

0

2

4 Lags

4 Lags

6

8

6

8

6

8

IMF6

0

2

4 Lags IMF9

2

residuals

1

2

0 -1

8

IMF8

0

0

1

0 -1

8

6

IMF5

1

PACF 0

4 Lags

0 -1

8

0 -1

2

0

PACF

0 0

0

1

PACF

PACF PACF

6

IMF4

1

(a)

4 Lags

PACF

2

1

-1

0 -1

0

IMF3

1

PACF

PACF

0 -1

IMF2

1

PACF

PACF

1

0 -2

0

2

4 Lags

6

-1

8

IMF7

1 0

2

PACF

-1

IMF1

4 0 Lags

-1

0

6

2

IMF4

0

2

4 Lags

-1

6

PACF 0

-1 2

4 Lags

6

8

1 2 0 -1

0

8

4 6 Lags residuals

0

2

4 Lags

0 0

2

4 Lags

4 Lags

6

IMF5 6

8

4 Lags

6

0

2

4 Lags

0

IMF3

1 0 -1

2

4 Lags

6

8

0

2

4 Lags

6

8

6

8

6

8

IMF6

1

8

0 -1 0

2

4 Lags

4 Lags residuals

-1 2

8

0

8

0 0

6

IMF9

IMF8

1

8

0

-1

8

6

IMF6

1

IMF2

1 2

-1

8

0

6

0

IMF7

1

0

4 Lags

2

1

IMF8

1 8 0

0 -1

2

PACF

PACF

1

PACF

6

0

-1 0

-1

8

4 Lags

-1

8

IMF5

1

0

6

PACF

4 Lags

1

4 Lags

PACF

2

2

0

PACF

PACF

PACF 0

0

1

0 -1

(c)

-1

8

IMF4

1

PACF

6

0

PACF

4 Lags

0

PACF

2

IMF3

1

PACF

PACF 0

PACF

PACF

0 -1

IMF2

1

PACF

IMF1

1

PACF

(b)

6

8

1 0 -1 0

2

4 Lags

Figure 3: Statistical significant PACF of IMFs and residual used for developing the CEEMD-RFKRR model for (a) Gilgit, (b) Muzaffarabad, and (c) Parachinar.

Figure 4: Flowchart of the proposed hybrid CEEMD-RF-KRR model.

Figure

5:

Taylor

diagram showing the correlation coefficient

between observed and forecasted rainfall and standard deviation of CEEMD-RF-KRR (i.e. CEEMD-RF-KRRPolynomial) vs. CEEMD-RF, Standalone RF, and Standalone KRR (i.e. Standalone KRRGaussian) models for Site 1: Gilgit, Site 2: Muzaffarabad, and Site 3: Parachinar.

Site 1: Gilgit

70

Site 2: Muzaffarabad

500

Site 3: Parachinar 250

60 400 200

40 30

300

|FE| (mm)

|FE| (mm)

|FE| (mm)

50

200

150

100

F

R K ne

ne

al o an d

C

EE

M

D

FK -R D EE M C

R

F -R

R

R K ne da lo

na St

al o an d St

R

R

F ne

-R D M EE C

-R D EE M C

R

F

R R FK

R K ne al o

an d

an d

al o St

St

0

al o

0

R

F ne

-R St

C

EE M

D

R FK -R D M EE C

R

F

R

0

an d

50

St

100 10

R

20

Figure 6: Box-plots of absolute forecasted error |FE| (mm) registered by CEEMD-RF-KRR (i.e. CEEMD-RF-KRRPolynomial) vs. CEEMD-RF, Standalone RF, and Standalone KRR (i.e. Standalone KRRGaussian) models for Site 1: Gilgit, Site 2: Muzaffarabad, and Site 3: Parachinar.

Site 1: Gilgit

1

Site 2: Muzaffarabad

1

0.8

0.8

0.6

0.6

0.6

ECDF

ECDF

ECDF

0.8

0.4 CEEMD-RF-KRR CEEMD-RF Standalone RF Standalone KRR

0.2

0

0.4

0.4

0.2

0.2

0 0

20

40

60

80

0

200

400

|FE| (mm)

|FE| (mm)

Site 3: Parachinar

1

600

0

0

100

200

|FE| (mm)

Figure 7: Empirical cumulative distribution function (ECDF) of the absolute forecasting error, |FE| (mm) at the testing stations using CEEMD-RF-KRR (i.e. CEEMD-RF-KRRPolynomial) vs. CEEMD-RF, standalone RF, and standalone KRR (i.e. Standalone KRRGaussian) models.

300

Figure 8: A comparison of the correlation of coefficient (R) for P forecasting generated by the hybrid CEEMD-RF-KRR (i.e. CEEMD-RF-KRRPolynomial) vs. the CEEMD-RF, Standalone KRR (i.e. Standalone KRRGaussian), and Standalone RF model.

List of Tables Table 1: Descriptive statistics of the study locations. Geographic characteristics

Climatological statistics (1962–2013) Precipitation P (mm)

Study Sites

Longitude (oE)

Latitude (oN)

Elevation (m)

Mean

Max

Min

Std.

Skew

Kurt

Gilgit

74.841

35.281

2437

12.35

127.00

0.06

16.12

2.83

10.12

Muzaffarabad

73.471

34.359

679

127.00

721.00

0.00

113.19

1.87

4.97

Parachinar

70.100

33.899

1705

74.79

424.60

0.50

61.39

1.68

4.33

Table 2: Decomposed IMFs and residuals for each study site using CEEMD method.

Study Sites Gilgit Muzaffarabad Parachinar

Noise Standard Deviation

No. of Realization

No. of Maximum Iteration

0.2 0.2 0.2

500 500 500

5000 5000 5000

No. of decomposed signals (IMFs & Res.) 10 10 9

Table 3: Parameters used in the development of CEEMD-RF-KRR model vs. CEEMD-RF, standalone KRR, and standalone RF models. The choices of Kernel types were: polynomial, linear and Gaussian. The results of the best i.e., Gaussian are presented. CEEMD-RF-KRR Study Sites

Gilgit Muzaffarabad Parachinar

Kernel type Gaussian Gaussian Gaussian Linear Linear Linear Polynomial Polynomial Polynomial

CEEMD-RF

standalone KRR

No. trees

No. split predictor

Kernel type

1000

5

Gaussian Gaussian

1000

5

1000

5

Gaussian Linear Linear Linear Polynomial Polynomial Polynomial

standalone RF No. trees

No. split predictor

1000

2

1000

2

1000

2

Table 4: Testing performance of CEEMD-RF-KRR (i.e., CEEMD-RF-KRRPolynomial) vs. CEEMD-RF, Standalone KRR and Standalone RF models measured by root mean square error (RMSE), mean absolute error (MAE), Pearson’s correlation coefficient (R). Gilgit

Sites

MAE

Models

RMSE (mm)

Standalone RF

8.99

6.24

Standalone KRRLinear

17.20

Standalone KRRPolynomial

Muzaffarabad RMSE

MAE

(mm)

(mm)

0.826

60.70

44.79

10.68

0.154

124.30

19.82

12.91

0.047

Standalone KRRGaussian

14.25

9.95

CEEMD-RF

3.86

CEEMD-RF-KRRLinear

Parachinar RMSE (mm)

MAE (mm)

R

0.858

39.04

29.08

0.880

88.67

0.273

72.60

51.88

0.457

222.07

195.18

0.216

106.22

82.72

0.299

0.364

99.43

71.40

0.372

63.60

47.34

0.505

2.95

0.974

35.51

26.37

0.957

22.22

16.46

0.962

5.54

4.04

0.952

55.67

42.99

0.883

51.60

40.35

0.795

CEEMD-RF-KRRPolynomial

2.52

1.98

0.986

25.76

19.99

0.971

16.58

13.65

0.976

CEEMD-RF-KRRGaussian

5.22

3.43

0.982

37.28

25.03

0.978

22.94

15.73

0.977

R

(mm)

R

Table 5: Performance of CEEMD-RF-KRR vs. CEEMD-RF, Standalone KRR and Standalone RF models using Willmott’s index (EWI), Nash-Sutcliffe (ENS) and LegatesMcCabe’s (ELM) agreement. Note that the best model is boldfaced (blue). Gilgit

Sites

Muzaffarabad

Parachinar

EWI

ENS

ELM

EWI

ENS

ELM

EWI

ENS

ELM

Standalone RF

0.565

0.644

0.403

0.649

0.674

0.416

0.711

0.720

0.486

Standalone KRRLinear

0.387

-0.303 -0.023

0.411

-0.367 -0.156

0.494

0.021

0.075

Standalone KRRPolynomial

0.407

-0.730 -0.236

0.391

-3.363 -1.545

0.463

-1.094 -0.474

Standalone KRRGaussian

0.088

0.105

0.047

0.161

0.125

0.068

0.274

0.249

0.156

CEEMD-RF

0.923

0.934

0.717

0.887

0.888

0.656

0.917

0.909

0.709

CEEMD-RF-KRRLinear

0.820

0.864

0.612

0.683

0.725

0.439

0.361

0.511

0.286

CEEMD-RF-KRRPolynomial

0.966

0.972

0.810

0.938

0.941

0.739

0.950

0.950

0.759

Models

CEEMD-RF-KRRGaussian

0.848

0.879

0.670

0.872

0.876

0.673

0.911

0.903

Table 6: Geographic comparison of the accuracy of the CEEMD-RF-KRR vs. CEEMD-RF, Standalone KRR and Standalone RF models in terms of relative mean absolute error (RRMSE, %) computed within the test sites. Note that the best model is boldfaced (blue). Sites

Gilgit

Muzaffarabad

Parachinar

Models

RRMSE (%)

RRMSE (%)

RRMSE (%)

Standalone RF

68.4

50.7

42.8

Standalone KRRLinear

130.9

103.7

79.5

Standalone KRRPolynomial

150.8

185.3

116.4

Standalone KRRGaussian

108.5

82.9

69.7

CEEMD-RF

29.4

29.6

24.4

CEEMD-RF-KRRLinear

42.2

46.4

56.6

CEEMD-RF-KRRPolynomial

19.2

21.5

18.2

CEEMD-RF-KRRGaussian

39.7

31.1

25.16

0.721

Competing interests The authors declare that they have no conflict of interest. MA, RP, YX and ZMY

Credit Author Statement MA and RP designed the experiment, carried out the experimentation and prepared the manuscript. YX and ZMY provided the expert advice and carried out the proofreading work.

Highlights 

Complete ensemble empirical mode decomposition-based time series decomposition.



Ensemble model developed with CEEMD, random forest and Kernel Ridge Regression.



Accurate rainfall forecasts generated at 3 candidate Sites in Pakistan.



Hybrid CEEMD-RF-KRR outperformed CEEMD-RF and standalone KRR and RF models.



Hybrid CEEMD-RF-KRR model can be a useful tool for water resource management.