Marine Pollution Bulletin 152 (2020) 110902
Contents lists available at ScienceDirect
Marine Pollution Bulletin journal homepage: www.elsevier.com/locate/marpolbul
Baseline
A novel approach to predict chlorophyll-a in coastal-marine ecosystems using multiple linear regression and principal component scores
T
⁎
Jayaseelan Benjamin Franklina, , Thadikamala Sathisha,1,2, Nambali Valsalan Vinithkumara, Ramalingam Kirubagaranb a
Atal Centre for Ocean Science and Technology, National Institute of Ocean Technology, Ministry of Earth Sciences, Government of India, Port Blair 744103, India Marine Biotechnology Division, Ocean Science and Technology for Islands, National Institute of Ocean Technology, Ministry of Earth Sciences, Government of India, Pallikaranai, Chennai 600100, India
b
A R T I C LE I N FO
A B S T R A C T
Keywords: Chlorophyll-a Multiple linear regression analysis Principle component analysis Seawater quality Prediction Mathematical modeling
Chlorophyll-a is an established indexing marker for phytoplankton abundance and biomass amongst primary food producers in an aquatic ecosystem. Understanding and modeling the level of Chlorophyll-a as a function of environmental parameters have been found to be very beneficial for the management of the coastal ecosystems. This study developed a mathematical model to predict Chlorophyll-a concentrations based on a data driven modeling approach. The prediction model was developed using principal component analysis (PCA) and multiple linear regression analysis (MLR) approaches. The predictive success (R2) of the model was found to be ~84.8% for first approach and ~83.8% for the second approach. A final model was generated using a combined principal component scores (PCS) and MLR approach that involves fewer parameters and has a predictive ability of 83.6%. The PCS-MLR method helped to identify the relationship amongst dependent as well as predictor variables and eliminated collinearity problems. The final model is quite simple and intuitive and can be used to understand real system operations.
Chlorophyll-a (Chl-a) is a photosynthetic pigment found in all plants, including free floating algae (phytoplankton) (Wetzel, 2001). Phytoplankton is the primary food producer in aquatic food chains and their growth is directly dependent on the availability of nutrients. Hence, phytoplankton density, which is typically measured as concentration of Chl-a, represents a vital indicator of nitrification (Riegl et al., 2014). The accelerated production of phytoplankton in coastalmarine water is generally the result of excess nutrients, especially total nitrogen and phosphorus (Smith, 2006). Therefore, estimates of Chl-a levels in coastal-marine waters could be used as an indicator for both the direct (Wiedenmann et al., 2013) and indirect (Fabricius et al., 2010) negative effects on seawater quality characteristics and trophic state conditions. Measuring and predicting the levels of Chl-a are vital components to monitor seawater quality and coastal management programs. To achieve a sustainable development that is accompanied by environmental preservation and protection of vulnerable habitats, it is very necessary to analyze factors leading to changes and be able to predict those changes early to facilitate appropriate counter measures or alternatives (Kim and Park, 2013).
Modeling of Chl-a level in lakes and reservoirs is considered as one of the most important analytical tool for biological and ecological investigations (Tufford and McKeller, 1999; Chen, 1970; Di Toro et al., 1971). Strong relationships between phosphorus, chlorophyll and water clarity have been observed and reported for freshwater systems around the world (Sakamoto, 1966; Brown et al., 2000). The application of an empirical model to facilitate nutrient-related management strategies of marine systems was first considered by Vollenweider et al. (1992) and then by Meeuwig et al. (2000) and was further developed by Hoyer et al. (2002). The studies clearly suggested the huge potential for the use of empirically based models that can predict the consequences of increased nutrient loading in estuaries and coastal-marine waters. Since then, various empirical models and approaches have been employed by several researchers to understand primary production in aquatic environments (Nogueira et al., 1998; Elliot, 2000; Wehde et al., 2001; Camdevyren et al., 2005). Few studies though have defined the relationship between Chl-a concentrations with biotic and other corresponding factors using univariate and multivariate linear or nonlinear approaches (Wuttichaikitcharoen and Babel, 2014). Multiple regression
⁎
Corresponding author at: Bombay Natural History Society, Mumbai 400 001, India. E-mail address:
[email protected] (J.B. Franklin). 1 Present address: Koringa College of Pharmacy, Korangi-533 461, India. 2 Joint First Author. https://doi.org/10.1016/j.marpolbul.2020.110902 Received 24 September 2019; Received in revised form 7 January 2020; Accepted 8 January 2020 Available online 14 January 2020 0025-326X/ © 2020 Elsevier Ltd. All rights reserved.
Marine Pollution Bulletin 152 (2020) 110902
J.B. Franklin, et al.
Fig. 1. Map of Port Blair bay showing the selected sites for sampling.
There are several small bays within Port Blair bay. In recent years, the ecological status of these bays has been negatively impacted as a result of the frequent occurrence of natural environmental stressors and unchecked anthropogenic activities due to rising local human populations. Development in urbanization, tourism and shipping are the major causes of declining water quality in Port Blair bay. The bays towards the head end of Port Blair bay exhibit limited water exchange with the open sea. Therefore, the ability to withstand environmental stressors in this area is very likely to be diminished, thereby causing declining conditions in both abiotic and biotic systems. Anthropogenic discharge elevates nutrient concentrations and frequently stimulates algal blooms. These blooms are a common phenomenon in coastal waters of Andaman and Nicobar Islands (Kumar et al., 2012; Karthik and Padmavati, 2014; Sachithanandam et al., 2013; Sahu et al., 2014; Begum et al., 2015). However, with an improved understanding of aquatic ecosystem processes, statistical tools and advanced computing capabilities, process-
analysis is an important method to correlate the dependency of critical variables with other parameters. Although some regression models yield higher R2 values, they fail to define the complex nature of the ecosystem (Camdevyren et al., 2005) to multicollinearity of the variables in the model. The multicollinearity between predictor variables could lead to erroneous identification of the most important predictors (Sharma, 1996; Hoe and Kim, 2004). Principle Component Analysis (PCA) is a familiar method used extensively to find the relationship between various biotic and abiotic parameters that avoids multicollinearity problems and helps researchers to develop better prediction models (Camdevyren et al., 2005; Wuttichaikitcharoen and Babel, 2014). Port Blair bay is an extension of Andaman Sea that begins from the eastern side of the Andaman and Nicobar Islands and extends as narrow stretches towards the southwest and ends in Flat bay (Fig. 1). Port Blair city is bounded mostly by the bay in the southern and western side. 2
Marine Pollution Bulletin 152 (2020) 110902
J.B. Franklin, et al.
was checked for any missing observations. Because, some of the variables in the data have very high or low values, to avoid heteroscedasticity, a combination of log-scaling and standardization methods were applied to the data. Initially the data were log transformed and then standardized by Z-score transformation method. The pretreated data were used for PCA and MLR analysis. The statistical analyses were carried out using SPSS version 23.0. The Kaiser-Meyer Olkin (KMO) and Bartlett's Test measure of sampling adequacy was used to verify the applicability of PCA to the data. The KMO evaluates the magnitude of observed correlation coefficients to the magnitudes of partial correlation coefficients (Ul-Saufie et al., 2011). Values above 0.5 imply that the data set is suitable for PCA or factor analysis (FA) (Kaiser, 1974). The PCA technique extracts the eigenvalue and eigenvectors from the covariance matrix of original variables. The eigenvectors correspond to principal components (PCs) and the eigen values represent the difference explained by the PCs. The PCs are the uncorrelated (orthogonal) variables, obtained by multiplying the original correlated variables with the eigenvector, which is a list of coefficients (i.e., loadings or weightings). Hence, PCs are weighted linear combinations of the original variables. Contribution of each variable in a PC is represented by factor loadings. Each PC provides data on the most expressive parameters, which define whole data set affording data reduction with least loss of original information. To maximize the difference in factor loadings on each PC, a VARIMAX rotation procedure was employed (Terdalkar and Pai, 2001).These loading values were used to group variables in PCs. Score values were calculated from the weight of variables in PCs and standardized variables by using Eq. (1):
based models can now be used to address water quality problems in other parts of the world. The aim of this study is to develop a multivariate statistical approach to classify predictor variables according to their inter-relations and to predict Chl-a levels in coastal-marine ecosystems. The predictive model for Chl-a levels was developed using principal component analysis (PCA) and multiple linear regression analysis (MLR). Two approaches were followed for identifying real parameters which influence Chl-a levels in marine coastal waters by MLR. To predict the Chl-a levels, real variables were used as input variables in MLR and then principal component scores (PCS) were used as independent variables in MLR (PCS-MLR). A PCS-MLR data driven Chl-a model was developed and validated. To the best of our knowledge based on surveys of the literature, this is the first attempt at predicting Chl-a levels using PCSMLR approach in coastal-marine ecosystems. We utilized 20 variables from 64 observations that were made in the year 2016 during four seasons. Surveys were conducted at seven stations namely Flat bay, Minnie bay, Junglighat bay, Haddo harbour, Phoenix bay, Aberdeen bay and the Open Sea (Fig. 1). Seawater samples were collected using a trawler and selected parameters were analyzed in the laboratory. The point of sample collection at each bay or Open Sea was positioned approximately or within 1 to 2 km from the coastline. Seawater quality data comprise 20 parameters: seawater temperature (Tsw), pH, transparency (TRANS), salinity (SAL), total suspended solids (TSS), dissolved oxygen (DO), nitrite‑nitrogen (NO2N), nitrate‑nitrogen (NO3-N), ammonia‑nitrogen (NH3-N), total nitrogen (TN), inorganic phosphate (IP), total phosphorus (TP), reactive silicate (SiO4), Enterococcus faecalis (EF), Escherichia coli (EC), total viable counts (TVC), zooplankton (ZOO), phytoplankton (PHY), pheophytin (PHE) and Chl-a. Tsw and pH were measured in situ using a calibrated thermometer with ± 0.1 °C accuracy (Brannan, Cleator Moor, UK) and pH meter (Thermo Orion 420 A plus, Waltham, Massachusetts, USA). TRANS was measured from Secchi disc depth. DO was analyzed by Winkler's method (Grasshoff et al., 1999) and SAL was estimated by the Mohr-Knudson Argentometric titration method (Strickland and Parsons, 1977). TSS was determined by filtering 1 L of seawater through pre-dried and pre-weighed 0.45 μm pore-size filter paper (Merck Millipore Ltd., Delhi, India). Filtered samples were used to analyze nutrient concentrations (i.e., NO2-N, NO3-N, NH3-N, IP and SiO4) (Grasshoff et al., 1999). Unfiltered samples were used to determine TN and TP. The concentration of nutrients in sea water was measured on a lambda-25-UV/Visible spectrophotometer with an accuracy of 0.004 μM/L and MDL between 0.004 μM/L and 100 μM/L (Perkin Elmer, Waltham, Massachusetts, USA) (Grasshoff et al., 1999). Data quality was checked by careful standardization, procedural blank measurements and duplicate samples. ZOO samples were identified and counted in a Sedgewick-Rafter plankton counting chamber under a zoom stereo microscope (Nikon SMZ800, Tokyo, Japan) and phase contrast microscope (Nikon Eclipse Ni-U, Tokyo, Japan) (Kasturirangan, 1963; Santhanam and Srinivasan, 1994). PHY was identified and counted under a zoom stereo microscope (Nikon SMZ800, Tokyo, Japan) and phase contrast microscope (Nikon Eclipse Ni-U, Tokyo, Japan) (Utermöhl, 1931; Santhanam et al., 1987). Chl-a and PHE concentrations were measured on lambda-25-UV/Visible spectrophotometer with an accuracy of 0.004 μM/L and MDL between 0.004 μM/L and 100 μM/L (Perkin Elmer, Waltham, Massachusetts, USA) (USEPA, 2002a). For enumeration and isolation of health indicator bacteria, the membrane filtration method of the United States Environmental Protection Agency (USEPA, 2002b) and Central Pollution Control Board, India was applied. Further, biochemical and sugar fermentation tests were done to identify EC and EF (Holt et al., 1994). TVC was estimated using Zobell Marine Agar by spread plate method. The results were expressed as colony forming units (CFU)/100 mL for each water sample. Data were arranged as a 20 × 64 matrix (20 variables with 64 observations) for further transformation and analysis. The overall data set
Skj = t1k z1j + t2k z2j + …+tpk z pj
(1) th
In the above equation, Skj = the score value for j observation in kth PCs; j = 1, 2, …, nth number of observation; k = the number of selected PC number from 1, 2, …,q;p = the number of independent variables;z = the standardized value of variable atjth observation and t = the weight of the variables in PC. In the present work, two PCAs were used to understand the correlation of Chl-a with other parameters. The first PCA consist of 20 parameters including Chl-a and the second PCA consist of 19 parameters without Chl-a. In the second PCA, PCS were used to develop the mathematical model for prediction of Chl-a. MLR analysis is one of the modeling techniques used to investigate the relationship between a dependent variable and several independent variables (Camdevyren et al., 2005).In this study, 19 variables were considered as independent variables and Chl-a was selected as the dependent variable. The general form of MLR is presented in Eq. (2): n
Y = βo +
∑ βi Xi + ε i=1
(2)
In the above equation, n = the number of variables; βo = the constant; βi = the regression coefficients and ε = the error term of the model. Out of 19 variables, only a few have shown little or no correlation with Chl-a levels. Nevertheless, these insignificant variables may show influence on the quality and stability of the fitted model. Therefore, stepwise regression was employed to exclude extraneous variables from the model. Stepwise regression identified closely linked variables with Chl-a levels. Principle component scores (PCS) combined with MLR is based on the hypothesis that the total variance of each parameter is stored in the PCS. This method is primarily utilized in air pollution studies (Swlethcki and Krejci, 1996).In addition, this method was also used in surface water quality of river ecosystems to estimate the contribution of identified sources to the concentration of the physicochemical parameters (Camdevyren et al., 2005).In this model. PCS were used as input variables rather than real variables by using Eq. (3); 3
Marine Pollution Bulletin 152 (2020) 110902
J.B. Franklin, et al.
Fig. 2. Flow diagram of the development of Chl-a prediction model. n
Y = βo +
∑ βi (PCS ) + ε
possessed a significant load only in PC6. In PC7, NH3-N and DO showed highest factor loadings. Similarly, DO has significant factor loading in PC1. For the second PCA, the KMO value (0.674) and the value of χ2 (555.10) by Bartlett's sphericity test (d.f. =171, P < .001) indicate that the data is suitable for PCA/FA. The aim of second PCA was to use the PCA scores for MLR by extracting all PCs following Camdevyren et al. (2005). In this PCA, 12 of 19 PCs with eigen value higher than 1 were obtained (Table 2). Because, PC scores were most important for MLR than factor loadings and eigen values, the scores were used for further analysis. The log transformed and standardized data were used for MLR. Chla levels was used as the dependent variable and all other parameters were considered as independent variables in the stepwise MLR. Nevertheless, the results from both data sets are comparable (Table 3). The PHE, TVC, TRANS and DO were found to be significant model terms. However, TRANS and DO had a negative relationship with Chl-a levels, while PHE and TVC showed a positive relationship. Eqs. (4) and (5) signified the model derived for Chl-a determination using log transformed and standardized data respectively. These equations explained 84.8% of the variation in the data set.
(3)
i=1
In the above equation, PCS = the principal component score; n = the number of PCS employed, βo = constant;βi = the regression coefficients; and ε = error term of the model. In this method, the second PCA (excluding Chl-a) scores were used as input variables. Stepwise regression was used in selection of PCS. Fig. 2 illustrates the complete model approach used in this study. In the first PCA, the value calculated as 0.660 by KMO indicates that the data set is suitable for PCA. In addition, the value of χ2calculated as 669.40 by Bartlett's sphericity test (d.f. = 190, P < .001) implies that the PCA is applicable to the current data set. Only seven out of 20 PCs had an eigenvalue > 1 and so were selected. These chosen PCs explained 76.65% of total variation of the variables (Table1). The first two PCs explained a total variability of 31.38%. The loadings of selected PCs are presented in Table 1. The high factor loading values of Chl-a indicates its significance from other parameters studied. In PC1, Chl-a, PHY, PHE and TRANS had the highest factor loadings. Nevertheless, TRANS, with highest factor loading in PC1, had a negative correlation with other parameters. Similarly, SF, EC, TVC and TSS represented PC2. PC3 was represented by SiO4, NO3-N, SAL and Tsw in which, Tsw and SAL had a negative correlation with SiO4 and NO3-N. The pH and TN have significant loadings in PC4 while, ZOO was negatively correlated. PC5 was represented by NO2 and TP while IP
log (Chl − a) = −2218 + 0.8480 ∗ log (PHE ) + 0.1934 ∗ log (TVC ) − 0.206 ∗ log (TRANS )–0.2949 ∗ log (DO) (4)
Table 1 Eigen values and factor loadings of PCA of 20 variables including Chl-a. Values marked in bold indicate maximum correlation between variables and its corresponding components. PC1
PC2
PC3
PC4
PC5
PC6
PC7
Eigen values Variance (%) Cumulative variance (%)
3.3244 16.6221 16.6221
2.9529 14.7645 31.3866
2.6293 13.1465 44.5331
2.2051 11.0255 55.5586
1.6244 8.1219 63.6805
1.3409 6.7043 70.3848
1.2541 6.2703 76.6551
Variables TRANS WT SAL pH TSS DO NO2-N NO3-N NH3-N TN SiO4 IP TP SF TVC EC ZOO PHY PHE Chl-a
−0.6057 0.5556 −0.0603 −0.1988 0.0529 0.0854 0.2990 0.0717 −0.3469 0.0112 −0.0167 0.0262 0.2138 0.3706 0.1747 0.0770 −0.0134 0.8131 0.8647 0.8667
−0.5307 −0.2012 −0.1936 0.0893 0.5518 −0.0880 0.1191 0.0334 0.3161 0.2831 0.0940 0.2333 −0.0999 0.7377 0.7676 0.8492 0.1800 0.2000 0.0329 0.2762
−0.0347 −0.6541 −0.7809 0.2306 −0.0592 −0.0787 0.4382 0.6937 −0.0519 −0.0367 0.7975 0.0644 −0.1793 0.0862 0.2546 0.1159 −0.0547 −0.2468 0.0794 0.1485
0.1272 −0.1402 0.1735 0.6718 0.5513 −0.1077 0.1143 0.3550 −0.1545 0.6618 −0.0336 0.0296 0.0270 0.0609 −0.0361 0.1380 −0.8612 −0.0337 −0.0092 0.0409
0.0370 −0.2159 0.2369 −0.0131 0.1741 0.1758 −0.6436 −0.0349 −0.2382 0.2104 −0.2508 −0.1195 0.7856 −0.3448 0.0968 −0.1690 0.2214 0.2037 −0.0275 0.0029
0.0254 0.1045 −0.2135 0.4032 0.1830 0.0974 0.2443 0.0569 −0.4181 −0.2096 −0.0500 0.8607 0.0017 0.1951 0.0360 0.0866 −0.0226 0.0728 −0.0173 0.0163
0.2393 0.0165 0.0932 −0.1764 −0.3017 0.8557 0.0032 0.0350 0.4905 −0.1507 −0.0662 0.0822 0.1346 0.0260 −0.0136 0.0378 −0.0869 −0.0376 0.1632 −0.0427
4
Marine Pollution Bulletin 152 (2020) 110902 −0.0328 0.0229 −0.0195 0.0039 0.0270 −0.0055 0.0370 0.0171 0.0075 0.0059 0.0103 0.0248 −0.0236 0.4640 0.0368 −0.0414 0.0007 0.0526 0.0231
∗ StzTVC − 0.18526 ∗ StzTRANS − 0.1135 ∗ StzDO
−0.1435 0.1882 0.0462 −0.0734 0.0533 0.0112 0.0069 −0.0197 −0.0820 0.0301 −0.0459 0.0221 0.0962 0.2302 0.1138 −0.0071 0.0197 0.7397 0.1550 −0.1504 −0.0414 0.0246 0.2277 0.7874 −0.0791 −0.0385 −0.0242 −0.0279 0.1525 0.0328 0.0682 0.0501 0.1162 0.0803 0.1481 −0.0899 0.0593 0.0070 −0.0549 0.0744 −0.2095 0.0540 −0.0428 −0.0311 0.8104 0.0802 −0.0325 −0.0376 0.2069 0.0935 −0.1263 0.1859 0.0595 0.1100 −0.1111 0.0019 0.0930 0.1174 0.1618 0.8138 −0.0389 0.0284 0.0207 −0.2121 −0.1371 0.0260 0.0446 −0.2288 −0.0867 0.1041 −0.0893 −0.0865 −0.0649 −0.0421 0.0637 −0.0276 0.0860 −0.1807 −0.0403 0.8585 0.3165 −0.0746 0.0589 0.0862 −0.0904 0.0722 0.0444 0.1107 −0.0338 0.0255 −0.0097 0.0937 −0.2097 −0.1063 −0.0214 −0.1101 0.8479 0.1842 −0.1758 −0.0442 0.0515 0.0846 −0.1582 −0.0983 −0.0463 −0.1828 0.0180 0.0139 0.1226 −0.0836 −0.0790 −0.0029 0.3001 0.1222 0.0999 0.0766 0.0313 −0.1032 −0.1462 0.9776 −0.0509 −0.0525 0.0724 −0.0441 −0.0455 0.0197 0.0886 −0.0396 −0.0114 −0.0113 0.0272 0.0240 0.0832
log(Chl − a) = 0.222 + 0.262 ∗ PC 2 + 0.126 ∗ PC14 + 0.102 ∗ PC 6 + 0.089 ∗ PC18 + 0.051 ∗ PC1 + 0.048 ∗ PC12 (6) The PCS-MLR identified the most significant parameter(s) that are associated with Chl-a abundance and logarithmic values of those parameters were linked with Chl-a data. Further, logarithmic values of seven parameters (TRANS, Tsw, SF, TVC, EC, PHY and PHE) were employed in stepwise regression, where log (Chl-a) is a dependent variable. Similarly, PHE, TVC and TRANS have also been associated with Chl-a. This model calculated a variation of 83.6% in Chl-a depicted by PHE, TVC and TRANS. Nevertheless, if the four excluded variables (Tsw, SF, EC, and PHY) were included in the model, the coefficient determination would rise to 84.6%, although, this difference is not statistically significant. The regression coefficients obtained through this approach are presented in Table 5. The final regression model to predict Chl-a levels was performed using eq. [7];
−0.0165 −0.0685 0.0690 0.0909 0.2780 −0.0424 −0.0581 0.0860 −0.0773 0.9471 −0.0398 −0.0437 0.0195 0.0427 0.1172 0.1053 −0.1789 0.0615 −0.0384
log (chl − a) = −0.4340 + 0.8112 ∗ log(PHE ) + 0.1900 ∗ log(TVC ) − 0.2441 ∗ log(TRANS )
−0.0378 0.0271 −0.1428 0.1543 0.1198 0.0185 0.1552 0.0731 −0.0549 −0.0490 −0.0127 0.9671 −0.0457 0.2044 0.0788 0.0864 −0.0179 0.0414 0.0172
−0.1665 −0.0935 −0.1092 −0.0143 0.1175 −0.0099 0.0806 0.0901 0.0170 0.0966 0.0695 0.0663 −0.0137 0.3233 0.8859 0.2318 0.0768 0.2027 0.0090
−0.0730 −0.0020 −0.0621 −0.2731 −0.1476 0.0259 −0.1716 −0.1632 0.0176 −0.1684 −0.0025 −0.0190 0.0130 0.0040 0.0869 0.0185 0.9306 0.0379 0.0184
0.0366 −0.1444 0.0427 −0.1279 −0.0473 0.0700 −0.0552 −0.0678 0.9693 −0.0823 0.0327 −0.0558 −0.0610 0.0608 0.0192 0.0943 0.0160 −0.1840 −0.0478
0.0091 −0.2051 −0.1958 0.1079 −0.0415 −0.0452 0.1100 0.9210 −0.0579 0.0798 0.1653 0.0627 −0.0603 0.1088 0.0991 0.0519 −0.1543 −0.0400 0.0407
0.989 5.207 76.617 1.012 5.325 71.410 1.079 5.677 43.897
−0.0252 0.0102 0.1650 −0.0521 0.0897 0.0836 −0.2052 −0.0677 −0.0586 0.0194 −0.1276 −0.0439 0.9574 −0.1862 −0.0115 −0.0544 0.0134 0.2131 0.0598 −0.2742 0.1828 −0.0361 −0.0380 0.0206 0.0734 0.1434 0.0413 −0.0527 −0.0340 0.0037 0.0182 0.0682 0.1881 0.0167 0.0419 0.0225 0.3879 0.9460 −0.2770 −0.0496 −0.1124 0.1002 0.2688 −0.0194 0.2110 0.0745 0.0876 0.0980 0.0507 0.1139 −0.0840 0.6225 0.3342 0.9070 0.0248 0.0931 0.0742
(7)
The correlation coefficient (R) between the predicted Chl-a values and log Chl-a were found to be 0.912 that indicates a strong correlation between these log transformed experimental values and model predicted Chl-a values. Fig. 3 describes the correlation between the observed and predicted Chl-a data. The idea behind PCR and MLR methods is to discard irrelevant and unstable parameters and to only use the most relevant variables for regression. Hence, the collinearity problem could be solved so that more stable regression equations and predictions could be obtained. The most common technique in statistics for detecting multicollinearity is the Variance Inflation Factor (VIF). The larger the VIF value, the more serious the collinearity problem. In practice, if any of the VIF value is equal or larger than 10 then there is a near collinearity. However, in this study, the regression coefficient is not reliable. The low VIF values of 1.324 (PHE), 1.227 (TVC) and 1.582 (TRANS) indicate absence of multicollinearity amongst selected variables. However, the residual statistics are an important factor that decides the adequacy of the developed model. Importantly, the zero mean of the residuals, constant variance and absence of outliers signified accuracy of the predicted model (Table 6). Therefore, the residual analysis confirmed applicability and validity of the simulated model
−0.0333 −0.2207 −0.3387 0.0532 0.0412 −0.0383 0.3045 0.1616 0.0229 −0.0389 0.8999 −0.0027 −0.1135 0.0655 0.0790 0.0433 −0.0019 −0.0835 0.0123
1.232 6.483 14.909
(5)
In PCS-MLR, the scores obtained in the second PCA without Chl-a were used as independent variables and the scores with Chl-a were used as dependent variables. Stepwise MLR determined the PCS that had significant correlation with Chl-a levels. Regression coefficients of the selected PCS are presented in Table 4. In this approach, six PCs (PC1, PC2, PC6, PC12, PC14 and PC18) showed significant correlation with Chl-a. The regression coefficient with these PCS was calculated as 0.8380 which defined a variation of 83.8% on Chl-a abundance. The regression coefficients had significant positive correlations indicating that the PCS are associated with Chl-a levels. Nevertheless, PC2 had the highest positive correlation on Chl-a followed by PC14. PC2 represented PHE with significant association with Chl-a levels. This is because PHE is a common degradation product of Chl-a, and could interfere with the determination of Chl-a. PC14 represented TRANS and had a negative association with Chl-a levels. In PC6, TVC had a significant relationship with Chl-a levels. Similar reports on associations with Chl-a levels and abundance of Vibrio species are welldocumented in literature (Johnson et al., 2010). Similarly, PHY possessed a significant score only in PC18. Further, EC and SF were found to have a significant relationship with Chl-a in PC1. The score values of Tsw alone were significant in PC12. Based on this approach the Chl-a values were determination using Eq. (6);
−0.8483 0.1065 −0.1311 −0.0885 0.1916 −0.0655 0.0606 −0.0062 −0.0254 0.0113 0.0226 0.0280 0.0194 0.1815 0.1420 0.1774 0.0549 0.2134 0.1828
0.227 1.197 100.000 0.724 3.810 98.803 0.791 4.162 94.993 0.863 4.540 90.830 0.868 4.570 86.291
StzChl − a = −7.9 × 10 −7 + 0.7384 ∗ Stz PHE + 0.2554
Total variance (%) Cumulative variance (%) Variables TRANS WT SAL pH TSS DO NO2-N NO3-N NH3-N TN SiO4 IP TP SF TVC EC ZOO PHY PHE
1.601 8.426 8.426
1.134 5.966 20.875
1.118 5.884 26.759
1.091 5.743 32.502
1.086 5.718 38.220
1.067 5.618 49.514
1.059 5.573 55.087
1.052 5.538 60.625
1.037 5.460 66.085
0.970 5.103 81.720
PC16 PC15 PC14 PC13 PC12 PC11 PC10 PC9 PC8 PC7 PC6 PC5 PC4 PC3 PC2 PC1
Table 2 Eigen values and factor loadings of PCA of 19 variables without Chl-a. Values marked in bold indicate maximum correlation between variables and its corresponding component.
PC17
PC18
PC19
J.B. Franklin, et al.
5
Marine Pollution Bulletin 152 (2020) 110902
J.B. Franklin, et al.
Table 3 Coefficients of regression analysis of log transformed and standardized data. Regression coefficients
Standardized Regression coefficients
t-value
P-value
Log transformed data Constant PHE TVC TRANS DO
−0.22184 0.848033 0.193486 −0.20685 −0.29491
0.738418 0.255419 −0.18526 −0.11354
−1.32614 12.25406 4.540424 −2.8199 −2.14573
0.189905 7.41 × 10 2.82 × 10 0.006532 0.03602
Standardized data Constant PHE TVC TRANS DO
−7.9 × 10 0.738417 0.255419 −0.18526 −0.11354
0.738417 0.255419 −0.18526 −0.11354
−1.6E-05 12.25404 4.540417 −2.81991 −2.14573
0.999988 7.41 × 10 2.82 × 10 0.006531 0.03602
−07
Table 4 Coefficients of regression analysis of PCS with log Chl-a. Regression coefficients
Constant PC2 PC14 PC6 PC18 PC1 PC12
0.221649 0.262235 0.126458 0.102443 0.08931 0.050933 0.048331
Standardized regression coefficients
0.729956 0.352008 0.285158 0.248603 0.141778 0.134534
P-value
R2
11.66689 13.69497 6.60415 5.349961 4.66413 2.659948 2.524039
9.81 × 10−17 1.13 × 10−19 1.45 × 10−08 1.62 × 10−06 1.92 × 10−07 0.010131 0.014412
0.838
Predicted value Standard error of predicted value Adjusted predicted value Residual Cook's distance Mahal. distance
Constant PHE TVC TRANS
−0.43405 0.811238 0.190044 −0.24416
0.706379 0.250875 −0.21867
−05
0.8480 −17 −05
t-value
P-value
−3.12457 11.7519 4.334533 −3.32753
0.002742 3.34 × 10−17 5.66 × 10−05 0.0015
Mean
Standard deviation
Minimum
Maximum
0.221649 0.035946
0.328514 0.009828
−0.48309 0.01935
0.866316 0.064556
0.329599
−0.52194
0.861247
0.145389 0.093949 2.200721
−0.4767 1.54 × 10−5 0.078447
0.247898 0.728534 10.84519
0.223826 −3.3 × 10 0.027294 2.953125
−17
Recently, PCS-MLR techniques have been widely used in diverse fields such as environmental studies, medical research and macro/microeconomic indices. For example, PCS-MLR has been used to: 1) define particulate matters of 10 μ (PM10) in air pollution by predicting concentrations of PM10 (Ul-Saufie et al., 2011), 2) predict atmospheric ozone concentrations and ground level ozone modeling (Sousa et al., 2007; Ozbay et al., 2011), 3) predict sediment yield for land and water management in river basins (Wuttichaikitcharoen and Babel, 2014), 4) identify the pollution sources and their contribution towards surface water quality variation (Mustapha and Abdu, 2012), 5) predict Chlorophyll-a levels in reservoirs using PCS-MLR technique (Camdevyren et al., 2005), 6) assess arterial blood pressure (Kaur et al., 2012) and 7) forecast the SArajevo Stock Exchange Index 10 (Rovčanin et al., 2015). Regardless of the focus, all these studies aim to develop better accuracy in predictive models and to eliminate multicollinearity problems. Nevertheless, it is consistently difficult to perform regression analysis using PC scores. PCA requires statistical expertise as well as specific software which are practically difficult. Hence, our approach could be useful for non-experts. In the present study, subsequent to PCS-MLR approach, the real variables have been identified based on the developed regression Eqs. (6) and (7). The final predictive model equation (Eq. (7)) is quite simple to understand and can be used to predict Chl-a concentrations with limited parameters. Our goal here has been to offer some of this analytical power to end-users with minimal training in statistics. This could find wide applicability with end-users such as ecosystem managers, port authorities, non-government agencies and researchers to use this model to predict Chl-a concentrations with limited biotic and abiotic variables to guide them towards a better understanding of the concepts underlying exploratory statistical modeling. This is the first study to develop a mathematical model based on PCS-MLR approach to predict Chl-a level in coastal-marine ecosystems with a maximum predictive power of 83.6% and 91.2% of variations in biotic and abiotic parameters. This study improved the regression model by reducing the number of variables and eliminating non-significant and indirect effective variables on the predictive variable. The
Table 5 Regression coefficients. Standardized regression coefficients
0.8480 −18
Table 6 Residual analysis.
t-value
Regression coefficients
R2
Fig. 3. Correlation between the observed and regression model predicted Chl-a values.
(Eq. (7)). The collective equation (Eq. (7)) can hence be used to predict Chl-a with limited parameters. It has been found that the incorporation of PCs as independent variables in the regression models enhanced model prediction as well as reduced model complexity by eliminating multicollinearity (Polt and Gunay, 2015). 6
Marine Pollution Bulletin 152 (2020) 110902
J.B. Franklin, et al.
predictive power attained from the present model and the data-driven approach using limited parameters promotes the validity of our model in predicting Chl-a levels in other bays, provided that there is an adequate data with comparable biotic and abiotic conditions. Ecosystem managers continue to describe potential methods to manage ecological systems with legal certainty that does not mesh well with ecological realities, but remedial actions are often sought when degradation has already impaired a number of functional interactions in the ecosystem. They fail to predict the factors/indicators responsible for the complex functioning of ecological systems. Nevertheless, the strength of management is in the recognition and prediction of such uncertainties. Therefore, innovative solutions are required with moderate data requirements, manageable complexity and merits in a clear and simple manner to forecast the factors/indicators responsible for changes in ecosystem functions to reduce risks of further ecosystem degradation. In effect, the model or approach developed here could be a valuable tool to access and regulate Chl-a levels in coastal-marine ecosystems.
Hydraul. Eng. 126, 570–577. Fabricius, K.E., Okaji, K., De'ath, A.G., 2010. Three lines of evidence to link outbreaks of the crown-of-thorns seastar Acanthaster planci to the release of larval food limitation. Coral Reefs 29, 593–605. Grasshoff, K., Kremling, K., Ehrhardt, M., 1999. Methods of Sea Water Analysis, Third Edition. Verlag Chemie, Weinheim, Germany, pp. 600. Hoe, J.S., Kim, D.S., 2004. A new method of ozone forecasting using fuzzy expert and neural network systems. Sci. Total. Envi. 325, 221–237. Holt, J.G., Krieg, N.R., Sneath, P.H.A., Staley, J.T., Williams, S.T., 1994. Gram-negative aerobic/ microaerophilic rods and cocci. In: Hensley, W.R. (Ed.), Bergey's Manual of Determinative Bacteriology, Ninth edition. Williams ve Wilkins 482, East Presion Street, Baltimore, Maryland. Norylve, USA, pp. 179. Hoyer, M.V., Frazer, T.K., Notestein, Sky K., Canfield Jr., D.E., 2002. Nutrient, chlorophyll, and water clarity relationships in Florida's near shore coastal waters with comparisons to freshwater lakes. Can. J. Fish. Aquat. Sci. 59, 1024–1031. Johnson, C.N., Flowers, A.R., Noriea, N.F., Zimmerman, A.M., Bowers, J.C., DePaola, A., Grimes, D.J., 2010. Relationships between environmental factors and pathogenic vibrios in the northern Gulf of Mexico. Appl. Environ. Microbiol. 76, 7076–7084. Kaiser, H.F., 1974. Index of factorial simplicity. Psychometrika 39, 31–36. Karthik, R., Padmavati, G., 2014. Dinoflagellatebloom produced by Protoperidiniumdivergensresponse to ecological parameters and anthropogenic influences in the Junglighat Bay of South Andaman Islands. App. Envi. Res. 36, 19–27. Kasturirangan, L.R., 1963. A Key for the Identification of the More Common Planktonic Copepod of Indian Coastal Waters. CSIR, New Delhi, India, pp. 1–83. Kaur, G., Arora, A.S., Jain, V.K., 2012. Multiple linear regression model based on principal component scores to study the relationship between anthropometric variables and BP reactivity to unsupported back in normotensive post-graduate females. In: Pavelkova, D., Strouhal, J., Pasekova, M. (Eds.), Advances in Environment, Biotechnology and Biomedicine. WSEAS Press, Czech Republic, pp. 373–377. Kim, J., Park, J., 2013. A statistical model for computing causal relationships to assess changes in a marine environment. J. Coastal. Res. 65, 980–985. Kumar, M.A., Karthik, R., Elangovan, S.S., Padmavati, G., 2012. Occurrence of Trichodesmium erythraeum bloom in the coastal waters of south Andaman. Int. J. Curr. Res. 4, 281–284. Meeuwig, J.J., Kauppila, P., Pitka, H., 2000. Predicting coastal eutrophication in the Baltic: a limnological approach. Can. J. Fish. Aquat. Sci. 57, 844–855. Mustapha, A., Abdu, A., 2012. Application of principal component analysis & multiple regression models in surface water quality assessment. J. Environ. Earth Sci. 2, 16–23. Nogueira, E., Perez, F.F., Ríos, A.F., 1998. Modelling nutrients and chlorophyll-a time series in an estuarine upwelling ecosystem (ria de Vigo, NW Spain) using the Boxjenkins approach. Estuar. Coast. Shelf Sci. 46, 267–286. Ozbay, B., Keskin, G.A., Dogruparnak, S.C., Ayberk, S., 2011. Multivariate methods for ground level ozone modeling. Atmos. Res. 102, 57–65. Polt, E., Gunay, S., 2015. The comparison of partial least squares regression, principal component regression and ridge regression with multiple linear regression for predicting Pm10 concentration level based on meteorological parameters. J. Data. Sci. 13, 663–692. Riegl, B., Glynn, P.W., Wieters, E., Purkis, S., d'Angelo, C., Wiedenmann, J., 2014. Water column productivity and temperature predict coral reef regeneration across the IndoPacific. Sci. Rep. v. 5, 8273. Rovčanin, A., Abdić, A., Abdić, A., 2015. Forecasting SASX-10 index using multiple regression based on principal component analysis. Int. Busin. Manag. 1, 23–29. Sachithanandam, V., Mohan, P.M., Kathik, R., Elangovan, S.S., Padmavati, G., 2013. Climate change influence the phytoplankton bloom (Prymnesiophyceae:Phaeocystissp.) in North Andaman coastal region. Indian J. Geo. Mar. Sci. 42, 58–66. Sahu, B.K., Begum, M., Kumarasamy, P., Vinithkumar, N.V., Kirubagaran, R., 2014. Dominance of Trichodesmium and associated biological and physico-chemical parameters in coastal water of Port Blair, South Andaman Island, Indian. J. Geo-Marine. Sci. 43, 1–7. Sakamoto, M., 1966. Primary production by phytoplankton community in some Japanese lakes and its dependence on lake depth. Arch. Hydrobiol. 62, 1–28. Santhanam, R., Srinivasan, A., 1994. A Manual of Marine Zooplankton. Oxford and IBH publishing Co. Pvt. Ltd, New Delhi, India, pp. 160. Santhanam, R., Ramanathan, N., Venkataramanujam, K., Jegatheesan, G., 1987. Phytoplankton of the Indian Seas (An Aspect of Marine Botany). Daya Publishing House, Delhi, India, pp. 127. Sharma, S., 1996. Applied multivariate techniques. John Wiley and Sons, Canada, New York, pp. 493. Smith, V.H., 2006. Responses of estuarine and coastal marine phytoplankton to nitrogen and phosphorus enrichment. Limnol. Oceanogr. 51, 377–384. Sousa, S.I.V., Martins, F.G., AlvimFerraz, M.C.M., Pereira, M.C., 2007. Multiple linear regression and artificial neural networks based on principal components to predict ozone concentrations. Envi. Modell. Soft. 22, 97–103. Strickland, J.D.H., Parsons, T.R., 1977. A practical handbook of sea water analysis. Bull. Fish. Res. Board Can. 167, 1–311. Swlethcki, E., Krejci, R., 1996. Source characterization of the central European atmospheric aerosol using multivariate statistical methods. Nucl. Instr. Methods. Physics. Res, pp. 109–110, 519–525. Terdalkar, S., Pai, I.K., 2001. Statistical approaches for computing diversity of zooplankton in the Andaman sea. J. Trop. Ecol. 42, 243–250. Tufford, D.L., McKeller, H.N., 1999. Spatial and temporal hydrodynamic and water quality modeling analysis of a large reservoir on the South Carolina (USA) coastal plain. Ecol. Model. 114, 137–173. Ul-Saufie, A.Z., Yahya, A.S., Ramli, N.A., 2011. Improving multiple linear regression
Contributors The research concept and the experiments were achieved by J.B.F and T.S. J.B.F and T.S analyzed data and prepared the manuscript. N.V·V and R.K evaluated the data. All authors approved the final manuscript. Credit author statement The research concept and the experiments were achieved by J.B·F and T.S. J.B·F and T.S analyzed data and prepared the manuscript. N.V·V and R.K evaluated the data and revised the manuscript. All authors approved the final manuscript. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements This study was supported by the Ministry of Earth Sciences (MoES), Government of India, [Grant Number: MOES/ICMAM-PD/AOGIA/32/ 2012]. The authors are thankful to Dr. M. A. Atmanand, Director, National Institute of Ocean Technology for constant support and encouragement to perform this research. JBF thank Dr. habil. Yogendra Kumar Mishra, Kiel University, Germany for his first revision of this manuscript. We thank the anonymous reviewer for his constructive comments and suggestions. JBF thank Prof. Thomas F. Duda Jr., University of Michigan, for critically reading and revising the final draft of this manuscript. References Begum, M., Biraja, K.S., Das, A.K., Vinithkumar, N.V., Kirubagaran, R., 2015. Extensive Chaetoceroscurvisetus bloom in relation to water quality in Port Blair Bay, Andaman Islands. Env. Monit. Assess. 187, 1–14. Brown, C.D., Hoyer, M.V., Bachmann, R.W., Canfield D.E. Jr., 2000. Nutrient-chlorophyll relationships: an evaluation of empirical nutrient-chlorophyll models using Florida and northern temperate lake data. Can. J. Fish. Aquat. Sci. 57, 1574–1583. Camdevyren, H.C., Demyr, N., Kanik, A., Keskyn, S., 2005. Use of principal component scores in multiple linear regression models for prediction of Chlorophyll-a in reservoirs. Ecol. Model. 181, 581–589. Chen, C.W., 1970. Concepts and Utilities of Ecologic Model. J. Sanit. Eng. Div. 96, 1086–1097. Di Toro, D.M., O'Connor, D.J., Thomann, R.V., 1971. A dynamic model of the phytoplankton population in the Sacramento-San Joaquin Delta. In: Non Equilibrium Systems in Natural Water Chemistry. Advances in Chemistry Series, vol. No. 106. American Chemical Society, New York, pp. 131–150. Elliot, A.H., 2000. Settling of fine sediment in a channel with emergent vegetation. J.
7
Marine Pollution Bulletin 152 (2020) 110902
J.B. Franklin, et al.
Viviani, R. (Eds.), Marine Coastal Eutrophication. The Response of Marine Transitional Systems to Human Impact: Problems and Perspectives for Restoration. Science of the Total Environment Supplement, Elsevier Scientific, Amsterdam, The Netherlands, pp. 63–105. Wehde, H., Backhaus, J.O., Hegseth, E.N., 2001. The influence of oceanic convection in primary production. Ecol. Model. 138, 115–126. Wetzel, R.G., 2001. Limnology: Lake and River Ecosystems, Third edition. Academic Press, San Diego (California, pp. 1006. Wiedenmann, J., D'Angelo, C., Smith, E.G., Hunt, A.N., Legiret, F., Postle, A.D., Achterberg, E.P., 2013. Nutrient enrichment can increase the susceptibility of reef corals to bleaching. Nature. Clim. Change. 3, 160–164. Wuttichaikitcharoen, P., Babel, M.S., 2014. Principal component and multiple regression analyses for the estimation of suspended sediment yield in ungauged basins of northern Thailand. Water 6, 2412–2435.
model using principal component analysis for predicting PM10 concentration in SeberangPrai, Pulau Pinang. Int. J. Envi. Sci. 2, 403–410. USEPA, 2002a. Method 1600, Enterococci in Water by Membrane Filtration Using Membrane-enterococcus Indoxyl-bbetaN-D-Glucoside Agar (mEI). USEPA Office of Water, Washington, D.C. USEPA, 2002b. Method 1604, Total Coliforms and Escherichia coli in Water by Membrane Filtration Using a Simultaneous Detection Technique (MI Medium). USEPA Office of Water, Washington, D.C. Utermöhl, V.H., 1931. NeweWege in der quantitativenWrfassung des Planktons (MitbesonderBeriichsichtigung des Ultraplanktons). Verh. Int.Verein. Theor. Angew. Limnol. 5, 567–595. Vollenweider, R.A., Rinaldi, A., Montanari, G., 1992. Eutrophication, structure and dynamics of a marine coastal system: results of a ten-year monitoring along the EmiliaRomagna coast (Northwest Adriatic Sea). In: Vollenweider, R.A., Marchetti, R.,
8