Available online at www.sciencedirect.com
ScienceDirect Procedia Computer Science 104 (2017) 3 – 11
ICTE in Regional Development, December 2016, Valmiera, Latvia
Modelling of Water Supply Costs Edvins Karnitisa, Girts Karnitisa,*, Janis Zutersa, Viktorija Bobinaiteb b
a University of Latvia, Raina blvd.19, Riga, LV1586, Latvia Lithuanian Energy Institute, Breslaujos st. 3, Kaunas, LT-44403, Lithuania
Abstract Water supply tariffs’ setting is a labour intensive regulatory procedure; currently number of informative and procedural shortages and problems exist. The aim of the current research is improvement of methodology for determination of the substantiated costs for provision of water services. A working hypothesis was advanced to modernize the methodology: the specific costs (€/m3) required for the provision of water services in a specific region is a variable multi-parameter function of key performance indicators. There is preferred a benchmark modelling procedure, which is based on the factual cases (declared indicators of water utilities) and synthesis of the general regularity. The model is developed using two independent modelling procedures. The correlation of the synthesized model with declared specific costs of Latvian water utilities is strong (0.88). The correlation between the respective modelled indications exceeds 0.95; hence, the trustworthiness in the results is high. The prospect is the determination of the price ceilings and then an operative tariff setting, thus significantly improving the methodology. by by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license © 2017 2016The TheAuthors. Authors.Published Published Elsevier (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the scientific committee of the international conference; ICTE 2016. Peer-review under responsibility of organizing committee of the scientific committee of the international conference; ICTE 2016
Keywords: Water utilities; Benchmarking methodologies; Data mining; Artificial neural networks
1. Introduction The provision of water services typically is a highly segmented function. Actually only one water utility is functioning in any specific territory (in total even hundreds of utilities in most of countries); consequently, all of them are local monopolies. Therefore, tariff setting usually is the task of the National Regulatory Authority (NRA). So, according to the Law1, the Public Utilities Commission of Latvia (PUC)2 regulates drinking water and sewerage services (including tariff setting)3. The tariff setting methodology4 prescribes that the water utility prepares
*
Corresponding author. Tel.: +37167034488; fax: +37167225039. E-mail address:
[email protected]
1877-0509 © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the scientific committee of the international conference; ICTE 2016 doi:10.1016/j.procs.2017.01.040
4
Edvins Karnitis et al. / Procedia Computer Science 104 (2017) 3 – 11
and submits to the PUC a particular tariff draft, which contains justified data on the volume of the service and costs in the previous years as well the prognosis for the next year; the data are detailed in number of cost positions and should be based on documents. The process is similar in many countries; methodologies are based on aggregation of large number of cost items5. The differences are in details, some countries need to save the water resource6, while the others have quality problems, or have implemented universal service principle by the tariffs. In the last decade, benchmarking has become widely considered as the tool to motivate the water utilities to raise their productivity7. Both the most popular benchmark methods (metric and process benchmarking) compare performance indicators (PIs) of utilities to find the more efficient companies and to share the best practice8. Unfortunately, currently there is benchmarking of separate data only “…over time, across water utilities, and across countries”9 without reflections and conclusions on impact of benchmarking process on sector management and development. Unanswered remains the question: how to achieve by the benchmarking some regulatory outcome, e.g., evaluation of costs and tariff setting. 2. Shortages of the methodological approach More detailed analysis identifies number of informative and procedural shortages and problems in the current approach to the tariff setting. Methodological principles, which are based on careful evaluation of all cost items, cause the need for extremely detailed laborious individual assessment of each position of each tariff draft, since: x Water utilities are using different business models; e.g., the utility can maintain and repair the infrastructure, can employ its own legal and/or IT specialists or it can use outsourcing10; comparative assessment is not possible x National regulations on accounting and bookkeeping are quite general, account layouts really are quite different; especially it relates to the administrative and personnel costs, material accounting, etc. The regulatory procedure becomes long and hard, in addition it stimulates long-term application of the tariff. Applied tariffs frequently are behind the time and become unjustified because of frequent changes of business scale as well energy, material and service prices, wages, etc. Another reason of problems is low quality, compatibility and reliability of input data (values of the PIs) since: x There is lack of regulations on the material, human and other resources needed for an efficient (i.e., economically substantiated) water supply service x The large number of utilities means a potential considerable diversity in the comprehension on the PIs x Huge number of used PIs is a strong administrative burden for utilities to provide all of them: “It is not only the small utilities that find it difficult to evaluate such large number of PIs, larger utilities fare no better. ….”11 x Many utilities are multi-sector companies; they provide regulated and non-regulated services; there is low assurance on the absence of the cross-subsidies, particularly on the subsidization of non-regulated services from regulated x Frequent stochastic changes of the network length, consumption, water losses and other aspects make prognoses inaccurate and unreliable Moreover, the detailed audit of various cost models and structures (even due diligence) of utilities is not a regulatory function; the NRA should examine the validity of costs as a whole instead examination of every cost item (including those with a negligible impact) and their composition. The aim of current research is improvement of methodology for the determination of the substantiated costs for provision of water services in order to enable NRAs to increase their efficiency and to reduce significantly the administrative burden on utilities. This article presents the results of the first stage (water supply) of on-going project.
5
Edvins Karnitis et al. / Procedia Computer Science 104 (2017) 3 – 11
3. Working hypothesis The enumeration of currently existing deficiencies not only clearly demonstrates the need for a radical enhancement of the methodological approach to the water tariff setting, but also outlines the main principles for modernization of the algorithm according to the realities in the water industry. The generalizations over the tariff draft (assessment of the total specific costs only) and over utilities (comparison of specific costs) would be achievable by: x Investigation of the potential dependency of the specific costs on the key PIs that are declared by the utilities x Determination of the general correlations among the specific costs of utilities and synthesis of the corresponding functional regularity x Determination of reasonable/substantiated costs for each water utility using the synthesized general regularity A facilitating practical aspect for the synthesis of the general regularity would be similar basic normative and business conditions for water utilities, which operate in the same geographical area with comparable environmental and socio-economic factors as well rules of game for business (e.g., NUTS 2 level region (case of Latvia)). Proceeding from this aspect we created a working hypothesis: the total specific costs (C) required for the provision of the water services in a specific region is a variable multi-parameter function of the set of key PIs (Ȇ), which characterizes the scale and specific features of this business. Then these PIs would serve as the drivers for determination of the substantiated specific costs; the searched regularity is: ൌሺȫሻ
(1)
Huge amount of unreliable input data naturally cannot form a necessary basis for proof of the hypothesis and consequently for setting justified tariffs. The well-known information processing axiom postulates that the quality of the output data is fully determined by the quality of the input data. To increase the last one, to achieve accuracy of the actual values of PIs, the set of used PIs is limited to: x Certainly, clearly and unambiguously defined PIs to provide their uniform understanding in all utilities x Quantitatively measurable and controllable PIs (input quantity data) to ensure the reliability of their values x Well-known and widely used PIs that exist in the business accounting and are obtainable for the NRA in the annual reports x Small number of key PIs that characterize the utility’s business and thus determine substantiated costs; it will reduce the administrative burden on utilities and by default will raise the quality of data To achieve the aim of current research, to develop methodology for determination of the functional regularity (1) we should to prove the working hypothesis and to make it tenable for the practical implementation (control of costs and tariff setting). Whereas it will be impossible to define it theoretically, there is preferred a benchmark modelling procedure, which is based on the factual cases (declared information of water utilities). 4. Selection of input data Careful selection of input data is a well-known critical precondition for successful modelling. The long-term experience of water professionals suggested the first step in the selection of indicators: the scale of utility’s k business (i.e., amount of authorised consumption A(k)) and size of its infrastructure (i.e., total length of the pipe network L(k)) provide the first relative notion (although the very rough one) on specific costs C(k) of the utility k. Consequently the initial composition of input data set Ȇ(k) would be defined as: ȫሺሻൌሼሺሻǡሺሻሽ
(2)
6
Edvins Karnitis et al. / Procedia Computer Science 104 (2017) 3 – 11
A difference between amounts of authorised consumption A(k) and produced water P(k) (so called non-revenue water) actually exist due to number of reasons, e.g., technological consumption, leaks, disruptions, unmetered and unauthorized connections. The non-revenue water raises significant costs for its sourcing, treatment and partial pumping. Therefore, the initial data set (2) is supplemented with an indicator of water use efficiency E(k), which displays the share of produced water that is supplied to authorized consumers: ሺሻൌሺሻȀሺሻ
(3)
Many water utilities are serving not only one city or town, but also are providing services in a number of smaller neighboring settlements (e.g., villages, hamlets). Fragmentation of the total network L(k) in s(k) separate segments adversely affects the substantiated costs (personnel, transport, management, etc.) of the utility. Network concentration index H(k) takes into account the total number of isolated segments s(k) in the network L(k), as well the specific weight of the length of each isolated segment L(m(k)) in total length of the network: ୱሺ୩ሻ
ሺሻ ൌ ୫ୀଵሺሺሺሻሻȀሺሻሻଶ
(4)
The connection of the consumer is an exit point of the utility’s infrastructure and boundary of its responsibility for the service. The number of connections N(k) characterizes expenditures in relation to consumer services, it indices also fragmentation of the network as well the ratio between the larger diameter (transmission and distribution) and smaller diameter (consumer) pipes. Now the overall input data set has become: ȫሺሻൌሼሺሻǡሺሻǡሺሻǡሺሻǡሺሻሽ
(5)
Several aspects, which are significant in other countries, are not relevant in Latvia. The single real water quality problem is iron removal, but it is quite equal in the whole territory; the tap water is suitable for drinking without further filtration in any settlement. Water shortage problems are negligible as minimum until 20406. Support of vulnerable consumers is implemented by special accommodation allowance on municipal level. The set of PIs (5), which practically is the result of several iterations, was used in the project. Accuracy of input data (selected PIs) remains some challenge for the water utilities (currently only the volume of authorised consumption is used to calculate the tariff), although they are the primary operational data of everyday business. Actually, it is clear that in any case part of the particular input data sets Ȇ(k) will stay in the risk zone. A significant advantage of the modelling is the ability to pick the most qualitative and reliable (good) data sets Ȇ(k) of the full data pack, which are declared by all utilities, and to develop the model on this basis; the obtained general regularity will be applicable to the remaining (bad) utilities too. Moreover, the theoretical research in data analysis and modelling shows that it is much more reasonably to carry out the analysis and to develop the data models selecting the most reliable, good data sets instead of using the full data pack, thus minimizing information noise12. To select good data sets a detailed formal analysis of declared data for 2013 and 2014 (that were used for the creation of the model) and comparative analysis of declared data for 2012–2014 were carried out. As the utilities of low confidence (bad utilities) were considered those, which have declared, e.g.: x Volume of produced water that is equal in two years or very near to the volume of supplied water (E > 0.99) x A very large average connection density (N/L > 100 connections per km; average in Latvia – 22.6) and/or a very low average consumption per connection (< 6 m3/month; average in Latvia – 48.3) x Decreased or significantly increased (> 30%) total length of the network in the current year in comparison with previous one at unchanged or even declining amount of authorised consumption Practitioners of data analysis recommend including in the good data around 70% of total data sets; the
7
Edvins Karnitis et al. / Procedia Computer Science 104 (2017) 3 – 11
synthesized regularity will be applied to the remaining 30% too13. We have supplemented this factor by another one: selected good data should cover the full ranges of declared data. Because of the unconvincing data quality, we carried out modelling using data of only 33 utilities in 2013 (50% of total) and 38 utilities in 2014 (60%). 5. Modelling procedure For the development of the benchmark model, the water utility k is considered as a multiple-input single-output converter with a variable internal state (transition regularity), which transforms the input data set (5) (set of the n PIs (cost drivers); in our case n = 5) for the corresponding output indicator – specific costs C(k): ሺሻൌሺȫሺሻሻൌሺሺሻǡሺሻǡሺሻǡሺሻǡሺሻሻ
(6)
The converter is a determined one – its input data set unambiguously defines the internal state of the converter and the output information (specific costs). Then synthesis of the model means creation of the monotonous (due to the economic logic) multi-functional regularity that describes the hyper-surface in the n-dimensional space, which is a geometric place of values of the specific costs. Number of input data sets u and consequently the output data is finite (number of utilities u). Nevertheless, in order to have the possibility to evaluate new and/or modified undertakings, the regularity should be continuous for any PI in the determined ranges of input data. The internal structure and operation of the utility are irrelevant for performing this task, the converter is considered as a black-box (see Fig. 1) that can be characterized by its transition function (6).
Fig. 1. Functionality of the benchmark model.
There are u different transition functions f(1), f(2), …, f(k), …, f(u) for u utilities; they form a factual basis for the modelling – set of the practical cases that can be used for synthesis of the general regularity from input/output examples. This is an advantage against the need to rely only on theoretical preconceptions. Then the modelling means an inductive process – synthesis of the general regularity of the transition function C = f(Ȇ) on the basis of u particular cases C(k) = f(Ȇ(k))14; the sought equation is developed using the mathematical modelling procedure. It can be predicted that it will not be achievable the model, which is completely adequate to all real utilities. The leading motive for practical purposes is to create the equation with the best possible quality. As the quality criterion for the created general regularity we used the correlation of the values of the synthesized model (equation) with the corresponding declared specific costs (correl (C(k); C)). Two mutually non-related modelling procedures have been used to ensure also the cross-check of results. Both consist of several phases and activities and are carried out in an iterative manner to approach gradually the searched general regularity. One procedure is based on the nonlinear regression process (NLR)15. To create a model, equation (6) should be generalized because of variability of impacts of any cost driver (performance indicator) on specific costs depending on PI value. For this purpose, specific values of any PI should be replaced by the mathematical functions; of course,
8
Edvins Karnitis et al. / Procedia Computer Science 104 (2017) 3 – 11
impact regularities of each PI will be different:
ൌሺȫሻൌሺͳሺሻǡʹሺሻǡ͵ሺሻǡͶሺሻǡͷሺሻሻ
(7)
Then the synthesis of the searched general regularity comes to the determination of all functions in the equation (7). According to the Occam’s razor principle, search of the suitable functions f1…f5 was made between the elementary functions, which satisfy the monotony condition (exponential, logarithmic, power), to indicate the particular regularity that best of all correlates with all practical examples. The NLR process in our case is an empirical movement (navigation) in multi-dimensional search space, which is formed by the input data vectors (see Fig. 2), to find the optimal function and optimal its parameters for each cost driver. The incremental and efficient bottom-up navigation was implemented as the gradual process and a definite trend towards the target – the maximum achievable quality criterion (correl (C(k); C) value 0,88 was achieved). The other modelling procedure uses a type of the artificial neural networks – multi-layer perceptron (MLP) with the error back-propagation training algorithm16. We realized that it is enough to have just three computing units (neurons) in the hidden layer and one output unit (see Fig. 3) to achieve the best accuracy17. Each neuron is represented by a mathematical function, weights (or parameters) of which is set automatically via a machine learning process. Input data normalization was made by S-type function because of very diverse scales of data values16.
Fig. 2. Modelling using nonlinear regression process.
The final outcome (i.e., the searched transition function) is obtained in the form of trained neural network; its quality was carefully evaluated in a combined way: x In addition to correlation between modelled specific costs and declared ones, the accuracy of the modelling was controlled by the stopping criterion – continuation of the training process until the modelling error falls below some predetermined threshold İ18. The best possible correlation obtained was 0.96 with İ value of 0.00001 x The models obtained with maximum possible accuracy/correlation too much adapt to the concrete data points used for training and thus they are invalid to model the overall process (so called overfitting). Therefore specific monotony tests of the model were defined to obtain a maximum quality of the searched regularity while preserving its monotony; e.g., the final value of the threshold İ was increased until 0.0001, the obtained correlation was similar with that in NLR case – 0.88
9
Edvins Karnitis et al. / Procedia Computer Science 104 (2017) 3 – 11
Fig. 3. Modelling using multi-layer perceptron.
6. Modelling results and inferences Using both modelling procedures, the benchmark model of Latvian water utility are synthesized on the basis of declared data 2013 and 2014. The models 2013 for 33 good utilities in comparison with the declared costs are shown on Fig. 4. The correlation of the modelled costs C with declared specific costs C(k) (table 1) is very strong (> 0.85); p-values (i.e., probability that obtained correlations are accidental) are less than 10-8. More than 70% of modelled specific costs are in the standard segment that is formed by values C(k) +/- 10%. Modelling for good data of 2014 (38 utilities) provide results of even better quality, correlation value 0,88 was achieved (increased input data quality!). Mathematical expressions of models are different, but the surfaces practically coincide in the whole range of input data; the correlations between the respective indications of models are extremely strong (>0.95). The difference between modelled costs for particular utility does not exceed 10%; the biggest differences are for utilities whose specific costs are out of standard segment or on its border. Hence, the trustworthiness in the results is high. 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Specificcosts€/cubm
Declaredcosts 0,9*C U1
U4
U7
U10
U13
U16
NLRmodel 1,1*C U19
U22
MLPmodel Utilities U25
U28
U31
Fig. 4. Modelled and declared specific costs of 2013 (33 good utilities).
Application of the synthesized regularities to the bad utilities (see Fig. 5) clearly identifies two clusters of bad data, i.e., utilities, which have declared unduly high and low costs in relation to scale of their business. Let us remember – bad data were separated according to the results of very formal data analysis without any connection to specific costs. Practically unchanged excellent mutual coincidence of the models for all utilities displays currently existing information asymmetry and incorrect data as the most probable reason of differences.
10
Edvins Karnitis et al. / Procedia Computer Science 104 (2017) 3 – 11
Specificcosts,€/cubm
U25
U23
U21
U19
U17
U15
U13
U9
Goodcases
U11
U7
U5
U3
U1
U37
Utilities U35
U33
U31
U29
MLPmodel U27
U25
U23
U21
U19
U17
U15
U13
U9
U11
U7
U5
U3
Declared U1
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Badcases
Fig. 5. Modelled (MLP model) and declared specific costs of 2014 (all 63 utilities).
According to the principles of benchmark modelling, the general regularity shows the mutual correlation between specific costs of good utilities and represents mean values of the declared specific costs. Whereas both artificially reduced and exaggerated costs exist, it can be roughly assumed that the synthesized benchmark model presents average/reasonable costs. In the next stages, it will be possible to use the modelled specific costs as a motivator to increase the operational efficiency of utilities and to reduce their expenses. Comparison of two NLR models for 2013 and 2014 (see Fig. 6) shows gradual general growing production costs. Increasing consumption means decreasing specific costs (e.g., U13), while developed but currently untapped network results in increasing costs (e.g., U1, U2, U22). Relatively small scale of business and weak technological base are the basic reasons of significant impact of any serious accident on specific costs in comparison with previous year – disruptions in 2013 (e.g., U15, U17, U18, U23) or in 2014 (e.g., U1, U2, U7, U22). This factor clearly shows necessity of operative tariff change. 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Specificcosts,€/cubm
2013
2014 Utilities
U1
U4
U7
U10
U13
U16
U19
U22
Fig. 6. NLR models 2013 and 2014.
A principal practical regulatory need is coincidence of models, using regularity of previous year for real PIs of current year. Fig. 7 shows that there is serious shift only for U1; the reason is different modelling ranges. 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Specific costsfor2014,€/cubm
Model2013
Model2014
U19
U25
Utilities U1
U4
U7
U10
U13
U16
U22
U28
U31
Fig. 7. Modelled specific costs for 2014 (MLP model), using models 2013 and 2014.
U34
U37
Edvins Karnitis et al. / Procedia Computer Science 104 (2017) 3 – 11
7. Conclusion The level of accuracy and credibility of the results, coincidence of the models obtained with the factual cases clearly demonstrates the correctness of the working hypothesis and perspective of the research. There is a strong basis for continuation of research to enhance further the compliance of the model, to increase its quality. Reduction of currently existing information asymmetry is the primary task. The results indicate much greater input data problems in comparison with the inadequacy of the model (the mutual correlation of independent modelling results is much higher than their correlation with declared costs of utilities). Studies related to the potential incompleteness of the input data set also should be continued to determine whether all the substantial input data (cost drivers) are included in the set of PIs. It is necessary to identify and to quantify individualities, which distort the regularity of the declared costs of utilities: some of them artificially reduce costs e.g., municipal subsidies or underinvestment, while others generate exaggerated costs, e.g., excessive capacity of some infrastructure objects. Thus, the general goals – setting of the substantiated tariffs, growing efficiency of utilities, reduction of the administrative burden on business and increase of the efficiency of the NRA will be achieved. Analogous methodology could be developed also for evaluation of the costs of sewerage and district heating utilities. References 1. 2. 3. 4. 5.
6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
On Regulators of Public Utilities. Available: http://www.vvc.gov.lv/export/sites/default/docs/LRTA/Likumi/On_Regulators_of_Public_Utilities.pdf; 2016. Sabiedrisko pakalpojumu regulesanas komisija (in Latvian). Available: http://www.sprk.gov.lv Udenssaimniecibas pakalpojumu likums (in Latvian). Available: http://likumi.lv/doc.php?id=275062; 2016. Udenssaimniecibas pakalpojumu tarifu aprekinasanas metodika (in Latvian). Available: http://likumi.lv/doc.php?id=209845; 2016. Geriamojo vandens tiekimo ir nuoteku tvarkymo bei pavirsiniu nuoteku tvarkymo paslaugu kainu nustatymo metodika (in Lithuanian). Available: https://www.e-tar.lt/portal/lt/legalAct/4c3e62a08a9311e4a98a9f2247652cf4. (in Bulgarian). Available: http://www.dker.bg/files/DOWNLOAD/directions_water_1.pdf; 2016. Luo T, Young R, Reig P. Aqueduct projected water stress rankings. Available: http://www.wri.org/publication/aqueduct-projectedwater-stress-country-rankings; 2016. Berg SV. Water utility benchmarking; measurement, methodologies, performance incentives. London: IWA Publishing; 2010. 172. Storto C. Benchmarking operational efficiency in the integrated water service provision; does contract type matter? Benchmarking. Vol.21, 6; 2014. p. 917-943. Berg S, Padowski JC. Overview of Water Utility Benchmarking Methodologies: From Indicators to Incentives. Available: http://warrington.ufl.edu/centers/purc/purcdocs/papers/0712_Berg_Overview_of_Water.pdf; 2016. Baranzini A, Faust A, Maradan D. Water supply: costs and performance of water utilities, evidence from Switzerland. Available: http://arodes.hes-so.ch/record/274/files/lm.pdf; 2016. Shinde VR, Hirayama N, Mugita A, Itoh S. Revising the existing performance indicator system for small water supply utilities in Japan. Urban Water. Vol.10, 6; 2013. p. 377-393. Barzdins J, Barzdins G, Apsitis K, Sarkans U. Towards Efficient Inductive Synthesis of Expressions from Input/Output Examples. Proceedings of the 4th International Workshop on Algorithmic Learning Theory. London: Springer-Verlag; 1993. p. 59-72. Leek J. The Elements of Data Analytic Style. Victoria British Columbia: Leanpub; 2015. 94. Angluin D. Inductive inference: theory and methods. Computing Surveys. Vol. 15, 3; 1983. p. 237-267. Dean J. Big data, data mining and machine learning. Hoboken New Jersey: Wiley; 2014. 266. Haykin S. Neural Networks and Learning Machines. New York: Pearson Education Inc.; 2010. 936. Alpaydin E. Introduction to Machine Learning. Cambridge. The MIT Press; 2010. 584. Mitchell TM. Machine Learning. Columbus OH: McGraw-Hill Education; 1997. 432.
Girts Karnitis born in 1974, earned his Doctors degree in Computer Science from the University of Latvia in 2004. He has published more than 20 scientific papers and has participated in many software projects as designer and programmer. His main scientific interests include business process modelling and database technologies, including NoSQL databases and Big Data technologies. Contact him at
[email protected].
11