Available online at www.sciencedirect.com Available online at www.sciencedirect.com
ScienceDirect ScienceDirect
Procedia online Computer 00 (2019) 000–000 Available at Science www.sciencedirect.com Procedia Computer Science 00 (2019) 000–000
ScienceDirect
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
Procedia Computer Science 156 (2019) 300–307
8th International Young Scientist Conference on Computational Science 8th International Young Scientist Conference on Computational Science
Detection of lost circulation in drilling wells employing sensor data Detection of lost circulation in drilling wells employing sensor data using machine learning technique using machine learning technique Ivan Khodnenkoaa*, Sergey Ivanovaa, Dmitry Peretsbb, Maxim Simonovbb Ivan Khodnenko *, Sergey Ivanov , Dmitry Perets , Maxim Simonov a ITMO University, Birzhevaya Line 14, 199034 Saint Petersburg, Russia a GAZPROM NEFT STC, 75-79 liter D Moika River emb.,Saint 190000 Saint Petersburg, ITMO University, Birzhevaya Line 14, 199034 Petersburg, Russia Russia b GAZPROM NEFT STC, 75-79 liter D Moika River emb., 190000 Saint Petersburg, Russia b
Abstract Abstract Despite the steady growth of petroleum products, oil production has become more difficult every year. One reason is production Despite steady of petroleum oil production has become year.from Oneanother reason is drops ofthe «light» oil,growth increasing the waterproducts, influx. This process consists of the more entrydifficult of alien every (external oilproduction reservoir) drops «light» oil, increasing water influx. consists entry of alienof (external from another reservoir) watersofinto the main flow of oil.the The reasons for This such process a violation may of be the weak cementing the production string oil or disruption waters into thering. mainThis flow of oil. The such a violation may be cementinginofdrilling the production string orsensor disruption of the cement work aims to reasons develop for an algorithm of detection of weak lost circulation wells employing data of the acement ring. This work aims toThe develop an algorithm of detection of were lost circulation in drilling wells employing sensor data using machine learning technique. following classification methods tested during the development process: k-nearest using a machine learning following methods were tested during the development k-nearest neighbors, support vectortechnique. machine, The random forestclassification classifier, logistic regression, naive Bayes. The logistic process: regression model neighbors, vector forest classifier, logistic regression, naive Bayes. The logistic regression model showed the support best results. Themachine, accuracyrandom of the developed algorithm for wells is 79%, and it can be useful in the detection of the lost showed thein best accuracy of the developed algorithm for wells is 79%, and it can be useful in the detection of the lost circulation theresults. drillingThe wells. circulation in the drilling wells. © 2019 The Authors. Published by Elsevier Ltd. © 2019 2019 The The Authors. Authors. Published Published by by Elsevier Elsevier Ltd. Ltd. © This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open accessresponsibility article under the CC license https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under of scientific the BY-NC-ND scientific committee of International the 8th International Young ScientistonConference on Peer-review under responsibility of the committee of the 8th Young Scientist Conference Computational Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science Science. Computational Science Keywords: detection of lost circulation; sensor data; machine learning; time series classification; drilling wells; oil production Keywords: detection of lost circulation; sensor data; machine learning; time series classification; drilling wells; oil production
1. Introduction 1. Introduction Starting from 2006 the oil production in Russia has grown. At the end of 2018, the volume has been increased to production in Russia hasofgrown. Atand the energy end of 2018, the[1]. volume has the been increased to 555Starting millionfrom tons 2006 basedthe on oil central dispatching control the fuel complex Despite steady growth 555 million tons based on central dispatching control of the fuel and energy complex [1]. Despite the steady growth
* Corresponding author. Tel.: +7-931-581-07-37. E-mail address:author.
[email protected] * Corresponding Tel.: +7-931-581-07-37. E-mail address:
[email protected] 1877-0509 © 2019 The Authors. Published by Elsevier Ltd. This is an open access under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/) 1877-0509 © 2019 Thearticle Authors. Published by Elsevier Ltd. Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science This is an open access article under CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science
1877-0509 © 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science. 10.1016/j.procs.2019.08.206
2
Ivan Khodnenko et al. / Procedia Computer Science 156 (2019) 300–307 Author name / Procedia Computer Science 00 (2019) 000–000
301
of petroleum products, oil production has become more difficult every year [2, 3]. One of the reasons is the decrease of «light» (easily retrievable) oil. Another one reason is the entry of alien waters into the main flow of oil. Usually, a distinction is drawn between a lost circulation or a production leakage. Lost circulation is demonstrated in figure 1a. This process happens when water from layer A moves to layer B with an oil. Figure 1b demonstrates a production leakage when water soaks into the main flow through the well casing. The reasons for such a violation may be weak cementing of the production string or disruption of the cement ring. a)
b)
Fig. 1. (a) an illustration of a lost circulation; (b) an illustration of a production leakage. To determine the described fault, as the main tool, the production log test (PLT) of wells is used. This test allows to analyze a state and parameters of a well and to create an informal description of the working state of a well. After that, the specialists read this description and they can determine a lost circulation, using their experience. But this type of work is expensive and involves several successive stages, which complicates its use. Also, this method does not have a formal result. As result, a specialist describes the working state of well. After that, he or she must analyze the result text of the test and make a conclusion. Nevertheless, this method is used as the main instrument. To choose a well for PLT there are two ways: when a well is producing too much water or choosing of wells randomly. The second way is more interested because exactly this process can be improved. So, the aim of this work is to develop the algorithm which allows to predict a probability a lost circulation in drilling wells. This solution will help to choose the well for PLT more effective and to allow to save money and time. 2. Related work The most used method of detection of lost circulation in drilling wells is water control diagnostic plots which was developed by Chan [4]. This method is based on the building of a log-log plot in logarithmic coordinates. The plot includes two relations. The first one shows a relation a water/oil ratio (WOR) and the second one displays a derivative of WOR. Basing on the behavior of curves it is possible to understand a period when the water flooding was started and a type of water flooding. This method has proved successful in many cases, for example in the paper on applying techniques for water restriction [5]. Moreover, in the work [6] the authors improved this method for horizontal wells and the accuracy of the method was about 60%. His major drawback is a need of human for analyzing of plots. Another practice method is the Merkulova-Ginzburg method [7] which is based on researching of dependencies of parameters. To analyze a well using this method it is necessary to build a plot using the immense coordinate system. A plot shows differences between volumes of water and oil in time. It can be created using the following formulas:
Author name / Procedia ComputerComputer Science 00 (2019)156 000–000 Ivan Khodnenko et al. / Procedia Science (2019) 300–307
302
(𝑖𝑖)
(𝑖𝑖)
𝑥𝑥 (𝑖𝑖) =
(𝑖𝑖)
(𝑖𝑖)
𝑉𝑉н.в +𝑉𝑉в
(𝑖𝑖) (𝑖𝑖) 𝑉𝑉н.в.к +𝑉𝑉в.к
; 𝑦𝑦 (𝑖𝑖) =
(𝑖𝑖) 𝑉𝑉н.в (𝑖𝑖) 𝑉𝑉н.в +𝑉𝑉в
(𝑖𝑖)
3
(1) (𝑖𝑖)
𝑉𝑉в and 𝑉𝑉н.в are current volume values of water and water/oil. 𝑉𝑉в.к and 𝑉𝑉н.в.к are volume values for all the period under review. After creation, the points which look like a straight line are approximated of line. By analyzing of slopes of lines, it is possible to find a period with incorrect work of a well. This method has a similar drawback with the previous one. Unfortunately, other studies could not be found in English literature. But in articles in Russian there are more researching, which have a similar theme. Also, there are articles with the reviews of analytics methods [8, 9]. 3. Methodology 3.1. The source data As source data 196 wells with main parameters for the period from 2014 to 2018 have been granted. The source of data is Gazpromneft STC. The main parameters are water cut, fluid and oil rates and bottom-hole pressure. Also, a table with PLT data was given. So, the number of wells was reduced to 44 wells which have stable data and information about the state of work. Figure 2 contains an example of input data.
Fig. 2. Example of input data. After analyzing the fragment of input data, the first detected conspicuous problem became measurements of water cut. There are too many empty values, and in addition to this, there are many values, which keep a single value. This behavior of water cut exist through all data, and the average frequency of measurement is one per week. To bypass this problem, we have proposed two decisions. The first one is to calculate water cut using equation 2 and to supplement water cut by calculated values. Wcalc = 1 −
Qn
𝜌𝜌 ∗ 𝑄𝑄𝑄𝑄
(2)
where Q𝑛𝑛 and 𝑄𝑄𝑄𝑄 are liquid and oil rates. Variable 𝜌𝜌 is an oil density. The calculated values have similar behavior, but different quantities. A deviation is equally within the same well, but different for other wells because of different
4
Ivan Khodnenko et al. / Procedia Computer Science 156 (2019) 300–307 Author name / Procedia Computer Science 00 (2019) 000–000
303
density. Thus, it is needed to select a coefficient for each well to match with the original water cut. It is mean, that this method does not seem to be appropriate. Alternatively, there is another time series of water cut, which was measured another way: TM water cut and chemical analytical laboratory (CAL) water cut. The first one is got using special sensors. The positive side of this is stable data without empty and repeated values. The negative side of this using sensors is a new technology, and many wells do not have it. Also, the magnitudes measured this way have different behavior and values. CAL water cut is measured in a laboratory. Not every well has this parameter, but the values are similar to the main water cut. Thus, the main time series can be supplemented with the values of CAL water cut. The first analysis helped to restore a path of water cut. However, despite this, the data have empty and repeated values and also outliers. So, the next section will describe the cleaning of data and preparing it for modeling. 3.2. Preprocessing of data Initially, the repeating values were removed. The next step is more complex. There are many ways to remove outliers. For example, paper [10] describes a way to create a graph to find anomalies. Visual methods are popular, and it is easy to find similar cases, for instance [11, 12]. However, the methods like this are not applicable in cases, when a dataset is big, or it is needed an automatic algorithm. Another method is described in the paper [13], and it uses information about the a priori probability of the process. Under the conditions of the problem under consideration, emissions occur due to the manual work of engineers and carelessness of the working personnel. Taking into account this information and basing on the paper [14], a sufficient condition is to find quantile. The less or higher values of this quantile are the outliers. The main requirement was to exclude abnormal values of parameters and to minimize the risk to damage the source time series. Results were assessed against the visual analysis and the best result was received when 99.7% quantile was used. Thus, for each parameter, a confidence interval was created. If a value is not ranged in the interval 𝑃𝑃(𝐿𝐿 ≪ 𝑥𝑥𝑖𝑖 ≪ 𝑈𝑈), where L and U are lower and upper thresholds, then this value is an outlier. The last step was the interpolation of data. The interpolation is a way to find unknown values using a known discrete set of values. The main requirement when choosing which interpolation method to use (fig. 3a) was that at the interpolation points S and T, interpolated values must aim to line A and from line B. Thus, three methods were applied (Fig 3b): polynomial interpolation, Akima interpolation, and monotonic cubic interpolation. Akima model showed the best result and this model was chosen. a)
b)
Fig. 3. (a) the requirement for interpolation; (b) testing of interpolation methods.
Ivan Khodnenko et al. / Procedia Computer Science 156 (2019) 300–307 Author name / Procedia Computer Science 00 (2019) 000–000
304
5
After the model was chosen, it is necessary to find the maximum number of empty values. Because if to interpolate an interval which contains a month of empty values, it will create more noise than valuable information. Thus, to understand the maximum number of empty values it was created a table 1. It shows average, median and 95% quantile days of empty values in a row. Table 1 – Path of statistic of empty values A A parameter characteristic 1 2 An overage 1.47 1.95 Qw A median 1 1 95% quantile 4 6 An overage 4.75 5.72 W A median 5 5 95% quantile 11 12 An overage 1.61 1.85 Pb† A median 1 1 95% quantile 5 5
3 1.59 1 4 5.01 5 10.5 1.85 1 6
Well number 4 5 6 1.84 1.47 1.42 1 1 1 5 3 3 1.9 5.68 6.21 1 4.5 7 8.15 13 12 1.89 1.97 1.11 1 1 1 3.6 5 2
7 2.05 1 7 7.98 8 14 1.52 1 4
9 2.11 1 5.95 8.5 9 14 1.27 1 2.35
10 3.27 1 5 5.9 5 13 2.03 1 6.3
Analyzing the created table, one may see seen that the average and median values for fluid rate and bottom-hole pressure are not greater than two days. Moreover, a 95% quantile is not higher than seven days. So, choosing seven days as a threshold only 5% will not be restored. The state with water cut is more complicated. 95% quantile is two weeks, and it is too big for recovering. In this case, it was decided to take seven days too and to cover half of the skipped data. Considering another two parameters, it is pretty enough for creating a classification model. 3.3. Algorithm After analyzing the preprocessed data for the building of a predictive model time series classification (TSC) has been chosen [15]. The main difference this algorithm is that this one allows classifying time series, which have time dependence between values. For implementation, firstly, it is needed to transform the source data set. Well known two ways. The first one is to approximate the time series with the help of model P order. Then the obtained coefficients can be used as a new dataset. For example, in the paper [16] authors use Fourier the transform, and discrete wavelet transform to get a new dataset. However, in this case, this method does not seem to be appropriate for some reason: different size of the time series, different measurement frequency and small size of time series. The second way is to segment the source time series of length N. After that, the new features which described a segment are generated for each segment. Also, each segment is gotten a target mark. Thus, a new dataset is generated for a classification purposed. Under the condition of the problem following features have been chosen: • an expected value for each parameter; • a standard deviation for each parameter; • a relation between expected values of water cut and fluid rated; • a relation between expected values of fluid rated and bottom-hole pressure; • a relation between expected values of water cut and bottom-hole pressure; • a relation between expected values of water cut, fluid rated and bottom-hole pressure; • a slope of the approximated line for each parameter; • a coefficient of determination of the approximated line for each parameter. After that, it is needed to select a length of a segment. For solve it, we have decided to generate few datasets with a length from 1 to 40 days. For the length of a segment equals one day the size of the result dataset is 6414 rows.
†
Bottom pressure
6
Ivan Khodnenko et al. / Procedia Computer Science 156 (2019) 300–307 Author name / Procedia Computer Science 00 (2019) 000–000
305
When the length equals to 40 days, the size is 489 rows. A higher size was not considered because of the small size of a generated dataset. Each dataset has been tested with the following models: • Logistic regression; • Support vector machine; • K-Nearest neighbors; • Naïve Bayes; • Random forest classifier. These models were chosen because some of them are used in time series classification and another is their popularity in many applications. Figure 4 shows the result of testing the models with a different size of a segment with the use of scikit-learn library [17, 18]. Models were used with default global parameters. Cross-validation was applied for this test with three folds.
Fig. 4. The test of classification models with a different size of segment. As can be seen from the figure, the logistic regression and random forest classifier models have the highest accuracy values. The stable high accuracy is when the length of a segment greater than 19 days. Based on this, it was decided to check the quality of the selected models by the number of correctly classified wells. The length of a segment was chosen equals 19 days as the minimal number of days on a stable interval. The models were tuned with the help of a brute force method. A matrix with all global possible parameter was created for each model. After that, both models were launched with every combination of parameters. Brute force algorithm shows that for logistic regression it is needed to use a parameter C with value equals 3.45 and a parameter «penalty» equals «l1». Random forest requires one hundred estimations and maximum depth equals 7. According to the testing results, the logistic regression model correctly determined the presence of water inflow in 7 out of 8 wells. The random forest model identified a violation in 6 of the 8 wells. Thus, the logistic regression model best coped with the task and in the further study this particular model was used. 4. Experiments The validation was done using the same wells. To evaluate the result accuracy was used, which shows the percent of correctly identified wells. For the first time, there were 200 runs of the developed algorithm. For each run, the algorithm was written, which leaves 20% of wells for testing randomly. This result is incomprehensible because the range of values is 80% and it is too big. The confidence interval showed a result from 10% to 90%. In the analysis, it was identified that the bad results were classified when the training dataset contains wells with a number of
Ivan Khodnenko et al. / Procedia Computer Science 156 (2019) 300–307 Author name / Procedia Computer Science 00 (2019) 000–000
306
7
segments less than 40. On the other hand, the test dataset contained the well with a number of segments greater than 40. After correction of an unbalanced distribution, the confidence interval was changed to range from 67% to 72 %. These numbers are more reliable because the ranges are narrower. The confidence interval can be less if to run the algorithm more times. However, the execution time of 200 launches is higher than two hours because of a long time of segmenting. Increasing the number of launchings will increase the time of execution. However, now, the accuracy of the algorithm was sufficient and appropriate for practical needs. Fig. 5 shows an example of a single approbation.
Fig. 5. Approbation of the algorithm on test data using logistic regression. Analyzing the figure, it can be seen the three errors. One of which is the type I error and two are the type II error. It shows that the developed algorithm made errors with less than 40 segments in the well. For large numbers of segments in a time series, the average probability gives correct results. 4. Conclusion and Future works In the process of development, we created an algorithm of detection of lost circulation in drilling wells employing sensor data using a machine learning technique. This method is based on the time series classification theory. The method is automatic, it uses a minimum set of the parameters of the well and it does not require a specialist for the work. As a result of testing, using cross-validation, the best results were shown by logistic regression and random forest models, namely 78% and 75%, respectively. When testing the algorithm on earmarked wells, the percentage of correctly identified wells was 70%. The developed algorithm can be improved. First, every time series started in 2014. It means that if a lost circulation started in 2015, for example, the train data would have the wrong values. To solve this problem, it is needed to analyze the time series using a known method. For example, for creating a dataset, it is possible to use Merkulova-Ginzburg method to remove wrong data. It will reduce the error of classification. Also, it is necessary to check the algorithm using new addition data from other wells and to try the other algorithms for modeling, for example XGBoost or LightGBM. It must help to improve results. So, the prosed solution has a positive result and meet the requirements for practical needs. This algorithm allows predicting the probability of oil facilities failures which can help to choose the candidates for more accurate PLT procedures. References [1]
“Oil production in Russia in 2018 increased by 1,6% - Economy and business - TASS.” [Online]. Available: https://tass.ru/ekonomika/5971425. [Accessed: 24-Apr-2019].
[2]
M. A. Lapshin and U. V. Danilchenko, “The current state and perspectives of development of the oil and gas industry in Russia,”
8
Ivan Khodnenko et al. / Procedia Computer Science 156 (2019) 300–307 Author name / Procedia Computer Science 00 (2019) 000–000
307
Actual Probl. Aviat. Cosmonaut., vol. 2, no. 8, 2012. [3]
K. V. Strizhnev, “Unconventional oil reserves, prospects, experience, projects,” 2015.
[4]
K. S. Chan, “Water Control Diagnostic Plots,” in SPE Annual Technical Conference and Exhibition, 1995.
[5]
A. V. Raspopov, A. S. Kazantsev, D. V. Andreev, I. V. Averina, D. D. Sidorenko, and S. N. Glazyrin, “Experience and prospects of application of water inflow limitation in the firlds of Perm region,” Geol. Geophys. Dev. Oil Gas Fields, no. 9, pp. 41–45, 2016.
[6]
D. A. Ostapchuk and I. A. Sintsov, “Improved diagnostic method for determining the causes of water fluiding,” 2015, pp. 115–122.
[7]
T. M. Petchorin, “Analysis of hydrodynamic flows at filled-flows of liquid,” 2010.
[8]
K. M. Fedorov and T. N. Pechyorin, “Comparative effectiveness of techniques of production water cut causes diagnostics,” J. High. Educ. INSTITUTIONS. OIL GAS, no. 4, pp. 49–57, 2009.
[9]
A. S. Ustyugov, V. V. Sutyagin, and M. M. Galiullin, “Express-diagnostics of definition of wells water-flooding causes based on field data analysis,” Autom. Telemech. Commun. Oil Ind., no. 8, pp. 4–9, 2016.
[10]
L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly detection and description: a survey,” Data Min. Knowl. Discov., vol. 29, no. 3, pp. 626–688, May 2015.
[11]
R. J. Hyndman, E. Wang, and N. Laptev, “Large-Scale Unusual Time Series Detection,” in 2015 IEEE International Conference on Data Mining Workshop (ICDMW), 2015, pp. 1616–1619.
[12]
N. Cao, C. Shi, S. Lin, J. Lu, Y.-R. Lin, and C.-Y. Lin, “TargetVue: Visual Analysis of Anomalous User Behaviors in Online Communication Systems,” IEEE Trans. Vis. Comput. Graph., vol. 22, no. 1, pp. 280–289, Jan. 2016.
[13]
H. N. Akouemo and R. J. Povinelli, “Probabilistic anomaly detection in natural gas time series data,” Int. J. Forecast., vol. 32, no. 3, pp. 948–956, Jul. 2016.
[14]
O. Vallis, J. Hochenbaum, and A. Kejariwal, “A Novel Technique for Long-Term Anomaly Detection in the Cloud.” 2014.
[15]
A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh, “The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances,” Data Min. Knowl. Discov., vol. 31, no. 3, pp. 606–660, May 2017.
[16]
F. Mörchen, “Time series feature extraction for data mining using DWT and DFT,” 2003.
[17]
F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, no. Oct, pp. 2825–2830, 2011.
[18]
“scikit-learn: machine learning in Python — scikit-learn 0.20.3 documentation.” [Online]. Available: https://scikit-learn.org/stable/. [Accessed: 24-Apr-2019].