A comparative analysis of bubble point pressure prediction using advanced machine learning algorithms and classical correlations

A comparative analysis of bubble point pressure prediction using advanced machine learning algorithms and classical correlations

Journal Pre-proof A comparative analysis of bubble point pressure prediction using advanced machine learning algorithms and classical correlations Xi ...

3MB Sizes 0 Downloads 179 Views

Journal Pre-proof A comparative analysis of bubble point pressure prediction using advanced machine learning algorithms and classical correlations Xi Yang, Birol Dindoruk, Ligang Lu PII:

S0920-4105(19)31019-8

DOI:

https://doi.org/10.1016/j.petrol.2019.106598

Reference:

PETROL 106598

To appear in:

Journal of Petroleum Science and Engineering

Received Date: 20 August 2019 Revised Date:

12 October 2019

Accepted Date: 14 October 2019

Please cite this article as: Yang, X., Dindoruk, B., Lu, L., A comparative analysis of bubble point pressure prediction using advanced machine learning algorithms and classical correlations, Journal of Petroleum Science and Engineering (2019), doi: https://doi.org/10.1016/j.petrol.2019.106598. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

A Comparative Analysis of Bubble Point Pressure Prediction Using Advanced Machine Learning Algorithms and Classical Correlations Xi Yang, Shell International E&P Inc., Data Science Birol Dindoruk, Shell International E&P Inc. & University of Houston, Petroleum Engineering, Mathematics and Data Science, and Ligang Lu, Shell International E&P Inc., Data Science

Abstract The need for fluid properties or PVT (Pressure-Volume-Temperature) properties, is part of the entire Exploration and Production (E&P) lifecycle from exploration to mature asset management to the typical later life events such as, Improved Oil Recovery (IOR). As the projects mature, the need for such data and its integration for various discipline-specific workflows and its interpretation in the light of reservoir performance varies. Among all the key PVT properties, bubble point pressure is probably the most important parameter. Bubble point pressure is important because it is the point at which constant composition and variable composition portions of the depletion paths merge. Geometrically, bubble point pressure appears to be a discontinuity. In addition, it dictates the existence (or not) of the incipient phase (i.e., gas phase) leading to the changes in the flow characteristics both in porous media and as well as within the wellbore and the facilities. Furthermore, it is also a good indicative of a possible gas cap when the reservoir is at saturation (reservoir pressure is equal to the bubble point pressure) or nearsaturated. Among the highlighted uses, there are many more used such as the determination of the elements of miscibility, gas lift design, etc. Therefore, it is very important to estimate the bubble point pressure accurately. In this study, tree-based advanced machine learning algorithm including XGBoost, LightGBM, and random forest regressor, and multi-layer perceptron (neural network) regressor are implemented to predict bubble point pressure (Pbp). A novel super learner model which is also known as stacking ensemble is used to enhance base machine learning model performance on predicting bubble point pressure. Three datasets with different predictors are prepared to study machine learning algorithms’ performance for three situations: only compositional data are available; only bulk properties (Gas-Oil-Ratio, gas gravity, API gravity and reservoir Temperature) are available; both compositional data and bulk properties are available. Through literature review, there is no research on using only compositional data and temperature to

predict bubble point pressure. Our super learner model offers an accurate solution for oil bubble point pressure when only compositional data and temperature are available. Machine learning models perform better than empirical correlations with limited input data (i.e., bulk properties). When compositional data and bulk properties are all used as predictors, super learner reaches about 5.146% mean absolute relative error on predicting the bubble point pressure from global samples with bubble point pressures in the range of 100 to 10,000 psi, which is a wider range compared to most ANN models published in literature.

Introduction In the field of petroleum reservoir studies, PVT properties are crucial for reservoir performance management. One of the most important PVT properties is the bubble point pressure for reservoir oils because the trend of other properties like solution GOR or FVF with pressure all changes at bubble point pressure. Bubble point pressure is typically measured via P-V expansion experiments. This is also known as constant composition expansion (CCE) or constant mass expansion (CME) experiment. However, there is a need for the estimation of bubble point pressure when the direct experimental data of bubble point pressure is not available. This could be due to various reasons from in-country restrictions, to cost and/or time restrictions, or simply to check the quality of some of the measurements and trend analysis. Traditionally, empirical correlations like Standing (1977), McCain (1991), Petrosky and Farshad (1998), and Dindoruk and Christman (2004) are used for bubble point pressure estimation by using four easy to estimate input properties: solution GOR, gas gravity, API, and reservoir temperature. Compositional data, which is also helpful and directly related to PVT properties, by design cannot be utilized by traditional empirical correlations. Thus, machine learning models are considered by researchers to help in predicting some of the PVT properties, especially for bubble point pressure. Though there are many linear or non-linear machine learning models available, artificial neural network becomes the most popular one in predicting PVT properties. Artificial neural network (ANN) has been widely used to for PVT properties prediction since late 1990s (Alimadadi et al. 2011; Al-Marhoun and Osman 2002; Elsharkawy 1998; Gharbi and Elsharkawy 1999; Gharbi et al. 1999; Goda et al. 2003; Osman et al. 2001; Osman and Al-Marhoun 2005; Numbere et al.

2013; Alakbari et al. 2016; Rammay and Abdulraheem 2016; Ramirez et al. 2017) with a long history since late 1990s. However, tree-based ensemble regressors (eg. XGBoost, LightGBM, and random forest regressor) also have robustness, and are capable to maintain good accuracy for small dataset, which is common for PVT datasets. Thus, we choose to use XGBoost, LightGBM and random forest regressor for bubble point pressure prediction and compare their performance with multi-layer perceptron (ANN) regressor in this study. Moreover, the studies on using machine learning to predict bubble point pressure most either have limitation on the range of bubble point pressure (usually under 5,000 psi) or restricts in a certain geological region (AlMarhoun and Osman 2002; Elsharkawy 1998; Gharbi and Elsharkawy 1999; Osman and AlMarhoun 2005; El-Sebakhy et al. 2007; Numbere et al. 2013; Ramirez et al. 2017; Rammay and Abdulraheem 2016). Therefore, our machine learning models are applied to global samples with measured bubble point pressures ranging from 100 to 10,000 psi. Machine learning algorithm are known to have advantage in handling multiple predictors related to response. However, Al-Marhoun and Osman (2002), Elsharkawy (1998) Gharbi and Elsharkawy (1999), Gharbi et al. (1999), Goda et al. (2003), Osman et al. (2001), Osman and AlMarhoun (2005), El-Sebakhy et al. (2007), Numbere et al. (2013), Ramirez et al. (2017), and Rammay and Abdulraheem (2016) all used four bulk properties for traditional empirical correlations (solution GOR  , specific vapor gravity  , API, and temperature ) as predictors for machine learning to estimate the bubble point pressure. The research on the usage of compositional data to predict PVT properties is limited. Alimadadi et al. (2011) utilized compositional data to predict oil formation volume factor (FVF) and density but bubble point pressure is used as input predictor rather than response to be predicted. In this study, we firstly use a dataset with only compositional data and temperature for bubble point pressure prediction, and our super learner model achieves accurate results. We also employ commonly used four characteristic input variables (GOR  ,  , API, and temperature ), and then combination of those commonly used variables coupled with the compositional data as predictors of machine learning algorithm for comparison. As is well known in the literature (Geweke 1994; Brown et al. 2005; Tso and Yau 2007; Kampichler et al. 2019), a single machine learning model does not always perform well for many

predictive model developments as expected bubble point pressure behaves similar. Though a

machine learning model performs the best using every criterion on the whole dataset, it may still have larger error compared to other machine learning models for a certain range or a certain subset. Super learner developed by van der Laan et al. (2007) uses ensemble to stack different base learners together to reach better prediction even compared to the best base learner because super learner absorbs the systematic error of base learners and thus relief the error on final prediction (Ju et al. 2019). The detailed super learner algorithm is proposed by Polley and van der Laan (2010) as a powerful tool for machine learning. Super learner has been applied in biology domain including genetics, medicine, epidemiology, and healthcare (Sinisi et al. 2007; Wang et al. 2011; Sweeney et al. 2014; Rose 2016; Wyss et al. 2018; Ju et al. 2019). Wyss et al. (2018) and Ju et al. (2019) found super learner is more robust than any base learner that the super learner is based on. However, for a super learner, all of its base learners and meta learner have to be trained. Therefore computational time for super learner will be more than any of the base learners. The PVT datasets in this research all have less than 1,000 samples so the computational cost can be accepted.

Methodology In this study, four machine learning algorithms: XGBoost (Chen and Guestrin 2016), LightGBM (Ke et al. 2017), random forest regressor (Breiman 2001), and multi-layer perceptron (an artificial neural network algorithm) regressor (Hinton 1989; Glorot and Bengio 2010) are applied for oil bubble point pressure prediction. In order to enhance the robustness of machine learning model, and improving the accuracy of bubble point pressure prediction, super learner (van der Laan et al. 2007; Polley and van der Laan 2010), which is also known as stacking ensembles, is also applied in addition to the four machine learning algorithms listed earlier. It was firstly proposed in biostatistics by van der Laan et al. (2007) while the detailed algorithm was developed by Polley and van der Laan (2010). For some oil samples, a single machine learning model may have large prediction error while another one may perform well so super learner could further improve the predictive capability by stacking prediction results from base machine learning algorithms. A brief introduction of super learner algorithm introduced by Polley and van der Laan (2010) as follows:

1. Choose a library  of  base learners. Base learners could be any machine learning algorithm including linear models and non-linear models. In our study, the four candidate base learners are a) XGBoost, b) LightGBM, c) random forest regressor, and d) multilayer perceptron regressor. 2. Fit each base learner in  on the entire dataset  with dimension × ( observations and predictors). The fitted models of all base learners are then stored.

3. Split the dataset  into folds using k-fold cross-validation scheme: observations are

split into -equal size groups; let the  fold be the validation samples,   and the remaining group as the training samples,    for = 1,2, … . The usual choice is 5-

fold or 10-fold cross validation but could be any integer between 2 and .

4. For  fold, fit each algorithm in  on  , and then using the fitted model to predict on    for = 1,2, … . Thus, every observation has a prediction using the model trained without itself (out-of-bag prediction). 5. Stack the predictions from each algorithm in  to create a , which is a ×  matrix.

6. Fit the meta learner on  and store the learner. For simplest meta learner, each algorithm

in  is assigned a coefficient  ( ∑  = 1) and the value of  are determined using

lease-squares. However, the meta-learner could be other higher-level linear regression or even non-linear. This step is forward only, which means the training and optimization of base learners will not be re-done.

In this study, all the machine learning algorithms mentioned earlier (XGBoost, LightGBM, random forest regressor, MLP neural network, and super learner) are implemented using Python 3.7. The corresponding Python packages of machine learning algorithms and their versions are listed in Table 1. More detailed introduction on the machine learning algorithms utilized are introduced in Appendix A.

Table 1—Machine learning algorithm Python packages.

Algorithm

Python Package

Package Version

Website

XGBoost

xgboost

0.9.0

LightGBM

lightgbm

2.2.3

Random Forest Regressor

scikit-learn.ensemble (Pedregosa et al. 2011)

0.20.3

MLP Neural Network

scikitlearn.neural_network (Pedregosa et al. 2011)

0.20.3

Super Learner

mlens.ensemble.super_ learner

0.2.3

https://xgboost.readthedocs.io/en/ latest/index.html# https://lightgbm.readthedocs.io/en /latest/ https://scikitlearn.org/stable/modules/generate d/sklearn.ensemble.RandomForest Regressor.html https://scikitlearn.org/stable/modules/generate d/sklearn.neural_network.MLPRegr essor.html http://ml-ensemble.com/

Dataset In this study, we used the PVT data stored in a standard format in GeoMark RFDBASE (RFDbase – Rock & Fluid Database by GeoMark Research. LTD). We have used all the global PVT data to research the ability of machine learning algorithm on predicting oil bubble point pressure without constraining the range of the data and/or the regional split. To be able to predict bubble point pressure (   , we have used two categories of input properties and their

combination from the subject database along with the reservoir temperature, : a) black oil properties: GOR  , specific vapor gravity  , API and b) compositional data including mole

percentages of the species, molecular weight of the plus fraction (C7+), and the total molecular weight. Among the selected input parameters, used three sets of parameters as predictors for bubble point pressure prediction. Set No. 1 is mainly compositional data and reservoir temperature; Set No. 2 are three black oil properties used in traditional correlations and temperature; Set 3 is the union of the Set 1 and Set 2. The details of the three sets of predictors are shown as: •

No. 1 (15 predictors): Measured mole % of !" , #$" , %" &, #' , #" , #( , )#* , n#* , )#+ , n#+ ,

#, , and #-. ; MW of #-. ; MW of the reservoir fluid; reservoir temperature (or the

temperature for  measurement).



No. 2 (4 predictors): Single stage GOR,  ; specific gas gravity,  ; API gravity;

reservoir temperature (the temperature for  measurement). •

No. 3 (18 predictors): combination of (1) and (2) above.

Data cleaning/quality is also a crucial step before machine learning application. Sometimes the issue with the data could be as simple as typos and occasionally unit errors while they still need to be fixed. In addition, the sets with missing data and/or erroneous data that cannot be verified are all excluded. As a result, for set No.1 of predictors shown above, the corresponding dataset after cleaning has 746 samples and the detailed statistical information is shown in Table 2. Similarly, for set No.2 and No.3 of predictors shown above, the statistical information of corresponding datasets No.2 (459 samples) and No. 3 (397 samples) are shown in Tables 3 and 4. The detailed data cleaning procedure is covered in Appendix B.

Table 2—Dataset No.1 (746 samples) statistical summary.

Temperature (°F)

185.68

Standard Deviation 52.52

Bubble Point Pressure (psi)

3586.

2016.

66.

3375

10335

N2 (mole %)

0.51

0.79

0.00

0.25

10.87

CO2 (mole %)

1.89

5.19

0.00

0.27

34.19

H2S (mole %)

0.14

1.16

0.00

0.00

19.33

C1 (mole %)

41.98

16.52

0.32

42.64

78.33

C2 (mole %)

5.32

2.62

0.05

5.29

13.56

C3 (mole %)

4.55

2.28

0.00

4.66

13.82

iC4 (mole %)

1.05

0.56

0.00

0.96

4.86

nC4 (mole %)

2.52

1.24

0.00

2.50

7.52

iC5 (mole %)

1.22

0.66

0.00

1.13

6.58

nC5 (mole %)

1.48

0.79

0.00

1.42

5.43

C6 (mole %)

2.62

1.54

0.00

2.23

9.82

C7+ (mole %)

36.73

15.33

11.87

34.74

85.08

C7+ MW

255.76

52.29

72.01

249.75

540.40

Reservoir Fluid MW

114.75

51.83

42.20

98.25

348.14

Mean

Min

Median

Max

68.00

180.00

330.00

Table 3—Dataset No.2 (459 samples) statistical summary. Mean

Standard Deviation

Min

Median

Max

Temperature (°F)

186.21

49.28

75.00

181.00

330.00

Bubble Point Pressure (psi)

3884.09

2168.17

255.00

3812.00

10335.00

API Gravity (°)

29.96

6.36

9.80

29.60

52.20

974.97

689.54

30.00

844.49

5934.00

0.85

0.16

0.57

0.83

1.82

Single Stage GOR (SCF/STB) Single Stage Gas Gravity

Table 4—Dataset No.3 (397 samples) statistical summary. Mean

Standard Deviation

Min

Median

Max

185.17

46.62

75.00

180.30

320.00

4003.99

2128.64

255.00

3965.00

10335.00

N2 (mole %)

0.35

0.45

0.00

0.22

4.02

CO2 (mole %)

0.65

2.77

0.00

0.13

34.19

H2S (mole %)

0.07

0.99

0.00

0.00

19.33

C1 (mole %)

45.98

16.39

0.32

49.76

78.33

C2 (mole %)

5.19

2.60

0.05

5.22

12.22

C3 (mole %)

4.36

2.16

0.00

4.55

13.43

iC4 (mole %)

0.95

0.47

0.00

0.93

4.24

nC4 (mole %)

2.36

1.16

0.00

2.33

7.52

iC5 (mole %)

1.16

0.62

0.00

1.08

6.58

nC5 (mole %)

1.40

0.77

0.00

1.31

5.43

C6 (mole %)

2.33

1.32

0.00

2.03

9.82

C7+ (mole %)

35.21

14.78

11.87

30.35

83.73

Temperature (°F) Bubble Point Pressure (psi)

C7+ MW

268.46

50.08

72.01

265.18

540.40

Reservoir Fluid MW

114.04

51.04

48.55

97.31

348.14

API Gravity (°)

29.75

6.31

9.80

29.30

52.20

986.077

635.616

30.00

900.165

3263.00

0.84

0.15

0.57

0.82

1.82

Single Stage GOR (SCF/STB) Single Stage Gas Gravity

Results & Discussion Four machine learning algorithms: XGBoost, LightGBM, random forest regressor, and MLP regressor are applied to all three combinations of input parameters shown earlier. The super learner is also applied to the same combinations of the input parameters with four base learners (XGBoost, LightGBM, random forest regressor, and MLP regressor) streamed into the Bayesian ridge regression (Mackay 1991; Tipping 2001) as meta learner (Fig. 1). In order to avoid the bias of selection of training and test data, we use 5-fold cross-validation scheme on all three sets of input proxies. Firstly, every dataset is split evenly into 5 folds randomly, and the distribution of each fold is checked to make sure there is no major difference between the distribution of whole dataset. For each iteration, one-fold of samples is used as test or validation data while the rest 4 folds of samples are used as training data to fit the model. The fitted model is then used to predict on validation data. The iteration continues until all folds of samples are predicted. In this case, all samples in the dataset are predicted using model fitted without themselves as being already used as the training data. Such cross-validation scheme will reduce the risk of overfitting to fixed train dataset while making sure that all the datasets are fully utilized. The schematic of K-fold cross-validation is shown in Fig. 2 for clarification.

Figure 1—Schematic of super learner model for oil bubble point pressure prediction.

Figure 2—Schematic of K-fold cross-validation scheme.

Dataset No.1 When the bubble point pressure is not available while compositional analysis is completed, we may choose to predict bubble point pressure directly using the compositional data. Although black oil proxy based classical correlations cannot be used calculate the bubble point pressure using direct compositional data, we utilize such compositional information and show the crossvalidation results of four base learners as well as super learner in Fig. 3. The error distribution plots of using different machine learning algorithms is also shown in Fig. 3, and while the statistical dimensions of the outcome are shown in Table 5. Although the differences between

base learners and super learner model are not significant, super learner still performs better than the rest of the base learners. XGBoost, LightGBM, and random forest regressor have wide error distribution for high bubble point pressure (8,000 to 10,000 psi) fluids, while MLP regressor tends to under-predict bubble point pressure within this range. Super learner does not cure this problem but it balances the problem of under-prediction in combination with reduced prediction errors in this range. As shown in Table 5, super learner ranks the first in all performance judging criteria. From the error distribution plots in Fig. 4, super learner has the highest density around zero prediction error, and the error distributes symmetrically.

(a)

(b)

(c)

(d)

(e) Figure 3—Cross plot of measured /0/ vs. predicted /0/ of dataset No.1 using: (a) XGBoost; (b) LightGBM; (c) Random forest regressor; (d) Multi-layer perceptron regressor (ANN); (e) Super leaner.

(a)

(b)

(c)

(d)

(e) Figure 4—Prediction error distribution plots of dataset No.1 using: (a) XGBoost; (b) LightGBM; (c) Random forest regressor; (d) Multi-layer perceptron regressor (ANN); (e) Super leaner.

Table 5—Dataset No.1 cross validation prediction error statistical results. Quantity

XGBoost

LightGBM

RF Regressor

MLP Regressor

Super Learner

Mean absolute error (psi)

202.092

218.953

228.898

217.877

192.979

Mean absolute relative error (%)

7.884

8.324

9.661

8.546

7.162

Absolute relative error standard deviation (%)

17.796

15.949

24.580

15.736

14.155

Mean relative error (%)

2.221

2.248

3.749

0.972

0.953

Relative error standard deviation (%)

19.337

17.850

26.143

17.881

15.835

Pearson correlation

0.990

0.988

0.988

0.988

0.990

Dataset No.2

For dataset No.2, the predictors we are using are three black oil properties: solution GOR  ,

specific vapor gravity  , API, and the reservoir temperature . Again, we applied four candidate

base learners as well as super learner on top of the base learners. Due to synergy between the input parameters with the traditional correlation we also considered some of the selected correlations from the literature which are aligned with the range of information considered here. For example, some published correlations, such as Standing (1977) and Petrosky and Farshad (1998) are not developed to cover wide range of bubble point pressures up to 10,000 psi. Standing (1977) and Petrosky and Farshad (1998) correlations are widely used to estimate the bubble point pressure within their range of validity (and even extrapolated beyond their limits) therefore we will still use them here for comparison purposes to compare against machine learning models. We have also used the bubble point pressure correlation proposed by Dindoruk and Christman (2004) which is designed to cover wide range of bubble point pressure but is originally designed for Gulf of Mexico (GOM) oil. To be able to make a fair comparison, coefficients of the original Standing (1977), Petrosky and Farshad (1998), and Dindoruk and Christman (2004) correlations are also re-trained (or re-calibrated) using the same data set utilized in our machine learning models. In order to tune the eleven coefficients of Standing (1977), Petrosky and Farshad (1998), and Dindoruk and Christman (2004) bubble point pressure correlation, we set up an objective function, which is defined as the sum of the absolute error between true values and predicted results on the training data (Eq. 1) $123 = ∑
3B ,567789:;6< =,; , ,; , >?@; , ; 4

−  ,; 4 ..........................................(1)

where 3 is the vector of coefficients of correlations. Standing (1977) correlation for bubble point pressure with 5 coefficients is shown in Eq. 2; Petrosky and Farshad (1998) correlation for bubble point pressure with 8 coefficients is shown in Eq. 3; Dindoruk and Christman (2004) correlation for bubble point pressure with 11 coefficients is shown in Eq. 4. H

:M

 = ' FG I L JK



'OPQ R

'OPS TUV

− + W ............................................................................................(2)

 = + F  = \ F

HI PX

JK PY HI P^

∙ 10[ − \ W where ] = ' ∙  :M − ( ∙ >?@ :S ..............................................(3)

JK P_`

∙ 10 a + '' W where > =

:_ ∙c PM .:Q ∙adePS M MgI PX L hK PY

G:f .

......................................................(4)

The objective function shown in Eq. 1 is then minimized using the Nelder–Mead (Nelder and Mead 1965; Gao and Han 2012) method, a commonly applied numerical method for minimization in multi-dimensional space. By integrating this optimization algorithm with Dindoruk and Christman (2004) bubble point pressure correlation, we are able to adjust eleven coefficients using one training dataset and predict on another dataset. In another word, using analog approach now the tuned Dindoruk and Christman bubble point pressure correlation can work in the same way as any other machine learning algorithms. Same optimization approach is also applied to Standing (1977) and Petrosky and Farshad (1998) correlations. For dataset No.2, we show the results of the MLP regressor, which performs the best among 4 base learners, super learner, original Dindoruk and Christman (2004) correlation, and tuned Dindoruk and Christman (2004) correlation. From the cross plots shown in Fig. 5, we can observe that MLP regressor performs similarly as super learner except for one sample, MLP regressor over-predict bubble point pressure significantly, while super learner still over-predicts the result but in a smaller magnitude. In Fig. 5b, the un-tuned Dindoruk and Christman (2004) correlation performs well when measured bubble point pressure is lower than 4,000 psi. When measured bubble point pressure is higher than 4,000 psi, un-tuned Dindoruk and Christman (2004) correlation tends to have over-prediction problem. After tuning, Dindoruk and Christman correlation using the data set not restricted to Gulf of Mexico exhibits major improvement in its predictive capability. The significant over-predicted sample by machine learning is underpredicted by tuned Dindoruk and Christman (2004) correlation parallel to the underprediction within the range of 7,500 to 8,000 psi. In the error distribution plot shown in Fig. 6, super learner has the most symmetrical error distribution, but the peak appears slightly under zero error. Comparing Fig. 6c and 6d, the un-tuned Dindoruk and Christman (2004) correlation has the error distribution highly skewed to right, while the tuned Dindoruk and Christman (2004) correlation has more bell-shaped error distribution but the peak appears under zero error. The statistical performance of machine learning and correlation methods for dataset No.2 is shown in Table 6.

In Table 6, using mean absolute error or mean absolute relative error as the predictive quality criteria, MLP regressor performs even slightly better than super learner. However, based on the other statistical dimension in Table 6, super learner still performs well. As for Dindoruk and Christman (2004) correlation, we see that even when the dataset is obtained globally, this correlation still has good performance in calculating bubble point pressure after re-tuning. In Table 6, the results show that tuned Dindoruk and Christman (2004) correlation performs slightly worse than super learner but not much. Standing (1977) and Petrosky and Farshad (1998) correlations do not perform well in Table 6, and especially, Petrosky and Farshad (1998) correlation has large standard deviation of both relative error and absolute relative error. Tuned Standing (1977) and Petrosky and Farshad (1998) correlations perform better but still not as good as tuned Dindoruk and Christman (2004) correlation or machine learning models. Dindoruk and Christman (2004) correlation performs well after tuning so we show the coefficients on predicting each fold in Table 7 below.

(a)

(b)

(c) (d) Figure 5—Cross plot of measured /0/ vs. predicted /0/ of dataset No.2 using: (a) Multi-layer perceptron regressor (ANN); (b) Super leaner; (c) Dindoruk and Christman (2004); (d) Tuned Dindoruk and Christman.

(a)

(b)

(c)

(d)

Figure 6—Prediction error distribution plots of dataset No.2 using: (a) Multi-layer perceptron regressor (ANN); (b) Super leaner; (c) Dindoruk and Christman (2004); (d) Tuned Dindoruk and Christman.

Table 6—Dataset No.2 cross validation prediction error statistical results. Tuned Dindoruk Dindoruk and Standing and Christman (1977) Christman (2004) (2004)

Tuned Standing (1977)

Petrosky and Farshad (1998)

Tuned Petrosky and Farshad (1998)

Quantity

MLP Regressor

Super Learner

Mean absolute error (psi)

241.095

249.913

400.920

258.439

517.501

366.611

475.159

315.438

Mean absolute relative error (%)

7.824

7.976

11.216

8.589

13.351

11.475

17.525

11.355

Absolute relative error standard deviation (%)

9.557

8.219

9.279

8.481

10.239

12.173

31.774

15.895

Mean relative error (%)

1.571

-1.055

0.947

-1.332

-1.505

1.983

-3.255

-0.148

Relative error standard deviation (%)

12.251

11.404

14.526

11.996

16.758

16.611

36.140

19.533

Pearson correlation

0.986

0.988

0.982

0.987

0.949

0.970

0.972

0.978

Table 7—Coefficients for tuned Dindoruk and Christman (2004) bubble point pressure correlation (Dataset 2). Coefficient

Predicting Fold Predicting Fold Predicting Fold Predicting Fold Predicting Fold 1 2 3 4 5

'

3.06352E-11

5.04529E-11

8.49694E-11

3.19345E-11

1.95178E-10

"

3.088707750

3.216612640

1.468152350

2.915895590

2.935542540

(

-3.72736E-04

-1.63649E-03

-1.17441E-03

-1.99699E-04

-1.61416E-03

*

0.925219814

0.808780608

0.821921000

0.916426081

0.827581392

+

0.029445287

0.025502941

0.045409324

0.046460722

0.017901473

,

-0.389883596

-0.302837627

-0.343171683

-0.493889746

-0.296716614

-

-0.114736668

-0.217082850

-0.068532041

0.133725979

-0.130922778

\

2.192031670

2.597809730

2.660391730

3.696593970

2.558256730

E

1.166294530

1.147039170

1.153314700

1.100811120

1.155862830

'O

1.391403200

1.633863490

1.378933910

1.178244240

1.455675200

''

0.012585899

-0.014439217

0.009685796

0.030784142

-0.008529091

Dataset No.3 For dataset No.3, both compositional data and three black oil properties and the reservoir temperature are available as predictors for machine learning algorithm. Tuned Standing (1977), tuned Petrosky and Farshad (1998), and tuned Dindoruk and Christman (2004) correlations are still applied to this dataset while they only use three black oil properties and the temperature. Though machine learning uses all 18 parameters as predictors while conventional correlations only use four parameters, we will still compare their results on this dataset to study if the coupling of compositional data and black oil properties will enhance the accuracy of bubble point pressure prediction. In Fig. 7, cross plots of the best of the two base learners (XGBoost and MLP regressor), super learner, and tuned Dindoruk and Christman correlation are presented. In Fig. 7a, XGBoost performs well when measured bubble point pressure is lower than 8,000 psi but has large prediction errors when measured bubble point pressure is higher than 8,000 psi. In Fig. 7b, MLP

regressor performs well in the range of high measured bubble point pressure but over-predicts two samples in the measured bubble point pressure range of 3,500 to 4,000 psi. In Fig. 7c, super learner adopts the merits of base learners and the prediction is stable in the whole range of measured bubble point pressure. In Fig. 7d, tuned Dindoruk and Christman correlation does not perform well for high bubble point pressure and has one severely over-predicted sample. The error distribution plots shown in Fig. 8 and prediction error statistics shown in Table 8 both confirm that super learner performs the best within the full range of this dataset. Dindoruk and Christman (2004) correlation coefficients on predicting each fold is shown in Table 9 below.

(a)

(b)

(c)

(d)

Figure 7—Cross plot of measured /0/ vs. predicted /0/ of dataset No3 using: (a) XGBoost; (b) Multi-layer perceptron regressor (ANN); (c) Super learner; (d) Tuned Dindoruk and Christman.

(a)

(b)

(c) (d) Figure 8—Prediction error distribution plots of dataset No.2 using: (a) XGBoost; (b) Multi-layer perceptron regressor (ANN); (c) Super learner; (d) Tuned Dindoruk and Christman (2004).

Table 8—Dataset No.3 cross validation prediction error statistical results.

Quantity

XGBoost

MLP Regressor

Mean absolute

184.725

154.854

Super Learner

Tuned Dindoruk and Christman (2004)

Tuned Standing (1977)

Tuned Petrosky and Farshad (1998)

149.902

246.526

363.936

314.242

error (psi) Mean absolute relative error (%)

5.990

5.478

5.146

7.582

10.683

10.893

Absolute relative error standard deviation (%)

8.947

9.001

8.250

9.035

11.494

14.942

Mean relative error (%)

1.277

1.195

0.976

-0.759

1.687

0.525

Relative error standard deviation (%)

10.691

10.469

9.674

13.939

15.601

18.484

Pearson correlation

0.992

0.995

0.995

0.986

0.972

0.980

Table 9—Coefficients for tuned Dindoruk and Christman (2004) bubble point pressure correlation (Dataset 3). Coefficient

Predicting Fold Predicting Fold Predicting Fold Predicting Fold Predicting Fold 1 2 3 4 5

'

3.52086E-10

2.23174E-10

1.01519E-10

1.12822E-10

1.87231E-11

"

1.753337320

2.843105670

2.669111780

3.115926560

2.740147770

(

-4.63920E-05

-3.78223E-04

-6.97371E-04

-1.03935E-03

-1.24119E-03

*

1.542893370

1.073005000

0.630784383

1.008389730

0.933181599

+

0.060608797

0.016190912

0.049705270

0.030094812

0.021959094

,

-0.385949055

-0.327867347

-0.515821867

-0.281001522

-0.288325806

-

-0.087599777

-0.053116706

-0.094247445

-0.062161234

-0.061038264

\

6.658862530

3.556438570

1.166903540

4.627641460

3.153017080

E

0.975912444

1.082712230

1.325034880

1.054963340

1.121614120

'O

1.285171710

1.420619320

1.387712530

1.372258670

1.421674380

''

0.004773802

0.014490032

0.016009203

0.003550068

0.013642342

Though tuned Dindoruk and Christman (2004) bubble point pressure correlation does not perform as well as machine learning models, it is still useful on global dataset for quick calculation. Thus, we show the tuned coefficients, which is optimized using whole dataset No.3, in Table 10. The coefficients shown in Table 10 can not only be used to calculate bubble point pressure but also can be used for quality checking the subject dataset before feeding the data to machine learning models. A useful case example of such quality check process is demonstrated in Appendix C. Table 10—Suggested coefficients of the tuned Dindoruk and Christman (2004) bubble point pressure correlation. Coefficient

Value

'

1.51366E-10

"

2.844319980

(

-8.46072E-04

*

0.879019955

+

0.030956521

,

-0.339185162

-

-0.100812111

\

2.356760050

E

1.158289240

'O

1.525553080

''

0.013780159

Summary and Conclusions In this study, we have prepared three datasets with three sets of predictors: compositional data, black oil properties, and combination of both sets of predictors (compositional data and black oil properties). For every case temperature was also part of the predictors as bubble point pressure is function of temperature. Advanced tree-based machine learning algorithms (XGBoost, LightGBM, and random forest regressor) and MLP regressor (ANN) are applied on the three datasets. Super learner has successfully enhanced machine learning model robustness and performance by stacking prediction results from base learners especially on dataset No. 1 and 3. We also tuned three conventional bubble point pressure (Standing 1977; Petrosky and Farshad 1998; Dindoruk and Christman 2004) correlations to compare with machine learning algorithms. Some important conclusions are drawn: 1. Using only compositional data and temperature, super learner reaches mean absolute relative error of 7.162% on  prediction over the range of measured value from 66 psi to 10,335 psi. 2. Using only bulk properties (  ,  , API, and  ), MLP regressor has mean absolute relative error of 7.824% on  prediction, which is slightly better than super learner (7.976%), but the super learner has a more balanced performance based on mean relative error, mean absolute relative error and standard deviation. 3. Using both compositional data and bulk properties, super learner exhibited mean absolute relative error of 5.146% (standard deviation = 8.250%) within the wide range of bubble point pressures, from 100 to 10,000 psi. 4. In all combinations of predictors considered, super learner is proved to be useful to combine the merits of base machine learning algorithms and enhance predictive robustness on  . 5. Dindoruk and Christman (2004) correlation performs well after tuning using NelderMead algorithm for optimization, and yields mean absolute relative error of 8.589% on dataset No.2, which is not as good as machine learning algorithms. This observation

proves that though Dindoruk and Christman (2004) correlation was designed for GOM oil, it may still suitable for global oil samples with the improved coefficients presented. 6. Using only bulk properties ( ,  , API, and ), we suggest using MLP regressor as the algorithm for  prediction considering of its performance and efficiency in calculation

cost. Dindoruk and Chirstman (2004) correlation is also suitable using bulk properties (  ,  , API, and  ) for  prediction but proper optimization algorithm (such as Nelder–Mead algorithm) should be applied for parameter tuning for best results. 7. When compositional data are available, we suggest using super learner with tree-based algorithm (eg. XGBoost, LightGBM, or random forest) and MLP algorithm as base learners on  prediction. If bulk properties are available together with compositional data, machine learning algorithms will provide stable prediction on global oil samples with measured  ranging from 300 to 10,000 psi.

References Alakbari, F. S., Elkatatny, S., and Baarimah, S. O. 2016. Prediction of Bubble Point Pressure Using Artificial Intelligence AI Techniques. Presented at the SPE Middle East Artificial Lift Conference and Exhibition, 30 November-1 December, Manama, Kingdom of Bahrain. Alimadadi, F., Fakhri, A., Farooghi, D., and Sadati, H. 2011. Using a Committee Machine With Artificial Neural Networks To Predict PVT Properties of Iran Crude Oil. SPE Reservoir Evaluation & Engineering 14(01): 129-137. Al-Marhoun, M. A., & Osman, E. A. 2002. Using Artificial Neural Networks to Develop New PVT Correlations for Saudi Crude Oils. Presented at the Abu Dhabi International Petroleum Exhibition and Conference, 13-16 October, Abu Dhabi, United Arab Emirates. Breiman, L., Friedman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA, USA. Breiman, L. 2001. Random Forests. Machine Learning 45(01): 5-32. Brown, G., Wyatt, J. L., and Tino, P. 2005. Managing Diversity in Regression Ensembles. Journal of Machine Learning Research 9(Nov): 2607-2633. Chen T., and Guestrin, C. 2016. XGBoost: A Scalable Tree Boosting System. KDD '16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 785-794.

Dindoruk, B., Christman, P.G. 2004. PVT Properties and Viscosity Correlations for Gulf of Mexico Oils. SPE Reservoir Evaluation & Engineering 7(06): 427-437. Elsharkawy, A. M. 1998. Modeling the Properties of Crude Oil and Gas Systems Using RBF Network. Presented at the SPE Asia Pacific Oil and Gas Conference and Exhibition, 12-14 October, Perth, Australia. El-Sebakhy, E. A., Sheltami, T., Al-Bokhitan, S. Y., Shaaban, Y., Raharja, P. D., and Khaeruzzaman, Y. 2007. Support Vector Machines Framework for Predicting the PVT Properties of Crude Oil Systems. Presented at the SPE Middle East Oil and Gas Show and Conference, 11-14 March, Manama, Bahrain. Friedman, J.H. 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29(05): 1189-1232. Friedman, J.H. 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis 38(04): 367378. Gao, F. and Han, L. 2012. Implementing the Nelder-Mead simplex algorithm with adaptive parameters. Computational Optimization and Applications 51(01): 259-388. Geweke, J. F. 1994. Variable selection and model comparison in regression. Federal Reserve Bank of Minneapolis, Working Papers 539. Gharbi, R. B. C., and Elsharkawy, A. M. 1999. Neural Network Model for Estimating the PVT Properties of Middle East Crude Oils. SPE Reservoir Evaluation & Engineering 2(03): 255-265. Gharbi, R. B. C., Elsharkawy, A. M., and Karkoub, M. 1999. Universal Neural-Network-Based Model for Estimating the PVT Properties of Crude Oil Systems. Energy & Fuels 13: 454-458. Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9: 249-256. Goda, H. M., El-M Shokir, E. M., Fattah, K. A., and Sayyouh, M. H. 2003. Prediction of the PVT Data using Neural Network Computing Theory. Presented at the Nigeria Annual International Conference and Exhibition, 4-6 August, Abuja, Nigeria. Hinton, G. 1989. Connectionist Learning Procedures. Artificial Intelligence 40(01): 185-234. Ju, C., Combs, M., Lendle, S. D., Franklin, J. M., Wyss, R., Schneeweiss, S., and van der Laan, M. J. 2019. Propensity score prediction for electronic healthcare databases using super learner and high-dimensional propensity score methods. Journal of Applied Statistics 46(12): 2216-2236. Kampichler, C., Wieland, R., Calmé, S., Weissenberger, H., and Arriaga-Weiss, S. 2010. Classification in conservation biology: A comparison of five machine-learning methods. Ecological Informatics 5(06): 441-450. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Khandelwal, P. 2017. Which algorithm takes the crown: Light GBM vs XGBOOST? Analytics Vidhya. https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vsxgboost/ MacKay, D. 1992. Bayesian Interpolation. Neural Computation 4(03): 415-447.

McCain, W. D. Jr. 1991. Reservoir-Fluid Property Correlations-State of the Art. SPE Reservoir Engineering 6(02): 266-272. Nelder, J.A., Mead, R. 1965. A simplex method for function minimization. The Computer Journal 7(4): 308-313 Numbere, O. G., Azuibuike, I. I., and Ikiensikimama, S. S. 2013. Bubble Point Pressure Prediction Model for Niger Delta Crude using Artificial Neural Network Approach. Presented at the SPE Nigeria Annual International Conference and Exhibition, 5-7 August, Lagos, Nigeria. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Osman, E. A., Abdel-Wahhab, O. A., and Al-Marhoun, M. A. 2001. Prediction of Oil PVT Properties Using Neural Networks. Presented at the SPE Middle East Oil Show, 17-20 March, Manama, Bahrain. Osman, E.-S. A., and Al-Marhoun, M. A. 2005. Artificial Neural Networks Models for Predicting PVT Properties of Oil Field Brines. Presented at the SPE Middle East Oil and Gas Show and Conference, 12-15 March, Kingdom of Bahrain. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12(Oct): 2825-2830. Petrosky, G. E., and Farshad, F. F. 1998. Pressure-Volume-Temperature Correlations for Gulf of Mexico Crude Oils. SPE Reservoir Evaluation & Engineering 1(05): 416-420. Polley, E.C. and van der Laan, M.J. 2010. Super Learner in Prediction. U.C. Berkeley Division of Biostatistics Working Paper Series: Working Paper 266. Ramirez, A. M., Valle, G. A., Romero, F., and Jaimes, M. 2017. Prediction of PVT Properties in Crude Oil Using Machine Learning Techniques MLT. Presented at the SPE Latin America and Caribbean Petroleum Engineering Conference, 17-19 May, Buenos Aires, Argentina. Rammay, M.H., and Abdulraheem, A. 2017. PVT correlations for Pakistani crude oils using artificial neural network. Journal of Petroleum Exploration and Production Technology 7(01): 217-233. Rose, S. 2016. A Machine Learning Framework for Plan Payment Risk Adjustment. Health Services Research 51(06): 2358-2374. Sinisi, S. E., Polley, E. C., Petersen, M. L., Rhee, S., and van der Laan, M. J. 2007. Super Learning: An Application to the Prediction of HIV-1 Drug Resistance. Statistical Applications in Genetics and Molecular Biology 6(01). Standing, M.B. 1977. Volumetric and Phase Behavior of Oil Field Hydrocarbon Systems. SPE, Richardson, Texas. Swalin,

A. 2018. CatBoost vs. Light GBM vs. XGBoost. Towards Data https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db

Science.

Sweeney, E. M., Vogelstein, J. T., Cuzzocreo, J. L., Calabresi, P. A., Reich, D. S., Crainiceanu, C. M., and Shinohara, R. T. 2014. A Comparison of Supervised Machine Learning Algorithms and

Feature Vectors for MS Lesion Segmentation Using Multimodal Structural MRI. PLoS ONE 9(4): e95753. Tipping, M. 2001. Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research Vol. 1: 211-244. Tso, G. K.F. and Yau K. K.W. 2007. Predicting electricity energy consumption: A comparison of regression analysis, decision tree and neural networks. Energy 32(09): 1761-1768. van der Laan, M.J., Polley, E.C., and Hubbard, A.E. 2007. Super Learner. U.C. Berkeley Division of Biostatistics Working Paper Series: Working Paper 222. Wang, H., Rose, S., and van der Laan, M.J. 2011. Finding quantitative trait loci genes with collaborative targeted maximum likelihood learning. Statistics & Probability Letters 81(07): 792-796 Wyss, R., Schneeweiss, S., van der Laan, M.J., Lendle, S. D., Ju, C., and Franklin, J. M. 2018. Using Super Learner Prediction Modeling to Improve High-dimensional Propensity Score Estimation. Epidemiology 29(01): 99-106.

Appendix A — Introduction on Machine Learning Algorithms Used In the methodology section, we introduced the machine learning algorithms used in this study: •

Three tree-based advanced machine learning algorithms: XGBoost (Chen and Guestrin 2016), LightGBM (Ke et al. 2017), random forest regressor (Breiman 2001)



One ANN machine learning algorithm: multi-layer perceptron regressor (Hinton 1989; Glorot and Bengio 2010)



One novel ensemble machine learning algorithm: super learner (Polley and van der Laan 2010)

In this Appendix, we introduce the machine learning algorithms discussed above starting from tree-based algorithms. Tree-based models are dated back to the classification and regression trees introduced by Breiman et al. (1984). In simplest word, based on training data, decision tree algorithm builds a “tree” with nodes that splitting different prediction results by all predictors. The leaves of a decision tree become the final prediction of decision tree models. The optimization process of training of decision trees is based on the purity or accuracy of prediction on training data (Breiman et al., 1984). Random forest is an ensemble algorithm based on

decision tree algorithm (Breiman 2001). Instead of using only one decision tree, random forest algorithm builds multiple decision trees with the number defined by user. For each decision tree of a random forest model, the number of nodes for splitting, the number of predictors to use, and the depth are can be defined by user. Though the user could choose the number of predictors of each tree in random forest, the chosen of predictors to use is random. Thus, random forest is known for its robustness by not dominating by one or two strongest features. Breiman (2001) believes random forest algorithm tends to have less over-fitting issue. Upon the development of decision tree algorithms (Breiman et al., 1984), another type of treebased algorithm, gradient boosting is introduced by Friedman (2001; 2002). Gradient boosting algorithm (Friedman 2001; 2002) is also an ensemble algorithm compared to random forest algorithm (Breiman 2001). However, gradient boosting algorithm (Friedman 2001; 2002) grows a tree slowly multiple times, which is controlled by user but not using multiple trees. Based on the development of gradient boosting algorithm, two advanced machine learning algorithms are developed: extreme gradient boosting (XGBoost) by Chen and Guestrin (2016); light gradient boosting machine (LightGBM) by Ke et al. (2017). Though XGBoost and LightGBM are advanced and complex machine learning algorithms, their novelty and robustness are mainly based on the splitting technique at each node. XGBoost uses pre-sorted algorithm and histogrambased algorithm at node splitting (Swalin 2018). However, LightGBM uses gradient-based onside sampling (GOSS) for splitting scheme of every node (Ke et al. 2017). Moreover, LightGBM incorporates another technique, exclusive feature bundling (EFB) for better calculation efficiency (Ke et al. 2017). With the combination of GOSS and EFM, LightGBM is able to achieve high efficiency and fast training speed (Ke et al. 2017; Khandelwal 2017). Artificial neural network (ANN) is a broad class of machine learning algorithms that imitate human being’s neural system for regression or classification problem. The specific ANN algorithm applied in this study is multi-layer perceptron regressor (Hinton 1989; Glorot and Bengio 2010). For clearness of MLP regressor, an example schematic is shown in Fig. A1. In Fig. A1, the left-most layer is the input layer, and ]' and ]" are the input predictors in the model;

the middle layer is called hidden layer, and ℎ' and ℎ" are called neurons or perceptron that uses

activation function (eg. sigmoid function or rectifier) to output a value for next hidden layer or output layer; the right-most layer is called output layer with an output function (eg. linear

regression) that takes in the value from last hidden layer and output final prediction. The red numbers on the arrows in Fig. A1 are coefficients of this example MLP regressor. For example, the input value for hidden neuron ℎ' equals to 4.2]' − 0.5]" + 1.2 , because 1O is the

compensation term in every layer. If chosen tanh  as the activation function of neuron ℎ' , the

output from it becomes ℎ' = q]' , ]"  = tanh 4.2]' − 0.5]" + 1.2. In Fig. A1, the model has

only one hidden layer. However, for MLP regressor, the user could choose any numbers of hidden layers and number of neurons in each layer. When number of hidden layers and neurons in each layer increase, the possibility of over-fitting increases significantly because more coefficients are in the model.

Figure A1— Example schematic of an MLP model.

The super learner algorithm developed by Polley and van der Laan (2010) is already introduced in the form of detailed steps in the methodology section. In order to show the super learner algorithm clearer, a schematic of super learner algorithm (Polley and van der Laan 2010) is shown in Fig. A2 below. Super learner is different from any machine learning algorithm above because it is actually a combination algorithm of base learners. Base learners of a super learner algorithm could be any machine learning algorithm such as XGBoost, LightGBM, MLP, and random forest. After the base learners are chosen and fitted by train data separately. In the next step, as shown in Fig. A2, the whole train dataset is predicted in the form of K-fold cross validation. In detail, the samples in each fold or group are predicted using the remaining data as training data. Such process is performed with each base learner and the predictions from all base learners are stacked, which is reason that super learner is also called stacking ensemble. Meta learner, which is usually linear regression model, is used to output final prediction based on

base learners’ stacked prediction. Other than detailed steps of super learner introduced in methodology section and schematic shown in Fig. A2, super learner could also be interpreted in a simpler way. For music ensemble, different musical instruments are used by composers for better music than any single instrument. A good conductor coordinates the ensemble orchestra for final music presentation to audience. For machine learning ensemble of super learner, base learners serves as every musical instrument of an orchestra, and the meta learner serves as the conductor who coordinated the final results.

Figure A2— Schematic of super learner algorithm developed by Polley and van der Laan (2010).

Appendix B — Data Cleaning Procedure Data cleaning procedure is required for every machine learning case. In this study, we start the data cleaning procedure by checking compositional data. Firstly, the sum of components’ mole percentage should be 100%. Though C7+ mole percentage is grouped and provided by reservoir fluid database (RFDBase), we have found more than 30 samples with incorrect C7+ mole percentage documented that causing sum of components’ mole percentages to be erroneous. In our cleaning step, considering of rounding of components’ mole percentages, oil samples with

sum of mole percentages higher than 100.5% or lower than 99.5% are checked and corrected if correct value of C7+ mole percentage could be re-grouped from unprocessed PVT reports on RFDBase. Secondly, due to possible incorrect grouping on C7+, MW of C7+ should be also checked during data cleaning. Since reservoir fluid total MW is available, C7+ MW of oil samples could also be calculated using other components’ MW and mole percentage. The measured C7+ MW provided from database and the calculated C7+ MW should be compared. If the difference between measured C7+ MW and calculated C7+ MW is larger than 1.0, we check the raw PVT reports for correction. Then the  of oil samples are measured at reservoir temperature or temperature close to reservoir temperature. Thus, the observations atabnormally high temperatures in the dataset are checked. When the documented temperature is higher than 350-400 °F, the raw PVT reports of such samples are manually checked. During this procedure, we have found samples with incorrect temperature associations some being due to incorrect unit conversions, typos, incorrect data associationsetc.. In addition to the points above, the three black oil properties:  ,  , API are also checked against some of the known correlations and textbook guidelines. It is unrealistic to set a “correct” range for data quality-check and cleaning and exclude some of the data within the realistic window for the reservoir conditions and fluid combinations. In this study, we have found the use of the empirical correlations being the most practical way to do the quality checking the data, which is also covered in Appendix C in detail. Finally, we clean samples with null values in each dataset. However, due to the size limitations on dataset, samples with missing values are not dropped directly. Instead, raw PVT reports of samples with missing values are checked for recovering the missing numbers for these samples.

Appendix C — Quality Check on Dataset using Correlation

While tuning Dindoruk and Christman (2004) bubble point pressure correlation for dataset No.3, we find an outlier sample, which has large prediction error even after tuning. As shown in Fig. A1, there is a sample towards the lower right corner (highlighted within a circle) which is predicted to have pst of nearly 12,000 psi while the measured pst is 3,085 psi with the corresponding reported associated values of T=131 °F, API = 24.7 ° API, and single stage GOR =1931.4 SCF/STB, and single stage vapor gravity = 0.58). Such excessice deviation between measured value and predicted value of bubble point pressures signals that the subject datapoint needs to be examined closer. After thorough investigation, We noticed that there is a clear mismatch between the reported GOR and the bubble point pressure. Going through the original report and the reported composition, we have noticed that there was an unit conversion error and the real GOR of the system is supposed to be 5.6146 (STB to ft3 conversion factor) times lower which corresponds to the GOR of 343.9 SCF/STB. The interesting thing that we realized that the alternative black oil property based data driven methods did not indicate any particular issue with this data point. Using the suggested tuned coefficients of Dindoruk and Christman (2004) correlation (Table 10), such quality check process of dataset could be done to make sure the data to be utilized with the data driven methods do not violate basic physical and experimental constraints before any training or testing of machine learning models.

Figure C1— Cross plot of measured /0/ vs. predicted /0/ of dataset No. 3 using tuned Dindoruk and Christman as a part of quick quality check.

1. Using only compositional data and temperature, super learner reaches mean absolute relative error of 7.162% on ‫݌‬௕௣ prediction over the range of measured value from 66 psi to 10,335 psi. 2. Using only bulk properties (ܴ௦ , ߛ௚ , API, and ܶ), MLP regressor has mean absolute relative error of 7.824% on ‫݌‬௕௣ prediction, which is slightly better than super learner (7.976%), but the super learner has a more balanced performance based on mean relative error, mean absolute relative error and standard deviation. 3. Using both compositional data and bulk properties, super learner exhibited mean absolute relative error of 5.146% (standard deviation = 8.250%) within the wide range of bubble point pressures, from 100 to 10,000 psi. 4. In all combinations of predictors considered, super learner is proved to be useful to combine the merits of base machine learning algorithms and enhance predictive robustness on ‫݌‬௕௣ . 5. Dindoruk and Christman (2004) correlation performs well after tuning using NelderMead algorithm for optimization, and yields mean absolute relative error of 8.589% on dataset No.2, which is not as good as machine learning algorithms. This observation proves that though Dindoruk and Christman (2004) correlation was designed for GOM oil, it may still suitable for global oil samples with the improved coefficients presented. We have include full set of coefficients to share with the readers.

This research was done at Shell International E&P Inc. No government funding was used.