Mario R. Eden, Marianthi Ierapetritou and Gavin P. Towler (Editors) Proceedings of the 13th International Symposium on Process Systems Engineering – PSE 2018 July 1-5, 2018, San Diego, California, USA © 2018 Elsevier B.V. All rights reserved https://doi.org/10.1016/B978-0-444-64241-7.50036-7
Novel Symbolic Regression-Mathematical Programming based Predictions of the Molecular Cetane Number with Small Sampling Data Jiawen Wei, Zhihong Yuan* Department of Chemical Engineering, Tsinghua University, 100084 Beijing, China
[email protected]
Abstract In order to conduct the high-throughput screening of molecules and subsequently speedup the process of exploring latent new fuels with desirable properties such as high CN (Cetane Number) which is one of the most vital signatures to assess the fuels combusted in the engine, a novel machine learning method named hybrid Symbolic RegressionMathematical Programming based approach for extracting surrogate model from data sets is proposed. We formulated the Symbolic Regression Tree (SRT) as a Generalized Disjunctive Programming (GDP) model. Via the Big-M, the GDP model is transferred to a mixed integer nonlinear programming (MINLP) model. BARON is then utilized to solve the MINLP. The established model in this paper demonstrates a good advantage in training efficiency, model representation and prediction accurate.
1. Introduction Concerns about diminishing fossil fuel reserves along with climate change and national security have promoted considerable research activities on exploring alternative, environmentally friendly processes for producing liquid transportation fuels. During the past decades, industrial and academia have doing their bests to pursue the proper biofuels, which have desirable combustion properties and strong competitiveness, to partially or completely replace traditional petroleum-driven gasoline and diesels. The CN (Cetane Number) is one of the most important signatures to evaluate the fuels combusted in the engine, which is a correlation based on ignition delay from the start of fuel injection. In general, the higher a fuel’s cetane number is, the shorter ignition delay period the fuel has. Cooperative Fuel Research (CFR) and Ignition Quality Tester (IQT) are two mainly used methods in determining CN. The American Society for Testing and Materials (ASTM) Standard D613 (ASTM D613, 2015) uses the single-cylinder CFR for determination of CN, while ASTM Standard D7170 (ASTM D7170-16, 2016) and ASTM Standard D6890 (ASTM D6890, 2015) utilize the IQT. All the methods provide accurate CN measurements, while the CFR can mostly reflect a fuel’s actual combustion behaviour in the engine and the IQT offers a faster measure process with lower volumetric requirements. Clearly, if a series of new biofuel molecules are experimentally synthesized one by one to test the potential CN, a huge amount of time and money will be spent inefficiently. To some extent, a proper experimental data driven predictive model representing the structure-CN relationships of molecules approximately, which can help chemists screen the molecules and subsequently speed up the process of exploring latent new fuels with good performance, is desirable.
248
J. Wei and Z. Yuan
Motivated by the tremendous advancement of machine learning, predicting molecules’ properties from their structures is an obvious way and has been extensively investigated. Yang used backpropagating neural networks to predict the CN of iso-paraffins and diesel fuels based on quantitative structure property relationship (QSPR) (Yang H, 2001). Taylor utilized the QSARIS software to calculate more than 100 molecular descriptors of 275 compounds that may be used to build the predictive model of the CN. Then the software determined which descriptors are most relevant in modelling the CN using a genetic algorithm and built a predictive model (RMSE = 9.1CN units) (Taylor J, 2004). Although this model is not accurate enough, Taylor's work did provide QSPR inputs for later research to predict CN and indicated which molecular descriptions are crucial. For example, Kessler retained 15 molecular descriptors to build a backpropagation neural network based prediction model of cetane number for furanic biofuel additives, with a total RMSE of 5.97 for the core data set of 284 molecules (Kessler T, 2017). Some other types of models such as utilizing an inverse function method (Smolenskii E, 2008) and using the genetic function approximation (GFA) (Creton B, 2010) have also been used to predict CN. Apparently, these methods have made it possible to describe the relationship between molecular descriptors and properties accurately in a single chemical family of similar chemical properties within test range. Although the aforementioned methods have pretty good robustness and can fit the relationship between CNs and molecular descriptors quite accurately, it is easy to overlook that in choosing which molecular description to be included in a predictive model of CN. To some extent, humans’ choices may limit the performance of the predictive model since these choices are made following humans’ knowledge. Alpha Go Zero, which beat Alpha Go by removing the constraints of human knowledge, is an implicational instance. Under such circumstance, it is very important and interesting to formulate a method for surrogate model building, which can automatically select the molecular descriptors while guaranteeing the prediction accurate, to predict molecular CN. Considering the excellent characteristics of Symbolic Regression and Mathematical Programming, the objective of this paper is to propose a novel machine learning method to explicitly correlate the relationship between CNs and molecular descriptors.
2. Method Description 2.1. Symbolic Regression Existing literature demonstrated that symbolic regression can give concise analytical expressions without humans’ prior knowledge so that it may help scientists to discover laws behind some focused phenomenon (Schmidt M, 2009). An expression tree such as shown in Figure 1(a) can illustrate how symbolic regression runs. It is obvious that any explicit function can be represented by expression trees. What a symbolic regression algorithm does is to combine the pre-set operators such as { ǡ െǡൈǡൊǡ ǥ } and operands including variables and parameters to form an explicit function which can represent the mathematical link among the input/output data sets. Due to the absence of any information except basic data set, symbolic regression aims to form proper functional forms and calculate corresponding parameters simultaneously. Conventionally, most symbolic regression algorithms are based on evolutionary computation which is approached as an application of Genetic Programming (GP). Genetic operations like crossover and mutation would generate more proper expressions, some of which are redundant ones. On the other hand, the expressions may not be concise and may not be even be local optimal due to the GP’s stochasticity. To our knowledge,
Symbolic Regression-Mathematical Programming based Predictions
249
there is no strict global optimization algorithm to solve the problem of Symbolic Regression.
(a) An instance of expression tree
(b) Description of an expression tree
Figure 1. Description of Symbolic Regression Tree. 2.2. The mathematical programming formulation of symbolic regression To describe an expression tree, several sets and parameters are introduced. As shown in Figure 2(b), the set N is proposed for the number (1 to 15) representing the corroding node of the tree in a sequence. The set O is the compilation of operators and operands. Obviously, there are some absent nodes, so an operator “null” is introduced to represent them. For convenience, some subsets are also defined. OE refers to the compilation of operators and operands except “null”, L refers to operands and “null”, U refers to unary operators while B refers to binary operators. Furthermore, the variable ݒ is given to each node and the link function ݂ is introduced to make the tree computable. For convenience, we set ݒ to be the left child node’s value of node “n”, and ݒ to be the right one’s value. ݆ is introduced to represents a constant. Necessarily, for some operators and operands, the constraint ݃ Ͳ is enforced to avoid math error or meaninglessness such as the dividend is zero. The binary variable ݕ describes which operator or operand the node “n” takes. Logical constraints such as each node taking only one operator or operand are also added. I represent the number of data sets. We can measure the fitness of the expression tree by the error between ݒଵ and the sample dependent variable ݖ . Therefore, a symbolic regression tree can be formulated as the following Generalized Disjunctive Programming (GDP). ݏൌ ሺݖ െ ݒଵ ሻଶ ሺͳሻ
ݐݏǤ ݕ ሧ ݂ ሺݒ ǡ ݒ ǡ ݔ ǡ ǡ ݆ ǡ ݒ ሻ ൌ Ͳ אை ݃ ሺݒ ǡ ݒ ǡ ݔ ǡ ǡ ݆ ǡ ݒ ሻ Ͳ ݕ ሧ ݂ ሺݔ ǡ ݆ ǡ ݒ ሻ ൌ Ͳ ௦௧ א ݃ ൌ ߝ െ ȁ݆ ȁ Ͳ
௨
ש ݕݒൌ Ͳ൨݉ݎ݁ݐ ב ݊ሺʹሻ
௨
ש ݕݒൌ Ͳ൨݉ݎ݁ݐ א ݊ሺ͵ሻ
ݕ ൌ ͳ ݉ݎ݁ݐ ב ݊ሺͶሻ
ݕ ൌ ͳ ݉ݎ݁ݐ א ݊ሺͷሻ א ݉ݎ݁ݐ ב ݊ሺሻ ݕ ൌ ݕ א
J. Wei and Z. Yuan
250
ݕ ൌ ݕ ݉ݎ݁ݐ ב ݊ሺሻ א
ݕ
אሼͲǡͳሽሺͺሻ
ܸ ݒ ܸത (9) ܬ ݆ ܬҧ ሺͳͲሻ ௦௧ ௦௧ ݕ ͳ݉ݎ݁ݐ ב ݊ሺͳͳሻ ݕ ௦௧ ݕൊ ݕ ͳ݉ݎ݁ݐ ב ݊ሺͳʹሻ ௦௧ ݕି ݕ ͳ݉ݎ݁ݐ ב ݊ሺͳ͵ሻ
א
௦௧ ݉ݎ݁ݐ ב ݊ሺͳͶሻ ݕ ͳ െ ݕ
௫ ௫ ݕ ݕ ͳ݉ݎ݁ݐ ב ݊ሺͳͷሻ
If equations and constraints in (2) and (3) are overwritten with Big-M constraints, the GDP model can easily be transformed into a MINLP model.
3. Results and Discussions To intuitively test the efficiency of the approach, 13 n-alkanes (C3~C15) are chosen as the training data sets and then 5 n-alkanes (C16~C20) are predicted. Six descriptors listed in Table 1 are fed to the above GDP model. The final descriptors appeared in the explicit expression will be optimally decided through solving GDP model by BARON (Tawarmalani M, 2004). Table 1. Definitions of the chosen molecular descriptors (Kessler T, 2017). Descriptor
Definition
Mor32e
3D-MoRSE - signal electronegativities
EEig08x
Eigenvalue 08 from edge adj. matrix weighted by edge degrees
CIC1
Complementary information content
RDF090m
Radial Distribution Function - 9.0 / weighted by atomic masses
RDF020p
Radial Distribution Function - 2.0 / weighted by atomic polarizabilities
L/Bw
Length-to-breadth ratio by WHIM
32
/
weighted
by
atomic
Sanderson
In addition to the set of GDP, a Mixed Integer Nonlinear Programming (MINLP) is also directly built to represent the symbolic regression tree. An Artificial Neural Network (ANN) with six inputs (six descriptors), one output (CN), two hidden layers of 32 neurons which is actually as same as Kessler’s work (Kessler T, 2017) is also set. The aim for setting MINLP and ANN models is to conduct the comparisons with GDP to demonstrate its efficiency. The surrogate model driven by ANN is traditionally inexplicit. Equations (16) and (17) offer the explicit expression between CN and descriptors from GDP and MINLP, respectively.
Symbolic Regression-Mathematical Programming based Predictions
ǣ ൌ ሺ͵ʹ ͲǤͳʹͷሻ ൈ ൬Ǥͺ ʹ ȉ ͳ െ ͶǤͻͶ ൈ
251
ͳ െ ʹǤͶͳͻ ൰ ͲǤͷͻ ȉ ͵ʹ
ͲͻͲ െ ʹǤͺ ȉ Ͳͺ െ ʹͳǤͷͻͷ ሺͳሻ ͳ െ ʹǤͳͻ
ǣ ൌ ሺRDF020p)ଶ
ሺRDF090m )ଶ ʹͺͶǤ͵Ͷ ͳͻǤͲ͵ʹ ȉ CIC1ሺͳሻ CIC1 L/Bw
Clearly, GDP chosen optimally More32e, CIC1, RDF090m, and EEig08x as the independent variables, while MINLP chosen RDF020p, RDF090m, CIC1, and L/Bw. The overall results are shown in Figure 2. The MINLP model showed the worst fitness in the train set. The ANN demonstrated the worst prediction performance. Three models’ average relative errors of the prediction are 20.32%(ANN), 8.20%(MINLP), 1.11%(GDP). The GDP model performs the best CN predictions of new molecules which are not listed in the training sets.
Figure 2. The predicted and original CNs of n-alkanes (C3~C20). A 31-node tree containing {ǡ െǡൈǡൊǡ ሺሻଶ } is used in both the GDP and MINLP model. The MINLP model has 15,504 equations, 6,210 continuous variables and 372 binary variables. The GDP model’s size is smaller with 4,653 equations, 838 continuous variables and 403 binary variables. The main factor to cause the different results of the GDP model and the direct MINLP model is the complexity of the calculation. Each constraint, even the loose constraint is concerned when the MINLP problem is solved, such consideration may lead to very long computational time and each cause the wrong results. While the GDP problem can select necessary constraints according to the logical variable ݕ . Therefore, the GDP model can find a better result. Obviously, excellent training and prediction capacities may be achieved if more operators and nodes are introduced. Although the prediction capacity of ANN can also be improved through adding more layers and more nodes (Deep learning), a very large data sets are required. In other words, the accuracy of deep learning with ANN may not be high enough when facing small sampling data. However, our proposed method can efficiently tackle small sampling data than ANN through automatically and optimally selecting the equation structure and parameters. Apparently, the mechanism of the proposed method is totally different from the trail-and-error approach of ANN.
252
J. Wei and Z. Yuan
The properties of liquid fuels such as CN should be considered during the process design stage under the pressure of clean energy manufacturing. The computational intractable associated with the optimal design/synthesis for liquid fuels production may happen if the complex inexplicit model for CN representation is introduced, as one kind of constraints, to the overall optimization framework. Clearly, the explicit model can accelerate the process design for quality-oriented liquid fuels production. Our computational experiments (including the tests of large training-testing data sets) suggest that the computational burden for the proposed method grows rapidly following the increases on the problem size. Therefore, the tailored algorithm for the proposed method is being developed.
Conclusion We established a novel Symbolic Regression-Mathematical Programming based method to predict n-alkanes’ CNs. The descriptors were selected spontaneously to form the explicit expressions. The proposed method demonstrated better prediction performance when compared to ANN with multiple hidden layer especially when facing small sampling data. If more molecules are put in the training set, more operators and nodes need to be calculated to form the expression tree. Needless to say, the calculation burden would be significantly heavier. Therefore, a tailored solving algorithm is under the exploration. The optimal design/synthesis and operation can benefit from the proposed method due to its capacity to generate explicit surrogate model.
References ASTM D613. 2015, Standard Test Method for Cetane Number of Diesel Fuel Oil. West Conshohocken, PA: ASTM International. ASTM D7170-16. 2016, Standard Test Method for Determination of Derived Cetane Number (DCN) of Diesel Fuel Oils—Fixed Range Injection Period, Constant Volume Combustion Chamber Method. West Conshohocken, PA: ASTM International. ASTM D6890. 2015, Standard Test Method for Determination of Ignition Delay and Derived Cetane Number (DCN) of Diesel Fuel Oils by Combustion in a Constant Volume Chamber. West Conshohocken, PA: ASTM International. Yang H, Fairbridge C, Ring Z. 2001, Neural network prediction of cetane number for iso-paraffins and diesel fuel. Pet Sci Technol, 19:573–586. Taylor J, McCormick R, Clark W. 2004, Report on the relationship between molecular structure and compression ignition fuels, NREL Technical Report. Kessler T, et al. 2017, Artificial neural network based predictions of cetane number for furanic biofuel additives. Fuel, 206:171-179. Smolenskii EA, et al. 2008, Cetane numbers of hydrocarbons: calculations using optimal topological indices. Russ Chem Bull, 57:461–467. Creton B, et al. 2010, Prediction of the cetane number of diesel compounds using the quantitative structure property relationship. Energy Fuels, 24:5396–5403. Schmidt M, Lipson H. 2009, Distilling Free-Form Natural Laws from Experimental Data. Science, 2009, 324: 81-5. Tawarmalani M, Sahinidis NV. 2004, Global optimization of mixed-integer nonlinear programs: A theoretical and computational study. Mathematical Programming, 99: 563-591.