Mario R. Eden, Marianthi Ierapetritou and Gavin P. Towler (Editors) Proceedings of the 13th International Symposium on Process Systems Engineering – PSE 2018 July 1-5, 2018, San Diego, California, USA © 2018 Elsevier B.V. All rights reserved. https://doi.org/10.1016/B978-0-444-64241-7.50272-X
Teaching data-analytics through process design Fani Boukouvala*, Jianyuan Zhai, Sun Hye Kim, Farida Jariwala Chemical & Biomolecular Engineering, Georgia Institute of Technology, 311Ferst Dr., Atlanta, GA, 30332, USA *
[email protected]
Abstract Data-analytics is becoming a very influential tool for decision-making today both in industry and academia. The incorporation of data-driven concepts in the core chemical engineering curriculum would be very beneficial to our graduates, making them competitive in today’s market. The main objective of this work is to develop tools and case studies that: (a) increase students’ understanding and appreciation of data for decision making, (b) introduce students to basic data analytics and machine learning methods, (c) introduce students to basic data visualization tools, and (d) exhibit that the interactions between rigorous simulations and data could lead to improved solutions. This work shows that process design is an appropriate course for the introduction of this material, since students are at a senior level and have obtained the necessary mathematical background, and the development of process flowsheets using commercial simulators creates opportunities for inexpensive data generation to create many case studies. Keywords: data-analytics, data-driven modelling, process design, simulation, optimization, machine-learning.
1. Introduction Undoubtedly, we have entered an era, during which data are everywhere and the power and capabilities of data-driven decision-making is being increasingly adopted in many fields of engineering (Beck et al., 2016). The terms “data science”, “data-analytics” and “Big-Data” have started to emerge in an increasing number of research papers and conference presentations from both academics and industry in chemical engineering (Venkatasubramanian, 2009, Qin, 2014). Data-scientists have the ability to combine statistics, mathematics and computer programming to capture data trends and find emerging patterns, via data pre-processing, visualization and analysis. Data-analytics is the science of applying algorithmic methods to raw data, in order to draw useful conclusions, which oftentimes may verify or disprove an existing insight. In this work, we aim to show how to incorporate data-analytics into the course of process design, since our main aim is to teach our students how to use information in the form of input-output data to validate existing insight, or gain new insight that would be hard to recognize without data-analytics. Due to the fact that data-sets are often extremely large, with high dimensionality and numerous measurements from advanced sensor technology, the term “Big-Data” has also been predominantly used to capture the volume, variety, velocity and veracity of data and data generation (Qin, 2014). The concepts that are described in this work do not require terabytes of data (which is less likely to be the case in most chemical engineering applications), but rather aim to introduce students to systematic data-driven decision-making tools, which can be applied to a wide range of data-set sizes.
1664
F. Boukouvala et al.
This work is motivated by the vision that chemical engineers with a data-analytics background can make a significant impact, by knowing how to use data in tandem with their first-principle knowledge, to design better systems. A recent sample survey of datarelated education showed that statistics and data-analytics do not currently have a big role in most chemical engineering programs (Braatz, 2015). This work will show examples of how process design can be one of the potential ideal courses in every undergraduate curriculum, through which one can introduce data-driven methods and tools, through simple case studies that can be formulated and solved using any simulation software, or process design project. Three modules for teaching different aspects of data-analysis have been developed and are introduced in this paper. The first module aims to introduce students to multivariate analysis techniques for analysing noisy, unstructured data, and for identifying true dimensionality of data sets. The second module introduces students to the development of reduced-order regression models, using data from their simulations, in order to represent correlations of multiple inputs to outputs of interest. The final module allows students to experiment with data-driven optimization concepts, ranging from simple gridsearch, to more sophisticated trust-region optimization. The modules are developed in Matlab and they are not reliant on a specific simulation package, as long as the simulation can communicate with Matlab via generation of text files. These modules are currently being taught and evaluated at the School of Chemical & Biomolecular Engineering at Georgia Institute of Technology, in the course of Process Design and Economics I.
2. Integration of Data-Analytics through the course of Process Design Despite the acknowledged need to increase our students’ understanding of statistics, datahandling, and data-analytics, it is often hard to create new courses that can be incorporated within the core curriculum, due to total credit-hour limitations. As a result, a very effective and efficient approach is to incorporate data-analytics within existing courses throughout the undergraduate curriculum. There are many opportunities for this seamless integration, since the concepts of empirical models, analytical methods and parameter estimation are all embedded within our core undergraduate courses already. One very timely opportunity for integration of instruction of data-analytics is through the course of process design and economics. During this course, students are introduced to the complexity of decision-making when open-ended design problems need to be solved, using their accumulated knowledge in chemical engineering fundamentals and computer simulations. This course has the objective of teaching students how to develop rigorous simulations of their conceptual designs, generate alternatives, and finally, follow a trialand-error approach to optimize their design with objectives that range from energy minimization to maximization of economic profitability measures, such as net present value (Seider et al., 2015). The most commonly used flowsheet simulators in process design course are ASPEN, gPROMS, ProMax, HYSYS, COCO, Unisim, and more (Adams, 2018). The three modules that have been implemented within this course, are appropriately designed to follow the structure of the course, and supplement the learning objectives of the existing syllabus (Figure 1). In the first component, a data-analytics approach is introduced that helps students identify the true dimensionality of a data-set and to visualize the correlations in the data (Module 1). In the second case study, students are introduced to best practices and state-of-the-art methods in machine learning for fitting
Teaching Data-Analytics through Process Design
1665
multivariate regression models (Module 2) to represent a complex output as a function of a set of inputs. Finally, students are introduced to basic optimization formulations and solvers, and are asked to build their own adaptive optimization solver to identify optimal solutions of their rigorous design simulation (Module 3). Each module is described in more detail in the next subsections. •Create large data set from simulation & hide labels •Introduce Principal Component Analysis to analyze data, find correlations and true dimensionality of data •Reveal variables, verify conclusions using firstprinciple knowledge and simulation
•Design experiments and collection of simulation data (simulation = "experiment") •Multivariate regression modeling: advantages and limitations •Comparison of regression models with data from rigorous simulation •Statistical metrics to quantify error and validation
•Introduction of basics of optimization: identifying variables, formulating optimization problem and finding descent direction •Optimization of regression models and comparision of result with simulation •Implementation of adaptive searc optimization solver •Comparison of results
Module 1:
Module 2:
Module 3:
Identifying important variables/ correlations
Creating reducedorder models (regression)
Using data for adaptive search and optimization
Figure 1. Summary of three modules for data-analytics in process design
2.1. Multivariate analysis for identification of correlations and dimensionality reduction One of the first challenges that students face when they have completed a rigorous simulation of a fully integrated process, containing multiple units (i.e., reactors, separators, heat exchangers, pumps, compressors, etc.) is identifying which variables to tune, in order to achieve an improved operation. Teaching process design is challenging due to the fact that students may fall in the trap of blindly relying on the simulator, without attempting to use their first-principle insight for making informed perturbations to the system. The former is our first goal, and for this reason students are first expected to explore their design via trial-and-error and sensitivity analysis tools, in order to gain better understanding of the performance of the design, as well as to learn how to apply chemical engineering fundamentals to explain perturbation effects to the system. However, even when students reach the point of understanding the effect of a single (or a few) key variables within a single unit to the outputs of that unit, it becomes more challenging to understand and quantify how changes in several variables across units, may propagate and affect the operation of other units within a fully integrated design with recycle streams. In addition to the above, it is true that a large fraction of our students in the future, may be faced with the challenge of trying to understand, analyse and predict the performance of a system, by using large historical data-sets. Through the steps of Module 1 (Figure 2), we take advantage of the existence of large integrated simulations, for the given project statement in order to create a large data set, which aims to mimic a historical data set of a process design. More specifically, we select a large number (M1) of manipulated design variables, feed conditions, and operating conditions from every unit in the design and identify a reasonable range of operation of these variables. In addition to the set of manipulated variables, we also select a set of
1666
F. Boukouvala et al.
monitored variables (M2) that may or may not be highly sensitive to the manipulated variables. We then randomly perturb the M1 manipulated variables, in order to create a set of N simulated case studies of the design, and generate a data matrix ܺ ൌ ሾܰ ൈ ሺܯଵ ܯଶ ሻሿ. Various sampling techniques can be used to generate the samples, such as Monte Carlo sampling or space-filling designs (i.e., Latin Hypercube sampling, orthogonal arrays) (Wang et al., 2006). Some of the columns (dimensions) of this data set are correlated (by design), but this is not revealed to the students. The module also has an option of adding noise to the data set ߕ, in order to better mimic noisy historical data sets. After students are introduced to Principal Component Analysis (PCA) in a lecture (Jolliffe, 2002), they are provided with the data set and are asked to use PCA to identify correlated columns, as well as the true dimensionality of the data which they feel truly describes the variability in the data. Students are asked to complete exercises in which they generate score plots and are asked to identify correlated dimensions. Once these exercises are completed, the real labels of the data set are revealed, and students find that their conclusions based solely on data, often agree with their intuition. Through this exercise, it is truly fascinating to observe that his reinforcement of intuition helps students trust the data-driven approach. Other times, the data analysis points students towards a correlation that they did not identify through their trial-and-error approach, which they are always asked to explain using their engineering insight. Despite the fact that our aim is to introduce data-analytics, the most powerful learning outcome of this module is the realization that both rigorous simulation based on chemical engineering fundamentals, coupled with data-analytics can collectively lead to better analysis and understanding of complex systems.
Figure 2. Outline of steps of Module 1
2.2. Regression and interpolation using simulation data Despite the fact that chemical engineering students oftentimes use empirical correlations, their understanding of regression does not often go beyond linear or polynomial regression in one dimension. Throughout the progression of a process design course, students can often benefit from using multidimensional regression models for analysis, visualization and optimization. One of the most commonly encountered example of the need to use regression functions (or else parametric models, surrogate models or metamodels) to represent phenomena is to allow for the incorporation of a rigorous model from a flowsheet simulator within a synthesis optimization formulation (Caballero et al., 2008). Through this module, students are taught about computer experimental design (Sacks et al., 1989), and subsequently they learn how to use the collected input-output data to minimize the least-squares error of prediction between the data and a variety of different regression functions: linear, quadratic, polynomial (Bhosekar and Ierapetritou, 2018) and more advanced state-of-the-art machine-learning inspired functions, such as
Teaching Data-Analytics through Process Design
1667
support vector regression (Smola et al., 1996). The importance of over-fitting and validation is conveyed to the students through a series of examples and comparisons of fitted models. The final assignment for this module is the collection of data from the actual simulation using a designed experiment; the fitting of a variety of regression models (Figure 3); the selection of the best model using statistical measures; plotting of the selected function in surface plots for visualization; and finally the validation of these fitted models for a set of data points that was not used to train them.
Figure 3. Concept of fitting parametric models (regression, interpolation) to data and finally using these functions for optimization.
2.3. Data-driven Optimization for flowsheet optimization During the final lectures of the process design course, students learn about optimization fundamentals, such as identifying the degrees of freedom, formulating the objective function, formulating the constraints, linear optimization vs. nonlinear optimization, use and complexity of discrete variables to represent logical constraints, and formulation of synthesis problems. Students quickly identify the gap between being able to solve rigorous optimization problems, and the use of rigorous simulations. Several simulation packages contain built-in optimization solvers, through which a flowsheet can be optimized directly (i.e., ASPEN optimization tool). However, the goal of this work is to develop modules that are independent of a simulation platform, and also to teach the students the fundamentals of optimization by allowing them to develop their own simplified solver. During the first stage of this module, students are asked to use their results from Module 1 and Module 2 for the purpose of optimization. From Module 1, they have identified the most important and influential variables, while in Module 2 they have developed regression models. First, students use their fitted regression model to perform local optimization in Matlab (fmincon) (Figure 3). Subsequently, students have to simulate the identified optimum within their simulation software, in order to compare the accuracy of the result. Oftentimes, the predicted optimum is not identical to the optimal value obtained by the rigorous simulation. Thus, students quickly comprehend the concept of adaptively sampling their rigorous simulator in ‘promising’ locations, in an effort to identify better solutions. This teaching material starts with the simple concept of grid-search, and ends with the teaching of a more advanced trust-region-search algorithm towards descent directions (Boukouvala et al., 2016, Kolda et al., 2003). The students are asked to implement this algorithm, which iteratively exchanges data with their simulation in order to search for optimal locations. The advantage of implementing a customized solver, is the ability to account for simulation instances that fail to converge due to numerical issues, which may hinder the applicability of certain solvers. For simplicity, this analysis is performed for a relatively low number of optimization variables (~2-3), which have been previously identified to be significant. Finally, the students compare their results with in-built optimization solvers, if those are available.
1668
F. Boukouvala et al.
Regardless of the result, through this exercise students truly understand fundamental optimization concepts, as well as the difference between local and global optimization
3. Conclusions Three case studies have been presented, through which various data-driven analysis, modelling and optimization concepts are introduced to chemical engineering seniors, through the course of process design. The goal of these modules is to prepare graduates to enter the workforce, better equipped and competitive in today’s Big-Data era, by introducing them to data-analytics tools that have gained significant applicability in industry and academia. The goal of the modules is to teach students how to use multivariate input-output data to improve their decision-making process, while never forgetting their chemical engineering fundamentals. These modules will continue to be tested and enhanced through additional case studies in the future, and will soon be made available to the community.
References T. Adams II, 2018, Learn ASPEN Plus in 24 hours, McGrawHill. D.A.C. Beck, J. M. Carothers, V. R. Subramanian, J. Pfaendtner, 2016, Data Science: Accelerating Innovation and Discovery in Chemical Engineering, AIChE J, 62(5), p. 14021416. A. Bhosekar and M. Ierapetritou, 2018, Advances in surrogate based modeling, feasibility analysis, and optimization: A review. Computers & Chemical Engineering, 108, p. 250-267. F. Boukouvala, R. Misener, and C.A. Floudas, 2016, Global optimization advances in MixedInteger Nonlinear Programming, MINLP, and Constrained Derivative-Free Optimization, CDFO. European Journal of Operational Research, 252(3), p. 701-727. R. Braatz, 2015, Sampling of Data Education in ChE Curricula. Available from: cache.org/files/Sampling-of-Data-Education-in-ChE-V3.0.pptx J. A. Caballero, J.A. and I.E. Grossmann, 2008, An algorithm for the use of surrogate models in modular flowsheet optimization. Aiche Journal, 54(10), p. 2633-2650. I.T. Jolliffe, 2002, Principal Component Analysis, Springer Series in Statistics, 2nd Edition, Springer, NY. T. G. Kolda, R.M. Lewis, and V. Torczon, 2003, Optimization by direct search: New perspectives on some classical and modern methods. Siam Review, 45(3), p. 385-482. S. J. Qin, 2014, Process data analytics in the era of big data. AIChE Journal, 60(9), p. 3092-3100. J. Sacks, W.J. Welch, T.J. Mitchell, H.P. Wynn, 1989, Design and analysis of computer experiments, Statistical science, 4(4), p. 409-423 W. Seider, D.R. Lewin, S. Widagdo, R. Gani and K. M. Ng, 2016, Product and Process Design Principles: Synthesis, Analysis and Evaluation, 4th edition, Wiley. A.J. Smola and B. Schölkopf, 2004, A tutorial on support vector regression. Statistics and Computing, 14(3), p. 199-222. V. Venkatasubramanian, 2009, DROWNING IN DATA: Informatics and modeling challenges in a data-rich networked world. AIChE Journal, 55(1), p. 2-8. G. Wang, S.S. Shan, 2006, Review of Metamodeling Techniques in Support of Engineering Design Optimization. ASME. J Mech. Des, 129(4), p. 370-380.