New tools for the design and manufacturing of new products based on Latent Variable Model Inversion

New tools for the design and manufacturing of new products based on Latent Variable Model Inversion

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory ...

2MB Sizes 0 Downloads 29 Views

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemometrics

New tools for the design and manufacturing of new products based on Latent Variable Model Inversion pez a, *, Pierantonio Facco b, **, Massimiliano Barolo b, Alberto Ferrer a Daniel Palací-Lo a

Multivariate Statistical Engineering Group, Department of Applied Statistics and Operational Research, and Quality, Universitat Politecnica de Valencia, Camino de Vera s/n, 7A, 46022, Valencia, Spain CAPE-Lab – Computer-Aided Process Engineering Laboratory, Department of Industrial Engineering, University of Padova, via Marzolo 9, 35131 Padova PD, Italy

b

A R T I C L E I N F O

A B S T R A C T

Keywords: Partial least-squares (PLS) Latent variable model inversion Quality by design (QbD) Design space

Latent Variable Regression Model (LVRM) inversion can be an efficient tool to find the so-called Design Space (DS), i.e. the different combinations of inputs (e.g. process conditions, raw materials properties …) that lead to the desired outputs (e.g. product quality, benefits …). This is especially critical when first-principles models cannot be resorted to, running experimental designs is unfeasible and only data from daily production (i.e. historical data) are available. Since data-driven methods are not free of uncertainty, different approaches have been proposed in the literature to delimit a subspace that is expected to contain the DS of a product. However, some of these methods are computationally costly or depend on the existence of at least one combination of inputs that provides, according to the model, the desired values for all output variables simultaneously. Furthermore, no approach to date offers an analytical expression for the confidence region limits for this subspace. In this paper a new way to find the DS is proposed, so the above limitations are overcome. To this end, the analytical definition of the estimation of the DS, and its confidence region limits, as well as a way to transfer restrictions on the original space to the latent space are suggested. An extension of these methods to quality attributes defined as linear combinations of outputs is also provided. The proposed methodology is illustrated using three simulated case studies.

1. Introduction In the pharmaceutical sector the Quality-by-Design (QbD) [1] initiative promotes the implementation of science-based methodologies that enable the delivery of products that meet the desired specifications by means of increasing the process flexibility and robustness, understood as the capability to tolerate changes in the materials involved and processing conditions (i.e. the inputs) without negatively affecting the quality of the outputs. The design space (DS), defined as “the multidimensional combination and interaction of input variables (e.g., material attributes) and process parameters that have been demonstrated to provide assurance of quality” [2], becomes a key to this end. This is not limited to just the pharmaceutical sector, but extends to others such as the petrochemical or chemical one, food industry, etc., and requires process optimization.

Optimizing a production process requires building a causal model that explains how changes in input variables (e.g. materials and their properties, processing conditions …) relate to changes in the outputs (e.g. amount of product obtained, its quality, purity, value, generated pollutants …). To this purpose, deterministic (i.e. first principles) models are always desirable. However, the lack of knowledge and the generally ample need of resources required to properly construct such models makes their use unfeasible in a large number of cases. This is why datadriven models are often resorted to Refs. [3,4]. To guarantee causality when using data-driven approaches, however, independent variation in the input variables is required [5]. But, even if nowadays large amounts of data are available in most production processes, the variation in the inputs is commonly not independent (i.e., data are not obtained from a Design of Experiments (DOE) that guarantees this independent variation in the inputs). Therefore, classical linear

* Corresponding author. Departamento de Estadística e Investigaci on Operativa Aplicadas y Calidad. Universitat Politecnica de Valencia, Cno. De Vera s/n, Edificio 7A, 46022, Valencia, Spain. ** Corresponding author. Computer-Aided Process Engineering Laboratory, Department of Industrial Engineering, University of Padova, via Marzolo 9, 35131 Padova PD, Italy. E-mail addresses: [email protected], [email protected] (D. Palací-L opez), [email protected] (P. Facco), [email protected] (A. Ferrer). https://doi.org/10.1016/j.chemolab.2019.103848 Received 29 March 2019; Received in revised form 26 August 2019; Accepted 15 September 2019 Available online 17 September 2019 0169-7439/© 2019 Published by Elsevier B.V.

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

procedure (Section 4.3) and; iv) an extension of the two first contributions to linear combinations of output variables (Section 4.4.). These contributions allow: i) using the limits of the confidence region for the NSs as hard constraints in the bracketing of the DS and for optimization purposes, and ii) bracketing the DS into a smaller subspace of the KS, thus increasing the accuracy in the definition of different combinations of inputs to obtain the desired outputs. A critical comparison with other methods proposed in literature is illustrated by means of two simulated case studies in Section 5: a problem with two output variables (Section 5.1) and a quality attribute defined as a linear combination of the outputs (Section 5.2). Conclusions are presented in Section 6.

regression and machine learning models cannot be used for process optimization. Methods based on projection to latent structures, such as Partial Least Squares (PLS) regression-based techniques, allow the analysis of large datasets with highly correlated data. Since they assume that the input (X) space and the output (Y) space are not of full statistical rank, they not only model the relationship between X and Y (as classical linear regression and machine learning models do), but also provide models for both the X and Y spaces. This fact gives them a very nice property: uniqueness and causality in the reduced latent space no matter if the data come either from a DOE or daily production process (historical data) [6–8]. This property makes them suitable for finding the combinations of input variables that guarantee appropriate outputs (if possible), such that the desired values for the quality attributes of interest for the final product are met. Furthermore, since the initial number of variables involved is reduced to a smaller number of uncorrelated latent variables (LVs), the computational cost of any optimization problem in the latent space will decrease in comparison to the equivalent problem in the original space. However, there are some points worth addressing when using PLS regression-based techniques for the estimation of the DS. Namely: i) the effect of the uncertainty associated to the data and the model on the delimitation of the subspace most likely to contain the DS; ii) the projection of restrictions on the original variables onto the latent space, and; iii) the formulation of the optimization problem itself. Assuming that at least one feasible combination of inputs exists that leads to the desired outputs, some approaches based on Latent Variable Regression Model Inversion (LVRMI) have been proposed in the literature to define the so-called Null Space (NS), i.e. the projection onto the latent space of all the combinations of inputs theoretically guarantying the same desired outputs [9,10]. To avoid extrapolation, most of these approaches delimit this estimation to the domain within the sets of historical products that have already been developed, also referred to as knowledge space (KS) [11,12]. These data-driven methods for defining the NS, however, present several drawbacks:

2. Partial Least Squares model fitting and prediction uncertainty Partial Least Squares (PLS) regression Ref. [16] is a latent variable-based approach used to model the inner relationships between a matrix of inputs X [N  M] and a matrix of output variables Y [N  L]. PLS is usually resorted to in order to predict Y from the A-dimensional subspace associated to X such that its covariance with Y is maximised. The PLS regression model structure can be expressed as follows: X ¼ T ⋅ PT þ E bT þ F Y¼T⋅Q 1 T T ¼ X ⋅ W ⋅ ðP ⋅ WÞ ¼ X ⋅ W

(1)

being T [N  A], P [M  A] and E [N  M] the X scores, loadings and b [L  A] and F [N  L] the respective Y residuals matrices, respectively; Q

loadings1 and residuals matrix, and W [M  A] the weighting matrix, such that A  rX , rX being the rank of X. In order to evaluate the model performance when projecting the n-th observation onto it, the Hotelling T 2n and the squared prediction error SPExn are used: Tn2 ¼ τTn ⋅ Λ1 ⋅ τn

(2)

SPExn ¼ ðxn  P ⋅ τn ÞT ðxn  P ⋅ τn Þ ¼ eTn ⋅ en

with Λ1 defined as the [A  A] diagonal matrix containing the inverse of

1. While the effect of the uncertainty, associated to both the data and the estimation of the model parameters, on the DS estimation has been addressed in the literature [13,14], there is no standardized approach to explicitly take it into account either when defining the NS or when solving the optimization problem. 2. Although some of them provide an analytical expression for the NS [9, 10], it cannot be used directly for obtaining the analytical expression for the limits of the confidence region associated to it, therefore making it impossible to use them as constraints in e.g. the optimization approach as formulated in Ref. [15]. Furthermore, the NS as formulated in Refs. [9,10] would not meet its own definition unless at least one combination of inputs exists that guarantees the desired values for all the outputs, therefore making the existence of its analytical expression dependant on if such set of inputs exists or not. 3. They are limited to the case in which the quality attributes of interest coincide with some of/all the output variables considered when fitting the PLS-regression model. As a consequence, there is no standardized procedure for the optimization of quality attributes that can or should be expressed as linear combinations of process outputs.

T

the A variances of the scores associated to the LVs, and τn ¼ W* ⋅ xn the [A  1] vector of scores corresponding to the projection of the n-th observation xn onto the latent subspace of the PLS model. Upper confidence limits, T 2lim and SPElim , can be calculated as in Ref. [17]. Once the PLS-regression model has been fitted, it can be used directly to obtain the [L  1] prediction vector of the conditional means of the output variables, b y obs , given the [M  1] vector corresponding to a particular observation, xobs , fulfilling that T 2xobs  T 2lim and SPExobs  SPElim , as b b ⋅ τobs ¼ Q b ⋅ W T ⋅ xobs y obs ¼ Q

(3)

This prediction is not free from uncertainty. The estimation of prediction uncertainty is done by using Ordinary Least Squares (OLS) type expressions, as suggested in the approach proposed by Faber and Kowalski [18]. This approach assumes that the matrix of scores T obtained from fitting the PLS model is independent from Y, which is clearly not true and makes interpreting the estimation of the uncertainty not straightforward [19]. However, this approximation was observed to yield similarly good results when compared with those achieved by means of other approaches such as linearization or re-sampling based methods, or the so-called U-deviation method used in the software the Unscrambler, while being easier to implement [13]. The 100 ⋅ ð1  αÞ% confidence interval (CI) of the prediction of the lth output variable b y obs;l given an observation xobs is calculated as:

This paper aims at solving some of the aforementioned issues. To do so, a brief introduction to PLS-regression model fitting and prediction uncertainty is given in Section 2. Section 3 introduces the approaches presented in the literature regarding PLS model inversion and the estimation of the DS. In Section 4 the main contributions of this work are presented, regarding: i) a new analytical expression of the NS associated to the desired value/s of the output/s (Section 4.1); ii) the analytical expression of the confidence region limits for the NS, also accounting for the differences in leverage along the NS (Section 4.2); iii) how to transfer restrictions from the original to the latent space, based on the same

1 b instead of Q is used here to keep it consistent with the The notation Q Supplementary Material.

2

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

CIyobs;l ¼ by obs;l  tNA;α=2 ⋅ sby

dimensional subspace usually can be found in the third scenario in which, according to the model, yDES will be obtained for different inputs combinations. This subspace is usually referred to as Null Space (NS), since, according to the PLS model, no change in the outputs is expected when moving between any two points inside it.

obs;l

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 þ hobs þ N vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ffi uX  u N u b  y y n;l t n¼1 n;l

sby ¼ sfl ⋅ obs;l

sfl ¼

(4) 3.2. Definition of the Null Space dependant on the direct inversion

N  df 1

hobs ¼ τTobs ⋅ ðTT ⋅ TÞ

As mentioned in Ref. [9], when the ðA  rY Þ- dimensional NS associated to yDES exists, any set of inputs xNS in it can be obtained as:

⋅ τobs

xNS ¼ P ⋅ G ⋅ r

where tNA;α=2 is the 100ð1  α=2Þ percentile of a Student’s t-distribution with (N-A) degrees of freedom; sby is the estimated standard deviation

where G is an [A  ðA  rY Þ] matrix whose columns are the left singular b associated with its ðA  rY Þ zero singular values, and r is a vectors of Q

obs;l

of the prediction of the l-th output variable for a particular observation yobs ;hobs is the observation leverage; and sfl the standard error of calibration corresponding to the l-th output, obtained as suggested in Ref. [13], yn;l and b y n;l being the n-th measured and predicted value for the l-th output of the model calibration dataset, respectively, and df the degrees of freedom consumed by the model. It must be noted that, since CIyobs;l depends on the leverage of the observation, the amplitude of the confidence interval is expected to be lower for observations close to the centre of projection (smaller leverage) than for those far away from it (greater leverage).

[ðA  rY Þ  1] vector of random real values used to obtain arbitrary points on the NS. Which of these combinations of inputs are feasible or practical will ultimately depend on the constraints imposed on them [9]. A slightly different approach has also been used in the literature [10], where the term ‘combined pseudo null-space’ is used to refer to the subspace in which points obtained by using Eq. (5) and Eq. (6) are obtained, and different NS are defined for each output variable. This alternative definition is based on the fact that, if a displacement ΔτNEW from the direct inversion solution to any other point on the latent space is b ⋅ ΔτNEW ¼ 0, then: produced, such that ΔyNEW ¼ Q

3. Partial Least Squares model inversion and design space estimation

b q 1;1 ⋅ ΔτNEW;1 þ b q 1;2 ⋅ ΔτNEW;2 þ … þ b q 1;A ⋅ ΔτNEW;A ¼ 0 b q 2;1 ⋅ ΔτNEW;1 þ b q 2;2 ⋅ ΔτNEW;2 þ … þ b q 2;A ⋅ ΔτNEW;A ¼ 0 ⋮ b q L;1 ⋅ ΔτNEW;1 þ b q L;2 ⋅ ΔτNEW;2 þ … þ b q L;A ⋅ ΔτNEW;A ¼ 0

3.1. Direct inversion

3.3. Accounting for the uncertainty in the estimation of the design space As mentioned in Section 2, neither prediction is free from uncertainty, nor is any result achieved by model inversion. Several distinct approaches have been proposed in the literature in order to account for this [20–26]. In particular, Tomba et al. [15] propose a jackknife approach to determine the confidence regions for both the direct inversion and the NS estimation. On the other hand, a Bayesian-based procedure is suggested by Bano et al. [27] for the probabilistic definition of the DS in the latent space. Both of these approaches, however, present the drawback of being computationally costly as dataset size and/or the dimensionality of the latent space increases. Alternatively, Facco et al. [28] present a methodology to account for the back-propagation of the prediction uncertainty in model inversion to bracket the design space, which can be used whenever a single output is considered. This methodology resorts to the calculation of the 100ð1  αÞ% CI associated to the prediction of the output variable from τDI given yDES , as in Eq. (4). The NS associated to the limits of the CI are then determined through Eq. (6), and used (together with the Hotelling T 2 confidence hyper-ellipsoid) to delimit a subspace within which the true DS is expected to most likely lie. This approach does not consider, however, the difference in amplitude in the confidence region due to the leverage of different sets of scores along the NS.

a. rY >A: the most likely case is that no solution provides the desired values for all of the outputs; b. rY ¼ A: a single solution exists that provides all yDES ; c. rY


T

1 ⋅ yDES

(7)

where the directions associated to the ‘combined pseudo null-space’ correspond to those of the hyper-plane that best approximates all of the L equalities in Eq. (7), while each equality is related to the NS of each one of the outputs.

Although a PLS model is more commonly used in its direct form, that is, as a predictive tool for the output variables given some input data, its inverse form can be used to suggest a combination of inputs xNEW [M  1] needed to achieve a set of desired values yDES for the outputs, yNEW ¼ yDES . In theory, this will be valid as long as yDES conforms to the correlation structure from the calibration dataset used to fit the PLS-regression model. Another concern when resorting to this inversion, as mentioned in Ref. [15], is the fact that the PLS model must then present good performance not just in the prediction of the outputs (Y) but also in the prediction of the inputs (X). As pointed out in Ref. [9], since a reduction in the dimensionality of the space has taken place (from operating in the space of the original variables to doing so in the latent space), three possible scenarios must be taken into account with respect to the rank of Y (rY ) and the dimensionality of the latent space (A):

b ⋅ Q b ⋅Q b τDI ¼ Q

(6)

(5)

b x DI ¼ P ⋅ τDI This solution is sometimes also referred to as the direct inversion of the PLS model, and constitutes the only solution in the second scenario, and provides a single set of scores τDI corresponding to a set of inputs (i.e., the combination of initial settings, process conditions, raw material properties, etc …) theoretically leading to yDES , as well as the inputs themselves, b x DI . In this scenario, however, the direct inversion is only one of many solutions theoretically leading to yDES . In particular, an ðA  rY Þ-

4. New tools for defining the design space based on LVRMI 4.1. Proposed definition of the Null Space Note that all of the approaches presented in Section 3.2 for the definition of the NS are implicitly limited by the fact that they require the 3

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

solution of the direct inversion to provide the desired values for all of the outputs. Otherwise, neither the ‘combined pseudo null-space’ nor some/ any of the NSs associated to each of the outputs, as defined in Ref. [10], will satisfy their own definition. This limitation can be easily bypassed, however, by taking into account the relationship between the output variables and the latent variables. Consider, as in Ref. [10], the l-th NS to be defined as the subspace constituted by all combinations of inputs that theoretically guarantee the desired value for the l-th (l ¼ 1, 2, …, L) output variable, but not necessarily the rest. Then, given a desired value yDES;l for the l-th output variable, a [L  1] vector b y NSl corresponding to a set of scores τNSl on the l-th NS can be found such that b ⋅ τNS þ mY Þ ¼ yDES;l y NSl ¼ oTl ⋅ ðDSY ⋅ Q oTl ⋅ b l

Table 1 Illustrative example: dataset.

A X

When applied to all L output variables:

2

v1;0

6v 6 v0 ¼ 6 2;0 4 ⋮ vL;0

3 7 7 7 ¼ mY  yDES 5

2

3 T 6 v1 7 6 7 6 T7 6 v2 7 b ; V¼6 7 ¼ DSY ⋅ Q 6 ⋮ 7 6 7 4 5 vTL

x4 ðx22 Þ

x5 ðx1 ⋅ x2 Þ

y

1 2 3 4 5 6

5.43 5.43 99.23 99.23 52.33 52.33

7.54 15.97 7.54 15.97 11.76 11.76

125.64 126.20 9893.38 9765.16 2787.64 2849.95

58.51 258.48 59.29 254.11 139.21 135.67

50.49 74.44 737.15 1576.28 583.76 630.73

61.85 278.99 307.89 436.40 266.08 260.52

One common limitation of all of the approaches mentioned in Section 3.3, when it comes to defining subspace of the process within which the DS is expected to be located, is that they fail to provide an analytical expression for the limits of the confidence region of the NS, which can therefore not be applied as hard constraints in the optimization problem as formulated in Ref. [15]. Similarly, none of the restrictions imposed in this optimization framework are taken into account in the approach to bracket the DS in Ref. [28], other than that on the Hotelling T 2 limit. In order to tackle the first of these two limitations, consider Eq. (4) for the 100 ⋅ ð1  αÞ% CI of the prediction of the l-th output variable b y obs;l given an observation xobs . Then, the confidence region associated to the NS for the l-th output, whose equation can be obtained through Eq. (9), can be defined by back-propagating this uncertainty when inverting the model. It can then be demonstrated (see Supplementary Material) that any xNEW with projection τNEW onto the latent space within the confidence limits for the estimation of the l-th NS must meet Eq. (11):

b T ⋅ DS ⋅ ol vl ¼ Q Y

v0 þ V ⋅ τNS ¼ 0

x3 ðx21 Þ

4.2. Analytical definition of the confidence region of the Null Space

(9)

vl;0 ¼ oTl ⋅ mY  yDES;l

x2

(8)

vl;a ⋅ τNSl ;a ¼ 0 ; vl ¼ ½vl;1 ; vl;2 ; …; vl;A T

a¼1

x1

independent vectors parallel to it.

with ol being the [L  1] vector whose l-th element is equal to one, and the rest are zeros; mY and DSY being the [L  1] column vector of centring factors and the [L  L] diagonal matrix with the scaling factors applied to the L output variables before fitting the PLS-regression model, respectively.2 Eq. (8) can be reorganized to obtain the equation of a hyper-plane of the form: vl;0 þ

Observation (n)

(10) tNA;α=2 

v0;l þ vTl ⋅ τNEW  tNA;α=2 s~y NEW;l

s~y

yDES being the [L  1] column vector with yDES;l as its l-th element, and τNS a set of scores in the intersection of the L NS (if it exists). It must be noted that the [L (Aþ1)] matrix ½v0 V contains in each row the (Aþ1) parameters of the equation of the NS associated to the desired values for the l-th quality attribute. All but the first element in the l-th row correspond to vl , the vector signalling the direction of maximum variability in the latent space for the expected value of the l-th output. This means that a displacement of module one in the same direction, in the latent space, will produce an increase in the expected value of the l-th output equal to the module of vl . Furthermore, as long as rank rX >1 (the rank corresponding to one output variable) it will be possible to define a NS corresponding to the desired value of one of the outputs. The intersection of all these NS, if it exists, will be an (A-L)-dimensional subspace and may be referred to as ‘combined NS’, and is coincident with the ‘combined pseudo-NS’ presented in Ref. [9] only if A > L. Furthermore, the set of scores closest to the centre of projection and on such intersection is exactly the result from the direct inversion (see Supplementary Material). This subspace can also be expressed as function of the inputs instead of the scores by simply considering that, according to Eq. (1), τNS ¼ W ⋅ xNS . Furthermore, note that this definition of the NS only requires its position (vl;0 ) and a single vector orthogonal to the NS (vl ), unlike the definition in Ref. [10], which requires a point in the NS and A-1 linearly

NEW;l

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 ¼ sfl ⋅ 1 þ ~τTNEW ⋅ ðTT ⋅ TÞ ⋅ ~τNEW þ N ~τNEW ¼ τNEW 

sy~

NEW ;l

(11)

v0;l þ vTl ⋅ τNEW ⋅ vl vTl ⋅ vl

being the estimated standard deviation of the l-th output for a

hypothetical observation ~xNEW with projection onto the latent space τ~NEW such that τ~NEW is the set of scores located in the l-th NS closest to τNEW . This suggested formulation of the confidence interval can be used to bracket the DS as in Ref. [28], but slightly differs from such bracketing in that it takes into account the leverage of the points along the NS, leading to non-linear confidence limits. To illustrate this, consider the following polynomial model defining a hypothetical input-output causal relationship from Ref. [28]: y ¼ 21 þ 4:3 ⋅ x1 þ 0:022 ⋅ x2  0:0064 ⋅ x3 þ 1:1 ⋅ x4  0:12 ⋅ x5 s:t: x3 ¼ x21 ; x4 ¼ x22 ; x5 ¼ x1 ⋅ x2

(12)

where y represents the ‘quality attribute of interest’ expressed as a function of the input variables x1 , x2 , x3 , x4 and x5 . Notice that only two of the five inputs (x1 and x2 ) are independent. Six observations are generated according to the model in Eq. (12), according to a two-level full factorial design on x1 and x2 , with a replicate centre point. Random normal noise with 5% of the standard deviation of each variable in the dataset is then added to x3 , x4 , x5 and y. This results in the dataset in Table 1. Fig. 1 serves to illustrate the different results using the proposed approach and the one in Ref. [28] to define the subspace where the true DS is expected to lie with an assigned confidence level (from now on

2 Equations in Sections up to 3.2.1 were presented assuming that matrices X and Y had been pre-treated already (e.g. mean-centring and scaling to unit variance).

4

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

Fig. 1. Illustrative example: yDES ¼ 204:86. Delimitation of the 95% confidence region for the NS (grey area showing the experiment space) with a) the proposed approach and; b) the approach in Ref. [28].

referred to as the experiment space).3 In this example, a PLS regression model is fitted with 2 LVs, and yDES ¼ 204:86. The grey area in Fig. 1a corresponds to the confidence region for the NS associated to y DES ¼ 204:86 for the PLS-regression model fitted with 2 LVs; in Fig. 1b, it corresponds to the delimitation of the DS as proposed in Ref. [28]. Other than the curvature of the limits, the most relevant difference is that no analytical expressions for the boundaries of the NS confidence region can be provided nor used as constraints by following the procedure in Ref. [28]. It is also important to note that, in practice, the DS will usually encompass a range of acceptable values for the outputs/quality attributes of interest, and not a specific single value. Therefore, a more realistic definition of this region will account for this. To do so, a similar procedure to that followed in Ref. [29] is resorted to. Consider the Lower Specification Limit (LSL), yLSL;l , and the Upper Specification Limit (USL), yUSL;l , such that a new product will be considered to be within specifications regarding the l-th output if yLSL;l  yNEW;l  yUSL;l . Two NS may be defined associated to the LSL and USL, to which hyperplanes of the form shown in Eq. (9) correspond with intercepts vLSL;l;0 ¼ oTl ⋅ mY  yLSL;l and vUSL;l;0 ¼ oTl ⋅ mY  yUSL;l , respectively. Then, any new sample xNEW with projection τNEW onto the latent space within this experiment space must meet: tNA;α=2 

Ax ⋅ b x NEW ¼ Ax ⋅ ðDSX ⋅ P ⋅ τNEW þ mX Þ  db

x NEW

b ⋅ τNEW þ mY Þ  d Ay ⋅ b y NEW ¼ Ay ⋅ ðDSY ⋅ Q b

y NEW

s~y

s~y

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 ¼ sfl ⋅ 1 þ ~τTNEW;USL;l ⋅ ðTT ⋅ TÞ ⋅ ~τNEW;USL;l þ N ~τNEW;LSL;l ¼ τNEW 

vLSL;l;0 þ vTl ⋅ τNEW ⋅ vl vTl ⋅ vl

~τNEW;USL;l ¼ τNEW 

vUSL;l;0 þ vTl ⋅ τNEW ⋅ vl vTl ⋅ vl

(14)

Fx ⋅ b x NEW ¼ Fx ⋅ ðDSX ⋅ P ⋅ τNEW þ mX Þ ¼ fb

x NEW

b ⋅ τNEW þ mY Þ ¼ f Fy ⋅ b y NEW ¼ Fy ⋅ ðDSY ⋅ Q b

NEW;USL;l

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 ¼ sfl ⋅ 1 þ ~τTNEW;LSL;l ⋅ ðTT ⋅ TÞ ⋅ ~τNEW;LSL;l þ N

NEW;USL;l

As can be seen in Fig. 1, the proposed definition of the experiment space presented in Section 4.2 results in a subspace of the latent space most of which seems ‘far away’ from the envelope defined by observations in the calibration dataset. This is because the Hotelling T 2 confidence hyper-ellipsoid has been implicitly assumed to be a good approximation for the KS, but this is not necessarily true. A more sensible approach to bracket the DS may account also for, at least, univariate restrictions on the input variables corresponding to their historical minimum and maximum observed values, and their projection onto the latent space. Once restrictions on the inputs and outputs have been defined, these constraints can be transferred to the latent space4 in the same way the explicit definition of the NS was achieved through Eq. (8) to Eq. (10). Doing this allows comparing restrictions in the latent space among each other, and with e.g. those corresponding to the Hotelling T 2 confidence limit. Consider a set of constraints on the input and output variables to be expressed as:

vLSL;l;0 þ vTl ⋅ τNEW vUSL;l;0 þ vTl ⋅ τNEW ;  tNA;α=2 s~y s~y NEW;LSL;l

NEW;LSL;l

4.3. Projecting constraints onto the latent space

y NEW

Ax and Ay being [I1  M] and [J1  M] matrices whose i-th and j-th rows, respectively, contain the coefficients of the i-th linear combination of inputs and j-th linear combination of outputs; dbx and dby being [I1  NEW

(13)

NEW

1] and [J1  1] column vectors whose i-th and j-th elements are, respectively, the values that must not be surpassed by the corresponding linear combination of inputs and outputs; mX and DSX being the [M  1] column vector of centring factors and the [M  M] diagonal matrix with the scaling factors applied to the M input variables before fitting the PLSregression model, respectively; and Fx , Fy , fbx and fby equivalent to NEW

NEW

Ax , Ay , dbx and dby , respectively, but for equalities constraints. NEW NEW Reorganizing terms in Eq. (14), the new restrictions on the latent space can be defined as:

Fig. 2 shows the equivalent to Fig. 1 when a range of desired values has been defined for the quality attribute of interest. In this case, yLSL ¼ 195 and yUSL ¼ 215.

4

While the projection of these restrictions onto the LV subspace has been a standard part of some commercial software focused on data analysis through latent variable-based methods (e.g. Aspen ProMV), the analytical formulation of the expressions corresponding to such projections has not been provided in the currently available literature, as far as the authors are aware of.

3 Although there are apparently only five samples in Fig. 1, these are actually six observations. The two centre points simply overlap almost perfectly.

5

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

Fig. 2. Illustrative example: delimitation of the experiment space (grey area) within, respectively, the lower (continuous red line) and upper (dashed red line) 95% confidence region limits for the NS associated to yLSL ¼ 195 and yUSL ¼ 215 with the proposed approach. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Fig. 3. Illustrative example: yDES ¼ 204:86. Delimitation of the experiment space (grey area) with a) the proposed approach and; b) the optimization approach in Ref. [15].

Aτ ⋅ τNEW  dτNEW Fτ ⋅ τNEW ¼ f τNEW d   Ax ⋅ DSX ⋅ P bx NEW  Ax ⋅ mX Aτ ¼ ¼ ; d τNEW b db  Ay ⋅ mY Ay ⋅ DSY ⋅ Q y   f NEW  A ⋅ m   x X Fx ⋅ DSX ⋅ P b ; f τNEW ¼ f x NEW  A ⋅ m Fτ ¼ b y Y Fy ⋅ DSY ⋅ Q b

allows their use as hard constraints for the optimization problem. This is why they have not been represented in Fig. 3b. In Figs. 1 and 3 the bracketing of the DS for a given desired value for the output, yDES , is in this case more restrictive than both the bracketing as suggested in Ref. [28] and the subspace of acceptable solutions for the optimization formulated as in Ref. [15]. Another point to note is that, in spite of there being initially 10 hard constraints imposed on the original variables (5 lower limits and 5 upper limits corresponding to the minimum and maximum observed values for each input variables in Table 1), only 5 of them (i.e. the five sides of the resulting polygon) remain once projected onto the latent space.5 Furthermore, because of the uncertainty associated to the PLS regression model, the projection of two of the samples in the initial dataset fall outside of the area within the constraints on the inputs (dashed straight lines), as opposed to what would have been intuitively expected (see Fig. 3). If the restrictions x3 , x4 and x5 are not applied, or are defined by taking into account their relationship with x1 and x2 (i.e. x3 ¼ x21 , x4 ¼ x22 and x5 ¼ x1 ⋅ x2 ), the projection of both observations would be within the region enclosed by the projected constraints. The same would occur if 4 LV were used to fit the model, but the graphical representation would become more complex and less intuitive (results not shown). On the other hand, the sample on the right side is outside of the boundaries as a consequence of being the worst predicted sample due to the noise introduced.



(15)

y NEW

Each equality restriction imposed implies the definition of a hyperplane of dimensionality (A-1) inside the latent space of the model fitted with A LVs, within which any τNEW must be located. Therefore, imposing two equality constraints requires τNEW to fall within the intersection of two (A-1)-dimensional hyper-planes, that is, within an (A-1)dimensional subspace (if both restrictions are equivalent) or an (A-2)dimensional subspace, if such intersection exists. This means that, at most, A compatible (i.e. an intersection exists), non-redundant equality constraints can be imposed before meeting all restrictions becomes impossible, and even then there is no guarantee that the resulting subspace falls within the subspace defined by any other inequality constraints that may have been imposed beforehand. Consider now the same dataset as in Table 1, and the maximum and minimum observed values (i.e. the ‘historical limits’) in it for all input variables as upper and lower constraints (i.e. univariate inequality constraints). Fig. 3 illustrates the result from transferring these restrictions to the latent space (as would be done when solving the optimization problem as formulated in Ref. [15]), as well as combining these restrictions with the constraints associated to the NS confidence region. Note that, although a way to obtain the confidence limits for the NS is provided in Ref. [15], no analytical expression is provided for them that

5 The identification of active and redundant constraints was performed by means of home-coded Matlab routines.

6

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

0

4.4. New tools to define the design space for a linear combination of outputs

d covðf1 ; f2 Þ s2f1 B B B B B d s2f2 B covðf2 ; f1 Þ Sf ¼ B B B B ⋮ B B @ d covðfL ; f2 Þ covðfL ; f1 Þ d

Up until now the quality attributes of interest coincide with one or more of the outputs used to fit the PLS-regression model, and it is assumed that such model is able to predict them with enough accuracy. However, this may not be enough in some situations, as will be illustrated in Section 5.2. Because of this, the methods presented in Sections 4.1 and 4.2 will be extended for linear combinations of output variables in Sections 4.4.1 and 4.4.2, respectively.

s2fl ¼

4.4.1. Definition of the null space for a linear combination of outputs Consider R quality attributes to be defined as a linear combination of the L outputs, and the r-th quality attribute of interest to be expressed as dr ¼ aTr ⋅ y, with ar being the [L  1] vector that contains the coefficients that relate the L output variables with the r-th quality attribute. Let the rth NS be the subspace constituted by all combinations of inputs that theoretically guarantee the desired value for the r-th quality attribute of interest, but not necessarily for the rest. Then, given a desired value dDES;r for the r-th quality attribute, a [L  1] vector b y NSr corresponding to any set of values for the outputs associated to a set of scores τNSr on the r-th NS can be found such that: b ⋅ τNS þ mY Þ ¼ dDES;r y NSr ¼ aTr ⋅ ðDSY ⋅ Q aTr ⋅ b r

d covðfl ; fl’ Þ ¼

2

v1;0

6v 6 v0 ¼ 6 2;0 4 ⋮ vR;0

3 7 7 7 ¼ Ar ⋅ mY  dDES 5



(19)

2 XN  y y n;l  b n;l n¼1 N  df

!  XN  ’  b y y y ⋅ y ’ n;l  b n;l n;l n;l n¼1 N  df

y NEW;l

(16)

mated standard deviation of the l-th output, by sb , the estimated d obs;r standard deviation of the r-th quality attribute. 5. Case studies

2

3 T 6 v1 7 6 7 6 T7 6 v2 7 b ; V¼6 7 ¼ Ar ⋅ DSY ⋅ Q 6 ⋮ 7 6 7 4 5 vTR

5.1. Case study 1: simulated data with two outputs (17)

This section illustrates a scenario in which more than one quality attribute is considered, and for which the DS estimation could not be analytically obtained by means of the available methodologies proposed in the literature. This is because two quality attributes are considered simultaneously, as opposed to the cases illustrated in Refs. [28,29]. Furthermore, the approach in Ref. [15] lacks the use of the limits of the confidence regions for the two NS as hard constraints. The data for this example were simulated following the procedure explained in Refs. [30,31] to generate multivariate normal data. Fifty observations with six variables (four considered as inputs (xi ) and two as outputs (yi )) following the correlation structure shown in Table 2 were simulated. Table 3 shows the mean and standard deviation for the generated dataset, for all xi and yi variables. The lower and upper limits for each of the input variables are also shown. These will be used as hard constraints,

Ar being a [R  L] matrix with aTr as its r-th row, dDES a [R  1] column vector with dDES;r as its r-th element, and τNS a set of scores in the intersection of the R NSs (if it exists). Again, as long as rX is higher than one it will be possible to define a NS corresponding to the desired value of one of the quality attributes of interest. It should be noted that Eq. (10) is a particular case of Eq. (17) where Ar ¼ IL , IL being the [L  L] identity matrix. 4.4.2. Confidence region of the null space for a linear combination of outputs The adaptation of the expressions in Eq. (11) and Eq. (13) for the confidence region of the NS and the experiment space corresponding to the r-th linear combination of outputs is almost trivial, but estimating the standard deviation of the r-th quality attribute of interest, expressed as a linear combination of outputs, is required. To do this, consider the r-th quality attribute dr for a given observation xobs with outputs yobs to be expressed as dobs;r ¼ aTr ⋅ yobs . The 100 ⋅ ð1  αÞ% CI of the prediction of dobs;r given an observation xobs is calculated as: CIb ¼ aTr ⋅ b y obs  tNA;α=2 ⋅ sb d obs;r d obs;r sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   1 sb ¼ aTr ⋅ Sf ⋅ ar ⋅ 1 þ hobs þ d obs;r N



as illustrated in the Supplementary Material. Eq. (4) constitutes a particular case of Eq. (18) and Eq. (19), when ar is substituted by ol . The analytical expressions for the confidence region of the NS and the experiment space corresponding to the r-th linear combination of outputs are equivalent to Eq. (11) and Eq. (13), substituting e.g. sb , the esti-

Eq. (16) is a more general expression of Eq. (8), where ol has been substituted by ar , and therefore the same steps as in Section 4.1 can be done so that, in the end: v0 þ V ⋅ τNS ¼ 0



1 d covðf1 ; fL Þ C C C C d covðf2 ; fL Þ C C C C C C ⋮ C C A 2 sfL

Table 2 Case Study 1: Correlation matrix used so construct the dataset.

x2 x3 x4 y1 y2

x1

x2

x3

x4

y1

0.45 0.54 0.30 0.50 0.30

0.50 0.55 0.10 0.34

0.70 0.15 0.50

0.20 0.15

0.70

(18) Table 3 Case Study 1: Characterization of the input and output calibration datasets.

sb being the estimated standard deviation of the r-th linear combinad obs;r tion of outputs for an observation xobs , and Sf the variance-covariance matrix of prediction errors:

7

Variable

Mean

Std. dev.

Lower limit

Upper limit

x1 x2 x3 x4 y1 y2

0 0 0 0 0 0

1 1 1 1 1 1

2.39 1.91 1.87 2.22 – –

2.92 1.79 1.87 1.87 – –

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

Fig. 4. Case Study 1: yTDES ¼ ½1:2375; 0:4024. Delimitation of the experiment space (grey area) with a) the proposed approach and; b) the optimization approach in Ref. [15].

rate of points in the experiment space for which the 95% confidence interval of the prediction of each output variable contains the (previously defined) desired values for those outputs, and AHIL represents the average value of the half interval length of such confidence intervals. Both the ECR and the AHIL were obtained by averaging the results obtained for 100 different datasets per each number of initial samples in the corresponding datasets. The same indexes were also obtained for the experiment space as would be obtained according to the methodology in Ref. [28], and can also be seen in red in Fig. 5, so that the performance of both approaches can be compared to each other.6 As can be seen in Fig. 5, the ECR associated to the second output variable is almost 1 (i.e. 100%) for any of the considered numbers of observations in the calibration dataset, while for the first output variable is never below 0.9 and greater than 0.95 once the calibration dataset contains 15 observations or more, when using the proposed approach. For the definition of the experiment space as delimited in Ref. [28], the ECR is always smaller than for the proposed approach for both output variables. This is because the limits of the confidence region are calculated using the leverage of the direct inversion solution, which is in this case higher than that of the set of scores on each NS with lowest leverage. In this sense, the proposed approach can be seen to provide a better estimation of the experiment space as a subspace where the true DS is expected to lie with at least a 95% confidence level. Another important point to make is that, although the proposed approach may seem to be too conservative (i.e. it provides values for the ECR that are higher than the chosen confidence level, and higher than the ECR for the most similar method in the literature), its aim is precisely defining a sensibly restricted experiment space where the true DS is expected to fall. Therefore, ECR values closer to one (not the chosen confidence level) are more desirable. Regarding the AHIL, for the second output it is almost twice than for the first output variable for both the proposed approach and that in Ref. [28]. This is somewhat expected as the uncertainty in the second output is higher than for the first one (see width of the experiment space for both outputs in Fig. 4).

and will be projected onto the latent space as done in Section 4.3. Fig. 4 shows the experiment space applied to this example, with yTDES ¼ ½1:2375;  0:4024. To differentiate between the two output variables considered, the NS and corresponding confidence limits for y1;DES are represented with continuous lines, and dashed lines are used for y2;DES . Two LVs were chosen to fit the PLS-regression model in order to allow easier graphical representation of the results, although four LVs would have been better option to explain more of the variability of X (R2 (X,2LV) ¼ 0.642; R2 (X,4LV) ¼ 0.87), and to better predict Y according to the leave-one-out cross validation method (Q2(2LV) ¼ 0.858; Q2 (4LV) ¼ 0.997). The proposed methodology could be applied with a higher-dimensional latent space, although visualizing the experiment space becomes more difficult in such case. Although no results are represented for the application of the approach in Ref. [28] in this case, the resulting plot (if it were to be applied to each output variable separately) would have been very similar to that in Fig. 4a, since the limits of the confidence regions for both NS are almost (but not quite) straight lines, even accounting for the leverage in their definition. In this case, only the projection of one of the observations falls outside of the area defined by the restrictions on the inputs. Furthermore, assuming the Hotelling T 2 ellipse to be a good estimation of the KS would have been a much better approximation than in the illustrative example shown in Fig. 3, although a small fraction of it would still be left out by the restrictions on the inputs. In this case, only half of the eight restrictions imposed on the inputs remain in the latent space, with the other four ones being redundant. The shadowed area in Fig. 4a corresponds to the intersection between the corresponding individual experiment spaces for y1;DES and y2;DES obtained using the proposed approach. The intersection of both subspaces is considered to define the overall experiment space, and not the union, because only combinations of inputs leading to the desired values for both outputs simultaneously would be accepted. In this case such intersection exists and at least some part of it is located within the subspace defined by the Hotelling T 2 ellipse and all the other restrictions on the inputs. Nevertheless, this may not be the case in more complex examples or if different values for y1;DES and y2;DES are specified. The shadowed area in Fig. 4b, on the other hand, is the subspace of acceptable solutions as would be defined in the optimization approach described in Ref. [15]. Note that such subspace is very different from the experiment space obtained with the proposed approach. In order to assess the performance of the proposed method for the delimitation of the experiment space, two performance indexes: the empirical coverage ratio (ECR) and the average half interval length (AHIL) are obtained when datasets of different sizes (from 6 initial observations to 50 initial observations) are used to fit the PLS-regression model. These results are shown in black in Fig. 5, where ECR represents the

5.2. Case study 2: Vinyl-chloride monomer manufacturing simulation This case study serves to illustrate the scenario when the quality attribute of interest is a linear combination of several outputs, as presented in Section 4.4, and corresponds to the synthesis of 1,2-

6

The computational cost of the proposed approach is far lower than for the methods in Refs. [15,27], while the only practical distinction with the approach in Ref. [28] in a 2-dimensional latent space is that the differences in leverage among the points along the NS is accounted for. This is why only this comparison is made. 8

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

Fig. 5. Case Study 1: yTDES ¼ ½1:2375; 0:4024. As a function of the size of the calibration dataset, for the proposed approach (black lines) and the one in Ref. [28] (red lines): a) empirical coverage ratio (ECR) and; b) average half interval length (AHIL). (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

variables, while variables x6 to x8 are measured ones; z1 is a mole fraction of the predominant reactant in stream TOP3, and y1 and y2 are the relevant outputs in this case study. An initial dataset with 20 simulated samples uniformly spanning the

dichloroethane (EDC) in a vinyl-chloride monomer (VCM) production process plant, simulated by using the software PRO/II and in accordance with the guidelines provided in Ref. [32]. In order to introduce some variability in this process without loss of validity of any assumption made during the simulation, manipulated variables were allowed to vary a maximum of 5% around the assigned values in Ref. [32]. Fig. 6 illustrates a simplified block diagram corresponding to such production process. In this case study, focus is directed towards the purity of EDC obtained after saturation during the EDC purification step (stream EDC4). The top section (blue section in Fig. 6) corresponds to the direct chlorination section of the plant, to which two streams with pure chlorine and ethylene, respectively, are fed. The bottom section (yellow section in Fig. 6) corresponds to the oxy-chlorination section, to which air and pure ethylene are fed together with a recycled stream constituted mostly by hydrochloric acid. Both streams are mixed and subjected to saturation (green section in Fig. 6) as the first step during the EDC purification. Table 4 shows the name and the meaning of the variables involved in this case study. Process variables x1 to x5 are manipulated input

Table 4 Case Study 2: Definition of the variables involved. Name

Meaning

x1 x2 x3 x4 x5 x6 x7 x8 z1 y1 y2

Flow (lb-mol/h) of CL2F (Cl2) Flow (lb-mol/h) of C2F1 (Ethylene) Flow (lb-mol/h) of AIR (O2þN2) Flow (lb-mol/h) of C2F2 (Ethylene) Flow (lb-mol/h) of WTR1 (Water) Temperature (ºC) of TOP3 Pressure (Psig) of TOP3 Molar flow (lb-mol/h) of TOP3 Molar fraction of Hydrochloric Acid (HCl) in TOP3 Total molar flow rate (lb-mol/h) of stream EDC4 Molar fraction of EDC in stream EDC4

Fig. 6. Case Study 2: Simulated EDC production process as an intermediate product to obtain VCM. 9

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

defined at the start of Section 5.2. The 95% confidence limits for the NS (experiment space) and the restrictions on the inputs transferred onto the latent space are also represented. In this case, due to the uncertainty in the DS estimate, the hard restrictions associated to the 95% NS confidence limits are redundant, as is the Hotelling T 2 hard constraint. Fig. 7b represents a zoomed in version of Fig. 7a to better visualize the experiment space. In this case, the experiment space resulting from applying the methodology in Ref. [28] would include the whole interior of the Hotelling T 2 (green and grey areas), while both the optimization approach in Ref. [15] and the proposed approach would further restrict the experiment space (grey area only).

hyperrectangle defined by the upper and lower limits on variables x1 to x8 and z1 is used for this example. Table 5 shows the mean and standard deviation for each variable in this dataset. The lower and upper limit constraints imposed on the xi variables and z1 , which bound the experiment space, are also indicated. These restrictions are defined in order to simulate the univariate limits usually imposed on the input variables in practice to avoid extrapolation outside of the historical, normal operating conditions. Additionally, the constraint x2  0:995 ⋅ x1 is imposed to guarantee that at least 99.5% of chlorine can be converted during the direct chlorination step of the process, so that it meets the requirements for its proper simulation. In this example, the initial fitting of a PLS regression model relating the input variables x1 to x8 and z1 with the output variables y1 and y2 explains and predicts poorly y2 (R2(X) < 0.95, R2 (y2)<0.46 and Q2(y2)≪0.01 even with 5 LVs), which is the quality attribute of interest in this case. However, if input variable z1 , which represents a molar fraction of a given species in a specific stream, is substituted by the new variable x9 , representing the molar flow rate (i.e. x9 ¼ z1 ⋅ x8 ) of the same species in that flow, and likewise y2 is substituted by y3 ¼ y1 ⋅ y2 , the PLS regression models fitted with these ‘new’ variables provide much better explanatory and predictive capabilities (for example, R2(Y) > 0.953 and R2(X) > 0.563 when using 2 LVs, and R2(Y) > 0.965 and R2(X) > 0.863 when using 5 LVs). Note that the new variables x9 and y3 are physically meaningful, as they represent molar flows of two species. Furthermore, since z1 and y2 are both molar fractions they are bounded between 0 and 1 for whatever values of the other inputs (or outputs). Therefore, removing z1 and y2 (and including x9 and y3 instead) also removes much of the nonlinearity responsible for the bad performance of the initially fitted model. However, the quality attribute of interest in this case is y2 , and the new model fitted does not include it anymore. Instead, y1 and y3 are used as model outputs. But, since y3 ¼ y1 ⋅ y2 , it can also be said that y2 ¼ y3 = y1 , and given that y1 > 0, if a desired value y2;DES for y2 is specified, then any combination of values for the input variables in the DS shall satisfy the equality: y2;DES ⋅ y1  y3 ¼ 0

6. Conclusions The proposed methodology for the DS estimation combines the already proposed constraints in the literature for optimization [15] with the idea of bracketing the DS [28]. This methodology has been shown to provide robust results in the delimitation of the experiment space, in the sense that this subspace is shown to be at least as reduced as the one resulting from the most restrictive approach between those in Refs. [15, 28], as illustrated in the first and second case studies, if not more restricted, as seen in the illustrative example. Furthermore, as illustrated with the first case study, the methodology presented provides values for the empirical coverage ratio (ECR) closer to one that the approach in Ref. [28], i.e. it provides a good definition of the subspace expected to contain the DS of a product while being more restrictive than previous approaches. To achieve this goal, a new way to analytically define the NS has been provided, which also allows the analytical definition of the limits of its confidence region that accounts for the changes in leverage along the different points in the NS. This new definition presents three main advantages: It allows the use of the confidence region limits for each NS as hard constraints in the definition of the experiment space, which was not possible in the optimization problem defined in Ref. [15]. The non-linearity of the limits of the confidence region of each NS, due to the effect of the leverage on the prediction uncertainty, can also be accounted for in their analytical expression, which could not be explicitly considered in the bracketing of the DS proposed in Ref. [28]. An extension of the definition of the NS for quality attributes defined as linear combinations of outputs is made possible. This is useful to approach problems that could not be addressed otherwise, as illustrated in the second case study.

(20)

This case study is interesting because neither the methodology to bracket the DS in Ref. [28], nor the optimization formulation in Ref. [15] allow approaching a problem like this. When the concepts illustrated in Section 4.4. are applied to this case study, the results illustrated in Fig. 7 are obtained for the NS associated to Eq. (20) for y2;DES ¼ 0:9792, and the model fitted with data as described at the start of Section 5.2, together with the corresponding confidence interval. Although 5 LVs would provide a better fit, and the methodology proposed in this paper could be applied nonetheless, a PLS-regression model with only 2 LVs is fitted here in order to better illustrate the results achieved. In Fig. 7a the 95% Hotelling T 2 confidence ellipse when a PLSregression model with 2 LVs was fitted, the initial dataset being as

In a similar way, a methodology to transfer restrictions on the inputs and outputs to the latent space has been presented. This is important, because constraints imposed on the input variables, on the outputs and on the latent variables are not directly comparable among each other, that is, one cannot determine which of them, if any, may be redundant when they are defined as affecting variables in different spaces. On the other hand, the projection of the restrictions on the inputs and/or outputs represent restrictions on the latent variables, so that they can be compared among themselves as well as with constraints originally defined in the latent space, such as those regarding the Hotelling T 2 . Furthermore, the computational cost for the bracketing of the DS is negligible7 compared to the approaches in Ref. [15] or [27], and the expressions in Sections 4.2 and 4.4.2 relative to the confidence regions for the NS can be extended to inputs and linear combinations of inputs. This may be of interest e.g. whenever considering the uncertainty if

Table 5 Case Study 2: Characterization of the input and output calibration datasets. Variable

Mean

Std. dev.

Lower limit

Upper limit

x1 x2 x3 x4 x5 x6 x7 x8 z1 y1 y2

142.19 157.21 312.24 118.02 54.53 30.93 135.23 241.76 0.99 240.52 0.98

13.36 16.70 32.17 13.99 4.59 0.53 2.55 24.98 4.443⋅105 19.43 7.6⋅104

121.22 132.45 278.21 105.70 45.73 31.87 130.51 228.50 0.99 – –

161.47 179.71 344.08 136.83 67.28 29.91 141.02 252.56 0.99 – –

7 Less tan 1 s on a MacBook Pro equipped with a 2.9 GHz Intel Core i5 and 8 GB 1867 MHz DDR3 RAM.

10

D. Palací-L opez et al.

Chemometrics and Intelligent Laboratory Systems 194 (2019) 103848

Fig. 7. Case Study 2: y2;DES ¼ 0:9792. a) Representation of all restrictions in the latent space, and b) delimitation of the experiment space with either the proposed approach or the optimization approach in Ref. [15] (grey area) and with the approach in Ref. [28] (grey area plus green area). (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

restrictions are imposed on inputs that cannot be freely manipulated. This paper provides a new approach to take into account the uncertainty in the estimation of the DS, but does not deal with how such uncertainty may be reduced to improve the accuracy in the estimation of the DS. In addition, the definition of the DS used in this paper assumes that the desired quality attributes are defined such that they can be obtained from products similar to the already produced ones in the past. In practice, the goal may be finding the most appropriate operating conditions to achieve new different products from past ones. These two issues will be addressed in future works.

[10] S. García-Mu~ noz, T. Kourti, J.F. MacGregor, F. Apruzzese, M. Champagne, Optimization of batch operating policies. Part I. Handling multiple solutions, Ind. Eng. Chem. Res. 45 (2006) 7856–7866. [11] J.F. MacGregor, M.-J. Bruwer, A framework for the development of design and control spaces, J. Pharm. Innov. 3 (2008) 15–22. [12] C. Jaeckle, J. Macgregor, Product design through multivariate statistical analysis of process data, Comput. Chem. Eng. 20 (1996) S1047–S1052. [13] L. Zhang, S. Garcia-Munoz, A comparison of different methods to estimate prediction uncertainty using Partial Least Squares (PLS): a practitioner’s perspective, Chemometr. Intell. Lab. Syst. 97 (2009) 152–158. [14] C.C. Pantelides, N. Shah, C. Adjiman, Design space, models, and model uncertainty, comprehensive quality by design in pharmaceutical development and manufacture, Am. Inst. Chem. Eng. (2009) 97. [15] E. Tomba, M. Barolo, S. García-Mu~ noz, General framework for latent variable model inversion for the design and manufacturing of new products, Ind. Eng. Chem. Res. 51 (2012) 12886–12900. [16] S. Wold, M. Sjostrom, PLS-Regression: a basic tool of chemometrics, Chemometr. Intell. Lab. Syst. 58 (2001) 109–130. [17] A. Ferrer, Multivariate statistical process control based on principal component analysis (MSPC-PCA): some reflections and a case study in an autobody assembly process, Qual. Eng. 19 (2007) 311–325. [18] K. Faber, B.R. Kowalski, Prediction error in least squares regression: further critique on the deviation used in the Unscrambler, Chemometr. Intell. Lab. Syst. 34 (1996) 283–292. [19] A.J. Burnham, J.F. MacGregor, R. Viveros, Interpretation of regression coefficients under a latent variable regression model, J. Chemom. 15 (2001) 265–284. [20] Y. Wang, Y. Yang, J. Jiao, Z. Wu, M. Yang, Support vector regression approach to predict the design space for the extraction process of pueraria lobata, Molecules 23 (2018). [21] L. Peltonen, Design space and QbD approach for production of drug nanocrystals by wet media milling techniques, Pharmaceutics 10 (2018) 104. [22] D. Laky, S. Xu, J. Rodriguez, S. Vaidyaraman, S. García Mu~ noz, C. Laird, An optimization-based framework to define the probabilistic design space of pharmaceutical processes with model uncertainty, Processes 7 (2019) 96. [23] J.J. Peterson, A Bayesian approach to the ICH Q8 definition of design space, J. Biopharm. Stat. 18 (2008) 959–975. [24] C. Pantelides, M. Pinto, S. Bermingham, Optimization-based design space characterization using first-principles models. Comprehensive Quality by Design in pharmaceutical development and manufacture, in: Proc. AIChE Annu. Meet. Salt Lake City, UT, USA, 2010, pp. 7–12. [25] K.A. Chatzizacharia, D.T. Hatziavramidis, Design space approach for pharmaceutical tablet development, Ind. Eng. Chem. Res. 53 (2014) 12003–12009. [26] E.J. Close, J.R. Salm, D.G. Bracewell, E. Sorensen, A model based approach for identifying robust operating conditions for industrial chromatography with process variability, Chem. Eng. Sci. 116 (2014) 284–295. [27] G. Bano, P. Facco, F. Bezzo, M. Barolo, Probabilistic Design space determination in pharmaceutical product development: a Bayesian/latent variable approach, AIChE J. 64 (2018) 2438–2449. [28] P. Facco, F. Dal Pastro, N. Meneghetti, F. Bezzo, M. Barolo, Bracketing the design space within the knowledge space in pharmaceutical product development, Ind. Eng. Chem. Res. 54 (2015) 5128–5138. [29] G. Bano, P. Facco, N. Meneghetti, F. Bezzo, M. Barolo, Uncertainty backpropagation in PLS model inversion for design space determination in pharmaceutical product development, Comput. Chem. Eng. 101 (2017) 110–124. [30] F. Arteaga, A. Ferrer, Building covariance matrices with the desired structure, Chemometr. Intell. Lab. Syst. 127 (2013) 80–88. [31] F. Arteaga, A. Ferrer, How to simulate normal data sets with the desired correlation structure, Chemometr. Intell. Lab. Syst. 101 (2010) 38–42. [32] PRO/II casebook #1 vinyl chloride monomer plant, in: PRO/II Caseb. #1, Simulation Sciences INC, 1992.

Acknowledgements This study resulted from the collaboration between the Computer Aided Process Engineering (CAPE-Lab) group at the Universit a degli Studi di Padova and the Multivariate Statistical Engineering Group (MSEG) at the Universitat Politecnica de Valencia (UPV). Financial support was granted by the UPV as a part of the ErasmusþProgramme: Key Action 1 – Mobility for learners and staff – Higher Education Student and Staff Mobility, and also by the Spanish Ministry of Economy and Competitiveness under the project DPI2017-82896-C2-1-R MINECO/AEI/FEDER, UE. Appendix A. Supplementary data Supplementary data to this article can be found online at https:// doi.org/10.1016/j.chemolab.2019.103848. References [1] FDA, Pharmaceutical CGMPs for the 21s Century - A Risk-Based Approach, 2004. [2] ICH, Pharmaceutical development Q8, ICH harmon, Tripart. Guidel. 8 (2009) 1–28. [3] J.J. Liu, J.F. MacGregor, Modeling and optimization of product appearance: application to injection-molded plastic panels, Ind. Eng. Chem. Res. 44 (2005) 4687–4696. [4] D. Bonvin, C. Georgakis, C.C. Pantelides, M. Barolo, M.A. Grover, D. Rodrigues, R. Schneider, D. Dochain, Linking models and experiments, Ind. Eng. Chem. Res. 55 (2016) 6891–6903. [5] G.E.P. Box, J.S. Hunter, W.G. Hunter, Statistics for Experimenters: Design, Discovery, and Innovation, Second, John Wiley & Sons, INC., 2005. [6] J.F. MacGregor, Empirical models for analyzing “big” data-whats the difference, in: Spring AIChE Conf., Orlando, Florida, USA, 2018. [7] Z. Liu, M.-J. Bruwer, J.F. MacGregor, S. Rathore, D.E. Reed, M.J. Champagne, Modeling and optimization of a tablet manufacturing line, J. Pharm. Innov. 6 (2011) 170–180. [8] J.F. MacGregor, M.J. Bruwer, I. Miletic, M. Cardin, Z. Liu, Latent variable models and big data in the process industries, IFAC-PapersOnLine 28 (2015) 520–524. [9] C.M. Jaeckle, J.F. MacGregor, Industrial applications of product design through the inversion of latent variable models, Chemometr. Intell. Lab. Syst. 50 (2000) 199–210.

11