A multi-thread parallel computation method for dynamic simulation of molecular weight distribution of multisite polymerization

A multi-thread parallel computation method for dynamic simulation of molecular weight distribution of multisite polymerization

Computers and Chemical Engineering 82 (2015) 55–67 Contents lists available at ScienceDirect Computers and Chemical Engineering journal homepage: ww...

1MB Sizes 0 Downloads 69 Views

Computers and Chemical Engineering 82 (2015) 55–67

Contents lists available at ScienceDirect

Computers and Chemical Engineering journal homepage: www.elsevier.com/locate/compchemeng

A multi-thread parallel computation method for dynamic simulation of molecular weight distribution of multisite polymerization Jinzu Weng a , Xi Chen a,∗ , Lorenz T. Biegler b a State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, PR China b Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA

a r t i c l e

i n f o

Article history: Received 2 April 2014 Received in revised form 16 April 2015 Accepted 30 May 2015 Available online 22 June 2015 Keywords: Multi-thread parallel computation Multisite polymerization Molecular weight distribution Dynamic simulation

a b s t r a c t Molecular weight distribution (MWD) is an important quality index of polymer products. Many methods have been proposed to dynamically simulate the MWD of polymerization, but these methods are normally designed for serial computations. In this paper, a multi-thread parallel computation method was proposed for multisite free-radical polymerization. Analysis of the relationship among different subtasks revealed a combined parallel strategy by fully exploiting the parallel feature of the process. A good performance was obtained to accelerate the dynamic simulation of MWD based on Flory method. We theoretically analyzed the speedup ratio (SR) and parallel efficiency (PE). Results showed that software algorithm and hardware configuration exhibited a good match. The efficiency of the proposed parallel method was presented through industrial slurry processes that used high-density polyethylene (HDPE). © 2015 Elsevier Ltd. All rights reserved.

1. Introduction Polymers are widely used in the production of automobiles, bottles, films, and electric appliances; each usage requires different polymer specifications. One key property that directly influences the end-use quality of polymers is the molecular weight distribution (MWD), which characterizes the molecular features of the product. For example, MWD affects the processing characteristics of the polymer through the melting point and the flow properties of the melted polymer; MWD determines the mechanical properties of the processed product, such as strength and impact resistance (Ellis et al., 1988). The calculation for MWD, however, remains challenging due to its large-scale features. Several of methods have been proposed to compute MWD, including the molecular weight moment method (Timothy and Kyu, 1997), the lumping method (Arno et al., 1978), the continuous variable methods (Stanley and Gerald, 1967), the numerical Monte Carlo method (Al-Harthi et al., 2006, 2007; Yao et al., 2012), the transformed domains method (Miller et al., 1996; Mills, 1986), the discrete Galerkin method (Deuflhard and Wulkow, 1989; Wulkow, 1996), the reduced stiffness via quasi-steady-state approximation method (Enrique et al., 2010; Zapata-Gonzales et al., 2012), the serial variable decoupling method (Chen et al., 2009), and

∗ Corresponding author. Tel.: +86 57187953966. E-mail address: [email protected] (X. Chen). http://dx.doi.org/10.1016/j.compchemeng.2015.05.027 0098-1354/© 2015 Elsevier Ltd. All rights reserved.

Flory method (Flory, 1953; Soares, 2001; Soares and McKenna, 2012). Timothy and Kyu (1997) proposed a novel method that utilizes polymerization kinetic equations and molecular weight moment equations in conjunction with a function that describes the weight fraction of the polymer in a finite chain length interval; in this method, the entire range of molecular weight is divided into a finite number of intervals, and the weight fractions of the polymers in these intervals are calculated. Stanley and Gerald (1967) developed equations that temporally track the moments of the polymer size distribution and the conversion for polymerization models. The exact and approximate solutions for the leading moments of the size distribution are determined for polymerization with initiation, propagation, and termination by combination. Al-Harthi et al. (2006, 2007) developed a dynamic Monte Carlo model to simulate atom-transfer radical polymerization (ATRP); the algorithm included activation, deactivation, propagation, chain transfer, and termination by combination. Disproportionation reactions and model probabilities were calculated from the polymerization kinetic parameters and reactor conditions; the model predicts monomer conversion, average molecular weight, polydispersity, and complete molecular weight distribution at any polymerization time or monomer conversion. Yao et al. (2012) used the Monte Carlo method to simulate the time evolution of MWD and copolymer composition during the exchange reaction process between polyamides with AA and BC structure; they analyzed the influences of initial composition and molecular weight. Miller et al. (1996) performed a continuous variable approximation

56

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

in chain length; ordinary differential equations (ODEs) were converted to a finite set of partial differential equations, which were solved by taking the Laplace transform with respect to the chain length; the procedure yielded the ordinary differential equations in time. Deuflhard and Wulkow (1989) and Wulkow (1996) described the development of a comprehensive solver for differential equations arising from the models of MWDs in polyreactions; the algorithm is based on discrete Galerkin h-p-method, which allows the efficient treatment of types of polymerization reactions. The quasi-steady-state approximation was applied to polymerization reaction for fast dynamic species in order to reduce the stiffness; the remaining equations are solved by computationally inexpensive and explicit algorithms (Enrique et al., 2010; Zapata-Gonzales et al., 2012). Specific features of the methodology were applied to academically and industrially relevant systems of controlled radical polymerization and coordination catalysis polymerization. Chen et al. (2009) proposed a serial variable decoupling method to rigorously calculate the MWD for batch free-radical polymerization; the originally large differential-algebraic equation (DAE) model was decomposed into a series of small-scale ODEs to calculate the MWD of a polymer with large chain length. Under the steady state or pseudo-steady state assumption (PSSA), Flory (1953) proposed the conventional Flory–Schulz distribution that describes the MWD of a polymer, in which the reactions between each functional group were treated as probability events based on the kinetic mechanism; it is also called the instantaneous Flory distribution because of the steady-state assumption. For continuous polyethylene production, Soares (2001) reviewed a cumulative Flory method that integrated an approximated Flory distribution over time and radical location in the polymer particle to recover the MWD of the cumulative polymer; the convolution of instantaneous distribution and polymerization parameters were calculated. The Flory distribution remains practical and useful in polymerization, although the distribution is invalid for chain transfer reactions occurring in the polymer or simulation of the dynamic evolution of free radicals with various chain lengths. This approach relies on analytical solutions of population balances; the Flory distribution can properly quantify the microstructure of polyolefins (Soares and McKenna, 2012). While the previously mentioned methods are effective in calculating MWD, these are normally designed for serial computations; the parallel feature of the MWD computation is not considered. Parallel computing requires multiple tasks to be simultaneously performed (Almasi and Gottlieb, 1989). The algorithm design for parallel computing takes more effort and time compared with that for serial computing, but the acceleration effect is excellent and sensible. As the improvement of single-processor computing has slowed down, researchers are realizing the importance of parallel computing, which has been widely applied in chemical engineering (Guido and Flavio, 2010; Ge et al., 2011; Nguyen and Bagajewicz, 2011; Laird et al., 2011; Zhu and Laird, 2008). However, a few papers have reported on calculating the MWD of polymerization. Chen et al. (2013) analyzed the MWD model and concluded that the traditional calculation was conducted stepwise along time and chain length; after analyzing the existence of parallelism in this procedure, they developed coarse-grained and fine-grained parallelism methods along time and chain length to accelerate the simulation of MWD. The previous study focused on polymerization of a single active site. A large number of polymerizations involving multisite catalyst mechanism remain to be considered; these introduce new challenges for parallel computation. Here we design a novel parallel strategy to compute the MWD of the multisite freeradical polymerization. The primary serial computation is divided and reconstructed using multiple threads by analyzing the relationships and communications among the moment method and the MWD calculation of each active site. A parallel strategy is proposed to accelerate the computation with a maximum speedup ratio (SR)

Table 1 Kinetic mechanism of homopolymerization. Reaction types

Descriptions

Activation

Cp (j) + A −→ P0 (j)

Initiation

P0 (j) + M −→P1 (j)

kaA (j) ki (j)

kp (j)

Propagation Transfer to monomer Transfer to hydrogen Transfer to co-catalyst

Pn (j) + M −→Pn+1 (j) ktM (j) Pn (j) + M −→ P1 (j) + Dn (j) ktH (j) Pn (j) + H2 −→ P0 (j) + Dn (j) ktA (j) Pn (j) + A −→P0 (j) + Dn (j)

Transfer ␤-hydride

Pn (j)−→P0 (j) + Dn (j)

Deactivation

kt (j)

kd (j)

Pn (j)−→Cd (j) + Dn (j) kd (j)

P0 (j)−→Cd (j)

and we analyze this calculation to select the proper number of parallel core. Finally, we dynamically simulated the grade transition for a slurry process involving high-density polyethylene (HDPE).

2. Flory distribution modeling of multisite polymerization Multisite catalyst mechanism exists in a large number of polymerization events. A catalyst is assumed to have two or more site types if it produces polymers with broad microstructural distributions. Multisite catalysts, such as heterogeneous Ziegler–Natta and Phillips catalysts, pose several challenges to parameter estimation; these catalysts, however, are not essentially harder to model than single-site catalysts. Each active site type is governed by a set of distinct polymerization rate constants that behave as a single-site catalyst. Although the multisite mechanism results in added computational complexity, the mechanism allows computation MWD methods designed for single site catalysts to be applied to multisite counterparts. For example, the Flory method still applies in multisite polymerizations. The relative independence of different sites results in the parallel feature of MWD computation for multisite polymerization; thus, the parallel computation strategy can be designed. We used multiple threads and demonstrated the industrial HDPE slurry process. Continuous slurry polymerization with multisite Ziegler–Natta catalysts is a typical and effective process to manufacture HDPE. Table 1 summarizes the kinetic mechanism, where CP is the potential active site of catalyst TiCl4 , A is the co-catalyst Al(C2 H5 )3 , M is the monomer, H2 is hydrogen, P0 is an active site, Cd is a deactivated site, Dn is a dead polymer of chain length n, Pn is the living polymer of chain length n, j is the index of the active sites; kaA , ki , kp , and kd are the kinetic constants of the activation, initiation, propagation, and deactivation reactions, respectively; ktM , ktH , ktA , and kt are the kinetic constants of the chain transfers to monomer, hydrogen, cocatalyst, and ␤-hydride reactions, respectively. Specially, the chain transfer to hydrogen reaction is considered as order 0.5 (Kim et al., 1991). The conventional methods for the dynamic simulation of MWD comprise three modules including the moment, and instantaneous Flory, and cumulative Flory methods; w is the instantaneous weight, W is the cumulative weight, f represents the instantaneous distribution, and F is the cumulative distribution. The detailed methods are elaborated as follows.

2.1. The moment method The moment method solves the leading moments of the chain length distribution (CLD) with considerably little computational

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

effort. The respective mth moments of the living and dead polymers at the jth active site are Y m (j) =

∞ 

nm [Pn (j)]

(1)

n=1 m

X (j) =

∞ 

Following the calculation of the instantaneous Flory distributions for all active sites, the instantaneous distribution of the polymer product can be obtained by taking the average of the distributions of each individual site: f (n) =

m

n [Dn (j)]

The mechanism in Table 1 defines the following equation for the kinetic constant of transfer and deactivation given as KT (j) + KD (j) = ktM (j)[M] + ktH (j)[H2 ]

0.5

+ ktA (j)[A] + kt (j) + kd (j),

for j = 1 : Ns

(3)

where j represents the index of the active sites and Ns represents the total number of active sites. The zeroth moment of the CLD of the living polymer chains can be derived by substituting the corresponding population balance into the moment expression: d(Y 0 (j)V ) = ki (j)[P0 (j)][M] − (KT (j) + KD (j))Y 0 (j) + ktM (j)[M]Y 0 (j), Vdt for j = 1 : Ns

(4)

Ns 

mfj =

Y 1 (j) + X 1 (j) Ns 

The first moments of CLD of the living and dead polymer chains are given as d(Y 1 (j)V ) = ki (j)[P0 (j)][M] + kp (j)[M]Y 0 (j) − (KT (j) + KD (j))Y 1 (j) Vdt for j = 1 : Ns

d(X 1 (j)V ) = (KT (j) + KD (j))Y 1 (j) − ktM (j)[M][P1 (j)], Vdt

(11)

(Y 1 (j) + X 1 (j))

j=1

where Y1 (j) and X1 (j) are the first moments of living and dead polymers, respectively, at the jth active site. The newly produced instantaneous total weight of the jth active site at time t, actually at a time interval t around t, is calculated as



w(j, t) = {ktM (j, t)[M(t)] + ktH (j, t)

[H2 (t)] + ktA (j, t)[A(t)]

1

+ ki (j, t) + kd (j, t)}Y (j, t)V (t)t

(12)

The component for each chain length is given as for j = 1 : Ns (5)

+ ktM (j)[M]Y 0 (j),

(10)

where Ns is the total number of active sites and mfj is the weight or mass fraction of the polymer made by site type j; mfj is defined as

Similarly, the zeroth moment of CLD of the dead polymer chains takes the form d(X 0 (j)V ) = (KT (j) + KD (j))Y 0 (j) − ktM (j)[M][P1 (j)], Vdt

f (j, n)mfj

j=1

(2)

n=2

57

(6)

for j = 1 : Ns

2

w(j, t, n) = w(j, t)f (j, t, n) = w(j, t)n(1 − (j, t)) ((j, t))

n−1

(13)

where n represents the chain length and f(j,t,n) is the timedependent extension of f(j,n) described in Eq. (11) with time-varying distribution parameter, (j,t), which is obtained for the jth active site at time instant t. The instantaneous Flory method has significant advantages in terms of computation speed. The special conditions of the steadystate assumption or PSSA, however, limit the application of the distribution when the reactor is in steady state or when the PSSA hypothesis is unsuitable (e.g., the grade transition process). For that case, the cumulative Flory method can be introduced to calculate the time evolution of MWD for a dynamic process.

(7) 2.3. Cumulative Flory method The EO model of the process comprises equations that describe thermodynamic properties, phase equilibrium, and heat and mass balance. A detailed description can be found in the literature (e.g., Zhang et al., 2013). 2.2. Instantaneous Flory method The conventional Flory–Schulz distribution shows the MWD of a polymer under a pseudo steady-state assumption (PSSA), which is also called the instantaneous Flory distribution. The weight distribution of jth active site is derived as 2

n−1

f (j, n) = (1 − (j)) n((j))

(8)

where n represents the chain length and (j) is the distribution parameter that represents the propagation probability of the active chain at the jth active site. For a continuous process, (j) is the ratio of the monomer propagation frequency over the sum of the monomer propagation, chain transfer, and deactivation at the jth active site: (j) =

kp (j)[M]



kp (j)[M] + ktM (j)[M] + ktH (j)

[H2 ] + ktA (j)[A] + kt (j) + kd (j) (9)

In a dynamic process, the cumulative Flory distribution method can be used to compute MWD. The inflow and outflow must be considered for a continuous polymerization. In the present case, only the outflow material should be considered because the feed contains no polymer. The increment of the cumulative weight for each chain length at each time interval can be calculated as dW (t, n)  = w(j, t, n) − W (t, n)Flowout(t)/V (t) dt Ns

(14)

j=1

where Flowout represents the volume outflow rate and V represents the liquid volume of the reactor. The cumulative weight can then be obtained through integration over time. The cumulative Flory distribution is given as F(t, n) =

W (t, n)



NMAX

(15)

W (t, n)

n=1

where NMAX represents the maximum chain length. Overall, the MWD model comprises Eqs. (3)–(15); this model is a DAE system that can be computed by time-dependent discretization.

58

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

Start

Moment calculation

Active site 1~Ns calculation

Cumulative MWD calculation

End Fig. 1. Conventional serial computation strategy.

3. Multiple-thread parallel strategy The conventional method mainly comprises three computations that are implemented serially (Fig. 1); this method is simple but time consuming. First, the moments are calculated to obtain their time evolution and other state variables using Eqs. (3)–(7). Second, the instantaneous Flory distribution of each active site is determined one by one using (8), (9), (12), (13). Lastly, the cumulative Flory distribution using (10), (11), (14), (15) is obtained by integrating all information obtained for each site at each time instant. 3.1. Parallel strategy Considering the serial structure (Fig. 1) and the features of the multisite polymerization, exploiting the parallel computation is feasible to accelerate the process from two aspects. Each site type was assumed to follow the same polymerization mechanism of a single-site catalyst and governed by a set of distinct polymerization rate constants and behaved essentially as a single-site catalyst (Section 2). Performing parallel calculation using different threads for computation of different active sites in the instantaneous Flory distribution step is feasible (Fig. 2). Following the calculation of the

Start

Moment calculation

Thread 1

Thread 2

Active site 1

Active site 2

thread

thread

Thread Ns



Cumulative MWD calculation

End Fig. 2. Simple parallel strategy.

Active site Ns thread

moments, multiple threads were created to calculate the instantaneous Flory distribution that each deals with one active site in a parallel way. The remaining processes were similar to the conventional method. Hereafter, we refer to the parallel strategy in Fig. 2 as the simple parallel strategy. By using this strategy, the computation time of this step can be maximally shortened to 1/Ns of the original computation time. However, as the step accounts for only a part of the whole process, the overall SR of simple parallel strategy will be reduced. The second strategy parallelizes the different modules in Fig. 1. A combined parallel calculation structure was designed to accelerate the whole computation process (Fig. 3). For a polymerization with Ns active sites, a total of Ns + 2 threads are created. Threads 0 and Ns + 1 are responsible for the moment calculation (3)–(7) and cumulative MWD calculation (10), (11), (14), (15), respectively; Threads 1 to Ns are responsible for the instantaneous MWD calculation (8), (9), (12), (13) of each active site. As shown in the figure, two line types are used. The solid line represents the basic operation of each thread in programming and the dash-dotted line represents the communications among threads. All threads were created at the onset of the process. Thread 0 was conducted in the conventional way for the moment calculation. Threads 1 to Ns were activated when the relative state variables used in the instantaneous Flory method have been calculated and stored by thread 0. Therefore, communications with thread 0 were required continuously. Thread Ns + 1 was driven by the signals judging whether the instantaneous weights of all active sites have been calculated and stored, which also requires consistent communication with threads 1 to Ns. After all of those threads are terminated, the program comes to the end. We refer to the parallel strategy (Fig. 3) as the combined parallel strategy. Comparison of Figs. 2 and 3 shows that the combined parallel strategy can accelerate the calculation only if the activation of each thread no longer waits for the termination of the thread in the previous level. This result could be implemented by dividing the time evolution process into different small segments along time and keeping communications among threads. Fig. 4 illustrates the implementation of this strategy. The moment calculation thread solves the DAE to obtain the time evolution of moments and other state variables. At each time instant, the results will be output for further use by the next level threads, the instantaneous MWD calculation threads. These threads will consistently monitor the processing of thread 0 to judge if their own calculation is ready for next time interval. Meanwhile, the third level thread, Thread Ns + 1, also monitors the processing of threads in level two to determine its own activation. ti (i = 1, . . ., NI) represents the calculated time instant; NI represents the total number of the divided time intervals; s represents the state variables used in the instantaneous Flory method, including temperature, concentrations of the monomer, transfer agent, and co-catalyst, volume of reaction liquid, volume outflow rate, and first moment of the living chain; w represents the instantaneous weights of those Ns active sites. The figure illustrates the communication between different levels of threads in the program that comprises two parts. The first includes the signals from the moment calculation thread to the instantaneous weight calculation threads of active sites 1 to Ns. For each ti , a conditional statement exists to judge whether the relative variables have been calculated and stored. If the answer is positive, a signal would be sent to trigger the instantaneous weight calculation thread to conduct the computation for the available time interval. The second part conveys signals from the instantaneous weight calculation thread to the cumulative MWD calculation thread. Similarly, a conditional statement also exists to judge whether the instantaneous weights of all active sites have been calculated and stored at each ti . If the answer is positive, a signal would be sent to trigger the cumulative MWD calculation thread to conduct the computation

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

59

Start Thread 1

Thread 0

Thread Ns

Thread 2

Thread 0

Thread 1

Thread 2

Moment

Active

Active

calculation

site 1

site 2



Thread Ns+1

Thread Ns+1

Thread Ns

Cumulative

Active

MWD

site Ns

calculation

End Fig. 3. Combined parallel strategy.

for the available time interval. By implementing the combined parallel strategy, the calculation can be greatly accelerated.

be defined as the SR divided by the number of processors, which reveals the matching performance between software algorithm and hardware configuration. The following theoretical analysis can help us on understanding the parallel performance of the proposed method. In the simulation process, the whole simulation time used for the serial way is denoted as ts ; the time required by the pretreatment of variables, such as allocating the space and presetting the initial value of each variable, is denoted as tv ; the calculation time required by the moment calculation thread is denoted as tm ; the calculation time required by the instantaneous weight calculation thread of active site j is denoted as tj (j = 1, . . ., Ns); the maximum and the minimum calculation time among all active sites are denoted as tjmax and tjmin , respectively; the calculation time

3.2. Analysis of parallel performance So far, we have considered the number of created threads only. Naturally, a straightforward way to conduct the simulation is using the same number of CPU cores as the thread number. This is easy to implement for many cases like the demonstrated HDPE example in this project. However, for some other cases with a large number of active sites or with a limited number of the CPU cores, the straightforward way is not always applicable. Using a proper number of CPU cores to achieve a proper SR will be meaningful. Another metric, parallel efficiency (PE) (Vegeais and Stadtherr, 1992), can

Moment calculation thread 0(t0)

No

t1

s(t1)

No

saved?

t2



s(t2)



tNI

No

s(tNI) saved?

saved?

Yes

Yes

Yes 0(t0)



t1

tNI-1

tNI

Instantaneous weight calculation threads of active site 1~Ns 0(t0)

t1

No

w(t1,j) saved?

No

t2



w(t2,j)



No

w(tNI,j) saved?

saved?

Yes

Yes

Yes 0(t0)

tNI

t1



tNI-1

Cumulative MWD calculation thread Fig. 4. Communication between each thread of combined parallel strategy.

tNI

60

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

Fig. 6. Calculation time of combined parallel strategy with enough cores.

Fig. 5. Calculation time of conventional serial strategy with a single core.

required by the cumulative MWD calculation thread is denoted as tc ; the maximum time cost among tm , tj and tc is denoted as tpmax . Since the total time interval number, NI, can be large, the computation time for the last time interval as shown in Fig. 4 is neglected in the following analysis. Similarly, the communication time is also neglected as it is much faster than the calculation time. Theorem 1 (.). The theoretical maximal SR (SRmax ) of the combined parallel strategy is

tv + tm + SRmax =

Ns 

tj + tc

j=1

tv + tpmax

Proof (.). According to definition, the SR for the parallel method over the serial method is SR =

ts tp

where ts , tp represent the serial and parallel simulation time, respectively. In the serial mode, the whole process is divided into three steps as shown in Fig. 1. The time cost is the summation of these three steps as ts = tv + tm +

Ns 

tj + tc

j=1

In the combined parallel strategy, different threads are designed for different tasks. To achieve SRmax , the best situation of the parallel strategy uses the same number of CPU cores as the thread number, 2 + Ns, which includes one moment calculation thread, Ns instantaneous weight calculation threads, and one cumulative MWD calculation thread. Here, tv is not exploited in the parallelization; thus all calculation threads are conducted after tv .

Fig. 7. Calculation time of combined parallel strategy with the minimum number of cores.

Since the communication time and the computation time of the last time interval shown in Fig. 4 can be ignored, the ideal time cost of parallel computation is the summation of tv and the maximal time among all used cores, which equals to the longest time of all created threads, tpmax . Therefore, SRmax will be tv + tm + SRmax =

Ns 

tj + tc

j=1

tv + tp max

䊐 Remark: Theorem 1 gives the theoretical SRmax under the ideal number of CPU cores. However, under the condition that the available number of CPU cores is less than the ideal case, another interesting question can be raised. Can we conduct the parallel strategy using fewer CPU cores than the ideal situation while keeping the SRmax ? In other words, do we really need a configuration of 2 + Ns cores for an Ns-site polymerization? The answer will be further elaborated in the following analysis. To better illustrate the problem, Figs. 5–7 are presented. Here we take an example that Ns equals to 5 and tm is the longest time among tm , tj and tc , and tv is omitted. Fig. 5 illustrates the computation tasks of the conventional serial strategy; all tasks are arranged serially in one core. The total time cost is the summation of all subtasks. Fig. 6 shows the computation of combined parallel strategy with sufficiently many cores, equal to the thread number, i.e., 2 + Ns = 7. In the figure, each box shows the calculation time of each core; the shadow area represents the total running time of the tasks and the blank area illustrates the total waiting time of the tasks. Since tm is the longest time of all threads, it is obvious that most of the time these cores are idle and waiting for task activation. To reduce the idle time, one can allocate those computations in another way as illustrated in Fig. 7. In the figure, two threads are merged to run on one core. By doing this, one could take full use of resource and minimize the CPU cores without sacrificing SR.

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

tps is the second highest time among tm , tj and tc . Here, tps ≤ tpmax . Theorem 2 (.). The minimum core number used to achieve SRmax satisfies

⎡ ⎢(tm + ⎢ ⎢

 Ns





¯ tj + tc )/tpmax ⎥ ⎥ ≤ CSR ≤ 1 +



j=1





1 + Ns tpmax /tps

where   and  round the elements up or down, respectively, to the nearest integers. Proof (.). To achieve SRmax (Theorem 1), the maximal time among all used cores should equal the longest time of all created threads, tpmax . On one side, let’s consider the best situations. Since the summation time costs of all tasks except tv takes (tm +

Ns 

tj + tc )/tp max

j=1

times of tpmax . Hence the ideal situation is to allocate the computation of the whole simulation into (tm +

Ns 

tj + tc )/tp max CPU cores.

j=1

However, considering the be a positive inte⎡ core number should ⎤ ger, it is rounded up to ⎢ ⎢(tm +

Ns 



tj + tc )/tp max ⎥ ⎥. Since this is the



j=1

ideal situation, the minimum core number used to achieve SRmax satisfies

⎡ C¯ SR ≥ ⎢ ⎢(tm +



Ns 

¯L tj + tc )/tp max ⎥ ⎥ = CSR



j=1

1+



1 + Ns tp max /tps



=1+

1 + Ns 1

= 1 + 1 + Ns = 2 + Ns

If tps ≤ 0.5tpmax , two or more threads would be completed within tpmax . Considering the worst situation that tm , tj and tc except tpmax

are all equal to tps , it is feasible to conduct tp max /tps threads in one core to decrease the CPU cores while keeping SRmax . Hence, except the thread whose calculation time is tpmax , the other threads may have the chance to run on one core; the rest core number is thus decreased from 1 + Ns to t 1+Ns . Furthermore, /t p max

ps

considering

the core number should also be a positive integer, 1+Ns

is applied. Therefore, in this situation, the total num-

tp max /tps

ber of CPU cores is 1 +



1+Ns tp max /tps

.

Since we consider the worst situation only, therefore the minimum core number used to achieve SRmax satisfies

 C¯ SR ≤ 1 +





Ns  j=1

⎤ ¯ tj + tc )/tp max ⎥ ⎥ ≤ CSR ≤ 1 +





1 + Ns tp max /tps



䊐 Remark: Theorem 2 gives the lower and upper bounds for the theoretical minimum number of cores used to obtain SRmax . It gives the direction to decompose the computation and fit the decomposition products into as fewer CPU cores as possible. 3.3. Algorithm development The decomposition of the simulation process into the parallel computation can be achieved by using different threads. Deploying them on different cores and keeping communication among the threads are the keys to the algorithm development and will be elaborated as follows. According to the previous analysis on the parallel strategy and Theorems 1 and 2, deployment of the different threads on the hardware cores is essential for this application. It decomposes the computation into parallel threads with the minimum number of cores without sacrificing the speedup ratio. The detailed steps include:



On the other side, let us consider the worst situations to derive the upper bound. If 0.5tpmax < tps ≤ tpmax , the worst situation is that all threads cost more than half the time of tpmax ; thus we cannot complete two or more threads on one core to reduce the CPU core number within tpmax . In this situation, the number of CPU cores should be the same as the combined parallel way, 2 + Ns. The upper bound of the minimum core number also satisfies under this situa

tion. 0.5tpmax < tps ≤ tpmax means 1 ≤ tpmax /tps < 2, and tpmax /tps = 1, hence



In general, the reality situation must exist between the worst and best situations, thus the minimum core number used to achieve SRmax satisfies

⎢(tm + ⎢ ⎢



61

1 + Ns tp max /tps

U

= C¯ SR

(1) Pretest simulation. A conventional serial simulation for a very short physical simulation time is first conducted to obtain the information of the computation time of different tasks, including tv , tm , tj (j = 1, . . ., Ns) and tc . Then tm , tj (j = 1, . . ., Ns) and tc are sorted by using the bubble sort algorithm, which repeatedly steps through the list to be sorted, compares each pair of adjacent items and swaps them if they are in the wrong order. The pass through the list is repeated until no swaps are needed, which indicates that the list is well sorted. Thus, the longest time of all created threads, tpmax , and the 2nd longest time of all created threads, tps , are obtained. (2) Evaluation of the minimum core number. Calculate the paramL , and C ¯ U based on the obtained computation eters SRmax , C¯ SR SR U L ¯ ¯ times. If CSR and CSR are equal to (2 + Ns), which means that to achieve the theoretically maximum speedup ratio, each task should be deployed in a separate hardware core. Thus, the minimum core number, Ncore , is set as (2 + Ns) and each thread should be deployed on a core. If not, go to Step 3. (3) Hardware decomposition. In this step, the threads will be deployed on the minimum number of cores without sacrificing the speedup ratio. The following steps are included. (3.1) Initialization. Initialize Ncore as 1. Deploy the thread with the longest computation time, tpmax , on this core. Remove this thread from the waiting list. (3.2) Combination. Pick out the task with the maximal time among the remaining tasks and deploy it on the next hardware core. Sum the task time with as many other times from large to small in the remaining tasks as possible, while ensuring the summation does not exceed tpmax *(1 + ε), where ε is a small positive number set as the tolerance. Those chosen tasks are picked out and also deployed on this core. Ncore is increased by 1. Then, go to Step 3.3. (3.3) Termination. If there is any remaining tasks in the waiting list, return to Step 3.2. Otherwise, end of the hardware deployment.

62

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

Allocate (2+Ns) threads In thread 1->Ns{ For i=1->NI Wait for s(ti) to be evaluated. Evaluate w(j,ti). End }

has not been calculated, the thread will be hung up waiting for the condition object to be triggered. Fig. 9 shows snippets of codes that deal with this situation, where mutex s[i]/mutex w[j][i] and condition s[i]/condition w[j][i] stand for the mutex object and condition object of different threads, respectively. Fig. 9(a) shows the code of waiting for the variable to be calculated. Fig. 9(b) shows the code of informing the variable has been calculated.

4. Results and discussion

In thread Ns+1{ For i=1->NI Wait for w(j,ti) to be evaluated. Evaluate W(ti). End } Fig. 8. Pseudocode of combined parallel strategy.

After deploying the different threads on different hardware cores, keeping the communication among the threads becomes essential for the algorithm implementation. Fig. 8 presents the pseudocode of the combined parallel strategy with (2 + Ns) threads, which are created in the operating system with Pthread functions. The threads are numbered from 0 to Ns + 1, with each thread executing a computation task of a modular in Fig. 3. The implementation of the key step in the pseudocode “Wait for s(ti )/w(j,ti ) to be evaluated” is achieved by using the mutex type and the condition type of Pthread. A mutex object and a condition object are allocated for the computation of each thread at each time instant in Fig. 4. All variables are initialized to a physically meaningless value, −1.0. While a thread executes to the step “Wait for s(ti )/w(j,ti ) to be evaluated”, the value of s(ti )/w(j,ti ) will be checked. Since all variables have physical meanings, such as temperature, concentration or volume, the calculated values should be non-negative. If the checking step reports that the variable

We dynamically simulated the grade transition of the HDPE slurry process (Section 2) to demonstrate the performance of the proposed method. We presented two cases with different model details to elucidate the acceleration effects. The first case (Soares and McKenna, 2012) includes only the kinetic model with six types of active sites; the second case (Zhang et al., 2013) exhibits a complicated process structure. The thermodynamic and kinetic models were included; and five active sites exist. In this project, we focus on the parallel computation method; the rationale of the models was beyond the scope of the discussion. The maximal chain length was 50 000; simulation of the polymerization took 50 000 s. We preformed the computation using DASPK Solver on a MAXIMUSTM SCW4000 PC with the Intel® Xeon® E5-2620 (2.00 GHz) and ECC DDR3 16 GB Memory (1600 MHz); 12 CPU cores were available. The operating system was Linux CentOS 5.1 (Linux 2.6.18-371.1.2.el 5). The compiler was g++-4.1.2 with options –O2 and –pthread. The serial and parallel modes were conducted and compared; each method is executed ten times to obtain a statistical result. Two parallel indexes, SR and PE, were introduced to better analyze the acceleration effects of the proposed method. SR is the ratio of serial calculation time (SCT) over parallel calculation time (PCT). PE is defined as the SR divided by the number of processors; this variable represents the efficiency to conduct parallel computation. Note that PE ≤ 1 and PE = 1 implies a perfect match between software algorithm and hardware configuration. SR =

SCT PCT

//Lock mutex object of s(ti)/w(j,ti)

//Calculate the value of s(ti)/w(j,ti)

pthread_mutex_lock(mutex_s[i]/mutex_w[j][i]);

……

//Poll the value of s(ti)/w(j,ti)

//Lock the mutex object of s(ti)/w(j,ti)

while(1){

pthread_mutex_lock(mutex_s[i]

if(s(ti)/w(j,ti)>=0)//Check the value. {

/mutex_w[j][i]); //Trigger the condition object

// s(ti)/w(j,ti) has been calculated, unlock //the mutex object and break out of the loop. pthread_mutex_unlock(mutex_s[i] /mutex_w[j][i]); break;

pthread_cond_signal(condition_s[i] /condition_w[j][i]); //Unlock the mutex object pthread_mutex_unlock(mutex_s[i] /mutex_w[j][i]);

} // s(ti)/w(j,ti) hasn’t been calculated. //Wait for the condition object to be triggered. pthread_cond_wait(condition_s[i] /condition_w[j][i], mutex_s[i]/mutex_w[j][i]); } (a). Wait for the variable to be calculated.

(b). Inform the variable has been calculated.

Fig. 9. Snippets of the code implementation of combined parallel strategy on Pthread.

SR

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

1

5

10

20

63

50 100 150 200 500 750 1000 1250 1500 2500 NI

Fig. 10. SR for different NI in Case 1.

PE =

SR Core number

4.1. Case 1

Fig. 11. MWD accuracy comparison between serial and parallel strategy of Case 1.

Only the kinetic model was used in this case; six active sites exist (Ns = 6). Following the combined parallel strategy, eight threads can be designed for the simulation. The results are shown as follows.

slightly less than the serial result because of the communication cost and the computation used to free variable space. Table 2 shows that the simple parallel strategy reduces the average simulation time from 123.168 s to 31.299 s. The calculation time of active site 2 (21.772 s), was the longest among the six active sites (Table 3). Further analysis on the time costs revealed that the simple parallel result was close to the total calculation time of the pretreatment, active site 2, and the calculations of moment and cumulative MWD. This result was reasonable because the simple parallel strategy could only reduce the instantaneous MWD calculation by parallelizing the multisite computation. The combined parallel strategy, however, could further accelerate the computation by fully exploiting the times for moment calculation and cumulative MWD threads, rather than purely waiting for the end of previous thread as the serial way. The average time cost was reduced to 27.022 s; this value close to the summation of the pretreatment time and longest thread time costs. The total average time cost of moment calculation and cumulative MWD threads (Table 3) was 4.146 s; this value was close to the difference between the simple and combined parallel results with eight cores (Table 2). Both results revealed that the combined parallel strategy hastened the calculation in all three modules (Fig. 1) better than the simple strategy; thus, the time cost in the intermediate module was reduced. Table 4 shows the acceleration effects of different parallel strategies. The first two rows showed that the SR of combined parallel strategy with eight cores was 4.558; the value was greater than the result of simple parallel strategy (3.961).

4.1.1. Segment effect and analysis of the accuracy in parallel strategy We first used eight cores to implement the parallel computation of the dynamic simulation of MWD simulation via eight threads. NI represents the simulation time segments used for the parallel computation. Fig. 10 reveals the SR results with varying NI under the test; the minimum number is unity and the maximum number is 2500. NI = 1 indicates that the simulation was conventional, i.e., without any parallel computations. SR increased with NI from 1 to 5 and reached a relatively steady state with further variation of NI, which was reasonable; the peak was located at NI = 1000 (Fig. 10). From the parallel structure (Fig. 4), the decreased time depended on the calculation time of the last time interval. Increasing NI decreased the calculation time of the last time interval; more conditional statements would be conducted in the program, implying the increase in unnecessary time (Fig. 4). A tradeoff existed between the acceleration of the task computation and the increase in communication time; this tradeoff indicates that it is unnecessary to discretize the computation task as small as possible. The best segment number, 1000, was set for the subsequent calculation in this case. We further examined the accuracy of the simulation results; parallel computation should not affect the results of MWD calculations. Fig. 11 shows the obtained MWD curves for the conventional serial and combined parallel strategies; we presented the results at 500, 5000, and 50 000 s. The curves of the serial and parallel strategies overlapped very well, which verifies the accuracy of the proposed parallel method. 4.1.2. Comparison between serial and parallel strategy Figs. 2 and 3 reveal two parallel strategies; one is to only parallelize the intermediate level by using Ns (Ns = 6 in this case) threads; the other requires Ns + 2 threads by parallelizing all levels. We can implement the strategies by using six and eight cores, respectively. Both cases were performed and compared with the serial mode. Table 2 tabulates the computation time costs of each strategy. Table 3 shows the computed time costs for each subtask, as well as the summation time cost. The dissimilar calculation times for each active site are attributed to the different values of the parameters associated with different active sites. The summation time was

Table 2 Time costs (s) of different strategies of Case 1. Index

Serial strategy (1 core)

Simple parallel strategy (6 cores)

Combined parallel strategy (8 cores)

Combined parallel strategy (6 cores)

1 2 3 4 5 6 7 8 9 10 Average

122.983 123.201 123.048 123.314 123.431 123.475 122.917 122.926 123.058 123.333 123.168

31.056 31.052 31.117 31.118 31.095 31.094 31.112 31.137 31.116 31.092 31.099

27.029 27.001 26.983 27.007 27.029 27.058 27.008 27.025 27.035 27.042 27.022

27.087 27.048 27.050 27.137 27.121 27.155 27.099 27.069 27.061 27.125 27.095

64

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

Table 3 Time costs (s) of all subtasks of Case 1. Index

Pretreatment

Moment calculation

Active site 1

Active site 2

Active site 3

Active site 4

Active site 5

Active site 6

Cumulative MWD

Sum

1 2 3 4 5 6 7 8 9 10 Average

4.778 4.829 4.793 4.762 4.776 4.766 4.741 4.742 4.762 4.737 4.769

1.492 1.491 1.494 1.494 1.501 1.494 1.492 1.495 1.493 1.495 1.494

20.892 20.891 20.890 20.888 20.982 21.169 20.825 20.887 20.891 21.039 20.935

21.771 21.791 21.733 21.737 21.821 21.843 21.762 21.733 21.736 21.794 21.772

20.693 20.690 20.760 20.693 20.736 20.734 20.725 20.713 20.697 20.772 20.721

17.301 17.347 17.306 17.427 17.339 17.351 17.309 17.294 17.312 17.348 17.334

15.171 15.210 15.176 15.241 15.227 15.175 15.173 15.171 15.276 15.159 15.198

17.283 17.346 17.289 17.347 17.328 17.334 17.284 17.286 17.285 17.377 17.316

2.645 2.646 2.645 2.723 2.649 2.646 2.642 2.643 2.642 2.644 2.652

122.026 122.241 122.086 122.312 122.359 122.512 121.953 121.964 122.094 122.365 122.191

Theorem 1 states that the theoretical SRmax of the combined parallel strategy is given as tv + tm +

Ns 

tj + tc

21.772 s > 1.494 s + 17.316 s = 18.810 s

j=1

SRmax =

tv + tp max

21.772 s > 15.198 s + 2.652 s = 17.850 s

 Ns

In

this

case,

tv + tm +

tj + tc = 123.168 s,

tv + tp max =

j=1

26.541 s and yielded SRmax = 4.641, which was close to the numerical result. The slight difference was attributed to the extra communication and the memory bandwidth bottleneck. From the incremental cores in combined strategy, the PE of combined strategy was 0.570, and was lower than the result of simple strategy (0.660); this result indicated that the combined strategy with eight cores was inefficient. 4.1.3. Parallel strategy for minimum cores with maximum SR Theorem 2 reveals that it may not need N CPU cores for the N thread parallel computation. A combined parallel strategy with fewer cores may also achieve similar acceleration effect. In this case, tpmax = t2 = 21.772 s, and tm = 1.494 s,

t1 = 20.935 s,

t5 = 15.198 s,

t3 = 20.721 s,

t6 = 17.316 s,

t4 = 17.334 s,

tc = 2.652 s

Thus, tps = t1 = 20.935 s. The upper bound on the fewest cores that yield SRmax is:



U = 1+ C¯ SR

= 1+



1 + Ns tp max /tps

1 + 6 1





=1+

1+6





1.040

=1+7=8

Theorem 2 also reveals the lower bound on the fewest cores for SRmax :



L C¯ SR =⎢ ⎢(tm +



Ns  j=1

These bounds indicate that a combined strategy with fewer than eight cores achieves similar acceleration effects. This can be actually implemented by allocating the subtasks well. According to the deployment algorithm, we have







tj + tc )/tp max ⎥ ⎥ = 5.393 = 6



Table 4 Acceleration effects of different parallel strategies of Case 1. Strategy

SR

PE

Simple parallel strategy (6 cores) Combined parallel strategy (8 cores) Combined parallel strategy (6 cores)

3.961 4.558 4.546

0.660 0.570 0.758

which implies that t2 > tm + t6 t2 > t5 + tc We can use four cores to conduct the active site threads 1 to 4, respectively, and two more cores to conduct the remaining calculations. One of these two cores is used for the calculation of the moment calculation thread and active site thread 6, and the other one is used for the calculation of the threads of active site 5 and cumulative MWD. In this way, we can use only six cores to obtain similar accelerating effect as that obtained by 8 cores. The last column in Table 2 shows the results with the six cores. The average time cost was 27.095 s, and was very close to the result for eight cores (27.022 s). Table 4 reveals that the SR of the combined parallel strategy with six cores is 4.546; this ratio is slightly less than the result of the combined parallel strategy with eight cores (4.558). PE, however, increased to 0.758, which implies a better match between software algorithm and hardware configuration. 4.2. Case 2 In this case, a more complicated model is used to describe the HDPE slurry process. Both the kinetic and thermodynamic models were included; five active sites existed (NS = 5). The description of the model was previously reported (Zhang et al., 2013). 4.2.1. Segment effect and accuracy analysis in parallel strategy In a similar way, we first used NS + 2 = 7 cores to implement the parallel computation of the dynamic MWD simulation with seven threads. Fig. 12 shows the SR results with varying NI. Similar to Case 1, there is a peak in the results due to the tradeoff existed between the acceleration of the task computation and the increase in communication time. Fig. 12 reveals that SR increases with NI varies from 1 to 5 and reaches a relatively steady stage with varying NI, which is similar to case 1; the difference is the location of the peak at locates at NI = 150. The best segment number, 150, was set for the subsequent calculation in this case. The accuracy of the simulation results is further examined, too. A comparison of the calculated MWD curves of the conventional serial strategy and the combined parallel strategy is shown in Fig. 13. Results at four time instants, 2000, 10 000, 15 000, and

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

65

Table 5 Time costs (s) of different strategies of Case 2. Index

Serial strategy (1 core)

Simple parallel strategy (5 cores)

Combined parallel strategy (7 cores)

Combined parallel strategy (4 cores)

Combined parallel strategy (3 cores)

1 2 3 4 5 6 7 8 9 10 Average

126.963 126.658 126.633 126.434 126.369 126.427 126.390 126.742 126.751 126.750 126.612

70.030 70.314 70.034 69.935 70.052 70.100 69.339 69.373 70.120 70.279 69.958

45.797 45.755 45.659 45.653 45.537 45.688 45.661 45.659 45.607 45.699 45.672

45.845 46.179 46.153 45.975 46.177 45.934 46.147 46.055 46.112 45.976 46.055

48.988 48.943 48.937 48.970 48.980 48.903 48.917 48.906 48.888 48.913 48.935

Table 6 Time costs (s) of all subtasks of Case 2. Index

Pretreatment

Moment calculation

Active site 1

Active site 2

Active site 3

Active site 4

Active site 5

Cumulative MWD

Sum

1 2 3 4 5 6 7 8 9 10 Average

3.929 3.947 3.920 3.918 3.931 3.908 3.911 3.907 3.935 3.940 3.925

40.691 40.673 40.700 40.702 40.689 40.712 40.709 40.713 40.685 40.680 40.695

20.353 20.356 20.419 20.423 20.305 20.319 20.319 20.324 20.316 20.308 20.344

15.647 15.602 15.576 15.586 15.586 15.582 15.586 15.586 15.570 15.589 15.591

15.207 15.182 15.176 15.168 15.171 15.168 15.173 15.171 15.170 15.175 15.176

13.606 13.587 13.587 13.575 13.580 13.576 13.581 13.578 13.581 13.580 13.583

13.581 13.563 13.556 13.552 13.557 13.552 13.551 13.553 13.559 13.552 13.558

3.226 3.223 3.221 3.217 3.214 3.215 3.216 3.217 3.213 3.228 3.219

126.240 126.133 126.155 126.141 126.033 126.032 126.046 126.049 126.029 126.052 126.091

3

50 000 s were presented. The curves of the serial strategy and parallel strategy overlap very well, which verifies the accuracy of the proposed parallel methods.

2.5

SR

2 1.5 1 0.5 0

1

5

10

20

50

100

150

200

500 1000 2500

NI Fig. 12. SR for different NI in Case 2.

4.2.2. Comparison between serial and parallel strategy To compare the parallel performance, the time costs of each strategy are listed in Table 5. The detailed time costs of each subtask are also computed and listed in Table 6. Different from the results of Case 1, the time cost of moment calculation thread in Case 2 is greatly increased due to the increased model complexity. This thread takes more CPU time than the other threads. Table 5 shows that the simple parallel strategy can reduce the average simulation time from 126.612 s to 69.958 s. The calculation time of active site 1 (20.344 s) was the longest among the five active sites (Table 6). The time cost of the cumulative MWD calculation thread was 3.219 s. Similarly, the simple strategy result was sufficiently close to the summation of the calculation time of the pretreatment, the moment calculation, active site 1, and the cumulative MWD calculation. In this case, as the time cost of moment calculation thread is larger than those of the other six threads, the simple parallel computation of the intermediate level only will have limited effects. SR and PE were 1.810 and 0.362, respectively (Table 7). By contrast, the combined parallel strategy reduced the average time cost to 45.672 s, a much better result than the simple parallel strategy. Table 6 reveals that the average total time of the pretreatment and the moment calculation cost was 44.620 s. The remaining time cost was reduced to about 1 s using the combined Table 7 Acceleration effects of different parallel strategies of Case 2.

Fig. 13. MWD accuracy comparison between serial and parallel strategy of Case 2.

Strategy

SR

PE

Simple parallel strategy (5 cores) Combined parallel strategy (7 cores) Combined parallel strategy (4 cores) Combined parallel strategy (3 cores)

1.810 2.772 2.749 2.587

0.362 0.396 0.687 0.862

66

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

parallel strategy. This result was similar to that of case 1 with the conclusion that the combined parallel strategy can reduce the time cost to summation of the pretreatment time cost and longest thread time cost. The only difference between two cases is that the longest threads are different due to the model difference. In case 1, the longest thread is for instantaneous MWD calculation of an active site; in case 2, it is the moment calculation due to model complexity. Table 7 shows the acceleration effects of different parallel strategies. The first two rows in the table show that the combined parallel strategy with seven cores has consistently better SR than the simple parallel strategy. The SR of combined strategy was 2.772, and was much better than the result of simple parallel strategy (1.810). The advantage in SR was insignificant because the moment method cannot be parallelized at this stage. Theorem 1 states that the theoretical SRmax of the combined parallel strategy satisfies

 Ns

tv + tm +

In

case,

tv + tm +

Ns 

tj + tc = 126.612 s,

tv + tp max =

j=1

44.620 s and yielded SRmax = 2.838, which was very close to the achieved numerical result. However, the combined parallel strategy used seven cores; thus the PE was 0.396, indicating that the software algorithm and hardware configuration were not ideally matched. 4.2.3. Parallel strategy with minimum cores for maximum SR In this case, tpmax = tm = 40.695 s, and t1 = 20.344 s,

t2 = 15.591 s,

t5 = 13.558 s,

t3 = 15.176 s,

t4 = 13.583 s,

tc = 3.219 s

Thus, tps = t1 = 20.344 s. Theorem 2 states that the upper and lower bounds of the minimum core number are given as



U C¯ SR

= 1+

= 1+



tp max /tps

1 + 5

⎡ L =⎢ C¯ SR ⎢(tm +



1 + Ns

2 Ns  j=1

tj + tc )/tp max = 3.002, which is close to 3.

j=1

Hence, a three-core configuration can also be designed with a little sacrifice on the speed. The configuration can also be implemented by well allocating the subtasks. For the time cost of each thread, tm = 40.695 s,

t1 = 20.344 s,

t4 = 13.583 s,

t2 = 15.591 s,

t5 = 13.558 s,

t3 = 15.176 s,

tc = 3.219 s

hence, according to the deployment algorithm, we have 40.695 s > 20.344 s + 15.591 s + 3.219 s = 39.154 s 40.695 s ≈ 15.176 s + 13.583 s + 13.558 s = 42.317 s which implies that

tm ≈ t3 + t4 + t5

tv + tp max

this

Ns 

tm > t1 + t2 + tc

tj + tc

j=1

SRmax =

In this case, (tm +





=1+

1+5





2.0003

=1+3=4





We can use one core to conduct the moment calculation thread and two more cores to conduct the remaining calculations. One of the two cores was used to calculate the active site threads 1 and 2, as well as the cumulative calculation thread; the other core was used to calculate the active site threads 3 to 5. In this manner, we can use only three cores to obtain a similar accelerating effect similar to that obtained using seven cores. The last column in Table 5 reveals the results using three cores. The average time cost was 48.935 s, and was slightly longer than that using seven cores (45.672 s); in this case, t3 + t4 + t5 = 42.317 s, which was longer than tm (40.695 s). This difference together with the auxiliary communication costs produces a small overhead of the total time cost. Table 7 lists the acceleration indexes using three cores for comparison. The combined parallel strategy with three cores had an SR of 2.587 was somewhat smaller compared combined parallel strategy with seven or four cores. The PE result, however, further increases to 0.862, very close to the ideal result of unity.



tj + tc )/tp max ⎥ ⎥ = 3.002 = 4



These values can be easily implemented by merging two arbitrary threads into one core, because the time costs of all the threads except that for the moment calculation are smaller than half of tm . We can use the first core to perform the moment calculation thread; the second core is used to calculate the active site threads 1 and 2; the third core calculates the active site threads 3 and 4; the last core is used to calculate the threads of active site 5 thread and cumulative calculation. The fifth column of Table 5 tabulates the results of the combined parallel strategy using four cores. The average calculation time is 46.055 s, which is very close to the result using seven cores (45.672 s). Table 7 lists the acceleration indexes using four cores; SR is sufficiently close to the result using seven cores. Reducing the number of cores from seven to four increased PE from 0.396 to 0.687, which implied a better match between the software algorithm and hardware configuration.

5. Conclusions Efficient calculation of MWDs is a key step in the simulation and optimization of polymer processes. Based on the multisite polymerization mechanism, we proposed a novel parallel strategy that features parallel calculation on conventional methods in conjunction with those on the active sites. This method can greatly reduce the computation time of MWD by decomposing the conventional serial computation into several parallel threads. Further analysis yields the SR of the proposed method. Software algorithm and hardware configuration showed a good match based on theoretical analysis. The proposed method exhibited a good performance through dynamic simulations on two multisite HDPE slurry processes. Furthermore, the proposed method is not limited to the HDPE processes and could be implemented in other multisite polymerizations. Prospectively, the parallel computation method could also be extended to other quality indices other than MWD, such as chemical composition distribution and sequence length distribution. Acknowledgments We gratefully acknowledge the financial support of 973 Program (No. 2012CB720503) and National Natural Science Foundation of China (No. 61374205).

J. Weng et al. / Computers and Chemical Engineering 82 (2015) 55–67

References Al-Harthi M, Soares JBP, Simon LC. Dynamic Monte Carlo simulation of atom-transfer radical polymerization. Macromol Mater Eng 2006;291:993–1003. Al-Harthi M, Soares JBP, Simon LC. Dynamic Monte Carlo simulation of ATRP with bifunctional initiators. Macromol React Eng 2007;1:95–105. Almasi GS, Gottlieb A. Highly parallel computing. Redwood City, CA: BenjaminCummings Publishers; 1989. Arno MB, Klaus HE, Hanns JE. Kinetic studies on the acid hydrolysis of dextran. Macromolecules 1978;11:774–81. Chen X, Jiang F, Yao Z. A sequential variable decoupling method for rigorous calculation of molecular weight distribution of batch free radical polymerization. Comput Aided Chem Eng 2009;27:873–8. Chen Z, Chen X, Shao Z, Yao Z, Biegler LT. Parallel calculation methods for molecular weight distribution of batch free radical polymerization. Comput Chem Eng 2013;48:175–86. Deuflhard P, Wulkow M. Computational treatment of polyreaction kinetics by orthogonal polynomials of a discrete variable. IMPACT Comput Sci Eng 1989;1:269–301. Ellis MF, Taylor TW, Gonzalez V, Jensen KF. Estimation of the molecular weight distribution in batch polymerization. AIChE J 1988;34:1341–53. Enrique SG, Ramiro IM, Eduardo VL, Antonio FT. Returning to basics: direct integration of the full molecular-weight distribution equations in addition polymerization. Macromol Theory Simul 2010;19:151–7. Flory PJ. Principles of polymer chemistry. New York: Cornell University Press; 1953. Ge W, Wang W, Yang N, Li J, Kwauk M, Chen F, et al. Meso-scale oriented simulation towards virtual process engineering (VPE) — The EMMS paradigm. Chem Eng Sci 2011;66(19):4426–58. Guido BF, Flavio M. A combination of parallel computing and object-oriented programming to improve optimizer robustness and efficiency. Comput Aided Chem Eng 2010;28:337–42. Kim JH, Kim I, Woo SI. Computer simulation study of ethylene polymerization rate profile catalyzed over highly active Ziegler–Natta catalysts. Ind Eng Chem Res 1991;30:2074–9. Laird CD, Wong AV, Akesson J. Parallel solution of large-scale dynamic optimization problems. Comput Aided Chem Eng 2011;29:813–7.

67

Miller NC, Toffolo RW, McAuley KB, McLellan PJ. Determining polymer chain length distributions using numerical inversion of Laplace transforms. Polym React Eng 1996;4:279–301. Mills PL. Determination of polymer chain length distributions by numerical inversion of z-transforms. Comput Chem Eng 1986;10:399–420. Nguyen D, Bagajewicz MJ. Parallel computing approaches to sensor network design using the value paradigm. Comput Chem Eng 2011;35(6): 1119–34. Soares JBP. Mathematical modeling of the microstructure of polyolefins made by coordination polymerization: a review. Chem Eng Sci 2001;56: 4131–53. Soares JBP, McKenna TFL. Polyolefin reaction engineering. Weinheim: Wiley-vch Verlag GmbH & Co. KGaA; 2012. Stanley K, Gerald MS. Moments of the size distribution in radical polymerization. AIChE J 1967;13:319–26. Timothy JC, Kyu YC. Calculation of molecular weight distribution from molecular weight moments in free radical polymerization. Ind Eng Chem Res 1997;36:1419–23. Vegeais JA, Stadtherr MA. Parallel processing strategies for chemical process flowsheeting. AIChE J 1992;38:1399–407. Wulkow M. The simulation of molecular weight distributions in polyreaction kinetics by discrete Galerkin methods. Macromol Theory Simul 1996;5:393–416. Yao Z, Sun J, Wang Q, Zhou G, Cao K. Monte Carlo simulation of molecular weight distribution and copolymer composition in transamidation reaction of polyamides. J Appl Polym Sci 2012;125:3582–90. Zapata-Gonzales I, Saldivar-Guerra E, Flores-Tlacuahuac A, Vivaldo-Lima E, OrtizCisneros J. Efficient numerical integration of stiff differential equations in polymerization reaction engineering: computational aspects and applications. Can J Chem Eng 2012;90:804–23. Zhang C, Zhan Z, Shao Z, Zhao Y, Chen X, Gu X, et al. Equation-oriented optimization on an industrial high-density polyethylene slurry process with target molecular weight distribution. Ind Eng Chem Res 2013;52:7240–51. Zhu Y, Laird CD. A parallel algorithm for structured nonlinear programming. In: Proceeding of 5th international conference on foundations of computer-aided process operations; 2008. p. 345–8.