Information Sciences 484 (2019) 367–386
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Stochastic configuration networks with block increments for data modeling in process industriesR Wei Dai a,b,∗, Depeng Li a, Ping Zhou b, Tianyou Chai b a b
School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang 110819, China
a r t i c l e
i n f o
Article history: Received 9 June 2018 Revised 13 January 2019 Accepted 24 January 2019 Available online 31 January 2019 Keywords: Stochastic configuration networks Process industries Randomized learner model Block incremental approach Simulated annealing algorithm
a b s t r a c t Stochastic configuration networks (SCNs) that employ a supervisory mechanism to automatically and fast construct universal approximators can achieve promising performance for resolving regression problems. This paper develops an extension of the original SCNs with block increments to enhance learning efficiency, which has received considerable attention in industrial process modeling. This extension allows the learner model to add multiple hidden nodes (termed hidden node block) simultaneously to the network during construction process. To meet industrial demands, two block incremental implementations of SCNs are presented by adopting different strategies for setting the block size. Specifically, the first one adds the hidden node blocks with a fixed block size, which achieves the acceleration of convergence rate at the cost of model compactness; the second one can automatically set the block size by incorporating simulated annealing algorithm, achieving a good balance between efficiency and complexity. The two algorithms are suitable for industrial data modeling with distinct requirements on modeling speed and memory space. The improved methods for building SCNs are evaluated by two function approximations, four benchmark datasets and two real world applications in process industries. Experimental results with comparisons indicate that the proposed schemes perform favorably. © 2019 Elsevier Inc. All rights reserved.
1. Introduction Today, modern industry requires high product quality, which inspires the quality-related optimization and control technologies [5,6,31]. In fact, neither optimization nor control of the practical process industries is easy to achieve. This is because optimal operation of a process largely depends on good measurements of quality indices, but it is difficult or even impossible to realize the real-time measurements of quality indices due to economic or technical limitations [4]. Fortunately, the quality indices can be estimated by using virtual sensing technique namely soft sensor, which is based on appropriate models. The traditional modeling approach is first-principles approach that always depends on prior mechanical knowledge [15,19,28]. The main bottleneck is that industrial processes are commonly too complicated to analyze, making the mechanical knowledge rather hard-won. Besides, the first-principles models are often established based on hypothesis, which may cause model biases. Therefore, the first-principles approach is unavailable for soft sensors in the process industries. Their data-driven counterparts, alternatively, give empirical models without prior mechanical knowledge. Over decades of research, R ∗
This paper belongs to the special issue “RANN” edited by Prof. W. Pedrycz. Corresponding author. E-mail address:
[email protected] (W. Dai).
https://doi.org/10.1016/j.ins.2019.01.062 0020-0255/© 2019 Elsevier Inc. All rights reserved.
368
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
data driven soft sensors have attracted increasing attention and have been applied to many fields [8,13,21,29]. Therefore, applying the data-driven soft sensors to solve the problem of quality index measurement in the process industries is a practical way [24,27,32]. Recently, a wide variety of machine learning techniques have been employed in data-driven soft sensors, among which representative examples are artificial neural networks (ANNs) [14,30] and support vector machines (SVMs) [9,12]. Although an increasing number of works are aiming at further improving the performance of learning algorithms, there are still some problems when applying to process industries, such as time-consuming training phase, inferior accuracy and poor generalization. In the 90s of the last century, a special single-hidden layer feed-forward network (SLFN), namely random vector functional link network (RVFLN), was proposed [11,20]. As a randomized learner model, RVFLN is characterized by a twostep training paradigm, that is, randomly assigning the hidden-node parameters (input weights and biases) and evaluating the output weights by solving a linear equation issue. As a result, RVFLNs perform better in learning speed and simplicity of implementation [23]. Nevertheless, two crucial issues in RVFLNs make it less practical in modeling tasks, i.e., (i) define network structure (the required number of hidden nodes) in the training loops; (ii) find an appropriate range for assigning hidden-node parameters and/or a proper way of random assignment. Concretely, it is risky to define the architecture prior to training, because networks that are too simple or too complex will both attenuate the model quality. That is to say, a learner model with fewer hidden nodes cannot ensure modeling accuracy while excessive hidden nodes may cause over-fitting that leads to poor generalization. Therefore, how to design a learner model to automatically match the network complexity is necessary and important. One trial is to build RVFLNs on the basis of incremental implementation for the problem solving, where the learner model starts with a small network and then incrementally generates hidden nodes one by one until an acceptable error tolerance is obtained. However, a recent work reported in [18] reveals the infeasibility of this incremental RVFLN (IRVFLN) with random selection of the hidden-node parameters from a fixed scope. This implies that the universal approximation property is conditional to these RVFLN-based models. Wang and Li [26] proposed an advanced randomized learner model called stochastic configuration networks (SCNs), which can work successfully in building a universal approximator. SCNs randomly assign the hidden-node parameters in the light of a supervisory mechanism and adaptively select their scopes, which indicate remarkable merits in the scope adaptation of hidden-node parameters, less human intervention on the network size setting and sound generalization. Those essential distinctions in the construction process keep SCNs from being regarded as a specific implementation of RVFLNs. The original SCNs, however, add hidden nodes incrementally by using a point incremental approach, which may lead to enormous iterations in the construction process for large-scale applications. This paper extends the original SCNs with block increments, which can simultaneously add multiple hidden nodes (hidden nodes block) each time and evaluate the output weights after each addition by solving a global optimal problem. Using reasonable settings for block sizes, SCNs can not only effectively accelerate the construction process, but also control the model complexity at an acceptable level. The main contributions of this paper are summarized as follows. •
•
•
Different from the original SCNs with point increments, this paper presents a block incremental approach for SCNs by establishing a block form of supervisory mechanism. That is, the proposed approach makes batch assignment of random parameters as well as block increments of hidden nodes a reality. In the block incremental framework, two algorithmic implementations are developed on the basis of fixed block size strategy and varied block size strategy, respectively. Concretely, the first one incrementally generates hidden node blocks whose sizes remain unchanged in the construction process, showing superiority in fast learning; the second one incrementally generates hidden node blocks with adjustable sizes by incorporating a simulated annealing (SA) method, constructing a relatively compact network. Impact of block size on learning performance is investigated in-depth and the reasonable recommended settings of model parameters are fully discussed.
The remainder of this paper is organized as follows: Two randomized learner models are briefly reviewed with retrospection both for the deficiencies of RVFLNs and for the highlights of SCNs in Section 2. Section 3 details the block incremental approach for SCNs, consisting of theoretical analysis and algorithmic description. Investigations on impact of block size on learning performance are reported in Section 4. In Section 5, case studies are presented by focusing on two real world applications, and Section 6 draws our concluding remarks. 2. Brief review of randomized learning techniques This section first presents RVFLNs and their drawbacks, followed by detailed technical essence of SCNs. The following notation is used throughout this paper. Let := {g1 , g2 , g3 , . . .} denote a set of real-valued functions and span( ) represent a function space spanned by ; Let L2 (D) stand for the space of all Lebesgue measurable functions f = { f1 , f2 , . . . , fm }: Rd → Rm on a set D ⊂ Rd , with the L2 norm defined as
f :=
m q=1
D
1 / 2
| fq (x )| dx 2
<∞
(1)
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
369
The inner product of φ = [φ1 , φ2 , . . . , φm ] : Rd → Rm and f is defined as
f, φ : =
m
f q , φq =
m
q=1
q=1
D
fq (x )φq (x ) dx
(2)
Note that the above definition becomes the trivial case when m = 1, which corresponds to a real-valued function defined on a compact set. 2.1. RVFLNs
For a given dataset of training (xi , ti ), let sampled inputs X = {x1 , x2 , . . . , xN }, xi = xi,1 , . . . , xi,d ∈ Rd and corresponding
outputs T = {t1 , t2 , . . . , tN }, ti = ti,1 , . . . , ti,m ∈ Rm , where i = 1, 2, . . . , N and N represents number of training samples. A RVFLN can be defined as a randomized version of SLFNs with L hidden nodes, i.e.,
f L (X ) =
L
β j g j (v j , b j , X )
(3)
j=1
where gj ( · ) represents activation function of the jth hidden node; the hidden-node parameters (vj and bj ) are randomly assigned from [−λ, λ]d and [−λ, λ] respectively; β j = [β j,1 , . . . , β j,q , . . . , β j,m ]T expresses the output weights between the jth hidden node and the output nodes; fL denotes the output function of current network. This learner model can be trained by solving a linear optimization problem, that is
2 min = β j g j v j , b j , xi − ti β1 ,...,β j i=1 j=1 N L
(4)
The above equation can be further written as the following matrix form with quadratic optimization techniques
β ∗ = arg min H β − T 2 = H † T
(5)
β
where
⎡
g( v1 , b1 , x1 ) .. H=⎣ . g( v1 , b1 , xN )
··· ··· ···
⎤
g( vL , bL , x1 ) .. ⎦ . g(vL , bL , xN ) N×L
(6)
is called the hidden layer output matrix and H† is its Moore–Penrose generalized inverse [17]. As one extension of RVFLNs, IRVFLNs can be regarded as an attempt to determine the network structure in the learning process. In IRVFLNs, the new network fL can be stated as a specific combination of the previous network f L−1 and the newly added hidden node gL (vL and bL ) when the Lth hidden node is added, i.e.,
fL (x ) = fL−1 (x ) + βL gL (x )
(7)
βL = eL−1 , gL /gL 2
(8)
where
is the output weight constructively evaluated and
eL−1 = f − fL−1 = eL−1,1 , . . . , eL−1,q , . . . , eL−1,m
T
(9)
represents the residual error of network fL−1 . Although it can be deemed to dynamically build networks during the training phase, Li and Wang [18] prove that the universal approximation property of IRVFLNs cannot be guaranteed if the freely acquired hidden-node parameters are assigned in a fixed scope and its convergence rate encounters some certain conditions. Besides, Gorban et al. [10] have demonstrated that in the absence of certain additional conditions one may observe an exponential growth of the number of terms needed to approximate a nonlinear map, which reveals the slow convergence rate of IRVFLNs. Therefore, the selection of hidden-node parameters should be constrained and data dependent rather than random assignment in a fixed scope, which is closely related to approximation capability. 2.2. SCNs Different from RVFLNs, SCNs as a class of randomized learner models were proposed in [26]. SCNs stochastically configure hidden-node parameters based on a supervisory (data-dependent) mechanism in the incremental construction process, thereby ensuring the universal approximation property. Under the SCN framework, three algorithmic implementations, namely SC-I, SC-II, and SC-III, are presented in [26]. In terms of both learning efficiency and generalization, SC-III outperforms the others. This is because the output weights in SC-III are evaluated all together by solving a global optimal problem,
370
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
which makes output weights contribute greatly to the construction of a universal approximator. In view of this, SCN refers to SCN with SC-III algorithm throughout the remainder of this paper. The brief of SCN is presented in the following Theorem 1. Theorem 1 ([26]). Suppose that span ( ) is dense in L2 space and ∀g ∈ , 0 < g < bg for some bg ∈ R+ . Given 0 < r < 1 and a nonnegative real number sequence {μL }, with limL→∞ μL = 0 and μL ≤ (1 − r ). For L = 1, 2, . . . , denoted by
2 ∗ δL,q = (1 − r − μL )e∗L−1,q , q = 1, 2, . . . , m.
(10)
The new hidden-node parameters vL and bL are randomly selected from adjustable scopes [−λ, λ] and [−λ, λ] respectively, where λ is automatically assigned from a given scope control set ϒ = {λmin : λ : λmax }. When the random basis function gL is generated to satisfy the following inequalities, namely supervisory mechanism: d
e∗L−1,q , gL
2
≥ bg
δL,q , q = 1, 2, . . . , m.
2 ∗
(11)
and the output weights are calculated by
∗
β1∗ , . . . , β ∗j , . . . , βL
L = arg min f − β jg j β j=1
Then, we have limL→∞ f − fL∗ = 0, where fL∗ = e∗L
= f−
fL∗
=
T [e∗L−1,1 , . . . , e∗L−1,q , . . . , e∗L−1,m ]
⎡ ∗T ⎤ ⎡β ∗ β1 1,1 ⎢ .. ⎥ ⎢ .. ⎢ . ⎥ ⎢ . ⎢ ⎥ ⎢ β ∗ = ⎢β j∗T ⎥ = ⎢ β j,∗1 ⎢ . ⎥ ⎢ . ⎣ . ⎦ ⎣ . .
βL∗T
.
βL,∗1
··· .. . ··· . .. ...
β1∗,q .. .
β j,q∗ .. .
βL,q∗
(12)
L
β ∗j g j , β ∗j = [β ∗j,1 , . . . , β ∗j,q , . . . , β ∗j,m ]T ∈ Rm , the optimal residual error and the optimal output weights β ∗ can be denoted as
··· . .. ··· .. . ···
j=1
β1∗,m ⎤ .. ⎥ . ⎥ ∗ ⎥ β j,m ⎥ .. ⎥ ⎦
(13)
.
∗ βL,m
L×m
The detailed implementation can be summarized below and more details can refer to Wang and Li [26]. It is worth mentioning that some extensions of SCNs are proposed and have been successfully employed in data analytics, e.g., Wang and Li [25] extended SCNs to a deep version (DeepSCNs) with both theoretical analysis and algorithm implementation. Compared with the existing deep learning algorithms, DeepSCNs can be constructed efficiently and provided with great merits, such as learning representation and consistency property between learning and generalization.
In terms of hidden node addition, however, SCNs employ a point incremental approach to construct the network, which is restricted to add only one new hidden node at each network growth. One potential problem is that it needs to remodel the whole network (optimize the output weights) for each newly added hidden node. The whole construction will be relatively complicated and time-consuming when constructing large networks. Furthermore, adding only a single hidden node at each network growth to reduce the residual error may not be very efficient as the approximation capability of a single hidden node is clearly very limited. In other words, adding multiple hidden nodes (named by hidden node block) together is potentially much more capable of decreasing the residual error than adding hidden node one by one. In fact, it is well known that for SLFNs, the ability to approximate complex nonlinear mappings directly from the input samples largely lies on the nonlinear feature mapping in the hidden layer, which maps low-dimensional input data to high-dimensional space. The hidden nodes can be seen as features in the high-dimensional space, which indicates that adding more random hidden nodes may gain more features. This inspires us to use a block growth mechanism for the hidden nodes to accelerate the building process of SCN models.
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
371
Fig. 1. Schematic diagram of incremental approaches.
3. Stochastic configuration networks with block increments In this paper, a block incremental approach that can add multiple hidden nodes simultaneously is employed to SCNs. The details are as follows. 3.1. Block incremental approach The mechanisms of point and block incremental approaches are depicted in Fig. 1, where the discrepancies are clearly visible. More precisely, the point incremental approach allows addition of only a single hidden node (see Fig. 1(a)), while the block incremental approach can grow additional hidden nodes batch by batch, which can be observed from Fig. 1(b). In SCNs with block increments, the newly growing hidden nodes are referred to as a hidden node block. In order to facilitate the algorithmic description, the output function is redefined as f L = HL β = fL−1 + hL βLT , where HL = [h1 , · · · , hL ]N×L , β has the same format with Eq. (13) and hL = gTL . For simplicity, each addition of the hidden node block is denoted as an iteration. Accordingly, k denotes the block size of newly growing hidden nodes in the kth iteration. 3.2. Universal approximation property The theoretical analysis on universal approximation property provided here is an extension of that given in [26]. For the block incremental approach, the output weights can be calculated as follows:
f − f L 2
2 = f − fL−k − hL−k +1 βLT−k +1 + · · · + hL βLT T 2 = eL−k − hL−k +1 , · · · , hL βL−k +1 , · · · , βL 2 = eL−k − Hk βk
Jβ =
=
m q=1
=
eL−k ,q − Hk βk ,q
T
eL−k ,q − Hk βk ,q
eL−k ,q 2 − 2eTL− ,q Hk βk ,q + Hk βk ,q T Hk βk ,q
m
k
q=1
2
= e L −k −
m q=1
2eTL−k ,q Hk βk ,q + Hk βk ,q
T
where Hk = [hL−k +1 , · · · , hL ]N×k , βk = [βL−k +1 , · · · , βL ]T
Hk βk ,q
(14)
and eL− = f − fL−k represent the hidden output k block, output weight block, residual error block respectively in the kth iteration. Take the derivative of Eq. (14) with respect to βk ,q , we have
∂ Jβ T T = −2H eL−k ,q + 2H Hk βk ,q k k ∂ βk ,q
k ×m
(15)
372
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
T e T and then set Eq. (15) to zero, we can obtain that H L−k ,q = H Hk βk ,q and βk ,q can be further expressed as follows: k k
† βk ,q = HT k Hk HT k eL−k ,q
where
(H T
k Hk
)†
(16)
is the Moore-Penrose generalized inverse of matrix
HT
k Hk [19].
It is not difficult to conclude that βL,q is a scalar while βk ,q is a vector. According to Eq. (16), Hk = hL , eL−k ,q = eL−1,q 2 and βk ,q = βL,q = eL−1,q , gL /gL in the case of k = 1. This is to say, the point incremental approach can be regarded as a special block incremental approach. More details referring to the original SCNs are given as follows. T ∗ T H † T ∗ T † ˜ Definite β˜k = (H k ) H eL− as intermediate values (βk ,q = (H Hk ) Hk eL− ,q , q = 1, 2, . . . , m), and then calk
k
k
k
culate the corresponding e˜L = e∗L− − Hk β˜k . k
k
Theorem 2. Suppose that span ( ) is dense in L2 space. Given 0 < r < 1 and a nonnegative real number sequence {μL }, with limL→∞ μL = 0 and μL ≤ (1 − r ). For L = 1, 2, . . . , k ∈ {L} and k = 1, 2, . . . , denoted by ∗ δL,q = (1 − r − μL )e∗L−k ,q 2 , q = 1, 2, . . . , m.
(17)
If the hidden output block Hk is generated with the following inequalities, namely the block form of supervisory mechanism:
∗ e∗L−k ,q , Hk β˜k ,q ≥ δL,q , q = 1, 2, . . . , m .
(18)
and the output weights are calculated by
∗
β1∗ , . . . , β ∗j , . . . , βL
L = arg min f − β jg j β j=1
Then, we have limL→∞ f − fL∗ = 0, where fL∗ = e∗L− = f − fL∗ and e∗0 = f.
L
j=1
(19)
β ∗j g j , β ∗j = [β ∗j,1 , . . . , β ∗j,q , . . . , β ∗j,m ]T , the optimal residual error block
k
Proof. As for the sequence e∗L 2 , we have
e∗L 2 ≤ e˜L 2 ∗
= eL−k − Hk β˜k , e∗L−k − Hk β˜k
2
2
2
= e∗L−k − Hk β˜k ≤ e∗L−k
(20)
Combining the above, one can easily get that e∗L 2 ≤ e˜L 2 = e∗L− − Hk β˜k 2 ≤ e∗L− 2 ≤ e˜L−k 2 , so e∗L 2 is k k
monotonically decreasing. Hence, we can further obtain
2 2 e∗L 2 − (r + μL )e∗L−k ≤ e˜L 2 − (r + μL )e∗L−k m
= e∗L−k ,q − Hk β˜k ,q , e∗L−k ,q − Hk β˜k ,q −(r + μL ) e∗L−k ,q , e∗L−k ,q q=1
=
m q=1
=
m q=1
=
m q=1
=
(1 − r − μL ) e∗L−k ,q , e∗L−k ,q − 2 e∗L−k ,q , Hk β˜k ,q + Hk β˜k ,q , Hk β˜k ,q
m q=1
†
T T ∗ ∗ δL,q − e∗T L−k ,q Hk (Hk Hk ) Hk eL−k ,q
∗ ˜ − e∗T δL,q L−k ,q Hk βk ,q
∗ − e∗L−k ,q , Hk β˜k ,q δL,q
≤0
(21)
Therefore, e∗L 2 − (r + μL )e∗L− 2 ≤ 0 and it can be further denoted as: k
2 2 e∗L 2 ≤ re∗L−k + μL e∗L−k
(22)
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
373
Table 1 Related variables and description for the block incremental approach. Variables
Description
k
Size of newly growing hidden node block in the kth iteration Residual error block before newly growing k hidden nodes Input weights generated randomly with a dimension of k × d in the kth iteration Biases generated randomly with a dimension of k × 1 in the kth iteration Growing hidden output block in the kth iteration Intermediate values of output weights in the kth iteration
e L − k
v k
b k Hk β˜ ,q k
Note that limL→∞ μL e∗L− 2 = 0, where limL→∞ μL = 0. Based on Eq. (22), we can get that limL→∞ e∗L 2 = 0, that is, k
limL→∞ e∗L = 0. This completes the proof. 3.3. Algorithmic description
In the block incremental approach, the block size plays an important role in the construction process. Correspondingly, fixed and varied size strategies for hidden node blocks have been used and come into two algorithms, namely BSC-I and BSC-II, respectively. The two algorithms adopt the same supervisory mechanism and output weight optimization method, but different polices for setting the hidden node block sizes during the construction process. In order to describe the algorithmic clearly, we uniformly list the variables in block form (see Table 1) and the details are as follows. 1. BSC-I: stochastic configuration algorithm with fixed block increments Firstly, specific formulas used in the construction process are presented. The residual error block is expressed as eL−k (X ) = f − fL−k (X ) = [eL−k ,1 (X ), · · · , eL−k ,q (X ), · · · , eL−k ,m (X )] ∈ RN×m , where fL−k represents the output function T
of BSC-I and eL−k ,q (X ) = [eL−k ,q (x1 ), · · · , eL−k ,q (xi ), · · · , eL−k ,q (xN )] ∈ RN ; Let the hidden output block:
Hk (X ) = [hL−k +1 (X ), · · · , hL (X )]N×k = [Hk (vk , bk , x1 ), · · · , Hk (vk , bk , xi ), · · · , Hk (vk , bk , xN )]T
(23)
exp(−vk xTi
where H (vk , bk , xi ) = 1/(1 + − bk )) stands for the nonlinear mapping from input xi based on activation k function with respect to newly added k hidden nodes. For the sake of simplicity, the instrumental variable ξ L, q is introduced. Let ξL = m ξ , where q=1 L,q
2 ξL,q = e∗L−k ,q (X ), Hk (X )β˜k ,q − (1 − r − μL )e∗L−k ,q (X ) , q = 1, 2, . . . , m
(24)
T and the intermediate values β˜k ,q = (Hk (X ) Hk (X ) )† Hk (X )T e∗L− ,q (X ), q = 1, 2, . . . , m. It is known from Theorem 2, the k hidden-node parameters can be stochastically configured by choosing a maximum ξ L among multiple outputs, subject to ξL,q ≥ 0, q = 1, . . . , m, i.e., the combination of Eqs. (17) and (18). Following the original SCNs, we make a summary of the stochastic configuration (SC) algorithm with fixed block size strategy in Algorithm BSC-I. It should be pointed out that the learning condition L ≤ Lmax can also be replaced by k ≤ kmax . The hidden node block tends to gain more features, which contributes to accelerate convergence rate. That is to say, the larger the block size, the higher the residual error decreasing rate, and the faster the construction process. Therefore, BSC-I is a promising algorithm that can improve learning speed and is a preferable choice in dealing with large-scale modeling tasks.
2. BSC-II: stochastic configuration algorithm with varied block increments Although the larger block size could achieve a higher residual error decreasing rate, it is difficult to construct a compact model. It is because residual error is monotonically decreasing, and the required features will be also decreasing along with the construction process. But BSC-I adopts the fixed block size in each iteration, which will inevitably bring redundant features as the oversize block may be used near the end of construct process. Due to the fact that too complex structure will lead to over-fitting and thus poor performance, it is necessary to adopt proper block size for each network growth to reduce the model complexity as far as possible. The above reason inspires the BSC-II algorithm, which employs a varied block size (k ) strategy for constructing network. This strategy is to achieve a good balance between fast convergence and compact structure by means of problem-dependent block size. Owing to the fact that the change of required features is similar to annealing process of metal smelting [16], the BSC-II algorithm employs a simulated annealing (SA) based varied block size strategy, which is defined by
k = 1+round ( (1 − 1 ) × (1 − p))
(25)
374
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
where round( · ) is an integer function; 1 represents the initial block size. Correspondingly, the Boltzmann probability p can be expressed as p(dE ) = exp(dE /η ), where dE = eL − eL−k represents residual error difference between the networks fL and fL−k (corresponding to current state and previous state of annealing in metallurgy), η is an adjustable parameter according to the characteristic of given modeling tasks. The main idea of SA based varied block size strategy is to automatically select the block size from the scope [1, 1 ], where k can be directly determined in the light of residual error difference. Based on the above description, the corresponding pseudo codes of SC algorithm with varied block size strategy are summarized in Algorithm BSC-II. Comprehensive studies on our proposed algorithms are described in the next subsection.
Remark 1. It should be pointed out that incorporating optimization strategy into randomized learner models may potentially slow down learning speed. Fortunately, the SA method used in BSC-II only involves simple calculation and has minimal computing load, which is almost negligible. Therefore, BSC-II makes a good trade-off between the learning time and network complexity. Remark 2. It is believed that not limited to the adopted sigmoid activation function, other activation functions are also applicable to SCN family, such as the radial basis function. Also, not limited to the SA method used in BSC-II, readers using other appropriate methods maybe also achieve good results.
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
375
4. Impact of block size on learning performance In this section, we investigate the function of block increments of hidden nodes. Comparisons among our proposed algorithms, SCN and IRVFLN are made through two function approximations and four real-world regression cases. The experiments are conducted in MATLAB 2016a environment running on a PC that equips with a core i5, 3.4 GHz CPU, 8 GB RAM. In the data preprocess, both the input and output data are normalized into [−1, 1], and estimated outputs of models with respect to testing samples are inversely normalized to the raw data space for observational convenience. The parameter r is generated from the scope (0, 1). Settings of the other parameters will be given in the following specific experiments. Mean and Standard Deviation (Dev) of Root Mean Squares Error (RMSE) are obtained by fifty independent trials conducted. 4.1. Function approximations This subsection presents two function approximations, which are well-received alternative to implement regression problems. The first is to approximate the function y1 that is defined as follows:
y1 ( x ) =
sin (x )/x, x = 0 , x ∈ [−10, 10] 1, x = 0
(26)
It is worth mentioning that if the learning process is not constrained in the training process, it is easy to cause overfitting. To avoid this, training dataset expanding and early stopping techniques are employed. The training dataset randomly selected is not less than 75% of the whole sample in our experiments to cover the characteristics of all raw sample space as far as possible, that is training dataset expanding technique. Meanwhile, the expected error tolerance ε and the maximum number of hidden nodes Lmax are used as termination criteria, that is early stopping technique. According to the above techniques, 400 distinct samples generated from the above function are divided into 300 training samples and 100 testing samples; ε measured by RMSE with respect to residual error is set as 0.05. The other parameters are obtained (see Table 2) with the cross validation method. In Table 2, ϒ = {λmin : λ : λmax } is a given scope set, where λ represents the bound of hidden-node parameters, e.g., v and b in IRVFLN are randomly assigned from [−1, 1](λ = 1 ); Tmax is the maximum times of random configuration. In addition, the adjustable parameter η in SA based update strategy of BSC-II is set as 10. The Mean and Dev of training RMSE are reported in Table 3 where IRVFLN fails to meet the expected error tolerance within an acceptable training time, as is revealed in [18]. The failed experiments of IRVFLN are marked as ‘-’. By contrast,
Table 2 Partial optimization parameters on the function y1 . Algorithms
Lmax
ϒ
Tmax
IRVFLN SCN BSC-I BSC-II
50 50 50 50
{1} {1:1:10} {1:1:10} {1:1:10}
1 10 10 10
376
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386 Table 3 Performance comparisons among IRVFLN, SCN, BSC-I and BSC-II on the function y1 . η = 10 in BSC-II. Algorithms
Training process (ε = 0.05)
k
(Mean, Dev)
t(s)
k
L
IRVFLN SCN BSC-I
1 1 1 2 3 5 10 [1, 5] [1, 10]
– 0.0276,0.0109 0.0293,0.0139 0.0265,0.0153 0.0118,0.0123 0.0 044,0.0 084 0.0 012,0.0 065 0.0225,0.0135 0.0208,0.0147
– 0.1197 0.1208 0.1169 0.1055 0.0971 0.0920 0.1025 0.0998
– 9.2 9.4 5.2 3.9 2.8 1.9 3.3 2.0
– 9.2 9.4 10.4 11.8 14.0 19.8 10.4 16.3
BSC-II
Fig. 2. Convergence performance of IRVFLN and SCN on the function y1 .
Fig. 3. Convergence performance of BSC-I on the function y1 .
SCN, BSC-I and BSC-II exhibit fairly good performance. It can be observed from Table 3 that the performance of SCN is equivalent to BSC-I with k = 1. As for BSC-I, with the fixed-value k increasing (e.g., five independent block incremental networks are generated with k = 1, k = 2, k = 3, k = 5 and k = 10, respectively), training time is effectively reduced by means of fewer iterations. For instance, BSC-I with k = 10 can achieve the desired accuracy only via two iterations, thereby leading to the fastest learning speed. Unfortunately, its hidden nodes increase unexpectedly. Unlike BSC-I, BSC-II narrows down the hidden nodes but slows down learning speed slightly. This can be seen from the comparisons between BSC-I with k = 5, BSC-II with k ∈ [1, 5] as well as BSC-I with k = 10, BSC-II with k ∈ [1, 10]. For more intuitively presenting the learning performance among different algorithms, the extra experimental results (ε = 0.01) based on training RMSE are also plotted in Figs. 2–4. It should be pointed out that the training RMSE is also shown when the number of iterations is 0, that is the case of f0 = 0 (e0 = T ). As observed from Fig. 2, the residual error decreasing rate of SCN is much higher than IRVFLN, which is aligned with the missing values of IRVFLN in Table 3. Figs. 3 and 4 depict that the larger k , the fewer iterations and the higher residual error decreasing rates. Figs. 2–4 demonstrate that both BSC-I and BSC-II share the universal approximation property and outperform the original SCN and IRVFLN in convergence speed.
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
377
Fig. 4. Convergence performance of BSC-II on the function y1 . Table 4 Partial optimization parameters on the function y2 . Algorithms
Lmax
ϒ
Tmax
IRVFLN SCN BSC-I BSC-II
100 100 100 100
{150} {150:10:200} {150:10:200} {150:10:200}
1 10 10 10
Table 5 Performance comparisons among IRVFLN, SCN, BSC-I and BSC-II on the function y2 . η = 1 in BSC-II. Algorithms
IRVFLN SCN BSC-I
BSC-II
Training process (ε = 0.05)
k
(Mean, Dev)
t(s)
k
L
1 1 1 2 5 10 15 [1,10] [1,15]
– 0.0408,0.0079 0.0417,0.0087 0.0393,0.0094 0.0369,0.0099 0.0374,0.0098 0.0374,0.0125 0.0410,0.0091 0.0401,0.0095
– 0.1682 0.1678 0.1613 0.1407 0.1282 0.1181 0.1385 0.1323
– 25.4 25.1 17.4 10.2 6.3 4.8 10.1 7.9
– 25.4 25.1 34.8 50.8 63.4 72.0 46.1 54.4
Remark 3. Although the SCNs with block increments are superior in training time, the improvement is limited. This is because the block form of supervisory mechanism (see Eq. (18)) involves the computation of Moore-Penrose generalized inverse, which is time-consuming and absent in the supervisory mechanism of SCNs. Therefore, it counteracts the superiority of proposed extension to a certain extent. It is interesting to simplify this complex operation in the future. Remark 4. For the original SCNs and IRVFLNs, the growing hidden nodes are easy to be counted and plotted on figure, as the number of iterations k is equal to the number of hidden nodes L. But it is a difficult task for BSC-II. This is because the number of growing hidden nodes in BSC-II is unfixed per iteration. Besides, considering that the modeling performance of constructive methods lies on each iteration, k is thus used in x-axis instead of L in our figures (e.g., Figs. 2–4). To further verify the effectiveness of the proposed extension, the second regression problem is to approximate the following high nonlinear compound function y2 :
y2 (x ) = 0.2e−(10x−4) + 0.5e−(80x−40) + 0.3e−(80x−20) , x ∈ [0, 1] 2
2
2
(27)
10 0 0 points randomly sampled from the above function and we randomly select 80% of the samples as the training set while the test set consists of the remaining 20%. The optimal parameters reported in Table 4 are used in the following comparison experiment. The error tolerance ε is set as 0.05, and the adjustable parameter η in BSC-II is set as 1. Mean and Dev of training RMSE are reported in Table 5 and the extra experimental results (ε = 0.01) based on training RMSE are plotted in Figs. 5 and 6. Similar to the results on the function y1 , the consistent analysis and conclusions are omitted, and the following focuses on the testing results (see Fig. 7). Owing to the fact that BSC-I with k = 1 is equivalent to SCN and testing results of our proposed algorithms with different k are similar, which is aligned with the results reported in Table 5, only BSC-I with k = 10 and BSC-II with k ∈ [1, 10] are shown in Fig. 7. The testing results show that the prediction performance of IRVFLN is poor, though the number of hidden nodes (namely iterations) is up to 200. By contrast, SCN performs very well with a much compact structure. As for our proposed algorithms, they not only achieve the
378
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
Fig. 5. Convergence performance of IRVFLN and SCN on the function y2 .
Fig. 6. Convergence performance of BSC-I and BSC-II on the function y2 .
Fig. 7. Approximation performance of IRVFLN, SCN, BSC-I and BSC-II on the function y2 .
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
379
Table 6 Specification of benchmark datasets. Date sets
#Attributes Input variables
Output variables
Yacht Energy Efficiency Stock AutoMPG8
6 8 9 7
1 1 1 1
Samples
Default task
308 768 950 392
Regression Regression Regression Regression
Table 7 Partial optimization parameters on benchmark datasets. Date sets
Yacht Energy Efficiency Stock AutoMPG8
Algorithms parameters IRVFLN (Lmax , ϒ , Tmax )
SCN (Lmax , ϒ , Tmax )
BSC-I (Lmax , ϒ , Tmax ,)
BSC-II (Lmax , ϒ , Tmax , η)
70,{1},1 80,{1},1 100,{10},1 200,{20},1
70,{1:1:10},10 80,{0.1:0.1:1},10 100,{10:2:20},10 200,{20:5:50},10
80,{1:1:10},10 100,{0.1:0.1:1},10 150,{10:2:20},10 250,{20:5:50},10
80,{1:1:10},10,0.5 100,{0.1:0.1:1},10,0.1 150,{10:2:20},10,0.5 250,{20:5:50},10,0.05
Fig. 8. Mean and Dev among SCN, BSC-I and BSC-II.
same accuracy as SCN, but also effectively reduce the iteration times to simplify the operations of constructing incremental networks. Details about real-word modeling tasks and discussions on related parameters are displayed below.
4.2. Benchmark datasets The benchmark datasets (see Table 6) obtained from Knowledge Extraction based on Evolutionary Learning Dataset Repository [1] and UCI Machine Learning Repository [2] are Yacht, Energy Efficiency: heating, Stock and AutoMPG8. For each benchmark dataset, six different block sizes (k = 1, 3, 5, 10 and k ∈ [1,5], [1,10]) are considered with five specified settings of iteration (k = 1, 5, 10, 20, 30), aiming to highlight the function of hidden node block adding. The parameter settings and statistical results of 50 independent trails are reported in Tables 7 and 8, respectively. Table 8 uses boldface type for the Mean that achieves the error tolerance and the corresponding Dev. Table 8 reports that the errors of our algorithms are relatively minimal in the same iteration, and the larger block size leads to a faster decreasing rate of the residual error sequence. For instance, the training RMSE of BSC-I with k = 5 after the 1st iteration (k = 1) is almost identical to SCN after the 5th iteration (k = 5) in the Yacht modeling task, which is still far superior to IRVFLN after the 30th iteration (k = 30). For more intuitively presenting those results, we use a stacked bar chart (see Fig. 8) to show the Mean and Dev of training RMSE in the 5th iteration on four datasets. As stated earlier, BSC-I with k = 10 gets the smallest Mean and Dev at the same iteration. In the case of achieving the given error tolerance, the training time of our proposed algorithms is effectively reduced. In fact, the block increments of hidden nodes simplifies the construction process with fewer iterations at a slight cost of network compactness. For instance, in the Energy Efficiency modeling task, although BSC-I increases the number of hidden nodes by about one-tenth compared with SCN, the training time is reduced by almost half. Meanwhile, BSC-II makes a good compromise between modeling speed and network compactness. Interestingly, BSC-II realizes the improvement of modeling
380
Table 8 Performance comparisons among IRVFLN, SCN, BSC-I and BSC-II on benchmark datasets. error tolerance ε = 0.05
Algorithms
k
Training process (Mean, Dev) k=1
k=5
k = 10
k = 20
k = 30
t(s)
k
L
Yacht
IRVFLN SCN BSC-I
1 1 1 3 5 10 [1,5] [1,10] 1 1 1 3 5 10 [1,5] [1,10] 1 1 1 3 5 10 [1,5] [1,10] 1 1 1 3 5 10 [1,5] [1,10]
0.5531,0.0661 0.4316,0.0241 0.4364,0.0293 0.3491,0.0292 0.2926,0.0096 0.2547,0.0117 0.2866,0.0077 0.2465,0.0150 0.5312,0.0467 0.5306,0.0029 0.5302,0.0028 0.2250,0.0321 0.1758,0.0146 0.1520,0.0021 0.1748,0.0119 0.1521,0.0017 0.4596,0.0281 0.3981,0.0341 0.4052,0.0333 0.3080,0.0331 0.2581,0.0334 0.1851,0.0220 0.2578,0.0279 0.1815,0.0227 0.4194,0.0458 0.3342,0.0380 0.3284,0.0278 0.2696,0.0170 0.2415,0.0135 0.2065,0.0108 0.2399,0.0120 0.1911,0.0059
0.5158,0.0557 0.2930,0.0133 0.2916,0.0121 0.2203,0.0247 0.1760,0.0216 0.0729,0.0117 0.1706,0.0144 0.1047,0.0142 0.3735,0.0779 0.1698,0.0066 0.1727,0.0134 0.1477,0.0028 0.1406,0.0029 0.1081,0.0061 0.1424,0.0042 0.1177,0.0068 0.3835,0.0398 0.2227,0.0283 0.2295,0.0293 0.1453,0.0170 0.1124,0.0099 0.0803,0.0043 0.1121,0.0081 0.0816,0.0045 0.3265,0.0429 0.2283,0.0172 0.2233,0.0147 0.1809,0.0082 0.1604,0.0077 0.1372,0.0049 0.1723,0.0080 0.1474,0.0040
0.4364,0.0565 0.2411,0.0180 0.2467,0.0159 0.1399,0.0125 0.0764,0.0111 0.0104,0.0013 0.0843,0.0124 0.0523,0.0094 0.3171,0.0502 0.1519,0.0021 0.1520,0.0026 0.1312,0.0047 0.1038,0.0065 0.0276,0.0027 0.1209,0.0069 0.0310,0.0042 0.3336,0.0481 0.1575,0.0176 0.1571,0.0184 0.0986,0.0075 0.0781,0.0052 0.0555,0.0024 0.0798,0.0049 0.0589,0.0027 0.2805,0.0324 0.1888,0.0098 0.1912,0.0098 0.1508,0.0054 0.1336,0.0046 0.1053,0.0050 0.1485,0.0052 0.1287,0.0048
0.4222,0.0489 0.1632,0.0162 0.1688,0.0210 0.0469,0.0051 0.0110,0.0015 8.5e-04,7.7e-05 0.0276,0.0057 0.0205,0.0055 0.2707,0.0307 0.1393,0.0032 0.1377,0.0041 0.0842,0.0080 0.0313,0.0029 0.0169,2.8e-04 0.0487,0.0102 0.0214,0.0011 0.2755,0.0300 0.1080,0.0095 0.1067,0.0077 0.0681,0.0036 0.0539,0.0019 0.0377,0.0012 0.0628,0.0029 0.0490,0.0021 0.2588,0.0248 0.1579,0.0066 0.1582,0.0074 0.1228,0.0040 0.1012,0.0030 0.0614,0.0031 0.1249,0.0044 0.1062,0.0041
0.3844,0.0373 0.1266,0.0095 0.1277,0.0066 0.0160,0.0028 0.0032,3.5e-04 2.9e-05,4.5e-05 0.0114,0.0017 0.0109,0.0016 0.2531,0.0295 0.1248,0.0048 0.1227,0.0050 0.0343,0.0030 0.0193,4.8e-04 0.0125,2.5e-04 0.0287,0.0035 0.0195,4.0e-04 0.2377,0.0265 0.0859,0.0058 0.0873,0.0052 0.0554,0.0019 0.0368,9.3e-04 0.0287,7.5e-04 0.0551,0.0024 0.0440,0.0018 0.2403,0.0197 0.1414,0.0054 0.1413,0.0042 0.1041,0.0037 0.0784,0.0036 1.0e-13,9.5e-14 0.0865,0.0044 0.0679,0.0040
0.2665 0.2675 0.1817 0.1520 0.1285 0.1728 0.1457 – 0.3173 0.3136 0.2166 0.1791 0.1319 0.1886 0.1377 – 0.4175 0.4154 0.3208 0.2293 0.1773 0.2601 0.1988 – 0.9310 0.9377 0.5418 0.4777 0.2630 0.5129 0.4083
57.6 57.4 20.2 12.6 9.8 13.4 10.5 – 70.6 69.8 25.2 15.6 8.0 19.8 22.6 – 90.7 90.4 35.2 22.3 12.2 36.2 17.6 – 186.3 187.1 70.9 44.1 23.1 69.8 43.5
57.6 57.4 60.5 63.1 65.8 60.7 61.2 – 70.6 69.8 75.5 78.2 79.8 76.7 77.2 – 90.7 90.4 105.1 111.6 122.2 95.2 102.0 – 186.3 187.1 212.8 220.5 230.8 192.7 196.2
BSC-II Energy Efficiency
IRVFLN SCN BSC-I
BSC-II Stock
IRVFLN SCN BSC-I
BSC-II AutoMPG8
IRVFLN SCN BSC-I
BSC-II
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
Datasets
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
381
Fig. 9. Performance comparisons among SCN, BSC-I and BSC-II.
speed by slightly adding more hidden nodes compared with the original SCN, which demonstrates sound generalization performance. To make comprehensive assessments of each learner model with respect to time consumption and network compactness, which stand for the efficiency and generalization performances respectively, a criterion is adopted here. Firstly, the training time and number of hidden nodes recorded in Table 8 are respectively normalized into [0, 1] when learner models achieve the expected error tolerance (ε = 0.05). Then, the normalized values are used to calculate scores of the two performances. For the score, higher is worse. For instance, in the Yacht modeling task, the L of BSC-I with k = 1 is largest among all the algorithms (see Table 8), so that its score of generalization performance is set as normalized 1. Finally, comprehensive performance is assessed by means of the sum of scores on four databases. Fig. 9 shows the detailed scores of each learner model in single and comprehensive performance, which are marked inside and right side of the char, respectively. It should be noted that IRVFLN have not been assessed as it fails to achieve the expected error tolerance (ε =0.05) within an acceptable training time. This fact is clearly illustrated in Table 8, where we use ‘–’ to represent the fruitless results of IRVFLN. From Fig. 9, we can find that the point incremental approaches (either SCN or BSC-I with k = 1) get the worst scores 1 on all databases in terms of time consumption, and the block incremental approach with the largest fixed size (BSC-I with k = 10) obtains the worst score 1 with respect to network compactness. Compared with SCN and BSC-I with k = 10, BSC-II with k ∈ [1,10] achieve the best comprehensive performances. The similar result can be noticed from the scores among SCN, BSC-I with k = 5 and BSC-II with k ∈ [1,5]. Therefore, BSC-II can make a good balance between time consumption and network compactness. This means that the varied block size strategy is beneficial for learning. From Figs. 8 and 9, we can also get the results on the modeling robustness with respect to block size. All these results suggest that our algorithms perform more favorably in regression problems. In a nut shell, if an actual system makes a rigorous demand on network compactness but no slashing limitation on modeling time, the original SCN is a desirable and preferred choice. BSC-I algorithm is suitable for the system requiring extreme learning speed, and BSC-II algorithm is a compromise choice due to its superior comprehensive performance in terms of the time consumption and network compactness. 4.3. Discussion on parameter selection In our proposed algorithms, the hyper-parameters mainly consist of Lmax , ϒ , r, Tmax , k and η. For Lmax , ϒ , r and Tmax , the reasonable recommended settings are almost same as original SCNs. A brief description is given here, but for more details refer to Wang and Li [26]. Lmax and ϒ tend to be larger settings in the more complex problems. The parameter r is crucial to the residual error decreasing rate, and an increasing real-value sequence from 0.9 to approaching 1 is recommended. This is because the closer r is to 1, the easier the supervisory mechanism is to be satisfied, which facilitates the configuration. Tmax determines the number of attempts for mining the candidate hidden nodes that meets with the supervisory mechanism, and it thus inevitably involves computing load. Tmax selected as an excessively high value makes the operation time-consuming, while too little Tmax may give rise to learning instability due to larger possibility of failure in finding candidates. The following context focuses on the selection of k and η. •
In BSC-I, the fixed block size k is positively correlated with the residual error decreasing rate but negatively correlated with model compactness. In order to see the effect of k on modeling performance, different settings, including k = 5, 10, 20, 40, are used in Stock dataset. The comparative results are plotted in Fig. 10, where the labeled points represent two states: achieved tolerance error (ε = 0.04) (marked with ‘•’) and critical over-fitting (marked with ‘∗ ’). In Fig. 10(a)
382
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
Fig. 10. The effect of different k on modeling performance.
•
and (b), the construction processes of BSC-I with k = 5 and 10 both have been early stopped before over-fitting. By contrast, BSC-I with k = 20 in Fig. 10(c) achieves the tolerance error in k = 10 and critical over-fitting in k = 11. Unfortunately, it can be seen form Fig. 10(d) that the network has been over-fitting before achieving tolerance error. This implies that excessive k are more likely to lead to over-fitting. Therefore, an appropriate block size set {3, 5, . . . , 10} with ε = 0.05 is recommended in this case. But for BSC-II, the varied block size k is mainly dependent on the initial block size 1 and the SA algorithm parameter η discussed below. According to our earlier analysis, BSC-II can make a good compromise between the learning speed-up and the network compactness. Therefore, if users who pay more attention to fast learning can set a larger 1 , and then BSC-II can automatically select an approximate block size per iteration from the scope [1, 1 ]. As a crucial parameter in BSC-II, η is directly associated with the varied block size k in each iteration. In fact, we attempted to vary η in candidate set {0.01, 0.05, 0.1, . . . , 10} based on the two function approximations and four realword cases. Through experiments, we find that BSC-II with too small or too large η may degenerates into BSC-I. It is meaningful to find out how to adaptively select η in the future.
5. Applications in industrial data modeling The above benchmark regression problems have demonstrated the merits of SCNs with block increments. In this section, we take two industrial processes, namely a mineral grinding process and a coal dense medium separation (CDMS) process, as examples to demonstrate the potential application in data-driven modeling of process industries. Process description and modeling task are first detailed for each case study. Then experimental setup, results and discussions are presented together for the sake of brevity. 5.1. Case 1: particle size estimation of grinding process The classical mineral grinding process studied in this paper operates in a closed loop as shown in Fig. 11. To begin with, fresh ore is continuously fed into a cylindrical mill together with a certain amount of mill water. Meanwhile, heavy
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
383
Fig. 11. Flow diagram of a typical mineral grinding process.
metallic balls are loaded and tumbled in the mill to crush the coarse ore to finer sizes. After that, the mixed slurry is continuously discharged from the mill into the spiral classifier, where the slurry is separated into two streams namely the overflow slurry and the underflow slurry. The dilution water is used to adjust the classification concentration to maintain the overflow slurry in a proper proportion of finer particles. The underflow slurry with coarser particles is recycled back to the mill for regrinding. The overflow slurry is the product transported to the subsequent procedure. In the mineral grinding process, the particle size (PS)(%) is a significant quality index, which determines the grinding efficiency. Unfortunately, the direct online measurement on this quality index is always difficult. The high-priced PS analyzers cannot be suitable for all grinding processes, such as hematite grinding process. Therefore, the PS soft-sensor should be realized. This industrial demand is fit to the purpose of research on the learner models. During grinding operation, the PS is easily affected by the operating parameters (namely, α 1 , α 2 and α 3 ) [22] and ore properties (namely, size distribution B1 and grindability B2 ). Therefore, α 1 , α 2 , α 3 , B1 and B2 are five influencing factors, and can be used to data modeling for the PS. In fact, however, ore properties cannot be measured online. Fortunately, their varieties can be identified by the measurable equipment currents (namely, c1 and c2 ) [3,7]. Therefore, c1 and c2 are utilized to replace B1 and B2 to realize the following nonlinear mapping.
P S = ϕ (α1 , α2 , α3 , c1 , c2 )
(28)
where ϕ ( · ) is an unknown nonlinear function. The motivation of quality index estimation in this study is to approximate the above mapping, which is a tricky problem in the mineral processing industry. 5.2. Case 2: ash estimation of coal dense medium separation process The CDMS is one of the most direct and effective coal cleaning procedures. A classical CDMS process as shown in Fig. 12 mainly includes mix sump, dense medium cyclone (DMC), magnetite silo, drain and rinse screens, magnetic separator, corrected medium sump, magnetite silo, and several actuators and instruments. Raw coal after deslimed and dewatered is first sent to the mix sump and mixed with dense medium, followed by being fed into the DMC. Under the action of gravity and centrifugal force, the mixed slurry in the DMC is separated into overflow and underflow slurries. The minerals lighter than the medium float and those denser sink, which exit at the top and bottom of the DMC, respectively. The overflow and underflow slurries are both taken to the drain and rinse screens, in which they are carried to the recovery circuit. The recovered medium flows into magnetic separator, extracted by screening, and then is sent to the corrected medium sump. The recovered medium is mixed with concentrated medium and water in there to maintain the dense medium density within a certain range. The valve and screw conveyor are used to adjust water and magnetite, respectively. It is well known ash content (AC) (%) is a principal criterion for evaluating coal quality, which is influenced by the mass feed rate of the ore a1 , dense medium density a2 , and feed pressure a3 . The data model should be constructed to achieve the following nonlinear mapping.
AC = ψ (a1 , a2 , a3 )
(29)
where ψ ( · ) is an unknown nonlinear function. 5.3. Model evaluation 1) Experimental setup: In Case 1, to establish the PS estimation model of grinding process, a dataset containing 400 distinct samples are collected in the industrial field. The samples are divided into 300 training samples and 100 testing
384
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
Fig. 12. Flow diagram of a typical CDMS process. Table 9 Performance comparisons among IRVFLN, SCN, BSC-I and BSC-II on the two cases. Algorithms
Training process (ε = 0.05) For the PS estimation
IRVFLN SCN BSC-I
BSC-II
For the AC estimation
k
t(s)
k
L
t(s)
k
L
1 1 1 2 3 5 10 [1,3] [1,5]
– 0.1403 0.1407 0.1328 0.1207 0.1097 0.0934 0.1362 0.1354
– 16.6 16.9 9.6 6.6 4.2 2.2 11.4 6.9
– 16.6 16.9 19.1 19.8 21.0 22.0 11.4 17.7
– 0.1521 0.1506 0.1394 0.1266 0.1134 0.1006 0.1412 0.1315
– 20.8 20.5 10.8 7.8 4.8 3.0 8.5 5.3
– 20.8 20.5 21.5 23.4 24.0 30.0 22.9 23.2
Fig. 13. Estimation performance of IRVFLN, SCN, BSC-I and BSC-II on the grinding process.
samples. It should be pointed out that, the input weights v and biases b in IRVFLN are randomly assigned from [−1, 1]5 and [−1, 1], while we take Lmax = 50, Tmax = 10, ϒ = {1 : 5 : 201}, r = 0.999 in the original and extensional SCNs. In the light of readability, only partial testing results are shown, and similar learning curves of other trials are omitted. In Case 2, 500 samples are also collected from the industrial field and we randomly choose 80% of the samples as the training dataset while the remainders make up the test dataset. With the similar strategy of data preprocess, parameter settings are the same to Case 1. 2) Results and discussions: Performance comparisons among these four algorithms on the two cases are recorded in Table 9, where the results are essentially in agreement with the two function approximations and four real-world regression cases. Therefore, the discussions are omitted here and readers for more details can refer to the analysis of Tables 3, 5 and 8. As we can see from Fig. 13, it depicts the real and estimated values of particle size. Aligning to our previous regression
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
385
Fig. 14. Estimation performance of IRVFLN, SCN, BSC-I and BSC-II on the CDMS process.
Fig. 15. Performance comparisons with different setting of k .
tasks, the estimated values obtained by the proposed two algorithms as well as the original SCN almost compactly surround the real values, which are much better than IRVFLN. Similar finding can be observed from test results in Case 2 (Fig. 14). To further demonstrate the advantages of the block incremental approach, we present statistical results in Fig. 15 based on Table 9. It can be clearly seen that both learning time and network compactness are varying with k for the two cases. In Fig. 15(a), the larger k (or its scope), the less training time. On the contrary, there is an opposite result in Fig. 15(b). Interestingly, one can easily find that BSC-I with k = 10 achieve the given error tolerance with a stable (nearly fixed) network size. The reason may be twofold. From the perspective of feature learning, the addition of hidden nodes can be regarded as trials of gaining features. Moreover, adding hidden node blocks in each iteration will gain more features with a greater probability. In the two cases, BSC-I with k = 10 can always finish the modeling task within three iterations. Comparing Fig. 15(a) and (b), we can find that BSC-II realizes the improvement of modeling time by slightly adding more hidden nodes, thereby acquiring the comprehensive performance between SCN and BSC-I. It is known from the above results and analysis, the two proposed algorithms with different superiorities can both achieve satisfying estimation accuracy. The choice can be based on engineering requirements. For instance, if an industrial control system has a limited memory space, BSC-II algorithm will be a best choice. But if the model often needs to be remodeled in a time varying industrial process, BSC-I algorithm is better by virtue of fasting learning performance. Before ending up this work, we offer some recommendations on the selection of the block size (k ), which is a crucial factor in our proposed block incremental approach. In the process industries, end-users who would like to a much more compact network can set a smaller value (e.g., k = 1, 2, 3) or employ the original SCN. If one pays more attention to modeling time can select a larger value (e.g., k = 5, 10), but it needs to take over-fitting seriously. 6. Conclusions Fast remodeling is required in many process industries, it is thus significant to develop advanced learner models to improving the learning efficiency. This paper proposes a new extension of stochastic configuration networks (SCNs) with block increments for problem solving. This version can speed up building SCN models in addition to maintaining high accuracy. Two algorithmic implementations are presented with different polices for setting the hidden node block size, which contributes the distinctive merit of each algorithm. Using a fixed block size strategy, the first algorithm shows superiority in fast learning, which is a preferable choice for large-scale modeling tasks. Incorporating simulated annealing algorithm into varied block size strategy, the second algorithm achieves a better comprehensive performance in efficiency and generalization. The impact of block size on learning performance have been fully investigated based on various benchmark tasks. The two practical industrial applications illustrate that the proposed approaches can effectively address the data-driven quality index modeling problems in the process industries.
386
W. Dai, D. Li and P. Zhou et al. / Information Sciences 484 (2019) 367–386
The extension is proposed to better address the regression problems, but the counterpart in the recognition problems is not considered, which will be left for our future work. Besides, it would be meaningful and interesting to further enhance SCNs by deducing a distinct block form of supervisory mechanism with fewer computing load, improving the problemdependent strategy for updating the hidden node block size with less human intervention, or developing advanced extension with block increments to construct a much more compact network. Acknowledgements The authors would like to thank the editors and reviewers for their valuable comments which substantially improve the quality of this paper. In particular, authors thank the Managing Guest Editor Associate Professor Dianhui Wang for suggesting us to reorganize the paper structure and focus on investigating the impact of block size on learning performance. This work was supported by the National Natural Science Foundation of China (grant numbers 61603393, 61873272), the Natural Science Foundation of Jiangsu Province (grant number BK20160275), the Postdoctoral Science Foundation of China (grant numbers 2015M581885, 2018T110571), and the Open Project Foundation of State Key Laboratory of Synthetical Automation for Process Industries (grant number PAL-N201706). References [1] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Mult. Valued Log. Soft Comput. 17 (2–3) (2011) 255–287. [2] C.L. Blake, C.J. Merz, UCI Repository of Machine Learning Databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] F.C. Bond, Crushing and grinding calculations part I, Brit. Chem. Eng. 6 (1961) 378–385. [4] T.Y. Chai, J.L. Ding, H. Wang, C.Y. Su, Hybrid intelligent optimal control method for operation of complex industrial processes, Acta Automat. Sinica 34 (5) (2008) 505–515. [5] T.Y. Chai, S.J. Qin, H. Wang, Optimal operational control for complex industrial processes, Annu. Rev. Control 38 (1) (2014) 81–92. [6] X.S. Chen, Q. Li, S.M. Fei, Supervisory expert control for ball mill grinding circuits, Expert Syst. Appl. 34 (3) (2008) 1877–1885. [7] W. Dai, T.Y. Chai, S.X. Yang, Data-driven optimization control for safety operation of hematite grinding process, IEEE Trans. Ind. Electron. 62 (5) (2015) 2930–2941. [8] L. Fortuna, S. Graziani, A. Rizzo, M.G. Xibilia, Soft Sensors for Monitoring and Control of Industrial Processes, Springer Science & Business Media, 2007. [9] C.H. Gao, Q.H. Ge, L. Jian, Rule extraction from fuzzy-based blast furnace SVM multiclassifier for decision-making, IEEE Trans. Fuzzy Syst. 22 (3) (2014) 586–596. [10] A.N. Gorban, I.Y. Tyukin, D.V. Prokhorov, K.I. Sofeikov, Approximation with random bases: pro et contra, Inf. Sci. 364 (2016) 129–145. [11] B. Igelnik, Y.H. Pao, Stochastic choice of basis functions in adaptive function approximation and the functional-link net, IEEE Trans. Neural Netw. 6 (6) (1995) 1320–1329. [12] L. Jian, C.H. Gao, Binary coding SVMs for the multiclass problem of blast furnace system, IEEE Trans. Ind. Electron. 60 (9) (2013) 3846–3856. [13] P. Kadlec, B. Gabrys, S. Strandt, Data-driven soft sensors in the process industry, Comput. Chem. Eng. 33 (4) (2009) 795–814. [14] S. Khoshjavan, M. Mazloumi, B. Rezai, Artificial neural network modeling of gold dissolution in cyanide media, J. Cent. South Univ. 18 (6) (2011) 1976–1984. [15] R.P. King, Modeling and Simulation of Mineral Processing Systems, Elsevier, 2012. [16] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983) 671–680. [17] P. Lancaster, M. Tismenetsky, The Theory of Matrices: With Applications, Elsevier, 1985. [18] M. Li, D. Wang, Insights into randomized algorithms for neural networks: practical issues and common pitfalls, Inf. Sci. 382 (2017) 170–178. [19] E.J. Meyer, I.K. Craig, The development of dynamic models for a dense medium separation circuit in coal beneficiation, Miner. Eng. 23 (10) (2010) 791–805. [20] Y.H. Pao, Y. Takefuji, Functional-link net computing: theory, system architecture, and functionalities, IEEE Comput. J. 25 (5) (1992) 76–79. [21] S. Park, C. Han, A nonlinear soft sensor based on multivariate smoothing procedure for quality estimation in distillation columns, Comput. Chem. Eng. 24 (2–7) (20 0 0) 871–877. [22] L.R. Plitt, A mathematical model of the gravity classifier, in: Science Direct, Proceedings of 17th International Mineral Processing Congress, Cim Bulletin, 1991, pp. 123–135. [23] S. Scardapane, D. Wang, Randomness in neural networks: an overview, WIREs Data Min. Knowl. Disc. 7 (2) (2017) e1200. [24] D.R.G. Villar, J. Thibault, R.D. Villar, Development of a softsensor for particle size monitoring, Miner. Eng. 9 (1) (1996) 55–72. [25] D. Wang, M. Li, Deep stochastic configuration networks with universal approximation property, (a updated version has been published in the Proceedings of 2018 IJCNN) (2017) arXiv: 1702.0563918. [26] D. Wang, M. Li, Stochastic configuration networks: fundamentals and algorithms, IEEE Trans. Cybern. 47 (10) (2017) 3466–3479. [27] D. Wang, J. Liu, R. Srinivasan, Data-driven soft sensor approach for quality prediction in a refining process, IEEE Trans. Ind. Inform. 6 (1) (2010) 11–17. [28] X.L. Wang, W.H. Gui, C.H. Yang, Y.L. Wang, Wet grindability of an industrial ore and its breakage parameters estimation using population balances, Int. J. Miner. Process. 98 (1–2) (2011) 113–117. [29] W.W. Yan, D. Tang, Y.J. Lin, A data-driven soft sensor modeling method based on deep learning and its application, IEEE Trans. Ind. Electron. 64 (5) (2017) 4237–4245. [30] M. Yuan, P. Zhou, M.L. Li, R.F. Li, H. Wang, T.Y. Chai, Intelligent multivariable modeling of blast furnace molten iron quality based on dynamic AGA-ANN and PCA, J. Iron Steel Res. Int. 22 (6) (2015) 487–495. [31] P. Zhou, W. Dai, T.Y. Chai, Multivariable disturbance observer based advanced feedback control design and its application to a grinding circuit, IEEE Trans. Control Syst. Technol. 22 (4) (2014) 1474–1485. [32] P. Zhou, S.W. Lu, T.Y. Chai, Data-driven soft-sensor modeling for product quality estimation using case-based reasoning and fuzzy-similarity rough sets, IEEE Trans. Autom. Sci. Eng. 11 (4) (2014) 992–1003.