Chemical Engineering Science 60 (2005) 399 – 412 www.elsevier.com/locate/ces
Fuzzy classification with an artificial chemical process Roberto Irizarry∗ DuPont Electronics Microcircuit Industries, Ltd, P.O. Box 30200, Manati, PR 00674-8501, USA Received 29 October 2003; received in revised form 3 May 2004; accepted 27 July 2004 Available online 22 October 2004
Abstract In this work, a new algorithm to extract a compact set of if/then rules from data for classification problems is presented. The premise is extracted directly using LARES as a learning tool, which is a new global optimization procedure based on a new recently introduced paradigm called artificial chemical process. The conclusion part is determined using soft computing techniques. In the learning phase, the objective function minimizes the number of misclassified patterns from training data and reduces the conflicts between the rules to generate the pattern partition. The proposed method has many potential applications in industrial processes. Several examples are presented, including fault detection and operation of reactions with unstable regimes. 䉷 2004 Elsevier Ltd. All rights reserved. Keywords: Fuzzy logic; Pattern classification; Linguistic model; Artificial chemical process; Global optimization
1. Introduction 1.1. Fuzzy and neuro-fuzzy systems A fuzzy inference system consists of a set of rules described in if/then statements, which together determines the action (output) for a given situation (input). This capacity for explaining responses on the basis of human-like reasoning has proven to be a very powerful tool for industrial applications. Inference systems have been used in designing feedback controllers (De Carli et al., 1994) and extracting control rules for robotic applications (Zhang and Ferch, 2003). When combined with neural networks, it has been applied to monitoring nuclear reactors to estimate the departure from nucleate boiling protection limit (Na, 1999). Other applications to industrial processes that exhibits complex operational behavior are sintering processes (Er et al., 2000), polymer molding processes (Li et al., 2002), and the generation of activated sludge (Du et al., 1999), among others. In
∗ Tel.: +1-787-621-1460; fax: +1-787-621-1403.
E-mail address:
[email protected] (R. Irizarry). 0009-2509/$ - see front matter 䉷 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.ces.2004.07.123
addition to industrial applications and robotics the development of inference systems also has important applications in medicine, specifically for medical diagnosis (Sanchez, 1998; Sanchez and Bartolin, 1990). This approach can prove very important to the production of high performance materials in the electronic and pharmaceutical industries, which involve complex interactions between raw material properties, processing conditions and end-product properties. Given its practical importance, automatic generation of inference systems from data is an important area of research. Algorithms have been developed that combine genetic algorithms with fuzzy logic (Karr, 1991; Sugeno and Yasukawa, 1993; Wang and Mendel, 1992; Ishibuchi et al., 1992, 1995). It has been demonstrated that, for some neural networks, there is an equivalence between the neural network and a fuzzy inference system (Jang and Sun, 1993). Many algorithms exploit this concept by combining GA fuzzy logic and neural networks, where the rules are extracted from the trained network (De Carli et al., 1994; Zhang and Ferch, 2003; Na, 1999; Er et al., 2000; Li et al., 2002; Du et al., 1999; Simpson, 1992; Tsoukalas and Uhrig, 1997; Russo, 1998, 2000; Mastorocostas and Theocharis, 2000; Su and Chang, 2000; Chung et al., 2000). Most of the algorithms
400
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
consist of a five-layer structure, with the training process consisting of two phases. In the first phase, GA is used to overcome the problem of getting trapped in local minima. In the second phase, back-propagation is used to improve output precision. Other methods use singular value decomposition to solve an over-specified linear system. Less attention has been paid to developing specialized algorithms for the classification problem. The algorithms developed by Ishibuchi et al. (1992, 1995) are one of the few methodologies developed for the classification problem. They consists of two phases: (1) fuzzy partition of a pattern space and (2) identification of the fuzzy rule for each subspace determined by the partition. The partition of the pattern space is based on a “multi-grid” partition. This coarseto-fine partition generates a high classification power, but it also generates a very large number of if/then rules. To alleviate this problem, GE was utilized to reduce the number of rules while preserving a high classification power (Ishibuchi et al., 1995). This method outperformed many other methods like fuzzy k nearest neighbor, fuzzy c-means, fuzzy integral with perceptron and others methods in classifying the Iris data (Russo, 1998).
1.2. Optimization algorithms The learning phase of fuzzy or neurofuzzy systems consists of solving a complex multi-modal optimization problem. When the structure of the model is also part of the solution, the learning phase becomes a complex multi-modal combinatorial optimization problem. As mentioned in the previous section, most of the works in the literature rely on GA (Holland, 1975; Goldberg, 1989) for the training phase, due to its ability to escape from local minimum and its coding capability for representing real and combinatorial variables. Other optimization algorithms like gradientbased methods are not suitable for this task, since they get trapped in local minimum. Evolution strategies (Schwefel, 1995) developed for real parameter optimizations cannot be used when the structure of the problem is part of the solution. Simulated annealing is too slow and its performance strongly depends on the cooling schedule. In this article, LARES, a new optimization algorithm introduced by Irizarry (2004a) is used in the learning phase. This algorithm is based on a new paradigm called artificial chemical process. LARES has the properties of escaping from local minimum and it is also very flexible in encoding generic variables, needed properties for the learning task. The performance is very robust in terms of tuning parameters, and the operators are very simple and do not need to be modified. These properties allow LARES to be used as a black box without the need for tuning by an expert, a desired property for application engineers. At the same time, the structure of the operators is simple enough for an expert to tailor the algorithm for a particular application. See Irizarry (2004a) for more details on the properties of the algorithm.
The main objectives of this article are: (a) to develop a simple algorithm to generate a compact set of fuzzy rules for the classification problem and (b) to demonstrate that LARES’ synergy with fuzzy systems can be also an efficient tool in the learning phase. This paper is organized as follows: in Section 2, the LARES algorithm is described; in Section 3, the new algorithm for pattern classification, LFC, is presented; in Section 4, the results are presented; and the conclusions are presented in Section 5.
2. LARES artificial chemical process The optimization problem: Consider the optimization problem in its more general form: Min F (), ∈
(1)
where is the vector of decision variables to be determined, is the feasible space of the decision variables, and F is the performance index to be minimized. No restriction is assumed on the type of variables (continuous, integer, combinatorial, non-numerical or symbolic, etc.). Also, there is no restriction on the type of function F. The application of LARES to solve the optimization problem consists of: • Encoding the decision variables, , into a new set of variables called molecules. • Evaluating the performance index at each iteration. Representation of the decision vector with molecules: The possible solutions to the problem (decision variables) are encoded into a finite set of variables (xj , j = 1, . . . , V ) called molecules, ( = H (x)). These molecule variables are defined over a small range of discrete values. Let j be the set of possible values that the variable xj can have, j j j ={1 , . . . , Mj }, where Mj is the total number of possible values for molecule j. The discrete value assigned to the molecule is called the state. If the optimization includes real parameters as part of the decision vector, then binary encoding similar to GA can be used, but in this case each bit represents a molecule. Reaction or activated state: The state (value assigned to the molecule variable) of the best solution found so far is called a ground state, g . During each iteration, some of the molecules change states systematically to generate new trial vectors. This “reaction” to generate an activated state is a powerful concept, different from GA or other populationbased algorithms. As will be seen in the description of the algorithm, all of the operators act using this concept, taking a molecule at a time thus making each molecule independent of the others. This concept is very flexible and can be advantageous for certain types of problems: (1) Each molecule can have a different range of states that can be visited (j ), which allows multiple encoding to be a natural part of the algorithm. This property is not shared with other
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
algorithms like genetic algorithms. (2) Constraints or bias (provided by problem-specific information) can be added to the states that a given molecule can visit, in order to satisfy constraints or to add knowledge-based information for the problem at hand. (3) Since the molecule variables are independent of each other, the molecule vector can be indexed in any desired way (i , ij k , etc.). Underlying principles of the algorithm: This is an iterative improvement methodology, which considers one solution at a time. Let Z be the set of all V molecules in the system g and xj the state for the molecule j for the best value found g g so far, x g = (x1 , . . . , xV ). If this solution can be improved, then ∃A ⊆ Z, in which each molecule in A has a new state g xjt = xja = xj ∀j ∈ A and all other molecules are in the g state of the best solution found so far (xjt = xj ∀j ∈ Z\A) generating a new vector x t such that F (x t ) < F (xg). To find this set, the following strategy, called artificial chemical process, is applied. The first step is to make a perturbation to the system by selecting a random set AR, and for each element of AR a new random state is assigned, g xja = xj ∀j ∈ AR. If this is the desired set AR = A, the trial state vector is accepted as the new best value found so far. If not, the following hypothesis is postulated to improve the perturbation with the hope of finding A: H1: ∃A ⊆ AR, in which each molecule in A has a new state generating a new vector x t such that F (x t ) < F (xg). The hypothesis H1 is tested using the following iterative procedure. In each iteration, select E molecules from AR and return their state to the state of the best value found so g far (xjt = xj ∀j ∈ E). If the new perturbation is improved (by some definition of improvement) from the previous one, then AR=AR\ E. Otherwise, all molecules in E are returned to AR with molecules in a new random state. This process is continued until A is found or until a predetermined termination criterion for AR is achieved. In the second case it is assumed that H1 is false, so then a brand new AR is generated and H1 is tested again. Iterations continue until an algorithm termination criterion is achieved. The elements of the algorithm are: 1. Selection of the perturbation set AR by a probability distribution function (PDF) and selection of new states for each molecule in AR. 2. Selection of the extraction set E by a PDF. 3. The improvement criterion for selecting whether molecules in E are returned back to AR in a new state or are separated from AR and back to the state of the best solution found so far. 2.1. Algorithm outline Let L, AR, E and S be four disjunctive sets whose elements are the molecule variables. Let F be the objective function to be minimized, expressed in terms of the vector of molecule variables, F = F (x). Let the value of the best
401 g
g
solution found so far be x g = (x1 , . . . , xV ). The algorithm consists of redistributing the molecule variables among the sets to generate new trial vectors using the following rule: whenever a molecule variable xj becomes a member of the set AR, a new value different from the best value found is g assigned to it. Let xja = xj be the new value, which will not change until the variable is moved out of the set AR. For each iteration, the trial vector x t = (x1t , . . . , xVt ) is then constructed using the following formula: g x if xj ∈ / AR, t (2) xj = ja xj if xj ∈ AR. With these definitions, the algorithm is described next: Initialization: 1. The algorithm starts by initializing x g randomly and placing all variables in L (set AR = E = S = ⭋, L = {x1 , . . . , xV }). Outer loop: Perturbation to form AR. 2. Select a random number, |Trx | (|Trx | |L|), from a uniform PDF: |Trx | = min(int([V × co ] + 1), |L|),
3. 4. 5. 6. 7.
8.
9.
(3)
where is uniformly distributed in (0,1) and co is an adjustable parameter used to select the average fraction of elements to be selected from V, which is the total number of molecules representing the possible solutions to the problem. Select |Trx | elements randomly from L to form the subset Trx , (Trx ⊆ L). Transfer the subset to AR: L = L\Trx ; AR = AR ∪ Trx . Select a random new value for each molecule variable in g Trx : xja = xj ∀j ∈ Trx . The new trial vector, x t , is generated using Eq. (2). If F (x t ) < F (x g ), the trial vector is accepted as a new best solution found, x g = x t . In this case, all of the elements in AR are sent to the set S: S = S ∪ AR; AR = ⭋. If the algorithm termination criterion is achieved, exit the algorithm and return x g as the solution to the optimization problem. If a better solution was found in step 7, skip the inner loop and perform another outer loop iteration: Go to step 17. Otherwise, continue with step 9. Initialize the parameter RP: RP = F (x t ). This parameter is used and modified in the “goodness” test in the inner loop. Also set |AR|0 = |AR|, which is the initial number of molecules in AR before starting the next inner loop. Inner loop: Iterative improvement of AR
10. Select a random number, |E|, from a prescribed PDF, |E| |AR|. The formula used is: |E| = min [int([|AR|0 × ci ] + 1), |AR|],
(4)
402
11. 12. 13. 14.
15.
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
where |AR|0 is defined in step 9 and ci is another adjustable parameter. Select |E| elements randomly from AR to form the subset E. Extract the subset E from AR: AR = AR\E. The new trial vector, x t , is generated using Eq. (2). If F (x t ) < F (x g ), the trial vector is accepted as a new best solution found, x g = x t . All of the elements in AR are sent to the set S: S=S∪AR; AR=⭋. If the algorithm termination criterion is achieved, exit the algorithm and return x g as the solution to the optimization problem, Improvement criterion for AR: If F (x t ) RP, the hypothesis is that there is a high probability that most elements in E will prefer to stay in their ground state g (xj =xj ) to generate better solutions. In this case, the elements in E are transferred to S: S=S∪E; E=⭋ and the metric RP is updated, RP = F (x t ). When F (x t ) > RP, the hypothesis is that there is a high probability that most elements in E will induce a better solution if they are in a different state from their ground state. In this case, a new activated state is generated for all elements g in E (xj = xja = xj ∀j ∈ E), and all of the elements in E are transferred back to AR (AR = AR ∪ E; E = ⭋).
Check conditions to exit or continue the inner loop: 16. If one of the following conditions is satisfied, the algorithm will exit the inner loop (go to step 17): • If a better solution was found in step 14. • The number of elements in AR is less than or equal to one, |AR| 1. For some particular problems this restriction could be generalized to a generic threshold parameter (|AR| < c), but this case is not considered in this work. • Large recycle-ratio (RR) defined as rec RR = RRT |AR|0
(5)
where rec is the counter of the number of times that E is sent back to AR in the current inner loop and |AR|0 defined in step 9 is the initial number of molecules in AR generated in the outer loop. The parameter RRT is adjustable. Otherwise, a new inner loop iteration is started: go to step 10. Note that most of the inner loops should be terminated naturally either by a better solution or by |AR| 1. The RRT is set large enough to let most of the inner loops to be terminated naturally by a better solution or |AR| 1. The RRT should not be too large, because then many iterations will be allowed for an undesired inner loop, affecting the algorithm’s performance.
Check the number of elements in L and AR, and the algorithm termination criterion: 17. Check the number of elements in the sets L. If the number of elements in L is below a prescribed value, LT, all of the elements in the set S are transferred: L = L ∪ S; S = ⭋. If the number of elements in L is still too low (|L| LT), then all of the elements in AR are transferred to L: L = L ∪ AR; AR = ⭋. 18. Start an outer loop iteration, returning to step 2. The following parameters are used: RRT = 1.0, co = 0.3, ci = 0.25, LT = V /2. A flow chart of the algorithm is shown in Fig. 1. 2.2. The algorithm viewed as an artificial chemical process This procedure has many analogies with a real chemical process. The molecule variables can be viewed as molecules that will be transformed from one state to another (variable value) by a “reversible chemical reaction”. First the Activation Reactor, AR, is loaded from the Load unit, L. A reaction is performed, and the undesired byproducts are separated from the product by separation processes. In this separation process, some molecules are extracted from the activation reactor and sent to the Extraction unit, E. If the Reactor Performance, RP, is improved, the extracted material is sent to the Separation unit, S. Otherwise, the material is recycled back into the activation reactor to be re-processed. These outer–inner iterations (reaction–purification) are continued until the product cannot be improved anymore. A schematic representation of this process is shown in Fig. 2.
3. LFC algorithm The following algorithm, called LARES fuzzy classification (LFC), generates a fuzzy inference system for automatic pattern classification from data using LARES as the learning tool. The hyperstructure of the algorithm is shown in Fig. 3. At each LARES iteration, the premise part is generated from the new trial vector, and then the pattern data are used to generate the conclusion part using soft computing techniques. The resulting fuzzy inference system is tested in terms of classification power using the same data to determine the objective function to be minimized by LARES. This method is an alternative approach, which generates fuzzy rules to classify patterns without the use of a grid partition, SVD or a fixed neural-network structure. The algorithm is described in detail in the following sections. In Section 3.1, the steps to generate the premise part are described. Then the premise part of this structure is encoded into a set of molecule variables as described in Section 3.1.1. In Section 3.1.2, a new transformation called re-assignment transformation is discussed for handling more flexible membership functions. The conclusion part is
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
403
Start
Select AR ⊆ Z Activate molecules in AR Set RP=F(xt )
Better solution xg=xt
A
xt
YES
Stop
YES B
Select E ⊆ AR Deactivate molecules in E AR=AR\E
xt
YES Stop
A YES B
NO
YES F(xt)
Reactivate molecules in E AR=AR ∪ E
RP=F(xt)
YES
NO
Inner termination Criteria
A
= If termination criteria is satisfied
B
= If F(xt)
Molecules in a new state Molecules in their ground state
Recycle: Activate-Transfer
L
Activate-Transfer
Reaction rejected
AR
Deactivate-Transfer
E
Product (reaction completed)
Reprocess material
Fig. 2. Schematic representation of LARES.
Extract
S
404
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
LARES
Decode to generate membership function parameters and Combinatorial variables (section 3.1)
- Use RT if necessary.
F () Generate the consequent part (section 3.2)
Data
Measure the performance of the resulting fuzzy algorithm (section 3.3) Fig. 4. Trapezoid rule defined in Eq. (7). The parameters are the nodes locations that should satisfy the constraint in Eq. (8). The RT is used to handle this type of membership function. Fig. 3. Hyper-structure of the LFC algorithm.
generated from the training data as described in Section 3.2. This “trial” fuzzy algorithm is tested using the inference engine on the same pattern data set. The objective function is then constructed to maximize classification power, as described in Section 3.3. The remarks on algorithm implementation are presented in Section 3.4. 3.1. Generating the premise part using LARES The rule base system considered in this algorithm consists of R rules of the following generic structure: Rule r: IF x1 is A1r and x2 is A2r and . . . xN is AN r THEN x belongs to Cr ,
(6)
where xi is the input for variable i to the fuzzy inference system and Air is the fuzzy set associated with variable i and rule r. The class Cr is the consequence of the rule r. Each fuzzy set, Air (xi ), is described by one-dimensional membership functions, ir (xi ). The model includes a combinatorial variable ir that defines the existence of the predicate corresponding to variable i and rule r. When ir = 1, the variable i forms part of the premise in rule r, that is, xi is Air , is part of the premise and is zero otherwise. There are several alternatives for the membership functions used to describe the fuzzy set. In this section, it is assumed that all variables are normalized. The membership function is of the form
ir (xi ) = (xi ; pir ),
(7)
where xi is defined in the interval [0,1] and pir is the vector of parameters that define the membership function. The triangular membership function is:
(xi ; pir ) = max{1 − |xi − air |/bir , 0}.
(8)
The grade of membership defined by this equation is 1 when xi =air and positive in the open interval (air −bir , air +bir ).
A gaussian membership function is another two-parameter model that can be used, 2 (xi ; pir ) = g(cir , air ) ≡ exp(−(xi − cir )2 /air ),
(9)
where cir is the center of the distribution and air is the width. A more generic membership function is the trapezoid function defined by four parameters:
(y, pir ) if if if = if if
(y air )(y, pir ) = 0, (air < y < bir )(y, pir ) = (y − air )/(bir − air ), (bir y cir )(y, pir ) = 1, (cir < y < dir )(y, pir ) = (y − cir )/(dir − cir ), (y dir )(y, pir ) = 0, (10)
where 0 air bir cir dir 1.
(11)
The four parameters, air , bir , cir , and dir , locates the position of the four nodes as shown in Fig. 4. In this case, the grade of membership is equal to 1 in the closed interval [bir , cir ] and positive in the open intervals (air , bir ) and (cir , dir ). This membership function is very flexible, since the non-fuzzy part is explicitly represented and the fuzzy part is not limited to being symmetric. For some applications, the importance between variables must be weighted (Sanchez, 1998). In this model, the membership function is modified by -cut functions,
(xi ; pir ) = MAX(, (xi ; pir )),
(12)
where is defined between zero and one. A value of one indicates that the variable does not contribute to the premise.
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
For a given input pattern vector x, the fire strength of the precedent of a fuzzy rule r is given by
r (x) =
N
(xi ; pir ),
(13)
i=1 (i,r)=1
when the multiplication operator is utilized. Alternatively, when the min operator is utilized, the fire strength of a rule r is given by
r (x) =
N
(xi ; pir ).
405
the computational space with the corresponding index: air = zir,s(1) , bir = zir,s(2) , cir = zir,s(3) , dir = zir,s(4) .
(16)
This transformation is called the re-assignment transformation (RT). 3.2. Procedure to determine the consequent part for classification
(14)
i=1 (i,r)=1
Notice that the combinatorial variable determines which variables form part of the premise. 3.1.1. Encoding the premise using molecule variables The decision variables in LARES are the parameters, pir , that define the membership functions ir (xi ) = (xi ; pir ) and the combinatorial variable ir . There are P ∗ N ∗ R parameters and N ∗ R pointers variables, where P is the number of adjustable parameters for the membership function utilized. For each real parameter, K molecule variables are used, which can have two possible state values (0 and 1). Binary encoding is then used to map the state of the molecules to the value of the real parameters. For each pointer, ir , a molecule variable is used whose value equals the value of the pointer (ir = j ), where j is the index of the molecule assigned to represent the value of ir .
Given the antecedents of the fuzzy rule system from the LARES trial vector, the consequence part is generated using soft computing techniques (Ishibuchi et al., 1992). This is a very effective and rapid method for calculating the conclusion for the classification problem. The training data consists of T patterns. Each pattern belongs to one of the M possible classes. Each pattern in the training set, p, consists of N p p variables, x p = (x1 , . . . , xN ), and the corresponding class is Cp . It is assumed that all variables are normalized. The procedure for determining the conclusion of a rule is described as follows: for each rule r, calculate the cumulative firing power of all patterns for each class, Ci :
Ci =
r (x p ).
(17)
p∈Ci
The class with the largest firing strength is the consequence of the rule: Cr = Cb such that Cb = max{ C1 , . . . , CM }.
(18)
3.1.2. Encoding non-symmetrical membership functions When the trapezoid membership function is utilized, the constraints in Eq. (12) add an additional difficulty in using LARES, as this constraint will be violated frequently if the parameters (air , bir , cir , dir ) are used as decision vectors in the optimization problem. To avoid the complexity of this constraint, a two-step encoding strategy (Irizarry, 2004b) is utilized to satisfy the constraints in all iterations. To encode this type of membership function instead of using the parameters (air , bir , cir , dir ) as part of the decision variables, a set of “computational” parameters (zir,1 , zir,2 , zir,3 , zir,4 ), also defined in the interval [0, 1] × [0, 1] × [0, 1] × [0, 1], are used as decision variables. LARES will sample in the computational domain and then define a transformation that determines the original problem parameters. Mapping from the computational space to the parameter space is accomplished by first sorting the computational parameters in ascending order (Bean, 1994):
Although this parameter does not form part of the fuzzy inference system, it will be utilized in the objective function to further manipulate the fuzzy partition. Inference mechanism: Given the rule set, to classify a new pattern x t , the inference mechanism consists of finding the rule with maximum fire strength:
zir,s(1) zir,s(2) zir,s(3) zir,s(4) ,
r¯ = max{r (x t ) | rule exists}.
(15)
where s(n) is the variable of the computational space with the nth rank in the sorted list. The parameters of the membership function are then determined by the parameters of
If there is more than one class with the largest firing strength, the consequence cannot be resolved and the rule is eliminated from the inference engine. Similarly, if Cb is zero, there are no data in the partition covered by the premise and the rule is also eliminated from the inference engine. Let the segregation parameters be defined by
C Sr ≡ M b . i=1 Ci
r
(19)
(20)
Then, the classification of x t is the consequence of rule r¯ , Cr¯ . If there is more than one class with the maximum , the pattern is unclassified.
406
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
3.3. Objective function for the classification problem To evaluate the performance of the trial inference system at each LARES iteration, the inference system is applied to the training data. Let Nfail be the number of patterns that where misclassified or not classified, and the objective function to be minimized by LARES is of the form: F = Nfail + w
R
(1 − Sr )2 ,
(21)
r=1
where w is a weight factor. The second term is used to try to eliminate conflict in a given rule to improve knowledge. This second term is very important when it is desirable to determine which regions of the input space belongs to one class with 100% accuracy.
important in each rule, which is not recommended for system with many variables. (c) Membership function and operation type: As a default, the triangular membership function with the min operator can be utilized to simplify interpretation of the fuzzy system. The trapezoid membership function can be selected if the user wishes to separate the non-fuzzy parts of the domain from the fuzzy parts. This second alternative makes the learning phase slower due to the 2X increase in parameters for a given problem and the transformation needed (Section 3.1.2) adding more nonlinearities to the problem. As discussed earlier, LARES is very robust, and its optimization parameters are fixed with no further manipulation. The weight factor in Eq. (21) is set to zero. If the user desires to have each rule be as independent from the others as possible, this parameter can be set to unity.
3.4. Remarks on algorithm implementation To implement the algorithm, the only information needed p p is the training patterns, x p = (x1 , . . . , xN ; Cp ), p = 1, Np . The user-specified parameters are (a) number of rules, R, (b) type of structure identification, and (c) type of membership function and operation type. The number of rules is the most important parameter of the algorithm. Some remarks on these tuning parameters follow. (a) Number of rules: This is the most important parameter of the algorithm. The optimal number is found by starting with a low value of R and generating the inference system using LFC. Repeat the procedure increasing the number of rules R = R + N (where N is a small number; a value of one is recommended). Continue until there is no further improvement in classification rate caused by increasing the number of rules. All other user-specified parameters are constant during this continuation process. This approach is efficient in finding a compact set of rules. Starting with a large value of R has the danger of over-specifying the system. As an example, the iris data discussed later have the following performance: (R, classification rate) = (2, 50%), (3, 98%), (4, 98%), (5, 99.3%), when the triangular membership function with the min operator is used. When the trapezoid membership function is utilized the prediction powder increases to greater than 99% with four rule and 100% with five rules. In this case an inference system 4–5 rules is selected. (b) Type of structure identification: The default is to have the combinatorial variable ir active (part of the solution of the optimization problem), in order to identify the structure of the fuzzy rules. When Eq. (13) is used, the combinatorial variable can be eliminated from the set of decision variables, since the -cut will eliminate the terms that do not contribute to the rule. When both variables are disabled, it is implied that all variables are
4. Test results The performance of the LFC procedure is tested using classification benchmark problems. The Iris data have been used extensively to test new algorithms, see Wang and Mendel (1992) and references therein. The second set of benchmark problems consists of a set of non-linear functions used by Ishibuchi et al. (1992) to test their algorithm. The algorithm is also applied to potential applications in chemical engineering. The first application considers a fault detection problem in a chemical reactor. The second application involves a classification of areas of stable operation vs. unstable operation in a reactor. 4.1. Iris data The Iris data (Fisher, 1936) consists of four inputs to describe the three possible classes; iris setosa, virginica, and versicolor. The input variables are sepal length in cm (in the range of [4.3–7.9]), sepal width in cm (in the range of [2.0, 4.4]), petal length in cm (in the range of [1.0, 6.9]), and petal width in cm (in the range of [0.1, 2.5]). The best results found for this type of problem are achieved by Ishibuchi et al. (1995) with high classification power using 13 rules and Russo (1998, 2000) with high classification power using five rules. The test results using LFC are summarized in Table 1. Table 1 Pattern classification for the Iris data Type of operator Membership function Number of rules % classified Mult. Mult. Mult. Min Min
Trapezoid Triangular Gaussian Trapezoid Trapezoid
5 5 5 5 4
100 100 99.3 99.3 99.3
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
1 0.5 0 0.0
1 0.5 0 0.0
1 0.5 0 0.0
1 0.5 0 0.0
1 0.5 0 0.0
0.5
0.5
0.5
0.5
0.5
1.0
1 0.5 0 0.0
0.5
1.0
1 0.5 0 0.0
1.0
1 0.5 0 0.0
1.0
1 0.5 0 0.0
1.0
1 0.5 0 0.0
1.0
1 0.5 0 0.0
0.5
0.5
0.5
1.0
1 0.5 0 0.0
0.5
1.0
1.0
1 0.5 0 0.0
0.5
1.0
1.0
1 0.5 0 0.0
0.5
1.0
407
Virginica
Setosa
Versicolor
Virginica 0.5
0.5
1.0
1.0
1 0.5 0 0.0
0.5
1.0
1 0.5 0 0.0
Versicolor 0.5
1.0
Fig. 5. Fuzzy rules for the Iris problem, using five rules with the triangular membership function and the multiplication operator.
SW
SL
PL
PW
1 0.5 0
Setosa 0.0
0.5
1.0
1 0.5 0
1 0.5 0 0.0
1 0.5 0 0.0
0.5
1.0
0.5
1.0
Versicolor 0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
1 0.5 0
1 0.5 0 0.0
0.5
1.0
Versicolor
1 0.5 0
1 0.5 0 0.0
0.5
1.0
Virginica
Fig. 6. Inference system for the Iris data, using four rules with the trapezoid membership function and the min operator.
In all simulations w was set to unity and the maximum number of iterations was 300,000, although in most cases the classification was above 99% between 3000 and 30,000 iterations. It is shown that with only five rules, the algorithm can classify the iris data very effectively. It is also shown that with four rules, 99.3% of the data can be classified correctly. Fig. 5 shows the inference system using five rules, with a triangular membership function and the multiplication operator. Fig. 6 shows the resulting inference system with
only four rules, using the trapezoid membership function and the min operator. This system is very compact and easy to analyze and shows that with only four rules, 99.3 of the data can be classified correctly. This is probably one of the most compact inference system developed in the literature for the iris data. The effect of adding the second factor to the objective function (w = 1) is shown in Fig. 7. When w = 1, conflict between rules is minimized while high classification power
Segregation coefficient, S
408
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412 Table 2 Pattern classification for non-linear function benchmark problem
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1
2
3
5
4
Type of operator
Membership function
Number of rules
Data
Test
Mult. Mult. Mult. Mult. Min Mult. Min
Gaussian Gaussian Gaussian Gaussian Gaussian Triangular Triangular
2 3 4 5 5 5 5
98.0 99.1 99.1 99.5 99.4 99.2 99.3
94.3 95.2 95.3 96.0 94.5 94.7 94.5
Rule number
Fig. 7. The effect of Sr in the objective function. When w = 1, three out of five rules will represent a given class with no other class included in the partition described by the rule. These rules can be analyzed in an isolated manner, in addition to forming part of the inference system.
100
4.2. Pattern classification of non-linear functions The following functions are used to generate a partition into two classes of the space [0, 1] × [0, 1]: f1 (x) = − sin(2 x1 )/4 + x2 − 0.5,
(22a)
f2 (x) = − sin(2 x1 )/3 + x2 − 0.5,
(22b)
f3 (x) = − sin(2 x1 − /2)/3 + x2 − 0.5,
(22c)
f4 (x) = −| − 2x1 + 1| + x2 ,
(22d)
f5 (x) = (x1 + x2 − 1)(−x1 + x2 ),
(22e)
98 % classification
96 94 92 a
90
b
88
c
86
d
84
e
82
f
80 0
5000
10000
15000
20000
25000
30000
Number of iterations Fig. 8. LFC best-so-far curves as function of different parameters for the Iris classification problem (First 30,000 iterations with R = 5). (a) mf = triangular, operator = min, opt = combinatorial, (b) mf = triangular, operator = mult, opt = combinatorial, (c) mf = trapezoid, operator = min, opt = combinatorial, (d) mf = trapezoid, operator = min, opt = combinatorial, (e) mf = triangular, operator = mult, opt = combinatorial disable, (f) mf = triangular, operator = mult, opt = -cut variable included. At the end of the run (300,000 iterations) all cases have classification powder equal or larger to 99.3. For the triangular membership function the classification powder is above 99% in the first 10,000 iterations while for the trapezoid membership function the classification power is above 98% in the first 20,000 iterations.
is preserved. Rules with Sr = 1, in addition to forming part of the inference system, can be examined separately, since they define regions in the pattern space that are not fuzzy and that represent only one class. The effect of different parameters on the performance of LFC for the Iris data is shown in Fig. 8. The algorithm is robust in terms of parameter selection. At the end of the run (300,000 iterations) all cases have classification powder equal or larger to 99.3. For the triangular membership function the classification powder is above 99% in the first 10,000 iterations while for the trapezoid membership function the classification power is above 98% in the first 20,000 iterations.
f6 (x) = −(x1 − 0.5)2 /0.42 + (x2 − 0.5)2 /0.32 + 1, (22f) f7 (x) = − (x1 − 0.5)2 /0.152 + (x2 − 0.5)2 /0.22 + 1.
(22g)
Classes 1 and 2 are defined by y = sign(f (x)). These functions were used by Ishibuchi et al. (1992) to test their algorithm. For each function, 10 problems are generated. For each problem, the training data consist of 50 random patterns for each class, and the test data consists of 50 random patterns for each class. Ishibuchi et al. (1992), solved this problem using the multigrid method. With four rules, the average classification power over the 70 instances was 68.4%. The classification power was increased to 91% with 29 rules, to 94.6 with 203 rules, and to 94.2 with 1014 rules. The results using LFC are presented in Table 2, which presents the average results over the 70 instances. It shows the effect of the number of rules and different parameters. These results demonstrate very good classification power for training and testing data with a very compact set of rules. With only two rules, the test data can be classified with 98% accuracy and the training data can be classified with 94.3% accuracy. With five rules, the test data improve to 99.5% and the training data to 96%. In these simulations, the combinatorial variables ir ’s, were part of the decision variables to be determined by LARES, with termination criteria of 20,000 iterations. In many cases, the optimal solution was found in less than 6000 iterations. The objective function included the segregation term with w = 1.0.
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
x4
x3
<
1 0.5 0
1 0.5 0
0.0 0.5 1.0
x2
x5
<
<
1 0.5 0
1 0.5 0
One fault 0.0 0.5 1.0
0.0 0.5 1.0
x1
x3
x6
x5 1 0.5 0
0.0 0.5 1.0
<
1 0.5 0
<
<
1 0.5 0 0.0 0.5 1.0
One fault 0.0 0.5 1.0
x1 1 0.5 0 0.0 0.5 1.0
409
0.0 0.5 1.0
1 0.5 0
One fault 0.0 0.5 1.0
x2 1 0.5 0
Two faults 0.0 0.5 1.0
x5
<
1 0.5 0
x6
0.0 0.5 1.0
1 0.5 0 0.0 0.5 1.0
Two faults
Fig. 9. Fault detection problem rules.
4.3. Fault detection in a chemical reactor
Table 3 Pattern classification for fault detection problem
Fault detection has been identified as a very important area of research for control engineers. The cost of abnormal event management (AEM) in process industries is on the order of billions of dollars per year, as discussed in Venkatasubramanian et al. (2003) and references therein. The first step in AEM is the diagnosis step, which can be viewed as a classification problem. The input patterns are the process variables and the classification is the type of fault. The reactor studied in Venkatasubramanian et al. (1990), is used as a prototype of a fault detection problem. This problem was also analyzed in Agrawal et al. (2003) using support vector machines. The problem is described in detail in Venkatasubramanian et al. (1990) and Agrawal et al. (2003) and not repeated here. The pattern space consists of six measured variables. This case study consists of finding an inference system that can differentiate between a single fault and a double fault. Fig. 9 shows the resulting inference system when the trapezoid membership function is utilized with the min operator. The combinatorial variables were included in the decision vector. The objective function included the segregation term with w set to unity. The data were classified 100% with five rules. Table 3 shows different combinations of operators and membership functions. It is shown that for some combinations, all of the data can be classified with four rules.
Type of operator
Membership function
Number of rules
% classified
Min Min Mult Min Mult
Trapezoid Trapezoid Trapezoid Triangular Triangular
4 5 4 4 4
92 100 100 100 100
4.4. Chemical processes with complex dynamics Chemical processes can have regions of unstable operations that can generate run-away reactions with large fluctuations in temperatures and concentrations. In some applications, avoiding those conditions is very important from a quality and safety point of view. This problem can also be viewed as a classification problem in which we want to distinguish between conditions for unstable operations and conditions for stable operations. To simulate such chemical processes, let us consider the CSTR with a first-order irreversible reaction studied by Uppal et al. (1974, 1976). This system may present steady state, multiple steady states, oscillatory, or chaotic solutions, depending on operation conditions. The system is modeled by two differential equations for mass and energy balance.
410
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
1 0.9
y
0.8 0.7 0.6 0.5 0.4 50
55
60
65
70
dimensionless time
(a)
1 0.9
y
0.8 0.7 0.6 0.5 0.4 20
25
30
35
(b)
40
45
50
55
60
65
70
75
80
85
90
95 100
dimensionless time
Fig. 10. Example of (a) steady-state behavior and (b) oscillatory behavior for the CSTR problem.
Table 4 Dimensionless parameters for the non-linear reactor inference system
Mass balance:
dy . = −y + Da(1 − y) exp d 1 + /
(23)
Energy balance:
d = − + Da · B(1 − y) exp d 1 + / − ( − c ),
(24)
where y is the reactant conversion and is the dimensionless temperature, Da is the number defined as the ratio of residence time to characteristic reaction time, is the dimensionless activation energy, B is the dimensionless heat of reaction, is the dimensionless heat transfer coefficient and c is the dimensionless coolant temperature. The numerical solutions can be classified into steady-state solutions as class one and unstable solutions as class two. Simulated process data are generated by sampling the parameters space uniformly and integrating equations (23) and (24) until t = 100. Examples of class one and class two
Type of operator Mult. Mult. Mult. Mult
Membership function Gaussian Gaussian Gaussian Triangular
Number of rules 10 12 14 12
Data
Test
100 100 99 100
92 95 91 94
Combinatorial variables No Yes Yes Yes
operations are shown in Fig. 10. The pattern space consists of (Da, , B), the parameter Da varies in the interval [0, 0.2], B varies in the interval [15, 25] and varies in the interval [1.5, 3.0]. The parameter c is kept constant and is equal to zero in all simulations. 100 samples were generated for the training data and 100 samples for the test data. Table 4 shows the results for four study cases. In the first experiment, the combinatorial variables were not included as part of the decision variables (all terms form part of the premise), while in the last three experiments the combinatorial variables formed part of the decision variables. Using 10 rules the algorithm was able to classify the training
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
data with 99–100% accuracy and the test data with 91–95% accuracy.
This work demonstrates that there is good synergy between fuzzy logic and LARES in finding a compact set of rules, making it an alternative to more classical GA approaches, which have also been shown to be robust algorithms for this application. LARES’ particular properties can be very important in certain applications. For example, if the feature space has attribute variables in addition to real parameters, the LARES’ multiple coding capability can be used without any need to tailor the algorithm, which is not the case in GA. Another advantage of LARES is that it reaches the global optimum neighborhood rapidly, which will be a very efficient approach when combined with a hill-climbing algorithm in a second phase (back-propagation type of algorithm) to find the global optimum. To study LARES’ performance for other fuzzy models, consider the classification model of Ishibuchi et al. (1992). This model produces a large number of fuzzy rules that increase exponentially with the number of variables. In Ishibuchi et al. (1995), a combinatorial optimization problem was postulated to reduce the number of rules while keeping a high classification power of the original model. The GA algorithm was used to solve the optimization problem with the following parameters: population size = 5, 10, 50, one point mutation with mutation rate = 0.1, 0.01, 0.001, 0.0001, Fitness proportional selection, one point cross over and elitism where the best individual passes to the next generation. The number of rules for the original algorithm using the Iris data with no optimization was 683. Depending on the mutation rate and population size selected, the optimal number of reduced rules varied from 18 (Np = 5, P = 0.001) to 281 (Np = 50, P = 0.1). The GA was improved by introducing a specialized mutation operation tailored to this problem in particular, and the optimal number of rules found was 13, with high classification rates over 99%. The same problem was solved using LARES, reducing the system to 11 rules in 13,000 iterations (See Fig. 11), even better than the best results found in the tailored GA in Ishibuchi et al. (1995). Notice that when the LFC algorithm was used, four rules were enough to describe 99.3% of the data. 5. Conclusion LFC is a new algorithm developed to generate a compact set of fuzzy classification rules from data. Numerical experiments demonstrate that the algorithm is direct, flexible, robust and efficient. LARES is a new algorithm used for the first time to train fuzzy systems demonstrating its utility for training fuzzy systems. LARES has the capability to escape from local minima, reaching a near-global solution
400 350 Number of rules
4.5. Synergy of LARES with fuzzy logic
411
300 250 200 150 100 50 0 0
5000 10000 Number of iterations
15000
Fig. 11. Decrease in number of rules when using LARES for the solution of Ishibuchi et al. (1995) combinatorial optimization problem applied to the iris data. After 13,000 iterations the system was described with 11 rules.
effectively. Also the coding flexibility of LARES helps to define the premise in terms of combinatorial and real variables in order to achieve a better solution. For partitions where flexible membership functions are needed, the generalized trapezoid function with re-assignment transformation has been introduced. This transformation allows the use of these types of functions in LFC without any tailoring of LARES. This approach can also be used for more complex membership functions that have collocation points as part of the parameters to be determined. As shown in the numerical experiments, LFC presents a good balance between input–output mapping precision and linguistic interpretation. This algorithm will have applications in areas of chemical process industries needs to classify different patterns like diagnostic of a possible fault in the process, diagnostics for unstable operations, and allocation of type of raw materials and process conditions for product quality. The algorithm eliminates several problems presented by other algorithms like (1) Exponential increase of rules with the number of variables (2) Exponential increase of computational load with number of rules (3) all variables form part of the rule (4) No need for two phase learning and/or tailoring of the optimization algorithm. References Agrawal, M., Jade, A.M., Jakaraman, V.K., Kulkarni, B.D., 2003. Support Vector Machines: A Useful Tool for Process Engineering Applications. CEP, January, pp. 57–62. Bean, J.C., 1994. Genetic algorithms and random keys for sequencing and optimization. ORSA Journal on Computing 6 (2), 154–160. Chung, I.F., Lin, C.J., Lin, C.T., 2000. A GA-based fuzzy adaptive learning control network. Fuzzy Sets and Systems 112, 65–84. De Carli, A., Liguori, P., Marroni, A., 1994. A Fuzzi-PI control strategy. Control Engineering Practice 2, 147–153. Du, Y.D., Tyagi, R.D., Bhamidimarri, R., 1999. Use of Fuzzy neuralnet model for rule generation of activated sludge process. Process Biochemistry 35, 77–83.
412
R. Irizarry / Chemical Engineering Science 60 (2005) 399 – 412
Er, M.J., Liao, J., Lin, J., 2000. Fuzzy neural networks-based quality prediction system for sintering process. IEEE Transactions on Fuzzy Systems 8, 314–324. Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179–188. Goldberg, D.E., 1989. Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA. Holland, J.H., 1975. Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI. Irizarry, R., 2004a. LARES: an artificial chemical process approach for optimization. Evolutionary Computation Journal 12 (4). Irizarry, R., 2004b. Solution of optimal control problems using LARES: an artificial chemical process for optimization. Chemical Engineering Science, submitted for publication. Ishibuchi, H., Nozaki, K., Tanaka, H., 1992. Distributed representation of fuzzy rules and its application to pattern classification. Fuzzy Sets and Systems 52, 21–32. Ishibuchi, H., Nozaki, K., Yamamoto, N., Tanaka, H., 1995. Selecting fuzzy if–then rules for classification problem using genetic algorithms. IEEE Transactions on Fuzzy Systems 3, 260–270. Jang, J.S., Sun, C.T., 1993. Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions on Neural Networks 4, 156–159. Karr, C., 1991. Applying genetics to fuzzy logic. AI Expert 3, 39–43. Li, E., Li, J., Yu, J., 2002. A genetic neural fuzzy system-based quality prediction model for injection process. Computers and Chemical Engineering 26, 1253–1263. Mastorocostas, P., Theocharis, J., 2000. FUNCOM: a constrained learning algorithm for fuzzy neural networks. Fuzzy Sets and Systems 112, 1–26. Na, M.G., 1999. Application of a genetic neuro-fuzzy logic to departure from nucleate boiling protection limit estimation. Nuclear Technology 128 (3), 327–340. Russo, M., 1998. FuGeNeSys—a fuzzy genetic neural system for fuzzy modeling. IEEE Transactions on Fuzzy Systems 6, 373–388. Russo, M., 2000. Genetic fuzzy learning. IEEE Transactions on Evolutionary Computation 4, 259–273.
Sanchez, E., 1998. Fuzzy logic and inflammatory protein variations. Clinica Chimica Acta 270 (1), 31–42. Sanchez, E., Bartolin, R., 1990. Fuzzy inference and medical diagnosis, a case study. Journal of Biometric Fuzzy System Assessment 1, 4–21. Schwefel, H.P., 1995. Evolution and Optimum Seeking, Wiley, New York. Simpson, P.K., 1992. Fuzzy min–max neural networks—Part 1: classification. IEEE Transactions on Neural Networks 3, 776–786. Su, M.C., Chang, H.T., 2000. Application of neural networks incorporated with real-valued genetic algorithms in knowledge acquisition. Fuzzy Sets and Systems 112, 85–97. Sugeno, M., Yasukawa, T., 1993. A fuzzy-based approach to qualitative modeling. IEEE Transactions on Fuzzy Systems 1, 7–31. Tsoukalas, L.H., Uhrig, R.E., 1997. Fuzzy and Neural Approaches in Engineering, Wiley, New York. Uppal, A., Ray, W.H., Poore, A.B., 1974. On the dynamic behavior of continuous stirred tank reactors. Chemical and Engineering Sciences 29, 967–985. Uppal, A., Ray, W.H., Poore, A.B., 1976. The classification of the dynamic behavior of continuous stirred tank reactors-influence of reactor residence time. Chemical and Engineering Sciences 31, 205–214. Venkatasubramanian, V., et al., 1990. Process fault detection and diagnosis using neural networks—1. Steady-state processes. Computers and Chemical Engineering 14, 699–712. Venkatasubramanian, V., Rengaswamy, R., Yin, K., Kavuri, S.N., 2003. A review of process fault detection and diagnosis Part I: quantitative model-based methods. Computers and Chemical Engineering 27, 293–311. Wang, L.X., Mendel, J.M., 1992. Generating rules by learning from examples. IEEE Transactions on Systems, Man and Cybernetics 22, 1414–1427. Zhang, J., Ferch, M., 2003. Extraction and transfer of fuzzy control rules for sensor-based robotic operations. Fuzzy Sets and Systems 134, 147–167.