Discovering generalized design knowledge using a multi-objective evolutionary algorithm with generalization operators

Discovering generalized design knowledge using a multi-objective evolutionary algorithm with generalization operators

Expert Systems With Applications 143 (2020) 113025 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

3MB Sizes 0 Downloads 40 Views

Expert Systems With Applications 143 (2020) 113025

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Discovering generalized design knowledge using a multi-objective evolutionary algorithm with generalization operators Hyunseung Bang a, Daniel Selva b,∗ a b

Sibley School of Mechanical and Aerospace Engineering, Cornell University, Ithaca, New York 14853, United States Department of Aerospace Engineering, Texas A&M University, College Station, Texas 77843, United States

a r t i c l e

i n f o

Article history: Received 24 June 2019 Revised 12 September 2019 Accepted 13 October 2019 Available online 14 October 2019 Keywords: Design space exploration Knowledge discovery Feature extraction Data mining Multi-objective evolutionary algorithm Adaptive operator selection

a b s t r a c t The early-phase design of complex systems is a challenging task, as a decision maker has to take into account the intricate relationships among different design variables. A popular way to help decision makers easily identify important design features is to use data mining. However, many of the existing algorithms output design features that are too complex (e.g., conjunction of many literals with unrelated predicates), making it difficult for a user to understand, remember, and apply these features to find better designs. In this paper, we introduce a new data mining method that extracts compact design features through knowledge generalization. The proposed method performs a search over the space of features using a multi-objective evolutionary algorithm that contains a set of generalization operators in addition to conventional evolutionary operators. Both variables and feature types are generalized by using an ontology defining a set of domain-specific concepts and relationships. Generalization leads to more compact and insightful features, as generalized knowledge encompasses wider concepts. A comparative experiment is conducted on a real-world system architecting problem to demonstrate the gain in compactness of the extracted features without significant reductions in predictive power. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction “Design by shopping” (Balling, 1999) is a popular approach to tackle early-phase engineering design problems (i.e., system architecting and conceptual design). Under this paradigm, designers inspect various design alternatives as they try to gain insights into the structure of the design problem at hand through the process called design space exploration (a.k.a. tradespace exploration, tradespace analysis, etc.) (Ross & Hastings, 2005). Design space exploration can be seen as an information-gaining process, in which designers discover new knowledge (i.e., they learn) about the design problem. Beyond finding sets of good (e.g., non-dominated) designs to explore in more depth, examples of the types of knowledge that can be gained through design space exploration include the main trade-offs among conflicting criteria, sensitivities of design criteria to design decisions and model parameters, couplings between design decisions, design features that are common among good designs, and the presence of clusters or families of similar designs that simplify the decision-making process. As designers learn such information, they can make a more informed decision to either select one or more designs for ∗

Corresponding author. E-mail addresses: [email protected] (H. Bang), [email protected] (D. Selva).

https://doi.org/10.1016/j.eswa.2019.113025 0957-4174/© 2019 Elsevier Ltd. All rights reserved.

further study or modify the problem formulation or the model to find solutions that better reflect the physics or economics of the problem (Woodruff, Reed, & Simpson, 2013). To support designers in such knowledge discovery process, various software tools have been suggested in the field of engineering design. One popular approach is to extract design knowledge explicitly in the form of logical IF-THEN rules, which resembles how human experts reason (Newell, Simon et al., 1972). Logical rules are easy to understand, as they can easily be used for building mental models of the knowledge they convey. Popular data mining methods to extract design rules include decision trees (Graening et al., 2008; Jahr, Calborean, Vintan, & Ungerer, 2012; Yan, Qiao, Simpson, Li, & Zhang, 2012), association rule mining (ARM) (Bandaru, Ng, & Deb, 2017; Ng, Bandaru, & Frantzén, 2016; Watanabe, Chiba, & Kanazaki, 2014), and evolutionary algorithms (EA) (Bandaru & Deb, 2010; Russo, Bernardino, & Barbosa, 2018). These methods can be used to identify design features that are common among good (e.g., non-dominated) designs. For example, Watanabe et al. applied ARM on a hybrid rocket engine design problem and found 1171 rules, one of which indicates that 80 percent of all non-dominated solutions have similar initial port radius (Watanabe et al., 2014). Regardless of the method used to extract rules, learning design knowledge encoded in logical rules may still be difficult. For

2

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

instance, the total number of extracted rules may be large, and/or each rule may contain a large number of conditions (literals). Over the years, various attempts have been made to improve the comprehensibility of the extracted rules by incorporating domain-specific knowledge into the knowledge discovery process. These methods use domain knowledge to: (1) highlight relevant and interesting rules or filter out redundant ones (Idoudi, Ettabaa, Solaiman, & Hamrouni, 2016; Kuo, Lonie, Sonenberg, & Paizis, 2007; Mansingh, Osei-Bryson, & Reichgelt, 2011; Natarajan & Shekar, 2005); (2) generalize the knowledge encoded in rules to make them more compact (Bellandi, Furletti, Grossi, & Romei, 2007; Domingues & Rezende, 2011; Psaila & Lanzi, 20 0 0; Srikant & Agrawal, 1995; Tseng, Lin, & Jeng, 2007; Zhou & Geller, 2008); or both (Marinica & Guillet, 2010; Marinica, Guillet, & Briand, 2008). In this paper, we focus on generalization as a way to make rules more compact, and thus easier to understand. Here, the term generalization refers to Michalski’s constructive generalization (Michalski, 1983), which can be described as replacing a concept F1 in a rule with another concept F2 , when F1 ⇒F2 . Such operation can make rules more compact, as F2 represents a broader concept. For example, if we have two rules F1 ⇒c and F3 ⇒c, and we know from our domain knowledge that F1 ∨F3 ⇒F2 holds, then we can hypothesize that F2 ⇒c is also a valid rule. If the hypothesis is true, we are able to combine two rules into a more compact one (note that the replacement above is not a sound inference - take for example another concept F4 which is neither F1 nor F3 , and satisfies F4 ⇒F2 and F4 ⇒¬c. Then F2 ⇒c cannot be true.). Previous efforts to simplify rules through generalization focused either on introducing generalized attributes at the beginning of a data mining search (Srikant & Agrawal, 1995; Tseng et al., 2007; Zhou & Geller, 2008), or applying generalization as a post-processing step (Domingues & Rezende, 2011; Marinica & Guillet, 2010; Marinica et al., 2008). In both approaches, the data mining and generalization steps are completely separated, and a conventional data mining algorithm is used without a significant modification. This reduces the efficiency and the effectiveness of the search in finding compact rules. Simply introducing generalized attributes at the beginning of the search expands the search space too much, making the search prohibitively slow for complex problems. On the other hand, applying generalization as a postprocessing step limits the space to be searched as the rules have to be derived directly from those obtained from a conventional algorithm. To this date, fully integrating generalization and data mining remains a challenging task. In this paper, we introduce a method to combine generalization and data mining search by EA, with the goal of extracting design rules that are compact and easy to understand, while having good predictive power. The algorithm applies generalizations during the search, exploring the generalization hypothesis space as it constructs design rules. The proposed method introduces generalization operators, in addition to conventional evolutionary operators (e.g. mutation and crossover operators), in order to generalize the knowledge represented in each rule. We use two types of generalization operators: variable-generalization operators and feature-generalization operators. Each type of generalization operator uses different mechanism to generalize knowledge. We use an adaptive operator selection (AOS) strategy (Hitomi & Selva, 2017) to control the extent to which the generalization operators are applied at a specific stage of the search. The algorithm adaptively allocates computational resources to different operators according to their performance in finding good solutions. The method is applied to a real-world system architecting problem to demonstrate that it can extract logical rules that are more compact, compared to more traditional approaches such as ARM and conventional EA, while maintaining good predictive power. In this work, we use compactness as a proxy of difficulty of interpreting rules.

The remainder of this paper is organized as follows. In Section 2, a review of related work is presented. Section 3 introduces the example design problem that is used to provide concrete examples. Section 4 explains the proposed method in detail. Section 5 describes the experimental setup of a comparative study to demonstrate the capability of the proposed tool in finding compact, useful design features. The results and discussion of the experiment are presented in Sections 6 and 7, respectively. Section 8 presents the conclusion of the paper. 2. Related work 2.1. Mining logical rules from data One of the most popular methods to extract knowledge in the form of logical IF-THEN rules is Association Rule Mining ´ (ARM) (Agrawal, Imielinski, & Swami, 1993). Formally, let T be a database containing all observations (also called transactions in the original formulation). Each observation t ∈ T is represented as a binary vector, with t[k] = 1 if t satisfies a binary attribute Fk ∈ F, where F = {F1 , F2 , . . . , Fm } is a set of binary attributes (also called items) considered in a given problem. Then an association rule is an implication of the form X⇒Fj , where X ⊂ F , Fj ∈ F , Fj ∈ X. As there can be an unmanageably large number of such rules, two importance measures called support and confidence are used to identify rules that are useful. Support and confidence of a rule X⇒Fj are defined as follows (Aggarwal, 2015):

supp(X ⇒ Fj ) = supp(X ∪ Fj )

(1)

supp(X ∪ Fj ) supp(X )

(2)

con f (X ⇒ Fj ) =

where the support of an attribute set X is defined as the fraction of the observations in T that satisfy all binary attributes in X. Mining association rules first requires identifying frequent attribute sets. An attribute set X is called frequent if its support is above a certain threshold minSupport. The two most popular algorithms for mining frequent attribute sets are the Apriori algorithm (Agrawal, Srikant et al., 1994) and the FP-growth method (Han, Pei, Yin, & Mao, 2004). After mining frequent attribute sets, different rules of the form X⇒Fj can be generated. Only the rules whose confidence is above a certain threshold minConfidence are accepted and returned in the final ruleset. High support for an attribute set indicates that there exist many observations that satisfy all the attributes in the set. Therefore, support is related to the statistical significance of a rule. High confidence for a rule X⇒Fj indicates that most of the observations which satisfy all the attributes of the set X also satisfy Fj . Therefore, confidence can be considered as a measure of strength of a rule or the degree of correlation between X and Fj . ARM can be extended to extract classification rules by adding a class label as an additional binary attribute (Liu, Hsu, & Ma, 1998). Then, only rules of the form X⇒c are mined, where c is a binary attribute corresponding to a class label. In this work, an observation t corresponds to a certain design, F is a set of binary design attributes, and c is a class label that determines the quality of the design (e.g. being non-dominated). Then a rule X⇒c can represent an associative relationship between a set of design attributes and the quality of a design. We refer to the antecedent of a rule X⇒c as a feature, since it corresponds to a certain design feature. ARM requires the user to set thresholds for support and confidence to filter out uninteresting rules. While the selection of threshold values for support and confidence has a strong influence on the rules generated (Coenen & Leng, 2007), there is no simple way to find the appropriate threshold values. In addition, in

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

real world applications, many of the rules obtained from these frequent attribute sets are repetitive, as they share very similar composition of the attributes in the antecedent. Such rules add very little information while slowing down the search significantly. This happens especially when there exist many frequent attribute sets, and the confidence threshold is small (Kabir, Xu, Kang, & Zhao, 2017). The large number of rules also makes it difficult for a decision maker to identify useful information. 2.2. Evolutionary algorithms (EA) for rule learning An alternative approach to ARM is to extract association rules using evolutionary algorithms (EA). In this approach, the extraction of association rules is framed as an optimization problem, and EAs are used to find the set of rules that best explains the observations in the database in terms of various importance measures (De Falco, Della Cioppa, & Tarantino, 2002; Dehuri, Patnaik, Ghosh, & Mall, 2008; Fidelis, Lopes, & Freitas, 2000; Ghosh & Nath, 2004; Martin, Rosete, Alcala-Fdez, & Herrera, 2014; Noda, Freitas, & Lopes, 1999; Qodmanan, Nasiri, & Minaei-Bidgoli, 2011; Yan, Zhang, & Zhang, 2009; Zhou, Xiao, Tirpak, & Nelson, 2003). The major distinctions that can be made on different EAs for rule learning are based on the formulation of the objective functions, and how individual rules are encoded in chromosomes. The desired characteristics of extracted rules that can be used to define the fitness function include high statistical significance, low complexity, and high level of “interestingness”, which may be defined using various metrics (Lenca, Vaillant, Meyer, & Lallich, 2007; Martínez-Ballesteros, Martínez-Álvarez, Troncoso, & Riquelme, 2014; Sokolova & Lapalme, 2009; Tan, Kumar, & Srivastava, 2004). In classical ARM, support and confidence are typically used as measures of statistical significance and rule strength, respectively (Agrawal et al., 1993; Han et al., 2004). “Comprehensibility” of the extracted rules is also considered, mostly through the number of literals in the rule (Martin et al., 2014; Qodmanan et al., 2011) These metrics often have trade-off relationships with one another. Therefore, it is often desirable to incorporate multiple metrics into the fitness function. Early work on the application of EAs for rule learning focused on defining a single-objective fitness function that aggregates multiple, conflicting terms (Dehuri et al., 2008; Noda et al., 1999). The main challenge in this approach is determining the weights to be used to aggregate different metrics. An arbitrary selection of the weights may put undesirably large or small emphasis on one specific metric of a rule. More recently, multi-objective problem formulations have gained popularity for EAs used for rule learning. By framing the rule learning task as a multi-objective problem, EA can jointly optimize rules for multiple, incommensurate objectives that reflect different characteristics of a rule, such as the statistical significance, interestingness and the comprehensibility of a rule (Ghosh & Nath, 2004; Martin et al., 2014; Qodmanan et al., 2011). Multi-Objective Evolutionary Algorithms (MOEA) mine sets of Pareto optimal solutions that have varying degrees of performance in each measure, instead of mining rules that perform very similarly with respect to a single fitness function. This allows the user to inspect alternative explanations of the data. In classical ARM, all rules follow the same structure X⇒c, where X is equivalent to a logical conjunction of the satisfaction of all its member attributes. Thus, the antecedent of a rule in EA is often encoded as an array of genes, with each gene representing the satisfaction of a single binary attribute (Dehuri et al., 2008; Martin et al., 2014; Qodmanan et al., 2011). Encoding rules in a rigid structure (conjunction of literals) reduces the search space significantly, compared to the more general case where both conjunctions and disjunctions are allowed. However, this approach

3

usually requires a higher number of rules to express complex concepts (Brachman, Levesque, & Reiter, 1992). An alternative is to extend EAs using Genetic Programming (GP), where rules are represented as trees (Koza, 1994). A tree in GP consists of function nodes and terminal nodes. Function nodes may contain mathematical operators (arithmetic, Boolean, conditional, etc.), while terminal nodes contain variables or constants appropriate for each problem domain. Thanks to its flexibility, GP has been used for various purposes, such as learning to construct features for pattern recognition (Krawiec, 2002; Zhang & Rockett, 2011), building classifiers (De Falco et al., 2002; Espejo, Ventura, & Herrera, 2010; Zhou et al., 2003), or extracting design principles expressed as mathematical equations (Bandaru & Deb, 2013; Russo et al., 2018; Tatsukawa, Nonomura, Oyama, & Fujii, 2013). By restricting the function nodes to contain only logical connectives (AND and OR), GPs can also be made to search the space of logical expressions (Wong & Leung, 1995). The tree structure lends itself to the hierarchical nature of logical expressions that contain both logical conjunctions and disjunctions.

2.3. Leveraging domain knowledge to enhance comprehensibility Various methods utilize domain-specific knowledge to help the user make sense of the extracted rules (Idoudi et al., 2016; Kuo et al., 2007; Mansingh et al., 2011; Natarajan & Shekar, 2005). For example, Mansingh et al. suggest a method to use an ontology containing domain-specific information to categorize rules into different groups such as ‘novel’, ‘known’, ‘missing’, and “contradictory” rules (Mansingh et al., 2011). The domain experts who participated in this study found categories such as “novel” and “missing” rules to be useful. Domain knowledge may also be used to define a new interestingness measure (Idoudi et al., 2016; Natarajan & Shekar, 2005). For example, Idoudi et al. use an ontology containing domain knowledge to define a semantic similarity measure among different concepts. Then, a rule is considered interesting if it contains dissimilar items, as the rule is less likely to be trivial. A more direct approach to help decision makers gain new insights is to improve the comprehensibility of extracted rules by making them more compact. The overall structure of output rules can be made compact by generalizing the knowledge represented in each rule. For example, hasFootwear⇒c is a generalization of the rule hasBoots⇒c, since hasBoots⇒hasFootwear. From the perspective of comprehensibility, extracting hasFootwear⇒c is more desirable than having separate rules for each type of footwear such as (hasBoots⇒c)∨(hasSneakers⇒c)∨(hasSandals⇒c)∨(hasLoafers⇒c), since the former requires a smaller number of attributes to express. As it is not a sound inference, making such generalizations can be considered as a search problem (Mitchell, 1982), where a hypothesis is first generated using the subsumption relationships between attributes (e.g. hasBoots⇒hasFootwear), and then tested against the data. Over the years, different ways of incorporating generalization into data mining have been suggested. In an early effort, Srikant and Agrawal proposed using taxonomies to introduce generalized attributes at the beginning of the search process (Srikant & Agrawal, 1995). These attributes are the generalizations (items at higher levels of taxonomy) of the initial set of binary attributes (items at the lowest level of taxonomy). As data mining uses both sets of attributes as building blocks, the extracted rules may contain any combination of the generalized attributes as well as the initial set of attributes. Similar suggestions have also been made later (Tseng et al., 2007; Zhou & Geller, 2008), with some algorithmic improvements to make data mining more efficient. However, the search space usually expands significantly when a taxonomy has a complex structure, making these methods impractical for large-scale and complex problems.

4

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

To reduce the size of the search space, Bellandi et al. suggest applying generalization selectively to rules whose structure conforms to a specific rule schema (Bellandi et al., 2007). A query language is developed to allow the users to interactively request rules that contain attributes generalized to a given level of hierarchy. As a result, the search space is limited only to the rules that follow a certain structure. Similarly, Marinica and Guillet introduce a post-processing method that allows the user to specify a desired rule schema, which may contain generalized concepts (Marinica & Guillet, 2010). However, this approach requires the user to have some idea of which attributes may bear interesting rules when generalized, making it not suitable for exploratory knowledge discovery. In a more recent work, Domingues and Rezende suggest applying generalization to the set of rules obtained from running data mining, without specifying the rule schema (Domingues & Rezende, 2011). In this case, a significant increase in the search space is avoided by separating the generalization step from the data mining search. However, the search cannot take full advantage of the generalized attributes due to the separation. The search space for generalization is restricted by the underlying data mining run. To the best of our knowledge, there has not been any previous effort to apply generalizations during data mining search without putting restrictions on the structure of the rules. Combining generalization and data mining enables fully exploring the capability of generalization in representing knowledge compactly. However, it also introduces new challenges as data mining can be significantly slower due to the large number of potential generalizations that can be made. In this work, we incorporate generalization into EA, and allow generalizations to be applied during the search. We adaptively control the amount of generalization hypotheses that are tested, in order to balance the computational resources spent on data mining search and generalization. 3. Design problem We now introduce an example design problem to provide a concrete context for the description of the method in the remainder of this paper. This example is taken from a real-world system architecting problem previously studied in (Selva, 2014). The design task is to architect a constellation of satellites that will make climate-related measurements such as atmospheric temperature, humidity, or precipitation rate. The objective is to simultaneously maximize the scientific benefit and minimize the lifecycle cost. Scientific benefit is calculated based on an architecture’s satisfaction of 371 climate-related measurement objectives, generated based on the World Meteorological Organizations OSCAR (Observing Systems Capability Analysis and Review Tool) database. The specifics of how science benefit and lifecycle cost are calculated can be found in (Selva, 2014). 3.1. Design decision variables The problem is formulated as an assignment problem between a set of candidate instruments (space-based sensors such as radars and infrared sounders) and a set of candidate orbits (defined by orbital parameters such as altitude and inclination). Given a set I of candidate instruments and a set O of candidate orbits, the design space is defined as the set of all binary relations from I to O. Each instrument in I can be assigned to any subset of orbits in O, including the empty set. In this problem, we consider 12 candidate instruments and 5 candidate orbits, for a total of 260 possible designs. The list of candidate instruments and the list of candidate orbits are shown in Tables 1 and 2, respectively. To enable generalization of decision variables, we define an ontology with several concepts, which are the abstractions of the

Table 1 Candidate orbits (LEO = Low Earth Orbit, SSO = Sun-Synchronous Orbit, AM = morning, PM = afternoon, DD = dawn-dusk, LTAN = Local Time of the Ascending Node). Orbit

Description

LEO-600-polar SSO-600-AM SSO-600-DD SSO-800-PM SSO-800-DD

LEO SSO SSO SSO SSO

with with with with with

polar inclination at 600km altitude morning LTAN at 600km altitude dawn-dusk LTAN at 600km altitude afternoon LTAN at 800km altitude dawn-dusk LTAN at 800km altitude

Table 2 Candidate instruments (SAR = Synthetic Aperture Radar, UV = Ultra Violet, VIS = VISible, SWIR = Short Wave InfraRed, TIR = Thermal InfraRed, IR = InfraRed). Instrument

Description

OCE_SPEC AERO_POL AERO_LID HYP_ERB CPR_RAD VEG_INSAR VEG_LID CHEM_UVSPEC CHEM_SWIRSPEC HYP_IMAG HIRES_SOUND SAR_ALTIM

Ocean color spectrometer Aerosol polarimeter Differential absorption lidar Short-wave / long-wave radiation budget Cloud and precipitation radar Polarimetric L-band SAR Vegetation/ice green lidar UV/VIS limb spectrometer SWIR nadir spectrometer SWIR-TIR hyperspectral imager IR atmospheric sounder Wide-swath radar altimeter

candidate instruments and candidate orbits. The newly added concepts and the existing variables together form concept hierarchies of orbits and instruments, as shown in Figs. 1 and 2, respectively. Here, nodes represent different concepts, and edges represent is-a relations between those concepts. The nodes colored in gray are leaf nodes, which represent specific instruments and orbits that are considered in this problem. We allow each leaf node to have multiple parents, as some of the abstract concepts are not mutually exclusive (e.g. Altitude600Orbit and PolarOrbit). The concept hierarchies are encoded using the Ontology Web Language (OWL). 3.2. Design features for data mining To run data mining, a set of input binary features needs to be defined. The data mining algorithm uses these binary features as basic building blocks to construct more sophisticated Boolean concepts as the search progresses. We refer to these input features as base features. In its simplest form, a base feature can be a single binary design decision. Each literal in the extracted rule would be a Boolean variable indicating whether a specific instrument Ii is assigned to an orbit Oj . In this work, however, we introduce more complex base features in order to bias the search towards more promising regions in the search space. Each base feature is defined using the predicates shown in Table 3. A base feature is defined by assigning specific instruments and orbits as arguments to these predicates. Then, the base features represent different attributes a design may have. For instance, present(SAR_ALTIM) indicates that the instrument SAR_ALTIM is used in at least one of the orbits. inOrbit(SSO-800-PM, HIRES_SOUND, CHEM_UVSPEC) indicates that instruments HIRES_SOUND and CHEM_UVSPEC are assigned to the orbit SSO-800-PM. 4. Method The proposed method extracts design knowledge in a compact form by extending an EA with a GP-like solution representation, and incorporating generalization into the search using Adaptive Operator Selection (AOS). Similar to GP, trees are used to represent

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

5

Fig. 1. Orbit concept hierarchy.

Fig. 2. Instrument concept hierarchy.

logical rules. The new method also introduces generalization operators in addition to conventional evolutionary operators such as crossover and mutation. The generalization operators replace either the predicates or their arguments in a logical expression (i.e., instrument and/or orbits) with more general concepts (cf. Figs. 1 and 2). The generalized concepts help express knowledge in a more compact way, as they can combine multiple lower-level concepts.

4.1. Baseline multi-objective evolutionary algorithm We use  -MOEA (Deb, Mohan, & Mishra, 2003) as a base EA due to its desirable algorithmic properties.  -MOEA coevolves a population of solutions as well as an external archive, which stores the best solutions found so far. The algorithm employs the concept of  -dominance to prevent degradation of solution quality

6

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025 Table 3 Base features used for the given example problem of architecting a climate monitoring satellite constellation. Name

Arguments

Description

present absent inOrbit notInOrbit together separate emptyOrbit numOrbits

Ii Ii Oi , Ij , Oi , Ij , Ii , (Ij , Ii , (Ij , Oi n

Instrument Ii is present in at least one of the orbits Instrument Ii is absent in all the orbits Instrument Ij (and Ik , Il ) is/are present in orbit Oi Instrument Ij (and Ik , Il ) is/are not present in orbit Oi Instrument Ii , Ij (and Ik ) are present together in any orbit Instrument Ii , Ij (and Ik ) are not present together in any orbit No instrument is present in orbit Oi The number of orbits that have at least one instrument assigned is n

(Ik , Il ) (Ik , Il ) Ik ) Ik )

Fig. 3. Representation of a feature.

while ensuring diversity of solutions and theoretical convergence (Laumanns, Thiele, Deb, & Zitzler, 2002).  -MOEA is a steady-state EA, which means that it updates the population sequentially, one solution at a time. This helps the algorithm react quickly and modify its strategy in applying generalizations, if needed. This point will be elaborated in detail in Section 4.3, where we introduce AOS in more detail. Each solution in  -MOEA represents a rule of the structure X⇒c, where X is a binary feature that describes a design, and c is a binary class label indicating the quality of a design. In this work, we restrict the class label c to represent the designs that are in a user-defined region of interest in the objective space, such as the Pareto front. Therefore, only the antecedent of a rule (feature) is used to define a solution. Similar to how computer programs are represented in GP (Koza, 1994), each feature is encoded as a tree as shown in Fig. 3. Function nodes only contain logical connectives AND or OR, and leaf nodes contain base features. To handle solutions represented in trees in EA, we use special forms of crossover and mutation operators. While various crossover operators can be defined for trees (Lang, 1995; Zhang, Gao, & Lou, 2007), we use the simple crossover operator introduced by John Koza (Koza, 1994). Here, the crossover points are selected randomly from two parent trees, and the subtrees under those crossover points are exchanged to create two offspring. As for mutation, we select a leaf node from a tree, and replace it with a random base feature with random input arguments. The application of mutation operators on feature nodes is restricted, as it may lead to too large changes in the performance of a feature. Three objectives are used to define the goodness of a rule X⇒c: maximizing conf(X⇒c), maximizing conf(c⇒X), and minimizing the number of literals in the antecedent of a rule. conf(X⇒c) represents how consistent or specific the feature X is in explaining

only the class c. When the class label c represents a binary class, as in our case, this metric is equivalent to precision. On the other hand, conf(c⇒X) represents how complete the feature X is in terms of the fraction of the class c that exhibits the feature X. Under a binary classification problem, conf(c⇒X) is equivalent to recall. They are included as separate objectives because they exhibit a trade-off relationship (Buckland & Gey, 1994), and it is in general impossible to define objective preferences among them a priori. Moreover, having both as objectives can better capture the predictive power of rules when the data is imbalanced (cf. accuracy). Minimizing the number of literals in a feature corresponds to reducing the complexity of a rule, hence improving the comprehensibility (Huysmans, Dejaeger, Mues, Vanthienen, & Baesens, 2011). Depending on the structure of a feature, maximizing conf(X⇒c) or conf(c⇒X) may be in a direct trade-off relationship with minimizing the number of literals. 4.2. Generalization strategy We define two different types of generalization operators: (1) feature-generalization (FG) operators; and (2) variablegeneralization (VG) operators. Similar to conventional operators used in EA, generalization operators take parent solutions as inputs and generate offspring using different heuristics. 4.2.1. Feature-generalization (FG) operators FG operators are applied when there exists an opportunity to generalize knowledge by modifying the predicate of a base feature. Such generalization is possible as there are semantic generalization/specialization relationships between different base features. Five FG operators were used in this work and are described below:

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025 •

InOrbit2Present: If there exists a leaf inOrbit (Oi , I0 , . . . , IN ), then replace it with a subtree

node



present (I j ) ∧ inOrbit (Oi , I0 , . . . , I j−1 , I j+1 , . . . , IN ) •

where 0 ≤ j ≤ N, and j is selected randomly. InOrbit2Together: If there exists a leaf inOrbit (Oi , I0 , . . . , IN ), then replace it with a subtree



node

together (I j , Ik ) ∧ inOrbit (Oi , I0 , . . . , I j−1 , I j+1 , . . . , Ik−1 , Ik+1 , . . . , IN ) •

where 0 ≤ j < k ≤ N, and j and k are selected randomly. NotInOrbit2Absent: If there exists a leaf not InOrbit (Oi , I0 , . . . , IN ), then replace it with a subtree

node

absent (I j ) ∧ notInOrbit (Oi , I0 , . . . , I j−1 , I j+1 , . . . , IN ) •



where 0 ≤ j ≤ N, and j is selected randomly. NotInOrbit2EmptyOrbit: If there exists a leaf node not InOrbit (Oi , I0 , . . . , IN ), then replace it with a node emptyOrbit(Oi ). Separate2Absent: If there exists a leaf node separate(I0 , . . . , IN ), then replace it with a subtree

absent (I j ) ∧ separate(I0 , . . . , I j−1 , I j+1 , . . . , IN ) where 0 ≤ j ≤ N, and j is selected randomly. Each FG operator is encoded in the form of an IF-THEN rule. The IF portion of an operator specifies the condition which indicates an opportunity to generalize the knowledge represented in a feature. For example, if there exists a leaf node inOrbit(SSO-800-PM, CHEM_UVSPEC, AERO_POL), then the InOrbit2Together generalization operator may replace it with together(CHEM_UVSPEC, AERO_POL). In this operation, the notion “CHEM_UVSPEC and AERO_POL are assigned to SSO-800-PM” is generalized to “CHEM_UVSPEC and AERO_POL are assigned together in at least one of the orbits considered in this problem”. Note that inOrbit(SSO-800-PM, CHEM_UVSPEC, AERO_POL) having a good predictive power does not warrant the usefulness of together(CHEM_UVSPEC, AERO_POL) in explaining the data. The IF portion of the operator simply captures an opportunity to explore and test a new hypothesis. A more conservative approach would be to use more than one base feature in the conditional part of each operator in order to generate stronger hypotheses. For example, inOrbit(SSO-800-PM, CHEM_UVSPEC, AERO_POL) ∨ inOrbit(SSO-600AM, CHEM_UVSPEC, AERO_POL) would be a stronger signal that together(CHEM_UVSPEC, AERO_POL) is a valid feature. However, such conditions would be more difficult to satisfy, and thus the operators would be triggered much more rarely during the search. In order to explore the generalization hypothesis space extensively, we made the conditions in the IF portion of the FG operators easy to satisfy. Another point to note is that most of the FG operators (with the exception of NotInOrbit2EmptyOrbit) actually increase the number of literals in a feature, rather than decreasing it. Therefore, in a syntactic sense, FG operators are not expected to make rules more compact. However, as we show in Section 6, FG operators contribute to extracting compact rules thanks to the semantic generalization of features. 4.2.2. Variable-generalization operators VG operators generalize features by replacing their argument variables (instruments and/or orbits) with variables that are higher up in a concept hierarchy (superclass of the initial argument variables). In this work, we refer to these variables as higherlevel variables. The information about “is-a” relationships across different variables is obtained from concept hierarchies stored in OWL. For example, Fig. 1 shows that the SSO-800-DD orbit is a dawn-dusk orbit, as well as an altitude-800km-class orbit. The VG operators used in this paper are:

7

OrbitGeneralizer: If there exist leaf nodes whose features include orbits as arguments (e.g. inOrbit, notInOrbit), then randomly select one of the nodes and replace the orbit variable Oi with a higher-level variable Oj , i.e. Oi ⇒Oj . InstrumentGeneralizer: If there exist leaf nodes whose features include instruments as arguments (e.g. inOrbit, notInOrbit, together, separate), then randomly select one of the nodes and replace the instrument variable Ii with a higher-level variable Ij , i.e. Ii ⇒Ij .

For example, inOrbit(SSO-800-PM,CHEM_UVSPEC) can be generalized by replacing SSO-800-PM with the PM Orbit class. Then, the feature becomes inOrbit(PMOrbit,CHEM_UVSPEC), which indicates that instrument CHEM_UVSPEC can be assigned to any of the orbits in the PM orbit class.

4.3. Adaptive operator selection As highlighted by Michalski (1983), generalization is done in two main steps: generation of plausible hypotheses and validating them. In this work, EA uses generalization operators as a way of creating new hypotheses, which are then tested against data when objectives such as conf(X⇒c) and conf(c⇒X) are computed. However, there exist some issues that need to be addressed when applying generalization operators. First, some generalization operators may perform better than others in generating plausible hypotheses. Depending on the structure of the solutions in the archive as well as those in the population at a certain point during the search, different generalization strategies may be useful in improving the solutions. Second, many of the hypotheses created through generalization are likely to be too general, resulting in high conf(c⇒X), but low conf(X⇒c). Therefore, over-reliance on generalization operators may result in biasing the search to favor only one of the objectives, or in a premature convergence on local optima. These issues are similar to those outlined by Hitomi and Selva (2018), who introduces some challenges of incorporating knowledge into EA through the use of knowledge-dependent operators. To address these issues, we adaptively control the types of generalization hypotheses that are tested at a specific stage of the search, using a method based on the multi-armed bandit literature called Adaptive Operator Selection (AOS) (DaCosta, Fialho, Schoenauer, & Sebag, 2008; Fialho, Schoenauer, & Sebag, 2010; Maturana, Lardeux, & Saubion, 2010). In the past, AOS has been used as a method to incorporate domain knowledge into EA and speed up the search by biasing it towards more promising regions of the search space (Hitomi & Selva, 2018). AOS evaluates the performance of different operators with a scalar metric, and assigns credits to those operators. Then, different operator selection strategies may be used to select high-performing operators more frequently than poorly performing ones. Still, to foster exploration, all operators should be given a chance and the operators showing poor performance should still be selected occasionally, as the performance of operators may change over time (Fialho et al., 2010). In this work, we use AOS to balance the selection of conventional evolutionary operators and generalization operators creating generalization hypotheses. We adopt one of the credit assignment strategies and one of the operator selection methods recommended from (Hitomi & Selva, 2017). An operator oi is rewarded one credit at iteration t, ci,t = 1 if it creates a solution that enters the archive. Then, Adaptive Pursuit (AP) (Thierens, 2005) is used to select an operator according to Eqs. (3)–(6).

qi,t+1 = (1 − α ) · qi,t + α · ci,t

(3)

8

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

o∗ = argmax qi,t

(4)

pmax = (1 − (|O| − 1 ) · pmin )

(5)

oi ∈O

 pi,t+1 =

pi,t + β · ( pmax − pi,t ), pi,t + β · ( pmin − pi,t ),

if oi = o∗ otherwise

(6)

pmin represents the minimum selection probability, which is used to ensure that unexplored or poorly performing operators get selected occasionally for exploration purposes. qi,t is the estimate of the quality of operator i at time t. AP identifies an operator with the highest quality metric qi,t , and updates the selection probabilities pi,t based on the learning rate β ∈ [0, 1] and pmin , so that the probability of selecting the best operator is increased by a quantity controlled by the learning rate, until it reaches pmax , and the probabilities of selecting the other operators are decreased in the same way, with a lower bound of pmin . When the selected operator cannot be applied to the parent solutions (the IF portion of the operator is not satisfied), AOS proceeds to re-select a different operator until an applicable operator is selected. Therefore, the actual selection frequency of an operator i may not necessarily be proportional to pi,t . The selection frequency of an operator is influenced by both the selection probability pi,t , and the probability of a solution satisfying the IF portion of the operator. 5. Experiment In this paper, we empirically test the utility of introducing generalization operators within EA for extracting compact and interesting design knowledge, as defined by the two confidence metrics and the complexity. The main goal of the experiment is to compare the performance of EAs which utilize generalization operators and the performance of a baseline EA (no generalization). We hypothesize that generalization operators help find features that are more compact, while having good conf(X⇒c) and conf(c⇒X). We are also interested in identifying the effects of different generalization strategies on the search behavior and search performance. Specifically, five different EAs are tested in this work: (1) baseline  -MOEA; (2)  -MOEA with FG operators selected using AOS; (3)  -MOEA with VG operators selected using AOS; (4)  -MOEA with both FG and VG operators selected using AOS; (5)  -MOEA with both FG and VG operators selected randomly. The implementation of all EAs tested in this study is based on an open-source Java library called MOEAFramework (Hadka, Herman, Reed, & Keller, 2015). The crossover operator is applied with probability 1. The mutation operator is applied with probability 0.9 at the chromosome-level. In other words, each leaf node in a feature F has the probability of 0.9 · |F1| to be modified by the mutation operator, where |F| is the number of literals in F. The  values used to store solutions in the archive are set to 0.035, 0.035 and 1, each corresponding to one of the three objectives: maximizing conf(X⇒c), maximizing conf(c⇒X), and minimizing the number of literals. All EAs have a population size of 400, and are set to run up to 10 0,0 0 0 function evaluations. The adaptive pursuit algorithm used for selecting operators has a minimum selection probability pmin of 0.05, and the learning rate β of 0.8. The dataset used in this experiment consists of 2204 different alternative architectures of the Earth-observing satellite system described earlier. The fuzzy Pareto front of these architectures is selected and labeled as the target designs, with a class label c = 1. Specifically, the 358 samples with Pareto ranking of 7 or less are labeled as c = 1. Then, the different algorithms listed above

are used to find the design features that best explain the target designs. The performance of each algorithm is measured using hypervolume (HV). HV represents the volume in the objective space spanned by the union of the hypercubes defined using each of the non-dominated solutions paired with a single reference point. High HV value implies that an approximate Pareto front PF has a good convergence and diversity. To eliminate scaling issues, we normalize the objectives, and define a reference point at [1.01, 1.01, 1.01]. The reference point is selected slighted off of the nadir point ([1, 1, 1]), to reduce the bias of HV favoring the solutions in the center of PF (Ishibuchi, Imada, Setoguchi, & Nojima, 2017). To compare the performances of different EAs, we run 30 independent trials for each method. Then, we use the Wilcoxon rank sum test with a significance level of 0.05 to test statistical significance in the HV differences. As it is a nonparametric test, Wilcoxon rank sum test does not impose any assumptions on the sample distribution. 6. Results Fig. 4 shows the change of HV as a function of the number of function evaluations (NFE) for each algorithm tested. One function evaluation corresponds to a one-time calculation of all three objectives for a feature: conf(X⇒c), conf(c⇒X), and the number of literals in the feature. The time it takes for a single evaluation of the objectives depends on the number of samples as well as the size of the feature (number of literals). It should be noted that evaluating confidences of a feature is relatively cheap compared to the computational time generally needed to evaluate designs. The left chart of Fig. 4 shows the HV for the whole range of NFE (up to 10 0,0 0 0), whereas the right chart of Fig. 4 only shows the HV after the first 50 0 0 NFE in order to highlight the differences across different EAs. Table 4 shows the median and standard deviations of the HV for each algorithm at every 20,0 0 0 NFE. The result shows that all EAs that utilize any type of generalization operator outperformed the baseline  -MOEA with statistically significant differences in the HV. Among the EAs with generalization, AOS-FG&VG performed the best, with the highest median and mean HV throughout all stages of the search. AOS-FG&VG starts to statistically outperform AOS-VG and AOS-FG around 28,0 0 0 NFE and 30,0 0 0 NFE, respectively. It also achieves a higher mean HV than Random-FG&VG with statistical significance, even though the difference is not large. Random-FG&VG attains the second best median HV value at 10,0 0 0 NFE. While it starts to statistically outperform AOS-FG around 50,0 0 0 NFE, it remains statistically equivalent to AOS-VG. The HV values of AOS-FG and AOS-VG remain at very similar levels throughout the search. Fig. 5 shows the history of HV values for fixed levels of feature complexity (number of literals in each feature). Here, the HV values are calculated assuming the 2D objective space of the two confidences with the reference point of [1.01, 1.01]. From Fig. 5, it can be seen that the range at which AOS-FG&VG statistically outperforms  -MOEA becomes smaller as the number of literals increases. AOS-VG performs well when the number of literals is small (between 1 and 3), but its performance starts to degrade as features get more complex. In contrast, AOS-FG underperforms at small feature lengths, but starts to perform better at moderate feature lengths (between 5 and 11). 7. Discussion 7.1. Effect of generalization on search performance The result shows that all EAs with generalization operators achieve higher HV compared to the baseline EA with no gener-

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

9

Fig. 4. HV as a function of the number of function evaluations (NFE). Each solid line represents the mean HV of 30 independent runs of each method. Left: HV values for the whole range of NFE. Right: HV values after the first 50 0 0 NFE highlighting the differences at later stages of the search.

Fig. 5. History of HV for different levels of feature complexity. The numbers shown on the bottom right corner represent the number of literals in each feature. The solid lines represent the mean HV of 30 independent runs of each algorithm. The bold portions of the lines indicate statistically significant differences in HV compared to  -MOEA.

Table 4 Median (standard deviation) of HV values achieved by each method at different points during the search. † indicate significantly higher mean HV compared to  -MOEA. ‡ indicate significantly lower mean HV compared to AOS-FG&VG. Boldface entries indicate best median value for a given NFE. Algorithm

20,0 0 0 NFE

40,0 0 0 NFE

60,0 0 0 NFE

80,0 0 0 NFE

10 0,0 0 0 NFE

 -MOEA

0.8058‡ (0.0303) 0.8312†‡ (0.0175) 0.8319† (0.0312) 0.8412† (0.0139) 0.8310†‡ (0.0163)

0.8352‡ (0.0341) 0.8626†‡ (0.0202) 0.8650†‡ (0.0338) 0.8806† (0.0125) 0.8744†‡ (0.0150)

0.8447‡ (0.0364) 0.8797†‡ (0.0224) 0.8825†‡ (0.0369) 0.9009† (0.0123) 0.8947†‡ (0.0126)

0.8510‡ (0.0382) 0.8872†‡ (0.0229) 0.8880†‡ (0.0377) 0.9108† (0.0125) 0.9027†‡ (0.0127)

0.8545‡ (0.0395) 0.8906†‡ (0.0237) 0.8912†‡ (0.0376) 0.9169† (0.0132) 0.9082† ‡ (0.0130)

AOS-FG AOS-VG AOS-FG&VG Random-FG&VG

10

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

Fig. 6. Average number of features found for a fixed number of literals. The number of features is averaged over 30 independent trials. The total number of all features for each algorithm is 10 0,0 0 0. Only the number of literals between 1 and 15 are shown.

alization. The higher HV implies that an algorithm is able to find more diverse features that are close to the true Pareto front. As the objective space is defined by the three objectives of maximizing conf(X⇒c) and conf(c⇒X), and minimizing the number of literals, this supports our hypothesis that generalization operators help extracting high-quality features that are compact. Among the EAs with generalization operators, AOS-FG&VG statistically outperforms both AOS-FG and AOS-VG. This suggests that FG and VG operators each contribute to the search performance, and there exists an added value in applying both types of generalization operators at the same time. Fig. 5 provides a supplementary view highlighting the different effects of the two types of operators on the performance. Based on the observations that can be made from 5, AOS-VG performs well in finding highquality features that have relatively fewer literals, while AOS-FG does a better job in finding features with a higher number of literals. VG operators are useful in finding features with smaller number of literals, as they extend the capability of individual base features by replacing the argument variables with their corresponding higher-level concept variables. This enables generating features that encompass wider concepts without increasing the number of literals. However, the utility of VG operators is reduced when we only consider the features that have a higher number of literals. AOS-VG’s performance is statistically equivalent to  -MOEA when the number of literals is 9 or more. While identifying the exact cause of this behavior requires further study, a partial explanation may be drawn from Fig. 6. Fig. 6 shows the number of features evaluated by each algorithm, for fixed numbers of literals (between 1 and 15). It is interesting to see that the distributions of the number of features are similar in shape between  -MOEA and AOS-VG, and between AOS-FG and AOS-FG&VG. For  -MOEA and AOS-VG, the search is biased towards the features with small number of literals. For example, AOS-VG allotted more than half (about 51,0 0 0 out of 10 0,0 0 0) of its total function evaluations to test features that have either 1 or 2 literals. As HV monotonically increases with the number of solutions explored, there is a higher chance for  -MOEA and AOS-VG to perform well when we only consider features with 1 or 2 literals. The difference in performance between  -MOEA and AOS-VG can be attributed to the availability of the base features extended with generalized variables. The strong emphasis on the features with small number of literals means that algorithms such

as  -MOEA and AOS-VG cannot sufficiently explore the features with larger number of literals. In contrast to AOS-VG, AOS-FG does not add much value in finding features with smaller number of literals. Rather, it performs well in finding high-quality features with larger number of literals. Recall that most FG operators used in this paper increase the number of literals rather than decreasing it. Due to this behavior, AOS-FG ends up testing more features with relatively large number of literals, which is reflected in Fig. 6. Overall, AOS-FG&VG seems to exhibit the advantages of both AOS-VG and AOS-FG. It is able to perform well in finding good features with 1 or 2 literals, thanks to the presence of generalized variables. It also performs well in finding features of higher length due to the high number of features explored at that region. 7.2. Effect of adaptive operator selection strategy on applying generalization In this work, we used AOS to control the type and the amount of generalization hypotheses generated and tested at different stages of the search. In order to test the utility of using AOS, we compared the performances of AOS-FG&VG with Random-FG&VG, where all operators are selected randomly. Table 4 show that AOS-FG&VG attains higher mean HV than Random-FG&VG with statistical significance. However, the performance of RandomFG&VG closely resembles that of AOS-FG&VG throughout the search, and the actual differences in HV values are relatively small. Fig. 7 shows how AOS-FG&VG and Random-FG&VG each rewarded and selected operators. Fig. 7b shows that AOS-FG&VG applied the crossover operator more frequently than any other operators throughout the search. Among the generalization operators, VG operators were selected more often than FG operators. Fig. 7d shows that Random-FG&VG relied less on the crossover operator and used more of the generalization operators. However, there is no significant change in the selection frequencies of the generalization operators compared to AOS-FG&VG. Moreover, the overall ordering of the selection frequencies of different operators is very similar between AOS-FG&VG and Random-FG&VG. This is somewhat unexpected, as one would expect a much more evenly distributed selection of operators in Random-FG&VG. The fact that the ordering of the selection frequencies are similar between AOS-FG&VG and Random-FG&VG suggests that the selection of generalization operators may have largely been influ-

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

11

Fig. 7. (a) Credits earned by each operator in AOS-FG&VG averaged over the 30 trials. (b) Selection frequency of each operator by AOS-FG&VG averaged over the 30 trials. (c) Credits earned by each operator in Random-FG&VG averaged over the 30 trials. (d) Selection frequency of each operator by Random-FG&VG averaged over the 30 trials.

enced by the applicability of each operator, possibly more than the quality of each operator tracked by AOS. For example, the selection frequency of Separate2Absent may be low not necessarily because it does not do well in creating plausible generalization hypotheses, but simply because there are not many features in the population containing Separate. Therefore, the mere existence of generalization operators seems to have a large impact on the performance rather than the operator selection strategy alone. This result is similar to the findings of Hitomi and Selva (Hitomi & Selva, 2017), who suggest that the existence of a diverse set of operators increases the exploration capabilities of the EA, contributing to improving the performance despite occasionally selecting poor operators. 7.3. Qualitative evaluation of the features In addition to comparing the search performances, it is also important to discuss the utility of the features in helping system architects learn useful information. However, investigating how different features directly influence the users’ learning requires carefully designing and running user studies, and is beyond of the scope of this paper. In this work, we pick a few features from each algorithm and make subjective assessments about their potential usefulness. In order to make a fair comparison, we use the following steps to select features out of 10,0 0 0 features evaluated from each run: (1) Apply constraints on both conf(X⇒c) and conf(c⇒X) to retain features with similar levels of predictive power; (2) For each algorithm, keep only the features with the lowest level of

complexity (number of literals); (3) Sort features based on their distances to the utopia point (both confidences being 1.0); (4) Select the three features with the shortest distance to the utopia point. Using these steps, three features are selected from  -MOEA and AOS-FG&VG each. The selected features are shown in Fig. 8. We compare only the features from these two algorithms, as the difference in the structure of the features is the most pronounced. One of the most apparent differences is the number of arguments used in each base feature. For example, if we compare the features in Fig. 8c, the total number of arguments used in defining all three base features from  -MOEA is 21, while the number of arguments used in the feature from AOS-FG&VG is 8. Note that the measure of complexity used in this work (number of literals) does not take into account the number of arguments used in each base feature. However, the reduction in the number of arguments alone can make knowledge represented in a feature much simpler, as can be observed in Fig. 8. When the user processes the information represented in a feature, the higher-level variables such as ActiveInstrument or Sun-synchronousOrbit act as chunks of information, as introduced in (Miller, 1956). The presence of chunks may be important for improving the comprehensibility of a complex feature, since recoding individual pieces of information into chunks increases the overall size of the information that can be processed. Another advantage of the features obtained from AOS-FG&VG is that the knowledge encoded in those features can be more easily checked against the user’s prior knowledge. For example, one of

12 H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025 Fig. 8. Features selected from  -MOEA (left) and AOS-FG&VG (right). The features in each row are obtained by applying the same minimum confidence thresholds, keeping only the features with the lowest level of complexity for each respective algorithm, and then selecting the features with the shortest distance to the utopia point (both confidences being 1.0).

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

the base features obtained from AOS-FG&VG is notInOrbit(DawnDuskOrbit, PassiveInstrument). A system architect may know from his or her prior knowledge that passive instruments that operate in the visible or near-infrared spectra require good illumination conditions, and thus should avoid being flown in dawn-dusk orbits where there is little light available. While the prior knowledge cannot fully explain notInOrbit(Dawn-DuskOrbit, PassiveInstrument) (only 5 out of 7 passive instruments considered in this problem operate in the visible or near-infrared spectral region), the system architect can determine that the feature is not in conflict with his or her intuition. The process of checking new knowledge against the prior knowledge is made easier with the presence of higher-level variables, as our domain knowledge is likely to involve concepts rather than specific individual instance variables. 8. Conclusion This paper introduced a new method to extract design knowledge in the form of logical rules using EA extended with generalization operators. The proposed method uses trees to represent logical expressions, enabling knowledge to be represented in a flexible and compact structure. The EA is extended by introducing two different types of generalization operators, featuregeneralization (FG) and variable-generalization (VG) operators, in addition to the conventional evolutionary operators. FG operators create new generalization hypotheses by modifying the predicates used to form features. On the other hand, VG operators modify the arguments of those predicates: they use concept hierarchies in order to replace a variable with a higher-level class variable. We hypothesized that generalization would help the EA mine more compact logical rules. The application of generalization operators is controlled using an adaptive operator selection (AOS) strategy, which adjusts the selection frequency of different operators based on their previous performance in finding good solutions. A comparative experiment was run in order to test the efficacy of generalization operators in mining compact sets of design rules, and to compare the effects of different strategies to apply generalizations. The experiment compared the performance of five different EAs: (1) baseline  -MOEA with no generalization, (2)  -MOEA with FG operators applied using AOS (AOS-FG); (3)  MOEA with VG operators applied using AOS (AOS-VG); (4)  -MOEA with both FG and VG operators applied using AOS (AOS-FG&VG); (5)  -MOEA with random application of FG and VG operators (Random-FG&VG). The results from the comparative experiment support our hypothesis that generalizations help finding design rules that are more compact while achieving good predictive power. They also show that different types of generalization operators, such as FG and VG operators, may have complementary effects on improving the search performance. When we directly compare the features obtained from AOS-FG&VG to those obtained from  -MOEA, the features from AOS-FG&VG contained smaller number of arguments, thanks to the higher-level variables. The presence of higher-level variables also makes the features obtained from AOS-FG&VG easier for a system architect to process and check the encoded knowledge against his or her prior domain knowledge. One major limitation of this research is in our way of modeling the comprehensibility of logical rules. We used the number of literals in a rule as a proxy for comprehensibility. While this approach has been used in previous works (Martin et al., 2014; Qodmanan et al., 2011), the number of literals alone may not be sufficient to fully characterize comprehensibility, especially when rules are allowed to contain both conjunctions and disjunctions. For example, previous findings in cognitive science suggest that conjunctions and disjunctions have a different effect on the comprehensibility of a rule (Haygood & Bourne Jr, 1965; Medin, Wattenmaker, &

13

Michalski, 1987). Moreover, the current model of comprehensibility does not account for the number of arguments used to define each base feature, even though it can be argued that a feature with fewer number of arguments is easier to understand. Despite such limitations, we used the number of literals to represent the complexity of logical rules due to the lack of operationalizable modeling techniques that can account for all these factors. Another limitation of the current approach is that the search in the feature space is purely data-driven, and does not take into account the domain knowledge of a system architect. The proposed algorithm only favors features having good predictive power and low complexity. This may prevent the algorithm from finding features with higher-level variables that are more “interesting” to the user in a subjective sense, but have slightly lower performance compared to other features. The findings in this paper are also limited by the fact that some of the generalization operators depend on the structure of the problem. FG operators such as InOrbit2Present or Separate2Absent are very specific to the formulation structure of the design decisions, and thus may not be applicable to other design problems. Therefore, when a new problem is introduced, some efforts are likely necessary to define new generalization operators that would have similar effects as the ones introduced in this work (FG and VG operators). Such efforts can be saved if generalization operators are reused for similar types of problems. For example, Selva et al. argued that there exist certain formulations that appear often in real-world system architecting problems, thus providing an opportunity for reuse of knowledge (Selva, Cameron, & Crawley, 2016). Future work will focus on addressing some of the major limitations of this paper. A significant effort should be made to develop more accurate models of the comprehensibility of logical rules with different structure. This would enable the proposed method to mine design rules that are actually easier to understand and learn, rather than simply minimizing the number of literals. Testing the proposed method on other types of design problems and in other domains would also be needed to generalize the findings of this paper. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Credit authorship contribution statement Hyunseung Bang: Conceptualization, Methodology, Software, Formal analysis, Investigation, Resources, Data curation, Writing original draft, Visualization, Project administration. Daniel Selva: Conceptualization, Resources, Writing - review & editing, Supervision, Funding acquisition. Acknowledgements This work was partially supported by National Science Foundation Grant number CMMI-1635253. References Aggarwal, C. C. (2015). Data mining: The textbook. Springer. ´ Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Acm sigmod record: 22 (pp. 207–216). ACM. Agrawal, R., Srikant, R., et al. (1994). Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, vldb: 1215 (pp. 487–499). Balling, R. (1999). Design by shopping: A new paradigm? In Proceedings of the third world congress of structural and multidisciplinary optimization (wcsmo-3): 1 (pp. 295–297). International Soc. for Structural and Multidisciplinary Optimization Berlin.

14

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025

Bandaru, S., & Deb, K. (2010). Automated discovery of vital knowledge from pareto-optimal solutions: First results from engineering design. In Ieee congress on evolutionary computation (pp. 1–8). IEEE. Bandaru, S., & Deb, K. (2013). A dimensionally-aware genetic programming architecture for automated innovization. In International conference on evolutionary multi-criterion optimization (pp. 513–527). Springer. Bandaru, S., Ng, A. H., & Deb, K. (2017). Data mining methods for knowledge discovery in multi-objective optimization: Part b-new developments and applications. Expert Systems with Applications, 70, 119–138. Bellandi, A., Furletti, B., Grossi, V., & Romei, A. (2007). Ontology-driven association rule extraction: A case study. Contexts and Ontologies Representation and Reasoning, 10. Brachman, R. J., Levesque, H. J., & Reiter, R. (1992). Knowledge representation. MIT press. Buckland, M., & Gey, F. (1994). The relationship between recall and precision. Journal of the American Society for Information Science, 45(1), 12–19. Coenen, F., & Leng, P. (2007). The effect of threshold values on association rule based classification accuracy. Data & Knowledge Engineering, 60(2), 345–360. DaCosta, L., Fialho, A., Schoenauer, M., & Sebag, M. (2008). Adaptive operator selection with dynamic multi-armed bandits. In Proceedings of the 10th annual conference on genetic and evolutionary computation (pp. 913–920). ACM. De Falco, I., Della Cioppa, A., & Tarantino, E. (2002). Discovering interesting classification rules with genetic programming. Applied Soft Computing, 1(4), 257–269. Deb, K., Mohan, M., & Mishra, S. (2003). A fast multi-objective evolutionary algorithm for finding well-spread pareto-optimal solutions. KanGAL Report, 2003002, 1–18. Dehuri, S., Patnaik, S., Ghosh, A., & Mall, R. (2008). Application of elitist multi-objective genetic algorithm for classification rule generation. Applied Soft Computing, 8(1), 477–487. Domingues, M. A., & Rezende, S. O. (2011). Using taxonomies to facilitate the analysis of the association rules. arXiv: 1112.1734. Espejo, P. G., Ventura, S., & Herrera, F. (2010). A survey on the application of genetic programming to classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(2), 121–144. Fialho, Á., Schoenauer, M., & Sebag, M. (2010). Toward comparison-based adaptive operator selection. In Proceedings of the 12th annual conference on genetic and evolutionary computation (pp. 767–774). ACM. Fidelis, M. V., Lopes, H. S., & Freitas, A. A. (20 0 0). Discovering comprehensible classification rules with a genetic algorithm. In Evolutionary computation, 2000. proceedings of the 2000 congress on: 1 (pp. 805–810). IEEE. Ghosh, A., & Nath, B. (2004). Multi-objective rule mining using genetic algorithms. Information Sciences, 163(1–3), 123–133. Graening, L., Menzel, S., Hasenjäger, M., Bihrer, T., Olhofer, M., & Sendhoff, B. (2008). Knowledge extraction from aerodynamic design data and its application to 3d turbine blade geometries. Journal of Mathematical Modelling and Algorithms, 7(4), 329. Hadka, D., Herman, J., Reed, P., & Keller, K. (2015). An open source framework for many-objective robust decision making. Environmental Modelling & Software, 74, 114–129. Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data mining and knowledge discovery, 8(1), 53–87. Haygood, R. C., & Bourne Jr, L. E. (1965). Attribute-and rule-learning aspects of conceptual behavior. Psychological Review, 72(3), 175. Hitomi, N., & Selva, D. (2017). A classification and comparison of credit assignment strategies in multiobjective adaptive operator selection. IEEE Transactions on Evolutionary Computation, 21(2), 294–314. Hitomi, N., & Selva, D. (2018). Incorporating expert knowledge into evolutionary algorithms with operators and constraints to design satellite systems. Applied Soft Computing, 66, 330–345. Huysmans, J., Dejaeger, K., Mues, C., Vanthienen, J., & Baesens, B. (2011). An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems, 51(1), 141–154. Idoudi, R., Ettabaa, K. S., Solaiman, B., & Hamrouni, K. (2016). Ontology knowledge mining based association rules ranking. Procedia Computer Science, 96, 345–354. Ishibuchi, H., Imada, R., Setoguchi, Y., & Nojima, Y. (2017). Reference point specification in hypervolume calculation for fair comparison and efficient search. In Proceedings of the genetic and evolutionary computation conference (pp. 585–592). ACM. Jahr, R., Calborean, H., Vintan, L., & Ungerer, T. (2012). Boosting design space explorations with existing or automatically learned knowledge. In International gi/itg conference on measurement, modelling, and evaluation of computing systems and dependability and fault tolerance (pp. 221–235). Springer. Kabir, M. M. J., Xu, S., Kang, B. H., & Zhao, Z. (2017). A new multiple seeds based genetic algorithm for discovering a set of interesting boolean association rules. Expert Systems with Applications, 74, 55–69. Koza, J. R. (1994). Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4(2), 87–112. Krawiec, K. (2002). Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines, 3(4), 329–343. Kuo, Y.-T., Lonie, A., Sonenberg, L., & Paizis, K. (2007). Domain ontology driven data mining: a medical case study. In Proceedings of the 2007 international workshop on domain driven data mining (pp. 11–17). ACM. Lang, K. J. (1995). Hill climbing beats genetic search on a boolean circuit synthesis problem of koza’s. In Machine learning proceedings 1995 (pp. 340–343). Elsevier.

Laumanns, M., Thiele, L., Deb, K., & Zitzler, E. (2002). Combining convergence and diversity in evolutionary multiobjective optimization. Evolutionary computation, 10(3), 263–282. Lenca, P., Vaillant, B., Meyer, P., & Lallich, S. (2007). Association rule interestingness measures: Experimental and theoretical studies. In Quality measures in data mining (pp. 51–76). Springer. Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rule mining. In Proceedings of the fourth international conference on knowledge discovery and data mining. Mansingh, G., Osei-Bryson, K.-M., & Reichgelt, H. (2011). Using ontologies to facilitate post-processing of association rules by domain experts. Information Sciences, 181(3), 419–434. Marinica, C., & Guillet, F. (2010). Knowledge-based interactive postmining of association rules using ontologies. IEEE Transactions on Knowledge and Data Engineering, 22(6), 784–797. Marinica, C., Guillet, F., & Briand, H. (2008). Post-processing of discovered association rules using ontologies. In Data mining workshops, 2008. icdmw’08. ieee international conference on (pp. 126–133). IEEE. Martin, D., Rosete, A., Alcala-Fdez, J., & Herrera, F. (2014). A new multiobjective evolutionary algorithm for mining a reduced set of interesting positive and negative quantitative association rules. IEEE Transactions on Evolutionary Computation, 18(1), 54–69. Martínez-Ballesteros, M., Martínez-Álvarez, F., Troncoso, A., & Riquelme, J. C. (2014). Selecting the best measures to discover quantitative association rules. Neurocomputing, 126, 3–14. Maturana, J., Lardeux, F., & Saubion, F. (2010). Autonomous operator management for evolutionary algorithms. Journal of Heuristics, 16(6), 881–909. Medin, D. L., Wattenmaker, W. D., & Michalski, R. S. (1987). Constraints and preferences in inductive learning: An experimental study of human and machine performance. Cognitive Science, 11(3), 299–339. Michalski, R. S. (1983). A theory and methodology of inductive learning. In Machine learning (pp. 83–134). Springer. Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review, 63(2), 81. Mitchell, T. M. (1982). Generalization as search. Artificial intelligence, 18(2), 203–226. Natarajan, R., & Shekar, B. (2005). A relatedness-based data-driven approach to determination of interestingness of association rules. In Proceedings of the 2005 acm symposium on applied computing (pp. 551–552). ACM. Newell, A., Simon, H. A., et al. (1972). Human problem solving: 104. Prentice-Hall Englewood Cliffs, NJ. Ng, A. H., Bandaru, S., & Frantzén, M. (2016). Innovative design and analysis of production systems by multi-objective optimization and data mining. In Procedia cirp: 50 (pp. 665–671). Elsevier. Noda, E., Freitas, A. A., & Lopes, H. S. (1999). Discovering interesting prediction rules with a genetic algorithm. In Proceedings of the 1999 congress on evolutionary computation-cec99 (cat. no. 99th8406): 2 (pp. 1322–1329). IEEE. Psaila, G., & Lanzi, P. L. (20 0 0). Hierarchy-based mining of association rules in data warehouses. In Proceedings of the 2000 acm symposium on applied computing-volume 1 (pp. 307–312). ACM. Qodmanan, H. R., Nasiri, M., & Minaei-Bidgoli, B. (2011). Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence. Expert Systems with applications, 38(1), 288–298. Ross, A. M., & Hastings, D. E. (2005). The tradespace exploration paradigm. In Incose international symposium: 15 (pp. 1706–1718). Wiley Online Library. Russo, I. L., Bernardino, H. S., & Barbosa, H. J. (2018). Knowledge discovery in multiobjective optimization problems in engineering via genetic programming. Expert Systems with Applications, 99, 93–102. Selva, D. (2014). Knowledge-intensive global optimization of earth observing system architectures: a climate-centric case study. In Sensors, systems, and next-generation satellites xviii: 9241 (p. 92411S). International Society for Optics and Photonics. Selva, D., Cameron, B., & Crawley, E. (2016). Patterns in system architecture decisions. Systems Engineering, 19(6), 477–497. Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. Srikant, R., & Agrawal, R. (1995). Mining generalized association rules. Tan, P.-N., Kumar, V., & Srivastava, J. (2004). Selecting the right objective measure for association analysis. Information Systems, 29(4), 293–a–313. Tatsukawa, T., Nonomura, T., Oyama, A., & Fujii, K. (2013). A new multiobjective genetic programming for extraction of design information from non-dominated solutions. In International conference on evolutionary multi-criterion optimization (pp. 528–542). Springer. Thierens, D. (2005). An adaptive pursuit strategy for allocating operator probabilities. In Proceedings of the 7th annual conference on genetic and evolutionary computation (pp. 1539–1546). ACM. Tseng, M.-C., Lin, W.-Y., & Jeng, R. (2007). Mining association rules with ontological information. In Innovative computing, information and control, 2007. icicic’07. second international conference on (p. 300). IEEE. Watanabe, S., Chiba, Y., & Kanazaki, M. (2014). A proposal on analysis support system based on association rule analysis for non-dominated solutions. In Evolutionary computation (cec), 2014 ieee congress on (pp. 880–887). IEEE. Wong, M. L., & Leung, K. S. (1995). Inducing logic programs with genetic algorithms: The genetic logic programming system. IEEE Intelligent Systems, (5), 68–76. Woodruff, M. J., Reed, P. M., & Simpson, T. W. (2013). Many objective visual analytics: Rethinking the design of complex engineered systems. Structural and Multidisciplinary Optimization, 48(1), 201–219.

H. Bang and D. Selva / Expert Systems With Applications 143 (2020) 113025 Yan, X., Qiao, M., Simpson, T., Li, J., & Zhang, X. (2012). Work-centered visual analytics to support multidisciplinary design analysis and optimization. In 12th aiaa aviation technology, integration, and operations (atio) conference and 14th aiaa/issmo multidisciplinary analysis and optimization conference (p. 5662). Yan, X., Zhang, C., & Zhang, S. (2009). Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. Expert Systems with Applications, 36(2), 3066–3076. Zhang, M., Gao, X., & Lou, W. (2007). A new crossover operator in genetic programming for object classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(5), 1332–1343.

15

Zhang, Y., & Rockett, P. I. (2011). A generic optimising feature extraction method using multiobjective genetic programming. Applied Soft Computing, 11(1), 1087–1097. Zhou, C., Xiao, W., Tirpak, T. M., & Nelson, P. C. (2003). Evolving accurate and compact classification rules with gene expression programming. IEEE Transactions on Evolutionary Computation, 7(6), 519–531. Zhou, X., & Geller, J. (2008). Raising, to enhance rule mining in web marketing with the use of an ontology. In Data mining with ontologies: Implementations, findings, and frameworks (pp. 18–36). IGI Global.