Discrete Applied Mathematics (
)
–
Contents lists available at ScienceDirect
Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam
0-1 multilinear programming as a unifying theory for LAD pattern generation✩ Kedong Yan a , Hong Seo Ryoo b,∗ a
Department of Information Management Engineering, Graduate School of Information Management & Security, Korea University, 145 Anam-Ro, Seongbuk-Gu, Seoul, 02841, Republic of Korea b
School of Industrial Management Engineering, College of Engineering, Korea University, 145 Anam-Ro, Seongbuk-Gu, Seoul, 02841, Republic of Korea
article
info
Article history: Received 9 July 2015 Received in revised form 30 November 2015 Accepted 9 August 2016 Available online xxxx Keywords: Boolean logic Logical analysis of data Pattern generation Multilinear programming 0-1 linearization
abstract This paper revisits the Boolean logical requirement of a pattern and develops 0-1 multilinear programming (MP) models for (Pareto-)optimal patterns for logical analysis of data (LAD). We show that all existing and also new pattern generation models can naturally be obtained from the MP models via linearization techniques for 0-1 multilinear functions. Furthermore, 0-1 MP provides an insight for understanding how different and independently developed models for a particular type of pattern are inter-related. These show that 0-1 MP presents a unifying theory for pattern generation in LAD. © 2016 Elsevier B.V. All rights reserved.
1. Introduction LAD is a Boolean logic-based data analytics methodology [14,22,35] that has aroused much interest in optimization and data mining communities during the past two decades and has proven useful in practical decisions [1,2,4,5,24,29,31]. For separation of a finite set of two different types of observations/data, without loss of generality, a critical step in LAD successively discovers a piece of structural information that distinguishes one or more of one type of data from all of the other type. Such knowledge is called a pattern, which corresponds to a conjunction of literals in Boolean logic, where a literal refers to a 0-1 feature or its negation. Patterns are the building blocks of a LAD classification theory, and the construction of a pattern with respect to a desired pattern selection criterion forms the key stage in data analytics via LAD. The difficulty is that pattern generation is a combinatorial optimization problem that forms the bottleneck stage in the application of LAD. Deferring the presentation of the background material until Section 2, pattern generation approaches in the literature can be classified as either enumeration-based [12,23,26] or optimization-based [10,20,27,35]. In note of numerical difficulties associated with pattern generation, term-enumeration methods are first developed, and they rely on simple rules for constructing terms that merely satisfy certain requirements imposed on patterns. As heuristic approaches, these have limitations in identifying a pattern that is optimal or Pareto-optimal with respect to one or more pattern preference, however. As a result, these methods ironically suffer from a poor average run-time complexity, for the number of LAD patterns made
✩ This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (Grant Number: NRF-2013R1A1A2011784). ∗ Corresponding author. E-mail addresses:
[email protected] (K. Yan),
[email protected] (H.S. Ryoo).
http://dx.doi.org/10.1016/j.dam.2016.08.007 0166-218X/© 2016 Elsevier B.V. All rights reserved.
2
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
n
up of d of the total of n literals can be as many as d 2d . To alleviate limitations of the term-enumeration methods in finding useful patterns, tools have also been developed for transforming an existing pattern into a Pareto-optimal one in polynomial run time in the number of terms of d literals for d ∈ {1, . . . , n} [3,6,26], which is exponentially many, however. Unlike enumeration methods, optimization-based approaches can generate optimal and Pareto-optimal patterns and also patterns of different degrees with an equal amount of efforts and suffer from the numerical complexity of the associated optimization problem only in the worst case. In brief, for identifying a pattern with a maximum coverage among those that distinguish a reference data Aℓ from all data of the other type, [23] first presented a polynomial set covering formulation and used the best linear approximation of the pseudo-Boolean functions from [25] to identify a pattern that is almost an ‘Aℓ maximum’ pattern. Subsequently, the authors linearized the same polynomial set covering model to obtain an optimizationbased approach for generating Aℓ -maximum patterns in [10]. On a related note, [9] considers a combined pattern generation and theory formation problem as a way to give rise to a ‘large margin’ theory that supposedly performs well in testing. The same formulation appears in [27] within a column generation scheme for large-scale integer programming in search of a way to find ‘near/fuzzy patterns’ that perform well in testing. Although optimization-flavored, these two methods are neither designed nor guaranteed to generate patterns, hence can be classified as heuristic methods. For optimization-based pattern generation, [35] gave the first full treatment and presented a set of mixed integer and linear programming (MILP) models for constructing various optimal and Pareto-optimal LAD patterns. Through extensive experiments, [35] also demonstrated advantages of optimization-based approaches over enumeration-based methods. In [20], a set of MILP models in a much smaller number of 0-1 decision variables was presented for a more efficient generation of useful LAD patterns via optimization principles. Equipped with new models that are much easily solved, [20] also demonstrated different utilities of strong prime and strong spanned patterns in their use in classifying new data. Motivated by a surge of interest in more efficient optimization-based LAD pattern generation, this paper revisits the Boolean logical definition of a LAD pattern and presents a unifying theory of LAD pattern generation in 0-1 MP. Specifically, we use the Boolean logical definition of a LAD pattern to first develop 0-1 MP models for well-studied (Pareto-)optimal LAD patterns. Next, we demonstrate that all existing pattern generation models from the literature and also new ones can be obtained from these MP models via 0-1 linearization techniques for multilinear functions. Furthermore, we demonstrate that 0-1 MP provides an insight for understanding how all different and independently developed pattern generation models for a particular type of LAD pattern are inter-related. In summary, these show that 0-1 MP holds a unifying theory for pattern generation in LAD (refer Fig. 1.) As for the organization of this paper, Section 2 provides a brief background on LAD pattern, and Section 3 develops 0-1 MP models for optimal and Pareto-optimal LAD patterns that are well-studied in the literature. In Section 4, we use McCormick’s envelopes for multilinear functions, standard probing techniques in integer programming and also new valid inequalities for 0-1 multilinear functions to demonstrate how the 0-1 MILP/IP pattern generation models from the literature and new models can be obtained from the MP models of Section 3. As a bonus treatment, Section 5 illustrates the construction of different pattern generation models on a small dataset and demonstrates their relative efficiency in pattern generation on six machine learning benchmark datasets. Results in this section seem to indicate that the efficiency in pattern generation by the 0-1 linear models is proportionally related to the advancedness of linearization techniques used for their derivation. This invites more attention and efforts to be directed toward the development of more advanced 0-1 linearization techniques and stronger valid inequalities for 0-1 multilinear functions. Finally, concluding remarks are provided in Section 6. 2. Background on LAD pattern Without loss of generality, let us consider the separation of a finite number of + and − observations/data. For • ∈ {+, −}, let us denote by S • the index set of m• observations of • type. We assume that the dataset is contradiction and duplicate free and let S + ∩ S − = ∅ and S = S + ∪ S − . Denote by a1 , . . . , an the n binary features that uniquely describe each data Ai , i ∈ S. Let aij denote the binary value of the jth feature of Ai for i ∈ S. In Boolean algebra, a term refers to a conjunction of one or more literals, where a literal is either a feature aj or its complement ¬aj (or a¯ j ) for j ∈ N := {1, . . . , n}. Therefore, a term takes the form T :=
j∈N1
aj
(¬aj ),
j∈N2
where N1 , N2 ⊆ N such that N1 ∩ N2 = ∅. We say that a term T covers Ai if T (Ai ) :=
j∈N1
aij
(¬aij ) = 1.
j∈N2
In LAD, a term is called a + (−) pattern if it distinguishes at least one + (−) observation from all − (+) observations. That is, if it covers one or more + (−) data and none of the − (+) data. (In note of the symmetry in the definitions of + and − patterns, we deal with the generation of + patterns in this paper for convenience in presentation.) Thus, a term T is a pattern if and only if T (Ai ) = 1 for some i ∈ S + and T (Ai ) = 0, ∀i ∈ S − . Let us introduce n additional 0-1 features an+j to negate aj as an+j := ¬aj = 1 − aj for j ∈ N. Let N ′ := {n + 1, . . . , 2n} and N := N ∪ N ′ . Now, each Ai , i ∈ S, is uniquely described by 2n 0-1 features/attributes aj , j ∈ N . For constructing patterns,
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
3
Fig. 1. Hierarchy of MP and MILP/IP pattern generation models for LAD. (Note: Models newly developed in this paper are inside highlighted boxes.)
we need indicator variables for the literals. Hence, we introduce 2n 0-1 xj ’s for j ∈ N and define them as
xj =
1, 0,
if literal aj is included in a term; and otherwise.
Given a 0-1 x, then, by letting NT := {j ∈ N : xj = 1}, we can uniquely determine the Boolean term associated with the solution as: T =
aj =
j∈NT
aj
j∈NT ∩N
(¬aj−n ).
j∈NT ∩N ′
Now, let Ji := {j ∈ N | aij = 0} for Ai , i ∈ S, (note that |Ji | = n) and note that T (Ai ) =
1, 0,
if NT ∩ Ji = ∅; and otherwise.
(1)
4
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
This yields that T is a LAD pattern if and only if NT ∩ Ji = ∅ for some i ∈ S + and NT ∩ Ji ̸= ∅ for every i ∈ S − ; equivalently,
xj = 0
for some i ∈ S
+
⇐⇒
xj = 0 ⇐⇒ ¬
i∈S + j∈Ji
j∈Ji
xj
= ¬0
i∈S + j∈Ji
and
xj = 1 ,
−
∀i ∈ S ⇐⇒
j∈Ji
xj = 1 ⇐⇒ ¬
i∈S − j∈Ji
xj
= ¬1.
i∈S − j∈Ji
Applying De Morgan’s laws to the above, we can describe a LAD pattern by Boolean logical formula in disjunctive normal form as follows: Definition 1 (Pattern). Given x ∈ {0, 1}2n , T formed on x via (1) is a LAD pattern if and only if:
(1 − xj ) = 1 and
( 1 − xj ) = 0 .
i∈S + j∈Ji
i∈S − j∈Ji
Degree and coverage are two important characteristics of a pattern. The degree of a pattern P gives the number of literals in it, while the coverage of a pattern Cov(P ) is the set (or, with abuse of notation, the number) of observations covered by P. LAD patterns are defined in terms of degree and coverage—more specifically, with respect to coverage (or evidence), simplicity and specificity preferences (e.g., [3,6,20,23,26]). As a pattern is a trait that is shared only by a set of homogeneous data, a (Pareto-)optimal pattern with respect to the coverage/evidence criterion is most desired [3,26]. We begin with evidentiallypreferred patterns. Definition 2 (Strong Pattern [10,20,23,26]). A pattern optimal with respect to the coverage (or evidential) preference for LAD pattern is called strong. A strong pattern thus has the maximum coverage among all patterns. A pattern is called C -maximum if it has the largest coverage among a maximal set P of patterns that cover a set of reference observations Ai , i ∈ C ⊆ S + . When C = {ℓ} for some ℓ ∈ S + , a C -maximum pattern is called an Aℓ -maximum pattern. As seen, Aℓ -maximum and C -maximum patterns are Pareto-optimal with respect to the coverage preference for LAD pattern, in reference to P . For clarification, we pause to remark that the concept of evidentially-preferred pattern has evolved during the last decade. When first introduced in [26], a pattern P is broadly termed a strong pattern if and only if there is no pattern P ′ such that Cov(P ) ⊂ Cov(P ′ ). Then, [6] called a strong pattern maximal to clearly indicate that ‘‘its set inclusion is maximal’’. [10,23] dealt with a special case and called a pattern ω-maximum if it has the maximum coverage among all patterns that cover a reference observation ω. Finally, [20] furthered this line of refinements with the notion of C -maximum patterns and distinguished between patterns that are Pareto-optimal and optimal with respect to coverage. Definition 2 reflects this progress of LAD research and clearly conveys subtle yet different concepts and utility among three types of evidentially-preferred patterns. Definition 3 (Prime & Spanned Pattern [3,6,12,26]). A prime pattern is optimal with respect to the simplicity preference for LAD pattern, thus becomes a non-pattern if any of its literal(s) is removed. For a maximal set P of patterns with the same coverage on a set of observations, a spanned pattern is Pareto-optimal with respect to the coverage (that is, with respect to P ) and selectivity preferences for LAD pattern. In plain terms, a spanned pattern has the maximum degree among all the patterns in P . Simplicity and selectivity criteria prove particularly useful when they supplement evidence/coverage as a secondary preference. Definition 4 (Strong Prime (Spanned) Pattern [3,6,10,20,23,35]). A strong prime (spanned) pattern is optimal with respect to the coverage and simplicity (selectivity) preferences for LAD pattern. In plain terms, a strong prime (spanned) pattern has the smallest (maximum) degree among all strong patterns. A C -maximum prime (spanned) pattern is Pareto-optimal with respect to the coverage and simplicity (selectivity) preferences for LAD pattern, in reference to a maximal set P of patterns that cover a set of reference observations in C . In plain terms, we call a pattern C -maximum prime (spanned) if it has the smallest (maximum) degree among all C -maximum patterns. When C = {ℓ}, C -maximum prime (spanned) patterns are called Aℓ -maximum prime (spanned) patterns. Abiding by the notation and definitions above, we proceed to two main sections of this paper. First, the following section develops 0-1 MP models for strong and strong prime (spanned) patterns and directly obtains from them optimization models for C -maximum and C -maximum prime (spanned) patterns, respectively, via a simple mathematical restriction to cover the observation(s) in C .
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
5
3. 0-1 MP models for LAD patterns A disjunctive normal formula in Boolean logic can be translated into an equivalent statement in a sum of products of literals by substituting the sum ( ) for the disjunction ( ) and the product ( ) for the conjunction ( ). Doing this to the definition of a pattern in Definition 1, we obtain an equivalent definition of a LAD pattern in terms of the two 0-1 multilinear inequalities. Specifically, given a 0-1 vector x, we can determine if the solution builds a LAD pattern by the following necessary and sufficient condition that
( 1 − xj ) > 0
(2)
i∈S + j∈Ji
( 1 − xj ) = 0 i∈S − j∈Ji
where the latter can be replaced by
( 1 − xj ) ≤ 0
(3)
i∈S − j∈Ji
as
(1 − xj ) ≥ 0 is always satisfied from xj ≤ 1 for all j ∈ N . Let: F := x ∈ {0, 1}2n : (3) .
i∈S −
j∈Ji
LAD is a supervised learning methodology, and the main goal of supervised learning is to discover knowledge on training/past data that generalizes well on future/unseen data in a consistent manner with the training data. This suggests that a pattern with the maximum coverage is desired, and such a pattern is called strong, as reviewed in the previous section. For generating strong patterns, consider:
(Ms ) : cs = max x∈F
(1 − xj ).
i∈S + j∈Ji
Proposition 1. (Ms ) admits a feasible solution that builds a pattern. Proof. Consider any ℓ ∈ S + and let NT := N \ Jℓ . Let 0, 1,
xj =
for j ∈ Jℓ ; and otherwise.
This 0-1 vector x satisfies i ∈ S−.
j∈Jℓ
(1 − xj ) = 1 in (2) for ℓ and satisfies the constraint in (3) as we have NT ∩ Ji ̸= ∅ for each
Theorem 1. An optimal solution to (Ms ) builds a strong pattern with coverage cs . Proof. Consider any x feasible to (Ms ) and let NT := j ∈ N : xj = 1 . Then, for each i ∈ S + , we have
(1 − xj ) =
1, 0,
if Ji ∩ NT = ∅; and otherwise,
1, 0,
if Ai is covered; and otherwise.
j∈Ji
or, equivalently,
j∈Ji
(1 − xj ) =
This shows that the objective function of (Ms ) calculates the coverage of a feasible solution, thus that the pattern built on a maximizer x∗ of (Ms ) via (1) is a strong pattern with coverage cs . Note that the degree of a pattern/term is simply given by d = j∈N xj ∈ {1, . . . , n}. Referring to Definition 4, a pattern is called strong prime if it has the minimum degree among all strong patterns. For identifying strong prime patterns, we first 1 ) and modify (Ms ) as the 0-1 MP model below. choose a real parameter α ∈ (0, n− 1
(Msp ) : csp = max x∈F
i∈S + j∈Ji
(1 − xj ) − α
xj .
j∈N
Theorem 2. An optimal solution to (Msp ) builds a strong prime pattern with coverage csp .
6
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
Proof. For notational simplicity, we let c (x) =
)
–
(1 − xj ) and d(x) = j∈N xj and write the objective function of (Msp ) as c (x) − α d(x). Now, consider any two feasible solutions x1 and x2 to (Msp ) and assume without loss of generality that c (x1 ) > c (x2 ); that is,
i∈S +
j∈Ji
c (x1 ) ≥ c (x2 ) + 1. 1 Then, d(x) ∈ {1, . . . , n} and α ∈ (0, n− ), respectively, yield 1
α d(x1 ) − α d(x2 ) = α d(x1 ) − d(x2 ) ≤ α(n − 1) < 1, and we obtain c (x1 ) − α d(x1 ) − c (x2 ) − α d(x2 ) = c (x1 ) − c (x2 ) + α d(x2 ) − α d(x1 ) ≥ 1 − α d(x1 ) − α d(x2 ) > 0.
This shows that an optimal solution to (Msp ) corresponds to a strong pattern. Now, consider two feasible solutions x1 and x2 to (Msp ) with the same coverage c but with different degrees. Assume without loss of generality that d(x1 ) < d(x2 ) to see that c − α d(x1 ) − c − α d(x2 ) = α d(x2 ) − d(x1 ) > 0.
Combining the two results above shows that an optimal solution to (Msp ) corresponds to a strong prime pattern with coverage csp . Referring to Definition 4, an optimal solution to
(Mss ) : css = max x∈F
(1 − xj ) + β
xj
j∈N
i∈S + j∈Ji
generates a strong spanned pattern, where β is any real number from (0, 1n ). Corollary 1. An optimal solution to (Mss ) builds a strong spanned pattern with coverage css . Proof. Immediate.
To develop a 0-1 MP model for C -maximum patterns, we first let N1C := {j ∈ N : aij = 1, ∀i ∈ C }
and N0C := {j ∈ N : aij = 0, ∀i ∈ C }
and JC := N1C ∪ N0C . Next, for ℓ ∈ C , we identify a non-empty set JiC := {j ∈ JC : aij ⊕ aℓj = 1},
i ∈ S \ C,
where ⊕ is the Boolean ‘exclusive-or’ operator defined as aij ⊕ aℓj = 1 if and only if aij = ¬aℓj . Proposition 2. For C ⊆ S + , a C -maximum pattern exists if and only if JC ̸= ∅ and JiC ̸= ∅, ∀i ∈ S − . Proof. Necessity is immediate, and sufficiency is shown in the proof of Theorem 3 just below. Now, for C ⊆ S FC :=
+
that satisfies the condition in Proposition 2, we let
x ∈ {0, 1}|JC | :
i∈S − j∈J C i
(1 − xj ) = 0
and consider:
(Mm ) : cm = max x∈FC
(1 − xj ).
i∈S + \C j∈J C i
Theorem 3. For C ⊆ S + such that FC ̸= ∅, let x∗ be an optimal solution to (Mm ) and let NT = j ∈ JC : x∗j = 1 . Then, the term built on x∗ via
T =
j∈NT ∩N1C
aj
(¬aj )
j∈NT ∩N0C
is a C -maximum pattern with coverage cm + |C |.
(4)
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
7
Proof. Since JiC ̸= ∅, ∀i ∈ S − , any solution with xj = 1, ∀j ∈ JC satisfies the single 0-1 multilinear constraint of FC , and the pattern built on it via (4) covers all Ai , i ∈ C . This shows that (Mm ) admits a feasible solution with coverage at least |C |. Note now that
(1 − xj ) =
j∈JiC
1, 0,
if JiC ∩ NT = ∅; and otherwise,
for i ∈ S + \ C . The maximization principle of (Mm ) next yields that the coverage of the patterns covering C is thus maximized by the pattern associated with x∗ , and the rest follows naturally. When C is a singleton with C = {ℓ} for ℓ ∈ S + , a C -maximum pattern reduces to an Aℓ -maximum pattern from [10], and the existence of an Aℓ -maximum pattern is guaranteed for a contradiction-free dataset. For generating Aℓ -maximum patterns, we let Jiℓ := {j ∈ N : aij ⊕ aℓj = 1},
i ∈ S \ {ℓ}
and Fℓ :=
x ∈ {0, 1}n :
i∈S − j∈J ℓ i
(1 − xj ) = 0 .
Note that Fℓ ̸= ∅ for any ℓ ∈ S + as Ai , i ∈ S, all have unique 0-1 fingerprints. Now, consider the 0-1 MP formulation below.
(Mℓ ) : cℓ = max x∈Fℓ
(1 − xj ).
i∈S + \{ℓ} j∈J ℓ i
Corollary 2. Let x∗ be an optimal solution to (Mℓ ) and let NT = j ∈ N : x∗j = 1 . Then, the term built on x∗ via
T =
j∈NT ∩N1ℓ
aj
(¬aj )
j∈NT ∩N0ℓ
is an Aℓ -maximum pattern with coverage cℓ + 1, where N1ℓ := {j ∈ N : aℓj = 1} and
N0ℓ := {j ∈ N : aℓj = 0}.
As an acute reader has noted, (Mm ) is a restricted version of (Ms ) that, in addition to every negative (−) observation be not covered, requires all observations of the different (+) type in C to be covered by a pattern generated by means of fixing some of the decision variables to value 0. While (rather) immediate, we next establish this result for the reason that will soon become apparent. Toward this end, note for JC for any C ⊂ S + that there exists JC′ such that j ∈ JC if and only if n + j ∈ JC′ and vice versa. Given C , thus, we let for ℓ ∈ C
JC := {j ∈ JC ∪ JC′ : aℓj = 0} and let
F := x ∈ {0, 1}2n : (3), xj = 0, ∀j ∈ JC ∪ [N \ (JC ∪ JC′ )] .
Now, consider the following 0-1 MP.
(Ms′ ) : cs′ = max x∈F
(1 − xj ). i∈S + j∈Ji
Theorem 4. For each C ⊆ S + , if a C -maximum pattern exists, then (Ms′ ) ≡ (Mm ). Proof. It suffices to show that (Ms′ ) generates a C -maximum pattern. To show this, we first note that the existence of a C -maximum pattern implies via Proposition 2 that JC ̸= ∅ and JiC ̸= ∅, ∀i ∈ S − . Moreover, when JC ̸= ∅, we have N \ (JC ∪ JC′ ) ̸= N . Thus, by setting xj = 0, j ∈ N \ (JC ∪ JC′ ), we can restrict our attention to the variables xj , j ∈ JC ∪ JC′ . For i ∈ S, let J˜i := Ji \ [N \ (JC ∪ JC′ )] = Ji ∩ (JC ∪ JC′ ). Note that JC = Jℓ ∩ (JC ∪ JC′ ) for ℓ ∈ C , we have J˜i \ JC = [Ji ∩ (JC ∪ JC′ )] \ [Jℓ ∩ (JC ∪ JC′ )] = (Ji \ Jℓ ) ∩ (JC ∪ JC′ ). Since JiC ̸= ∅ for all i ∈ S − , Ji \ Jℓ must have elements in JC ∪ JC′ , that is, J˜i \ JC ̸= ∅. Along with xj = 0 for all j ∈ JC (which
means all observations in C are covered), J˜i \ JC ̸= ∅ shows that (Ms′ ) admits a feasible solution. This establishes the result that an optimal solution of (Ms′ ) generates a C -maximum pattern.
8
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
Recall that (Msp ) is obtained from (Ms ) by subtracting α j∈N xj from the objective function of (Ms ) while (Mss ) is obtained from (Ms ) by adding β j∈N xj to its objective function, where α and β satisfy certain requirements. Similarly, 0-1 MP models for C -maximum prime and C -maximum spanned patterns result when the objective function of (Mm ) is appropriately modified to include certain extra terms, and the MP models for Aℓ -maximum prime and Aℓ -maximum spanned patterns are directly available from C -maximum counterparts by selecting C = {ℓ} for ℓ ∈ S + . These basically show that all 0-1 MP models for optimal and Pareto-optimal LAD patterns feature one 0-1 multilinear constraint and slightly differ only in their objective functions, and the essence of LAD pattern generation can be summed up as
optimize
objective function of (Ms ) or its modification
s.t. F or FC , where FC subsumes Fℓ and is a restriction of F with respect to C ⊂ S + . In what follows, we thus consider only (Ms ) and (Mm ) for demonstrating that 0-1 MP presents a unifying theory for LAD pattern generation and derive from them all existing models from the literature and a few new models for strong and C -maximum patterns. Interested readers may choose a different 0-1 MP model and simply follow our steps to obtain its 0-1 linear derivatives. We will omit this trivial illustration for reasons of space in this paper, however. 4. Derivation of 0-1 linear models for LAD patterns As a unifying theory, 0-1 MP allows for the derivation of various 0-1 linear pattern generation models by means of 0-1 linearization techniques for 0-1 multilinear functions and their refinements. This section extensively demonstrates this utility. Briefly, Section 4.1 uses standard McCormick envelopes for monomial functions and Section 4.2 exploits probing techniques for integer programming to obtain all 0-1 MILP/IP pattern generation models from the literature and also new models from the MP models of the previous section. Last, Section 4.3 employs recent results on strong valid inequalities of (3) to put together ‘compact’ pattern generation models that are defined in terms of a small number of stronger inequalities. 4.1. McCormick envelopes For a monomial function of the form
ϕ(x) =
xj ,
j∈L
where xj ∈ [0, 1], j ∈ L, McCormick’s convex and concave envelopes are
convϕ (x) = max 0,
xj − (|L| − 1)
(5)
j∈L
and conc ϕ (x) = min{xj : j ∈ L},
(6)
respectively [13,33,34,37]. Note that the 0-1 MP model for generating strong patterns can be rewritten as
(Ms ) : max
y+ y− = 0 , i − i +
i∈S
i∈S
where y•i =
(1 − xj ),
for i ∈ S • , • ∈ {+, −},
j∈Ji
− and the monomials in (Ms ) admit linearization via McCormick’s envelopes. Specifically, applying (6) to y− i for i ∈ S , we have
(0 =) y− i =
(1 − xj ) ≤ 1 − xj =⇒ xj ≤ 1,
j ∈ Ji ,
j∈Ji
but this merely confirms the upper bounds on x. Applying (5) to y− i , however, yields useful information that
(0 =) y− i =
j∈Ji
(1 − xj ) ≥
j∈Ji
(1 − xj ) − (|Ji | − 1) = 1 −
j∈Ji
xj ,
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
9
and we can use this to linearly underestimate
(1 − xj ) = 0,
i ∈ S−,
j∈Ji
as
xj ≥ 1,
i ∈ S−.
(7)
j∈Ji
+ As for y+ i , i ∈ S , they appear in the objective function to be maximized, hence only need to be linearly overestimated. By means of (6), we have
for j ∈ Ji , i ∈ S + .
y+ i ≤ 1 − xj
(8)
Let
+ Ξs := x ∈ {0, 1}2n , y ∈ [0, 1]m : (7), (8) , where the integrality on y is relaxed, owing to x ∈ {0, 1}2n , (7) and (8). These yield a 0-1 MILP model below for generating strong patterns, which appeared in [20]:
(Rs1 ) :
max
(x,y)∈Ξs
y+ i .
i∈S +
Now, note from (8) that: y+ i =
(1 − xj ) ≤ min{1 − xj : j ∈ Ji },
i ∈ S+
j∈Ji
|Ji | y+ i ∈ S+ i ≤ |Ji | min{1 − xj : j ∈ Ji }, =⇒ ny+ (1 − xj ) = n − xj , i ∈ S + i ≤
⇐⇒
j∈Ji
+
⇐⇒ nyi +
j∈Ji
x j ≤ n,
i∈S . +
(9)
j∈Ji
By letting
+ Ξs′ := x ∈ {0, 1}2n , y ∈ {0, 1}m : (7), (9) , thus, we obtain a different 0-1 linear programming model for strong patterns as follows:
(Rs2 ) :
max
(x,y)∈Ξs′
y+ i .
i∈S +
+ Unlike in Ξs , y in Ξs′ cannot be relaxed to be continuous. To see this, suppose j∈Ji xj = a for some i ∈ S . Then yi ≤ 1 − a/n, hence takes value 1 − a/n in an optimal solution of (Rs2 ). (Note that 1 − a/n is a fractional value for any a ∈ (0, n).) We pause for two remarks.
Remark 1. Let us recall the 0-1 MILP model for strong patterns from [35]: min
wi+
i∈S +
s.t.
2n
aij xj + nwi+ ≥ d,
i ∈ S+
(10)
j =1 2n
aij xj ≤ d − 1,
i ∈ S−
(11)
j =1
xj + xn+j ≤ 1, 2n
j∈N
(12)
xj = d
(13)
j =1
1≤d≤n x ∈ {0, 1}2n ,
+
w ∈ {0, 1}m .
10
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
Note in the model above that its objective is to minimize the number of uncovered + data. Hence, by letting wi+ = 1 − y+ i for i ∈ S + , we have
min
(1 − y+ i )
⇐⇒
max
i∈S +
y+ i .
i∈S +
Using (13) and wi in (10) yields: +
2n
2n
aij xj + n(1 − y+ i ) ≥
j =1
xj ,
i ∈ S + ⇐⇒ ny+ i +
j =1
2n (1 − aij )xj ≤ n, i ∈ S + ⇐⇒ (9). j =1
Likewise, using (13) for d in (11) gives: 2n
aij xj ≤
j=1
2n
xj − 1,
i ∈ S − ⇐⇒
j =1
2n
(1 − aij )xj ≥ 1, i ∈ S − ⇐⇒ (7).
j =1
Last, the complementarity inequalities in (12) are automatically satisfied by an optimal solution (see [20] for proof), hence can simply be dropped from the model formulation. Altogether, the above transform the 0-1 MILP model from [35] into (Rs2 ) and show that the latter, derived in a top-down fashion from (Ms ), is a ‘refined version’ of the former that is rid of n + 2 redundant constraints and 1 variable. Furthermore, after understanding that (Rs1 ) and (Rs2 ) are derived from the same 0-1 MP model (Ms ), one can verify that the constraints in (9) can be obtained by aggregating the constraints in (8) with respect to j ∈ Ji for each i ∈ S + . In summary, this remark illustrates how a 0-1 MP model naturally yields different and independently developed optimization models in the literature via standard linearization techniques and/or simple variable transformation and provides accounts for seeing how they are inter-related. This demonstrates that 0-1 MP is a unified as well as unifying theory for LAD pattern generation. Remark 2. Let us recall a general 0-1 multilinear inequality
ci
i∈M
xj ≤ b,
x ∈ {0, 1}n , Ni ⊆ {1, . . . , n}
(14)
j∈Ni
where ci > 0, i ∈ M, that subsumes the one in F . A ‘cover’ for a multilinear inequality is known to be a set Γ ⊆ M such that i∈Γ ci > b. A cover is called ‘minimal’ if none of its subsets is a cover (e.g., [7,19]), and the complete set of minimal cover inequalities for a 0-1 multilinear inequality is well-known to be equivalent to the original multilinear inequality in terms of 0-1 integer solutions that they admit as feasible [19]. With this, one can see that McCormick’s term-wise relaxation of (3) in this subsection gives rise to its 0-1 linear relaxation in terms of the minimal cover inequalities of the multilinear inequality. Furthermore, Γ = {i} for each i ∈ S − is a cover for (3), and S − gives the complete set of the minimal covers for (3). Thus, (7) is equivalent to F in terms of feasible 0-1 solutions. Continuing with the derivation of 0-1 linear pattern generation models, let us consider (Mm ) and use (6) to obtain a concave overestimate of its objective function. This yields
(Rm ) : max
y+ i
i∈S + \C
s.t. y+ i ≤ 1 − xj ,
xj ≥ 1,
i ∈ S + \ C , j ∈ JC i ∈ S−
j∈JiC
x ∈ {0, 1}|JC | ,
+ y ∈ [0, 1]m −|C |
which is the 0-1 MILP model appeared in [20] for C -maximum patterns. Alternatively, recall from Theorem 4 that (Ms ) reduces to (Mm ) when restricted to cover every i ∈ C . Hence, by restricting (Rs2 ) to cover the observations in C , we obtain a different optimization model for C -maximum patterns. Of particular interest is the case when C = {ℓ}. In this case, (Mm ) reduces to (Mℓ ), and the 0-1 linearization of the single multilinear constraint in Fℓ of (Mℓ ) via (5) yields:
(Mℓ′ )
− : max (1 − xj ) xj ≥ 1, i ∈ S , xj ∈ {0, 1}, j ∈ N . + j∈J ℓ i∈S \{ℓ} j∈J ℓ i
We note that
(Mℓ′ )
i
above is the 0-1 polynomial set covering model appeared in [23] for Aℓ -maximum patterns.
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
11
Finally, let us use (6) to obtain a concave overestimate of the monomials in the objective function of (Mℓ′ ) as: y+ i =
(1 − xj ) ≤ min{1 − xj : j ∈ Jiℓ } j∈Jiℓ
=⇒
ℓ + ℓ J y ≤ J min{1 − xj : j ∈ J ℓ } i i i i ℓ + J y ≤ ( 1 − x ) j i i
⇐⇒
ℓ + J y + xj ≤ Jiℓ . i i
⇐⇒
j∈Jiℓ
j∈Jiℓ
This allows us to obtain the 0-1 linear programming model:
(Rℓ ) : max
y+ i
i∈S + \{ℓ}
ℓ + J y + xj ≤ Jiℓ , i i
s.t.
i ∈ S + \ {ℓ}
j∈Jiℓ
xj ≥ 1,
i ∈ S−
j∈Jiℓ + y ∈ {0, 1}m −1 .
x ∈ {0, 1}n ,
We note that (Rℓ ) is the model appeared in [10] for generating Aℓ -maximum patterns. Alternatively, (Rℓ ) can also be obtained from (Rs2 ) by restricting the latter to cover Aℓ . 4.2. Probing & logical conditions Standard probing techniques and logical implications in integer programming (e.g., [15–17]) can provide important insights for obtaining additional 0-1 linear models for pattern generation. The resulting models are particularly interesting in that they correspond to ‘multiterm relaxations’ of the 0-1 MP models of Section 3. As a warm-up exercise, we consider a 0-1 multilinear function f (x) = x1 x2 x3 + x2 x4 + x1 x2 x4 , where xj ∈ {0, 1} for j = 1, 2, 3, 4. In order to linearize f (x), we let f (y) = y1 + y2 + y3 , where y 1 = x1 x2 x3 ,
y2 = x2 x4
and y3 = x1 x2 x4 .
Now, examining the relation between yi ’s and the original variables that appear in them reveals an interesting logical condition. For instance, since x1 appears in the first and third monomials/terms of f , x1 = 0 implies y1 = y3 = 0. In turn, this and yi ≥ 0 for i = 1, 2, 3 yield: x1 = 0
=⇒
y1 + y3 ≤ 0.
Probing at x1 = 1, on the other hand, reveals that y1 + y3 ≤ 2, where 2 in the right-hand side of the last inequality is obtained from the number of monomials of f where x1 appears. Hence, we can sum up the logical relation among x1 , y1 and y3 in an inequality as: y1 + y3 ≤ 2x1 . Similarly, we can use y1 + y2 + y3 ≤ 3x2 y 1 ≤ x3 y2 + y3 ≤ 2x4 for linearizing f in x and y. In order to systematize this method for linearizing 0-1 MP models, we use (Ms ) for illustration and let
Ij+ := i ∈ S + : j ∈ Ji
12
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
for each j ∈ N denote the monomials of the objective function of (Ms ) where xj appears. From this, we obtain:
+ y+ i ≤ |Ij |(1 − xj ),
j∈N.
(15)
+
i∈Ij
Using (15), along with the minimal cover inequalities (7) and, optionally, the complementarity inequalities (12), we obtain a new 0-1 linear model from (Ms ) for generating strong patterns. Specifically, let
+ ΞsĎ := x ∈ {0, 1}2n , y ∈ [0, 1]m : (7), (12) (optional), (15) , and consider:
(Rs3 ) :
max
Ď (x,y)∈Ξs i∈S +
y+ i .
Theorem 5. Let (x∗ , y∗ ) be an optimal solution to (Rs3 ) and let NT := j ∈ N : x∗j = 1 . Then, the term built on x∗ via (1) is a strong pattern with coverage i∈S + y∗i .
Proof. We first show that y∗ is integral. For this, suppose the contrary and let ∆ ⊆ S + denote the index set of y∗i ’s that are fractional. Now, if x∗j = 1, then, by (15) and y∗i ≥ 0, ∀i ∈ S + , we have i∈I + y∗i = 0, and this yields that y∗i = 0, ∀i ∈ Ij+ . j
For x∗j = 0, we have
Case 1. Suppose we have
i∈I
+
j
y∗i ≤ |Ij+ |, and consider two sub-cases.
+ i∈Ij
y∗i = |Ij+ |. If there is y∗i , i ∈ Ij+ ∩ ∆, then there must exist y∗l for l ∈ Ij+ such that y∗l > 1. This
is impossible as it violates the upper bound restriction on y∗l . Therefore, y∗i = 1 for all i ∈ Ij+ . Case 2. Suppose now that
+ i∈Ij
y∗i < |Ij+ | and that Ij+ ∩ ∆ ̸= ∅. Then there exists at least one y∗l , l ∈ Ij+ ∩ ∆, and every such
y∗l is associated only with xk for which
∗
< |Ik+ |. This, in turn, implies that y∗l can be increased by a small amount
+ i∈Ik yi Ď in Ξs , and
δ > 0 without violating any constraint doing so will improve the objective function value of (Rs3 ) by the same amount δ . This, however, contradicts that (x∗ , y∗ ) is an optimal solution to (Rs3 ). ∗ The remainder of the proof on (x∗ , y∗ ) corresponding to a strong pattern with coverage i∈S + yi is immediate, in reference to the proof of Theorem 1.
In [25], the authors used the average values and first derivative information to discover that the best linear approximation of a pseudo-Boolean objective function in the left hand side of (14) is
L(x) =
ci
−
i∈M
|Ni | − 1 2|Ni |
+
1
2|Ni |−1 j∈N i
xj
.
(16)
In [10], the authors used this linear approximation function for (Mℓ′ ) and presented the following model for generating Aℓ patterns
1 − ′ xj ≥ 1, i ∈ S , xj ∈ {0, 1}, j ∈ N (Lℓ ) : min xj ℓ j∈N i∈I ℓ 2|Ji |−1 j∈J ℓ i
j
where Ijℓ := {i ∈ S + \ {ℓ} : j ∈ Jiℓ } for j ∈ N. For the objective function of (Ms ), note that the approximation function in (16) becomes
−
i∈S +
n−1 2n
+
1
2n−1 j∈J i
(1 − xj ) .
Hence, a 0-1 solution that maximizes this linear approximation function subject to Ξs (the constraint set of (Ms )) is also optimal to the program that maximizes
( 1 − xj ) i∈S + j∈Ji
over Ξs . Now, from (15), we obtain
+ i∈Ij
+ y+ i ≤ |Ij |(1 − xj ),
j∈N
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
=⇒
y+ i ≤
=⇒ n
y+ i ≤
–
13
|Ij+ |(1 − xj )
|Ij+ |(1 − xj )
j∈N
i∈S +
=⇒
)
j∈N
j∈N i∈I + j
y+ i ≤
i∈S +
1 n j∈N
|Ij+ |(1 − xj )
and use them to obtain a new strong pattern generation model from (Rs3 ) (but not vice versa) as
(Ls ) : max §
x∈Ξs j∈N
|Ij+ |(1 − xj ) or
min
§
x∈Ξs j∈N
|Ij+ |xj ,
where
Ξs§ := {x ∈ {0, 1}2n : (7), (12)}. When restricted to cover a reference observation Aℓ for ℓ ∈ S + , (Ls ), furthermore, yields a model (Lℓ ) that is equivalent to (L′ℓ ) from [10]. Before closing this subsection, we recall that (Rs1 ) and (Rs3 ) are both obtained from (Ms ). This knowledge helps us more easily see that (Rs3 ) can be obtained from the other via a simple aggregation of the constraints in (8) with respect to i ∈ Ij+ for each j ∈ N ; this makes sense as (Rs1 ) was derived from (Ms ) via a term-wise linearization method while (Rs3 ) was obtained from its MP root via a multi-term relaxation approach. As a unifying theory for pattern generation, 0-1 MP reveals how (Lℓ ) and (L′ℓ ) above are related and also the relation among other models that are based on the same LAD pattern generation preference criterion. This is left as an exercise for interested readers. 4.3. New linearization techniques for (3) As a general theory, the 0-1 MP models of Section 3 can generate additional 0-1 linear pattern generation models when different 0-1 linearization techniques are applied to the 0-1 multilinear functions in (2) and (3). This section demonstrates this utility of 0-1 MP models with newly discovered valid inequalities for (3) from [43]. Some definitions are in order. Definition 5 (Hypercube [41]). The κ -dimensional (hyper-)cube Cκ is the simple graph whose vertices are the κ -tuples with entries in {0, 1} and whose edges are the pairs of κ -tuples that differ in exactly one position. A j-dimensional subcube of Cκ is a subgraph of Cκ isomorphic to Cj . We call a cube maximal if it is not a face of a different cube. Definition 6 (Face & Facet of Hypercube [41]). A face of Cκ is a subcube of Cκ . A facet of Cκ is a (κ − 1)-dimensional subcube of Cκ . Definition 7 (Dominance [42]). If π x ≥ π0 and µx ≥ µ0 are two valid inequalities for a polytope, π x ≥ π0 dominates µx ≥ µ0 if there exists u > 0 such that µ ≥ uπ and µ0 ≤ uπ0 , and (µ, µ0 ) ̸= (uπ , uπ0 ). Definition 8 (c-Dominance [7]). For two 0-1 inequalities A and B, A is said to c-dominate B if every 0-1 point x satisfying A also satisfies B. Furthermore, A is said to strictly c-dominate B if A c-dominates B and there exists some 0-1 point satisfying B but not A. In order to make this paper self-contained, we summarize two main results of [43] that are best explained when the data under analysis are analyzed in a graph for certain neighborhood properties. Thus, given a set of data for analysis, we first indicate each 0-1 binary data Ai , i ∈ S − , (obtained via a standard binarization process [11], if needed) as a node in a graph with a unique 0-1 label of length n. Next, we compare 0-1 labels of the nodes and introduce an arc/edge between each pair whose 0-1 labels differ in one position; that is, if they have the Hamming distance 1. The first main result of [43] states that each hypercube in the graph obtained as above yields a single valid inequality (henceforth, referred to as a hypercube inequality) for F that is stronger than any individual or combination/collection of valid inequalities that can be generated from the vertex nodes of the hypercube via the conventional method [19]. This result is summarized below without proof for reasons of space. Theorem 6 (Theorem 1 in [43]).Consider 2κ observations that form a Cκ (1 ≤ κ < n). Denote by Vκ the index set of 2κ observations of Cκ and let Jκ∩ = i∈Vκ Ji . Then,
j∈Jκ∩
xj ≥ 1
(17)
14
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
is valid for F . Furthermore, (17) dominates 2κ minimal cover inequalities
xj ≥ 1,
i ∈ Vκ ,
j∈Ji
and all valid inequalities that can be obtained from any subgraph of Cκ . Furthermore, [43] discovered that a single inequality can be generated from a set of neighboring hypercubes that is stronger than any individual or combination/collection of hypercube inequalities that can be generated from the hypercubes of the set. The material from [43] for generating this so-called an extended hypercube inequality is provided below without proof for reasons of space. Definition 9 (Neighboring Hypercubes, Definition 8 in [43]). We call two κ1 and κ2 -cubes (κ1 ≤ κ2 , without loss of generality) ‘neighbors’ if a facet of Cκ1 is a face of Cκ2 . Theorem 7 (Simplified Version of Theorem 4 in [43]). Consider a set of q (≥2) hypercubes Cκ1 , . . . , Cκq with dimension κ1 ≤ · · · ≤ κq , where each pair of hypercubes are neighbors. Consider the hypercube inequalities xj ≥ 1, i ∈ Q := {1, . . . , q} j∈Jκ∩ i
∩ ∩ i∈Q Jκi and Jκ1 −1 denote the variable index set of Cκ1 −1 which is a facet of Cκ1 and a face of Cκi , i = 2, . . . , q. k −1 Then, for ηk > i=1 (κk − κi )ηi , k ∈ Q , the 0 − 1 inequality ηi , (18) xj + η i xj ≥ ηi and let J∩ =
i∈Q
j∈J∩
j∈Jκ∩ −1 \J∩ 1
i∈Q |j∈Jκ∩ i
i∈Q
is valid for F and strictly c-dominates each of the q hypercube inequalities above and also any valid inequality from a strict subset of Q . As for the coefficients of the inequality in (18), we can set η1 = 1 and determine the remaining values recursively by k−1 i=1 (κk − κi )ηi + 1 for k = 2, . . . , q. (Refer Remark 3 in [43].) Summarizing, after representing data under analysis in a graph, one can identify maximal hypercubes in the graph (e.g., [39]) to reduce each hypercube into a single hyper-node and introduce an edge between each pair of neighboring ˜ one can use or implement a heuristic algorithm, including those for approximate hypercube nodes. In the resulting graph G, clique covering (e.g., [8,18,28,30]), to successively find a maximal Q for generating an extended hypercube inequality via ˜ corresponds to a single hypercube, for which an inequality is generated (18). As an acute reader notes, each 1-clique in G via (17). For a new pattern generation model, we first let S˜ − denote the index set of a set of maximal cliques that cover all nodes ˜ and let: of G
ηk =
+ Ξs⋆ := x ∈ {0, 1}2n , y ∈ [0, 1]m : (12) (optional); (15); (17) or (18) for i ∈ S˜ − . As n ≪ m+ in LAD applications with n equal to the number of ‘support features’ found via set covering (e.g., [11,35]) to describe the data under analysis without contradiction with a minimal subset of all features, we see that (15) may be best suited for the purpose of linearizing the objective function of the 0-1 MP models. Put together, the material above yields a new 0-1 linear model for strong patterns as:
(Rs4 ) :
max
(x,y)∈Ξ ⋆ s
y+ i .
i∈S +
Like other 0-1 linear models presented in the two previous subsections, this model can be modified to yield new models for C -maximum and Aℓ -maximum patterns and other Pareto-optimal patterns, such as strong prime patterns. For reasons of space and triviality, we leave this as an exercise for interested readers. 5. Comparison of MILP/IP pattern generation models As a bonus treatment for interested readers, this section provides an overview of the difference in construction and performance of different pattern generation models of the two previous sections. More specifically, we first use a small dataset to generate the instances of the MP as well as 4 different 0-1 linear programming models for strong patterns in Section 5.1. Next, we use the same five pattern generation models to analyze six well-studied machine learning datasets from [32] in Section 5.2 as a way to provide a glimpse of how they compare in efficiency.
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
15
Table 1 Small dataset for illustration. Data type
Observation
a1
a2
a3
a4
+
A1 A2 A3 A4
1 1 0 0
0 1 1 1
0 1 1 1
1 1 1 0
−
A5 A6 A7 A8 A9 A10 A11
0 1 0 1 0 0 1
1 1 0 0 0 0 0
0 0 0 0 1 0 1
0 0 0 0 0 1 1
5.1. Construction of pattern generation models Consider a set of data provided in Table 1 with m+ = 4 and m− = 7. Referring the reader back to Sections 3 and 4.1–4.3, we construct the instances of the MP and the four 0-1 MILP/IP models for a strong pattern for a better understanding and comparison. For the purpose, we first let + + + f (y) = y+ 1 + y2 + y3 + y4 ,
where: y+ 1 = (1 − x2 )(1 − x3 )(1 − x5 )(1 − x8 ) y+ 2 = (1 − x5 )(1 − x6 )(1 − x7 )(1 − x8 ) y+ 3 = (1 − x1 )(1 − x6 )(1 − x7 )(1 − x8 ) y+ 4 = (1 − x1 )(1 − x4 )(1 − x6 )(1 − x7 ) and the instance of (Ms ) (refer Section 3) is straightforwardly obtained as follows: max s.t.
f (y)
(1 − x1 )(1 − x3 )(1 − x4 )(1 − x6 ) + (1 − x3 )(1 − x4 )(1 − x5 )(1 − x6 ) + (1 − x1 )(1 − x2 )(1 − x3 )(1 − x4 ) + (1 − x2 )(1 − x3 )(1 − x4 )(1 − x5 ) + (1 − x1 )(1 − x2 )(1 − x4 )(1 − x7 ) + (1 − x1 )(1 − x2 )(1 − x3 )(1 − x8 ) + (1 − x2 )(1 − x5 )(1 − x7 )(1 − x8 ) = 0 x ∈ {0, 1}8 .
As based on McCormick envelopes, the instance of (Rs1 ) from [20] (refer Section 4.1) takes the form below with m+ × n constraints plus m− minimal cover inequalities: max
f (y)
s.t. y+ 1 ≤ 1 − x2 ,
y+ 1 ≤ 1 − x3 ,
y+ 1 ≤ 1 − x5 ,
y+ 1 ≤ 1 − x8 ,
y+ 2 ≤ 1 − x5 ,
y+ 2 ≤ 1 − x6 ,
y+ 2 ≤ 1 − x7 ,
y+ 2 ≤ 1 − x8 ,
y+ 3 ≤ 1 − x1 ,
y+ 3 ≤ 1 − x6 ,
y+ 3 ≤ 1 − x7 ,
y+ 3 ≤ 1 − x8 ,
y+ 4 ≤ 1 − x1 ,
y+ 4 ≤ 1 − x4 ,
y+ 4 ≤ 1 − x6 ,
y+ 4 ≤ 1 − x7 ,
x1 + x3 + x4 + x6 ≥ 1 ,
x3 + x4 + x5 + x6 ≥ 1,
x1 + x2 + x3 + x4 ≥ 1 ,
x2 + x3 + x4 + x5 ≥ 1,
x1 + x2 + x4 + x7 ≥ 1,
x1 + x2 + x3 + x8 ≥ 1,
x2 + x5 + x7 + x8 ≥ 1, x ∈ {0, 1}8 ,
y ∈ [0, 1]4 .
(Rs2 ) simply aggregates the concave envelopes for the objective monomial terms (refer Section 4.1 and an MILP model appeared in [35]), hence features a smaller number of m+ + m− constraints. It requires all variables to be pure integer, however. The instance of (Rs2 ) for the illustrative dataset is: max s.t.
f (y) 4y+ 1 + x2 + x3 + x5 + x8 ≤ 4 ,
4y+ 2 + x5 + x6 + x7 + x8 ≤ 4,
16
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
Fig. 2. Neighborhood analysis of − data in Table 1. (a) Hypercubes identified; (b) Maximal cliques.
4y+ 3 + x1 + x6 + x7 + x8 ≤ 4,
4y+ 4 + x1 + x4 + x6 + x7 ≤ 4,
x1 + x3 + x4 + x6 ≥ 1,
x3 + x4 + x5 + x6 ≥ 1 ,
x1 + x2 + x3 + x4 ≥ 1 ,
x2 + x3 + x4 + x5 ≥ 1 ,
x1 + x2 + x4 + x7 ≥ 1 ,
x1 + x2 + x3 + x8 ≥ 1,
x2 + x5 + x7 + x8 ≥ 1 , x ∈ {0, 1}8 ,
y ∈ {0, 1}4 .
(Rs3 ) (refer Section 4.2) multi-term relaxes the objective multilinear function of (Ms ). In general, this model involves a smaller number of constraints for y than (Rs2 ), although this is not well demonstrated through this dataset. For the dataset in Table 1, we have I1+ = {3, 4} I2+ = {1}, I3+ = {1}, I4+ = {4}, I5+ = {1, 2}, I6+ = {2, 3, 4}, I7+ = {2, 3, 4} and I8+ = {1, 2, 3}. Hence, letting + + + F + := {x ∈ {0, 1}8 , y ∈ [0, 1]4 : y+ 3 + y4 ≤ 2(1 − x1 ), y1 + y2 ≤ 2(1 − x5 ), + + + y+ 1 ≤ 1 − x2 , y2 + y3 + y4 ≤ 3(1 − x6 ), + + + y+ 1 ≤ 1 − x3 , y2 + y3 + y4 ≤ 3(1 − x7 ), + + + y+ 4 ≤ 1 − x4 , y1 + y2 + y3 ≤ 3(1 − x8 )},
we obtain the following instance of (Rs3 ): max
f (y)
s.t. x1 + x3 + x4 + x6 ≥ 1,
x3 + x4 + x5 + x6 ≥ 1,
x1 + x2 + x3 + x4 ≥ 1,
x2 + x3 + x4 + x5 ≥ 1,
x1 + x2 + x4 + x7 ≥ 1,
x1 + x2 + x3 + x8 ≥ 1,
x2 + x5 + x7 + x8 ≥ 1, F +. Last, to obtain the instance of (Rs4 ) (refer Section 4.3), we first present the 7 − observations in a graph and introduce an arc between each pair of nodes with fingerprints different in only one bit to first obtain agraph shown in Fig. 2(a) with 4 maximal hypercubes; namely, 2 C1 ’s C11 = [A7 A9 ], C21 = [A7 A10 ] , 1 C2 C12 = [A5 A6 A7 A8 ] , and 1 C0 (A11 ). Based upon the analysis of neighboring maximal hypercubes in Fig. 2(b), we next find a clique cover of the graph with 2 maximal cliques (namely, a 3-clique formed by Q1 = {C12 , C11 , C21 } and an isolated 1-clique of Q2 = {A11 }). Hence, by letting S˜ − = {Q1 , Q2 }, we generate two inequalities via (18) and (17), respectively, and obtain the instance of (Rs4 ) as follows: max s.t.
f (y) 4(x3 + x4 ) + 2(x1 + x2 ) ≥ 5,
x2 + x5 + x7 + x8 ≥ 1,
F . +
In comparison to (Rs3 ), we note that (Rs4 ) involves only 2 cover inequalities that are, as stated in Theorems 6 and 7, stronger than the 7 minimal cover inequalities in the former in terms of 0-1 solutions that they admit as feasible. 5.2. Performance on benchmark datasets As a bonus treatment for those interested, we demonstrate the relative efficiency of different strong pattern generation models on six benchmark datasets from [32]. First, we summarize information on the datasets used in Table 2.
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
17
Table 2 Six datasets for pattern generation experiments. Dataset (abbreviation)
BUPA liver disorder (bupa) Cleveland heart disease (clev) Credit card scoring (cred) Pima Indian diabetes (diab) Boston housing (hous) Wisconsin breast cancer (wisc) a
0-1 Featuresa
No. of observations
+
−
145 160 296 268 257 444
200 137 357 500 249 239
20 12 17 22 17 12
Obtained from original attributes via standard binarization and feature selection process [11].
For each dataset in Table 2, we generated a complete set of + and − strong patterns by each of the five pattern generation models tested (namely, (Ms ) and (Rsi ) for i ∈ {1, . . . , 4}) through a standard, successive procedure as those from [35] that is condensed in procedure PattGen below for reference. We recorded the CPU time required for this task by each model, including time for finding maximal hypercubes and neighboring hypercubes for (Rs4 ). All MILP/IP instances generated were solved by Gurobi 6.0.4 [21], while the nonlinear and nonconvex instances of (Ms ) were solved by BARON 15.2.0 [36,38] obtained from http://www.minlp.com. procedure PattGen input: dataset, model ∈ (Ms ), (Rsi ) for i = 1, . . . , 4 2: output: a set of + and − patterns 3: for • ∈ {+, −} do 4: let •¯ ∈ {+, −} \ {•}. 5: set S • to be the index set of all • data. 6: while S • ̸= ∅ do 7: generate and solve an instance of model to obtain x∗ . 8: generate a pattern on x∗ via (1). 1:
9:
10: 11:
∗ S • ←− S • \ i ∈ S • | j∈Ji (1 − xj ) = 1 . end while end for
Experimental results are summarized in Table 3. Since all models tested generated the same set of strong patterns, Table 3 compares only the CPU time each model required for generating a complete set of patterns and highlights the best result for each dataset in bold characters for readers’ easy reference. When patterns are successively generated as in procedure PattGen, the first instance of a pattern generation model is always the largest of all instances, hence is the most demanding one for solution. In note of this, we also collected the CPU times required for generating the first strong patterns by the five models compared and present them in Table 4, where the best result for each dataset is again highlighted in bold characters for easy reference. In brief, experimental results in these two tables indicate that MILP/IP models are superior to its MP root in efficiency; the difference is about two orders of magnitude. Numerical results also seem to indicate that the more advanced linearization techniques used are, the more efficient the resulting MILP/IP model is in generating patterns. For evidence, we first recall that (Rs3 ) and (Rs4 ) are multiterm relaxation-based models that are newly developed in this paper while the other two from [20,35] are based on term-wise relaxation techniques. Now, note that (most of) the best results are listed in the last two columns of Tables 3 and 4. 6. Conclusions Motivated by recent interests in optimization-based LAD pattern generation, this paper revisited the Boolean logical definition of a LAD pattern to develop 0-1 MP models for well-studied (Pareto-)optimal LAD patterns and presented a unifying theory of LAD pattern generation in 0-1 MP. For this purpose, we demonstrated that 0-1 MP models naturally yield all existing pattern generation models from the literature and also new ones by means of standard and new 0-1 linearization techniques for 0-1 multilinear functions and that 0-1 MP provides knowledge about how seemingly different, independently developed pattern generation models for a particular type of LAD pattern are inter-related. Altogether, these show that 0-1 MP holds a unifying theory for pattern generation in LAD. As a bonus treatment, we compared the relative efficiency of different pattern generation models on six machine learning benchmark datasets from [32]. Results from these experiments seem to indicate that efficiency of a MILP/IP pattern generation model may be proportionally related to the advancedness of linearization techniques used. This invites more attention and efforts to be directed toward the development of more advanced 0-1 linearization techniques and stronger valid inequalities for (2) (and its variants) and also (3) and 0-1 polynomial functions at large. On a related note, with the
18
K. Yan, H.S. Ryoo / Discrete Applied Mathematics (
)
–
Table 3 CPU seconds for generating a set of strong patterns. Dataset
Pattern type
(M s )
0-1 MILP/IP Models
(Rs1 )
(Rs2 )a
(Rs3 )
(Rs4 )b
bupa
+ −
426.02 574.24
12.73 23.68
4.40 10.55
4.42 5.47
4.85 8.14
clev
+ −
46.99 64.83
2.35 2.76
0.77 1.12
0.77 0.89
0.68 0.80
cred
+ −
922.39 1108.69
62.95 57.14
53.52 39.73
8.66 11.07
8.59 9.36
diab
+ −
10068.80 7850.31
451.48 818.23
109.12 245.61
112.10 102.94
123.95 95.02
hous
+ −
134.13 132.63
3.32 5.15
1.30 3.44
1.15 1.32
1.07 1.17
wisc
+ −
13.59 14.86
0.59 0.40
0.26 0.09
0.23 0.22
0.20 0.17
(Rs1 )
(Rs2 )a
(Rs3 )
(Rs4 )b
a b
Pure 0-1 model. Include CPU seconds for finding hypercubes and cliques.
Table 4 CPU seconds for generating the first strong pattern. Dataset
Pattern type
(Ms )
0-1 MILP/IP Models
bupa
+ −
26.850 44.509
0.823 2.037
0.506 2.066
0.087 0.189
0.103 0.153
clev
+ −
4.759 7.075
0.277 0.278
0.078 0.200
0.043 0.036
0.034 0.032
cred
+ −
83.280 66.480
7.321 3.513
24.656 13.070
0.208 0.202
0.170 0.175
diab
+ −
376.593 149.872
22.538 23.923
10.511 41.720
0.581 0.390
0.584 0.304
hous
+ −
15.303 17.929
0.564 0.713
0.135 1.091
0.074 0.052
0.053 0.066
wisc
+
4.494
0.119
0.054
0.0184
0.0183
−
2.184
0.043
0.019
0.019
0.017
a b
Pure 0-1 model. Include CPU seconds for finding hypercubes and cliques.
advent of more efficient mathematical data analytics models, it makes up an empirically interesting study to investigate how these sound, optimization methods scale up for problem size and can analyze large-scale datasets, particularly in comparison with popular heuristic methods that are available nowadays (e.g., [40].) We plan to delve further into these research subjects in the future. References [1] G. Alexe, S. Alexe, D.E. Axelrod, T. Bonates, I.I. Lozina, M. Reiss, P.L. Hammer, Breast cancer prognosis by combinatorial analysis of gene expression data, Breast Cancer Research 8R41. [2] G. Alexe, S. Alexe, D. Axelrod, P. Hammer, D. Weissmann, Logical analysis of diffuse large B-cell lymphomas, Artif. Intell. Med. 34 (2005) 235–267. [3] G. Alexe, S. Alexe, P. Hammer, A. Kogan, Comprehensive vs. comprehensible classifiers in logical analysis of data, Discrete Appl. Math. 156 (6) (2008) 870–882. [4] G. Alexe, S. Alexe, P. Hammer, B. Vizvari, Pattern-based feature selections in genomics and proteomics, Ann. Oper. Res. 148 (1) (2006) 189–201. [5] S. Alexe, E. Blackstone, P.L. Hammer, H. Ishwaran, M.S. Lauer, C.E.P. Snader, Coronary risk prediction by logical analysis of data, Ann. Oper. Res. 119 (2003) 15–42. [6] G. Alexe, P. Hammer, Spanned patterns for the logical analysis of data, Discrete Appl. Math. 154 (7) (2006) 1039–1049. [7] E. Balas, J.B. Mazzola, Nonlinear 0-1 programming: I. linearization techniques, Math. Program. 30 (1984) 1–21. [8] M. Behrisch, A. Taraz, Efficiently covering complex networks with cliques of similar vertices, Theoret. Comput. Sci. 355 (2006) 37–47. [9] T.O. Bonates, Large Margin Rule-Based Classifiers, John Wiley & Sons, Inc., 2010, http://dx.doi.org/10.1002/9780470400531.eorms0450. [10] T. Bonates, P. Hammer, A. Kogan, Maximum patterns in datasets, Discrete Appl. Math. 156 (6) (2008) 846–861. [11] E. Boros, P. Hammer, T. Ibaraki, A. Kogan, Logical analysis of numerical data, Math. Program. 79 (1997) 163–190. [12] E. Boros, P. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, I. Muchnik, An implementation of logical analysis of data, IEEE Trans. Knowl. Data Eng. 12 (2000) 292–306. [13] Y. Crama, Concave extensions for nonlinear 0-1 maximization problems, Math. Program. 61 (1993) 53–60. [14] Y. Crama, P. Hammer, T. Ibaraki, Cause-effect relationships and partially defined Boolean functions, Ann. Oper. Res. 16 (1988) 299–326. [15] R. Fortet, L’algèbre de boole dt ses applications en recherche opérationnelle, Cah. Cent. Étud. Rech. Opér. 1 (4) (1959) 5–36. [16] R. Fortet, Applications de l’algèbre de boole en recherche opérationnelle, Rev. Fr. Inform. Rech. Opér. 4 (14) (1960) 17–25.
K. Yan, H.S. Ryoo / Discrete Applied Mathematics ( [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43]
)
–
19
F. Glover, E. Woolsey, Converting the 0-1 polynomial programming problem to a 0-1 linear program, Oper. Res. 22 (1) (1974) 180–182. J. Gramm, J. Guo, F. Hüffner, R. Niedermeier, Data reduction and exact algorithms for clique cover, ACM J. Exp. Algorithmics 13 (2008) 2:2.2–2:2.15. F. Granot, P. Hammer, On the use of boolean functions in 0-1 programming, Methods Oper. Res. 12 (1971) 154–184. C. Guo, H. Ryoo, Compact MILP models for optimal & Pareto-optimal LAD patterns, Discrete Appl. Math. 160 (2012) 2339–2348. Gurobi Optimization, Inc., Gurobi optimizer reference manual, 2015. URL http://www.gurobi.com. P. Hammer, Partially defined boolean functions and cause–effect relationships, in: Proceedings of the International Conference on Multi-Attribute Decision Making Via OR-Based Expert Systems, University of Passau, Germany, 1986. P. Hammer, T. Bonates, Logical analysis of data – an overview: From combinatorial optimization to medical applications, Ann. Oper. Res. 148 (2006) 203–225. A. Hammer, P. Hammer, I. Muchnik, Logical analysis of Chinese labor productivity patterns, Ann. Oper. Res. 87 (1999) 165–176. P. Hammer, R. Holzman, Approximation of pseudo-Boolean functions; applications to game theory, Methods Models Oper. Res. 36 (1992) 3–21. P. Hammer, A. Kogan, B. Simeone, S. Szedmak, Pareto-optimal patterns in logical analysis of data, Discrete Appl. Math. 144 (2004) 79–102. P. Hansen, C. Meyer, A new column generation algorithm for logical analysis of data, Ann. Oper. Res. 188 (2011) 215–249. E. Kellerman, Determination of keyword conflict, IBM Tech. Discl. Bull. 16 (1973) 544–546. K. Kim, H. Ryoo, A lad-based method for selecting short oligo probes for genotyping applications, OR Spectrum 30 (2) (2008) 249–268. L.T. Kou, L.J. Stockmeyer, C.K. Wong, Covering edges by cliques with regard to keyword conflicts and intersection graphs, Commun. ACM 21 (1978) 135–139. L.-P. Kronek, A. Reddy, Logical analysis of survival data: Prognostic survival models by detecting high-degree interactions in right-censored data, Bioinformatics 24 (2008) i248–i253. M. Lichman, UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. G. McCormick, Computability of global solutions to factorable nonconvex programs: Part i - convex underestimating problems, Math. Program. 10 (1976) 147–175. A. Rikun, A convex envelope formula for multilinear functions, J. Global Optim. 10 (1997) 425–437. H. Ryoo, I.-Y. Jang, MILP approach to pattern generation in logical analysis of data, Discrete Appl. Math. 157 (4) (2009) 749–761. H. Ryoo, N. Sahinidis, A branch and reduce approach to global optimization, J. Global Optim. 8 (2) (1996) 107–138. H. Ryoo, N. Sahinidis, Analysis of bounds for multilinear functions, J. Global Optim. 19 (4) (2001) 403–424. N.V. Sahinidis, BARON 15.2.0: Global Optimization of Mixed-Integer Nonlinear Programs, User’s Manual, 2015. URL http://www.minlp.com. M.A. Sridhar, C.S. Raghavendra, Computing large subcubes in residual hypercubes, J. Parallel Distrib. Comput. 24 (1995) 213–217. WEKA: The University of Waikato, Weka 3: Data mining software in java. URL http://http://www.cs.waikato.ac.nz/ml/weka. D.B. West, Introduction to Graph Theory, Pearson Education, Inc., New Jersey, 2001. L.A. Wolsey, Faces for a linear inequality in 0-1 variables, Math. Program. 8 (1975) 165–178. K. Yan, H.S. Ryoo, Strong valid inequalities for Boolean logical pattern generation, (submitted for publication).