A combinatorial optimization problem arising from text classification

A combinatorial optimization problem arising from text classification

A combinatorial optimization problem arising from text classification Sandro Bosio ∗, Giovanni Righini Dipartimento di Tecnologie dell’Informazione, Un...

74KB Sizes 2 Downloads 118 Views

A combinatorial optimization problem arising from text classification Sandro Bosio ∗, Giovanni Righini Dipartimento di Tecnologie dell’Informazione, Universit` a degli Studi di Milano

Abstract We study a combinatorial optimization problem related to the automatic classification of texts. The problem consists of covering a given text using strings from a given set, where a cost is incurred for each type of string used. We give a 0-1 linear programming formulation and we report on computational experiences on very large instances using two different Lagrangean relaxations and heuristic algorithms based on simulated annealing and threshold accepting. Key words: Texts classification, simulated annealing, threshold accepting, Lagrangean relaxation

1

Introduction

The great amount of unclassified digital documents requires automatic tools for text classification [1]. Usually text classification requires a preliminary step, named segmentation, that consists of dividing a text into strings; a different approach recently investigated in computational linguistics consists of representing the text by a reduced suffix tree containing only a small subset of the strings, selected so as to describe “in the best way” the text itself [2]. From this approach can be derived the following combinatorial optimization problem. Given an alphabet Σ, a text W that is a sequence of characters of Σ, and a subset S of the strings of W , let O be the set of the occurrences of the elements of S in W , find a subset Y ⊆ S of strings and a subset X ⊆ O of their occurrences such that the occurrences in X do not overlap and such that ∗ Ph.d. student at Dipartimento di Matematica, Politecnico di Milano Email address: [email protected] (Giovanni Righini).

Preprint submitted to Elsevier Science

3 March 2003

a coverage function f (X) is maximized and a cost function g(Y ) is minimized, where both f () and g() are linear. The coverage function f (X) is the number of covered characters, that is the sum of the lengths of the selected occurrences; the cost function g(Y ) is the sum of the costs of the selected strings. Since for classification purposes the use of long strings is preferable to that of short ones, the cost of each string is set equal to the inverse of its length. We considered the single-objective problem obtained by the combination of both f (X) and g(Y ) into a unique objective function by a parametric weight: max z = αf (X) − (1 − α)g(Y ) where α ∈ [0, 1] must be tuned according to classification results (in our tests we set α = 0.1). The problem is then formulated as a 0-1 linear programming problem as follows. TCSS)

max z = α

|O|  j=1

lu(j) xj − (1 − α)

 |O|       atj xj     j=1  

≤1

s.t.  xj − yu(j) ≤ 0   xj ∈ {0, 1}       y ∈ {0, 1}   i

|S|  1 y

i=1

li i

t = 1 . . . |W |

(1)

j = 1 . . . |O| j = 1 . . . |O| i = 1 . . . |S|

(2) (3) (4)

where: • • • • •

xj is a binary variable corresponding to occurrence j = 1 . . . |O|, yi is a binary variable corresponding to string i = 1 . . . |S|, li is the length of string i = 1 . . . |S|, u(j) is the string corresponding to occurrence j = 1 . . . |O|, atj is an element of a binary matrix that indicates whether character t = 1 . . . |W | can be covered by occurrence j = 1 . . . |O|.

Packing constraints (1) state that the selected occurrences cannot overlap. Variable upper bound constraints (2) state that an occurrence can be selected only if the corrisponding string is. The formulations obtained by relaxing (3) or (4) are equivalent to TCSS, as their optimal solution is binary, and have been solved with CPLEX 6.5. For every choice of the y variables the resulting problem is to get the best coverage using only occurrences of the selected strings. The text W can be seen as a graph, where every character corresponds to a node and every occurrence corresponds to an arc from the first covered character to the character following the last covered one. Adding a fictitious node after the last character, and 2

adding an arc from every character i to the following one (which stands for the choice of leave i uncovered), we obtain a directed acyclic graph over which the problem can be solved in polynomial time as a maximum path problem (MPP) from the first node to the ficticious destination node.

2

Lagrangean relaxations

Because of the very large size of the problem instances of interest for practical applications, standard solvers based on linear relaxation and branch and bound are not viable. We compared two different Lagrangean relaxations, obtained by relaxing either packing constraints or variable upper bound constraints. In the former case the Lagrangean subproblem is solved by inspection, whereas in the latter it is separable in a polynomial MPP and in a trivial subproblem. We used dual bounds provided by these relaxations, computed via subgradient optimization, in a branch and bound algorithm; we also compared different branching strategies and different Lagrangean heuristics to obtain primal bounds. The search tree is explored according to a best-first strategy. The Lagrangean-based approach yields tight upper bounds, but does not provide good feasible solutions, expecially for large instances, unless heuristic algorithms based on local search are incorporated.

3

Local search heuristics

We studied two different local search algorithms based on threshold accepting and simulated annealing [4]. These methods allow worsening moves whose acceptance, deterministic in the former case and probabilistic in the latter, is governed by some parameters subject to a suitably tuned cooling schedule. The algorithms we tested work on the y variables and for each value of the y variables a complete solution is computed by solving a MPP. Given a solution s its neighborhood N(s) is the set of solutions that differ from s by the value of one variable yi; therefore the neighborhood size is O(|S|), which is usually very large. Thus only a subset of N(s) is explored by repeatedly selecting Q neighbors until a solution is accepted or some ending criteria are satisfied, as a prescribed maximum number of trials or a time-out. The selection of the Q neighbors can be done in a deterministic as well as in a non-deterministic way. We performed extensive computational tests with different values of Q and with both deterministic and non-deterministic policies. By choosing the value of Q it is possible to tune the trade-off between results quality and computing time. We also report on experiments with different cooling schedules. 3

4

Conclusions

Algorithms based on Lagrangean relaxation provide a good upper bound and a feasible solution very quickly, but they improve these values very slowly. Relaxing the variable upper bound constraints yields better dual bounds than relaxing packing constraints. Running local search algorithms alone gives very good solutions in a reasonable time, and this can be done starting from different solutions. Best results have been obtained by threshold accepting with a deterministic neighborhood search. Details on algorithms and computational results can be found in [3]. The features of both Lagrangean relaxation and local search can be exploited simultaneously when local search heuristics are invoked at each node of the branching tree, starting from a feasible solution provided by a simple Lagrangean heuristic. The gap between the upper bounds provided by Lagrangean relaxation and the lower bounds provided by local search goes below 0.6% in a few minutes for problem instances with 370000 occurrences (and an equal number of variable upper bound constraints), 10000 strings and 220000 packing constraints.

References [1] Y. Yang, X. Liu: “A re-examination of Text Categorization Methods”, 22nd Annual International SIGIR, pp. 42-49, Berkley, August 1999 [2] J.P. Vert: “Text categorization using adaptive context trees”, Proceedings of the CICLing-2001 conference, pp. 423-436, Springer Verlag, 2001 [3] S. Bosio: “Algoritmi di programmazione matematica per un problema di classificazione di testi” (in italian), degree thesis, Dipartimento di Tecnologie dell’Informazione, Universit`a degli studi di Milano, December 2002 (available at http://sansone.crema.unimi.it/∼righini/Papers/Bosio.pdf) [4] Aarts E.H.L., Lenstra J.K.: “Local search in Combinatorial Optimization”, John Wiley & Sons, 1997

4