Halfspace learning, linear programming, and nonmalicious distributions

Halfspace learning, linear programming, and nonmalicious distributions

ELSEVIER Information Processing Letters 51 (1994) 245-250 Information Processing Letters Halfspace learning, linear programming, and nonmalicious d...

521KB Sizes 5 Downloads 74 Views

ELSEVIER

Information Processing Letters 51 (1994) 245-250

Information Processing Letters

Halfspace learning, linear programming, and nonmalicious distributions Philip M. Long ’ Computer Science Department, Duke Unicersity, P.O. Box 90129, Durham, NC 27708, USA

Communicated by M.J. Atallah; received 6 August 1993; revised 11 April 1994

Abstract We study the application of a linear programming algorithm due to Vaidya to the problem of learning halfspaces in Baum’s nonmalicious distribution model. We prove that, in n dimensions, this algorithm learns up to c accuracy with probability 1 - E in O((~*/E) log’(n/s) + n3.3R log(n/s)) time. For many combinations of the parameters, this compares favorably with the best known bound for the perceptron algorithm, which is 0(n2/c5). Key words:

Computational

learning

theory; Analysis

of algorithms

1. Introduction In part

due

to their

interpretation

neurons [13,15], a great deal about the analysis of halfspaces putational models of learning.

as artificial

has been written in different comIn this paper, we

analyze the use of one of Vaidya’s linear programming algorithms [18] to learn halfspaces in a variant of the PAC model [19] due to Baum [2],

called the non-malicious distribution (NMD) model. In the NMD model, a halfspace H c R” is chosen randomly. There is a straightforward reduction from the problem of learning arbitrary halfspaces to learning homogeneous halfspaces,

’ Supported by Air Force Office of Scientific Research grant F49620-92-J-0515 and a Lise Meitner Fellowship from the Fonds zur Fijrderung der wissenschaftlichen Forschung (Austria). 0020-0190/94/$07.00

i.e., those that go through the origin (see, e.g. [51X We will therefore restrict our attention to homogeneous halfspaces. The halfspace H to be learned is then chosen by picking its normal vector uniformly from the unit sphere. (If one works through the reduction from general halfspaces, this gives rise to the distribution on general halfspaces obtained by choosing (wi . . . w,, 6) uniformly from the unit ball in IWn+‘, and setting the halfspace to be learned to be (x’] w’.x’? b}.) It is further assumed that there is a probability distribution D over R” that is unknown to the learner, but that the learner may (in unit time) sample points Xi,. . . , Zrn independently at random according to D. Furthermore, the learner can find out whether each of these points is in the hidden halfspace H. The goal of the learner is to output an approximation I? of H. The accuracy of fi is measured by the probability that another point x drawn from D will be in the

0 1994 Elsevier Science B.V. All rights reserved

SSDI 0020-0190(94)00096-H

246

P.M. Long/Information

Processing Letters 51 Cl 994) 245-250

symmetric difference of E? and H, i.e., that A is “wrong” about X. The learner is given an input E > 0, and is required to ensure that with probability at least 1 - F, with respect to the random choice of H and _?‘,,. . . , zrn, the accuracy of g is better than E. This model takes a Bayesian viewpoint, as has been done in [9,14] and elsewhere. In Baum’s original formulation, there were two separate parameters measuring the accuracy and the probability that the given accuracy was achieved. However, combining the two simplifies our bounds greatly, and it seems that in practice the two should at least be close. Both being extremely sure of having a mediocre hypothesis, and being fairly sure of having an outstanding hypothesis seem unusual goals. 2 Baum [2] showed that the perception algorithm [15] accomplishes the above in 0(n2/a5> time. In this paper, we show that an algorithm based on linear programming [la], 3 requires 0

( ”

&

2 log211

+ F

n3.38

log!! &

I

time. For many values of y1 and E, this represents a substantial improvement. For example, when II and l/a are within a constant factor, the linear programming bound of this paper grows like 0(n3.3X log n), which compares favorably with the O(n7) bound in this case for the perceptron algorithm. We make use of 0(n2.38) time matrix multiplication [4] as a subroutine. Since the constants in the analysis of this matrix multiplication algorithm, as well as those for other o(n3) algorithms, are quite large, some claim that in practice the

’ The proof of results for the PAC model [8] that show that the dependence of the requirements on the inverse of the “confidence” can always be brought down to being logarithmic cannot be easily modified for this model, since, loosely speaking, here the probability of failure depends on the random choice of the target as well as the random sample. s The idea of using linear programming for recovering a “separating” halfspace is quite old (see [5]), and its use to generate polynomial time learning algorithms originates with Blumer et al. 131.

time required for multiplying n x II matrices grows like O(n”). However, replacing the n2.38 in our learning bounds with n3 only changes the n3.38 in our bound to an n4. An 0(n4 log n) bound would still be a significant improvement over O(n7). The most natural application of linear programming techniques to the problem of learning halfspaces is to draw a sfmple Zi,. . . , Tm, and then find a halfspace H for which each of Zi,...,Zm is in H if and only if it is in the halfspace H to be learned, since this requirement amounts to m linear constraints on the normal vector of g [3]. However, in order to simply plug in to the time bounds for solving linear inequalities (with m inequalities and n unknowns), one must make assumptions on the precision of the Zj’s. (It is open whether there is a polynomial time algorithm for the unit-cost model for linear programming. Such an algorithm would imply unit-cost polynomial time algorithms for learning in the NMD model, as well as Valiant’s PAC model [191; see 131.1 In the NMD model, we have found that a different approach does not require any assumptions on the precision of the Zj’s, and yields significantly improved time bounds. 4 The iterations in Vaidya’s linear programming algorithm can be viewed as what are called queries in another model of learning (due to Angluin [ll) often called the equiuafence query model [12]. Implicit in Vaidya’s analysis of this algorithm are bounds the number of iterations made by this algorithm in terms of the volume of the solution set of the system of linear inequalities [la]. These bounds can therefore also be viewed as equivalence query bounds. Using conversions from this model to the PAC model [l,lO,lll, one can see that, for arbitrary E > 0, if the volume of the set of normal vectors defining halfspaces that “agree” with the target halfspace H on x’,,, . . . , Trn is not unreasonably small, then, with high probability, a randomized variant of Vaidya’s algorithm quickly

4 In fact, for many combinations of the parameters, this approach yields improved time bounds for Valiant’s PAC model, but the improvement is less dramatic.

P.M. Long/Information

ProcessinK Letters 51 (1994) 245-250

finds aAsetting of weights whose associated halfspace H satisfies that the fraction of elements of xi,...,x, * which lie in the symmetric difference of fi and H is at most .5/2. Applying results of Blumer et al., for m = O((n/e) log(l/&)), this is good enough. Finally, we show that, with high probability, the volume of the set of normal vectors defining halfspaces which “agree” with the halfspace to be learned on Ti, . . . , .iTm is not too small.

2. Preliminaries Let N denote the positive integers and [w the reals. For sets S, and S,, let S,AS, denote the symmetric difference of S, and S,. We use the unit cost RAM model of computation (see, e.g. [17]), where it is assumed that the standard operations on real numbers can be done in unit time. Choose X = lJ n X,, and for all n, let gn be a class of subsets of X,, and E’= lJ ,,E’n. Such a %Y is called a concept class. An example of C E C, is a pair (x, Xc(x)) EX, X (0, l], where Xc is C’s characteristic function. A subset C’ of X, is consistent with (x, Xc(x)) if Xc(x) =X&x). A sample is a finite sequence of examples. We say C’ LX, is consistent with a sample if it is consistent with each example in the sample. A batch learning algorithm A takes as inputs an n E N, a desired accuracy F > 0, and a sequence of elements of 5 X, X (0, l], A then outputs an algorithm for computing a function h :X, + (0, 1). Usually, the algorithm can be understood from the description of h, and we will refer to the two interchangeably in the rest of this paper. We begin with the PAC model [19]. For a function 4 : R x N + N E {co}, we say A PAClearns %? in time 4 if and only if there is a function rC,: [w X N U {w) such that for all input parameters n and E, if m = $L(l/&, n): m G 4(1/~, n), whenever A receives n, E, and examples (xi, y,), A halts within time 4(1/c, n) Yl), . . .,(x,, l

l

and outputs a hypothesis h, whose running time (on inputs in X,) is also at most f#41/.5, n), . for all probability distributions D on X,, and for all C E E’,,, if A is given m examples of C generated independently according to D, the probability that the output h of A satisfies Pr u E ,(h(u) f xc(u)) G e (“has accuracy better than F”) is at least 1 - E. Now we turn to the NMD model [2]. Suppose, for each n E N, L?~ is a probability distribution on Zn’,, and Z? = (~3’~In > 1). In this case, we say A PAC-learns %Twith respect to 9 in the NMD model in time 4 if and only if there is a function $ : R x N + N such that for all input parameters n and E if m =1+/1(1/e, n>: m G 4(1/e, n>, . for any x,, . . . , x,, the expectation, over a randomly chosen C according to gn, of the running time of A on inputs n, e, and (xi, is at most f#41/~, n), X&i)), . . . ,(x,n, xc(x,)> and the running time of the resulting hypotheses h (on inputs in X,) is at most +(l/~, n>, . for all probability distributions D on X,,, if (x,, . . . , x,) and C are chosen independently at random from D” and zY,,, and (x,, is given to A, the XJXi)), . . . , (xm, ~Jx,)) probability that the output h of A satisfies Pr, t ,(h(u) #,y,Ju)> G F is at least 1 - F. We will also discuss concept classes not indexed by n. It should be obvious how to adjust the above definitions for this case. Finally, we define the equivalence-query model [l]. In the equivalence-query model, when the algorithm is learning a class ‘Z’, it sequentially produces queries which are (representations of) elements of E’. When the algorithm correctly guesses the function C to be learned, the learning process is-over. Otherwise, if the algorithm has queried C, it receives x E CAC. A learning algorithm for this model is called a query algorithm. For each n, let HALF, be the set of all homogeneous halfspaces in [w”, i.e. l

((3EW”I~.x’aO}Iw’E~“). ’ In this paper, always be obvious.

an encoding

for the elements

247

of X will

Let HALF

= U n HALF,.

248

P.M. Long/Information

Processing Letters 51 (I 994) 245-250

3. The analysis

We also make use of a special case of a result of Blumer, Ehrenfeucht, Haussler, and Warmuth.

We will make use of the following lemma, which is implicit in the analysis of Littlestone. Lemma 1 [ 111. Choose X, E?c 2x. Suppose for any x E X, C E %?‘,the question of whether x E C can be computed in B time. If there is a query algorithm B for ?Y for which the time taken by B between queries is at most LY, and the algorithm makes at most y queries, then there is a PAC learning algorithm for g which outputs hypotheses from %? whose time requirement is 0

y

P W/4

a+

&

it

r

i)

Lemma 3 [3]. There is a constant c such that, for all E < l/4, if a (randomized) batch learning algorithm for learning HALF, draws 1

m > ?log&

&

random examples, and with probability at least 1 - .5/2 outputs a halfspace I? such that the frac, tion of examples in the symmetric difference of H and th,e halfspace H to be learned is at most e/2, then H has accuracy better than E with probability at least 1 -E.

and which uses The following

o($+logf))

bound

[16,6,31X Choose m, n E N, and E R”. Then

Lemma 4 (see S = Ix’,, . . . , i’J

random examples. The following notation ~1. For each w’ E R”, let

will be useful.

will also be useful.

Fix some

HnS=T}l

~{TcSI~HEHALF,,

6 (Fin.

H,={,%WI&?>O}. For some finite set S G IR” and some w’ E R”, let H w’s = H,: n S. Also, define E s,w’=

(

ZEWI

H;,=H,-,,,

llZll*<

1 . >

the set of all halfInformally, Es w’ represents spaces that are’indistinguishable from HG on the basis of S. Finally, for a finite set S c R”, and L’> 0, let HALF,,,.

= {H,,,

I GE R”, volume(

Es,,-) > L’}.

Now we are ready for another lemma, which is implicit in the analysis of Vaidya 1181, if his linear programming algorithm is viewed as a query algorithm in the manner of Maass and Turan [12]. Lemma 2 [l&121. For any finite S L R”, there is a query algorithm MTV, such that, for all c > 0, MTV learns HALF,,,. while using at most O(n2.3X) time between queries, and makes at most 0

i

n log n + Iog i

queries.

Finally,

we record

a trivial lemma.

Lemma 5. Choose a set X, and a probability distribution D ouer X. Suppose E,, . . . , E, form a partition for X, and for each x E X, i, is the index of the element of the partition containing x. Then for all c > 0, Pr,,D(Pr(E,x)

GC)
Let m be the least power of 2 not less than (cn/e)log(l/.z), where c is as in Lemma 3; draw .Zr, , x’, E R” independently at random from D; S:={S,,...,x’,); c := (E /4X2/emY; convert algorithm MTV (from Lemma 2) for learning HALF.Y~,, into a PAC algorithm A’ using the transformation of Lemma 1; simulate A’ using the uniform distribution on f?,. . , if,,,) to, with probability at least 1 - c/4, find a halfspace H such that (l/n~)I{jI?~~fidH)l GE/~; output k?;

Fig. 1. Algorithm

A from Theorem

6.

P.M. Long /Information Processing Letters 51 (1994) 245-250

We put these together of this paper.

to prove the main result

distribution on (1,. . . , m} in O(log m) time, and therefore the total time spent sampling is at most log m

Theorem 6. If, for each integer n > 2, 9,, is the distribution otter HALF, obtained by choosing the normal cector G uniformly from the unit ball, then there is an algorithm A that PAC learns HALF with respect to 9 = (9,) in the N&ID model in time

i

E

+ n3.38 log!

ii

ii

E

1 1

by the sample size bound with Lemma 2. Substituting that this time is

1 + log & ii

of Lemma 1, together and simplifying yields

&

. &

To bound the time spent otherwise, we plug into Lemma 1 together with Lemma 2, getting

I

Proof. Choose n. Consider the algorithm A described in Fig. 1. Fix S for the moment. Then since the volume of the unit ball is more than 2”/n!, Pr,(volume(

Es,,)

Combining

Lemmas

4 and 5 with the above,

n log n + log!

0 ii x

n2,38

+

1: )

n ln(n log f-2+ lodl/c))/e)

i

G U)

&

Expanding the definitions of c’ and m, simplifying, and observing that the resulting quantity q dominates (1) completes the proof. we

get PrG(volume(

n log n + log-

O-

(1)

n2 -1og2n

0

249

Acknowledgement Es,,-) G U) We especially thank Nicolo Cesa-Bianchi, Freund, Hans Hagauer, David Haussler Manfred Opper.

By the choice of L’, PrJvolume(E,,,) 6 v) G ~/4. By Fubini’s Theorem (see 17, Vol. 2, p. 120]), since this holds for arbitrary S, it holds for randomly chosen S as well. Thus, the probability with respect to the random choice of ?r,. . . , iCrn and of the target H that H n iTI,. . . , if’,,,} is not in HALF@.,,. , x./&: is at most s/4. By construction, this implies that the probability that fi does not satisfy (l/m) l{j I .Cj E HAH) I G ~/2 is at most ~/2. Applying Lemma 3 completes the proof of the correctness of the algorithm. To analyze the time used by the algorithm, it is tempting to simply plug into the time bounds of Lemma 1. However, in that lemma, it is assumed that the algorithm may obtain random examples in unit time. However, when we simulate A’, we must generate random examples for it. Nevertheless, one may trivially sample from the uniform

Yoav and

References [ll D. Angluin, Queries and concept learning, Machine Learning 2 (1988) 319-342. [2] E. Baum, The perceptron algorithm is fast for nonmalicious distributions, Neural Comput. 2 (1990) 248-260. 131 A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth, Learnability and the Vapnik-Chervonenkis dimension, J. ACM 36 (4) (1989) 929-965. [41 D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progressions, in: Proc. 19th ACM Symp. on the Theory of Computation, 1987. [5] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis (Wiley, New York, 1973). [61 H. Edelsbrunner, Algorithms in Combinatorial Geometry (Springer, Berlin, 1984). [71 W. Feller, An Introduction to Probability and its Applications, Vol. 1 (John Wiley and Sons, New York, 3rd ed., 1968).

250

P.M. Long /Information

Processing Letters 51 f 1994) 245-250

[8] D. Haussler, M. Kearns, N. Littlestone and M. Warmuth, Equivalence of models for polynomial learnability, Inform. and Comput. 9.5 (1991) 129-161. [9] D. Haussler, M. Kearns and R. Schapire, Bounds on the sample complexity of bayesian learning using information theory and the VC-dimension, in: Proc. 1991 Workshop on Computational Learning Theory (1991) 61-74. [lo] M. Kearns, M. Li, L. Pitt and L. Valiant. Recent results on boolean concept learning, in: P. Langley, ed., Proc. 4th Internat. Workshop on Machine Learning (Morgan Kaufmann, Los Altos, CA, 1987) 337-352. [Ill N. Littlestone, From on-line to batch learning, in: Proc. 1989 Workshop on Computational Learning Theory (1989) 269-284. [12] W. Maass and G. Turan, How fast can a threshold gate learn?, Tech. Rept. 321, Graz Technical University, 1991. [13] W.S. McCulloch and W. Pitts, A logical calculus of ideas imminent in neural nets, Bull. Math. Biophysics 5 (1943) 115-137.

[14] M. Opper and D. Haussler, Calculation of the learning curve of Bayes optimal classification algorithm for learning a perception with noise, in: Computational Learning Theory: Proc. 4th Ann. Workshop (Morgan Kaufmann, Los Altos, CA, 1994) 75-87. [15] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms (Spartan Books, Washington, DC, 1962). [16] N. Sauer, On the density of families of sets, .I. Corn&n. Theory Ser. A 13 (1972) 145-147. [17] R. Tarjan, Data Structures and Network Algorithms (SIAM, Philadelphia, PA, 1983). [18] P. Vaidya, A new algorithm for minimizing convex functions over convex sets, in: Proc. 30th Ann. Symp. on the Foundations of Computer Science (1989) 338-343. [19] L. Valiant, A theory of the learnable, Comm. ACM 27 (11) (1984) 1134-1142.