Approximating Matrix Multiplication for Pattern Recognition Tasks

Approximating Matrix Multiplication for Pattern Recognition Tasks

Journal of Algorithms 30, 211]252 Ž1999. Article ID jagm.1998.0989, available online at http:rrwww.idealibrary.com on Approximating Matrix Multiplica...

241KB Sizes 0 Downloads 111 Views

Journal of Algorithms 30, 211]252 Ž1999. Article ID jagm.1998.0989, available online at http:rrwww.idealibrary.com on

Approximating Matrix Multiplication for Pattern Recognition TasksU Edith Cohen† and David D. Lewis† AT & T Labs-Research, Florham Park, New Jersey 07932 Received January 10, 1997; revised February 24, 1998

Many pattern recognition tasks, including estimation, classification, and the finding of similar objects, make use of linear models. The fundamental operation in such tasks is the computation of the dot product between a query vector and a large database of instance vectors. Often we are interested primarily in those instance vectors which have high dot products with the query. We present a random sampling based algorithm that enables us to identify, for any given query vector, those instance vectors which have large dot products, while avoiding explicit computation of all dot products. We provide experimental results that demonstrate considerable speedups for text retrieval tasks. Our approximate matrix multiplication algorithm is applicable to products of k G 2 matrices and is of independent interest. Our theoretical and experimental analysis demonstrates that in many scenarios, our method dominates standard matrix multiplication. Q 1999 Academic Press

1. INTRODUCTION We propose and analyze a random-sampling based algorithm that estimates all entries of a matrix product of k G 2 matrices. The estimates are unbiased. The distribution of the relative error depends only on the relative size of an entry Žin its respective row or column., and decreases for larger entries. Our algorithm is suited for objectives such as identifying all high-valued entries in nonnegative matrix products, and exhibits polynomial asymptotic improvements over the ‘‘exact’’ sparse multiplication algorithm. The savings are achieved by spending proportionately less time on low-valued entries in the product. Our technique is well suited for pattern * A preliminary version of this paper appeared in the Proceedings of the Eighth ACM]SIAM Symposium on Discrete Algorithms, 1997. † E-mail: edith,[email protected]. 211 0196-6774r99 $30.00 Copyright Q 1999 by Academic Press All rights of reproduction in any form reserved.

212

COHEN AND LEWIS

recognition applications. One such application is text retrieval, and we provide experimental results for several text retrieval tasks. For many tasks, both on sparse and dense matrices, our algorithm outperforms Žasymptotically and in practice. matrix multiplication. A similar technique for estimating matrix-vector products Ž k s 2. was independently discovered by Clarkson w7, 8x. Clarkson is investigating the use of this method to speed up Gaussian elimination, the solving of linear systems of equations, and other mathematical programming applications. In pattern recognition tasks, a database of instances to be processed Žimages, signals, documents, . . . . is commonly represented as a set of length-d vectors x 1 , . . . , x n of numeric feature values. Examples of feature values include the number of times a given word occurs in a document, the coordinates of the center of mass of black pixels in an image, or the set of spectral coefficients output by a speech analyzer. The desired processing of instances is often implemented using linear models w13, 21x. The knowledge required to carry out the processing is captured in a vector of numeric weights, q s Ž q1 , q2 , . . . , q d .. Each instance is processed by taking the dot product of the weight vector q with each of the vectors x 1 , . . . , x n of feature values. The weights in q may be set by hand or by supervised learning from training data, in order that the products q ? x i satisfy some desired goal. In classification problems, the goal is that values of q ? x i are high when x i is ‘‘related’’ to or is a member of a particular class represented by q. In regression problems the goal is that q ? x i approximate some numeric value associated with instance x i . In some cases, computing the proximity of one instance to another can be cast in linear form. For instance, if q and x are normalized to have a Euclidean norm of 1.0 and are viewed as unit length vectors, then q ? x is the cosine of the angle between them, a commonly used proximity measure. ŽIn this case, the roles of instance vector and weight vector are interchangeable. . These pattern recognition techniques are applied, for instance, in statistical information retrieval systems. Documents and queries are represented as vectors of terms or features. The user query can be treated as an example document, whose similarity to database documents must be found w23, Chap. 4x, or it can be used as evidence toward setting the parameters of a probabilistic classification function, which is then applied to database documents w10, 22x. In both cases, linear models are widely used and continue to be developed w20, 29x. We can view all the preceding techniques in terms of matrix multiplication. Applying a weight vector q to a database of instance vectors means computing the vector by a matrix product qTA, where A is a matrix whose columns  x 1 , . . . , x n4 are the instance vectors. The value of the ith entry of qTA corresponds to the dot product of q and x i . In proximity problems, we

APPROXIMATE MM FOR PATTERN RECOGNITION

213

often need to compute dot products between all pairs of a set of instances. This is equivalent to computing the matrix product ATA. The scale of these matrix multiplications can be considerable. In text retrieval, for example, the number of instances Ždocuments. can by 10 4 to 10 7 and the number of features Žwords or phrases. may be 10 5 or more. The reading computation can be expensive, even when utilizing sparse matrix and indexing techniques w28x. Dense instances with hundreds of features are also common, for example, with factor analytic text representations ŽSection 5.1. or in image retrieval, and sparse matrix techniques are not applicable in these cases. Our sampling algorithm identifies high-valued entries of nonnegative matrix products, while avoiding a costly exact computation that needlessly computes ‘‘uninteresting’’ low-valued entries. The algorithm assigns scores to each entry of qTA. In contrast to existing approximate scoring techniques Žsee w26x., the expected value of the scores is equal to the true value that would be obtained with the full computation. Furthermore, the variance of the score divided by the true value is independent of the weight distribution of the instance vectors. It depends only on the Žrelative. value of the entry and decreases for higher valued entries. Our algorithm may be extended to handle matrices that include negative entries, but provides weaker worst case bounds on the estimate quality. On the real-valued data sets we have tested, however, our algorithm achieves significant improvements over multiplication. In many pattern recognition applications, the exact value of q ? x is not critical, and our approximate dot products can be used instead of true ones. For instance, the feature values in vectors are often set by coarse heuristics, with the result that small differences in dot products between vectors have little consequence. Furthermore, many tasks inherently do not require exact values. In a classification task, we may only need to know whether q ? x exceeds some threshold value for class membership. In an information retrieval system, we may want to find, for instance, the documents with the 10 highest scores and display them to the user, without needing exact knowledge of the score of any document. However, our algorithm is also applicable when exact top values are needed, as when the goal is to identify exactly the 10 largest dot products. In such tasks, the scores produced by our algorithm can be used to identify a small set of instances that is likely to contain the desired set of instances Že.g., those with large dot products.. The exact dot-product evaluation is carried out only for these selected instances. The running time of our algorithm depends on the accuracy required for the task at hand and on the distribution of the scored values. Our algorithm is particularly suited for problems where a typical query q has a small number of relevant instances Žhigh dot-product values. and the bulk

214

COHEN AND LEWIS

of instances have significantly smaller values. As a rough ‘‘asymptotic’’ comparison, the exact computation m s qTA s Ž m1 , . . . , m n . of all dot products takes O Ž dn. operations Žfor dense instances .. Our proposed algorithm, for the task of isolating the highest dot product Žpossibly with some others which are very close in value. for nonnegative matrices, performs a one-time O Ž dn. Žfor a dense instance . preprocessing step, and then takes O dq

ž

Ý nis1 m i max i g 1 , . . . , n4 m i

/

log n ,

Žat most O Ž n log n., but usually less. operations per query. Not only does the sampling algorithm perform fewer operations, but the operations are simpler: memory access and a counter increment 1 versus a floating-point multiply]add. In experiments with large information retrieval datasets we demonstrate speedups of 5- to 15-fold without significantly compromising the accuracy of the results. Matrix products estimation. We state the time bounds and estimation quality of our matrix-product estimation. Let A 1 , . . . , A k Ž k G 1. be nonnegative matrices. These matrices can be viewed as a ‘‘fixed’’ database. Let An1=n 2 s A 1 ??? A k be their product. Our sampling algorithm preprocesses the matrices A 1 , . . . , A k in time linear in Ý kis1
mi M

G p,


Fe,

1 Random numbers are generated, but as described in the implementation section, this cost may be amortised across queries by storage and reuse.

APPROXIMATE MM FOR PATTERN RECOGNITION

215

and moreover, Žwith the same confidence., mi

; i g  1, . . . , n 2 4 ,

M

- p Ž 1 y e . implies

m ˆi M

- p.

In particular, observe that all entries in a row larger than the mean of entries for the row Žincluding, of course, the largest entry. are such that m irM G 1rn 2 . Therefore, for any e and with the previously stated confidence, all entries greater than the mean are identified and estimated to within relative error e in O Ž
1 poly Ž n 2 .

,

; i g  1, . . . , n 2 4 , < m ˆ i y mi < F e .

Real-¨ alued matrix products. Our basic sampling algorithm applies directly to nonnegative vectors and matrices. Nonnegativity is common but not universal in pattern recognition tasks. Negative values occur moderately often in query vectors, particularly when the query vector is actually a linear classifier or regression function. Negative values in instance vectors are less common, but occur in signal processing tasks and occasionally occur even in text processing Žsee Section 5.1.. Our algorithm extends for real-valued matrices, but weaker bounds apply. We estimate real-valued matrix products by representing each product as the difference of two positive quantities, which we named, the positi¨ e and the negati¨ e contributions of the product. Each contribution can be estimated using a modification of our algorithm for nonnegative products. Consider real-valued matrices A 1 , . . . , A k and query q. Let A, m s qTA, m, ˆ and M be defined as in the nonnegative case earlier. As is the case for nonnegative data, the preprocessing step takes time linear in Ý kis1

216

COHEN AND LEWIS

ŽThe vector mX is the sum of the positive and negative contributions of the X X2 2 m . In time O Ž
1 poly Ž n 2 .

,

; i g  1, . . . , n 2 4 , < m ˆ i y mi < F e .

The relative error of the estimate depends on the relative magnitudes of the positive and negative contributions, within their respective rows, and with respect to each other Žbounds are provided in Section 3.. Hence, it could be arbitrarily bad in the worst case. Empirically, however, for real-valued data we had available, entries with large values typically had large positive contributions and small negative contributions Žthe contributions were inversely correlated., which allows for estimates with a small relative error. Relation to Euclidean proximity problems. A variety of tasks involve finding instances in close proximity to other instances. This may be a goal in itself, or may be a subtask in performing classification w11x, regression w18x, and other tasks. Frequently seen proximity problems are closest pair, nearest neighbors, and bichromatic nearest neighbors. A variety of proximity measures can be used in such tasks. As mentioned earlier, the cosine correlation can be implemented directly as a dot product. Another common measure is the Euclidean distance, which is related to the dot product as follows, 5u y v 5 2 s 5u 5 2 q 5v 5 2 y 2u ? v. When vectors have a Euclidean norm of 1.0, dot product determines the Euclidean distance exactly. Simple reductions can handle vectors whose Euclidean norm is not 1.0. The caveat in applying our algorithm for Euclidean nearest neighbors is that a small relative error on estimating a dot product translates to an additive error which is a fraction of the Euclidean norm. Additive errors on dot-product estimation, however, do result in similar additive errors on the corresponding Euclidean distances. The Euclidean proximity problem of computing all-pairs nearest neighbors Žlikewise, computing all dot products. are trivially solvable in O Ž dn2 . time Žfor dense data, for sparse data the time depends on the nonzero structure .. Some known algorithms perform better when the dimension d can be treated as a constant Žor is very small.. Bentley, Weide, and Yao w3x showed better expected asymptotic performance for instances where the dimension is constant and the vectors are randomly sampled from certain distributions. Arya et al. w1x established better worst case performance for constant dimension, by allowing approximation. They presented an O Ž n

APPROXIMATE MM FOR PATTERN RECOGNITION

217

log n. time algorithm that computes Ž1 q e . nearest neighbors,2 Žfor any fixed constant e and d .. The running time of these algorithms is near-linear in n, but exhibits exponential dependence on d. For high dimensions, even for the average or approximate problems, the O Ž dn2 . bound is the best known Žshort of using fast matrix multiplication.. Our algorithm exhibits significantly faster running times for realistic distributions. It is best suited for datasets where the e-neighborhoods do not contain many points or when most of the near-neighbors are relevant and are far from the bulk of irrelevant points. It is easy to see that this is not the case for data generated uniformly at random. It is typical, however, for ‘‘real’’ datasets. On such real distributions, the running time for all nearest neighbors could be O Ž n2 log n.. In Section 2 we present and analyze our sampling algorithm for nonnegative matrix products. In Section 3 we generalize the sampling algorithm for real-valued matrices. In Section 4 we discuss the implementation. Our experimental results are presented in Section 5.

2. THE SAMPLING METHOD FOR NONNEGATIVE PRODUCTS Consider a product, M n1=n kq 1 s An11=n 2 ??? Ankk=n kq 1 , of k G 2 nonnegative matrices. We present our sampling algorithm for estimating the product of M. The algorithm inputs the matrices A 1 , . . . , A k . In an on-line setting, the matrices A 2 , . . . , A k are fixed and provided as input at preprocessing time, and vectors q g R n 2 are presented on-line Žthis can be viewed as supplying the rows of the matrix A 1 on-line.. We first overview the structure Žinput, output, running time. of the algorithm. Preprocessing stage. Inputs the matrices A 2 , . . . , A k and creates appropriate data structures. Preprocessing requires linear time in the number of nonzero entries in A 2 , . . . , A k . Scoring the ith row of M. Inputs: q ' wA 1 x i ? Žthe ith row of A 1 , where i g  1, . . . , n14., and number of samples S. Ž S can be obtained as input or determined on-line based on intermediate results .. v

v

1. The row wA 1 x i ? is processed in linear time in the number of n kq 1 wMx i h is computed. nonzeros in wA 1 x i ? . As a byproduct, Mi s Ý hs1 2

Neighbors with distance at most Ž1 q e . times the distance of the closest neighbor.

218

COHEN AND LEWIS

2. Sample S times: Each sample is an index j g  1, . . . , n kq 1 4 , where j is selected with probability,

wMx i j mj s , wMx i h Mi

n kq 1 Ý hs1

where m ' w M x i ? s qTA 2 ??? A k s Ž m1 , . . . , m n kq 1 . . The data structure created previously allows us to obtain samples with this property in O Ž k . time. The sampling stage returns a score ¨ ector Ž S1 , . . . , Sn ., where S j , the score of j g  1, . . . , n kq1 4 , is the number of kq 1 kq 1 S .. Because the times j appears in the sample. ŽWe have S s Ý njs1 j expected value of S j equals m j Ž SrM ., the vector, Mi S

Ž S1 , . . . , Sn . kq 1

is an unbiased estimate of wMx i ? . The sampling algorithm provides information about the weight distribution and values of m in a fraction of the time needed to compute m explicitly. A more elaborate analysis of the dependence of the estimate accuracy on the number of samples S is provided in Subsection 2.4. 2.1. The Algorithm The preprocessing stage constructs a k q 1-layer directed graph, where edges are directed from layer j to layer j q 1 and all edges are assigned transition probabilities. A sample for row i of A1 is obtained by a random walk from the ith node in the first layer to the Ž k q 1.st layer. ŽThe index of the last node reached constitutes the returned value.. Graph representation of matrices and matrix products. A matrix A of dimension n1 = n 2 is represented by a weighted directed bipartite graph GŽA., with n1 nodes  ¨ 1 , . . . , ¨ n14 at the lower level and n 2 nodes  ¨ X1 , . . . , ¨ nX 2 4 at the upper level. For 1 F i F n1 and 1 F j F n 2 , there is a directed edge from ¨ i to ¨ Xj if and only if wAx i j / 0. The weight w Ž e . of the edge e s Ž ¨ i , ¨ Xj . is wAx i j . It is easy to see that the outdegree of ¨ i Žresp., indegree of ¨ Xj . is the number of nonzeros in the ith row Žresp., jth column. of A. The sum of the weights of edges incident to ¨ i is the weight of the ith row Žthat is, the sum of the weights of all entries in the ith row.. The matrix product M s A 1 , . . . , A k Žwhere A i is of dimensions n i = n iq1 . is represented as a layered graph with Ž k q 1. layers. For 1 F i F k q 1, the ith layer has n i nodes and the edges between the ith and the

APPROXIMATE MM FOR PATTERN RECOGNITION

219

Ž i q 1.st layers are as in the bipartite graph GŽA i .. For 1 F i F k q 1, let Vi  ¨ 1i , ¨ 2i , . . . , ¨ ni i 4 be the set of nodes at layer i. Preprocessing stage. The following algorithm assigns transition probabilities p: E to the edges of the product graph such that random walks from node ¨ 1i g V1 to layer Vkq1 , according to the edge probabilities, produce the desired samples. First, for every subproduct, A i ??? A k Ž1 F i F k . and n kq 1 wA i ??? A k x jh Žthe for every 1 F j F n i , we compute the sum Wj i s Ý hs1 weight of the jth row of A i ??? A k ., and we associate it with the node ¨ ji. We denote Wj i ' W Ž ¨ ji . and M j ' Wj 1. ALGORITHM 2.1 wCompute transition probabilitiesx.

v

Compute Wj i.

1. Layer k q 1 nodes: For 1 F j F n kq 1 , set Wj kq 1 ¤ 1. 2. For i s k down to 1: Ž ith layer nodes. n iq 1 wA i x jhWhiq1. ŽFor every node ¨ in layer i, set For 1 F j F n i , set Wj i ¤ Ý hs1 W Ž ¨ . to the sum of the products w Ž ¨ , u.W Ž u., for all u in the i q 1st layer such that Ž ¨ , u. is an edge.. Assign probabilities: i iq1 For every pair of nodes  ¨ ji, ¨ iq1 l 4 such that the edge e s Ž ¨ j , ¨ l . exists, set v

p Ž e . ' pji l s

w Ž e . W liq1 Wj i

.

Scoring. To sample S times from the ith row of M: Trace S random walks from ¨ i1 to layer k q 1. For 1 F j F n kq1 , let S j , the score of j, be the number of random walks that terminated in ¨ jkq 1. The quantity Mi Ž S jrS . has expected value wMx i j . Remark. Algorithm 2.1 can be viewed as an algebraic reduction of the product A 1 ??? A k to the form D 1 P1 ??? Pk , where D 1n1=n 1 is a diagonal matrix with entries M1 , . . . , Mk and for i s 1, . . . , k, Pin i=n iq 1 are Markovian with entries wPi x j l ' pji l . In the ith iteration, the matrix Pi and Wj i Ž j s 1, . . . , n i . are computed from A i and Wj iq1 Ž j s 1, . . . , n iq1 .. In essence, the algorithm finds a diagonal matrix Di and a Markovian matrix Pi such that Di Pi s A i Diq1 Žwhere Di is a diagonal matrix with entries W1i, . . . , Wnii .. The distribution of termination nodes of a random walk is given by the Markovian matrix P s P1 ??? Pk Žthe expected value of S irS equals wPx i j ..

220

COHEN AND LEWIS

2.2. Example Consider the 3 = 3 matrices A and B and their product AB, 4 As 1 0

2 2 3

0 5 , 6

2 Bs 5 0

3 3 1

0 2 , 4

18 AB s 12 15

18 14 15

4 24 30

Figure 1 shows the graph representation of the product AB. On the left, edges are labeled by their weight. On the right, nodes are labeled by the values Wi 1 and Wi 2 Ž i s 1, . . . , 3. and edges are labeled by the transition probabilities p: E. 2.3. Correctness and Running Time The preprocessing stage involves constructing the graph and computing transition probabilities ŽAlgorithm 2.1., and it is easy to see that it can be implemented in O ŽÝ kis1

Assuming matrices are provided in compact sparse representation which lists only the location and contents of nonzero entries Žand does not allocate space for empty rows and columns. and the graph representation does not contain nodes corresponding to empty rows and columns.

FIG. 1. On the left we show the graph representation of the example matrix product. On the right we show the resulting node weights Ži.e., row weights for subproducts., the products of these node weights with incident edge weights, and the resulting probability distributions on edges.

221

APPROXIMATE MM FOR PATTERN RECOGNITION

takes O Ž1. time units. The actual behavior, however, depends on the computational model and the implementation. In the bit-model of computation, each step of the random walk requires time proportional to the entropy of the distribution at the current node. In the common unit-cost model, each step requires at most log degŽ ¨ . s O Žlog max i n i . time. In our implementation Žsee Section 4. we use presampling, that is, we perform the sampling in the preprocessing stage and we produce and store samples ahead of time. These samples are retrieved later for query evaluations and are reused for different queries. This make the actual cost per sample only O Ž1. Žas opposed to O Ž k . without presampling.. Correctness is established by the following propositions. kq 1 wA i For i s 1, . . . , k and 1 F j F n i , Wj i s Ý hs1

n

PROPOSITION 2.2. ??? A k x jh .

Proof. By downward induction on i s k, k y 1, . . . , 1. Immediate for i s k. Assume the proposition holds for i ) l . We have Wj s l

n lq1

Ý w ž ¨ j , ¨r l

lq1

rs1

s

/W

r

lq1

s

n lq1

n kq1

rs1

hs1

Ý wA l x jr Ý wA lq1

n kq1 n lq1

n kq1

hs1 rs1

hs1

??? A k x r h

Ý Ý wA l x jr wA lq1 ??? A k x r h s Ý wA l ??? A k x jh .

PROPOSITION 2.3. The probability that a random walk from node ¨ j l at n kq 1 wA l layer l Ž1 F l F k . terminates at the node ¨ ikq 1 is wA l ??? A k x jirÝ hs1 x ??? A k jh . Proof. By downward induction on l . Immediate for l s k: The proban kq 1 wA k x jh . Assume bility assigned to an edge Ž ¨ jk , ¨ ikq1 . equals wA k x jirÝ hs1 the proposition holds for l ) r. The probability that a random walk from ¨ jr ends at ¨ ikq 1 equals n rq1

Ý

pjqr

qs1

w A rq1 ??? A k x q i . Ý h w A rq1 ??? A k x qh

Proposition 2.2 asserts that pjqr

s w A r x jq

Wqrq1 Wj r

kq 1 w A rq1 ??? A k x qh Ý hs1

n

s w A r x jq

kq 1 w A r ??? A k x jh Ý hs1

n

.

222

COHEN AND LEWIS

Hence, the probability that the walk terminates at ¨ ikq 1 is rq 1 w A r x jq w A rq1 ??? A k x q i Ý qs1

w A r ??? A k x ji . w A r ??? A k x jh

n

n kq 1 Ý hs1

s

w A r ??? A k x jh

n kq 1 Ý hs1

This concludes the proof. 2.4. Accuracy and Variance of Scores Consider some row i g  1, . . . , n1 4 of M. Let pj s

wMx i j wMx i j s wMx i h Mi

n kq 1 Ý hs1

denote the probability that a single sample is j. S j , the number of independent random walks Žout of S walks. terminating at the jth node, has a binomial distribution with parameters S and pj . The expected value of S j is Spj . The probability that S j s t Ž j is sampled t times. is Sy t S t p Ž 1 y pj . , t j

ž /

Žsee, e.g., Feller w15x for background.. Approximating binomial with normal, Prob  S j G T 4 , 1 y F

Prob  S j - T 4 , F

T y 0.5 y Spj

ž'

Spj Ž 1 y pj .

T q 0.5 y Spj

ž'

/

Spj Ž 1 y pj .

/

,

Ž 1.

.

The asymptotic tail behavior, as the number of samples grows, is given by additive Žabsolute error. and multiplicative Žrelative error. forms of the Chernoff bounds w6x, Prob Additive: Prob

½ ½

Sj S Sj D

) pj q e F ey2 S e ,

5 5

2

Ž 2. y2 S e 2

- pj y e F e

,

223

APPROXIMATE MM FOR PATTERN RECOGNITION

Prob Multiplicative: Prob

½ ½

Sj

) pj Ž 1 q e . F eyS p j e

S Sj

- pj Ž 1 y e

S

5 .5

2

r3

,

Ž 3. yS p j e 2 r2

Fe

.

Bounds on additi¨ e error. It follows from Ž2. that for each column j, the estimation error of the estimate Ž S jrS . Mi for wMx i j is bounded by Prob

½

Sj S

Mi y w M x i j G e Mi F 2 ey2 S e . 2

5

Ž 4.

Hence, Ss

1 2

Mi

2

ž / a

log

2 n kq1

Ž 5.

d

samples suffice for estimating all entries in the ith row of M to accuracy "a with confidence at least 1 y d . Therefore, 1 Ý nis1 Ž Mi .

2a

2

2

log

2 n1 n kq1

Ž 6.

d

samples suffice to estimate, with confidence 1 y d , all entries of the matrix M to within an additive error of "a . We obtain the following: PROPOSITION 2.4. Let A 1 , . . . , A k be k nonnegati¨ e matrices. Let An1=n 2 s A 1 ??? A k . Denote by N the number of nonzero entries in the matrices A 1 ??? A k . For a ) 0 and a polynomial polyŽ ? , ? ., in time

ž ž

O Nqk

1 Ý nis1 Ž Mi .

a2

2

log Ž n1 q n 2 .

//

,

with confidence 1 y 1rpolyŽ n1 , n 2 ., our sampling algorithm produces a maˆ such that trix A

ˆx i j F a . wAx i j y wA

; Ž i , j . g  1, . . . , n1 4 =  1, . . . , n 2 4 , Bounds on relati¨ e error. From Ž3. we obtain Prob

½

Sj S

Mi y w M x i j G e w M x i j F 2 eyS p j e

5

2

r3

.

Ž 7.

224

COHEN AND LEWIS

Note that the relative error depends on the magnitude of the entry and decreases for larger entries. From Eq. Ž7. we obtain: PROPOSITION 2.5. For any 1 G p ) 0 and d ) 0, use of S s 3 py1ey2 logŽ2 n kq 1rd . samples suffice to guarantee that with confidence at least 1 y d , all columns j in the ith row are estimated within the following, relati¨ e error : ; j,


(

Fe

p pj

,

equi¨ alently, within additi¨ e error : ; j, < m ˆ j y m j < F e M ppj .

'

The objective in many applications is to isolate and to estimate all entries with value that exceeds a certain threshold, e.g., entries j for which pj ) p, or equivalently, m j ) pM. The following corollary considers a threshold pM. All entries with value above pM are estimated with relative error at most e . All entries smaller than pM by at least a fraction are estimated to be less than pM. COROLLARY 2.6. For any 1 G p ) 0, and d ) 0, use of S s 3 py1ey2 logŽ2 n kq 1rd . samples guarantees that with confidence at least 1 y d , all columns j in the ith row are estimated as follows: ; j such that pj G p,

Sj S

Mi y w M x i j F e w M x i j .

Moreo¨ er, ; j, pj - Ž 1 y e . p implies

Sj S

- p.

Ž Note that for any polynomial polyŽ ? ., we can ha¨ e confidence of Ž1 y 1rpolyŽ n kq 1 .. with O Ž py1ey2 log n kq1 . samples.. The largest entry in the ith row must have pj G 1rn kq1. Putting p s 1rm kq 1 , it follows that COROLLARY 2.7. O Ž n kq 1 ey2 log n kq1 . samples guarantee that with confidence at least Ž1 y 1rpolyŽ n kq 1 .., all entries larger than the mean Ž particularly, the largest entry . in the row are estimated with relati¨ e error at most e . In the text retrieval datasets we experimented with, the largest entries in a row often had a much larger fraction of the total row weight Ž pj c 1rn..

APPROXIMATE MM FOR PATTERN RECOGNITION

225

Hence, even fewer samples suffice to guarantee e relative error for the largest entries. 2.5. An Alternati¨ e Approach We comment on a different and a conceptually simpler approximate multiplication algorithm. The algorithm randomly samples dX g d coordinates and performs the dot-product computations on these coordinates only, yielding approximate results. This algorithm is faster than the exact computation by a factor of drdX . The actual dot products can be approximated by multiplying the corresponding values of the abbreviated computation by drdX . The expected value of these estimates equals the actual values, as in our method. However, the variance is typically much higher, in particular when most of the contribution to a high dot product is from a few coordinates. In this method, the variance of the ratio of the estimate and true value, depends on the weight distribution of the instance vectors and therefore, is unpredictable Ždot products with the same exact value could have estimates with very different variance.. 2.6. Dynamically Changing Database So far, we viewed the matrices A 2 , . . . , A k as a static database, and we allowed only the rows of A 1 to be provided on-line. Here we consider the handling of a stream of queries, when rows, columns, or single entries, can be updated in the matrices A 2 , . . . , A k . The problem boils down to the cost of maintaining an updated version of the transition graph. If the matrices A 1 , . . . , A i are updated, the cost of updating the transition graph is upper bounded by wA 1 x q ??? qwA i x. A lower bound for the cost of updating the graph is linear time in the number of transition probabilities that need to be modified. This lower bound is always met, but the required number of modifications is sometimes much larger than the number of entries changed. If an entry wA i x h j is updated, we need to update the weights at all nodes in layers 1, . . . , i that can reach ¨ hi , and then the transition probabilities at all edges leaving these nodes. The cost is linear in the number of outgoing edges incident to the reachability set of ¨ hi in the reversed graph. Note that the cost of updating the whole row wAx h ? is the same as updating a single entry in the row. As is typical in dynamic algorithms, we can obtain a tradeoff between the cost of maintaining an updated transition graph and the query cost. That is, update cost can be decreased, by not performing the full explicit update, but then the query time increases.

226

COHEN AND LEWIS

3. SAMPLING FOR REAL-VALUED MATRIX PRODUCTS We generalize the sampling algorithm of Section 2 to products of real-valued matrices. 3.1. Positi¨ e and Negati¨ e Contributions of Matrix Products The product of real-valued matrices can be presented as the difference of two quantities, the positi¨ e and the negati¨ e contributions of the product. Each contribution can be estimated using a modification of our sampling algorithm. DEFINITION 3.1. Let B1n1=n 2 , . . . , B in i=n iq 1 , . . . , B kn k=n kq 1 be matrices with q " !#

real entries. The positi¨ e contribution B1 ??? B k and the negati¨ e contribuy " !# tion B1 ??? B k of the product B1 ??? B k are matrices of dimension n1 = n kq1 and are defined inductively as follows. If k s 1 then q " !#

w

B1

xij s

y " !#

w

B1

xij s

½ ½

w B1 x i j ,

if w B1 x i j ) 0,

0,

otherwise,

y w B1 x i j ,

if w B1 x i j - 0,

0,

otherwise.

For k ) 1, q " !#

q " ! # q " !#

n2

w B1 ??? B k x i j s Ý w hs1

n2

y " !#

q " ! # y " !#

Ýw

q

x i h w B 2 ??? B k x h j

B1

B1

x i h w B 2 ??? B k x h j ,

hs1

q " ! # y " !#

n2

w B1 ??? B k x i j s Ý w hs1

n2

q

x i h w B 2 ??? B k x h j

B1

y " ! # q " !#

Ýw

B1

x i h w B 2 ??? B k x h j .

hs1

Each entry of a matrix product of k matrices can be viewed as a sum of products of k scalars. The positive contribution is the sum of all positive scalar products and the negative contribution is the Žnegated. sum of all negative scalar products.

227

APPROXIMATE MM FOR PATTERN RECOGNITION

PROPOSITION 3.2.

For any k matrices B1 , . . . , B k we ha¨ e q " ! # y " !#

B1 ??? B k s B1 ??? B k y B1 ??? B k ,

!

q #

" !

y #

"

B1 ??? B k q B1 ??? B k s

q " !#

ž

B1

y " !#

q

B1

???

Ž 8.

q " !#

/ ž

Bk

y " !#

q

Bk

/

.

Ž 9. Proof. By induction on k. 3.2. The Modified Sampling Algorithm Consider a product, M n1=n kq 1 s An11=n 2 ??? Ankk=n kq 1 , of k G 2 real-valued matrices. The sampling algorithm preprocesses these matrices in linear time in the number of nonzero entries. It generates a k q 1 layer directed graph whose edges are labeled with transition probabilities. The first layer has 2 n1 nodes and the Ž k q 1.st layer has n kq1 nodes. Random walks from the first layer to the Ž k q 1.st layer corresponds to sampling in the positive or negative contributions of the product. The layered graph is constructed as follows. v

Nodes. For 1 F i F k, the ith layer has 2 n i nodes. Let i Viq s  ¨ 1iq , ¨ 2iq , . . . , ¨ qn 4, i

Viy s  ¨ 1iy , ¨ 2iy , . . . , ¨ niy 4. i Let Vi s Viq j Viy be the set of nodes at layer i. The k q 1st layer 4. consists of n kq 1 nodes Vkq1 s  ¨ 1kq 1 , . . . , ¨ nkq1 kq 1 Edges. For i - k, the edges between the ith and the Ž i q 1.st layers are as follows: q y }Edges in Viq = Viq1 and edges in Viy = Viq1 are as in the q !# " bipartite graph GŽ A i .. v

y q }Edges in Viq = Viq1 and edges in Viy = Viq1 are as in the y !# " bipartite graph GŽ A i .. For i s k, q " !# }edges in Vkq = Vkq1 are as in GŽ A k . and y " !# y }edges in Vk = Vkq1 are as in GŽ A k ..

228

COHEN AND LEWIS

The preprocessing step assigns transition probabilities p: E to the edges of the graph such that a random walk from ¨ i1q to the Ž k q 1.st layer corresponds to a sample from row i of the positive contribution of the product. The index of the last node reached is j with probability, q " !#

w A 1 ??? A k x i j

!

q #

.

"

kq 1 w A 1 ??? A k x i h Ý hs1

n

Similarly, a random walk from ¨ i1y corresponds to a sample from row i of the negative contribution of the product. The index of the last node reached is j with probability, y " !#

w A 1 ??? A k x i j

!

y #

.

"

kq 1 w A 1 ??? A k x i h Ý hs1

n

The motivation for doubling of the nodes sets Žuse of Viy and Viq . in layers i g  1, . . . , k 4 is to keep track of the sign of the scalar product corresponding to the current random walk. The transition probabilities are computed by first computing, for every 1 F i F k and 1 F j F n i , the sums, n kq1

Wj

iq

hs1 n kq1

Wj

iy

q " !#

s Ý w A i ??? A k x jh , s

y " !#

Ý w Ai

??? A k x jh ,

hs1

q " !#

y " !#

Žthe weight of the ith rows of A j ??? A k and A j ??? A k , respectively.. We associate these sums with the corresponding nodes and we denote Wj iq ' W Ž ¨ jiq . and Wj iy ' W Ž ¨ jiy .. The following is a modification of Algorithm 2.1. ALGORITHM 3.3 Compute transition probabilities x. Compute Wj iq , Wj iy : 1. For all ¨ g Vkq 1 , set W Ž ¨ . ¤ 1 Žlayer k q 1 nodes., v

229

APPROXIMATE MM FOR PATTERN RECOGNITION

2. Compute W: Vk Žlayer k nodes.: For 1 F j F n k , set q " n kq1 !# Wj kq ¤

Ýw

hs1 n kq1

Wj

ky

¤

Ak

x jh ,

y " !#

Ýw

Ak

x jh ,

hs1

Žthe sum of the weights of outgoing edges from ¨ jkq or ¨ jky to V kq1 .. 3. For i s k y 1 down to 1: Žcompute W: Vi .. For every node ¨ g Vi , set W Ž ¨ . to the sum of the products w Ž ¨ , u.W Ž u., for all u g V iq1 such that Ž ¨ , u. is an edge. Assign transition probabilities: For every edge e s Ž ¨ , u., set v

pŽ e. s

w Ž e . W Ž u. WŽ¨ .

.

Interestingly, the estimation of the positive and negative contributions is intertwined. The graph cannot be split into two separate graphs, each with n i nodes at later i Ž i s 1, . . . , k q 1. for estimating the positive and negative contributions. 3.3. Example As an example, consider the 2 = 2 matrices A, B, and C, As

y4 y2

q 10 , q5

Bs

q3 0

y1 , y2

Cs

q2 y4

y1 . q2

The value, positive, and negative contribution of the subproducts AB, BC, and ABC are as follows, q y !#" !#" 10 y 5 10 0 0 5 BC s , BC s , BC s , 8 y4 8 0 0 4 q y !#" !#" y12 y 16 0 4 12 20 AB s , AB s , AB s , y6 y8 0 2 6 10 q y !#" !#" q40 y 20 80 20 40 40 ABC s , ABC s , ABC s . q20 y10 40 10 20 20 Figure 2 illustrates from left to right, the product graph with edge weights, the graph with the values of Wj iq and Wj iy on the nodes and

FIG. 2. The product graph and probability assignment.

230 COHEN AND LEWIS

APPROXIMATE MM FOR PATTERN RECOGNITION

231

w Ž ¨ , u.W Ž u. on each edge Ž u, ¨ ., and the graph with transition probabilities assigned to edges. 3.4. Correctness The following propositions assert correctness. Their proofs are similar, but notationally more involved than the proofs of Propositions 2.2 and 2.3. They are left as an exercise for the reader. PROPOSITION 3.4.

For i s 1, . . . , k and 1 F j F n i , n kq1

Wj

iq

s

Ý w Ai

hs1 n kq1

Wj

iy

s

q " !#

??? A k x jh ,

y " !#

Ý w Ai

??? A k x jh .

hs1

PROPOSITION 3.5. The probability that a random walk from node ¨ j1q Ž resp., ¨ j1y . terminates at the node ¨ ikq 1 is q " !#

w A 1 ??? A k x ji

,

q " !#

kq 1 w A 1 ??? A k x jh Ý hs1

n

Ž respecti¨ ely ., y " !#

w A 1 ??? A k x ji

!

y #

.

"

kq 1 w A 1 ??? A k x jh Ý hs1

n

3.5. Estimation Accuracy The analysis in Subsection 2.4 applies for estimates of either the positive or the negative contributions of the product. Our relative error bounds diverge when the two quantities are close in value, because the actual dot-product values are the difference of these quantities.

232

COHEN AND LEWIS

Denote MiXq

n2

s

Ý w A1 hs1

MiXy

n2

s

q " !#

??? A k x i h ,

y " !#

Ý w A1

??? A k x i h ,

hs1

q " !#

pq j s

w A 1 ??? A k x i j

!

py j s

MiXq y " #

,

MiXy

.

w A 1 ??? A k x i j

The following additive bound is a corollary of Proposition 2.4. PROPOSITION 3.6. Let A 1 , . . . , A k be k real-¨ alued matrices. Let An1=n 2 s A 1 ??? A k . For i s 1, . . . , n1 , let MiX s MiXq q MiXy . Denote by N the number of nonzero entries in the matrices A 1 , . . . , A k . For any a ) 0, in time,

ž ž

O Nqk

1 Ý nis1 Ž MiX .

a2

2

log Ž n1 q n 2 .

//

,

ˆ such that the modified sampling algorithm produces a matrix A ; Ž i , j . g  1, . . . , n1 4 =  1, . . . , n 2 4 ,

ˆx i j F a . wAx i j y wA

Proof. It is sufficient to obtain additive error ar2 on both the positive and negative contributions of the product. It is easy to see that Xq 2 1 ŽM . ra 2 logŽ n1 q n 2 ... samples suffice for the positive contriO ŽŽÝ nis1 i Xy 2 1 ŽM . ra 2 logŽ n1 q n 2 ... samples suffice for the negbution and O ŽŽÝ nis1 i ative contribution. The following is a corollary of Proposition 2.5. PROPOSITION 3.7. For any 1 G p ) 0, using O Ž py1ey2 log n kq 1 . samples for estimating the negati¨ e and positi¨ e contributions yields confidence of

233

APPROXIMATE MM FOR PATTERN RECOGNITION

at least Ž1 y 1rpolyŽ n kq 1 .. for

¡< mˆ y m < j

~

j

mj

;j

Fe p

'

ž

Xy M Xq pq py j qM j

'

'

M Xq pq j

¢< mˆ y m < F e'p ž M 'p Xq

j

q j

j

y

/

M Xy py j

,

q M Xy py j

Ž relati¨ e error . , Ž additi¨ e error . .

' /

3.6. Another View and Extensions The product graph in the real case is actually a product graph of nonnegative matrices. We relate the real-matrix product B s B1 ??? B k to a product of nonnegative matrices as follows. First note that

ž

1 yB s ky 1 B 2

B yB

/

k

Ł

is1

yB i . Bi

Bi yB i

ž

/

Each multiplicand can be replaced using the equation, q y !#" !#"

ž

Bi yB i

Bi s Bi

/

Bi y !#"



Bi q !#"

Bi

Bi

0

ž

I yI

yI , I

/

where the block matrices consisting of the identity matrices have the appropriate dimensions. Using the commutativity of symmetric block matrices with commuting blocks, we obtain

k

Ł

is1

ž

Bi yB i

yB i I s 2 ky 1 Bi yI

/

yI I

ž

k

/

q y !#" !#"



Bi

Bi

Bi

Bi

Ł y q is1 !#" !#"

Hence,

ž

B yB

yB I s B yI

/ ž

yI I

k

/

q y !#" !#"



Bi

Bi

Bi

Bi

Ł y q is1 !#" !#"

0

.

0

.

234

COHEN AND LEWIS

We multiply both sides of the previous equation: by Ž 0I . on the right, and by 1 Ž . 2 I, yI on the left. Thus, ky1

B s Ž I, yI .

q y q " !#" !#" !#



Bi

Bi

Bi

Bi

Bk

0 0

Ł y q y is1 !#" !#" !# "

.

Bk

Note that the graph and transition probabilities constructed earlier correspond to the nonnegative matrix product on the right-hand side. The preceding constitutes a direct proof that a real matrix product can be estimated using an estimation algorithm for a corresponding nonnegative product ŽSection 2.. 3.7. Complex Matrix Products A similar reduction applies to complex-valued matrix products. A complex matrix C can be decomposed to four nonnegative matrices: the q y !#" !#" positive and negative real contributions Ždenoted C and C . and the positive and negative imaginary contributions, defined similarly, and qi yi q y qi !#" !#" !#" !#" !#" denoted C and C . Hence, C s C y C q iŽ C yi !#" y C .. The complex matrix product C q C 1 ??? C k can be expressed as the following product, C s Ž I, yI, yiI, iI .

q " ¡!#

ky1

=Ł is1

Ci y !# " Ci qi " !# Ci yi !# "

¢

Ci

y " !# qi " !# yi "¦ !# !# ¡ q "¦ Ci q !# "

Ci yi !# "

Ci qi !# "

Ck y " !#

Ci qi !# "

Ci q !# "

Ci y !# "

Ck qi " !#

Ci yi " !#

Ci

Ci y " !#

Ci

Ci q " !#

Ci

Ck yi !# " .

§¢

Ck

§

The nonnegative matrix product in the right-hand side Žwithout the first multiplicand. can be estimated using the algorithm for nonnegative matrix products described in Section 2.

APPROXIMATE MM FOR PATTERN RECOGNITION

235

4. IMPLEMENTATION ISSUES We implemented the sampling algorithm for products of k s 2 nonnegative matrices. The input consists of a ‘‘database’’ matrix Ad= n, where n is the number of instances and d is the number of features. Query vectors of dimension d are either provided beforehand or presented on-line. For each input query q g R d and a parameter S Žnumber of samples., the algorithm computes M s Ý nis1wqTAx i and a Žpossibly sparse. integral score ¨ ector Ž S1 , . . . , S n ., where S i is the number of times i was sampled. 4.1. Implementing the Sampling Algorithm We provided efficient implementations for both dense and sparse matrices. Sparsity is fully exploited in the sense that we exhibit considerable speedups over sparse matrix multiplication algorithms. Presampling. A useful implementation strategy is to precompute and to store samples based on the matrix product graph, rather than storing the graph itself. The samples are generated in the preprocessing of the database matrix A. We generated lists L1 , . . . , L d of indices  1, . . . , n4 of sampled instances. For each input query q, the algorithm processes q and determines how many samples are needed from each of the d lists L1 , . . . , L d . It then retrieves the appropriate number of samples from each stored list and counts how many times each of the indices  1, . . . , n4 appears. Presampling improves performance because the same samples are used for different queries. Hence, while evaluating a query, each sampled instance requires only a memory lookup instead of the generation of a random number. Furthermore, presampling is effective from a memory management standpoint, because instances in the sample are grouped together on disk or in memory Žrather than dispersed throughout a large database.. In the preprocessing stage we select the list size l i s < L i < Ž i s 1, . . . , n. to be the maximum number of samples from L i we expect to be required by any query. In our experiments we selected the l i s to be the maximum required for the desired accuracy when the database instances themselves are used as queries. For our data, the sample-lists required considerably less storage than the original database matrix Žor its graph representation .. We should note that presampling introduces correlations between results of independent queries Ži.e., certain instances systematically have higher or lower scores for similar queries.. In particular, the same instance always has the same score for the same query, which is in fact desirable in some applications. However, if desired, the correlation of scores across queries can be reduced by using larger presamples and by using randomization in the selection within each list.

236

COHEN AND LEWIS

The use of presampling calls for a different implementation strategy than the straightforward implementation of the algorithm of Section 2. In the following text we sketch the modified preprocessing and query processing stages. The particulars of the implementation vary for handling sparse or dense data efficiently, and should be obvious to the reader from the later description. Preprocessing: Generating the sample lists. For each row 1 F i F d of A: Compute the sum a i ¤ Ý h wAx i h . Initialize L i to the empty list. Repeat for l i times Ždesired number of samples.: Draw an instance j g  1, . . . , n4 at random, where j has probability wAx i jra i , and append it to the list L i . v

v

Process a query ¨ ector q Ž with S samples.. 1. For every 1 F i F d such that qi / 0: let qXi ¤ qi a i ; let M ¤ Ý dis1 qXi Žs Ý nhs1 wqTAx h .. 2. For each 1 F i F d such that qi / 0, determine the number of samples c i needed from the list L i . Ž S s Ý i c i .. We choose c i such that it has expected value SqXirÝ djs1 qXj ,

¡

ci s

~

SqXi Ý djs1 qXj

¢Ý

SqXi X d js1 q j

with probability with probability

SqXi Ý djs1 qXj SqXi Ý djs1 qXj

y

y

SqXi Ý j qXj

,

SqXi Ý djs1 qXj

.

3. Scoring: For i s 1, . . . , d: Repeat c i times: Let s g  1, . . . , n4 be the next sample in L i . If s is encountered for the first time: create a counter for s with value 1. Otherwise, increment the counter of s. 4. For j g  1, . . . , n4 , set Si Žthe score of j . to the value of the counter Žif initialized., and 0 otherwise. v v

4.2. Chain Products and Real-Valued Data Our implementation ideas extend to k ) 2 matrices and real-valued data. For real-valued data Žsee Section 3., we generate two sets of sample lists, one for the positive contribution and one for the negative contribution. In each query evaluation, we need to decide how many samples are taken for estimating each contribution. Issues in selecting the sample size are discussed later in this section for nonnegative data, and are similar for real-valued data.

APPROXIMATE MM FOR PATTERN RECOGNITION

237

When our ‘‘database matrix’’ A is represented as a product A s A 2 ??? A k Ž k ) 2., the only implementation difference is in generating the samples in the presampling stage. The sample lists have the same form as in the k s 2 case, but are constructed with respect to the product A Žusing the graph method outlined in Section 2.. The implementation of query processing and scoring is the same as the two-matrix case. 4.3. Using Only Scores to Identify High Matches Consider a query q and the corresponding match vector m s qTA s Ž m1 , . . . , m n .. The expected value of Si equals Sm irM Žsee Section 2.. M s Ý njs1 m j is computed by the scoring algorithm. The expected value of m ˆ i s Si MrS equals m i . Hence, high scores can be used to identify high matches. The variance of m ˆ irm i decreases with the number of samples S and with the match value m i Žsee Eq. Ž1... We typically choose S according to tradeoffs between time and accuracy. When query vectors and columns of A are nonnegative and normalized to have a Euclidean norm of 1.0, the m i s are cosines in the range w0, 1x. To identify instances where m i G a we use a slightly lower threshold t s a Ž1 y e . and we consider instances where the estimate m ˆ i ) t. For queries with larger M, we need to use an increased number of samples S in order to achieve the same variance in estimating entries with value exceeding a . We use a parameter H Žusually in the range  20, . . . , 1004. across all queries, where for a query with a given M we use S s HM samples. This assures that the variance in estimating a particular entry of the product matrix depends only on the value of the entry Žand not on the particular row.. We use a threshold T s tH for S j . Note that T ) Si if and only if the estimate m ˆ i ) t. The values of H and the threshold T determine the variance of the estimate of an entry with a certain value m i , and hence, its probability of scoring above or below T. By Eq. Ž1., Prob  Si G T 4 , 1 y F

Prob  Si - T 4 , F

T y 0.5 y Hm i

ž'

Hm i Ž 1 y m i .

T q 0.5 y Hm i

ž'

Hm i Ž 1 y m i .

/

/

,

Ž 10 .

.

Hence, appropriate values can be predetermined for objectives such as ‘‘Find a subset of  1, . . . , n4 that Žwith high probability. contains at least 90% of all matches of value a and above and at most 5% of all matches of value a y 0.1 and below.’’ Note that predetermining the values of T and

238

COHEN AND LEWIS

H assures that the desired property holds with respect to the ‘‘worst case’’ distribution of the unknown m i s, where all the m i equal either a or a y 0.1. 4.4. Determining S Adapti¨ ely For general distributions of the m i values, some knowledge of the distribution allows us to make a better choice of H and T. Such knowledge can be obtained by increasing S adaptively and looking at the resulting intermediate distributions of scores, in order to estimate properties of the actual distribution. For example, values closer to a are more likely to introduce noise. Hence, if most values are much smaller or much larger than a , a smaller number of samples suffice. Typically, there are many instances with low match values and a small fraction of them, which could still be a significant number, have Žnoisy. high scores. The amount of such noise decreases with S, but it is also dependent on the distribution of the m i s. Predetermining S necessitates making worst case assumptions on the distributions which would often result in using many more samples than an adaptive algorithm. The adaptive approach is also preferable for identifying the k nearest neighbors of an instance. The distribution of m i s as a whole, and in particular the distribution of the largest few m i s Žthose for the nearest neighbors., will vary considerably from query instance to query instance. A natural relaxation on the k nearest neighbors problem is that of identifying k Ž1 y e .-nearest neighbors, where the ith neighbor identified needs to have a dot product of value at least Ž1 y e . times the dot product with the ith exact nearest neighbor. First note that if the dot product of the kth nearest neighbor is Žapproximately. known, then Corollary 2.6 can be applied directly to obtain an appropriate number of samples. Otherwise, we sample adaptively and we continue sampling until there are at least k entries j, for which S j ) 3 ey2 logŽ2 n kq1rd ., Žwhere 1 y d is the desired confidence.. These k entries Žin order of decreasing S j s. are outputted as the k Ž1 y e . nearest neighbors. The correctness follows by putting p s 3Sey2 logŽ2 n kq 1rd . in Corollary 2.6. Another task for which it is beneficial to determine S adaptively is in identifying a set of instances I of a certain size L such that a high fraction of the instances in I have matches over some value a . An adaptive algorithm can continue the sampling Žincrease S . while maintaining the set I of L highest scoring items. The sampling stops when a test shows that a satisfactory fraction of the instances in I are high matches. The test can either be based on the distribution of scores, or, more accurately, the fraction of high matches in I can be estimated by taking several uniform random samples from I and computing exact dot-product values for the

APPROXIMATE MM FOR PATTERN RECOGNITION

239

selected instances Žwhile counting how many of these instances are high matches.. Such a test can be performed each time the number of samples is increased by some factor g Žfor example, g s 2, each time S doubles in value.. Hence, the test is performed O Žlog g S . times. The number of random samples from I needed for each test can be computed as a function of the desired fraction of high matches in I. This algorithm assures that the final number of samples S is at most g times the ‘‘necessary’’ number. Such a guarantee cannot be provided by an algorithm that predetermines S. 4.5. Using a Combined Algorithm In many applications it is desirable to first compute Žimplicitly or explicitly. an approximate dot product value for all instances, followed by exactly computing dot products for a selected subset of the instances. One such scenario was given earlier, where exact computations were used to test the quality of our intermediary result, and to decide whether more sampling is needed. More generally, it is often desirable to efficiently identify the set of instances with high dot products, and then to compute exact dot products only for those instances. To accomplish this, we first use sampling to filter out instances that are extremely unlikely to be high matches. Then we compute the dot products of q with the remaining instances, which typically constitute a much smaller set. This approach for obtaining exact high matches yields considerable speedups, particularly for dense data, over fully computing the product qTA Žsee Section 5.. Combined algorithm. Input. A query q and parameters T Ž threshold . and H Ž hit factor .. v

Apply the sampling algorithm with S s HM samples to obtain scores Ž S1 , . . . , Sn .. Let I ;  1, . . . , n4 be the subset of instances with scores above threshold Ž Si G T .. v

v

Compute the dot-product qT x i for all i g I and identify the highest matches. v

The choice of the parameters H and T determines a tradeoff between the size of the set I Žthe smaller the better., the number of high matches that are filtered out, and the cost of sampling Žsee Eq. Ž10... To obtain a small size set I, the threshold T should be such that TrH is larger than most values of  m1 , . . . , m n4 . On the other hand, T and H should be such that TrH is below the magnitude of the lowest match which is of interest. The likelihood that matches above TrH q e are filtered out, or values

240

COHEN AND LEWIS

below TrH y e are in the set I, decreases with S Žsee analysis in Section 2.. 4.6. A Heuristic for Real-Valued Data In Section 3 we proposed an algorithm to estimate a real-valued matrix product by estimating the positive and negative contributions of the product. The actual dot-product value equals the value of the positive contribution minus the negative contribution. Here we propose a heuristic to identify large dot-product values for real-valued matrices by estimating only positi¨ e contributions. This heuristic was motivated by our observation that on the datasets we examined, high matches typically had high positive contribution and low negative contribution, and vice versa, high positive contribution implies high dot products. This relation holds when features are not independent, that is, similarity of some features of two instances, makes them more likely to be similar on other features. An intuitive explanation is that when two vectors are correlated, larger entries in corresponding columns are likely to have the same sign. The likelihood drops toward 50% when the two vectors are independent. We conjecture that this dependence is common in many other pattern recognition tasks. Our experiments ŽSection 5.1. demonstrate considerable speedups with negligible loss of accuracy for this heuristic. Our heuristic is a ‘‘combined’’ algorithm Žin the sense of Subsection 4.5.. It first transforms vectors to a nonnegative form such that the dot-product value for a pair of transformed vectors equals the positive contributions of the original dot product. Following that, the sampling algorithm identifies vectors with high ‘‘transformed’’ dot products, and computes exact original dot products only for Žthe small set of. vectors with high transformed scores. The following transformation is applied to the document and query vectors. For every such vector q we define ˜ q such that q˜2 i s max  0, qi 4 , q˜2 iq1 s ymin  0, qi 4 .

Ž 11 .

Note that the vectors ˜ q are nonnegative. It is easy to see that the product ˜q ? ˜x i is the positi¨ e contribution of the product q ? x i Žsee Section 3.. The ˜n= 2 d sampling algorithm is applied to the nonnegative database matrix A and nonnegative query vectors ˜ q. It identifies Žand scores. the objects with highest positive contributions. The exact Žoriginal. dot products are then explicitly computed for documents with high scores.

APPROXIMATE MM FOR PATTERN RECOGNITION

241

5. EXPERIMENTS As a test of our sampling method, we chose the task of estimating the largest entries in each row of ATA, where each of the columns  x 1 , . . . , x n4 of A is the representation of a textual document as a set of numeric feature values. This finds, for each document, its set of nearest neighbors, under a dot-product based proximity function. In information retrieval, such sets of nearest neighbors can be used to support hypertext browsing w25x or document clustering w27x. Finding nearest neighbors has similar computational characteristics to using example documents as queries Žan increasingly common text retrieval strategy. and to processing of queries to which query expansion has been applied w17x. Different methods of representing texts will give document matrices with very different properties. We first describe the challenging case, where the representation yields a dense matrix of real-valued document vectors. Following that, we describe results for a sparse nonnegative representation. All experiments were run under similar local conditions using one processor of a SGI Challenge with 1 GB RAM. 5.1. Dense Real-Valued Data Several information retrieval approaches have been proposed which model the observed document-term matrix as having a simpler underlying structure based on document clusters, word clusters, or latent factors. The most popular such approach is latent semantic indexing ŽLSI. w12x. LSI is based on singular value decomposition Žsee w2x for background.. The sparse t document vectors x i g Rq are replaced by dense d-dimensional vectors d ˆx i g R where d g t Žtypically, d is of order 10 2 .. The coordinates of ˆx i are linear combinations of the coordinates of x i , that is, the decomposition provides a transformation matrix T g R t=d such that ˆ x i s x Ti T and hence, T ˆ s T A. The SVD decomposition guarantees the new database matrix is A ˆTA ˆ is the best least-square approximation to ATA that can be provided that A ˆ g R d= n. Values of d of a few hundred typically give by a rank d matrix A the best retrieval effectiveness, indicating that a limited rank approximation can be better content representation than the original matrix. When using LSI for text retrieval, a user query is treated as a vector q of length t, and then transformed to a length d latent vector ˆ q Žs qT T.. The  4 goal then is to find the set of objects I ; 1, . . . , n such that ˆ q?ˆ x is large. ˆ This again amounts to identifying the largest entries in the product ˆ qTA. For our experiments we used a collection of 5000 encyclopedia articles, originally represented as a 5000 by 75,714 sparse nonnegative matrix of term weights similar to those used in Section 5.2. The LSI procedure was applied to this matrix, producing a dense 320 factor by 5000 document

242

COHEN AND LEWIS

ˆ which was then normalized so that the Euclidean norm of each matrix A, document vector was 1.0.4 We tested our sampling procedure by attempting to find all large document]document matches, in this case all large ˆTA. ˆ values in A 5.1.1. Dealing with Negati¨ e Entries Even when Žas is usual. the raw document-term matrix A is nonnegative, ˆ contain negative entries. We applied the it is possible that T and thus A heuristic proposed in Subsection 4.6 to efficiently identify high dot products. The underlying assumption for this heuristic is that there is a significant correlation between high dot-product values and high positive contributions. We can show this correlation graphically. The relation between dot-product values and corresponding positive contributions for the ency320.5000 dataset is plotted in Figs. 3 and 4. Figure 3 plots the positive contribution vs dot-product value for the roughly 106,000 document pairs with dot product above 0.55 in the dataset ency320.5000. Figure 4 plots 4

The LSI ency320.5000 data was provided by Susan Dumais of Bellcore.

FIG. 3. Plots of Žˆ xi ?ˆ x j, ˜ xi ?˜ x j . for all document pairs such that ˆ xi ? ˆ x j ) 0.55 Ž106K pairs..

APPROXIMATE MM FOR PATTERN RECOGNITION

243

FIG. 4. Plots of Žˆ xi ?ˆ x j, ˜ xi ?˜ x j . for all pairs of 500 documents.

positive contribution vs dot product for all pairs on a Žrandomly selected. subset of 500 of the documents. In Fig. 5 we plot this dependence for all pairs of 156 vectors from the Isolet data set Žisolated letter speech recognition.. We used normalized vectors with 617 attributes, where each vector corresponds to a voice sample of a spoken letter. 5 5.1.2. Results We applied a combined sampling and partial dot-products computation Žas described in Section 4.6. to the dataset ency320.5000. We first applied a sampling algorithm with parameters H and T Žhitfactor and threshold, ˜ That is, for a ‘‘query’’ ˜q as in Sections 4.3 and 4.5. on the positive matrix A. ˜x i . For all ˜., we used S s HM samples, where M s Ýnis1w˜qTA Ža column of A instances with score above T, we computed the exact dot-product value Žq ? x i .. Table 1 shows the results of this test. The first row in each box Žalgorithm Full . shows results for a naive algorithm that computes all dot products. Recall that the number of instances and the number of queries is 5 This dataset is available from directoryrpubrmachine-learning-databasesrisolet on ftp.ics.uci.edu.

244

COHEN AND LEWIS

FIG. 5. Plot for all positive]original dot products of 156 normalized vectors from Isolet data sets.

5000. Hence, Full computes 25 = 10 6 dot products. We compare this to our combined sampling algorithm. The sampling algorithm has a one time cost to prepare the sampled representation Ž‘‘index’’ time., and the size of this representation Ž‘‘presamples computed’’. is proportional to the parameter H. The sampled representation is used to score all instances for each query Ž‘‘samples drawn’’ is the total number of samples used at query time.. Dot products are computed explicitly for instances with scores of at least T, so smaller values of T Žfor constant H . lead to more dot products being computed explicitly Ž‘‘dot products computed’’.. Each dot-product computation amounts to 320 floating point multiply]adds, whereas each sampling step amounts to a memory lookup and a counter increment Žusing presampling.. The result is that the sampling algorithm can use much less compute time than Full while still identifying almost all high dot products. For example, Algorithm R1 utilizes only 11% of the CPU time Ž9% if disregarding preprocessing. of Full and identifies 99.8, 99.3, 98.4, 96.7, and 94.2% of the matches above 0.85, 0.80, 0.75, 0.70, and 0.65, respectively. Algorithm R5 utilizes 20% of the CPU time of Full Ž18% if disregarding preprocessing. and identifies 99% of matches over 0.55.

245

APPROXIMATE MM FOR PATTERN RECOGNITION

TABLE 1 Box B Shows Number of Pairs of Similar LSI Document Vectors Found by Various Sampling Methods a CPU time Žs.

A Alg: H, T

Index

Query

Presamples computed

Samples drawn

Dot products computed

Full 0, 0 R1 50, 30 R2 50, 25 R3 60, 30 R4 60, 25 R5 80, 40

} 85 85 86 86 88

3708 338 634 629 1332 667

0 738,801 738,801 886,563 886,563 1,182,037

0 450,181,950 450,181,950 540,218,340 540,218,340 720,291,120

25,000,000 657,121 2,625,280 2,284,815 6,934,072 1,879,159

Number of pairs with dot product of at least

B Alg: H, T

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Full 0, 0 R1 50, 30 R2 50, 25 R3 60, 30 R4 60, 25 R5 80, 40

106,368 90,620 103,306 104,339 106,201 105,153

68,872 62,216 67,784 68,223 68,829 68,554

44,412 41,842 44,061 44,218 44,402 44,343

28,320 27,403 28,217 28,259 28,320 28,305

17,688 17,410 17,667 17,674 17,688 17,686

11,256 11,184 11,256 11,255 11,256 11,256

7568 7549 7568 7568 7568 7568

a

For each method, box A gives both indexing time Žtime to build data structures prior to querying. and querying time Žtime to execute all queries., along with number of presamples drawn and samples used at query time Žcontrolled by parameter H . and number of exact dot products computed Žcontrolled by parameter T .. Method Full is equivalent to using H s 0 and T s 0. It computes the dot product between all pairs of vectors, so counts listed for it are the total number of dot products at or over each threshold value.

We observed that most matches were in w0, 1x and that the very few negative matches had small magnitudes. This is not surprising, because dot products of the LSI vectors should approximate the dot products of the Žnonnegative. original vectors. Note that even though the scoring is with respect to the positive contributions of the dot products, the final dot products and the results in Table 1 are with respect to the actual values. The number of samples in the sample matrices Žcomputed in the preprocessing. are 738,801 for R1 and R2, 886,563 for R3 and R4 and 1,182,037 for R1. Recall that each sample is an identifier of a document Žthat is, an integer in the range 1, . . . , n.. Hence, the precomputed samples require small additional storage. These sizes support the use of presampling Žsee Section 4.. 5.2. Sparse Nonnegati¨ e Data Our data here was a collection of 22,173 Reuters newswire articles. The matrix form of the data has 22,173 columns and 106,113 rows, with each

246

COHEN AND LEWIS

row representing a word. Each matrix entry was a function of the word’s frequency in the document, computed using the Cornell ‘‘ltc’’ weighting w5x. This weighting formula has been shown to be effective in making proximity of document vectors correspond more closely to similarity of meaning. The resulting matrix is sparse, with about 2 = 10 6 nonzeros. The vector for each document is normalized to have a Euclidean norm of 1.0, so that the dot product between two document vectors is their cosine correlation. Information retrieval systems typically compute the dot product of a query vector with a collection of sparse document vectors by making use of an inverted file w28x of document vectors. An inverted file is also the appropriate storage for efficient sparse matrix multiplication Žsee w16x.. For each indexing term, a list of the ids of all documents with a nonzero weight for that term, plus the weights themselves, are stored. The dot product of a query vector with every document vector can be computed by examining only the inverted lists for those terms where the query vector has a nonzero value. We compared our sampling approach to computing the dot products with a publically available inverted file-based IR system, SMART.6 5.2.1. Results We experimented with the sampling algorithm by itself and in conjunction with dot-product evaluation of high scoring matches. The use of sampling alone has the advantage of reduced storage requirements, because after preprocessing, only the Žtypically small. sample matrix is needed to resolve queries. The combined algorithm, however, yielded more accurate results. In the previous section, we showed our combined algorithm to be highly effective for dense data, where eliminating even a portion of the expensive computations of explicit dot products gives a considerable speedup. For sparse data, however, sparse matrix multiplication techniques compute the full matrix product much faster than computing of each dot product individually would require. For the Reuters dataset, the time required for sparse multiplication was comparable to evaluating 10% of the dot products individually. Hence, our combined algorithm is effective only if sampling can eliminate a very large fraction of the dot products that would normally need to be computed. This was indeed the case in our experiments for the Reuters dataset. We tested the algorithm’s performance with respect to the goals of identifying all k nearest neighbors and identifying all high matches. For these goals, sparse matrix multiplication achieves perfect results. The 6

Available from directory rpubrsmart on ftp.cs.cornell.edu.

247

APPROXIMATE MM FOR PATTERN RECOGNITION

sampling based algorithms are faster but incur some loss of accuracy in the results. We examine these tradeoffs. Identifying all high similarities by a combined algorithm. Table 2 compares the efficiency of sparse matrix multiplication with that of our combined algorithm when the goal is identifying and computing all dot products over a specified threshold. Two implementations of sparse matrix multiplication, our own Ž Full . and SMART, were used. SMART did not write the high dot products to a file, so its query time in Table 2 slightly underestimates its time to complete the task. Also note that the time to build the inverted file is shown as index time for SMART, but is included in query time for Full. As in the previous experiment, H controls the size of the sampled representation, and hence, the accuracy of our combined algorithm, while T controls how many instances have explicit dot products computed.

TABLE 2 Box B Shows Number of Pairs of Similar Sparse Document Vectors Found by Various Sampling Methods a CPU time Žs.

A Alg: H, T

Index

Query

Presamples computed

Samples drawn

Dot products computed

SMART Full R1 60, 10 R2 60, 7 R3 100, 15 R4 100, 11 R5 40, 8

67 } 185 185 203 203 181

3503 2266 382 529 419 517 288

} } 2,266,396 2,266,396 3,307,053 3,307,053 1,526,426

} } 392,812,060 392,812,060 498,441,200 498,441,200 273,542,120

491,641,929 491,641,929 2,043,813 5,603,726 1,317,467 3,295,974 1,486,735

Instances with dot product of at least

B Alg: H, T

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Full R1 60, 10 R2 60, 7 R3 100, 15 R4 100, 11 R5 40, 8

1,755,221 0.641 0.891 0.480 0.763 0.494

550,871 0.885 0.979 0.749 0.919 0.763

243,193 0.975 0.997 0.906 0.978 0.918

142,323 0.995 1.000 0.970 0.995 0.976

97,035 0.999 1.000 0.992 0.999 0.994

71,969 1.000 1.000 0.998 1.000 0.998

54,921 1.000 1.000 0.999 1.000 0.999

43,559 1.000 1.000 1.000 1.000 1.000

a

For each method, box A gives indexing and querying time, along with number of samples drawn and number of exact dot products computed. Both method Full and the text retrieval system SMART use sparse matrix multiplication techniques to compute the dot products between all pairs of vectors. We show under Full in box B the total number of dot products at or over each threshold value. For the sampling algorithms, we instead show what proportion of those over-threshold dot products were found.

248

COHEN AND LEWIS

Table 2 shows that the combined algorithm uses substantially less time than either of the sparse matrix multiplication implementations, while still finding a high proportion of the large dot products. Identifying k nearest neighbors by a combined algorithm. We also investigated identifying all k nearest neighbors using a combined algorithm with parameters H and kX . The sampling algorithm is applied first with parameter H. The kX highest scores are identified, and the exact dot products for these instances were computed. The algorithm returns these kX values sorted according to dot-product value. The performance measure is the fraction of query documents whose k nearest neighbors are precisely identified Žequivalently, are all present in this set of kX items., for various k F kX . In the algorithms used here, we used the same parameter H for all queries. This means that we ‘‘lost’’ many nearest neighbors with low match values. We suspect that the performance can be improved considerably by determining H individually and adaptively for each query: larger values of H are used when the nearest neighbors have low matches Žsee discussion in Section 4.4.. In Table 3 we list results for several choices of H and kX . For each run, we list the index and query CPU times, the size of the sampled representation, the total number of samples used for queries, and the fraction of queries for which at least k nearest neighbors were correctly identified Žfor k s 1, . . . , 9.. The trivial nearest neighbor, the instance itself, is always excluded. Table 3 includes the running times of Full Žour implementation. and SMART for computing all dot products and outputting the 10 nearest neighbors. The last two rows provide the performance for the ranked union of the lists of 10 top documents from two experiments. We listed the results for the union of R2 and R4 and the union of R5 and R6. Algorithm R5 q R6 used 38% of the query processing time of full and found 99% of the nearest neighbors and 99% of the fourth nearest neighbors. Algorithm R4 used 19% of the time and found 90% of nearest neighbors and 50% of the fourth nearest neighbors. We observed that when k nearest neighbors are not identified correctly, it is often because the kth nearest neighbor has a low match and is close in value to a large set of other instances. In most applications it would not be important to distinguish the kth nearest neighbor from other instances with similar dot products. Identifying nearest neighbors by sampling only. In addition to the foregoing experiments with the combined algorithm, we used a pure sampling algorithm to find the single nearest neighbor for each of the 22,173 Reuters document vector. ŽThe trivial nearest neighbor, for instance itself, was excluded.. Two pure sampling runs were done, one taking 337,826,300 samples Žindex time of 141 s and query time of 299 s., and one taking 136,777,240 samples Žindex time of 135 s and query time of 140 s.. For the

249

APPROXIMATE MM FOR PATTERN RECOGNITION

TABLE 3 Proportion of the k Nearest Neighbors Identified for Reuters Data, for Various Values of k and Sampling Algorithms CPU time Žs.

A X Alg: H, k SMART Full R1 30, 40 R4 40, 50 R2 50, 50 R5 50, 100 R6 100, 100 R8 140, 50 R2 q R4 R5 q R6

Index

Query

Presamples computed

Samples drawn

67 } 177 179 182 182 183 190 } }

3628 2591 240 318 374 454 516 462 690 972

0 0 1,147,270 1,526,426 1,902,252 1,902,252 3,307,053 3,816,159 3,428,678 5,209,305

0 0 205,282,680 273,542,120 337,826,300 337,826,300 498,441,200 517,037,360 611,368,420 836,267,500

Fraction of queries where all k nearest neighbors are found

B X Alg: H, k

1NN

2NN

3NN

4NN

5NN

6NN

7NN

8NN

9NN

SMART Full R1 30, 40 R4 40, 50 R2 50, 50 R5 50, 100 R6 100, 100 R8 140, 50 R2 q R4 R5 q R6

1.000 1.000 0.836 0.901 0.929 0.956 0.970 0.956 0.972 0.992

1.000 1.000 0.642 0.764 0.817 0.881 0.916 0.886 0.922 0.976

1.000 1.000 0.478 0.625 0.700 0.795 0.854 0.805 0.859 0.951

1.000 1.000 0.355 0.507 0.593 0.709 0.785 0.725 0.793 0.922

1.000 1.000 0.264 0.409 0.492 0.624 0.720 0.649 0.722 0.889

1.000 1.000 0.200 0.332 0.410 0.548 0.659 0.580 0.657 0.851

1.000 1.000 0.154 0.271 0.343 0.481 0.599 0.518 0.592 0.813

1.000 1.000 0.199 0.221 0.288 0.423 0.546 0.458 0.531 0.775

1.000 1.000 0.030 0.181 0.240 0.371 0.498 0.402 0.479 0.738

first run we used hit factor H s 50 and it required 1,902,250 presamples. For the second run we used H s 20 and it required 768,923 presamples. The times here are similar or better than those of the combined algorithm, and like the combined algorithm much better than those of both Full and SMART. However, the accuracy of the pure sampling runs was considerably inferior to that of the combined algorithm. The slower Ž H s 50. sampling run correctly found the first nearest neighbor of a document 9352 times Ža proportion of 0.422., while the faster run Ž H s 20. found the first nearest neighbor only 6311 times Ž0.285.. The nearest neighbor was in the top 10 documents Žbreaking ties arbitrarily . 16,596 times Ž0.748. for the first run, and 12,558 times Ž0.566. in the second. It should be noted that in many cases the nearest neighbor is one of a substantial number of documents with very close similarities to the query document, and the preceding statistics penalize the sampling method for finding those very

250

COHEN AND LEWIS

similar documents. Nevertheless, the pure sampling method seems considerably inferior to the combined one.

6. k-MATRIX PRODUCTS IN TEXT RETRIEVAL Our implementation and experiments were performed for applications where the database is a single matrix A. Text retrieval applications exist where the database is represented as a product of several matrices, and our implementation ideas extend to this case Žsee Section 4.2.. In these applications, retrieval from a sparse document-term matrix is enhanced by using a manually or automatically generated thesaurus w19, 24x. The original database matrix Ad= n is a sparse term by document matrix. Each query vector q has dimension d Žnumber of terms.. The thesaurus is a sparse d by d matrix T. The vector qT T semantically corresponds to smoothing q by the thesaurus. More meaningful retrieval results can sometimes be obtained from the enhanced products qT TA or qT TT TA Žrather than qTA.. It is generally not beneficial, storage or time-wise, to explicitly compute M s TA and to apply our single-matrix implementation. An efficient alternative is often to maintain T and A separately, and use our k-matrix implementation. One reason is that computing M could be computationally extensive, and may need to be partially or fully repeated when the multiplicands are modified. A more compelling reason not to explicitly compute M is that typically M is denser and hence requires much more storage space than the product form. A similar strategy could be applied to LSI. As discussed in Section 5.1, retrieval using LSI involves precomputing and storing a transformed docuˆ s T TA. A query q is similarly transformed to ˆq s qT T, and ment matrix A ˆ are sought. An alternative would be to directly the largest entries of ˆ qTA seek the largest entries of the four-matrix product ˆ qT TT TA. This approach would save space when documents are relatively short, and would save preprocessing time when additions and deletions to the document collection are frequent.

7. DISCUSSION We have shown how a sampling-based method for approximating matrix multiplication can increase the efficiency of dot-product computations for pattern recognition. On dense Ži.e., nonsparse. data sets, our approach is particularly effective. While there has been considerable progress in speeding up the production of LSI document representations w4x, there has been

APPROXIMATE MM FOR PATTERN RECOGNITION

251

little progress in speeding up retrie¨ al with LSI document representations, beyond the obvious expedients of using parallel hardware or using fewer latent dimensions w14x. To achieve our speedup of 5- to-fold by dropping dimensions would require using only 32]64 dimensions rather than 320, which would substantially impact effectiveness. Our results on sparse data should be evaluated cautiously, because wide variety of methods for speeding up inverted file based scoring are already known, particularly when only the top scoring instances are needed and exact results are not necessary w26, 28x. These methods include compressing inverted files Žto reduce IrO., using lower precision arithmetic, ignoring some inverted lists or parts of inverted lists, and limiting the number of document score accumulators maintained. Such optimizations are particularly effective with very short queries, which were not a focus of our experiments. Our 5- to 15-fold speedup over inverted file based matching therefore must be considered in the context of these alternative optimizations. Our method does have an advantage that the expected values of our scores are equal to the true scores and the variance of our estimates is predictable and well behaved. ACKNOWLEDGMENTS The authors thank Yoav Freund for stimulating discussions and helpful comments. We are also indebted to the anonymous referee for many helpful comments and observations. In particular, the discussion in Subsections 3.6 and 3.7 is attributed to the referee.

REFERENCES 1. S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu, An optimal algorithm for approximate nearest neighbor searching, in ‘‘Proceedings of the Fifth ACM]SIAM Symposium on Discrete Algorithms,’’ ACM]SIAM, 1994, pp. 573]582. 2. K. E. Atkinson, ‘‘Numerical Analysis,’’ 2nd ed., Wiley, New York, 1989. 3. J. L. Bentley, B. W. Weide, and A. C. Yao, Optimal expected-time algorithms for closest point problems, ACM Trans. Math. Software 6 Ž1980., 563]580. 4. M. W. Berry, S. T. Dumais, and G. W. O’Brien, Using linear algebra for intelligent information retrieval, SIAM Re¨ . 37Ž4. Ž1995., 573]595. 5. C. Buckley, G. Salton, and J. Allan, The effect of adding relevance information in a relevance feedback environment, in ‘‘SIGIR 94: Proceedings of the Seventeenth Annual International ACM]SIGIR Conference on Research and Development in Information Retrieval,’’ W. B. Croft and C. J. van Rijsbergen, Eds.., pp. 292]300, Springer-Verlag, London, 1994. 6. H. Chernoff, A measure of the asymptotic efficiency for test of a hypothesis based on the sum of observations, Ann. Math. Statist. 23 Ž1952., 493]509. 7. K. L. Clarkson, Dec. 1996, personal communication. 8. K. L. Clarkson, Rapid estimation of matrix-vector products and applications, manuscript, Oct. 1996.

252

COHEN AND LEWIS

9. D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progressions, in ‘‘Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing,’’ Assoc. Comput. Math., New York, pp. 1]6, 1987. 10. W. B. Croft and D. J. Harper, Using probabilistic models of document retrieval without relevance feedback, J. Documentation 35Ž4. Ž1979., 285]295. 11. B. V. Dasarthy, Ed., ‘‘Nearest Neighbor ŽNN. Norms: NN Pattern Classification Techniques,’’ IEEE Comput. Soc., Los Alamitos, CA, 1991. 12. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic indexing, J. Amer. Soc. Inform. Sci. 41Ž6. ŽSept. 1990., 391]407. 13. R. O. Duda and P. E. Hart, ‘‘Pattern Classification and Scene Analysis,’’ Wiley-Interscience, New York, 1973. 14. S. T. Dumais, Latent semantic indexing ŽLSI.: TREC-3 report, in ‘‘The Third Text Retrieval Conference ŽTREC-3.’’ ŽD. K. Harman, Ed.., U.S. Dept. of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, pp. 219]230, 1995. 15. W. Feller, ‘‘An Introduction to Probability Theory and Its Applications,’’ Vol. 1, Wiley, New York, 1968. 16. G. Golub and C. Van Loan, ‘‘Matrix Computations,’’ Johns Hopkins Press, Baltimore, MD, 1989. 17. D. Harman, Relevance feedback and other query modification techniques, in ‘‘Information Retrieval: Data Structures and Algorithms’’ ŽW. B. Frakes and R. Baeza-Yates, Eds.., pp. 241]263, Prentice-Hall, Englewood Cliffs, NJ, 1992. 18. T. J. Hastie and R. J. Tibshirani, ‘‘Generalized Additive Models,’’ Monographs on Statistics and Applied Probability, Vol. 43, Chapman & Hall, London, 1990. 19. D. D. Lewis, An evaluation of phrasal and clustered representations on a text categorization task, in ‘‘Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,’’ 1992, pp. 37]50. 20. D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, Training algorithm for linear text classifiers, in ‘‘SIGIR ’96: Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,’’ 1996, 298]306. 21. D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Eds., ‘‘Machinery Learning, Neural and Statistical Classification,’’ Ellis Horwood, New York, 1994. 22. S. E. Robertson and K. S. Jones, Relevance weighting of search terms, J. Amer. Soc. Inform. Sci., May]June 1976, 129]146. 23. G. Salton and M. J. McGill, ‘‘Introduction to Modern Information Retrieval,’’ McGrawHill, New York, 1983. 24. K. S. Jones, ‘‘Automatic Keyword Classification for Information Retrieval,’’ Butterworths, London, 1971. 25. R. H. Thompson and W. B. Croft, Support for browsing in an intelligent text retrieval system, Internat. J. Man-Machine Studi. 30 Ž1989., 639]668. 26. H. Turtle and J. Flood, Query evaluation: Strategies and optimizations, Informat. Process. Manag. 31Ž6. Ž1995., 831]850. 27. P. Willett, Recent trends in hierarchic document clustering: A critical review, Inform. Process. Manag. 24Ž5. Ž1988., 577]598. 28. I. H. Witten, A. Moffat, and T. C. Bell, ‘‘Managing Gigabytes: Compressing and Indexing Documents and Images,’’ Van Nostrand-Reinhold, New York, 1994. 29. Y. Yang and C. G. Chute, An example-based mapping method for text categorization and retrieval, ACM Trans. Inform. Syst. 12Ž3. ŽJuly 1994., 252]277.