An efficient algorithm for generalized random sampling

An efficient algorithm for generalized random sampling

Pattern Recognition Letters 12 (1991) 683-686 North-Holland November 1991 An efficient algorithm for generalized random sampling Amihood Amir* Dep...

194KB Sizes 15 Downloads 168 Views

Pattern Recognition Letters 12 (1991) 683-686 North-Holland

November 1991

An efficient algorithm for generalized random sampling Amihood

Amir*

Department of Computer Science and Institute for Advanced Computer Studies, University of Maryland, College Park, AID 20742, USA

Doron

Mintz**

Center for Automation Research, University of Maryland, College Park, MD 20742, USA

Received 10 September 1990 Revised 18 January 1991

Abstract

Amir, A. and D. Mintz, An efficient algorithm for generalized random sampling, Pattern Recognition Letters 12 (1991) 683-686. An efficient algorithm for selecting s different m-element subsets of {1..... k} is presented. The lower bound for this problem is •(sm). The two previously known solutions are of time complexity O(sm 2 log k) and O(s(k + m)). Our method is ranking the lexicographic order of m-tuples and then generating an s-element random sample. The time complexity is O(sm2).

Introduction W h e n writing heuristics or randomized algorithms one needs efficient sources o f randomness. We distinguish between three possible levels: (1) Given an integer k, generate a r a n d o m integer r, 1 ~ r ~ k . (2) Given integers k, m, generate a set o f m different integers chosen r a n d o m l y between 1 and k. (3) Given integers k, s, m, generate a set whose elements are s different m-element subsets o f {1. . . . . k}. * Partially supported by NSF grant CCR-880364 and a University of Maryland General Research Board Full Year Award. ** The support of NSF grant DCR-86-03723 is gratefully acknowledged.

We are interested in an efficient solution to p r o b l e m 3. We were motivated by the need for an efficient and deterministic algorithm for r a n d o m tuple selection for surface fitting [9, 5,12,13]. O u r solution involves combining efficient methods for generating r a n d o m sets [1,2, 4] with set ranking algorithms [I0, 8]. O u r main contribution is an efficient new m e t h o d for unranking m-element subsets o f { 1 , . . . , m } . The previously k n o w n unranking algorithms have time complexities O ( m 2 1 o g k ) [10, 8] or O ( k + m) [7]. O u r algorithm unranks in time O(m2).

Problem definition Let k be a positive integer. C h o o s e a set o f s dif-

0167-8655/91/$03.50 © 1991 -- Elsevier Science Publishers B.V. All rights reserved

683

Volume 12, Number I1

PATTERN RECOGNITION LETTERS

ferent random subsets of {1..... k}, each subset having exactly m elements.

November 1991

Construct the unique tuple (bar,, .... ba~,,,)that maps to d i End End

Discussion

Several algorithms to find a set of distinct random numbers (1-element sets) exist in the literature [1,2,4]. Rajah, Ghosh and Gupta [11] showed a parallel algorithm. These algorithms can also be used to solve a special case of the problem. By successive application of the algorithm, mutually disjoint random m-element subsets can be chosen. This paper presents a solution to the more generalized problem where the sets are different but not necessarily disjoint. A straightforward way of solving the problem is by repeatedly generating random numbers in the range { 1..... k} and adding them to our sets as long as there is no conflict (i.e., the added number did not previously appear in the set, nor does it complete a previously existing set). In the best case, such an algorithm generates sm random numbers. But asymptotically, as the size of the sample (s) grows, the probability that a previously selected subset will be chosen again grows nonlinearly. The above solution also requires slightly more sophisticated methods (e.g., hash tables, heaps, balanced binary trees) to efficiently implement the comparisons (see [2, 3] for the problems with 1-element sets). In order to avoid these complications one must use algorithms such as [1,2,4].

Time. Step 1 can be computed in time O(s) by any of the algorithms [1,2,4]. The remainder of this paper is devoted to showing that a tuple can be reconstructed from a number in time O(m2), making the total algorithm's time O(sm2). There is a well-known combinatorial algorithm for ranking all m-element subsets of {1..... k}. It is based on the following two observations: (I) P,~ is equivalent to the set of lexicographically ordered m-element tuples of numbers in the range {0..... k - l } . (2) Every non-negative integer d is uniquely represented by the sum d=(bll)+(b;)+"'+(bm'

)

where O<~bl
Our solution

m-1

We consider the numerical analysis model where arithmetic operations (+, - , x , / , exp, log) and random number generation are done in unit cost for numbers of all sizes. Let P,,~Ibe the set of m-element subsets of {1..... k}. Assume we have a ranking of P,~I (a bijection from P,~I onto {1, ...,(~,)}). Our algorithm is: Algorithm 1. Find s random integers {l . . . . .

2. F o r i = l t o s 684

d I..... ds in the range

~
br is the greatest number such that (br;l)
-

~

(bill).

j=k+l

Time. At most m multiplications are needed to compute the value of (~). A binary search over the range {0, .... k - l } enables us to find each br in time O(m log k). The total time for all m elements is O(m 2 log k). Knott [7] gave a different unranking algorithm. His algorithm requires both time and space O (k+ m).

Volume 12, Number 11

PATTERN RECOGNITION LETTERS

This algorithm is superior for small k's (i.e., k~< m 2) but is inefficient for large k's. We show a method for deriving bl ..... bm in time O(m2). Consider the m-degree polynomial

1

q(n) = _ _ n ( n - 1 ) ( n - 2)... (n - m + 1) m!

This polynomial has m roots {0,1 ..... m - 1 } and for n > m - 1 is a monotonically increasing polynomial. Our problem is to find, for given d and r, the natural number b such that

November 1991

Conclusion We have presented an O(sm 2) algorithm for sampling data sets. The previously best methods accomplished this in time O(sm 2 log k) or O(s(k+m)). Our algorithm is deterministic and uses a constant memory size. The algorithm was embedded in a Least Median of Squares parallel fitting algorithm [9, 6] implemented on a Connection Machine and achieved a considerable speedup over the other algorithms mentioned. Note that the lower bound for this problem is f~(sm), so it is possible that a more efficient solution exists.

Acknowledgements

Without loss of generality we may ignore the constant, and for given d, r, find the natural number b such that

The authors thank two anonymous referees whose suggestions greatly simplified this paper's exposition.

p(b) << d < p(b + 1), J

where p(x) = x ( x - 1)... ( x - r). Clearly, [_(/r~j ~ b since

d~(L~J)'>p(L~J). Also, L~-d.] + r + 1 > b since

p(L~r3J+r+l) = I~I(L~/-3j+i+i) i=I

>(F~IY~>d. All we need to do is find [.~/-d], which can be done in constant time using logarithms, and then check the r + l numbers L~rdJ .... ,[.~/-dJ + r + l to find b (we are guaranteed that b is among them). Let y = [.~c~j. Compute p ( y ) in time O(r). Subsequently, p ( y + 1) = (y + 1) p ( y ) / ( y + 1 - r).

Thus, b is found in O(r) multiplications. Time. O(m) for each element of the m-tuple (since r<~m). The total time per set is thus O(m2). Total Algorithm Time. O(sm2).

References [1] Amir, A. (1988). A pearl diver deals a poker hand. UMIACS-TR-88-9. [2] Bentley, J. and R. Floyd (1987). Programming pearls - a sample of brilliance. Commun. ACM, 754-757. [3] Bentley, J. and D. Gries (1987). Programming pearls - abstract data types. Commun. ACM, 284-290. [4] Chrobak, M. and R. Harter (1988). A note on random sampling. Inform. Process. Left. 29, 255-256. [5] Fischler, M.A. and R.C. Bolles (1981). Random sampling consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, June 1981. [6] Meer, P., D. Mintz, D.Y. Kim and A. Rosenfeld. Robust regression methods for computer vision: a review. To appear in Intl. J. Computer Vision. [7] Knott, G.M. (1974). A numbering system for combinations. Commun. ACM, 45-46. [8] Lehmer, D.H. (1964). The machine tool of combinatorics. In: Applied Combinatorial Mathematics. Wiley, New York. [9] Meet, P., D. Mintz and A. Rosenfeld (1990). Least median of squares based robust analysis of image structure. In: Proc. hnage Understanding Workshop, Pittsburgh, PA, Sept. 1990. [10] Nijenhuis, A. and H. Will (1978). Combinatorial Algorithms, Academic Press, New York, 2nd edition. 685

Volume 12, Number 11

PATTERN RECOGNITION LETTERS

[I I] Rajan, V., R.K. Ghosh and P. Gupta (1989). An efficient parallel algorithm for random sampling. Inform. Process. Lett. 30, 265-268. [12] Rousseuw, P.J. and A.M. Leroy (1987). Robust Regression and Outlier Detection. Wiley, New York.

686

November 1991

[13] Tirumalai, A. and B.G. Schunck (1989). Robust surface approximation using least median squares regression. Tech. Rep. CSE-TR-13-89, University of Michigan.