Pattern Recognition Letters 12 (1991) 683-686 North-Holland
November 1991
An efficient algorithm for generalized random sampling Amihood
Amir*
Department of Computer Science and Institute for Advanced Computer Studies, University of Maryland, College Park, AID 20742, USA
Doron
Mintz**
Center for Automation Research, University of Maryland, College Park, MD 20742, USA
Received 10 September 1990 Revised 18 January 1991
Abstract
Amir, A. and D. Mintz, An efficient algorithm for generalized random sampling, Pattern Recognition Letters 12 (1991) 683-686. An efficient algorithm for selecting s different m-element subsets of {1..... k} is presented. The lower bound for this problem is •(sm). The two previously known solutions are of time complexity O(sm 2 log k) and O(s(k + m)). Our method is ranking the lexicographic order of m-tuples and then generating an s-element random sample. The time complexity is O(sm2).
Introduction W h e n writing heuristics or randomized algorithms one needs efficient sources o f randomness. We distinguish between three possible levels: (1) Given an integer k, generate a r a n d o m integer r, 1 ~ r ~ k . (2) Given integers k, m, generate a set o f m different integers chosen r a n d o m l y between 1 and k. (3) Given integers k, s, m, generate a set whose elements are s different m-element subsets o f {1. . . . . k}. * Partially supported by NSF grant CCR-880364 and a University of Maryland General Research Board Full Year Award. ** The support of NSF grant DCR-86-03723 is gratefully acknowledged.
We are interested in an efficient solution to p r o b l e m 3. We were motivated by the need for an efficient and deterministic algorithm for r a n d o m tuple selection for surface fitting [9, 5,12,13]. O u r solution involves combining efficient methods for generating r a n d o m sets [1,2, 4] with set ranking algorithms [I0, 8]. O u r main contribution is an efficient new m e t h o d for unranking m-element subsets o f { 1 , . . . , m } . The previously k n o w n unranking algorithms have time complexities O ( m 2 1 o g k ) [10, 8] or O ( k + m) [7]. O u r algorithm unranks in time O(m2).
Problem definition Let k be a positive integer. C h o o s e a set o f s dif-
0167-8655/91/$03.50 © 1991 -- Elsevier Science Publishers B.V. All rights reserved
683
Volume 12, Number I1
PATTERN RECOGNITION LETTERS
ferent random subsets of {1..... k}, each subset having exactly m elements.
November 1991
Construct the unique tuple (bar,, .... ba~,,,)that maps to d i End End
Discussion
Several algorithms to find a set of distinct random numbers (1-element sets) exist in the literature [1,2,4]. Rajah, Ghosh and Gupta [11] showed a parallel algorithm. These algorithms can also be used to solve a special case of the problem. By successive application of the algorithm, mutually disjoint random m-element subsets can be chosen. This paper presents a solution to the more generalized problem where the sets are different but not necessarily disjoint. A straightforward way of solving the problem is by repeatedly generating random numbers in the range { 1..... k} and adding them to our sets as long as there is no conflict (i.e., the added number did not previously appear in the set, nor does it complete a previously existing set). In the best case, such an algorithm generates sm random numbers. But asymptotically, as the size of the sample (s) grows, the probability that a previously selected subset will be chosen again grows nonlinearly. The above solution also requires slightly more sophisticated methods (e.g., hash tables, heaps, balanced binary trees) to efficiently implement the comparisons (see [2, 3] for the problems with 1-element sets). In order to avoid these complications one must use algorithms such as [1,2,4].
Time. Step 1 can be computed in time O(s) by any of the algorithms [1,2,4]. The remainder of this paper is devoted to showing that a tuple can be reconstructed from a number in time O(m2), making the total algorithm's time O(sm2). There is a well-known combinatorial algorithm for ranking all m-element subsets of {1..... k}. It is based on the following two observations: (I) P,~ is equivalent to the set of lexicographically ordered m-element tuples of numbers in the range {0..... k - l } . (2) Every non-negative integer d is uniquely represented by the sum d=(bll)+(b;)+"'+(bm'
)
where O<~bl
Our solution
m-1
We consider the numerical analysis model where arithmetic operations (+, - , x , / , exp, log) and random number generation are done in unit cost for numbers of all sizes. Let P,,~Ibe the set of m-element subsets of {1..... k}. Assume we have a ranking of P,~I (a bijection from P,~I onto {1, ...,(~,)}). Our algorithm is: Algorithm 1. Find s random integers {l . . . . .
2. F o r i = l t o s 684
d I..... ds in the range
~
br is the greatest number such that (br;l)
-
~
(bill).
j=k+l
Time. At most m multiplications are needed to compute the value of (~). A binary search over the range {0, .... k - l } enables us to find each br in time O(m log k). The total time for all m elements is O(m 2 log k). Knott [7] gave a different unranking algorithm. His algorithm requires both time and space O (k+ m).
Volume 12, Number 11
PATTERN RECOGNITION LETTERS
This algorithm is superior for small k's (i.e., k~< m 2) but is inefficient for large k's. We show a method for deriving bl ..... bm in time O(m2). Consider the m-degree polynomial
1
q(n) = _ _ n ( n - 1 ) ( n - 2)... (n - m + 1) m!
This polynomial has m roots {0,1 ..... m - 1 } and for n > m - 1 is a monotonically increasing polynomial. Our problem is to find, for given d and r, the natural number b such that
November 1991
Conclusion We have presented an O(sm 2) algorithm for sampling data sets. The previously best methods accomplished this in time O(sm 2 log k) or O(s(k+m)). Our algorithm is deterministic and uses a constant memory size. The algorithm was embedded in a Least Median of Squares parallel fitting algorithm [9, 6] implemented on a Connection Machine and achieved a considerable speedup over the other algorithms mentioned. Note that the lower bound for this problem is f~(sm), so it is possible that a more efficient solution exists.
Acknowledgements
Without loss of generality we may ignore the constant, and for given d, r, find the natural number b such that
The authors thank two anonymous referees whose suggestions greatly simplified this paper's exposition.
p(b) << d < p(b + 1), J
where p(x) = x ( x - 1)... ( x - r). Clearly, [_(/r~j ~ b since
d~(L~J)'>p(L~J). Also, L~-d.] + r + 1 > b since
p(L~r3J+r+l) = I~I(L~/-3j+i+i) i=I
>(F~IY~>d. All we need to do is find [.~/-d], which can be done in constant time using logarithms, and then check the r + l numbers L~rdJ .... ,[.~/-dJ + r + l to find b (we are guaranteed that b is among them). Let y = [.~c~j. Compute p ( y ) in time O(r). Subsequently, p ( y + 1) = (y + 1) p ( y ) / ( y + 1 - r).
Thus, b is found in O(r) multiplications. Time. O(m) for each element of the m-tuple (since r<~m). The total time per set is thus O(m2). Total Algorithm Time. O(sm2).
References [1] Amir, A. (1988). A pearl diver deals a poker hand. UMIACS-TR-88-9. [2] Bentley, J. and R. Floyd (1987). Programming pearls - a sample of brilliance. Commun. ACM, 754-757. [3] Bentley, J. and D. Gries (1987). Programming pearls - abstract data types. Commun. ACM, 284-290. [4] Chrobak, M. and R. Harter (1988). A note on random sampling. Inform. Process. Left. 29, 255-256. [5] Fischler, M.A. and R.C. Bolles (1981). Random sampling consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, June 1981. [6] Meer, P., D. Mintz, D.Y. Kim and A. Rosenfeld. Robust regression methods for computer vision: a review. To appear in Intl. J. Computer Vision. [7] Knott, G.M. (1974). A numbering system for combinations. Commun. ACM, 45-46. [8] Lehmer, D.H. (1964). The machine tool of combinatorics. In: Applied Combinatorial Mathematics. Wiley, New York. [9] Meet, P., D. Mintz and A. Rosenfeld (1990). Least median of squares based robust analysis of image structure. In: Proc. hnage Understanding Workshop, Pittsburgh, PA, Sept. 1990. [10] Nijenhuis, A. and H. Will (1978). Combinatorial Algorithms, Academic Press, New York, 2nd edition. 685
Volume 12, Number 11
PATTERN RECOGNITION LETTERS
[I I] Rajan, V., R.K. Ghosh and P. Gupta (1989). An efficient parallel algorithm for random sampling. Inform. Process. Lett. 30, 265-268. [12] Rousseuw, P.J. and A.M. Leroy (1987). Robust Regression and Outlier Detection. Wiley, New York.
686
November 1991
[13] Tirumalai, A. and B.G. Schunck (1989). Robust surface approximation using least median squares regression. Tech. Rep. CSE-TR-13-89, University of Michigan.