Computation of the maximum rank correlation estimator

Computation of the maximum rank correlation estimator

Economics Letters 62 (1999) 279–285 Computation of the maximum rank correlation estimator Jason Abrevaya* The University of Chicago, Graduate School ...

159KB Sizes 0 Downloads 44 Views

Economics Letters 62 (1999) 279–285

Computation of the maximum rank correlation estimator Jason Abrevaya* The University of Chicago, Graduate School of Business, 1101 East 58 th Street, Chicago, IL 60637, USA Received 15 July 1998; accepted 23 October 1998

Abstract This paper shows that the objective function for the maximum rank correlation estimator can be evaluated in O(n log n) calculations (where n is the sample size). Previously, O(n 2 ) calculations were thought necessary for computation of the objective function.  1999 Elsevier Science S.A. All rights reserved. Keywords: Binary search tree; Rank estimation JEL classification: C14

1. Introduction The maximum rank correlation (MRC) estimator was developed by Han (1987) in order to estimate b up-to-scale in the generalized regression model, y i 5 D + F(x 9i b,ei ) (i 5 1, . . . ,n),

(1)

where F : 5 2 → 5 is strictly increasing in both arguments and D : 5 → 5 is weakly increasing and non-degenerate. The binary-choice model, censored model, and proportional hazards model are just a few examples of the generalized regression model; the binary-choice model, for instance, has F(x 9i b,ei ) 5 x 9i b 1 ei and D(v) 5 1(v . 0). See Han (1987) for more discussion of the model. The MRC estimator maximizes the objective function

OO1(x9b . x9b) ? 1( y . y ) n

Sn (b) 5

i

j

i

j

(2)

i 51j ±i

1

over some parameter space @. The MRC objective function is a double summation over the n(n 2 1) ‘observation-pairs.’ Brute-force computation calculates 1(x 9i b . x j9 b) ? 1( y i . y j ) for each observationpair, resulting in O(n 2 ) total calculations. Since Sn (b) is a non-smooth function of b, a non-gradient *Tel.: 11-773-834-0721; fax: 11-773-702-0458. E-mail address: [email protected] (J. Abrevaya) 1 Since b is only identified up-to-scale (and without a location parameter), some normalization of the parameter vector is necessary – e.g. @ 5 hb : uubuu 5 1j or @ 5 hb : ub 1 u 5 1j (where b 1 is the first component of b). 0165-1765 / 99 / $ – see front matter PII: S0165-1765( 98 )00255-9

 1999 Elsevier Science S.A. All rights reserved.

J. Abrevaya / Economics Letters 62 (1999) 279 – 285

280

search method (e.g., the Nelder–Mead simplex algorithm) is used to maximize Sn (b). This maximization requires many evaluations of the objective function, and the O(n 2 ) speed makes the MRC estimator computationally unattractive for large sample sizes. For the binary-choice model, however, Sn (b) can be evaluated in O(n log n) calculations using a clever shortcut (Windmeijer, 1993; Cavanagh and Sherman, 1998). The shortcut algorithm can be thought of as a two-step process. In the first step, the index values x 9i b are sorted. If the subscripts of the data are re-labeled so that x 91 b , x 92 b , ? ? ? , x 9n b, note that

O1( y 5 1)SO1( y 5 0)D. n

Sn (b) 5

i

j

i 51

(3)

j ,i

In the second step, this function is evaluated by looping through the data once and keeping a running tally of zero values. The first and second steps require O(n log n) and O(n) calculations, respectively, meaning that the whole algorithm requires O(n log n) calculations. For other models, it is widely believed that evaluation of the MRC objective function requires O(n 2 ) calculations of the brute-force method (see, e.g., Sherman, 1993; Cavanagh and Sherman, 1998; ´ 1998). This slow computation speed was one motivation for the monotone rank Chay and Honore, estimators of Cavanagh and Sherman (1998), whose objective functions can be evaluated in O(n log n) calculations. In this paper, we show that the MRC objective function can also be evaluated in O(n log n) calculations by an algorithm that uses binary search trees.

2. The algorithm Assume that there are no ties in the index values, so that x 91 b , x 92 b , ? ? ? , x 9n b after sorting the index values and re-labeling the subscripts. (Appendix A explains how to deal with ties in the index values.) Then, the MRC objective function can be written

OSO 1( y . y )D 5OS , n

Sn (b) 5

i 51

n

i

j ,i

j

i

(4)

i 51

where Si ; o j ,i 1( y i . y j ). The key to the algorithm is being able to evaluate Si without having to repeatedly loop through all j , i for every i since this double loop requires O(n 2 ) calculations. The first step in the algorithm is to form a binary search tree of the unique dependent-variable values. A binary search tree is a data structure commonly used in computer programming to quickly search for a value from a sorted list of values.2 Fig. 1 shows an example of a binary search tree for the values h14,19,22,31,34,48j. Each value in the sorted list is represented by a node in the tree. Each node can have a left child and a right child. The value of a left child must be less than the value of its parent, and the value of a right child must be greater than the value of its parent. The root is the top node in the tree. The height of the tree is the number of levels in the tree. The tree in Fig. 1 has a height of three. The binary search tree can dramatically improve the speed of searching for a value. For instance, consider searching for the value 31 using the binary search tree in Fig. 1. First, the value is compared 2

See Aho et al. (1974) and Horowitz and Sahni (1994), for instance.

J. Abrevaya / Economics Letters 62 (1999) 279 – 285

281

Fig. 1. Binary search tree.

to the root’s value. Since 31 is greater than the root (node 22), the right child (node 34) is then checked. Since 31 is less than 34, the left child (node 31) is then checked and the value is found. The height of the binary search tree is the maximum number of nodes that need to be checked before finding a value. Of course, many different binary search trees correspond to a given set of values. For instance, Fig. 2 is another valid binary search tree for h14,19,22,31,34,48j. This tree, however, has a height of six and would not speed search time. Let N be the number of unique dependent-variable values (i.e., the size of the set hy 1 , . . . ,y n j without duplicates). To minimize the time needed to access tree nodes (from the root), it is optimal to have a tree with minimal height. Each level , of the tree (where level 1 corresponds to the root) can have a maximum of 2 , 21 nodes, meaning that the minimal height for a tree with N nodes is log 2 (N 1 1). Conveniently, one can always form a binary search tree with the minimal height. First, sort the list of unique values; let hz 1 , . . . ,z N j be the sorted list of values. Then, the following recursive algorithm (FORMTREE) forms a tree of minimal height from a sorted list of values: FORMTREE(sorted list of size k, hv1 , . . . ,vk j)

Fig. 2. Skewed binary search tree.

J. Abrevaya / Economics Letters 62 (1999) 279 – 285

282

• If k 5 1, return tree with single node v1 . • If k 5 2, return tree with root v1 and right child v2 . • If k . 2, return tree with root v(k 11 ) / 2 , left subtree5FORMTREE(hv1 , . . . ,v(k 11 ) / 221 j), and right subtree5FORMTREE(hv(k 11 ) / 211 , . . . ,vk j). The reader can check that FORMTREE(h14,19,22,31,34,48j) yields the binary search tree in Fig. 1. The binary search tree containing the unique dependent-variable values only has to be constructed once.3 In order to compute the objective function in Eq. (4), the dependent-variable values are processed one-by-one (starting with y 1 and ending with y n ) in the binary search tree. For each i, we need to count the number of y j with j , i that are less than y i . In order to do this counting efficiently, two counters are associated with each node in the tree. The first counter keeps track of the number of values processed so far that are equal to the node. The second counter keeps track of the number of times that the node’s left child has been visited (i.e., the number of values processed so far that are less than the node). For each i, a search for the value y i in the binary search tree is done. During the search, these counters allow one to infer how many values processed so far are less than y i . If y i is greater than a node, both counters are recorded since y i is larger than any values at the node or in the node’s left subtree. If y i is equal to a node (i.e., y i has been found), the second counter should be recorded since y i is larger than any values in the node’s left subtree. An example illustrates most clearly how this algorithm works. Assume that index values have been sorted and the dependent variables are given by y 1 5 19,

y 2 5 34,

y 3 5 14,

y 4 5 22,

y 5 5 31,

y 6 5 48,

corresponding to the binary search tree in Fig. 1. Fig. 3 shows the six searches used to evaluate Eq. (4). For each search, the search path is given in bold. The first and second node counters described above are called ‘HIT’ and ‘LCV’ (for ‘left child visited’), respectively. The updated values of the node counters and the value of Si are given in each step. The node counters used to calculate Si are shown in bold. In the first step, for y 1 5 19, the algorithm proceeds as follows: • y 1 , 22: increase LCV counter of node 22 by one; visit left child (node 14) • y 1 . 14: record HIT and LCV counters of node 14; visit right child (node 19) • y 1 5 19: record LCV counter of node 19; increase HIT counter of node 19 by one The other steps are similar. Each time a left child is visited, the LCV counter is incremented. Each time a right child is visited, the two counters of the parent node are recorded (to be added into Si ). Each time a value is found, the LCV counter is recorded (to be added into Si ) and the HIT counter is incremented. How many computations does this algorithm require? The one-time construction of the binary

3

Although Sn (b) will be evaluated for multiple values of b, the value of b does not change the dependent-variable values, so the same binary search tree can be used. Only the order (determined by x 19 b , ? ? ? , x n9 b) in which the dependent variables are processed will be affected by b.

J. Abrevaya / Economics Letters 62 (1999) 279 – 285

283

Fig. 3. Illustration of the MRC algorithm.

search tree requires O(n log n) calculations. The sort of the index values requires O(n log n) calculations. The processing of each y i requires O(log N) calculations since the height of the binary search tree is log 2 (N 1 1). Thus, the processing of all the dependent variables requires O(n log N) operations.4 The overall computation time is O(n log n). To assess the performance of the new algorithm, we simply drew index values and dependent variables from normal distributions and then computed the MRC objective function. The algorithms were programmed in GAUSS.5 Fig. 4 shows the computation speed of the brute-force and binarysearch-tree algorithms for various sample sizes. Note that time is plotted with a logarithmic scale (on the y-axis). For n 5 5000, the new algorithm is about 100 times faster than the brute-force algorithm; for n 5 10,000, the new algorithm is more than 150 times faster than the brute-force algorithm.

4

If N is fixed (e.g., binary-choice or order-choice models), the computation speed of the processing step is O(n). The simulations were run on a PC with a Pentium Pro processor (200 MHz) and 128 megabytes of memory. The built-in GAUSS function sortc() was used for sorting values. 5

284

J. Abrevaya / Economics Letters 62 (1999) 279 – 285

Fig. 4. Computation time for MRC objective function.

3. Conclusion The proposed algorithm makes the MRC estimator more attractive computationally, especially for larger sample sizes. The improved computation speed is important since the objective function must be evaluated many times in order to maximize the objective function.6 The monotone rank estimators of Cavanagh and Sherman (1998) still have computational advantages over the MRC estimator since they do not require the tree-traversal step of the algorithm described in Section 2. We have shown, however, that the order of the computation time for the MRC objective function is the same as the monotone rank estimators.7 6

Note that the proposed algorithm can also be used to calculate Kendall’s (1938) tau statistic, upon which the MRC estimator is based. This result is of less practical interest since the tau statistic is usually just computed once. 7 For all sample sizes, the monotone rank estimators were 10–20 times faster than the new algorithm. This comparison is a bit unfair to the new MRC algorithm, however, since the sorting algorithm for the monotone rank estimators is ‘hardcoded’ for GAUSS whereas the tree-traversal algorithm is not.

J. Abrevaya / Economics Letters 62 (1999) 279 – 285

285

Acknowledgements Financial support from the National Science Foundation (SBR-9730155) and the University of Chicago Graduate School of Business are gratefully acknowledged. GAUSS code for the proposed algorithm is available from the author’s website (http: / / gsbwww.uchicago.edu / fac / jason.abrevaya / ).

Appendix A. Dealing with ties in index values In addition to the LCV and HIT counters, keep track of the maximum index value that has contributed to the LCV counter of a node, the maximum index value that has contributed to the HIT counter of a node, as well as the number of times that the maximum index value has contributed to each of the counters. This information is sufficient to calculate o j ±i 1(x 9i b . x j9 b) ? 1( y i . y j ) for each i (since the counter contributions from observations having index values equal to x 9i b can be ignored). References Aho, A.V., Hopcroft, J.E., Ullman, J.D., 1974. The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA. Cavanagh, C., Sherman, R.P., 1998. Rank estimators for monotonic index models. Journal of Econometrics 84, 351–381. ´ B.E., 1998. Estimation of semiparametric censored regression models: an application to changes in black–white Chay, K.Y., Honore, earnings inequality during the 1960s. Journal of Human Resources 33, 4–38. Han, A.K., 1987. Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator. Journal of Econometrics 35, 303–316. Horowitz, E., Sahni, S., 1994. Fundamentals of Data Structures in Pascal, Computer Science Press, New York. Kendall, M.G., 1938. A new measure of rank correlation. Biometrika 30, 81–93. Sherman, R.P., 1993. The limiting distribution of the maximum rank correlation estimator. Econometrica 61, 123–137. Windmeijer, F.A.G., 1993. The maximum rank correlation estimator and the rank estimator in binary choice models. Econometric Theory 9, 313 (‘Problems and Solutions’ section).