Data & Knowledge Engineering 61 (2007) 554–562 www.elsevier.com/locate/datak
Regression analysis for massive datasets Tsai-Hung Fan a b
a,*
, Dennis K.J. Lin b, Kuang-Fu Cheng
a
Graduate Institute of Statistics, National Central University, 300 Jhongda Road, Jhongli 320, Taiwan, ROC Department of Supply Chain and Information Systems, The Pennsylvania State University, PA, United States Received 27 June 2006; accepted 27 June 2006 Available online 24 July 2006
Abstract In the past decades, we have witnessed a revolution in information technology. Routine collection of systematically generated data is now commonplace. Databases with hundreds of fields (variables), and billions of records (observations) are not unusual. This presents a difficulty for classical data analysis methods, mainly due to the limitation of computer memory and computational costs (in time, for example). In this paper, we propose an intelligent regression analysis methodology which is suitable for modeling massive datasets. The basic idea here is to split the entire dataset into several blocks, applying the classical regression techniques for data in each block, and finally combining these regression results via weighted averages. Theoretical justification of the goodness of the proposed method is given, and empirical performance based on extensive simulation study is discussed. 2006 Elsevier B.V. All rights reserved. Keywords: Best linear unbiased estimator; Minimum variance; Optimal weight
1. Introduction In the past decade, we have witnessed a revolution in information technology. Consequently, routine collection of systematically generated data is now commonplace. Databases with hundreds of fields, billions of records and terabytes of information are not unusual. For Example, Barclaycard (UK) carries out 350 million transactions a year, Wal-mart makes over 7 billion transactions a year, and AT&T carries over 70 billion long distance calls annually (see, Hand et al. [8]). It becomes very challenging to extract useful features from a large dataset because many statistics are difficult to compute by standard algorithms or statistical packages when the dataset is too large to be stored in a primary memory. Unlike observations resulting from designed experiments, large datasets sometimes become available without predefined purposes – sometimes with rather vague purposes. Typically, it is desirable to find some interesting features in the datasets that will provide valuable information to support decision making. Primary tasks in analyzing large datasets include: data processing, classification, detection of abnormal patterns,
*
Corresponding author. E-mail address:
[email protected] (T.-H. Fan).
0169-023X/$ - see front matter 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2006.06.017
T.-H. Fan et al. / Data & Knowledge Engineering 61 (2007) 554–562
555
summarization, visualization, association and correlation analysis. The conventional regression analysis is popular for serving these purposes in general. However, regression analysis may not be straightforward for massive datasets. In fact, even evaluating basic descriptive statistics may not be a trivial task for a large dataset. Consider the problem of percentile estimation as a simple illustration. Suppose, given independent observations x1, x2, . . ., xn from an unknown distribution function F, we want to find its 100a-percentile, that is, the number na such that F(na) = a. This is similar to the problem of finding the kth smallest of n observations, an estimate of the 100ath population percentile provides an approximation to the [an] largest observation. This seems to be a straightforward problem once all observations have been sorted. A major difficulty arises, however, when the available computer memory is much smaller than n. Then sorting x1, x2, . . ., xn becomes impossible unless making trade-off space required for time, see Pagter and Rauhe [14]. Another alternative method for percentile estimation is thus needed. The memory space in some computing environments can be as large as several terabyte. However, the number of observations that can be stored in primary memory is often restricted. The available memory, though large, is finite. Many computing environments also limit the maximum array size allowed and this can be much smaller and even independent of the available memory. The large datasets present some obvious difficulties, such as complex relationships among variables, and a large physical memory requirement. In general, a simple job for a small data set may be of major difficulty for large datasets. This is particularly true for statistical modeling, such as regression analysis. This work was motivated from our experience in analyzing several large real-life datasets. Some other work discussing the statistical perspective in analyzing massive datasets, particularly via Bayesian approach, can be seen in Elder and Pregibon [6], Glymour et al. [7], Ridgeway and Madigan [10] and Balakrishnan and Madigan [1]. In this paper, we will focus on developing computational and inferential regression analysis for large datasets. The paper is organized as follows. Section 2 provides the basic idea of point estimation from the large dataset. Section 3 gives the general regression setting. Our main results are then presented in Section 4: the optimal weight concept and its theoretical supports. Section 5 provides some simulation study to illustrate the appropriateness of the proposed approach. Final conclusions are given in Section 6. 2. Statistical inference on large datasets It is a challenge to compute some basic statistics, such as sample quartiles, from a large dataset because of the limitation of primary memory and the maximum array size in computing environments. To estimate a parameter h(F) of a population F, such as quantiles or the density of the population, it is frequently required to store entire dataset in primary memory in order to obtain an efficient estimate. For example, calculation of the quartiles requires the data first to be sorted and counted. One way to overcome the mentioned difficulty in memory is to use the sub-sampling technique. This approach is useful for preliminary analysis, but the estimator is less efficient as it only uses information in parts of the data. For a higher efficiency, an estimator should be derived from the entire dataset rather than by parts. However, calculation based on entire data set may not be feasible for large datasets. Intuitively, we may sequentially read the data and store in primary memory, block by block, and analyze the data in each block separately. As long as the size of block is small, one can easily implement this estimation procedure within each block under various computing environments. A question arises here is how to make a final conclusion based on the results obtained from each block. Suppose that there is an independent and identically distributed sample with large sample size n, and we are interested in finding the population median. Manku et al. [13] provide a technique to find an approximate median. To find the sample median exactly, one needs at least n storage elements. See also Hurley and Modarres [9] and the references therein. When n is large, such as 10 000 000, standard algorithms for computing sample median may exceed the available memory and fail. However, it is easy to compute a sample median of 10 000 samples in many statistical packages, such as S-plus and SAS. We may sequentially read in the data block by block, each having, says, 10 000 samples, and then compute the sample median of each block, which leads to an independent and identically distributed set of sample medians. It has been shown under some mild conditions that these sample medians are independent and asymptotically distributed as normal with mean
556
T.-H. Fan et al. / Data & Knowledge Engineering 61 (2007) 554–562
equal to the population median. Thus a natural estimate for the population median is then the average of these 1000 sample medians. In summary, to estimate a parameter h(F) based on a large data set, we may employ a two-stage procedure: first read in the whole data set sequentially block by block, each having manageable sample size, and compute an estimate of h(F) within each block. Then take the average of the resulting estimates obtained from each block as an estimate for h(F). Note that the second stage can be updated as soon as a new block is processed by the first stage, and hence does not require additional memory. Suppose that w1, . . ., wn is an independent and identically distributed sample from a population F, where wi can be either a random variable or a random vector. We are interested in estimating a parameter h(F) of the population. To formulate our estimation procedure, we rewrite the sample as: w1;1 ;
w1;2 ;
; w1;a
w2;1 ; .. .
w2;2 ; .. .
; w2;a . ; ..
wg;1 ;
wg;2 ;
; wg;a ;
where wj,l = w(j1)*a+l, for j = 1, . . ., a and l = 1, . . ., g; a is the block size; and g is the total number of blocks. Note that n = ag. The block size a is chosen so that the estimation of h can be easily handled within a block. It is shown (Li et al. [11]) that the resulting estimate is robust to the choice of a. Using the same estimator for each block, and denoting by ^ hjn the resulting estimate based on the sub-sample in the jth block wj,1, . . ., wj,a, h(F) is estimated by averaging of ^ hjn ’s, i.e., 1 h¼ g
g X
^ hjn :
j¼1
Li et al. [11] propose a general approach to deal with statistical inference on datasets with large sample size. Their approach is widely applicable and capable of making statistical inference for any parameter h(F) of a population F. It is shown that under some mild conditions, the resulting estimator is strongly consistent and has a normal limiting distribution. Furthermore, it is shown that in many situations the resulting estimate is as efficient as if all data were simultaneously used to compute the estimate. Their method can also be used for obtaining point estimation as well as density estimation. The sampling properties of the estimate h can be summarized given in the following propositions (see, Li et al. [11]). Proposition 1. For any positive integer values of a and g, (a) if ^ hjn is affine equivariant, then so is h; (b) if ^ hjn is an unbiased estimator of h, then so is h.
Proposition 2. Suppose that w1, . . ., wn are independent and identically distributed, and a!1 and g!1 as n!1. (a) If ^ hjn ’s converge weakly to the true value h, then so does h. ^ (b) If hjn ’s converge in L2 to the true value h, then so does h. (c) If ^ hjn ’s converge strongly to the true value h, then so does h. 3. Regression analysis for massive datasets The general regression model is y t ¼ b0 þ b1 xt;1 þ . . . þ bp1 xt;p1 þ t ;
t ¼ 1; . . . ; N ;
where yt is the response variable, xt,1, . . ., xt,p1 are the predictor variables, b0, . . ., bp1 are the parameters to be estimated, and t is the error with E(t) = 0, Var(t) = r2. A common way matrix form can be represented as
T.-H. Fan et al. / Data & Knowledge Engineering 61 (2007) 554–562
557
Y ¼ X b þ ; where
2
y1 6y 6 2 Y ¼6 6 .. 4.
3
2
7 7 7 7 5
yN
;
1 61 6 X ¼6 6 .. 4.
x1;1 x2;1 .. .
.. .
x1;p1 x2;p1 .. .
1
xN ;1
...
xN ;p1
N 1
3
2
7 7 7 7 5
; N p
b0
3
6b 6 1 b¼6 6 .. 4.
7 7 7 7 5
bp1
p1
2
;
1 6 6 2 ¼6 6 .. 4. N
3 7 7 7 7 5
: N 1
As previously mentioned, when the sample size N is too large, the direct regression analysis can be infeasible, mainly due to the computing memory or computing time. Zen et al. [15] adapted the idea of Li et al. [11] to perform regression analysis as follows. First, the original dataset is split into k blocks, each consists n samples. In other words, 3 2 2 3 y 1;j 1 xj;1;1 xj;1;p1 7 6 6 1 xj;2;1 xj;2;p1 7 6 y 2;j 7 6 7 7 6 7 ; Yj ¼ 6. 7 ; Xj ¼ 6 ð3:1Þ . . . . 6 7 6 .. 7 . . .. .. .. 4 5 5 4 y n;j
1 xj;n;1
n1
xj;n;p1
np
where yl,j = yl+n(j1) is the lth observation in the jth block; xj,l,i = xl+n(j1),i the ith variable value of the lth sample in the jth block, and j = 1, . . ., k, l = 1, . . ., n and i = 0, . . ., p 1. Eq. (3.1) can now be represented as 2 3 2 3 Y1 X1 6Y 7 6X 7 6 27 6 27 7 6 7 Y ¼6 6 .. 7 ; X ¼ 6 .. 7 ; 4 . 5 4 . 5 Y k N p X k N p P P k where X 0 X ¼ j¼1 X 0j X j , and X 0 Y ¼ kj¼1 X 0j Y j . Such a partition algorithm resolves the main difficulty of the computing memory/time for the calculation of the matrices X 0 X and X 0 Y. The final best linear unbiased estimate Draper and Smith [5] of b is then obtained as !1 ! k k X X 0 0 ~¼ b X Xj X Yj : j
j
j¼1
j¼1
Zen et al. [15] further study the appropriateness of such an estimation. They use the Bartlett’s method Bartlett [2] to test for the homogeneity of the block variances and provide a regression procedure for the equal variance situation only. A case study was reported to demonstrate the usefulness of their approach, which will have not been possible, using the classical regression analysis procedure. 4. The proposed weighted average procedure We next propose the weighted average method for the resulting estimation. The final estimate is a weighted average of the best linear unbiased estimates obtained from each block. Specifically, the optimal weights can be obtained by minimizing the variance of the ultimate estimate. This is summarized in the following theorem. 2 2 1 1 Theorem 1. Let r2ij ¼ a1 ij rj , and rij ¼ ðrij Þ , for i = 0,1, . . ., p 1 and j = 1, . . ., k, where aij are the ith 1 0 ^ diagonal element of ðX j X j Þ , P and bij are the least square estimates of bi in the jth block. The optimal weights that ^ij , are when wi;j ¼ rij =Pk rij with minimum variance ðPk rij Þ1 . ~i Þ, where b ~i ¼ k wi;j b minimize varðb j¼1 j¼1 j¼1
^ij is r2 , thus to minimize Proof. Recall that for fixed i = 0,1, . . ., p 1 and j = 1, . . ., k, the variance of b ij ~i Þ ¼ varðb
k X j¼1
w2i;j r2ij ;
ð4:1Þ
558
T.-H. Fan et al. / Data & Knowledge Engineering 61 (2007) 554–562
Pk we must choose 0 6 wi,j 6 1, with j¼1 wi;j ¼ 1 such that (4.1) has minimum value. Using Lagrange multiplier method, let ! k k X X 2 2 wi;j rij k wi;j 1 ; Lðw; kÞ ¼ j¼1
j¼1
where k is the Lagrange multiplier. Solving owoLi;j ¼ mal weights equal to , , k k 1 X 1 X 2 2 rij ; wi;j ¼ rij rij ¼ rij j¼1
P
2 j 2wi;j rij
k ¼ 0 and oL ¼ ok
P
j wi;j
1 ¼ 0, we have the opti-
j¼1
and the resulting minimum variance is ! , !2 k k k k X X X X r2ij 2 ^ij ¼ var wi;j b rij rij ¼ P 2 rij ¼ k j¼1 j¼1 j¼1 j¼1 j¼1 r ij
k X
!1 rij
:
j¼1
h
This completes the proof.
Under the common assumption of normality and equal variance, the distribution of the resulting estimate is shown to be t-distribution, as given in the theorem below. ^ij =Pk aij , for i = 0,1, . . ., p 1; and ~H ¼ Pk aij b Theorem 2. When r2j ¼ r2 for all j = 1, . . ., k, then b i j¼1 j¼1 1 Pk 1=2 ~ H ^2 ¼ ð j¼1^rij Þ ðbi bi Þ has t-distribution of k(n p) degrees of freedom, where ^rij ¼ a1 ij r hP i1 Pn k c 2 c 2 2 ^ r =k , and r ¼ a ðy x b Þ =ðn pÞ, the variance estimate in block j with x = ij
j¼1 j
j
l¼1
l;j
l;j j
l,j
(1, xj,l,1, . . . , xj,l,p1). 2 2 1 are reduced to Proof. If P r2j ¼ r2 , "j, then r2ij ¼ a1 ij r , and rij = aij/r . Then the optimal weights in Theorem k ^ij which are ~H ¼ Pk wi;j b wi;j ¼ aij = j¼1 aij , which are constants. Hence, the best weighted estimates of b are b i i j¼1 Pk 1 H H ~ ~ normally distributed with mean Eðbi Þ ¼ bi and varðbi Þ ¼ ð j¼1 rij Þ , i.e., !1=2 k X ~H b N ð0; 1Þ: b rij i
i
j¼1
Note that ðn pÞc r2j =r2 has Chi-square distribution of n p P degrees of freedom for j = 1, . . ., k, and c c 2 2 r1 ; . . . ; rk are independent. It induces that kðn pÞr^2 =r2 ¼ ðn pÞ kj¼1 r2j =r2 has Chi-square distribution of k(n p) degrees of freedom. Thus !1=2 , !1=2 k ^2 X r ~H b b rij T kðnpÞ ; i i r2 j¼1 ^ij are independent of c ~H and r^2 which holds because b r2j within each block and they due to independence of b i are independent between blocks as well. Finally, k X
!1=2 rij
~H b b i
,
i
j¼1
which completes the proof.
r^2 r2
P
!1=2 ¼
j aij
1=2
0 , !1=2 ~H b , ^2 !1=2 b X i i r ~H b ¼ @ ¼ a b r^2 ij i i 2 2 r r j
k X
!1=2 ^rij
~H b :; b i i
j¼1
h
Furthermore, without the assumption of equal variance, the asymptotic distribution of the resulting estimate under normality can be shown as in Theorem 3. Theorem 3. When r2j are unequal for j = 1, . . ., k, then k k X X ^rij ^ ~H ¼ ^ ^ b b bij ; w ¼ Pk i;j ij i rij j¼1 j¼1 j¼1^
T.-H. Fan et al. / Data & Knowledge Engineering 61 (2007) 554–562
559
c2 with ^rij ¼ ða1 and ij rj Þ !1=2 k d X ~H b ! ^rij b N ð0; 1Þ; as n ! 1: i i 1
j¼1
^ij N ðb ; r2 Þ Proof. for allP i = 0,1, . . ., p 1, b i ij PSince k 1 ^ ~ bi ¼ j wi;j bij N ðbi ; ð j¼1 rij Þ Þ, or equivalently k X
!1=2 rij
and
are
independent
~i b N ð0; 1Þ: b i
for
j = 1, . . ., k,
ð4:2Þ
j¼1
1 p 1 p c Note that for each block j = 1, . . ., k, as n ! 1, c r2j ! r2j , so ^rij ¼ a1 ! rij ¼ a1 r2j and r2j ij ij P 1=2 p P 1=2 k k ! , for fixed k, i.e., rij j¼1^ j¼1 rij !1=2 , !1=2 k k X X p ^rij rij ! 1; as n ! 1: ð4:3Þ j¼1
j¼1
Therefore, (4.2) and (4.3) together yield: 2 !1=2 !1=2 , !1=2 3 k k k X X X d ~i b 4 5! ^rij b rij rij N ð0; 1Þ; as n ! 1; i j¼1
j¼1
j¼1
by Slutsky’s Theorem Casella and Berger [4]. So we have !1=2 k d X ~i b ! ^rij b N ð0; 1Þ; as n ! 1: i
j¼1 p p p ~i ! ^ij ! bi , and ^rij ! rij , we have b bi Moreover, since for each j = 1, . . ., k, and i = 0,1, . . ., p 1, as n ! 1, b p p ~H ! b . Thus b ~H b ~i ! 0, as n ! 1, for i = 0,1, . . ., p 1. Again, by Sluksky’s Theorem, it concludes and b i i i that, as n ! 0, !1=2 !1=2 k k d X X H ~ ~H b ~i þ b ~i b ! ^rij ^rij b b ¼ b N ð0; 1Þ: i
i
j¼1
i
i
j¼1
Since the sample size within each block is large, Theorem 3 provides a theoretical support for our proposed estimate. In summary, our procedure can be formulated as follows: • Step 1. Divide the original N observations into k blocks, each block consists with n samples, i.e., N = k · n. The value n is chosen as large as the computer memory allows to perform the regression analysis. • Step 2. Perform standard regression analysis for each block. • Step 3. The resulting estimates were obtained by the optimal weights method, as given in Theorem 1. h
5. Simulation study and example In this section, we perform a simulation study to illustrate the proposed method. All computation was done via a Pentium IV PC. The regression coefficients (b0, b1, . . ., b5) were randomly generated from N(0, 100), and the values of the independent variables xt,i, t = 1, . . ., N and i = 1, . . ., 5 were generated from U(0, 10). Then all these coefficients and independent variables were kept fixed throughout the simulation. The datasets of size N = 1 000 000 were then generated from the model: yt = 29.223 8.367xt,1 + 6.567xt,2 + 9.167xt,3
560
T.-H. Fan et al. / Data & Knowledge Engineering 61 (2007) 554–562
6.000xt,4 + 2.766xt,5 + t, where t’s are i.i.d. N(0, 1). Interval estimates based on Theorem 2 with various block sizes of k = 10 100 1000 and 10 000 were computed. We study the coverage probability of the interval estimate, i.e., 0 !1=2 !1=2 1 k k X X ~H za=2 ~H þ za=2 A; ^rij ^rij p ¼ Pr@b 6 bi 6 b i i j¼1
j¼1
~H b i
and ^rij are defined in Theorem 2 and za/2 is the 100 (1 a/2)-percentile of the standard normal diswhere tribution. Theoretically, p is approximately equal to 1 a. For a given nominal confidence level 1 a, the ~H , thus a shorter interval length of the confidence interval depends on the standard error of the estimate b i is preferred. Since the results are similar for all bi’s, only the results on b1 is reported here. Table 1 lists the coverage probabilities for b1 as well as the cpu time taken with 100 simulation runs, for various combinations of k and n. It can be seen that the coverage probabilities are way below the nominal level (95%) for large n (small k). This is especially true for the case of k = 1 (no blocking at all). Similar problems are also encountered in hypothesis testing due to the so called Lindley Paradox Lindley pffiffiffi [12] and Berger [3]. Intuitively, the standard error of an estimate usually will decrease in the order of 1= n as the sample size n increases. Therefore, an interval estimate, whose length depends on the standard error of the point estimate, becomes too narrow to cover the true parameter when the sample size n is huge. In such cases, the computational roundoff errors result in too short interval to cover the true parameter. Thus, a huge sample size would result in less coverage probability of an interval estimate. That is the motivation of this paper. As the sample sizes employed in each block reduced, many desirable properties become clear: the coverage probabilities tend to the nominal level, the computation is rather faster, and more importantly, much less memory space is needed. However, the results get reversed as the block sizes increase. Indeed, for larger N, for example N = 10 000 000, the PC is not able to execute the program without breaking up the data. Table 2 shows the results for N = 10 000 000 from the same regression model as described previously along with the averages of the point estimates as well as the interval estimates. Similarly, the estimation results for too large n are not satisfactory. For moderate n, both the point estimates and the interval estimates are quite accurate. This concludes that the proposed method does provide good estimation procedure while a classical regression method can not be handled by a regular computer. To see the accuracy of the estimates, we also compare the average lengths of the confidence intervals between the proposed method with (estimated) optimal weights (Theorem 3) and the one with equal weight when the variances are unequal among blocks. In this study, we used the same regression coefficients and predictor variables as those in Table 1 with N = 1 000 000 and the variances of the blocks were randomly selected from {1, 10, 100, 1000}. The ratios of the average lengths produced by the estimates of b1 using the equal weight to the optimal weights as well as the coverage probabilities and averaged interval estimates are shown Table 1 Simulation results of the coverage probabilities (C.P.) of the 95% interval estimates of b1 for different block settings (N = k · n = 1 000 000) k
n
C.P.
CPU (s)
1 10 100 1000 10 000
1 000 000 1 00 000 10 000 1000 100
.32 .89 .96 .95 .98
319.43 313.93 311.85 311.74 348.77
Table 2 ~H and the coverage probabilities (C.P.) of the 95% interval estimates of b1 for different block settings for Simulation results of b 1 N = k · n = 10 000 000 ~H k n C.P. b Interval estimate 1 10 100 1000 10 000
1 000 000 100 000 10 000 1000
.36 .93 .96 .97
8.371 8.368 8.367 8.367
(8.374, (8.371, (8.371, (8.372,
8.367) 8.364) 8.363) 8.363)
T.-H. Fan et al. / Data & Knowledge Engineering 61 (2007) 554–562
561
Table 3 Comparison of the interval estimates of b1 between equal weight and (estimated) optimal weights for different block settings (N = k · n = 1 000 000) k
n
CPequal (interval)
CPopt (interval)
Ratio of lengths
10 100 1000
100 000 10 000 1000
.93 (8.377, 8.356) .91 (8.376, 8.354) .94 (8.378, 8.355)
.91 (8.369, 8.366) .96 (8.368, 8.365) .97 (8.368, 8.365)
7.49 8.60 8.77
in Table 3 for various allocations. We see that the proposed estimates resulted in much shorter interval estimates, although they both closely attained the nominal coverage probability. It confirms the fact that the proposed method gives more accurate parameter estimation. We had also tried different distributions for xt,j and bk, and the results were all similar. Finally, a real dataset of credit cards information was analyzed. The data were taken from a database collected by the Marketing Survey Center at Fu-Jen Catholic University in Taiwan. There were N = 2 000 000 observations, each contains nine attributes, namely, y = monthly credit card bill, x1 = frequency of using the credit card, x2 = gender, x3 = education level, x4 = number of persons in the family, x5 = age, x6 = monthly income, x7 = monthly expense and x8 is the monthly family income, of the card holder. We applied the proposed block weighted least square method with k = 1000 and n = 2000 to get the fitted model: ^y ¼ 59701:77 453:87x1 3804:66x2 1170:48x3 þ 794:79x4 þ 10:13x5 þ 0:09x6 0:08x7 þ 0:16x8 : Note that the fitted models do not change much for k · n being 200 · 10 000, 500 · 4000, or 1000 · 2000; while our computer can not access the job for the whole dataset without blocking unless we clean out the entire memory space. 6. Discussion and conclusion For the coming information era, massive datasets are becoming popular. While many statistical methods are available for data mining, they are typically not applicable for massive datasets. In this paper, we propose an optimal estimation procedure for parameters when the regression analysis is run by blocks on a large dataset. The proposed procedure effectively reduces the required amount of computing memory without loss of efficiency. It is readily applicable for many real-life applications. Asymptotic properties of the resulting estimators have been studied and asymptotic normality has been established. The proposed procedure is shown, theoretically and empirically, to be optimal in the sense of minimizing the variance of estimates. Simulation studies have also indicated the usefulness of the proposed approaches. A real example successfully confirms that the proposed method is valid and somewhat robust if appropriate blocking is used such that the sub-sample size is moderate. The constant variance assumption may not be realistic in practice, but blocking indeed mixes up the variances and our approach is suitable for unequal variances between blocks. An alternative procedure is to use only subset of the dataset, a sampling survey approach. The sampling scheme seems to be crucial and more sophisticated which deserves further study. It is also true that transformation of some variables might be needed practically, but it is not the major concern of this paper. Here we only addressed the computational issue for large datasets in the regression model. It deserves further study on the problems of variable transformation and/or variables selection. Many well-established statistical methods have proved to be powerful for data analysis, but they may not be applicable for data mining, mainly due to the size of datasets. Further research is needed for applying these techniques to massive datasets. In this paper, we have focused on the popular regression analysis. Another important topic to be studied is the clustering analysis which is known to be an algorithm of order O(n2) – clearly not applicable to massive datasets. Acknowledgements The authors would like to thank the referees for their constructive comments. This work was supported by the MOE program for promoting academic excellence of universities under grant number 91-H-FA07-1-4.
562
T.-H. Fan et al. / Data & Knowledge Engineering 61 (2007) 554–562
References [1] S. Balakrishnan, D. Madigan, A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets, Bayesian Analysis 1 (2006) 345–362. [2] M.S. Bartlett, The use of transformations, Biometrics 3 (1947) 39–52. [3] J.O. Berger, Statistical Decision Theory and Bayesian Analysis, second ed., Springer-Verlag, New York, 1985. [4] G. Casella, R.L. Berger, Statistical Inference, Wadsworth, Belmont, California, 1990. [5] N.R. Draper, H. Smith, Applied Regression Analysis, 3rd ed., Wiley, New York, 1998. [6] J. Elder, D. Pregibon, A Statistical Perspective on Knowledge Discovery in Databases, in: U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996 (Chapter 4). [7] C.D. Glymour, D. Madigan, D. Pregibon, P. Amyth, Statistical themes and lessons for data mining, Data Mining and Knowledge Discovery 1 (1) (1997) 11–28. [8] D.J. Hand, G. Blunt, M.G. Kelly, N.M. Adams, Data mining for fun and profit, Statistical Sciences 15 (2000) 111–131. [9] C. Hurley, R. Modarres, Low-storage quantile estimation, Computational Statistics 10 (1995) 311–325. [10] G. Ridgeway, D. Madigan, A sequential Monte Carlo method for Bayesian analysis of massive datasets, Data Mining and Knowledge Discovery 7 (2003) 301–319. [11] R. Li, D.K.J. Lin, B. Li, Statistical inference on large data sets. Knowledge Discovery, in press. [12] D.V. Lindley, A statistical paradox, Biometrika 44 (1957) 187–192. [13] G.S. Manku, S. Rajagopalan, B.G. Lindsay, Approximate Medians and Other Quantiles in One Pass with Limit and Memory. In: Proc. ACM SIGMOD International Conference on Management of Data, June 1998, pp. 426–435. [14] J. Pagter, T. Rauhe, Optimal Time-Space Trade-Offs for Sorting. In: FOCS Proceedings of the 39th Annual Symposium on Foundations of Computer Science, November 1998, pp. 264–268. [15] M.M. Zen, Y.H. Lin, D.K. Lin, Simple linear regression for large data sets, Journal of Chinese Statistical Association 41 (2003) 401–420. Professor T.H. Fan is full professor and director at the Graduate Institute of Statistics at National Central University, Taiwan. She received her Ph.D. degree in Statistics from Purdue University. Her research interest includes Bayesian statistics, regression analysis, data mining and statistical applications.
Dr. Dennis Lin is a University Distinguished Professor of Supply Chain and Statistics at the Penn State University. His research interests are quality engineering, industrial statistics, data mining and response surface. He has published over 100 papers in a wide variety of journals. He serves (or served) as associate editor for Statistica Sinica, American Statisticians, Journal of Data Science, Quality Technology & Quality Management, Journal of Quality Technology; and Taiwan Outlook; and chaired the Membership Committee as well as SPES (Section for Physical Engineering and Sciences) for American Statistical Association. Dr. Lin is an elected fellow of the American Statistical Association (ASA), an elected member of International Statistical Institute (ISI), an elected fellow of American Society of Quality (ASQ), a lifetime member of International Chinese Statistical Association (ICSA), a fellow of the Royal Statistical Society, and has received the Most Outstanding Presentation Award from SPES, ASA. He is an honorary chair professor for various universities, including National Chengchi University (Taiwan), Remin University of China, Fudan University and XiAn Statistical Institute.
Professor K.F. Cheng is distinguished professor of statistics at National Central University, Taiwan (ROC). His research interest includes regression analysis, statistical genetics, and quantitative method for finance, etc. He is an elected member of International Statistical Institute, and editor of the Journal of Chinese Statistical Association.