ARTICLE IN PRESS
JID: ECOSTA
[m3Gsc;November 18, 2016;14:8]
Econometrics and Statistics 0 0 0 (2016) 1–8
Contents lists available at ScienceDirect
Econometrics and Statistics journal homepage: www.elsevier.com/locate/ecosta
Big Data in context and robustness against heterogeneity J.S. Marron Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599-3260, USA
a r t i c l e
i n f o
Article history: Received 22 March 2016 Revised 5 June 2016 Accepted 7 June 2016 Available online xxx Keywords: Big data Robustness against heterogeneity
a b s t r a c t The phrase Big Data has generated substantial current discussion within and outside of the field of statistics. Some personal observations about this phenomenon are discussed. One contribution is to put this set of ideas into a larger historical context. Another is to point out the related important concept of robustness against data heterogeneity, and some earlier methods which had that property, and also to discuss a number of interesting open problems motivated by this concept. © 2016 ECOSTA ECONOMETRICS AND STATISTICS. Published by Elsevier B.V. All rights reserved.
1. Introduction and organization Everyone involved in the quite large set of people involved with data analysis has recently been repeatedly hearing the phrase Big Data. This set of people includes both classically trained statisticians and also a host of others, some of whom have been doing data analysis for many years and others who are newcomers to this topic. The newcomers have generated some new viewpoints about Big Data, and also have given new life to some earlier ideas and viewpoints. An interesting example in the popular literature is Mayer-Schönberger and Cukier (2014). That monograph is filled with interesting anecdotes, and provides a useful dichotomy of data analysis approaches into Causal and Correlational methods. The goal of Causal data analysis methods is to understand underlying causes and drivers behind a phenomenon being scientifically studied. This is currently central to a body of thought, that goes back to the ancient Greeks and is called The Scientific Method. Over more than the last 100 years, the large community of statisticians have been engaged in the development of a well defined set of protocols for this approach. The simplest version involves the following deliberately ordered steps: 1. Formulate a scientific hypothesis. 2. Collect a data set, ideally in a carefully designed fashion (typically needed to investigate causation). 3. Analyze the data with the goal of understanding how well it verifies the hypothesis. Correlational data analysis methods have rather different goals. Instead of searching for underlying causes, this type of analysis focuses merely on discovering connections. While this concept may be alarming to scientists because nonreproducible spurious correlations are frequently found, there have been some great successes of this type. A canonical example is speech (and other data types such as hand writing, etc.) recognition software. This is currently of sufficient quality that it is now widely deployed for example in the context of telephone call centers. Note that such software does not use any insights about how the sound waves that constitute speech are turned into neuronal impulses which are interpreted inside the brain, but instead is simply a black box that has been trained on a large data base of previously classified sounds.
E-mail addresses:
[email protected],
[email protected] http://dx.doi.org/10.1016/j.ecosta.2016.06.001 2452-3062/© 2016 ECOSTA ECONOMETRICS AND STATISTICS. Published by Elsevier B.V. All rights reserved.
Please cite this article as: J.S. Marron, Big Data in context and robustness against heterogeneity, Econometrics and Statistics (2016), http://dx.doi.org/10.1016/j.ecosta.2016.06.001
JID: ECOSTA 2
ARTICLE IN PRESS
[m3Gsc;November 18, 2016;14:8]
J.S. Marron / Econometrics and Statistics 000 (2016) 1–8
These issues, and the comparison between Causal and Correlational analyses are put into an historical context in Section 2. Section 3 discusses new conceptual models that are relevant for some types of Big Data, including the notion of data heterogeneity, which motivate the need to find methods that are robust against that challenge. A reviewer correctly pointed out that the view of Big Data discussed here is quite personal, and that others would discuss completely different aspects of the area. My views are driven by the data I currently work with, which comes mostly from cancer and other types of biomedical research. In those domains, the main Big Data challenges are about very high dimensionality, so that is what is addressed here. Other important Big Data challenges include high volume data, in some cases so large that even simple statistical summaries cannot be exactly computed. See e.g. Bousquet and Bottou (2008) for good discussion. Related challenges have motivated Divide and Conquer approaches and some interesting new mathematical statistics, see Zhang et al. (2013), Chen and Xie (2014), Zhao et al. (2016) and Shang and Cheng (2015). While the current interest in Big Data has been useful in attracting more than usual attention and perhaps research funding to statistics and other types of data analysis, it is important to keep in mind that a less well ballyhooed but in fact larger challenge is Complex Data. While typical data sets are indeed getting bigger, usually driven by rapidly improving data collection capabilities, they are also getting much more complex in many different ways. A useful framework for thinking about this is Objected Oriented Data Analysis, using terminology coined in Wang and Marron (2007). As noted in Lu et al. (2014) and Marron and Alonso (2014), this terminology is quite useful in complex data analysis settings, because it provides a way of thinking that facilitates essential interdisciplinary discussion about how best to extract needed information from complex data sets. 2. Big Data and some history Naive readers of Mayer-Schönberger and Cukier (2014) might come away with the impression that correlational ideas are very new. But this is incorrect as can be seen by reviewing some history. For over 100 years, going back at least to Pearson (1900), but really to a collection of researchers of that era who formalized the use of data in the scientific method, causal analysis methods have been the foundation of the large and continually evolving academic field now known as statistics. While causal methods have traditionally been the domain of the statistics community, for quite some time there have been data analysts, working outside the statistical community who have frequently invented and explored methods of a more correlational nature. A relevant issue is that these apparently peripheral fields have typically been masters at marketing and obtaining research funding, with an often larger research budget at the US National Science Foundation than all of statistics. Here are some historical highpoints of these areas: • Statistical pattern recognition. This was a popular research area in the 1960s, as seen in the seminal work by Duda et al. (1973), which developed the pre-cursors of modern voice recognition software. • Artificial intelligence. See McCorduck et al. (1977) for an insider’s view of important aspects of this area. Much of the widely publicized aspects were some rather over-hyped approaches to the challenge of medical diagnosis. At one point there were extravagant claims that computers would replace physicians by giving improved medical diagnoses. Of course that task proved to be far more challenging than was originally realized. While work continues in that direction, and there are constant improvements, such computational tools are now more sensibly marketed as intelligent assistants. • Neural networks. See Cochocki and Unbehauen (1993) for a good overview of this field. The big appeal here was a set of algorithms which attempt to duplicate at some level the architecture of neuronal structure that has been so successful in the human brain. There were some big successes early on, such as major success with voice recognition. As with many over-advertised fields, this lost substantial popularity after it was realized that it would not solve all data analytic problems. However, in the age of Big Data it is interesting to note that this methodology has experienced a new burst of popularity, under the new name of Deep Learning. An important lesson may be that while neural nets will not provide good solutions to all classification problems, they are very competitive in situations where the training data set is sufficiently large. In particular, these methods appear to have automatic adaptation properties which give very big benefits in large volume settings, but result in overfitting for smaller data sets. • Data Mining. A high profile description of this area can be found in Fayyad et al. (1996). At the time, the name itself seemed to have been deliberately provocative, as it had been previously used in the statistical community as a pejorative term for an analysis that did not follow the above enumerated scientific method protocol. The poster child success in this area was the discovery in supermarket scanner data that there was a clear correlation between purchases of baby diapers (called nappies in some cultures) and beer, although current internet sentiment calls into question whether this was an actual data discovery. • Machine learning. This area can be viewed as a collection of methods for doing various types of data analysis that are based on optimization ideas, instead of the more classical statistical methods based on probability distributions. A major landmark in this direction was the elegant Support Vector Machine of Vapnik (1982), which spawned a large number of useful variations, see Cristianini and Shawe-Taylor (20 0 0) or Schölkopf and Smola (2001) for good overview. Thinking about all of these data analytic approaches together, note that there are some differences between them. Also all of them still have journals, and an active research community. As noted above, each has had its times of major hype, and for a while received a perhaps inordinately large amount of research funding. The timing between them is roughly in intervals of 5–10 years, so from this perspective it may not be surprising that Big Data has come along when it did. Please cite this article as: J.S. Marron, Big Data in context and robustness against heterogeneity, Econometrics and Statistics (2016), http://dx.doi.org/10.1016/j.ecosta.2016.06.001
ARTICLE IN PRESS
JID: ECOSTA
J.S. Marron / Econometrics and Statistics 000 (2016) 1–8
[m3Gsc;November 18, 2016;14:8] 3
Now some of these were driven by truly breakthrough ideas, perhaps most notably neural networks and machine learning. Others, such as Data Mining and Big Data seem much less impressive in terms of really exciting new ideas, but instead seem to focus more on the correlational versus causational issue. Mayer-Schönberger and Cukier (2014) offer some interesting views on the latter topic. An important one is that they correctly point out that modern statistics courses and textbooks are completely focused on causal methodologies, and discuss correlational methods only very briefly, if at all. They go on to predict that future research will actually be dominated by correlational approaches, with causal ideas soon to become relegated to the past. But an intermediate view seems preferable here. For example, the whole reason that the purely correlational discovery of diapers and beer is so appealing is that at the end of the day we intuitively understand it from a causal perspective. Human intuition is such that scientifically valid knowledge only comes from eventually understanding phenomena at a causal level. For this reason prediction of the demise of causal inference seems over-blown. However, the suggestion of Mayer-Schönberger and Cukier (2014) that statistical education should change to incorporate more instruction on correlational methods is on target. Such techniques have given us in the past (and will continue to give us in the future) very important discoveries. An example is diapers and beer, where the apparent discovery was made via pure correlation. Note that this suggests a causal driver through the lurking variable of fatherhood, which could be further studied. As illustrated here both approaches, and in fact their interaction, should be important players. It seems likely that the most successful future scientists will be well trained in both. 3. Data heterogeneity An interesting issue is how should classically trained statisticians get engaged in the current research efforts in the world of Big Data? There are a number of opportunities present here. An important point is that the statistics community has a very large and diverse tool box to bring to the table. Often within statistics there have been arguments of various types, such as parametric versus nonparametric, Bayesian versus frequentist, likelihood versus robust, etc. But the totality of all of these should be regarded as an impressively strong, diverse and well developed set of ideas, methods and approaches, that can and should be brought to bear upon Big Data challenges. Another potentially important contribution is that, over the long time that classical statistical methods have been under development, we have come to recognize and point out many pitfalls, which may not be immediately obvious to beginners, such as multiple comparisons and lurking variable issues. Finally, statistical engagement in Big Data is important to avoid wasting research funding on the re-invention of ideas that are already well known. Another important Big Data arena where statisticians should play an active role is in the parallelization of statistical software. While lots of statistical software is under active development, maybe most notably in the R Project (Gentleman et al., 2003), much of that tends to be structured around single processor computation. However, the demands of Big Data are such that multi-processor platforms are where the future will lie. Hence there is a strong need for the development of software of that type, which has motivated most major software packages to start providing multi-processor options. Finally the statistical community is well positioned to bring useful conceptual models to the Big Data effort. A fundamental concept is data heterogeneity which has been considered in various contexts at least since the early 1990s, see Kim and Seo (1991). Data heterogeneity has recently been nicely set in the Big Data context by Meinshausen and Bühlmann (2014). Their important idea is to replace the conventional statistical thought model of the Gaussian probability distribution, usually written as
Nd (μ, )
(1)
where d is the dimension, μ is a d × 1 mean vector and is a d × d covariance matrix, with instead a Gaussian mixture distribution, of the form J
w j Nd
μ j, j .
(2)
j=1
The Gaussian model has been a major workhorse in the development of classical statistics, and indeed has become a time proven model, essentially because of the Central Limit Theorem. While the latter nominally applies to averages, it also indicates why the model (1) has been very useful in so many data contexts, because many types of data, such as those involving measurement error or even counts, can be thought of as representing the sum of many small independent random shocks which are thus well modeled using (1). However, there is a critical assumption, which is that the data have been gathered under the same experimental conditions. This assumption tends to be very easily satisfied for classically designed experiments, such as those described in Section 1. However, Big Data often presents a stark contrast, because frequently such data sets are composed by concatenating several smaller data sets. Often these involve differing experimental conditions and design choices that have been made. Now for each individual experiment the Gaussian model (1) is usually appropriate, but typically the means and covariances will vary, so the mixture model (2) becomes more appropriate, where J is the number of experiments that are being combined. Less common, yet still existent collaborative studies such as The Cancer Genome Atlas (Weinstein et al., 2013) have yielded a few Big Data sets, where most of these experimental design issues have been carefully controlled through painstaking multi-center negotiation of methods and protocols, but that remains a rarity. However, even when the design is that careful, there are always expected to be additional unknown and uncontrollable biological factors. Please cite this article as: J.S. Marron, Big Data in context and robustness against heterogeneity, Econometrics and Statistics (2016), http://dx.doi.org/10.1016/j.ecosta.2016.06.001
ARTICLE IN PRESS
JID: ECOSTA 4
[m3Gsc;November 18, 2016;14:8]
J.S. Marron / Econometrics and Statistics 000 (2016) 1–8
0.03 0.02 0.01
0 −20
0.01 −20
−20 −20
0
0 20 40 PC 1 direction
0 20 40 PC 1 direction
−20 0 20 PC 3 direction
PC 2 direction
0
0
0 −20
0.02
20
20
−20 0 20 PC 2 direction
20
−20
PC 3 direction
20
0 20 40 PC 1 direction
PC 3 direction
PC 2 direction
0 −20
40 PC 1 direction
PC 1 direction
40
20 0 −20
−20 0 20 PC 2 direction
20
−20 0 20 PC 3 direction
0.02
0
0.01
−20 −20 0 20 PC 2 direction
0
−20 0 20 PC 3 direction
Fig. 1. PCA of Breast Cancer microarray data. Symbols indicate different cancer types, showing clear contrast in the PC2 and PC3 scatterplots. Colors show a chip fabrication artifact, which dominates the rest of the analysis, even in the first PC component. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
Note that while heterogeneity is a common artifact of combining information across studies, the issues here are different from the statistical challenge called meta analysis, see e.g. DerSimonian and Laird (1986). In that situation, much sparser information, such as just p-values and sample sizes, or maybe coarse summaries such as means and standard deviations, are used in attempts to pool information across studies. The context of the present discussion is different because here all of the original data are available and concatenated into one Big Data set. While the development of statistical methodologies for dealing with heterogeneity appear to be still in their infancy, there are some preliminary ideas already present. These include visualization of heterogeneity which is studied in Section 3.1, robustness against heterogeneity which is discussed in Section 3.2, and explicit heterogeneity adjustment, see Section 3.3. Additional discussion, together with the posing of some open research problems can be found in Section 4. 3.1. Visualization of heterogeneity Visualization of data is an important preliminary to solid data analysis, which is done less frequently than it should be. Fig. 1 shows an important example of this task, in the context of the Breast Cancer study of Perou et al. (20 0 0). This was an early investigation using microarray data, a set of measurements of gene expression, i.e. biological activity of a set of genes. The data set includes n = 103 patients, and the data objects are d = 5961 dimensional vectors of gene expression values measured for each patient, represented as symbols in Fig. 1. This gives a good impression of how the patients relate to each other using the graphical device called a Principal Component Analysis (PCA) scatterplot matrix. As well described in Jolliffe (2005), PCA can be thought of as taking a high dimensional point cloud, and projecting it onto a set of (orthogonal) direction vectors. These often give excellent structural insights into the data set because they are chosen to maximize the visual variation. The projection coefficients, often called scores, are essentially the coordinates used in the visualization. The plots on the diagonal of the matrix can be thought of as one dimensional projections of the data, and show the distributions of each set of scores in two ways. First they are shown as points, where the horizontal axis shows the scores, and the vertical axis provides visual separation of the points. A random ordering would be used in a classical jitter plot, but here the original ordering of the points in the raw data is used instead (as this sometimes provides additional insights). Second, Please cite this article as: J.S. Marron, Big Data in context and robustness against heterogeneity, Econometrics and Statistics (2016), http://dx.doi.org/10.1016/j.ecosta.2016.06.001
JID: ECOSTA
ARTICLE IN PRESS J.S. Marron / Econometrics and Statistics 000 (2016) 1–8
[m3Gsc;November 18, 2016;14:8] 5
the distribution of the scores is summarized by the black curves which are smooth histograms, or more precisely kernel density estimates, see Wang and Marron (2007) for good introduction. The off diagonal panels of Fig. 1 are scatterplots showing the joint distributions of pairs of scores, i.e. of 2-d projections of the data onto pairs of PC directions. This type of data view can turn up perhaps unexpected aspects, for example the two distinct clusters shown in the row 1, column 2 panel of Fig. 1. The cause of these clusters is explained by the color scheme which highlights a chip fabrication artifact. The seriousness of this artifact is clear from the fact that the red–blue contrast shows up clearly in the first PC (upper right panel), which is the direction of maximal variation in the data. Yet there is biologically important information in these data, as shown by the symbols, which represent different cancer classes. Note the row 2, column 3 panel shows that different symbols tend to lie in different regions. These discovered cancer classes led to a clinically important discovery, as detailed in Perou et al. (20 0 0). That discovery was made by clustering methods, which can easily lead to false discoveries, see Liu et al. (2012), Hennig (2015) and Huang et al. (2015). More relevant to this paper is that a major hurdle in that discovery process was the strong data heterogeneity shown using colors in Fig. 1. This motivates devoting substantial thought to handling heterogeneity in Big Data sets. While data visualization, such as that done in Section 3.1 might reveal heterogeneity in the form of clusters, this cannot be expected to always reveal heterogeneity. In particular, the differing subpopulations may not be so separated that they appear as clusters, and yet still be sufficiently separated to impact a statistical analysis. An approach to this challenge, discussed in Section 3.2, is to develop statistical methods that are robust against heterogeneity in data. Such approaches are especially useful when tackling unknown heterogeneity. When the heterogeneity is known (e.g. in the case of data set concatenation, it is also tempting to consider explicit heterogeneity adjustment, which is discussed in Section 3.3. A perhaps surprising part of that discussion is that even in that context, heterogeneity robustness of the adjustment itself once again appears as a key concept. 3.2. Robustness against heterogeneity Robust statistics was a very popular research area during the 1970s and 1980s (as popular as sparsity and functional data analysis currently are). Good discussion of various aspects of this can be found in Hampel et al. (2011), Huber (2011), and Staudte and Sheather (2011). A reviewer has attributed the following relevant quote to Brian Ripley “(robustness) had tremendous growth for about two decades from 1964, but failed to win over the mainstream”. Nevertheless, a major contribution of that work was the development of a host of methods that are not seriously impacted by outliers, and ways of thinking about that challenge. For example the sample mean is not robust, in the sense of breakdown, because if only one data point is moved to infinity, the mean also goes to infinity. In contrast the sample median is very robust, i.e. has good breakdown properties, because up to half the data can go to infinity without the median following. See Riani et al. (2014) for some interesting new ideas in this direction. While robustness against outliers received the most attention, the fundamental ideas actually applied generally to all of the assumptions made in statistical analyses. For example, robustness against the standard sampling assumption of independence was studied by Beran (1991). From this perspective robustness against heterogeneity can be viewed as a new variation on the original robustness theme. While this terminology has not always been used, the challenges of data heterogeneity have been apparent for some time, and methods that target this problem have been developed. An important example is Surrogate Variable Analysis (SVA), proposed by Leek and Storey (2007), in the context of linear model regression analysis. The idea there is to model data heterogeneity using unmodeled factors, which could be known, or else can be estimated from the data. While SVA has a goal that is essentially robustness against heterogeneity, serious concerns about its actual robustness in this sense are raised in Section 3.3. A much different approach to robustness against heterogeneity has been proposed by Meinshausen and Bühlmann (2014) and Bühlmann and Meinshausen (2016). This method appears to have much better robustness properties than SVA. 3.3. Heterogeneity adjustment In the case of known heterogeneity, such as the chip fabrication problem shown using colors in Fig. 1, or when heterogeneity is graphically discovered using data visualization, it is natural to attempt to adjust the data by removing the obvious biases. There are typically two major approaches to this. First, in the context of a specific hypothesis to test, it is natural to take an ANOVA type approach where one controls for the bias as a modeled factor in the analysis. This basic idea underlies the popular COMBAT method of Johnson et al. (2007), and the SVA approach of Leek and Storey (2007). See Leek et al. (2010) for good overall discussion of this type of approach. Second, one tries to eliminate batch effects for the data set as a whole, without a specific hypothesis in mind. The latter is more challenging, but is essential for many important data analytic tasks, such as exploratory analysis. A useful approach of the latter type was proposed by Benito et al. (2004), which is based on the Distance Weighted Discrimination (DWD) classification method proposed in Marron et al. (2007). The good result of a DWD adjustment of the Breast Cancer data in Fig. 1, trained on the red and blue chip fabrication artifact shown there, is seen in Fig. 2. This is again a PCA scatterplot matrix after DWD adjustment. The same cancer class symbols are used, and this time the cancer classes are further highlighted using colors as well. Unlike the unadjusted data, where the first PC component (the direction Please cite this article as: J.S. Marron, Big Data in context and robustness against heterogeneity, Econometrics and Statistics (2016), http://dx.doi.org/10.1016/j.ecosta.2016.06.001
ARTICLE IN PRESS
JID: ECOSTA
J.S. Marron / Econometrics and Statistics 000 (2016) 1–8
PC 2 direction
0 −20 −40
0 −40 −20 0 20 PC 1 direction
20
0.01
−20 0
−40 −20 0 20 PC 1 direction
−20
−20 0 20 PC 3 direction
20 0 −20
−20 0 20 PC 2 direction
−20 0 20 PC 3 direction
20 PC 3 direction
20
−20
0
−40
0.02
0
0
20
−20 0 20 PC 2 direction
−40 −20 0 20 PC 1 direction
PC 3 direction
PC 1 direction
0.01
20
PC 2 direction
0.02
PC 1 direction
6
[m3Gsc;November 18, 2016;14:8]
0.04 0.03
0
0.02 −20
0.01 −20 0 20 PC 2 direction
0
−20 0 20 PC 3 direction
Fig. 2. DWD adjusted Breast Cancer data. Colors and symbols both indicate Cancer Class, which is now the dominant factor in the visualization. This shows the effectiveness of the batch effect adjustment. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
of maximal variation in the data) was strongly driven by the fabrication artifact, the first 3 PC components now strongly reflect the important underlying biology represented by the cancer classes. While DWD has proven its effectiveness in a large number of applications, it is natural to wonder if it is worth the complication (it involves serious optimization). For example, could one just as well adjust for the artifacts in Fig. 1, by simply subtracting the means of the red and blue classes (sometimes called the centroid method, which is much simpler to compute than DWD)? Practical experience showed that mean subtraction was much less effective than DWD for this task, but it was not clear for some time why this was the case. That mystery was solved by Liu et al. (2009), who pointed out a problem that was termed at the time unbalanced unknown subtypes. The main ideas are graphically understood using the toy example in Fig. 3. For illustration purposes, a two dimensional data set is shown, with colors representing a phenomenon of interest such as two cancer types, and symbols representing a spurious artifact such as a fabrication effect is shown in the left panel. Note that in the batch indicated with the plus signs (top part of the plot), there happen to be more blue cases, while in the other batch (shown as circles) there are more red cases. If the respective means of the fabrication batches, shown as the green x symbols, are used to adjust the data, the point clouds will then move (until they overlap each other) in the direction shown as the green dashed line. The impact of this adjustment is shown in the lower right panel. That result is clearly unsatisfactory, because neither the corresponding circles nor plus signs have come together, and more important the red circles have begun to overlap the blue plus signs, meaning the critical red–blue difference is now much harder to discern. The DWD adjustment is quite different, where the data are now shifted along the more vertical magenta dashed line on the left, to give the adjusted data shown in the upper right panel. Note that now the corresponding colors have appropriately come together, and the resulting reds and blues are now well separated, so further study based on this adjustment will be useful and insightful. The challenge presented here, in the terminology of this paper, is one type of data heterogeneity. In particular, there are unbalanced Gaussian mixture components in both the red and the blue biological classes. The fact that DWD is so much more effective in this example shows that it is much more robust to this type of heterogeneity than the centroid approach. The reason for this comes from how DWD works. While the class centroids (green x symbols) strongly feel the imbalance Please cite this article as: J.S. Marron, Big Data in context and robustness against heterogeneity, Econometrics and Statistics (2016), http://dx.doi.org/10.1016/j.ecosta.2016.06.001
ARTICLE IN PRESS
JID: ECOSTA
[m3Gsc;November 18, 2016;14:8]
J.S. Marron / Econometrics and Statistics 000 (2016) 1–8
7
DWD Adjusted Data 4 2 0 6
−2
4
−4
2
−5
0
5
0 −2
Centroid Adjusted Data
−4 −6
2 −5
0
5
0 −2 −10
−5
0
5
10
Fig. 3. Simple 2-d toy example of batch adjustment for heterogeneous data. Colors show biological classes being studied. Symbols show batch effects. The respective adjustments are shifts in the direction of the dashed lines. This shows DWD batch adustment is much more robust to data heterogeneity than Centroid batch adjustment. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
in subsample sizes, DWD is instead driven by the inverse distances from each point to the optimal separating plane, shown as the solid magenta line. That criterion is less influenced by the imbalance of subsample sizes, and thus gives a better adjustment. Note that in the example shown in Fig. 3, data heterogeneity actually plays two roles. First there is the known heterogeneity represented using the symbols. Second, adjusting for these is complicated because of the unknown heterogeneity represented by the colors. While the latter could be mitigated in this particular case by using centroids that are appropriately weighted using the known subsample sizes, it is straightforward to construct examples with additional unknown and unmodeled factors, where this will not be possible. While the centroid attempts to handle the first notion of heterogeneity, it fails to do so in a way that is robust against the second type of heterogeneity, while DWD is much more robust in this latter sense.
4. Discussion and open problems The example presented in Fig. 3 raises a number of interesting questions. These include the following. How robust in this sense are the popular methods, such as SVA and COMBAT, even in settings where the goal is just testing of a specific hypothesis? They will clearly work well when addressing Gaussian examples, but it will be interesting to see from both the viewpoints of simulation and theoretical analysis how these methods fare when subpopulations (e.g. the error terms in ANOVA models) are modeled by Gaussian mixtures, especially those with varying means and covariances. In particular, studies are needed where both types of heterogeneity raised in Section 3.3 should be brought into play. A reviewer has pointed out that it will be interesting to study the impact of mis-labeled heterogeneity on the following statistical inference. Can classification methods besides DWD be considered, and assessed on their robustness to heterogeneity properties? For example, looking at the example in Fig. 3, the Support Vector Machine will have a separating hyperplane (the analog of the magenta solid line), which will likely be more horizontal, resulting in a more robust adjustment. As noted in Marron et al. (2007), the original DWD was designed as a classification method, with the goal of providing improved generalizability in high dimensions over the conventional Support Vector Machine. Its usefulness for batch adjustment was shown by Benito et al. (2004). It is natural to wonder if improvements are available from modifications of DWD that are specifically aimed at enhanced robustness properties. Preliminary work is under way with Kim Chuan Toh on taking this approach. The Big Data applicability of DWD has until recently been limited by reliance upon Interior Point optimization methods, which scale poorly with the number of data points. Upcoming new work with Xin Yee Lam, Defeng Sun and Kim Chuan Toh, based on semi-proximal alternating direction method of multipliers ideas, has provided a breakthrough in this area. The resulting methods are very competitive with the best SVM implementations. How much impact will ideas about robustness against heterogeneity ultimately have on actual data analysis? A personal prediction here is that research on this topic will take some intuitively unexpected twists and turns. This is based on Please cite this article as: J.S. Marron, Big Data in context and robustness against heterogeneity, Econometrics and Statistics (2016), http://dx.doi.org/10.1016/j.ecosta.2016.06.001
JID: ECOSTA 8
ARTICLE IN PRESS
[m3Gsc;November 18, 2016;14:8]
J.S. Marron / Econometrics and Statistics 000 (2016) 1–8
observing over time a number of surprizes in the closely related theoretical domain of very high dimensional data, including Hall et al. (2005), Aoshima and Yata (2014) and Shen et al. (2013). Acknowledgment Many of the ideas presented here were developed in conversation with colleagues in the Department of Statistics and Applied Probability, while the author held the position of Saw Swee Hock Visiting Professor at the National University of Singapore. References Aoshima, M., Yata, K., 2014. A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Ann. Inst. Stat. Math. 66 (5), 983–1010. Benito, M., Parker, J., Du, Q., Wu, J., Xiang, D., Perou, C.M., Marron, J.S., 2004. Adjustment of systematic microarray data biases. Bioinformatics 20 (1), 105–114. Beran, J., 1991. M estimators of location for gaussian and related processes with slowly decaying serial correlations. J. Am. Stat. Assoc. 86 (415), 704–708. Bousquet, O., Bottou, L., 2008. The tradeoffs of large scale learning. In: Advances in NeuralInformation Processing Systems. The Neural Information Processing Systems Conference, pp. 161–168. Bühlmann, P., Meinshausen, N., 2016. Magging: maximin aggregation for inhomogeneous large-scale data. Proceedings of the IEEE 104 (1), 126–135. Chen, X., Xie, M., 2014. A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684. Cochocki, A., Unbehauen, R., 1993. Neural Networks for Optimization and Signal Processing. John Wiley & Sons, Inc. Cristianini, N., Shawe-Taylor, J., 20 0 0. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press. DerSimonian, R., Laird, N., 1986. Meta-analysis in clinical trials. Control. Clin. Trials 7 (3), 177–188. Duda, R.O., Hart, P.E., et al., 1973. Pattern Classification and Scene Analysis, vol. 3. Wiley, New York. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., 1996. Advances in knowledge discovery and data mining. Gentleman, R., Ihaka, R., et al., 2003. The R project for statistical computing. Hall, P., Marron, J., Neeman, A., 2005. Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67 (3), 427–444. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A., 2011. Robust Statistics: The Approach based on Influence Functions, vol. 114. John Wiley & Sons. Hennig, C., 2015. What are the true clusters? Pattern Recognit. Lett. 64, 53–62. Huang, H., Liu, Y., Yuan, M., Marron, J.S., 2015. Statistical significance of clustering using soft thresholding. J. Comput. Graph. Stat. 24 (4), 975–993. Huber, P.J., 2011. Robust Statistics. Springer. Johnson, W.E., Li, C., Rabinovic, A., 2007. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8 (1), 118–127. Jolliffe, I., 2005. Principal Component Analysis. Wiley Online Library. Kim, W., Seo, J., 1991. Classifying schematic and data heterogeneity in multidatabase systems. Computer 24 (12), 12–18. Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A., 2010. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11 (10), 733–739. Leek, J.T., Storey, J.D., 2007. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3 (9), 1724–1735. Liu, X., Parker, J., Fan, C., Perou, C.M., Marron, J., 2009. Visualization of cross-platform microarray normalization. Batch Effects and Noise in Microarray Experiments: Sources and Solutions, pp. 167–181. Liu, Y., Hayes, D.N., Nobel, A., Marron, J.S., 2012. Statistical significance of clustering for high-dimension, low–sample size data. J. Amer. Stat. Assoc 103, 1281–1293. doi:10.1198/0162145080 0 0 0 0 0454. Lu, X., Marron, J., Haaland, P., 2014. Object-oriented data analysis of cell images. J. Am. Stat. Assoc. 109 (506), 548–559. Marron, J., Todd, M.J., Ahn, J., 2007. Distance-weighted discrimination. J. Am. Stat. Assoc. 102 (480), 1267–1271. Marron, J.S., Alonso, A.M., 2014. Overview of object oriented data analysis. Biom. J. 56 (5), 732–753. Mayer-Schönberger, V., Cukier, K., 2014. Learning with Big Data: The Future of Education. Houghton Mifflin Harcourt. McCorduck, P., Minsky, M., Selfridge, O.G., Simon, H.A., 1977. History of artificial intelligence. In: Proceedings of the 1977 International Joint Conference on Artificial Intelligence (IJCAI), pp. 951–954. Meinshausen, N., Bühlmann, P., 2014. Maximin effects in inhomogeneous large-scale data. arXiv preprint arXiv: 1406.0596 Pearson, K., 1900. The Grammar of Science. Black, London. Perou, C.M., Sørlie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S., Rees, C.A., Pollack, J.R., Ross, D.T., Johnsen, H., Akslen, L.A., et al., 20 0 0. Molecular portraits of human breast tumours. Nature 406 (6797), 747–752. Riani, M., Cerioli, A., Atkinson, A.C., Perrotta, D., et al., 2014. Monitoring robust regression. Electron. J. Stat. 8 (1), 646–677. Schölkopf, B., Smola, A.J., 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. Shang, Z., Cheng, G., 2015. A Bayesian splitotic theory for nonparametric models. arXiv preprint arXiv: 1508.04175. Shen, D., Shen, H., Zhu, H., Marron, J., 2013. Surprising asymptotic conical structure in critical sample eigen-directions. arXiv preprint arXiv: 1303.6171. Staudte, R.G., Sheather, S.J., 2011. Robust Estimation and Testing, vol.918. John Wiley & Sons. Vapnik, V., 1982. Estimation of Dependences based on Empirical Data. Springer-Verlag, NY. Wang, H., Marron, J., 2007. Object oriented data analysis: Sets of trees. Ann. Stat. 35 (5), 1849–1873. Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R.M., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M., Network, C.G.A.R., et al., 2013. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45 (10), 1113–1120. Zhang, Y., Duchi, J., Wainwright, M., 2013. Divide and conquer kernel ridge regression. In: Conference on Learning Theory, pp. 592–617. Zhao, T., Cheng, G., Liu, H., 2016. A partially linear framework for massive heterogeneous data. Ann. Stat. to appear
Please cite this article as: J.S. Marron, Big Data in context and robustness against heterogeneity, Econometrics and Statistics (2016), http://dx.doi.org/10.1016/j.ecosta.2016.06.001