5 Statistical analysis tools 5.1 Introduction Statistics involves the collection, analysis, and interpretation of data. Often, it also involves the study of population characteristics by inference from sampling. In many fields of research, it is of interest to better understand some property of a large population (e.g., the average income of all residents of a state). The population value is called a parameter, and it typically remains unknown since it is difficult to collect data on all individuals in the population. Data are then collected on a smaller subset of the population, called a sample, and the population parameter is estimated by calculating a statistic from the sample. A statistical inference is a conclusion that patterns learned from sample data can be extended to a broader context, such as a population, through a probability model (Ramsey & Schafer, 2012). There are many considerations to make when designing a study that affect the inferences that can be made. Two main types of statistical inferences that are commonly of interest are population inferences and causal inferences (Ramsey & Schafer, 2012). Population inferences involve drawing conclusions about a population parameter from a sample, whereas causal inference involve trying to establish a cause and effect relationship between variables. The focus of the discussion here will be on population inferences since causal inferences typically involve a designed experiment. Designed experiments are very important in biomedical studies when answering hypothesis driven questions. For example, randomized control studies are foundational in clinical trials (Sibbald & Roland, 1998). However, the focus here is on connections to cluster analysis in biomedical studies, which is exploratory and hypothesis generating in nature. To make population inferences, it is important to clearly define the population of interest and obtain a sample that is representative of the population. In order to obtain a representative sample, random sampling should be conducted where a chance mechanism is used to select subjects. This helps prevent bias, which could result in over or underestimated values (Ramsey & Schafer, 2012). It is also important to determine how many people should be included in the study in order to draw generalizations about the entire population. There are different types of population inferences that are often of interest. A point estimate is a single statistic that is calculated from the data to estimate the population parameter. For example, the point estimate for the population average of the verbal IQ among 10e12 year old boys with autism spectrum disorder (ASD) would be the sample average. Frequently, instead of making inferences based on a single value, a confidence Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00005-X Copyright © 2020 Elsevier Inc. All rights reserved.
125
126
Computational Learning Approaches to Data Analytics in Biomedical Applications
interval is constructed that provides a range of plausible values for the population parameter with a certain level of confidence. For example an interval could be obtained such that with 95% confidence the population average verbal IQ among 10e12 year old boys with ASD lies between the lower bound and the upper bound in the interval. Finally, another common type of inference is to conduct a hypothesis test, which determines whether a population parameter is equal to some pre-specified or theoretical value. For example, a hypothesis test could be conducted to determine whether the difference in the population average verbal IQ between 10 and 12 year old boys diagnosed and not diagnosed with ASD is equal to zero or not (i.e., is there a difference in the population averages between boys with and without ASD).
5.2 Tools for determining an appropriate analysis There are many types of statistical analyses that can be used to make statistical inferences. The appropriate statistical model should be determined by the study design and the questions of primary interest to the researcher. In this section, some basic statistical analysis methods are introduced, and a discussion is provided about how to select the appropriate tool. Asking a few questions about the variables (or features) being studied and how the data were collected can often lead to an appropriate method. Once the method is applied, further diagnostics can help evaluate whether the model assumptions hold or if an alternative analysis is needed. First, it is best to distinguish between a few different types of variables. The first distinction is between independent variables (IV) and dependent variables (DV). Independent or explanatory variables are variables that are thought to “explain” something in another variable. In an experiment, the IV is under the control of the researcher (e.g., randomly assigning patients to different treatments being studied). Whereas, in an observational study, the IV is not under the experimenter’s control, but it has to be observed (e.g., demographic information such as age, ethnicity, or education level). Causal inferences are more feasibly made from randomized experiments than observational studies, which have the drawback of potentially confounding variables. The presence of confounding variables may reveal an association between two variables that is actually driven by their relationship to a third variable not included in the analysis (the confounder) (Ramsey & Schafer, 2012). It is also important to know whether the variable is quantitative (Q) or categorical (C) when determining the appropriate statistical analysis. Categorical variables involve classifying individuals into groups or categories, whereas quantitative variables are numerical quantities. Both types of variables can be further categorized in a way that could affect the choice of statistical analysis. Quantitative variables can be discrete or continuous in nature. A discrete variable is one that takes on a finite or countably infinite number of values, whereas a continuous variable is one that takes on an uncountably infinite number of values (Devore, 2015). For example, the number of patients arriving at
Chapter 5 Statistical analysis tools
127
an emergency room during a 1 h period could be {0,1,2, .}, which would be a discrete random variable since it takes on a countably infinite number of values. However, the weight of a person could be any real number greater than zero and would be a continuous random variable. For categorical variables, one primary distinction is whether or not the categories have any inherent ordering. Categorical variables with no inherent ordering are called nominal variables, and those with ordering are called ordinal variables. For example, ethnicity would be nominal, whereas a Likert scale rating would be ordinal. In addition to knowing information about what types of variables are being studied, it is also helpful to distinguish the number of variables being investigated. Univariate data involve only one variable (or an analysis is conducted one variable at a time). Bivariate data involve two variables, and multivariate data involve more than two variables. The following questions provide a starting point to help guide a researcher to an appropriate statistical analysis: -
What is the main research question(s)? What variables are being investigated to answer this question(s)? How many variables are there? What relationships are being investigated? What is the independent variable(s)? What is the dependent variable(s)? What type of variables (Q or C) are the IVs and DVs?
For the purposes of this chapter and introducing some common types of statistical methods that may be of use to biomedical researchers, these will be the main questions that are addressed. However, there are many other questions that may also help determine the type of analysis and conclusions that can be made. The following questions are just a few examples: - Was there any random sampling or randomization into groups? - Are data collected over time or via some other type of sequential ordering? - Are there any variables for which individuals in the study are measured more than once? Fig. 5.1 illustrates a flowchart of common types of statistical analyses that are selected based on the type of data. In this chart, only continuous quantitative variables and nominal categorical variables are considered. Specific details for some of these analyses used in clustering applications will be provided in later subsections of this chapter, but a description of these methods can be found in many statistical textbooks (Bremer & Doerge, 2009; Devore, 2015; Samuels & Witmer, 2015). Each of these analyses has a certain set of model assumptions that need to be checked before doing statistical inferences. The models may be robust to certain assumptions and not robust to others, so it is important to know how the analyses are affected if an assumption is not met. If assumptions are not met, a remedy may be needed or an alternative analysis applied. For
128
Computational Learning Approaches to Data Analytics in Biomedical Applications
>1(Q) DV (Q) IVs or mix of (Q/C) IVs
Analysis Flowchart Type of Data
1(Q) DV 1(Q) IV 1(Q) DV 1(C) IV Simple Linear Regression/ Correlation Analysis
1(C) DV 1(C) IV
Bivariate
# C =2
2 sample ttest
# C >2
One-way ANOVA
1(C) DV 1(Q) IV # C =2
Logistic Regression
# C >2
Chi Square or Fisher’s Exact Test
Multinomial Regression
Multivariate
1(Q) DV >1(Q) IV or mix of (Q/C) IV
Multiple Linear Regression or ANCOVA
1(Q) DV >1(C) IV
Multi-factor ANOVA
>1(Q) DV (C) IVs
Multivariate Multiple Linear Regression or MANOVA
MANOVA
1(C) DV >1(Q) IV or mix of (Q/C) IV
Linear Discriminant Analysis or Logistic Regression
FIG. 5.1 Statistical analysis flowchart for different parametric statistical methods.
example, most of the analyses listed in Fig. 5.1 are parametric, in that the response variables or error terms of the model are assumed to follow a specific probability distribution. However, there are nonparametric methods that do not require this distributional assumption that may be more appropriate if parametric model assumptions are found not to hold. When the parametric assumptions hold, the nonparametric methods are often not as statistically powerful, but they are a good alternative when the parametric assumptions do not hold. For a review of nonparametric statistical methods see (Conover, 1999; Pett, 2015).
5.3 Statistical applications in cluster analysis There are many ways that statistical analysis can be used to aid in clustering applications. For example, statistical methods could be used to compare the performances of internal cluster validation indices, which are summary metrics that quantify the separation and compactness of clusters generated from different clustering settings (Arbelaitz, Gurrutxaga, Muguerza, Pe´rez, & Perona, 2013). Statistical ideas also underlie many cross validation and data imputation methods (James, Witten, Hastie, & Tibshirani, 2013; van Buuren, 2018). In this chapter, the focus will be placed on one way statistics can be incorporated well into a cluster analysis workflow (Fig. 5.2). As described in Chapter 2, during data pre-processing and prior to clustering, there are often many features available in the data that may not all be needed. Correlation analysis can aid in identifying these redundant features for removal prior to performing clustering [Chapter 2]. After an appropriate clustering method is selected and performed for a particular application, it is important to evaluate the results and better understand the cluster composition. One way to investigate this is to understand the importance of different features in the final clustering results. Features can be analyzed individually to determine if there are statistically
Chapter 5 Statistical analysis tools
•
IdenƟfy redundant features
• •
129
Enhance cluster evaluaƟon BeƩer understand feature importance
FIG. 5.2 An example of statistical applications in the clustering workflow. See (Al-Jabery et al., 2016) for a detailed example of this workflow used with a specific type of clustering.
significant differences between clusters, or a multivariate analysis can be conducted, which incorporates all features into a single analysis. Further details are provided below for the goal of using statistical methods for cluster evaluation. Identify redundant features Enhance cluster evaluation Better understand feature importance
5.3.1
Cluster evaluation tools: analyzing individual features
One approach to cluster evaluation is to better understand the differences in cluster composition by investigating how individual features differ between clusters. This can help determine the importance of individual features and help subject matter experts better interpret the meaning of different clusters. To illustrate how the flowchart in Fig. 5.1 can be used to identify an appropriate analysis to accomplish this goal, consider the subset of analyses corresponding to bivariate data. In this case, the cluster label is the categorical independent variable, and an individual feature is the dependent variable. If the feature is quantitative and there are only two clusters [1(Q) DV, 1 (C) IV, #C ¼ 2], a two-sample t-test can be conducted to test for significant differences in the mean value of the feature between clusters. If the feature is quantitative and there are more than two clusters [1(Q) DV, 1 (C) IV, #C > 2], a one-way analysis of variance (ANOVA) can be conducted to test for a significant difference in means among the clusters and determine which clusters have statistically different means for the feature. However, if the feature is categorical [1(C) DV, 1 (C) IV], then a c2 square test or a Fisher’s Exact Test (FET) can be used to investigate whether there is an association between the feature and the cluster label. Each of these analyses are briefly introduced below along with several hypothesis testing concepts that are fundamental to all of the methods.
130
Computational Learning Approaches to Data Analytics in Biomedical Applications
5.3.1.1 Hypothesis testing and the 2-sample t-test Consider the case when there are k ¼ 2 clusters (i.e., the number of categories #C ¼ 2). Individuals in the two clusters are thought to be samples drawn from two different populations. It is of interest to test whether the population mean of a quantitative feature differs significantly between clusters. A two-sample t-test (Ramsey & Schafer, 2012; Rosner, 2015) can be employed to accomplish this goal. Defining the Hypotheses. A hypothesis test will be conducted that will result in a decision between two competing hypotheses: the null hypothesis (H0) and the alternative hypothesis (Ha). The null hypothesis is usually designed to represent the status quo or to signify that nothing of interest is happening. The alternative hypothesis, however, typically represents the researcher’s theory that something noteworthy and different from the status quo is occurring. For the 2-sample t-test, the null and alternative hypotheses are: H0 : m1 ¼ m2 vs. Ha : m1 sm2
where mi represents the population mean of the feature in cluster i for i ¼ 1,2. Under the alternative hypothesis, the population mean of the feature differs between the clusters, indicating that the feature is useful for providing insight into the differences in cluster composition. One of the main ideas behind hypothesis tests is to assume the status quo (H0) unless there is enough evidence in the data collected to indicate that H0 is untrue and thus Ha should be concluded. Since the data are obtained from samples of the population, it is very rare to observe the means being exactly equal in the sample. So, it is necessary to determine how different these sample means can be while considering variation in the data and still be compatible with the null hypothesis. If the sample means are different enough while accounting for the inherent variation, it would suggest the null hypothesis may be untrue and would provide evidence for the alternative. Making a decision between these hypotheses requires setting a criteria that can be compared to a threshold and tied to a probability model, so that uncertainty can be quantified (Ramsey & Schafer, 2012). Test Statistic. To accomplish this, data are collected by obtaining samples from the underlying populations, and a summary statistic, called a test statistic, is calculated. The test statistic is a single value calculated from the data designed to gauge the plausibility of the alternative compared to the null hypothesis. For the 2-sample t-test, the test statistic is as follows: x1 x2 tuv ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi; s21 s22 þ n1 n2
where xi , si2 , and ni are the feature sample mean, sample variance, and sample size of cluster i ¼ 1; 2, respectively. This test statistic is often referred to as Welch’s t-test, and it is designed for cases with two populations that may have different variances. There is an
Chapter 5 Statistical analysis tools
131
alternative version of the test statistic that is designed for when the two population variances are equal, but many researchers have recommended Welch’s version since the equal variance assumption can be difficult to assess and inference issues can arise when it is not met (Delacre, Lakens, & Leys, 2017; Ruxton, 2006). Defining a Decision Rule. Since the test statistic is a random variable whose value will likely change if a different sample is collected, the distribution of the test statistic, called the sampling distribution, can be determined. Specifically, when the null hypothesis is true, the sampling distribution of the test statistic can be specified exactly or approximately so that it is known how the test statistic will behave under H0. The sampling distribution of the test statistic when H0 is true can be used to establish a threshold for whether to reject H0 or not using either the rejection region approach or by calculating a p-value. The rejection region approach involves establishing a threshold (called a critical value) for the test statistic that would lead to a decision to reject H0 or not. The critical value is based on the sampling distribution of the test statistic under the null distribution. If the test statistic value is located in one of the tails of the null distribution (far away from where the distribution is centered), then it is unlikely that the data were obtained under the assumption that the null hypothesis is true. Therefore, H0 can be rejected. The critical value establishes the threshold for how far away the test statistic should be from where the sampling distribution under the null is centered in order to reject H0. The decision to reject H0 or not can equivalently be made by calculating a p-value and setting a threshold for it. The p-value for a hypothesis test is the probability that the test statistic is at least as extreme (contradictory to H0) as the one observed, given the null hypothesis is true. It is a measure of the compatibility between the data and the null hypothesis, with smaller p-values indicating greater evidence against H0. Since it is a probability, it will be a value between 0 and 1. The closer the p-value is to zero, the less likely it is to obtain the observed results just due to chance alone when the null is actually true. Making the testing decision based on the p-value requires establishing a threshold called the significance level (usually denoted as a) such that p-values smaller than a lead to rejecting H0 and concluding evidence for the alternative. For p-values greater than or equal to a, H0 is not rejected. The critical value and significance level can be chosen such that both the rejection region and p-value approaches lead to the same decision. The advantage of the p-value approach is that it provides quantitative information about the strength of evidence against the null as compared to the rejection region approach, which just yields a yes or no decision of whether to reject H0 or not. For the 2-sample t-test, the sampling distributions and rejection rules are given in Table 5.1. Errors in Testing and Statistical Power. Since the data represent a sample from two larger populations, the underlying truth about which of the two hypotheses is true is unknown. Thus, the decision that is made could be correct or incorrect. There are two possible errors that can be made in a hypothesis test. The first is called a type I error, or a
132
Computational Learning Approaches to Data Analytics in Biomedical Applications
Table 5.1 Sampling Distribution and Rejection Rules for 2-sample t-test. Note that tðnÞ represents a t-distribution with n degrees of freedom and tp ðnÞ is the pth percentile of the distribution. Type of 2-sample t-test Unequal variance (Welch’s t-test)
Sampling distribution under H0
Critical value and rejection rule
wtðdf Þ tuv
Reject H0 if: t > t1a ðdf Þ uv 2
2 s21 s22 þ n n Where df ¼ 12 2 2 s21
n1
n1 1
s22
þ
Else fail to reject H0
p-value and rejection rule Reject H0 if:
tuv Where Twtðdf Þ Else fail to reject H0
n2
n2 1
false positive. This error occurs when the null hypothesis is rejected even though it is true. It is called a false positive, since the researcher has claimed something beyond the status quo was occurring (a ‘positive’) even though it was not. In other words, it is claiming an effect when there is none. For the clustering scenario, this would mean saying the mean values of a particular feature differ between clusters when they really do not in the larger population of individuals. The probability of a false positive is bounded above by the significance level a, which is typically why a is set to be a small value such as 0.01, 0.05, or 0.10. The second type of error is called a type II error, or a false negative. This error occurs when the null hypothesis is not rejected even though it is false. It is called a false negative since it would mean that an existing effect is overlooked. For the clustering scenario, this would indicate the mean values of a particular feature differs between clusters, but the test indicated there was no difference. The probability of a false negative is defined as the quantity b. There are also two ways a correct decision could be made, and it is sometimes useful to think about the probabilities associated with the correct decisions. The probability of correctly rejecting a null hypothesis (true positive) is called the power and is calculated as 1- b. The probability of correctly failing to reject a true null hypothesis (true negative) is called the confidence level and is calculated as 1- a. Table 5.2 illustrates the different types of outcomes in a hypothesis test. While it would be ideal to keep both a and b small, they are inversely related. In fact, there are four components that may be specified by the experimenter that are all related: the sample size (number of individuals per group), a practical effect size (magnitude of the study effect relative to the variation that would be practically meaningful), the significance level, and the power. If three of these components are specified, the fourth can be calculated. If planning can be done in advance, the sample size needed to achieve a low false positive and false negative rate for a meaningful effect size can be calculated. For cluster evaluation, this is not usually feasible, since the sample size of each cluster is
Chapter 5 Statistical analysis tools
Table 5.2
133
Different outcomes from a hypothesis test.
unknown beforehand. However, the overall sample size could be controlled. Other texts provide a more detailed discussion on sample size and power calculations (e.g., Ryan, 2013). Assumptions. The 2-sample t-test has a set of assumptions that are important to understand and check. First, it is assumed that the individuals in the sample are independent both within a group and between groups. This means that the value of the response variable for one individual does not depend on and is not correlated with the response value of any other individual in the study. It is very important that this assumption is met. Otherwise, the stated type I error rate that is set by the significance level may not be accurate (Lissitz & Chardos, 1975). However, independence is difficult to check graphically or with a test. Rather, thought must be given to how the data were collected to determine whether there may be an inherent dependency in between individual data points. For example, if multiple data points are collected on the same individuals at different time points, these data points would not be independent. The second assumption is that the data are random samples from two normally distributed populations. Normality can be checked by creating histograms or normal probability plots of the quantitative response variable within each of the two groups. The test is more robust to departures from normality, especially for large sample sizes. Robust means that certain properties of the test, such as the stated type I error rate, are still reasonably accurate even if the assumption is not met. Note that the t-test is sensitive to outliers. If influential outliers are present, a nonparametric test such as the Mann-Whitney test (Mann & Whitney, 1947) may be more useful since it is more resistant to being heavily affected by outliers. While the Mann-Whitney test does not require normality, it still relies on the independence assumption. Thus, a more sophisticated analysis is needed for either the parametric or nonparametric approaches when independence is not met.
5.3.1.2 Summary of hypothesis testing steps and application to clustering In summary, the basic steps for conducting a hypothesis test are as follows: 1. Formulate the question of interest and translate it into null (H0) and alternative (Ha) hypotheses.
134
Computational Learning Approaches to Data Analytics in Biomedical Applications
2. Decide on a data collection protocol that will address the question of interest. This includes determining an appropriate statistical test, setting the significance level a, and doing a sample size calculation to also control b, if possible. 3. Once data has been collected, check the assumptions. 4. Calculate the test statistic. 5. Make the testing decision (reject H0 or not) using either the rejection region or pvalue approach. 6. Write conclusions for publication in context of the question of interest (including reporting the p-value). In conclusion, the 2-sample t-test can be used to test whether or not there is a statistically significant difference in the true means of an individual feature between two clusters. Features that differ significantly between clusters indicate they may be an important factor in helping distinguish the groups and can be examined for clinical relevance in a biomedical setting. However, if there are many features and multiple 2sample t-tests are conducted, it will be important to control the familywise false positive rate across the set of tests. While not discussed here, there are many methods that can be used to do this such as Bonferroni (Bland & Altman, 1995) and the False Discovery Rate approach (Benjamini & Hochberg, 1995). More details about methods that enable controlling the overall false positive rate can be found in (Bender & Lange, 2001; Rice, Schork, & Rao, 2008).
5.3.1.3 One-way ANOVA Consider the case where there are k 3 clusters (i.e., the number of categories #C 3). It is important to test whether the population mean of a quantitative feature differs significantly between clusters and, if so, which clusters have significantly different means. A one-way analysis of variance (ANOVA) can be used to test these questions (Kutner, Nachtsheim, Neter, & Li, 2004; Ramsey & Schafer, 2012; Samuels & Witmer, 2015). The idea behind the one-way ANOVA, is that it compares variation between groups to variation within groups (Fig. 5.3). If “enough” variation is occurring between groups relative to the within group variation, then there is a difference somewhere among the k population means. Determining how much is “enough” involves establishing the test statistic and decision rule as described below. ANOVA model and Assumptions. Suppose there are k populations with means m1 ; m2; .; mk . Consider the model: yij ¼ mi þ εij
where yij is the value of the feature for the jth individual within the ith cluster. The model breaks these values of yij into two components, the population mean of the ith cluster mi and an error term εij that represent random deviations of individuals from the cluster mean. The assumptions of the model are that individuals in the study are independent and that individuals within each group represent a random sample from a normally distributed population, with mean mi and variance s2 . This inherently implies that the population variances are the same for all clusters. Thus, the main assumptions are
Chapter 5 Statistical analysis tools
135
1.0
0.8
Column 1
0.6
0.4
0.2
0.0
-0.2
1
2
3
Cluster FIG. 5.3 The y-axis represents a quantitative feature plotted against the cluster number (k ¼ 3). The red boxes represent the cluster means. ANOVA testing utilizes the ratio of variation between the cluster means to within cluster variation.
(1) independence, (2) normality within groups, and (3) constant variance. The KruskalWallis test (Kruskal & Wallis, 1952) can be used as a nonparametric test for an alternative analysis. Global Test. There are often two types of hypothesis tests that are of interest for a one-way ANOVA, and these tests are typically performed sequentially. The first test performed is called a global test, which determines whether there is a difference anywhere in the means among a set of k populations. The hypotheses for the global test are: H0: m1 ¼ m2 ¼ . ¼ mk versus Ha: Not all of the mi ’s are equal. Where mi represents the population mean of the feature in cluster i for i ¼ 1,2, .,k. The test statistic is called an F-test, and is as follows: k 1 P ni ðy i yÞ2 k 1 i¼1 Between group variation F ¼ 2 ¼ Within group variation ni k P 1 P yij y i n k i¼1 j¼1
where yij is the value of the feature for the jth individual within the ith cluster, y i is the sample mean of cluster i, y is the overall mean of the feature across all individuals, ni is the sample size of cluster i, and n is the total number of individuals in the study. The numerator of the test statistic represents the between group variation of individual cluster means from the overall mean. The numerator is also often referred to as the mean square between groups, MS(between), since is calculated by dividing the between group sum of squares, SS(between), by its degrees of freedom (k-1). That is: k P ni ðy i yÞ2 SSðbetweenÞ i¼1 ; MSðbetweenÞ ¼ ¼ k1 df ðbetweenÞ
136
Computational Learning Approaches to Data Analytics in Biomedical Applications
The denominator of the test statistic represents within group variation of individual values from the group mean. Similarly, it is often referred to as mean square within group, MS(within) or more simply the mean square error (MSE), since it represents the variation that is unexplained by group differences. It is calculated by dividing the within group error sum of squares, SS(within) or SSE, by its degrees of freedom (n-k). That is: ni k P P
MSE ¼
SSE i¼1 ¼ dfE
j¼1
yij y i
2
nk
;
If the between group variation is larger than the within group variation, the F-test statistic will be large, and this will indicate that a difference exists among the group means (Ha). The sampling distribution of F when the null is true is an F-distribution with k-1 numerator and n-k denominator degrees of freedom, that is F (k-1,n-k). The p-value, critical value, and rejection rule are as follows: p-value ¼ P(F > F *) where F w F(k-1,n-k) Reject H0 if F *> F1-a(k-1,n-k) or p-value< a
Where F1-a(k-1,n-k) is the 1-a percentile of the F-distribution. Tukey Pairwise Comparisons. When the global test indicates there is a difference among the cluster means, it is sometimes useful to determine which clusters have statistically different means. This involves conducting a hypothesis test for a difference in means between all k (k-1)/2 pairs of clusters. There are several different methods that can be used to compare group means or some combination of them (Bender & Lange, 2001; Kutner et al., 2004), but Tukey’s method is preferred when it is of interest to conduct all of the pairwise comparisons. Tukey’s method (Braun, 1994; Kutner et al., 2004) is a multiple testing method that controls the false positive rate across all the tests. The method is referred to as the Tukey-Kramer method (Braun, 1994; Kramer, 1956) when the number of individuals in each group (cluster) is not the same (unbalanced). In addition to controlling the familywise error rate, the Tukey procedure differs from conducting individual t-tests between the groups since it utilizes the MSE as a pooled variance that estimates the common variance across the k populations. The testing procedure is given in Table 5.3. Table 5.3 Testing procedure for Tukey’s pairwise comparisons method, where qðk; n kÞ represents the studentized range distribution with k and n-k degrees of freedom and q1a ðk; n kÞ is the 1-a percentile of the distribution. Hypotheses H0: mi ¼ mj Ha: mi s mj Where is j
Test statistic and sampling distribution under H0 pffiffiffi 2 yi yj q ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi wqðk; n kÞ 1 1 þ MSE ni nj
Critical value and rejection rule Reject H0 if: jq j > q1a ðk; n kÞ Else fail to reject H0
Chapter 5 Statistical analysis tools
137
5.3.1.4 c2 test for independence Thus far only quantitative features have been considered, but some features may be categorical in nature. For categorical features, a c2 test of independence (Bremer & Doerge, 2009; Conover, 1999; Samuels & Witmer, 2015) can be conducted to determine whether or not there is an association between the feature categories and the cluster labels. Data is typically organized into a contingency table where the rows and columns represent different categories for the two categorical variables. Counts for the number of occurrences of each category combination are given inside the table with totals on the final row and column. In clustering applications, the rows of the contingency table would be the different clusters and the columns would be different values of the categorical feature. Table 5.4 illustrates a contingency table for a data set with k ¼ 3 clusters and a categorical feature with 3 groups. Data in a contingency table can be visualized with a mosaic plot (Fig. 5.4). The width of the bars is relative to the size of each cluster, so it is easy to see that cluster 3 is the largest while clusters 1 and 2 are similar in size. The height of the bars represents the percentage of observations that fall in the three groups within each cluster (i.e., row percentages). For the data in Table 5.4, cluster 1 has a high percent of observations in group 1, whereas clusters 2 and 3 have a high percentage in group 3. Table 5.4 Example of contingency table for k ¼ 3 clusters and categorical feature with 3 groups. Contingency table
Group 1
Group 2
Group 3
Total
Cluster 1 Cluster 2 Cluster 3 Total
28 1 15 44
13 2 24 39
2 46 77 125
43 49 116 208
Mosaic Plot 1.00
0.75
3
0.50 2 0.25 1 0.00
1
3
2
Cluster FIG. 5.4 Mosaic plot of the data from contingency table.
138
Computational Learning Approaches to Data Analytics in Biomedical Applications
Table 5.5 Testing procedure for c2 independence test, where r ¼ number of rows, c ¼ number of columns, c2 ððr 1Þðc 1ÞÞ represents the c2 distribution with (r-1)(c-1) degrees of freedom and c21a ððr 1Þ ðc 1ÞÞ is the 1-a percentile of the distribution. Hypotheses H0: The two categorical variables are independent (no association) Ha: The two categorical variables are not independent (there is an association)
Test statistic and sampling distribution under H0 c2 ¼
X ðobserved count expected countÞ2 expected count
wc2 ððr 1Þðc 1ÞÞ Where row total column total expected count ¼ overall total
Critical value and rejection rule Reject H0 if: c2 > c21a ððr 1Þ ðc 1ÞÞ Else fail to reject H0
The c2 test of independence offers a formal way to test whether there is a statistically significant association between the cluster label and the feature categories. The testing procedure is given in Table 5.5. The test statistic is calculated by finding the expected count under independence for each “cell” in the contingency table. There are a total of r x c “cells” or combinations of categories, where r is the number of row categories, and c is the number of column categories. Within each cell, the difference between the expected and observed counts is squared and then scaled by the expected count. These values are added together across all cells to obtain the test statistic value. Large values of the test statistic are indicative of a deviation from independence, and the formal rejection rule to establish significance is given in Table 5.5. The p-value can be calculated using a c2 distribution. However, it should be noted that the use of the c2 distribution is an approximation based on large sample theory (Conover, 1999). For small samples, an alternative test called Fisher’s Exact Test (FET) (Bremer & Doerge, 2009; Conover, 1999; Samuels & Witmer, 2015) can be used, which utilizes the exact distribution. The rule of thumb is to use the c2 test when less than 20% of the expected cell counts are less than 5; otherwise the FET is more appropriate (Kim, 2017).
5.3.2
Cluster evaluation tools: multivariate analysis of features
The previous section described ways that different types of traditional statistical analyses can be used to evaluate differences in individual features between clusters. This can help subject matter experts better understand the feature characteristics of different clusters. However, cluster evaluation can be enhanced by utilizing a multivariate analysis that includes all the features, rather than considering them separately. The details of these multivariate analysis methods are not covered here, but some examples of specific cluster evaluation questions that could be addressed with multivariate analysis are described briefly. For more information about multivariate statistical analyses see (Johnson & Wichern, 2008).
Chapter 5 Statistical analysis tools
139
One question that may be of interest is to determine which features best discriminate the clusters. To answer this question, the relationship between a categorical dependent variable (cluster membership) with more than one quantitative independent variable (feature) should be investigated. A descriptive linear discriminant analysis (LDA) is one option (Fig. 5.1) for accomplishing this goal. LDA finds linear combinations of features that maximize group separation, called canonical discriminant functions. The importance of individual features in distinguishing clusters can be evaluated by investigating how strongly each feature correlates with the canonical discriminant functions. This can aid subject-matter researchers in better understanding which features are most important in cluster separation. Another question that could be addressed with multivariate statistical analyses is whether there is a significant difference in the feature means among the C clusters after accounting for a covariate feature not included in the clustering (e.g., age). An analysis of covariance (ANCOVA) could be used to address this question (feature is Q DV, Cluster is C IV, covariate is Q IV). This could be important if there were a particular variable (covariate) that is not of direct interest for clustering but could be related to the features used in clustering. This type of analysis would allow the researcher to test for differences in the feature means between clusters after adjusting for the covariate. There are many other cluster evaluation questions that could potentially be addressed with multivariate statistical analyses, but it is important for the subject-matter expert to be involved in framing these questions according to their research goals. It is also important to keep in mind that additional assumptions are often needed for multivariate analyses, and these should be verified for their appropriateness. Although this chapter does not focus on the details of multivariate analyses, this section illustrates some potential applications in cluster evaluation.
5.4 Software tools and examples 5.4.1
Statistical software tools
There are many different statistical software available for implementing statistical methods. Commercial software includes SAS, JMP, SPSS, Stata, and Minitab, among others. R is a popular open source, open development language, and Bioconductor contains many statistical methods for bioinformatics applications that are implemented in R. There are many advantages and disadvantages for these different software that will not be discussed here. For the purposes of illustrating how the analyses described above can be used to evaluate differences in features between clusters, JMP software (JMPÒ, Version 13. SAS Institute Inc.) will be used in the example below. Some advantages of JMP are that it is powerful as a data exploration tool and has many nice visualization features. JMP has a GUI interface, but scripting is also an option. Some limitations of JMP are that it is commercial and not open source; thus, there is a cost to use it, and it is not designed so researchers can easily implement new methods.
140
Computational Learning Approaches to Data Analytics in Biomedical Applications
5.4.1.1 Example: clustering autism spectrum disorder phenotypes Autism spectrum disorder (ASD) is a complex disease characterized by high variation in many phenotypes including behavioral, clinical, physiologic, and pathological factors (Georgiades, Szatmari, & Boyle, 2013). Cluster analysis is a useful approach for sorting out the phenotypic heterogeneity with the goal of identifying clinically relevant subgroups that may benefit from different diagnoses and treatments (Al-Jabery et al., 2016). In this example, data were obtained from 208 ASD patients through the Simons Simplex Collection (Fiscbach & Lord, 2010) site at the University of MissourieThompson Center for Autism and Neurodevelopmental Disorders. Simplex indicates that only one child (called a proband) is on the ASD spectrum, but neither of the biological parents nor do siblings have the disease. A set of 27 phenotypic features (25 quantitative and 2 categorical) are available that provide information about different characteristics including ASD-specific symptoms, cognitive and adaptive functioning, language and communication skills, and behavioral problems. Table 5.6 provides a list of these variables with a Table 5.6 List of variable labels, definitions, and types for ASD data example. Q ¼ quantitative, C¼Categorical. Variables with a eD indicate the variable was discarded prior to clustering. Label
Definition
Type
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var19 Var20 Var21 Var22 Var23 Var24 Var25 Var26 Var27
Overall verbal IQ Overall nonverbal IQ Full scale IQ Module of ADOS administered ADI-R B nonverbal communication total ADOS communication social interaction total ADI-R A total abnormalities in reciprocal social interaction ADOS social affect total ADI-R C total restricted repetitive & stereotypical patterns of behavior ADOS restricted and repetitive behavior (RBB) total Repetitive behavior scale-revised (RBS-r) overall score Aberrant behavior checklist (ABC) total score Regression Vineland II composite standard score Vineland II daily living skills standard score Vineland II communication standard score Peabody picture vocabulary test (PPVT4A) standard score Social responsiveness scale (SRS) parent-awareness raw score SRS parent e Cognition raw score SRS parent e Communication raw score SRS parent e Mannerisms raw score SRS parent e Motivation raw score SRS parent total raw score Vineland II socialization standard score RBS-R subscale V sameness behavior Child behavior checklist (CBCL) internalizing problems total CBCL Externalizing problems total
Q Q Q C Q Q Q Q Q Q Q Q C Q Q Q Q Q Q Q Q Q Q Q Q Q Q
eD eD eD eD eD e e e e
D D D D
Chapter 5 Statistical analysis tools
141
brief definition. (Al-Jabery et al., 2016) applied a novel ensemble subspace clustering model to identify meaningful subgroups and aid in better understanding the phenotypic complexity present in ASD. In the following example, the statistical methods described previously are used to evaluate feature importance and understand clinical relevance of clusters identified using the clustering method applied in (Al-Jabery et al., 2016).
5.4.1.2 Correlation analysis As discussed previously [Chapter 2], correlation analysis can be utilized to check the strength of relationships between pairs of quantitative variables. This enables the detection of highly correlated features that can be selected for removal prior to clustering since they contain redundant information. For example, taking the first 5 of the quantitative phenotypic features in the ASD dataset for illustration purposes, the Pearson (Fig. 5.5B) and Spearman (Fig. 5.5C) correlation values can be obtained as well as a visualization of the pairwise relationships in a scatterplot matrix (Fig. 5.5A). Note that the Var4 is categorical and is not included in this analysis. The following JMP menu commands are used to generate these results.
n n n JMP Commands: Correlation Analysis Pearson pairwise correlations and scatterplot matrix: Analyze >> Multivariate Methods >> Multivariate >> Y-columns [enter Q variables] The following optional analyses can be obtained by selecting further options within the result output of the previous command. Spearman pairwise correlation: Multivariate >> Nonparametric correlations >> Spearman’s r Adding histograms to scatterplot matrix: Scatterplot Matrix >> Show histogram >> Horizontal.
n n n
The scatterplot matrix provides a visualization for each pair of quantitative variables. The plots above and below the diagonal are identical, with the axes switched. Thus one only needs to look at half of the plots (e.g., upper diagonal). In Fig. 5.5A, it is apparent that Vars1-3 have strong linear pairwise relationships. All three of these variables are related to IQ (see Table 5.6). The histogram for each of the quantitative variables is also provided as an additional option to visualize the shape of the variable distributions and reveal obvious outliers (none are seen here).
142
Computational Learning Approaches to Data Analytics in Biomedical Applications
(A)
(B)
(C)
FIG. 5.5 JMP correlation results for five variables in ASD dataset. A. Scatterplot matrix with histograms of individual variables on diagonal. B. Pearson correlation results. C. Spearman correlation results.
Chapter 5 Statistical analysis tools
143
The Pearson correlation (Fig. 5.5B) results are given in the same format as the scatterplot matrix. The diagonal values are all 1 since they represent the correlation of the variable with itself and only half of the values (e.g., upper diagonal) are needed. For example, the Pearson correlation between the verbal (Var1) and nonverbal IQ (Var2) is 0.8566, and the visualization of that relationship can be seen in the same position of the scatterplot matrix. Note that the values are colored according to their strength, with dark blue being high positive correlations and dark red being high negative correlations. The Spearman correlation (Fig. 5.5C) results are presented in a different way. Each pairwise correlation is listed in a set of two columns that include all possible pairwise combinations. The correlation is given along with a p-value for testing whether the Spearman correlation is zero or not and a bar graph that represents the correlation value. The Spearman correlation for the verbal (Var1) and nonverbal IQ (Var2) is 0.8485, which is significantly different from zero. It can be seen from the bar chart that the IQ variables have the highest Spearman correlation, which aligns with the Pearson results and 25 scatterplot matrix. Note that all ¼ 300 pairwise correlations can be calculated for 2 the 25 quantitative variables and utilized as one method for removing redundant variables prior to cluster analysis.
5.4.1.3 Cluster evaluation of individual features The subspace clustering approach described in (Al-Jabery et al., 2016) is applied to the ASD data, and the top three clustering configurations selected by majority voting of three validation indices (Davies-Bouldin, Silhoutte, and Calinski-Harabasz) are further evaluated to better understand how variables differ between clusters. The subspace clustering method has an uni-dimensional clustering step that offers an alternative way of removing indiscriminant features rather than checking redundancy through the pairwise correlations. A total of 9 variables were removed as part of this phase of the method, and these are noted in Table 5.6. Note that JMP also has the ability to perform some types of clustering (e.g., k-means and hierarchical) through the menu options Analyze > Clustering, but those are not explored here since the focus in this chapter is on using statistical methods for cluster evaluation rather than cluster methodology. Two cluster results. One of the top three clustering results identified two clusters. To analyze the individual features from these results, consider choosing the appropriate analysis based on Fig. 5.1. The independent variable will be the cluster identifier, which will be categorical with C ¼ 2 categories. The dependent variable will be one of the variables listed in Table 5.6. For the 16 quantitative dependent variables, the appropriate analysis is the 2-sample t-test. For the 2 categorical dependent variables, a Chi-square or Fisher’s exact test should be performed.
144
Computational Learning Approaches to Data Analytics in Biomedical Applications
2-sample t-test. As an illustration, consider quantitative Var6, the ADOS Communication Social Interaction Total. The following JMP menu commands are used to generate the t-test results.
n n n JMP Commands: 2-sample t-test Two sample t-test Analyze >> Fit Y by X >> Y-Response [enter Q variable]. X-Factor [enter Cluster ID variable]. Select further options within the result output of the previous command. Means and Std Dev t-Test (Welch’s t etest, assumes unequal variance)
n n n
First, it is helpful to look at the means and standard deviations for each cluster (Fig. 5.6A). The results give the sample sizes for the two clusters (n1 ¼ 189, n2 ¼ 19) as well as the means (M1 ¼ 0.5583, M2 ¼ 0.8535), standard deviation (SD1 ¼ 0.1971, SD2 ¼ 0.1469), standard error of the mean (SEM1 ¼ 0.0143, SEM2 ¼ 0.0337) and a 95% confidence interval for the true cluster means (CI for Cluster 1 Mean [0.5300,0.5866], CI for Cluster 2 Mean [0.7827, 0.9243]). The Welch’s t-test results are given in Fig. 5.6B. In the left column of Fig. 5.6B, notice that the “Difference” indicates the difference in sample means between cluster 2 and cluster 1. The standard error and a 95% confidence interval for the true difference are also given. Notice that zero is not in this interval indicating there is a significant
(A) Means and Std Deviations Level 1 2
Std Err Mean Lower 95% Upper 95% Number Mean Std Dev 189 0.558301 0.197093 0.01434 0.58658 0.53002 19 0.853469 0.146900 0.03370 0.92427 0.78267
(B) t Test 2-1 Assuming unequal variances 0.295168 t Ratio Difference Std Err Dif 0.036624 DF Upper CL Dif 0.370592 Prob > |t| Lower CL Dif 0.219744 Prob > t Confidence 0.95 Prob < t
8.059465 25.02574 <.0001* <.0001* 1.0000
-0.4
-0.2
0.0 0.1 0.2 0.3 0.4
FIG. 5.6 JMP 2-sample t-test results for ADOS Communication Social Interaction Total Variable in ASD dataset. A. Summary statistics for each cluster, including a 95% confidence interval for the mean. B. Welch’s t-test results.
Chapter 5 Statistical analysis tools
145
difference in the two cluster means. The right column of Fig. 5.6B gives the t Ratio, which is the value of the t-test statistic along with the DF ¼ degrees of freedom. The entities labeled “Prob” represent three different p-values, which correspond to different options for the alternative hypothesis. Since it is of interest here to determine if there is any difference in cluster means, the two-sided p-value labeled “Prob < jtj” will be used. The other two p-values correspond to alternative hypotheses that specify the direction in terms of whether cluster 1 mean is larger or smaller than cluster 2 mean (called onesided tests). The two-sided test checks for a change in either direction. The graph to the right in Fig. 5.6B shows the t-distribution with DF ¼ 25.026 when the null hypothesis is true and the red line indicates the t Ratio. In conclusion, the mean ADOS Communication Social Interaction Total differs significantly between clusters 1 and 2 (p-value <0.0001 using a 2-sample, 2-sided, Welch’s t-test). The mean of Cluster 2 is estimated to be between 0.2197 and 0.3706 units higher than Cluster 1 (95% CI for the true mean difference). c2 test and FET. To illustrate the evaluation of a categorical variable, consider Var4, the module of ADOS administered. There are three different types of modules (labeled 0, 0.5, 1). The following JMP menu commands are used to generate the c2 and FET results:
n n n JMP Commands: c2 test and Fisher’s Exact Test (FET) c2 test Analyze >> Fit Y by X >> Y-Response [enter C variable]. X-Factor [enter Cluster ID variable]. Select further options within the result output of the previous command. Exact Test > Fisher’s Exact Test
n n n
The contingency table and mosaic plot of the data are given in Fig. 5.7A. It can be seen that within each cluster, there are similar proportions of individuals that are administered the ADOS modules labeled “0” and “0.5”. However, cluster 1 has a high proportion of individuals administered the ADOS module labeled “1”; whereas that proportion is small in cluster 2. The test results are given in Fig. 5.7B. Summary information is provided in the first row, such as the overall sample size (N ¼ 208), the degrees of freedom (2e1)x (3e1) ¼ 2, negative log likelihood, and R2. Below the summary information, two different methods (Likelihood Ratio, Pearson) for calculating the c2 test statistic and p-value (Prob > ChiSq) are given. The method described previously corresponds to the Pearson method, which has a test statistic of 21.425 and a p-value of <0.0001. However, JMP will produce a warning flag if more than 20% of the expected cell counts are less than 5. In this case, the Fisher’s Exact Test (FET) is more appropriate and can be requested. The p-value for the FET is provided as the Two-sided
146
Computational Learning Approaches to Data Analytics in Biomedical Applications
(A) Mosaic Plot
Count 0 Total % Col % Row % 1
2
Total
1.00
Var4 0.5 1
Total 0.75
31 14.90 79.49 16.40 8 3.85 20.51 42.11 39 18.75
35 16.83 79.55 18.52 9 4.33 20.45 47.37 44 21.15
123 59.13 98.40 65.08 2 0.96 1.60 10.53 125 60.10
189 90.87
Var4
Cluster
Contingency Table
19 9.13
1
0.50 0.5
0.25
0
0.00
208
1
2
Cluster
(B) Tests N
DF
208
2
Test
Likelihood Ratio Pearson
-LogLike RSquare (U) 11.237238
0.0570
ChiSquare Prob > ChiSq <.0001* 22.474 21.425 <.0001*
Warning: 20% of cells have expected count less than 5, ChiSquare suspect. Fisher’s Exact Test
Table Probability (P)
8.689e-7
Two-sided Prob < – P <.0001*
FIG. 5.7 JMP c test and FET results for ADOS Module Variable in ASD dataset. A. Contingency table and mosaic plot. B. Results for Pearson c2 test and Fisher’s Exact Test with a warning message and summary information. 2
Prob p. For the ADOS module variable, the FET results are used since the expected cell count flag was produced and the resulting p-value is < 0.0001. Thus, there is sufficient evidence to conclude that the ADOS module and Cluster ID are not independent and there is an association between the two variables. Three cluster results. A different one of the top three clustering results identified three clusters. To analyze the individual features from these results, again consider choosing the appropriate analysis based on Fig. 5.1. The independent variable will be the cluster identifier, which will be categorical with C ¼ 3 categories. The dependent variable will be one of the variables listed in Table 5.6. For the 16 quantitative dependent variables, the appropriate analysis to perform is the one-way ANOVA. For the 2 categorical dependent variables, a c2 test or Fisher’s exact test is still the appropriate analysis. Only the ANOVA results are shown, since the c2 test is performed as previously described. One-way ANOVA. As an illustration, consider quantitative Var6, the ADOS Communication Social Interaction Total. The following JMP menu commands are used to generate the one-way ANOVA results.
Chapter 5 Statistical analysis tools
147
n n n JMP Commands: One-way ANOVA One-way ANOVA Analyze >> Fit Y by X >> Y-Response [enter Q variable]. X-Factor [enter Cluster ID variable]. Select further options within the result output of the previous command. Means/Anova Compare Means » All Pairs, Tukey HSD
n n n
First, the sample sizes and means for each cluster are given in Fig. 5.8A, along with 95% confidence intervals for the true mean. Observe that cluster 1 has the largest sample
(A) Means for Oneway Anova Number Mean Std Error Lower 95% Upper 95% 181 0.554202 0.01453 0.52556 0.5828 22 0.781508 0.04168 0.69934 0.8637 5 0.846212 0.08742 0.67386 1.0186
Level 1 2 3
Std Error uses a pooled estimate of error variance
(B) Analysis of Variance Source Cluster Error C. Total
DF 2 205 207
Sum of Squares Mean Square 1.3623642 0.681182 7.8332278 0.038211 9.1955920
F Ratio 17.8269
Prob > F <.0001*
(C) Connecting Letters Report Mean Level A 0.84621212 3 A 0.78150826 2 B 0.55420224 1 Levels not connected by same letter are significantly different.
Ordered Differences Report Level 3 2 3
- Level Difference Std Err Dif Lower CL Upper CL p-Value 0.2920099 0.0886188 0.082788 0.5012315 0.0033* 1 0.2273060 0.0441358 0.123105 0.3315070 <.0001* 1 0.0647039 0.0968454 -0.163940 0.2933479 0.7823 2
FIG. 5.8 JMP one-way ANOVA results for ADOS Communication Social Interaction Total Variable in ASD dataset. A. Summary statistics for each cluster, including a 95% confidence interval for the mean. B. Global F-test results. C. Tukey pairwise comparison results.
148
Computational Learning Approaches to Data Analytics in Biomedical Applications
size but smallest sample mean. Cluster 3 has the smallest sample size and the largest sample mean. Fig. 5.8B provides the results from the Global F-test in the form of an ANOVA table. The degrees of freedom (DF), sum of squares, and mean square results are provided for the between group (Cluster) and within group (Error) sources of variation. The F-test statistic is given by the F Ratio, and the p-value is provided in the Prob > F column. Thus, for the ADOS Communication Social Interaction Total variable, there is a significant difference among the cluster means (p-value <0.0001). However, this result does not indicate which of the cluster means are significantly different, and Tukey pairwise comparisons can be conducted if knowledge of the specific cluster differences is desired. Fig. 5.8C gives two different ways to view the Tukey results. The connecting letters report provides the means for each cluster along with a set of letters. Levels that are not connected by the same letter are significantly different, indicating in these results that cluster 1 is significantly different from 2 to 3, but the latter are not significantly different from each other. The ordered differences report provides more details about the testing for each of the three comparisons. For example, the first row provides information about comparing clusters 3 and 1. The difference in means between the two clusters (0.292) and its standard error (0.0886) is given along with the Tukey p-value testing for a difference between the two groups (p-value ¼ 0.0033). A 95% confidence interval for the true mean difference is also provided, indicating here that the mean of cluster 3 is estimated to be between 0.0828 and 0.5012 units higher than cluster 1 (95% confidence interval for the true mean difference). From these results, it is concluded that for the ADOS Communication Social Interaction Total variable, the mean for cluster 1 is significantly different from the mean of cluster 2 (p-value«0.0001) and cluster 3 (p-value ¼ 0.0033), with cluster 1 having the smallest mean. There is not a significant difference in the means between clusters 2 and 3 (p-value ¼ 0.7823).
5.4.1.4 Summary of results The analysis methods described above can be used to evaluate all the variables used in a particular clustering result. In this example, the top three clustering results based on the validation indices were selected for further evaluation. For each clustering result and feature, an appropriate statistical test was applied to test for differences between clusters. Table 5.7 gives the results of these analyses along with a summary of variables with high pairwise correlations. Note that an appropriate next step would be to apply a multiple testing correction method to control the familywise false positive rate across the tests conducted. The results for significant variables can then be further interpreted to determine which clusters have significantly higher or lower means to help determine the clinical relevance of the subgroups.
Chapter 5 Statistical analysis tools
149
Table 5.7 Summary of p-values for the top three clustering results (Outputs #1-#3) for all 18 individual variables included in the clustering.
P-values highlighted in red are statistically significant at the a ¼ 0.05 level. The analysis used is indicated by the box color (FET ¼ green, t-test ¼ blue, one-way ANOVA ¼ orange). *Variables with >0.7 pairwise Pearson correlation are also reported.
5.5 Summary This chapter provides an introduction to statistical analysis concepts and connects them to a workflow for cluster evaluation. The focus of the chapter is on describing the ideas behind statistical inference and explaining details about some of the important common statistical methods that can be used in cluster evaluation. The methods and examples presented are specifically for identifying which individual variables differ significantly between clusters and understanding how they differ. This type of analysis can help subject-matter researchers identify characteristics of the subgroups, which may be helpful for understanding clinical relevance. There are many other ways that statistical methods could be used to either enhance cluster evaluation (e.g., linear discriminant analysis) or connect to other research questions related to clustering. This chapter provides an example of one application, while illustrating fundamental statistical concepts. An analysis workflow is also provided to aid in selecting an appropriate statistical analysis under different types of research questions.
150
Computational Learning Approaches to Data Analytics in Biomedical Applications
References Al-Jabery, K., Obafemi-Ajayi, T., Olbricht, G. R., Takahashi, T. N., Kanne, S., & Wunsch, D. (2016). Ensemble statistical and subspace clustering model for analysis of autism spectrum disorder phenotypes. Conference Proceedings IEEE Engineering in Medicine and Biology Society, 2016, 3329e3333. https://doi.org/10.1109/EMBC.2016.7591440. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pe´rez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243e256. https://doi.org/10.1016/j. patcog.2012.07.021. Bender, R., & Lange, S. (2001). Adjusting for multiple testing–when and how? Journal of Clinical Epidemiology, 54(4), 343e349. https://doi.org/10.1016/S0895-4356(00)00314-0. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1), 289e300. Bland, j. M., & Altman, D. G. (1995). Multiple significance tests: The Bonferroni method. BMJ, 310(6973), 170. https://doi.org/10.1136/bmj.310.6973.170. Braun, H. (1994). The collected works of John W. Tukey: Multiple comparions, volume VIII (1st ed.). CRC Press. Bremer, M., & Doerge, R. W. (2009). ). Statistics at the bench: A step-by-step handbook for biologists (1st ed.). Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory Press. van Buuren, S. (2018). Flexible imputation of missing data. Second edition (1st ed.). Boca Raton, FL: CRC Press. Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York, NY: Wiley. Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch’s t-test instead of student’s t-test. International Review of Social Psychology. https://doi.org/10.5334/irsp.82. Devore, J. L. (2015). Probability and statistics for engineering and the sciences (9th ed.). Boston, MA, USA: Cengage Learning. Fiscbach, G. D., & Lord, C. (2010). The Simons Simplex collection: A resource for identification of autism genetic risk factors. Neuron, 68(2), 192e195. https://doi.org/10.1016/j.neuron.2010.10.006. Georgiades, S., Szatmari, P., & Boyle, M. (2013). Importance of studying heterogeneity in autism. Neuropsychiatry, 3(2), 123e125. https://doi.org/10.2217/npy.13.8. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learningdwith applications in R (1st ed.). New York: Springer-Verlag. Johnson, R. A., & Wichern, D. W. (2008). Applied multivariate statistical analysis (6th ed.). Pearson. Kim, H.-Y. (2017). Statistical notes for clinical researchers: Chi-squared test and Fisher’s exact test. Restorative Dentistry & Endodontics. https://doi.org/10.5395/rde.2017.42.2.152. Kramer, C. Y. (1956). Extension of multiple range tests to group means with unequal numbers of replications. Biometrics, 12(3), 307e310. https://doi.org/10.2307/3001469. Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583e621. https://doi.org/10.1080/01621459.1952.10483441. Kutner, M., Nachtsheim, C., Neter, J., & Li, W. (2004). Applied linear statistical models (5th ed.). McGrawHill/Irwin. Lissitz, R. W., & Chardos, S. (1975). Article metrics related articles cite share request permissions explore more download PDF a study of the effect of the violation of the assumption of independent sampling upon the type I error rate of the two-group t-test. Educational and Psychological Measurement, 35(2), 353e359.
Chapter 5 Statistical analysis tools
151
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1), 50e60. https://doi.org/10.1214/ aoms/1177730491. Pett, M. A. (2015). Nonparametric statistics for health care research: Statistics for small samples and unusual distributions (2nd ed.). SAGE Publications, Inc. Ramsey, F., & Schafer, D. (2012). The statistical sleuth: A course in methods of data analysis (3rd ed.). Cengage Learning. Rice, T., Schork, N., & Rao, D. C. (2008). Methods for handling multiple testing. Advances in Genetics, 60, 293e308. https://doi.org/10.1016/S0065-2660(07)00412-9. Rosner, B. (2015). Fundamentals of biostatistics (8th ed.). Cengage Learning. Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the ManneWhitney U test. Behavioral Ecology, 17(4), 688e690. https://doi.org/10.1093/beheco/ark016. Ryan, T. P. (2013). Sample size determination and power (1st ed.) https://doi.org/10.1002/ 9781118439241. Samuels, M. L., & Witmer, J. A. (2015). Statistics for the life sciences (5th ed.). Pearson. SAS. (n.d.). JMP version 13. Cary, NC: Inc., SAS Institute. Sibbald, B., & Roland, M. (1998). Understanding controlled trials: Why are randomised controlled trials important? General Practice, 316(7126), 201ff. https://doi.org/10.1136/bmj.316.7126.201.