Journal of Business Research 58 (2005) 935 – 943
A simulation study to investigate the use of cutoff values for assessing model fit in covariance structure models Subhash Sharmaa,*, Soumen Mukherjeeb, Ajith Kumarc, William R. Dillond a
Moore School of Business, University of South Carolina, Columbia, SC 29208, USA b MAPS Inc., Waltham, MA, USA c Arizona State University, Tempe, AZ, USA d Southern Methodist University, Dallas, TX, USA Received 3 January 2002; accepted 14 October 2003
Abstract In this paper, we used simulations to investigate the effect of sample size, number of indicators, factor loadings, and factor correlations on frequencies of the acceptance/rejection of models (true and misspecified) when selected goodness-of-fit indices were compared with prespecified cutoff values. We found the percent of true models accepted when a goodness-of-fit index was compared with a prespecified cutoff value was affected by the interaction of the sample size and the total number of indicators. In addition, for the Tucker-Lewis index (TLI) and the relative noncentrality index (RNI), model acceptance percentages were affected by the interaction of sample size and size of factor loadings. For misspecified models, model acceptance percentages were affected by the interaction of the number of indicators and the degree of model misspecification. This suggests that researchers should use caution in using cutoff values for evaluating model fit. However, the study suggests that researchers who prefer to use prespecified cutoff values should use TLI, RNI, NNCP, and root-mean-square-error-ofapproximation (RMSEA) to assess model fit. The use of GFI should be discouraged. D 2004 Elsevier Inc. All rights reserved. Keywords: Structural equation modeling; Confirmatory factor analysis; Goodness-of-fit-indices; Simulation
1. Introduction The evaluation of covariance structure models is typically carried out in two stages: (1) an evaluation of overall model fit and (2) evaluations of specific parts/aspects of the model such as the measurement properties of indicators and/ or strength of structural relationships. The chi-square test statistic was among the first set of indices proposed to evaluate overall model fit to the data in a statistical sense. As is the case with most statistical tests, the power of the chi-square test increases with sample size. Since in covariance structure analysis, the nonrejection of the model subsumed under the null hypothesis is typically the desired outcome, the rejection of the model through the chi-square test in large samples, even for trivial differences between the sample and the estimated covariance matrices, soon came to be perceived as problematic (Bentler and Bonett, 1980;
* Corresponding author. Tel.: +1-803-777-4912; fax: +1-803-777-6876. E-mail address:
[email protected] (S. Sharma). 0148-2963/$ – see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jbusres.2003.10.007
Tucker and Lewis, 1973). In response to this ‘‘sample-size’’ problem of the chi-square test statistic, several alternative goodness-of-fit indices were proposed for evaluating overall model fit. In turn, a number of simulation studies evaluated the sensitivity of these indices to sample-size variations (e.g., Anderson and Gerbing, 1984; Bearden et al., 1982; Bentler, 1990; Marsh et al., 1988). In their comprehensive, integrative review of various goodness-of-fit indices, McDonald and Marsh (1990) concluded that only four indices were relatively insensitive to sample size: the noncentrality parameter (NCP) of McDonald (1989) and a normed version thereof (NNCP), the relative noncentrality index (RNI), and the Tucker-Lewis index (TLI). An index is defined to be insensitive to sample size if the expected value of its sampling distribution is not affected by sample size. However, researchers typically evaluate model fit by comparing the value of some goodness-of-fit index with some prespecified cutoff value. Based on the results of a recent simulation study, Hu and Bentler (1998, 1999) suggest that a cutoff value close to 0.95 for TLI or RNI, a cutoff value close to 0.90 for NNCP or a
936
S. Sharma et al. / Journal of Business Research 58 (2005) 935–943
cutoff value of 0.06 for root-mean-square-error-of-approximation (RMSEA; Steiger and Lind, 1980; Steiger, 1990) is needed before one could claim good fit of the model to the data. However, they caution that one cannot employ a specific cutoff value because the indices may be affected by such factors as sample size, estimation methods, and distribution of data. Furthermore, finding that the expected value of an index is independent of sample size does not logically imply that the percentage of index values exceeding the cutoff value is also independent of sample size. Therefore, it is quite possible that even if the expected value of an index is unaffected by sample size, the relative frequencies of model acceptance and rejection when a prespecified cutoff value is used could potentially depend on sample size. Should this occur, the use of a universal cutoff value may be inappropriate, as replication studies of a given model using different sample sizes could lead to different conclusions regarding the acceptance/rejection of models. In addition, for a given sample size, the relative frequencies of model acceptance and rejection may vary with the number of indicators in the model, which is typically a function of the number of constructs or factors in the model. However, for a given number of constructs, the number of indicators could vary due to the use of shorter or longer versions of previously developed scales. The objective of this paper, therefore, is to use simulation to empirically assess the effects of factors, such as sample size and number of indicators, on goodness-of-fit index and, more importantly, on the use of prespecified cutoff values for assessing model fit. The effects will be assessed both for true and for misspecified models. The paper is organized as follows: First, we briefly discuss goodness-of-fit indices evaluated in this study and their suggested cutoff values. Second, we present the simulation design employed. Third, we present the results of our simulations. Finally, we discuss the implications of our results for using prespecified cutoff values for acceptance/rejection decisions.
2. Goodness-of-fit indices and their cutoff values 2.1. Goodness-of-fit indices While several goodness-of-fit indices have been proposed in the literature, this study will assess the following five indices: the NNCP, the RNI, the TLI, the RMSEA, and the goodness-of-fit index of Joreskog and Sorbom (1982). We now discuss our rationale for including these five indices. First, in an integrative review of several GFIs, McDonald and Marsh (1990) concluded that among the fit indices typically used by researchers, only NCP, NNCP, RNI, and TLI were insensitive to sample size. We excluded the NCP from our analysis because we did not find it as being used frequently in substantive research for evaluating model fit, presumably because advocates of this index did not specify cutoff values for its use. Second, Marsh et al. (1988) did not
include RMSEA in their simulation study, and neither did McDonald and Marsh (1990) in their integrative review. More recently, however, Browne and Cudeck (1993) suggest using this index to assess model fit. This index was included by Hu and Bentler (1998) in their simulation study and found to be quite sensitive to model misspecification. Finally, the goodness-of-fit index, although found to be sensitive to sample size in a number of simulation studies, is still being used extensively by researchers to assess model fit. 2.2. Cutoff values for assessing model fit As mentioned earlier, researchers typically compare the computed value of some GFI to a prespecified cutoff value for evaluating model fit. For normed fit indices (i.e., goodness-of-fit index, NNCP, RNI, and TLI) whose values typically range between 0 and 1, with 1 indicating perfect fit, the cutoff value of 0.90 recommended by Bentler and Bonett (1980) is the most popular and widely employed by researchers to evaluate model fit. The model is considered to have an unacceptable fit if the value of the fit index is less than 0.90. We used a cutoff value of 0.90 for the NNCP even though McDonald and Marsh (1990) did not prescribe any cutoffs for this index. For the RMSEA, whose value does not range between 0 and 1, Browne and Cudeck (1993) suggested that values of 0.05 or less would indicate a ‘‘close fit’’, a value of 0.08 or less would indicate a ‘‘reasonable fit’’, and values greater than 0.10 would indicate ‘‘unacceptable fit’’.
3. Simulation study Simulation studies were done to assess the effects of sample size, number of indicators, factor loadings size, and size of factor correlations on the mean value of the selected fit indices and on the percent of models accepted using prespecified cutoff values. Two specifications of correlated two-factor, four-factor, six-factor, and eight-factor confirmatory factor models, with four indicators per factor, were used. The two-factor model will have a total of eight indicators and one correlation among the two factors. The four-factor model will have a total of 16 indicators and six correlations among the four factors. The six-factor model will have a total of 24 indicators and 15 correlations among the six factors. The eight-factor model will have a total of 32 indicators and 28 correlations among the eight factors. In the first specification, the correct or true model was estimated. In the true or correct model, the specification of the model estimated in the sample was identical with the population model. That is, the model should have a perfect fit to the data. Any lack of fit is attributed to sampling error. In the second specification, the model was not correctly specified, in that the model estimated in the sample was not the same as the population model. Specifically, the correlations among the factors were not estimated. Misspecified models
S. Sharma et al. / Journal of Business Research 58 (2005) 935–943
were included in the study to assess the extent to which the use of cutoff values might result in Type II errors (i.e., the decision to accept the model specified under the null hypothesis as true when an alternative model is the correct one).
4. Simulation methodology Four factors were systematically varied to create the simulation experimental design: (1) four sample sizes were used (100, 200, 400, and 800); (2) number of indicators were varied from 8 to 32, in steps of 8 (i.e., 8, 16, 24, and 32); (3) three factor loadings (i.e., .3, .5, and .7) were used; and (4) three correlations among the factors were employed (.3, .5, and .7). Following prior simulation studies, a confirmatory factor analysis (CFA) model was chosen. 4.1. Data generation The simulation design resulted in a total of 36 different population covariance matrices. A total of 100,000 observations were generated from each of the 36 population covariance matrices using the GGNSM procedure (IMSL Library, 1980). From each of the 36 sets of 100,000 observations representing a given population covariance matrix, 100 replications of each sample size were randomly drawn. That is, 400 samples were drawn from each set of the 36 sets of observations. This gave a total of 14,400 samples (3 levels of factor loadings 3 levels of factor correlations 4 levels of number of indicators 4 levels of sample sizes 100 replications). A sample covariance matrix was computed from each of the 14,400 samples. 4.2. Model estimated: true models For each sample, the corresponding true model was estimated. All the parameters, including the correlations among the factors, were estimated. For a given index, the percent of true models rejected when compared with a prespecified cutoff value would give a measure of the Type I error committed by the usage of the respective index for model acceptance/rejection decisions. 4.3. Model estimated: misspecified models As indicated earlier, another objective of our study was to investigate model acceptance/rejection frequencies when cutoff values are used to evaluate the fit of misspecified models. In general, misspecification could occur in countless ways. However, since our main concern was to assess how the fit indices behaved for misspecified models and to keep the simulation study to manageable levels, we chose a subset that would span a wide range of misspecifications with respect to the lack of overall fit. The subset of models chosen were those that resulted from systematically not
937
estimating the correlations among the factors. Specifically, misspecified models were operationalized by positing orthogonal models for each of the following combinations: (1) k=.3, /=.3; (2) k=.5, /=.5; and (3) k=.7; /=.7, where k and / denote factor loadings and factor correlations, respectively. These combinations represent varying degrees of model misspecification, with the first combination resulting in the smallest amount of misspecification and the third combination resulting in the largest amount of misspecification. For each estimated model (true and misspecified), the five goodness-of-fit indices discussed earlier were computed. In addition, for each goodness-of-fit index, the percent of times the fitted models were accepted was computed for each cell of the simulation design on the basis of a prespecified cutoff value (values exceeding 0.90 for NNCP, TLI, RNI, and goodness-of-fit index and values below 0.05 for RMSEA). The percent of misspecified models accepted when compared with a prespecified cutoff value would give a measure of the Type II error committed by the usage of the respective index for model acceptance/rejection decisions.
5. Results In Monte Carlo simulations of covariance structure models, some of the samples analyzed inevitably yield improper solutions, wherein one or more of the parameter estimates are inadmissible (e.g., zero or negative error variances, standardized factor loadings or interfactor correlations exceeding one, etc.). While such improper solutions would be discarded in substantive research contexts where, typically, a single-sample covariance matrix is analyzed, it is important to include them in the analysis of the Monte Carlo results because the sampling distribution that is ultimately being evaluated within each treatment of the simulation design includes all the sample covariance matrices that are generated. There were a total of 0.08% improper solutions for true models and 5.69% improper solutions for misspecified models. Consistent with the results of previous simulations, a majority of the improper solutions were for small sample sizes (N = 100 and 200). There were no improper solutions for samples of size 800. To assess the effect of the manipulated factors, the data were analyzed using ANOVA and computing the effect size, g2. The g2 associated with each estimated effect represents the percent of variance in the dependent variable that is accounted for by that effect after accounting for the impact of all other effects. Because of large sample sizes, many of the effects that are practically insignificant (as measured by g2) will be statistically significant. Consequently, we present the results and the discussion only for those factors that are statistically significant and whose g2 is greater than 3% (Anderson and Gerbing, 1984; Sharma et al., 1989); these effects will be referred to as significant effects.
938
S. Sharma et al. / Journal of Business Research 58 (2005) 935–943
Table 1 Eta-squares for mean value of GFIs and percent of times models accepted for true models
Sample size (N) Number of indicators (NI) Factor loadings (L) Factor correlations (P) Sample size Number of indicators (N NI) Sample size Loadings (N L)
NNCP
RMSEA
RNI
TLI
GFI
0.227a 0.459b 0.151 0.225 – – – – 0.221 0.309
0.284 0.603 – 0.113 – – – –
–* 0.310 – – – 0.386 – –
– 0.304 – – – 0.378 – –
0.208
0.055
0.056
0.633 0.525 0.256 0.187 – – – – 0.095 0.289
– –
– –
– 0.168
– 0.175
– –
a
Eta-square for goodness-of-fit indices. Eta-square for percent of times true models accepted for cutoff value of 0.90 (0.05 for RMSEA). * Not significant at P V.05. b
5.1. True models 5.1.1. Goodness-of-fit indices As indicated earlier, we performed a 3 3 4 4 (Factor Correlations Factor Loadings Sample Size Number of Indicators) ANOVA, with each GFI as the dependent variable. Table 1 presents the significant results. The following conclusions can be drawn from the table: (1) Sample Size Number-of-Indicators interaction (N NI interaction) is the only interaction that is significant, and this interaction is significant only for NNCP and goodnessof-fit index; (2) the size of factor loadings and the size of
correlations among the factors do not effect any of the goodness-of-fit indices; (3) sample size effects NNCP, RMSEA, and goodness-of-fit index; and (4) the number of indicators effects only NNCP and goodness-of-fit index. To gain further insights into these effects, we examine the means and standard deviations of goodness-of-fit indices for various combinations of sample sizes and number of indictors (the effects corresponding to the N NI interaction). Table 2 presents the means and standard deviations. It can be seen that RMSEA is not substantially affected by sample size, and irrespective of the number of indicators, the effect seems to be the same for sample sizes of 200 and over. For NNCP and goodness-of-fit index, the effect of sample size becomes more prominent as the number of indicators increase. The mean values for the NNCP reveal the nature of the interaction and, also, the reason why McDonald and Marsh (1990) and Marsh et al. (1988) found this index to be insensitive to sample size. If the analysis is restricted to results for models with 8 or 16 indicators, then, the NNCP would be insensitive to sample size in our study as well. The inconsistency arises as a consequence of including models with larger number of indicators (i.e., 24 and 32 indicators) in our simulation. While it appears from the mean values that RNI and TLI are affected by sample size for a large number of indicators, this effect is not significant, and this conclusion is consistent with previous studies. However, the reason for nonsignificance is probably due to the fact that the standard deviations of these two indices are relatively large compared with the other three. McDonald and Marsh (1990) noted that RNI and TLI are normed in the population (that is, they assume values between 0 and 1) but not in the sample, especially for small sample sizes. Bentler (1990) noted that the range for
Table 2 Means and standard deviations of the GFI for true models Index
Number of indicators 8
16
24
32
Sample size 100 200
Sample size 100 200
400
800
Sample size 100 200
400
800
400
800
Sample size 100 200
1.00 0.02 0.01 0.01 1.00 0.09 1.00 0.08 0.96 0.00
1.00 0.01 0.01 0.01 1.00 0.04 1.00 0.04 0.98 0.00
0.88 0.11 0.03 0.02 0.87 0.34 0.87 0.33 0.81 0.01
0.97 0.06 0.01 0.01 0.96 0.14 0.96 0.14 0.89 0.01
0.99 0.03 0.01 0.01 0.99 0.09 0.99 0.08 0.94 0.00
1.00 0.02 0.00 0.01 1.00 0.04 1.00 0.04 0.97 0.00
0.72 0.13 0.04 0.01 0.77 0.22 0.77 0.21 0.76 0.01
0.93 0.08 0.01 0.01 0.94 0.15 0.94 0.15 0.86 0.01
0.98 0.04 0.01 0.01 0.98 0.07 0.99 0.07 0.93 0.00
1.00 0.02 0.00 0.00 1.00 0.04 1.00 0.03 0.96 0.00
Values of TLI and RNI for models whose factor loadings are .50 or .70 RNI 1.00 1.00 1.00 1.00 0.98 1.00 1.00 0.10 0.04 0.02 0.01 0.07 0.04 0.02 TLI 1.00 1.00 1.00 1.00 0.98 1.00 1.00 0.11 0.04 0.02 0.01 0.07 0.04 0.02
1.00 0.01 1.00 0.01
0.94 0.07 0.94 0.07
0.99 0.03 0.99 0.04
1.00 0.02 1.00 0.02
1.00 0.01 1.00 0.01
0.89 0.08 0.89 0.08
0.98 0.04 0.97 0.04
0.99 0.02 0.99 0.02
1.00 0.01 1.00 0.01
400
800
Number of indicators and sample size interaction (N NI) NNCP 1.00 1.00 1.00 1.00 0.97 0.04 0.02 0.01 0.01 0.08 RMSEA 0.02 0.01 0.01 0.01 0.02 0.03 0.02 0.01 0.01 0.02 RNI 1.02 1.00 1.00 1.00 1.26 2.19 1.52 0.15 0.06 10.79 TLI 1.02 1.00 1.01 1.00 1.26 1.98 1.37 0.13 0.05 10.19 GFI 0.93 0.96 0.98 0.99 0.86 0.02 0.01 0.00 0.00 0.01
0.99 0.04 0.01 0.01 1.00 0.40 1.00 0.38 0.93 0.01
For each index, the values at the top row indicate the means, and the values at the bottom row indicate the standard deviations.
S. Sharma et al. / Journal of Business Research 58 (2005) 935–943
TLI is large, especially for small samples. In fact, for a sample size of 100, the range of TLI was as high as 322.78 (low value of 17.89 and a high value 304.89) and the range of RNI was as high as 341.60 (low value of 18.99 and a high value of 322.61). These ‘‘outliers’’ obviously would affect the significance tests. An examination of the outliers suggests that most of these outliers are for cases that have small factor loadings (i.e., .3) and small sample sizes (i.e., 100). We can only speculate as to why only these two indices (out of the five) exhibit such large fluctuations. A reasonable conjecture is that these two indices, in contrast to the other three, are, essentially, ratios of two statistics derived from the null and true models. Therefore, these indices are affected by the badness of the null model as well as the goodness-of-fit of the hypothesized model. This conjecture is further supported by the fact that these two indices are undefined in the population if the null model is true, suggesting that these indices would be extremely unstable in samples if the null model is approximately true (McDonald and Marsh, 1990). This problem is obviously exacerbated in the cases of small samples. To determine if the behavior of TLI and RNI change when factor loadings are .5 or greater, we reanalyzed the data by deleting the models whose factor loadings are .30. The results indicated that the sample size and the number of indicators, and their interaction, were significant (values of g2 for the N NI interaction are 0.085 and 0.089 for RNI and TLI, respectively; values of g2 for sample size are equal to 0.124 and 0.128 for RNI and TLI, respectively; and values of g2 for the number of indicators are equal to 0.058 and 0.062 for RNI and TLI, respectively). Table 2 also gives the means and standard deviations for models whose factor loadings are .50 or .70. The behavior of RNI and TLI is similar with that of GFI and NNCP; however, these two indices do not seem to be substantially effected by sample size and number of indicators. The results for the mean values of the indices in Table 2 can be summarized as follows: The RMSEA is the least effected index and is insensitive to sample size for sample sizes of over 200. Goodness-of-fit index and NNCP are insensitive to sample size above some threshold (sample size) value; however, this threshold value likely varies monotonically with the number of manifest indicators in the model and, furthermore, this threshold value may not be the same for all the indices. That is, for a given index, the sample size at which the index becomes insensitive (to sample size) could be a function of the number of indicators. The behavior of TLI and RNI is erratic for models with small factor loadings (i.e., .30). When these models are deleted, the behavior of TLI and RNI is similar with that of goodness-of-fit index and NNCP, in that TLI and RNI are affected by sample size, and the effect depends on the number of indicators. The question then becomes: Are the effects of sample size, number of indicators, factor loadings, and factor correlations the same when one uses these indices to make model acceptance/rejection decisions by comparing
939
an index value to a prespecified cutoff value? That is, what is the impact of the manipulated factors on the Type I error, the error of rejecting the model when it is indeed true? 5.1.2. Percent of models accepted For each of the 144 cells or conditions defined by sample size (four levels), number of indicators (four levels), factor loadings (three levels) and factor correlations (three levels), the percent of models accepted for each index was computed. Model acceptance/rejection decision was made by comparing the value of the index to a prespecified cutoff value (0.90 for NNCP, RNI, TLI, and GFI, and 0.05 for RMSEA). The percent of models accepted was the dependent variable in a 4 4 3 3 ANOVA. Since for each cell, there is a single observation, the fourth-order interaction was used as the error term for significance tests. Table 1 also gives the g2 of the effects. The following conclusions can be drawn from the table: (1) The effect of the interaction of the sample size with the number of indicators (N NI) is even more pronounced for the percent of times the true model is accepted compared with the mean value of the fit index; this interaction is significant for all the indices. Note that in the case of the mean value of the fit index, this interaction was not significant for RMSEA, RNI, and TLI; (2) The Sample Size Size of Loading (N L) interaction is significant for RNI and TLI. This interaction was not present for mean values of the indices; (3) The main effects of sample size for all the indices are significant; (4) The main effects of the number of indicators are significant for NNCP, RMSEA, and goodness-of-fit index; and (5) The main effects of factor loadings are significant for RNI and TLI. To gain further insights into these effects, we present in Table 3 the percent of times that true models are accepted for the number of indicators for the above significant effects. It is clear from Table 3 that the behavior of goodness-offit index is clearly the most aberrant, with substantial sample size effects when the number of indicators is large, and points to the need to reconsider its continued use in model evaluation. TLI and RNI are affected by sample size and its effects depend on the number of indicators. The behavior of these two indices is extremely good for models with factor loadings of .5 or above and with sample sizes of 200 or above. For these models, the effect of sample size and number of indicators is practically nonexistent. For the NNCP, on the other hand, sample-size effects are dependent on the number of indicators. The effect of sample size and number of indicators appears to be the least for RMSEA. The findings so far suggest that the percent of models accepted (when an index is compared with a cutoff value) is affected by the interaction of sample size with the number of indicators. In addition, the RNI and TLI are affected by the two-way interaction of sample size and size of factor loadings; however, the effects are very little for models whose factor loadings are .5 or above. When used for evaluating model fit relative to some cutoff value, RMSEA emerges as the most promising candidate, and the RNI and
940
S. Sharma et al. / Journal of Business Research 58 (2005) 935–943
Table 3 Percent of times true models are accepted for a cut-off value of 0.90 (0.05 for RMSEA) Number of indicators and sample size interaction Index
Number of indicators 8
NNCP RMSEA RNI TLI GFI
16
24
32
Sample size 100 200
400
800
Sample size 100 200
400
800
Sample size 100 200
400
800
Sample size 100 200
400
800
98.9 81.3 73.7 75.1 94.6
100.0 100.0 91.0 91.6 100.0
100.0 100.0 95.9 96.6 100.0
80.3 92.4 71.3 72.0 0.3
100.0 100.0 92.6 92.8 100.0
100.0 100.0 97.4 98.0 100.0
41.3 95.0 57.0 57.9 0.0
99.9 100.0 89.8 90.2 100.0
100.0 100.0 98.7 98.8 100.0
8.3 94.4 37.9 38.7 0.0
98.2 100.0 91.4 91.7 100.0
100.0 100.0 98.8 99.1 100.0
100.0 97.2 85.0 86.2 100.0
99.2 100.0 85.4 85.8 100.0
88.3 100.0 81.6 82.0 17.8
66.6 100.0 77.7 78.2 0.0
Sample size and size of factor loadings (k) interaction Index
Factor loading (k)
NNCP RMSEA RNI TLI GFI
.30
.50
.70
Sample size 100 200
400
800
Sample size 100 200
400
800
Sample size 100 200
400
800
56.9 90.8 25.4 26.2 23.7
99.5 100.0 73.8 74.8 100.0
100.0 100.0 93.1 94.3 100.0
58.3 91.3 58.0 59.7 23.8
99.5 100.0 99.8 99.9 100.0
100.0 100.0 100.0 100.0 100.0
56.4 90.3 96.3 96.9 23.7
99.6 100.0 100.0 100.0 100.0
100.0 100.0 100.0 100.0 100.0
89.3 99.1 52.3 53.1 54.3
88.3 99.2 94.9 96.1 54.4
TLI for models with factor loadings of .5 or above and sample size of 200 or above. A related question, though, is whether similar patterns recur when evaluating the fit of misspecified models. That is, to what extent can these indices detect model misspecification? The next section addresses this point. 5.2. Misspecified models As indicated earlier, the degree of misspecification was operationalized by not estimating the factor correlations, and numerous combinations of degree of misspecifications were tried. Table 4 gives the estimated value of the fitting or the discrepancy function, fk(uˆ k), for each of the misspecified models when they were fitted to their corresponding population covariance matrices. Because the estimated value of the fitting function is devoid of sampling errors, it will be equal to zero for a correctly specified model and greater than zero for misspecified models. Consequently, the value of the fitting function measures the degree of misspecification. As can be seen from Table 4, the degree of misspecification is Table 4 Discrepancy function [ fk(uˆ k)] values for misspecified models Factor Factor Number of indicators loadings correlations 8 16 (k) (/) .3 .5 .7
.3 .5 .7
24
32
0.719 (1) 3.900 (2) 8.916 (4) 15.352 (5) 8.431 (3) 38.647 (7) 78.711 (8) 124.413 (9) 36.526 (6) 143.694 (10) 269.738 (11) 404.738 (12)
Numbers in parentheses represent degree of misspecification, which ranges from very low (1) to very high (12).
87.9 99.7 100.0 100.0 54.7
confounded with the number of indicators, in that the degree of misspecification increases with an increase in the number of indicators. Essentially, there are 12 levels of misspecification, as indicated by the numbers in parentheses. These 12 levels of misspecification, which range from very low (1) to very high (12), are used to present and discuss the results. For each of the 48 cells defined by degree of misspecification (12 levels) and sample size (4 levels), the percent of models accepted using a given cutoff value of 0.90 (0.05 for RMSEA) was computed for the misspecified model. Since there are only two factors (sample size and degree of misspecifications) and only 48 cells, we simply present the percent of models accepted for each cell. Table 5 gives the percent of models accepted for each cell. As can be seen from the table, all the fit indices fail to reject a substantial number of models when the degree of misspecification is less than five (less than moderate to very low levels of misspecification), which essentially corresponds to factor loadings of .3 and factor correlations of .3. For these cells, the performance of NNCP, RMSEA, and goodness-of-fit index is quite erratic and, in some cases, these fit indices do not reject any models. That is, these indices are not sensitive enough to detect less than moderate to low levels of misspecification in the models. What is also interesting to note is that these indices tend to accept more models as the sample size increases, which is not very surprising. Note that the previous results for the mean values of the indices suggested that all the indices were sensitive to sample size. Essentially, the mean values of an index for smaller samples were less than the mean values of the index for larger sample sizes. That is, for a given prespecified cutoff value of the index, the number of models accepted
S. Sharma et al. / Journal of Business Research 58 (2005) 935–943
941
Table 5 Percent of misspecified models accepted Sample size
Fit index
Degree of misspecificationa 1
2
3
4
5
6
7
8
9
10
11
12
100
NNCP RMSEA RNI TLI GFI NNCP RMSEA RNI TLI GFI NNCP RMSEA RNI TLI GFI NNCP RMSEA RNI TLI GFI
100.00 78.00 50.00 44.00 100.00 100.00 84.00 53.00 52.00 100.00 100.00 98.00 65.00 52.00 100.00 100.00 100.00 64.00 55.00 100.00
77.00 83.00 22.00 22.00 9.00 95.00 97.00 31.00 28.00 100.00 100.00 100.00 29.00 26.00 100.00 100.00 100.00 23.00 17.00 100.00
86.00 42.00 47.00 38.00 97.00 99.00 24.00 32.00 17.00 100.00 100.00 19.00 38.00 13.00 100.00 100.00 4.00 13.00 2.00 100.00
37.00 80.00 18.00 17.00 0.00 70.00 100.00 18.00 18.00 46.00 90.00 100.00 12.00 12.00 100.00 100.00 100.00 3.00 2.00 100.00
5.00 84.00 3.00 3.00 0.00 42.00 100.00 6.00 6.00 0.00 54.00 100.00 2.00 2.00 100.00 82.00 100.00 0.00 0.00 100.00
9.00 0.00 11.00 2.00 38.00 4.00 0.00 8.00 0.00 85.00 0.00 0.00 2.00 0.00 99.00 0.00 0.00 0.00 0.00 100.00
16.00 25.00 4.00 2.00 0.00 4.00 10.00 2.00 2.00 23.00 1.00 6.00 0.00 0.00 92.00 0.00 0.00 0.00 0.00 100.00
0.00 3.00 0.00 0.00 0.00 0.00 16.00 0.00 0.00 0.00 0.00 6.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 10.00
0.00 5.00 0.00 0.00 0.00 0.00 24.00 0.00 0.00 0.00 0.00 26.00 0.00 0.00 0.00 0.00 25.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
200
400
800
a
The degree of misspecification ranges from very low (1) to very high (12).
should increase as the sample size increases. The performance of TLI and RNI is better than NNCP, RMSEA, and goodness-of-fit index. For all the other degrees of misspecifications (i.e., degree of misspecifications greater than five, which correspond to moderate to very high levels of misspecifications), the performance of RNI and TLI is excellent in that the percent of models for all cells, except one, is less than 5%. The performance of NNCP is also good but not as good as TLI and RNI. The performance of RMSEA is not as good as that of NNCP, TLI, and RNI. The performance of goodness-offit index is the worst, once again, suggesting that its use to evaluate model fit should be reevaluated.
6. Discussion and conclusions The findings of this study have important implications for users of covariance structure models. First, all the goodness-of-fit indices included in the study are affected to varying degrees by variations in the models with respect to sample size, number of indicators. In addition, the magnitudes of covariances affected TLI and RNI. Relative to the other indices, the TLI and RNI perform the best followed by NNCP and RMSEA. The goodness-of-fit index shows the most adverse effects, which led Hu and Bentler (1998, 1999) to recommend against its use. The results of our study and the ensuing recommendations are summarized in Table 6 and briefly discussed below. First, the performance of goodness-of-fit index is the worst, both with respect to how it is affected by sample size, number of indicators and detecting model misspecification.
It is suggested that this index should not be used to evaluate model fit. Compared with other indices, RNI and TLI perform the best as long as the size of the factor loadings is .5 or greater and the sample size is not less than 200. Overall, for those preferring to use prespecified cutoff values, it is recommended that RNI and TLI should be used to evaluate model fit. The performance of NNCP and RMSEA is not as good as TLI and RNI. Since RMSEA is not affected by the size of factor loadings and since NNCP performs reasonably well, we recommend their use in conjunction with TLI and RNI. However, an alternative course of action is to introduce some flexibility into the model evaluation procedure by allowing for cutoff values to vary somewhat with the modeling context. For example, in small samples, a more reasonable cutoff value for RMSEA would be 0.07 to 0.08 (cf, Browne and Cudeck, 1993); for smaller sample size and larger models, a cutoff value of less than 0.90 for TLI, RNI, and NNCP should be used. Second, replication studies using different sample sizes may lead to different conclusions if model fit is evaluated by comparing the fit index to a prespecified cutoff value at least below some threshold sample size. In addition, studies assessing the same model, but with different number of indicators, might reach different conclusions. Third, the results suggest that as the number of indicators increases, a larger sample size is needed before the index becomes insensitive (to sample size), suggesting that researchers need to have a larger sample size as the number of indicators in the model increase. Alternatively, for data sets with a large number of indicators (i.e., more than 24) and smaller sample sizes (around 200), it becomes necessary to use more liberal cutoff values for normed indices (e.g.,
942
S. Sharma et al. / Journal of Business Research 58 (2005) 935–943
Table 6 Summary and recommendations Index
Summary and recommendation
Goodness-of-fit index
1. The mean value of the index is substantially affected by sample size; that is, its mean value decreases as sample size decreases. However, the effect of sample size is contingent on the number of indicators. 2. The percent of times the true model is rejected (Type I error) increases substantially as the sample size increases, but decreases as the number of indicators increases. 3. The index is not very sensitive to detecting misspecified models. 4. Recommendation: This index should not be used. 1. The indices are sensitive to sample size; that is, their mean values decrease as sample size decreases. However, the effect of sample size depends on the number of indicators (i.e., model size). 2. The sample size, number of indicators interaction is not significant due to large variations in the index for small sample sizes. This is to be expected as these indices are not normed (i.e., do not lie between 0 and 1) in the sample resulting in outliers (values above 1 and below 0). The outliers were mostly for small sample size (i.e., 100) and small factor loadings (i.e., .30). The effect became significant when the outliers were deleted but is not as severe as that for GFI and NNCP. 3. The percent of times the true models (whose factor loadings are .50 or greater and sample sizes are 200 or greater) are rejected is low and appears to be independent of sample size and number of indicators. 4. Compared to RMSEA, GFI, and NNCP, RNI and TLI are more sensitive to the degree of misspecification. The percent of times misspecified models is accepted is less than 6% for models whose factor loadings are .50 or greater. 5. Recommendation: Performance of RNI and TLI is the best among the set of indicators examined and is the recommended index for evaluating model fit when the factor loadings are reasonably large (.5 or above). 1. The mean value of the index is affected by sample size; that is, its mean value decreases as sample size decreases. However, the effect of sample size is contingent on the number of indicators. 2. The percent of times the true model is rejected (Type I error) is affected by sample size, and this effect increases as the number of indictors increase. 3. Compared with RMSEA and GFI, NNCP is quite sensitive to the degree of misspecification. The percent of times misspecified models is accepted is less than 6% for models whose factor loadings are .50 or greater. 4. Recommendation: Use of this index is recommended in conjunction with RNI and TLI. 1. The index is affected by sample size; that is, its mean value increases for smaller sample sizes. The effect of sample size is independent of the number of indicators. 2. The percent of times the true model is accepted is high for sample size of 200 or above. 3. The RMSEA is more sensitive than GFI and less sensitive than NNCP, RNI and TLI. The percent of times misspecified model is accepted is quite low for higher degrees of misspecification. In this respect, this index performs better than GFI but not as well as NNCP, RNI, and TLI. 4. Recommendation: Performance of RMSEA is reasonable and better than GFI but not as good as TLI and RNI. However, since this index is not affected by the size of factor loading, it is recommended that RMSEA be used in conjunction with NNCP, TLI, and RNI.
RNI and TLI
NNCP
RMSEA
0.80) to ensure that frequencies of model acceptance/rejection remain approximately similar. For example, 86.2% of true models were accepted when TLI was used for assessing model fit with a cutoff value of 0.90, a sample size of 200, and eight indicators. To achieve the same 86.2% acceptance rate for a sample size of 200 and 32 indicators would require a cutoff value of 0.82. Once again, this makes a strong case for adjusting the fit indices for the effects of sample size and number of indicators before comparing with arbitrary cutoff values. We feel that this issue presents further research opportunities for investigating the nature of adjustments needed for various fit indices to account for the effects of model parameters. Finally, we would like to acknowledge some of the limitations of this study. The total number of indicators in the model was manipulated by keeping the number of indicators per factor constant at four and increasing the number of factors in the model. Whether the results of this study also hold when the total number of indicators in the model is manipulated, by keeping the number of factors constant and varying the number of indicators per factor, cannot be inferred from this study. However, we do not expect
the findings to be different, as the underlying issue is the number of indicators and not the number of factors or number of indicators per factor. The study is also limited by the heuristic used to obtain misspecified models. Among the countless misspecified models that could be generated, we systematically selected only 12 misspecified models, which ranged from the least misspecified to the most misspecified models.
References Anderson JC, Gerbing DW. The effect of sampling error on convergence, improper solutions, and goodness-of-fit indices for maximum likelihood confirmatory factor analysis. Psychometrika 1984;49(June):155 – 73. Bearden WO, Sharma S, Teel JR. Sample size effects on chi-square and other statistics used in evaluating causal models. J Market Res 1982; 19(November):425 – 530. Bentler PM. Comparative fit indexes in structural models. Psychol Bull 1990;107(March):238 – 46. Bentler PM, Bonett DG. Significance tests and goodness of fit in the analysis of covariance structures. Psychol Bull 1980;88(November): 588 – 606. Browne MW, Cudeck R. Alternate ways of assessing model fit. In: Bollen
S. Sharma et al. / Journal of Business Research 58 (2005) 935–943 KA, Long JS, editors. Testing structural equation models. Sage Publications; Newbury Park (CA): , 1993. p. 136 – 62. Hu L, Bentler PM. Fit indices in covariance structure modeling: sensitivity to under parameterized model misspecification. Psychol Methods 1998; 3(December):424 – 53. Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equ Modeling 1999;6:1 – 55. IMSL Library. IMSL Edition 8.0. Houston (TX): Visual numerics. Joreskog KG, Sorbom D. Recent developments in structural equation modeling. J Market Res 1982;19(November):404 – 16. Marsh HW, Balla JR, McDonald RP. Goodness-of-fit indices in confirmatory factor analysis: the effect of sample size. Psychol Bull 1988; 103(May):391 – 410.
943
McDonald RP. An index of goodness-of-fit based on noncentrality. J Classif 1989;6(1):97 – 103 [March]. McDonald RP, Marsh HW. Choosing a multivariate model: noncentrality and goodness of fit. Psychol Bull 1990;107(March):247 – 55. Sharma S, Durvasula S, Dillon WR. Some results on the behavior of alternate covariance structure estimation procedures in the presence of nonnormal data. J Market Res 1989;26(May):214 – 21. Steiger JH. Structural model evaluation and modification: an internal estimation approach. Multivariate Behav Res 1990;25:173 – 80. Steiger JH, Lind JC. Statistically based tests for the number of common factors. Paper Presented at the Annual Meeting of the Psychometric Society, Iowa City, IA; 1980. Tucker LR, Lewis C. A reliability coefficient for maximum likelihood factor analysis. Psychometrika 1973;38(March):1 – 10.