Complexity, Metacognition, and Fluid Intelligence Lazar Stankov
The University of Sydney, Sydney, Australia
From among five tests of fluid intelligence employed in this study, two (Swaps and Triplet Numbers) were designed to investigate increases in complexity and difficulty. This was accomplished by manipulating the number of steps needed to reach a solution. The increase in task difficulty is related to changes in the overall performance levels that are reflected in arithmetic means. The complexity of a task is related to the increase in correlation with measures of fluid intelligence or in the increase in factor loadings on a fluid intelligence factor. Both these tendencies are present in the results of this study. A metacognitive process of self-confidence was assessed by asking participants to indicate how confident they were that the item they have just answered was correctly solved. A metacognitive process of self-evaluation was assessed by estimating the number of correctly solved items at the end of each test. The analyses of the overall performance also indicate that an ``easy/difficult'' distinction provides a reasonable account of the calibration data that show over- and underconfidence. Exploratory and confirmatory analyses indicate the presence of a relatively strong self-confidence factor. Confirmatory analysis also indicates the presence of a self-evaluation factor.
The aims of this study are twofold: the examination of the role of cognitive complexity in fluid intelligence and the identification of metacognitive factors in test performance. Cognitive complexity refers to the amount of common variance that a given test shares with a broadly defined cognitive ability (i.e., fluid intelligence). The expectation is that more complex tasks will have higher loadings on a broad factor. Two types of metacognitive processes are of interest: self-confidence judgments (captured by measures of subjective probability that the answer that is provided to a test item is correct) and self-evaluation judgments (measured by the post-test estimate of the number of correctly solved items). Both self-confidence and self-evaluation can be compared to the actual performance measures (i.e., percentage correct scores) and an assessment of the accuracy of these two types of judgment can be obtained. Three outcomes are possible with respect to self-confidence: good calibration (i.e., close correspondence between judgment and performance) or two sub-divisions of poor
Direct all correspondence to: Dr. Lazar Stankov, Department of Psychology, The University of Sydney, Sydney, NSW 2006, Australia. E-mail:
[email protected] INTELLIGENCE 28(2): 121 ± 143 ISSN: 0160-2896
Copyright D 2000 by Elsevier Science Inc. All rights of reproduction in any form reserved. 121
122
STANKOV
calibration that can be conveniently classified as over- or underconfidence. Self-evaluation can also provide three outcomes: good self-evaluation and over- and under-evaluation. An issue of theoretical importance is whether self-confidence and self-evaluation are different at an empirical level. This can be resolved using two kinds of evidence. First, if at the overall level of analysis different degrees of realism are obtained from the measures of evaluation and confidence. Second, if factor analysis can identify two separate factors corresponding to self-confidence and self-evaluation. Another issue of theoretical importance suggests a link between the notion of complexity and a metacognitive process tapped by self-confidence judgments. In particular, an important current debate in studies of probabilistic decision making is about the presumed causes of miscalibration of confidence judgments. One view holds that difficult tasks tend to show overconfidence and easy tasks tend to show good calibration or even underconfidence. This is the ``easy/difficult'' position. The other view holds that the main distinction is not in terms of task difficulty but rather with respect to the nature of the task. The claim is that perceptual tasks tend to show underconfidence and general knowledge tasks tend to show overconfidence. This is the ``different processes'' position. The present study is not designed to compare these two positions directly since it employs only fluid intelligence tests rather than perceptual or general knowledge tasks. It will, however, provide a check of the related assumption that there is no change in the nature (i.e., factorial structure) of the task as one moves from easy to difficult versions of the same test. This is relevant to both positions. If easy versions of the tasks turn out not to measure the same process as the difficult versions, the ``different processes'' position will be strengthened. If they all measure the same process, and there is a systematic reduction in overconfidence, as the tasks become easy, the ``easy/difficult'' position will gain support. The Roots The present approach derives from the psychometric tradition in psychology. Contemporary representatives of this tradition such as Carroll (1993), Cattell (1971), Horn (1997), Jensen (1998), Roberts and Goff (1997) continue the work of many predecessors with Spearman (1927) and Thurstone (1938) having the greatest influence. Our own work over the past decade deals with both methodological and substantive issues in the study of complexity (see Fogarty & Stankov, 1988; Myors, Stankov, & Oliphant, 1989; Raykov & Stankov, 1993; Roberts, Beh, & Stankov, 1988; Spilsbury, Stankov, & Roberts, 1990; Stankov, 1988a,b, 1989, 1994; Stankov & Crawford, 1993; Stankov & Cregan, 1993; Stankov & Myors, 1990; Stankov & Raykov, 1995; Stankov, Fogarty, & Watt, 1989; Stankov, Boyle, & Cattell, 1995) and metacognition (Crawford & Stankov, 1996, 1997; Kleitman & Stankov, in press; Stankov, 1998a,b; Stankov & Crawford, 1996a,b, 1997). Each term in the title of this manuscript has a rich background that needs to be spelled out in more detail. Intelligence Intelligence refers to fluid intelligence (Gf ) or fluid reasoning ability. ``This ability is measured in tasks requiring inductive, deductive, conjunctive, and disjunctive reasoning to arrive at understanding relations among stimuli, comprehend implications, and draw
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
123
conclusions'' (Horn, 1997). The five distinct tests employed in this study are known markers of Gf. I did not include measures of any other factors from the theory of fluid and crystallized intelligence since the notion of complexity is primarily of direct relevance to Gf (see Stankov, 1994; Stankov & Crawford, 1993; Stankov & Cregan, 1993; Stankov & Raykov, 1995). Two of the five tests of Gf that are included in this study contain four levels of experimental manipulation and are thus useful adjuncts when examining the concept of complexity. Cognitive Complexity This concept is based on the following two main assumptions that will be examined in this article (see Stankov, 1994; Stankov & Raykov, 1995): 1. 2.
Cognitive complexity is directly related to a test's loading on a broad cognitive (e.g., fluid intelligence) factor. It is possible to manipulate task demands in a way that affects tests' loadings and therefore empirically evaluate the effectiveness of such manipulations.
This approach deliberately avoids a definition of complexity in terms of preconceived notions that derive from other sciences (e.g., logic, computer sciences). This is because broad psychometric factors encapsulate a number of different processes as can be seen in the above definition of Gf, and experimental manipulations that show systematic increases in factor loadings on that factor are more complex in the sense that they tap a wider range of elementary cognitive processes. It is thus possible to make a clear distinction between complexity, as defined in point 1 above, and task difficulty. Difficulty, in the present context, refers to the overall performance on the test by a group of people and it is captured by the arithmetic mean. Therefore, whilst tests may not differ in difficulty they may, simultaneously, show an increase in complexity. Conversely, of course, psychological tests may differ in difficulty while showing no change in complexity (see Spilsbury et al., 1990). This approach advocates the use of empirical procedures to study the nature of tasks that are complex (and difficult) for humans. The emphasis here, of course, is on individual differences. It is therefore important to keep in mind that the proposed manipulations imply that the increased complexity of a task will lead to a systematic increase in the difference between high Gf-scoring individuals and low Gf-scoring individuals. Equivalently, the increase in task complexity leads to higher correlations with other tests of intelligence. Thus, the features that are embodied in such experimental manipulations are relevant for the understanding of fluid intelligence. In principle, the experimental manipulation may be anything whatsoever since the empirical evidence is assumed to be the final arbiter. Traditionally, however, the main focus has been on the study of processes that embody the notion of capacity Ð i.e., processing (or ``attentional'') resources and/or working memory. Although both these constructs have been discussed in the literature on intelligence at least since the late 1970s (see Hunt, 1980) and early 1980s (see Stankov 1983a,b), recent work has emphasized the role of working memory (see Kyllonen & Christal, 1990). However, with respect to the processing resources construct (and some measures of working memory), empirical tests
124
STANKOV
of a link with intelligence have been equivocal (see Stankov et al., 1995). For this reason, Stankov et al. (1995) argued that a need to link intelligence with experimental cognitive psychology's theories of capacity is perhaps overstated and that the traditional concept of complexity in relationship to intelligence provides a sufficient theoretical account of the available empirical evidence. The two experimental tasks of the present study Ð the Swaps and Triplet Numbers Tests with four levels each Ð represent manipulations of complexity since they show the desired increase in correlation with Gf (see Stankov, 1994; Stankov & Raykov, 1995). Schweizer (1996) also reports findings with the ``Exchange'' task (similar to the Swaps task of the present study) that are in essential agreement with our previous results. Theoretical interpretations of the observed effects of the experimental manipulations using these two tasks have been in terms of both increases in capacity demands and in terms of lapses of attention. I shall return to this issue in the last section of the present article. In summary, the aim is to find out if a broad fluid intelligence factor can be identified by the accuracy scores from the 11 measures of fluid intelligence employed in the present study. These 11 variables can be divided into three traditional tests of fluid intelligence (i.e., Counting Letters Test, Letter Series Test, and Raven's Progressive Matrices Test), four levels of the Swaps Test, and four levels of the Triplet Numbers Test. The four levels of the latter two tests are ordered with respect to cognitive complexity. Consequently, it is expected that loadings of these four levels on the fluid intelligence factor will increase in magnitude. Metacognition Our research into metacognitive processes derives from a tradition in cognitive psychology that seeks a link between cognitive performance and self-confidence. In essence, the question is ``how good are we in self-monitoring our activities and in self-evaluation of the success or otherwise of the just-finished task?'' Cognitive test items are convenient instruments for studying this aspect of metacognition. The original impetus for our research in this area came from studies within experimental psychology that suggested a difference in self-monitoring between ``general knowledge'' and ``perceptual'' tasks. In particular, the relationship between accuracy scores and confidence ratings is markedly different for these two types of tasks. For general knowledge tasks, people tend to be overconfident Ð that is, they frequently act as if they believe that their answer is correct even though the answer sometimes turns out to be wrong. For perceptual tasks, on the other hand, people tend to be underconfident Ð that is, they are not quite sure whether or not a correct answer is really correct (Juslin & Olsson, 1997). Our own early work has supported this distinction. Thus, the Vocabulary Test and general knowledge Geography Test tend to show overconfidence while the Line Length Test shows underconfidence (see Kleitman & Stankov, in press; Stankov, 1998a,b; Stankov & Crawford, 1996a,b, 1997). Juslin and Olsson (1997) interpret these data as an indication that different types of tasks are tapping different cognitive processes. While the overall performance in our studies supported the distinction made by Juslin and Olsson (1997), in all our work to date the correlation between confidence ratings
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
125
scores from perceptual and general knowledge tasks was high, indicating that a separate self-confidence trait was being tapped.1 This latter finding, in fact, is in agreement with an alternative interpretation of overall performance in self-monitoring that postulates the existence of one rather than two distinct processes for perceptual and general knowledge tasks (see Baranski & Petrusic, 1994; Ferrell, 1995). The underlying psychological effect, according to this alternative view, is related to variations in task difficulty rather than to differences in psychological processes captured by the tasks. Thus, easy tasks are characterized by good calibration (or occasional underconfidence), difficult tasks by overconfidence and the nature of tasks in question is irrelevant. The ``easy/difficult'' interpretation has received further support in a recent study that employed a series of eight perceptual tasks from different modalities (Stankov, in press). The complexity manipulations with the Swaps and Triplet Numbers Tests of this battery permit an examination of issues relevant to the ``easy/difficult'' versus ``different processes'' comparisons. In particular, since manipulations of the same task (either Swaps or Triplet Numbers) will also involve a reduction (or increase) in difficulty, the ``easy/difficult'' explanation would predict overconfidence for the most difficult versions of the tasks and good calibration for easier tasks. This interpretation would be strengthened and the alternative (i.e., ``different processes'') interpretation weakened if it can be shown that complexity manipulation does not implicate different processes for easy and difficult levels of the task. In other words, if the same factors are being tapped by all versions of the test and a gradual decrease in the amount of overconfidence is observed, the ``easy/difficult'' interpretation would appear more plausible than ``different processes'' interpretation. Another issue of interest to those studying metacognition is based on the derivation of two different scores reflecting distinct views about the nature of probability (see Crawford & Stankov, 1996, 1997). For all items in a test, the participants were asked to provide an answer and then indicate how sure they are that their answer was correct. The average of confidence ratings over all items will be called the ``self-confidence'' score in this article, and it is interpreted as a measure of a metacognitive trait of self-monitoring. In some of our previous studies, we also asked for an estimate of the number of items correctly solved at the end of working on the whole test Ð the ``post-test performance estimate'' (PTPE). This estimate was obtained for tests used in the present study (Stankov & Crawford, 1996a,b). I shall refer to it as the ``self-evaluation'' score. Gigerenzer, Hoffrage, and Kleinbolting (1991) postulate that self-confidence scores (or their derivatives) and self-evaluation scores should tap different processes. The reason for this theoretical distinction derives from differing views about the nature of probability. Thus, confidence ratings after each item belong to the class of ``subjective'' or Bayesian assessments. In such assessments a person attaches probability to a single event (i.e., an answer to an item). Self-evaluation, on the other hand, is ``frequentist'' in nature since the participant makes a judgment about percentage of correctly answered items in the just-finished test. It is assumed that these two judgments require distinct cognitive processes. A way to examine whether the same or different processes are utilized requires a subtraction of the percentage of correctly answered items (i.e., the traditional accuracy score) from the self-confidence score and also from the self-evaluation score. Either one of these two difference scores can be positive (indicating overconfidence or overevaluation) or negative (indicating underconfidence or under-evaluation). Different processes would
126
STANKOV
be implied if, for example, empirical data produce overconfidence for single items and under-evaluation for the entire test; otherwise, the same process is implied. Another way to examine whether the same processes are involved is through the examination of correlational evidence by carrying out factor analysis of self-confidence and self-evaluation scores together with traditional accuracy and speed scores. These four kinds of scores are available for all five tests of this battery Ð 20 measures altogether for the examination of the status of metacognitive processes. The aim is to find out whether (a) self-confidence and self-evaluation scores define different factors; and (b) either a self-confidence factor or a self-evaluation factor (or both) is independent from the factors defined by the accuracy (i.e., percentage correct) and speed scores. In our previous work, a self-confidence factor has been identified but evidence for the self-evaluation factor has been equivocal (Kleitman & Stankov, in press; Stankov & Crawford, 1996a,b, 1997). In summary, the aim with respect to metacognitive processes is to (a) examine whether different levels of complexity manipulations display systematic reduction in the size of overconfidence. If so, the ``easy/difficult'' account of the relationship between confidence ratings and accuracy would gain support. (b) Establish whether overconfidence is accompanied with under-evaluation. If so, different types of judgmental processes are implied. (c) Establish if separate self-confidence and self-evaluation factors exist independently of speed and accuracy factors. Method Participants The participants in this study were 352 first year psychology students at the Departments of Psychology of the Universities of Melbourne and Sydney. Thirteen participants who had missing values on at least one of the variables of interest to this article were excluded from the analyses and the analyses were based on N = 339. Tests All tests employed in this study have been used previously in our work. The tests were programmed using the MicrosoftBasic programming language, and were administered via Macintosh Classic computers. Instructions were presented on the computer screen and participants gave their responses by typing on a standard keyboard. For all tests except the Triplet Numbers Test, there was no time limit given, so all participants could complete all items. For each test, participants were instructed to offer an answer to every test item no matter how unsure they were that their answer was correct. They were then required to indicate how confident they were that each answer was correct. This was expressed in terms of a percentage Ð participants were allowed to type any percentage they wished but, in fact, mostly rounded numbers (i.e., 60%, 40%, etc.) were used. For all tests, the lowest allowed confidence rating (i.e., ``just guessing'') was determined by the number of alternatives (e.g., for a five-alternatives item, it was 20%). After completing each test, participants were asked to estimate the number of items they thought were correctly answered. For each item the following variables were recorded: the correctness of the answer, the confidence rating, and the time (in milliseconds) taken to answer an item.
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
1.
2. 3.
127
Counting Letters. (12 items) In this task, a person counts the number of times each of the three letters R, S, and T appear successively on the computer screen. The string length was always either 10 or 12 letters with participants required to enter their numerical response into the keyboard. Letter Series. This is a 15-item version of the typical Thurstonian series completion test. Raven's Progressive Matrices. (40 items) This is a computerized version of this well-known test of fluid intelligence. We employed 20 items from the ``Standard'' and 20 items from the ``Advanced'' forms of the Raven's Test.
Swaps Tests The stimulus material for all versions of the Swaps Test consisted of a set of three letters Ð J, K, and L Ð presented simultaneously on the computer screen, though not necessarily in that order. The instructions were to mentally interchange, or ``swap,'' the positions of two of the letters. The four versions of the task differed in the number of such instructions. There were four blocks of 12 items with an equal number of swaps. For items consisting of two or more swaps the participant had to keep track of the concurrent sequence. These items were randomly mixed to form a 48-item test. The participants did not know how many swaps would be required on any given trial. To preclude the possibility of memory for instructions limiting performance on this task, the required swap instructions were also kept visible throughout the participants' work. The answer consisted in typing the three letters in the order resulting from all the swaps. Examples of the four increasingly more involved sub-tasks is as follows. Stimuli: J K L. 4. 5. 6. 7.
Swap1. ``Swap 2 and 3.'' (Ans.: J L K.) Swap2. ``Swap 2 and 3,'' ``Swap 1 and 3.'' (Ans.: K L J.) Swap3. ``Swap 2 and 3,'' ``Swap 1 and 3,'' ``Swap 1 and 2.'' (Ans.: L K J.) Swap4. ``Swap 2 and 3,'' ``Swap 1 and 3,'' ``Swap 1 and 2,'' ``Swap 1 and 3.'' (Ans.: J K L.)
The Triplet Numbers Tests The stimulus material for all versions of the Triplet Numbers Test employed in this study consisted of a randomly chosen set of three different digits presented simultaneously on the computer screen. These digits changed after each response. The four versions of this test differ with respect to the instructions given to the participants and time limits.2 Instructions for the increasingly complex versions are as follows. 8. 9. 10.
Triplet1. Press the ``Yes'' key if a particular number Ð e.g., 3 Ð is present within the triplet. Otherwise, the ``No'' key is to be pressed. Time: 2 min. Triplet2. Press the ``Yes'' key if the second digit is the largest within the triplet. Otherwise, the ``No'' key is to be pressed. Time: 3 min. Triplet3. Press the ``Yes'' key if the second digit is the largest and the third digit is the smallest. Otherwise, the ``No'' key is to be pressed. Time: 5 min.
128
STANKOV
Figure 1. Structural model for the examination of complexity effects. These effects should be reflected in the monotonically increasing loadings of SW1 to SW4 and TR1 to TR4 on the Gf factor. Table 2 provides the relevant solution. Note: Test uniquenesses (Es) are not indicated in the diagram.
11.
Triplet4. Press the ``Yes'' key if the first digit is the largest and the second digit is the smallest or the third digit is the largest and the first digit is the smallest. Otherwise, the ``No'' key is to be pressed. Time: 6 min. Scoring
All accuracy (number-correct) scores in this study were converted to percentages by dividing the total score with the number of items in the test and multiplying by 100. Both the self-confidence score (i.e., the average confidence rating over all items in the test) and the self-evaluation score were also expressed in terms of percentages. The speed score was the average speed of answering all items in a particular test.
Results Complexity A convenient method for addressing the issue of complexity is through the use of linear structural equation modeling procedures embedded in programs such as RAMONA (Browne & Mels, 1994) and LISREL (Joreskog & Sorbom, 1993a,b). The model I propose to test is depicted in Fig. 1 (see also Stankov & Raykov, 1995). In essence, it is assumed that the relationship between the 11 manifest variables of this study can be accounted for by three latent variables. These are the fluid intelligence factor (Gf, with loadings on all 11 variables), the SWAPS factor (with loadings on four levels of the Swaps Test), and TRIPLETS factor (with loadings on four levels of the Triplet Numbers Test). A solution allowing for correlations among all three factors would be unidentifiable. It is therefore assumed that the Gf factor is not correlated with either SWAPS or TRIPLETS factors but that the latter two are correlated among themselves. The check on the effectiveness of complexity manipulations consists in establishing whether there is a
1. Counting Letters 2. Letter Series 3. Raven's Progressive Matrices 4. SWAP1 5. SWAP2 6. SWAP3 7. SWAP4 8. TRIPLET1 9. TRIPLET2 10. TRIPLET3 11. TRIPLET4 Means SDs
1.000 0.546
0.271 0.416 0.433 0.397 0.017 0.270 0.196 0.403 78.01 13.02
0.142 0.271 0.302 0.304 0.076 0.091 0.152 0.283 49.09 24.42
2
1.000 0.337 0.362
1
0.248 0.401 0.429 0.414 0.082 0.210 0.185 0.351 66.80 19.01
1.000
3
1.000 0.486 0.498 0.435 0.271 0.198 0.263 0.283 89.50 17.28
4
1.000 0.634 0.613 0.139 0.257 0.272 0.378 81.72 20.53
5
1.000 0.757 0.128 0.218 0.237 0.403 74.48 24.52
6
1.000 0.148 0.257 0.160 0.379 66.95 27.30
7
1.000 0.252 0.263 0.201 98.61 4.17
8
1.000 0.333 0.264 95.83 7.81
9
10
1.000 0.374 94.69 8.90
Table 1. Descriptive Statistics (Means, Standard Deviations, and Correlations) for the Study of Complexity
1.000 87.21 14.80
11
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE 129
130
STANKOV
monotonic increase in the size of loadings on the Gf factor as we move from Swap1 to Swap4 and from Triplet1 to Triplet4. The descriptive statistics for the study of complexity are presented in Table 1. It can be seen in this table that the arithmetic means for the four levels of Swaps and Triplet Numbers Tests systematically decrease, indicating that these levels become increasingly difficult. Overall, the Swaps Test is more difficult than the Triplet Numbers Test. Thus, the most difficult Triplet Numbers Test is almost equal to the easiest version of the Swaps Test. It is also important to note that, as expected, the decrease in difficulty leads to a performance at the ceiling level, particularly for the Triplet Numbers Test. This causes a reduction of variance. Stankov (1994) and Stankov and Raykov (1995) report the same findings with essentially the same test battery on a different sample of participants. This restriction in range may affect the size of covariances and correlations and consequently the size of factor loadings and there is no easy way to disentangle the effect of the restriction in range from a genuine increase in complexity. (For a critical stand on this issue, see Footnote 4). The correlation matrix of Table 1 was used as an input for the RAMONA (Browne & Mels, 1994) analyses and a corresponding covariance matrix was employed for the LISREL (Joreskog & Sorbom, 1993a,b) analyses. Both programs were used in order to check the invariance properties of the covariance-based and correlation-based solutions. The advantage of an analysis of the correlation matrix over the analysis of the covariance matrix is that the former allows for examination of complexity in a uniform standardized metric. Complexity is a task's feature unrelated to the observed variables' variances and the correlation matrix provides a uniform metric that is desirable in studies of complexity. In addition, the advantage of using RAMONA derives from the fact that that it is known to produce accurate standard errors for the correlation-based solutions. A further reason for employing the LISREL program in the present study derives from the possibility to use its adjunct, the PRELIS package, in order to correct for the restriction of range. The first fitted model was the unconstrained model depicted in Fig. 1. This produced a set of satisfactory goodness-of-fit indices (90% ± confidence intervals follow the index where appropriate): chi-square = 61.79, df = 35, p = 0.003; root mean square error of approximation (RMSEA) = 0.048 (0.027, 0.067). Two modifications were suggested by this basic run: (a) The loading of the Triplet1 on Gf factor was not significantly different from zero; and (b) loadings on a couple of adjacent levels of both the Triplets and Swaps Tests were very similar. It was therefore decided to impose the following three constraints on the Gf loadings: (a) Triplet1 = 0; (b) Swap3 = Swap4; and (c) Triplet2 = Triplet3. This solution also produced satisfactory goodness-of-fit indices: chi-square = 64.35, df = 38, p = 0.005; RMSEA = 0.045 (0.025, 0.064).3 This last constrained solution is ``nested'' within the first unconstrained solution and, since the difference in chi-square values is not significant, viz. 2.56 for difference in df = 3, it is suggested that the imposed constraints can be considered acceptable. Fig. 1 displays all statistically significant parameters of the latter solution. It is apparent from Fig. 1 that the Gf factor is well defined Ð it has significant loadings from all but one variable (Triplets1). Also, there is a monotonically increasing relationship between psychological manipulations of task complexity and loadings on the Gf factor for both Swaps and Triplet Numbers Tests. The same trend is not apparent with the loadings on the SWAPS and TRIPLETS factors. Thus, the present findings replicate the results of Stankov (1994), and Stankov and Raykov (1995).
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
Table 2.
131
Confirmatory Factor Analytic Solution (LISREL) Based on Covariances Calculated from Table 1 Gf
1. Counting Letters 2. Letter Series 3. Raven's Progressive Matrices 4. SWAP1 5. SWAP2 6. SWAP3 7. SWAP4 8. TRIPLET1 9. TRIPLET2 10. TRIPLET3 11. TRIPLET4
SWAPS
TRIPLETS
11.66 9.19 13.30
Es 447.92 66.36 164.32
7.54 13.31 16.25 16.74 0.40 2.26 2.47 7.90
9.42 11.42 18.07 18.48
2.32 3.47 4.81 5.39
301.40 253.42 162.28 260.94 19.30 49.85 43.36 129.61
Correlation between SWAPS and TRIPLET factors = 0.29.
LISREL was also used with the present data to carry out analyses based on covariances similar to those reported by Stankov and Raykov (1995). These analyses were preceded by the use of the ``censored-from-above'' option in PRELIS (Joreskog & Sorbom, 1993a) that allows one to estimate the covariance matrix after accounting for the fact that the data are censored (i.e., restricted in range due to the ceiling effects). The covariance-based LISREL analysis produced a set of satisfactory goodness-of-fit indices: chi-square = 61.54, df = 35, p = 0.0037; RMSEA = 0.047 (0.027, 0.067). Like the first RAMONA solution mentioned above (note the similarity between the goodness-of-fit indices) the LISREL solution presented in Table 2 does not contain additional constraints. The results are essentially the same as those produced by the correlations-based RAMONA solution, proving the invariance property of the solution presented in Fig. 1. There are a couple of differences between the solutions that are worth noting. First, there is a clear increase in the size of factor loadings for the four levels of Swaps and Triplet Numbers Tests. It will be recalled that the RAMONA solution presented in Fig. 1 contains two sets of equality constraints. Second, unlike the previous RAMONA solution, there is an increase in the size of factor loadings on the SWAPS and TRIPLETS factors.4 Metacognition The Overall Levels of Performance Two substantive issues are of interest: (i) Is the departure from perfect calibration (i.e., the overconfidence/underconfidence bias) a function of task difficulty (Baranski & Petrusic, 1994; Ferrell, 1995) or is it likely to depend on the nature of the task as claimed by Juslin and Olsson (1997)? and (ii) Is self-confidence different from self-evaluation as Gigerenzer et al. (1991) assumed? These two questions can be answered with reference to Table 3, which contains arithmetic means for the percentage correct scores, average confidence rating scores, and self-evaluation scores. (Note: Self-evaluation scores were not available for the levels of Swaps Test because of the mixed block structure of this test.) The first question can be answered by examining the differences between the average confidence ratings and percentage correct scores. These are presented in the next to the last
132
STANKOV
Table 3. Arithmetic Means for Scores on Different Levels of Swaps and Triplet Numbers Tests
1. Counting Letters 2. Letter Series 3. Raven's Progressive Matrices SWAP1 SWAP2 SWAP3 SWAP4 TRIPLET1 TRIPLET2 TRIPLET3 TRIPLET4 Note:
Average Confidence
Estimated Percentage Correct (Eval.)
Difference Conf. ± Corr.
Difference Eval. ± Corr.
49.61 78.35 67.31
69.27 83.78 76.65
46.08 70.56 64.55
19.66 5.43 9.34
ÿ3.53 ÿ7.79 ÿ2.76
89.50 81.72 74.48 66.95 98.61 95.83 94.69 87.21
92.92 89.57 84.71 79.54 98.28 97.09 96.26 92.86
96.26 93.18 91.22 85.54
3.42 7.85 10.33 12.59 ÿ 0.33 1.26 1.57 5.65
ÿ2.35 ÿ2.65 ÿ3.47 ÿ1.67
Average Percentage Correct
The last two columns present differences between percent correct scores and the other two types of scores.
column of Table 3. A positive sign of the difference indicates overconfidence. In our work up to now, we have interpreted the difference of 10 percent as mild under- or overconfidence; anything larger than 10 percent represents a pronounced bias. As mentioned in the Introduction, Juslin and Olsson (1997) claim that perceptual tasks tend to show underconfidence and general knowledge tasks show overconfidence in typical calibration curves research. Strictly speaking, the tasks employed in this study Ð that is, fluid intelligence tasks Ð belong to neither category. Much of our own work has used only one fluid intelligence tasks (the Raven's Progressive Matrices Test). A consistent finding with this test (see Kleitman & Stankov, in press; Stankov, 1998a,b; Stankov & Crawford, 1996a,b, 1997) is a slight overconfidence but, in general, relatively good calibration. In the present study, the Letter Series Test shows reasonably good calibration, but the Raven's Test indicates a somewhat stronger overconfidence than that observed in the past. Clearly, the Counting Letters Test shows a very strong overconfidence. Since for all levels of complexity manipulations the sign of the observed difference remains positive (or negative but very close to zero), the Swaps and Triplet Numbers Tests also show overconfidence. However, on easy versions of these tasks people tend to behave in a well-calibrated way; they do not show underconfidence. The results are in accordance with the ``easy/difficult'' view. This position is further strengthened by the findings presented in the preceding section Ð the factor structure of different levels of Swaps and Triplet Numbers Tests is the same. Overall, the present data simply do not support the ``different processes'' position. Recently, Stankov (in press) found further support for the ``easy/difficult'' account with a battery of eight perceptual tests. Apart from one visual perceptual test, most perceptual tests did not show underconfidence as predicted by Juslin and Olsson (1997). The second question Ð that is, does self-confidence differ from self-evaluation Ð can be partly answered by comparing the values in the next to the last column of Table 3 with those in the very last column of the same table. For all tests, the difference between accuracy scores and self-evaluation of performance at the end of testing is negative at all
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
Table 4.
133
Arithmetic Means and Standard Deviations for the Variables Included in the Analyses of Metacognitive Performance Mean
SD
Level/Accuracy 1. Counting Letters 2. Letter Series 3. Raven's Progressive Matrices 4. Swaps 5. Triplets
Measure
49.614 78.351 67.307 78.847 94.385
24.163 12.280 18.469 18.318 5.880
Speed of Test Taking (s) 6. Counting Letters 7. Letter Series 8. Raven's Progressive Matrices 9. Swaps 10. Triplets
5.353 19.792 26.674 14.788 1.650
2.346 7.047 11.971 4.541 0.439
Confidence 11. Counting Letters 12. Letter Series 13. Raven's Progressive Matrices 14. Swaps 15. Triplets
69.268 83.779 76.649 86.988 96.313
15.332 11.159 13.225 14.631 6.263
Evaluation 16. Counting Letters 17. Letter Series 18. Raven's Progressive Matrices 19. Swaps 20. Triplets
46.080 70.558 64.546 73.296 91.869
24.829 18.494 21.153 22.339 11.213
complexity levels. This finding is in agreement with the results reported by Stankov and Crawford (1996a,b). Thus, people tend to under-evaluate their accuracy. Since, as we have seen above, they also tend to show overconfidence, the present findings support the claim of Gigerenzer et al. (1991) that judgmental processes generate statements about ``subjective'' probabilities are different from processes that lead to ``frequentist'' statements. As mentioned earlier, correlational data, which are considered below, can also provide information relevant to the second question. Correlational Data In order to examine the evidence for the existence of separate metacognitive traits of self-confidence and self-evaluation, the four levels of complexity for Swaps and Triplet Numbers Test were combined (i.e., added and divided by four) to obtain single scores. In this way, all five distinct tests of the present battery provided four measures: percent correct, average confidence rating score, average time for answering an item score, and self-evaluation score. Table 4 presents means and standard deviations for all 20 variables and Table 5 shows correlations among them. As can be seen in Table 4, the lowest average score is for the Counting Letters Test and the highest average score is for the Triplet Numbers Test. Swaps and Letter Series Tests are of about the same level of difficulty. All other indices also point to the Triplet
Note:
272
271 173
203
209 172
Decimal points omitted.
534
171
606
305
262 141 064
488
166
218
112 ÿ168 116
ÿ029 ÿ178 535
198 132 470
325
023 223
ÿ142 003
248 241
525
359
309 175 125
143 ÿ184 162
139 437
461 351 002
1.00
466 397 ÿ087
325 272 ÿ340
3
1.00 546
2
1
1.00 337 362
Variable
548 287
208
312
676 252 122
237
300
398 ÿ059 140
135 247
1.00 470 ÿ041
4
249 392
108
226
264 398 094
145
175
277 165 168
218 194
1.00 024
5
016 030
ÿ003
022
049 073 ÿ148
034
020
319 246 ÿ190
314 298
1.00
6
138 079
070
065
075 128 ÿ124
057
061
467 263 ÿ122
1.00 608
7
213 132
292
227
176 146 ÿ022
341
189
421 194 ÿ036
1.00
8
295 102
027
078
360 174 000
029
028
1.00 295 ÿ021
9
ÿ022 070
ÿ187
ÿ074
ÿ056 116 ÿ118
ÿ214
ÿ151
1.00 ÿ138
10
288 204
280
264
345 253 591
444
362
1.00
11
458 374
418
701
495 374 121
636
1.00
12
360 290
698
447
467 330 221
1.00
13
Correlations among the Variables Included in the Analyses of Metacognitive Performance
1. Counting Letters 2. Letter Series 3. Raven's Progressive Matrices 4. Swaps 5. Triplets 6. Counting Letters (s) 7. Letter Series (s) 8. Raven's Progressive Matrices (s) 9. Swaps (s) 10. Triplets (s) 11. Counting Letters (conf.) 12. Letter Series (conf.) 13. Raven's Progressive Matrices (conf.) 14. Swaps (conf.) 15. Triplets (conf.) 16. Counting Letters (eval.) 17. Letter Series (eval.) 18. Raven's Progressive Matrices (eval.) 19. Swaps (eval.) 20. Triplets (eval.)
Table 5.
713 368
333
361
1.00 443 224
14
352 585
195
264
1.00 077
15
243 185
300
194
1.00
16
455 464
413
1.00
17
284 223
1.00
18
1.00 284
19
1.00
20
134 STANKOV
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
135
Numbers Test as the easiest test in the battery. In agreement with our previous results, Raven's Progressive Matrices Test takes a longer time to complete than most other tests in the battery and in this sense, it is the hardest. To attain the goals of this investigation, the correlation matrix in Table 5 was analyzed using both exploratory and confirmatory procedures. The use of confirmatory procedures may be particularly illuminating if exploratory analyses provide equivocal findings. As we shall see shortly, such is the case with the self-evaluation factor relative to the self-confidence factor of the present study. Originally, four factors were expected on theoretical grounds. Thus, since all five tests are markers of fluid intelligence, accuracy scores would be expected to define a separate Gf factor. Furthermore, since measures of speed of mental operations have tended to cluster together and define a separate mental speed factor in our previous work (see Roberts & Stankov, in press), we expected a speed factor with loadings from timed measures of all five test. Of major interest to us in this study was the possibility of identifying metacognitive factors. Two such factors were expected to emerge Ð one corresponding to the self-confidence scores and another factor being defined by the self-evaluation scores. Our previous studies have identified a self-confidence factor (Kleitman & Stankov, in press; Stankov & Crawford, 1997) but no data on the existence of a self-evaluation factor are available at this stage. Nevertheless, Stankov and Crawford (1996a,b) suggest that such a possibility should be entertained. The four factors mentioned above can be labeled as ``method'' factors in this study since they correspond to the four different types of scores that are derived from each test. Exploratory Factor Analysis The results of exploratory factor analysis (maximum likelihood solution and oblimin rotation) are presented in Table 6. In accordance with the expectations, we extracted four factors even though the root-one-criterion suggests six factors. As can be ascertained through inspection of Table 6, the first two factors corresponded to expectations. These are the Gf factor and the Mental Speed (or Speed of Test-Taking) factor. The remaining two factors appear to capture metacognitive aspects of performance. There is, however, a tentative flavor to this interpretation, particularly with respect to the interpretation of the fourth factor. The third factor in Table 6 can be interpreted either as a broad metacognitive trait or, perhaps more realistically, as a self-confidence trait. This factor has loadings from both confidence ratings and self-evaluation scores. It can be observed, however, that the loadings of the self-confidence measures are higher. Notice also that various scores from the Swaps Test tend to have somewhat high loadings on this factor. This opens the possibility of yet another interpretation Ð the Swaps trait factor. Given previous results with confidence ratings, Self-confidence is the preferred label for this factor. Although the fourth factor too may have different interpretations, it appears that the most appropriate label for this factor is in terms of Raven's Progressive Matrices Test. Four scores from that test have the highest loadings on this factor and all other loadings appear somewhat incidental. The appearance of a factor that is defined by a single test also corresponds to our previous results (see Kleitman & Stankov, in press; Stankov & Crawford, 1997). These former studies employed exploratory and confirmatory factor
136
STANKOV
Table 6.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
Exploratory Factor Analysis (Maximum Likelihood Followed by Oblimin Rotation) of the Correlations Presented in Table 5 Mental Speed
Variable
Gf
Counting Letters Letter Series Raven's Standard Progressive Matrices Swaps Triplets Counting Letters Time Letter Series Time Raven's Matrices Time Swaps Time Triplets Time Counting Letters Conf. Letter Series Conf. Raven's Matrices Conf. Swaps Conf. Triplets Conf. Counting Letters Eval. Letter Series Eval. Raven's Matrices Eval. Swaps Eval. Triplets Eval.
0.439* 0.602* 0.672* 0.662* 0.482* ÿ 0.159 0.103 0.269* 0.227* ÿ 0.113 ÿ 0.067 ÿ 0.029 0.006 0.134 ÿ 0.401* 0.030 0.136 0.095 0.108 0.060
ÿ 0.379* ÿ 0.073 0.091 0.024 0.112 0.534* 0.732* 0.688* 0.543* 0.418* ÿ 0.314* ÿ 0.029 0.039 0.017 0.104 ÿ 0.279* 0.000 0.029 0.050 0.034
0.065 0.015 ÿ 0.137 0.449* 0.150 0.063 0.002 ÿ 0.049 0.283* 0.050 0.413* 0.517* 0.317* 0.849* 0.513* 0.260* 0.366* 0.167 0.754* 0.447*
0.053 0.160 0.521* ÿ 0.263* ÿ 0.090 0.082 0.085 0.362* ÿ 0.206 ÿ 0.201 0.232* 0.402* 0.786* ÿ 0.077 0.074 0.098 0.293* 0.640* ÿ 0.083 0.068
1.00 0.031 0.394 0.237
1.00 0.021 ÿ 0.122
1.00 0.302
1.00
Factor intercorrelations Gf Mental speed Confidence Raven's Note:
Confidence
Raven's
*Indicates salient loadings.
analysis to show that different scores from the same test (e.g., accuracy, time, and confidence) may produce both ``trait'' and ``method'' factors. The first three factors of the present study are the examples of ``method'' factors; the last factor is ``trait.'' Correlations among the factors suggest that a broad accuracy/self-confidence/self-evaluation factor may appear at the second order of analysis. Confirmatory Analyses Fig. 2 depicts the generic model. This model is based on the rationale for the four expected ``method'' factors mentioned above, and on our previous work with similar tasks on different samples of participants (e.g., Kleitman & Stankov, in press; Stankov & Crawford, 1997). The four method factors are: (a) a broad factor of Gf that is expected to emerge from the accuracy scores (GF); (b) a broad factor of test-taking speed factor that is expected to emerge from the time scores (SPEED); (c) a broad self-confidence factor (CONF); and (d) a broad self-evaluation factor (EVAL). Given the results of our previous analyses, this last factor is perhaps the most tentative of the ``method'' factors. In addition, our previous work with similar variables on other samples also suggest the existence of ``trait'' factors (e.g., Kleitman & Stankov, in press; Stankov & Crawford,
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
137
Figure 2. Structural model for the examination of metacognitive processes. Five tests (Letter Counting, Letter Series, Raven's, Swaps, and Triplet Numbers) provided four scores each (A = accuracy; S = speed; C = confidence; and E = evaluation). The main focus is on the ``trait'' factors and a solution supporting the existence of CONF and EVAL factors would suggest the role of metacognitive processes in the structure of cognition. Table 7 provides the relevant solution. Note: Test uniquenesses (Es) are not indicated in the diagram.
1997). These are the factors that emerge from different scores that are derived from the same tests. Thus, the model assumes that there will be separate factors corresponding to Counting Letters, Letter Series, Raven's Progressive Matrices, Swaps, and Triplet Numbers Tests. (Note that labels SWAP [SWAPS] and TRIP [TRIPLETS] refer to different factors in Figs. 1 and 2 and Tables 2 and 6). The model of Fig. 2 also assumes that all method factors are correlated and that all trait factors are correlated, but there is no correlation between the two types of factors. The solution based on the model in Fig. 2 produced acceptable RAMONA (Browne & Mels, 1994) goodness-of-fit indices: chi-square = 293.00, df = 140, p = 0.000, (0.048, 0.066). After fixing some of the non-salient loadings to zero, the solution reported in
Counting Letters Letter Series Raven's Progressive Matrices Swaps Triplets Counting Letters Time Letter Series Time Raven's Matrices Time Swaps Time Triplets Time Counting Letters Conf. Letter Series Conf. Raven's Matrices Conf. Swaps Conf. Triplets Conf. Counting Letters Eval. Letter Series Eval. Raven's Matrices Eval. Swaps Eval. Triplets Eval.
Note:
0.205
1.000 0.141
0.373 0.516 0.481 0.484 0.541
GF
Based on the correlations presented in Table 5.
Factor intercorrelations GF SPEED CONF EVAL COUNT SERIES RAVEN SWAPS TRIPLET
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
Variable
1.000
0.396 0.773 0.704 0.574 0.406
1.000 0.587
0.484 0.570 0.544 0.312 0.315
CONF
1.000
0.255 0.557 0.271 0.397 0.518
EVAL
1.000 0.180 0.296 0.276 0.164
0.685
0.755
ÿ 0.236
0.686
COUNT
1.000 0.552 0.462 0.332
0.691
0.738
0.098
0.647
SERIES
1.000 0.398 0.246
0.721
0.815
0.465
0.732
RAVEN
Confirmatory Factor Analysis (RAMONA Solution)
SPEED
Table 7.
1.000 0.481
.0691
0.913
0.400
0.731
SWAP
1.000
0.633
0.783
0.115
0.481
TRIP 0.390 0.315 0.233 0.231 0.475 0.788 0.392 0.288 0.511 0.822 0.195 0.131 0.040 0.069 0.288 0.466 0.212 0.406 0.364 0.331
Es
138 STANKOV
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
139
Table 7 was obtained. For that latter solution, a chi-square value of 279.21 with df = 137 ( p = 0.000) and an acceptable RMSEA = 0.055 (0.046; 0.065) was obtained. All parameter values that are reported in Table 7 are significant at the 0.10 level. Two additional analyses were run in order to check if an acceptable solution could be obtained with a smaller number of ``method'' factors. First, we removed the ``EVAL'' factor from the model presented in Table 7. This produced a chi-square of 332.98 with df = 144, p = 0.000; RMSEA = 0.062 (0.054, 0.071). Second, we removed both EVAL and CONF factors. This gave a chi-square value of 487.75 with df = 149, p = 0.000; RMSEA = 0.082 (0.074, 0.090). These results indicate unacceptable fit for both models. Furthermore, it is apparent that the effect is much more pronounced with the removal of the CONF factor from this solution. Clearly, the only discrepancy between the model of Fig. 2 and the solution reported in Table 7 is in terms of factor intercorrelations for the method factors. Out of six possible correlations between the method factors, three are equal to zero, two are significant but relatively small (0.141 and 0.205), and only one is rather high (0.587) Ð the correlation between CONF and EVAL factors. Note that these two factors could not be clearly separated in the exploratory analyses of the present data. Therefore, it is possible that a separate metacognitive factor spanning self-confidence and self-evaluation may exist in reality but the present evidence is somewhat weak. Correlations among the trait factors in Table 7 are reasonably high suggesting that perhaps a ``general'' factor with loadings on all 20 variables may be justified. We therefore attempted a five-factor solution with the four method factors of Table 6 and a single all-traits factor. This produced a chi-square = 1261.25 with df = 147, p = 0.000, RMSEA = 0.150 (0.142, 0.157). These goodness-of-fit indices are unsatisfactory. It can be concluded that the analyses support the existence of a fluid intelligence factor (Gf ) based on accuracy scores, and a test-taking mental speed factor (SPEED) based on measures of time. Importantly, a separate Self-confidence (CONF) factor is also supported. Although a distinct self-evaluation (EVAL) factor is present in our confirmatory solution, it is relatively poorly defined in some of the attempted solutions (e.g., exploratory analysis). Discussion Manipulations of complexity employed in this study are relevant for our understanding of fluid intelligence. However, it is important to keep in mind the fact that a comprehensive understanding can emerge only by looking at the whole range of complexity manipulations and by considering more than a single common factor. Similar conclusions may be derived from examination of the metacognitive processes. Not one, but at least four factors are needed to understand the relationship between the five tests and the four measures derived from each test. Theoretical Accounts of Complexity As mentioned in the Introduction, the empirical evidence is not in complete agreement with capacity theories. The prediction of these theories often fails at the boundary Ð that is, when capacity is being stretched to the limit. The approach adopted here is eclectic Ð any manipulation that changes factor loadings in a systematic way will suffice. What are
140
STANKOV
the salient features of the levels of Swaps and Triplet Numbers Tests that can be offered as plausible explanations for the critical aspects of psychological processes that lead to changes in complexity? There may exist several explanations and, to the extent to which capacity limits have not been reached for the two tasks, even an explanation in terms of working memory may be acceptable. However, while memory load increases with the more complex levels of the Triplet Numbers task, memory increase is minimal on the Swaps task. It will be recalled that that the instructions for the Swaps Test remained visible on the computer screen throughout work on an item and therefore the load on working memory was minimized but not eliminated. It is also useful to note that reliance on memory by itself does not make a test more difficult Ð the Swaps Test is in fact more difficult than the memory-dependent Triplet Numbers Test. A more plausible explanation, not conceptually completely different from the capacity explanation, is in terms of the failure of an attentional mechanism. In order to appreciate this point, it is necessary to observe that an essential aspect of complexity manipulations of both the Swaps and Triplet Numbers Tests is the need to carry out operations involving an increasing number of mental steps. These mental steps are relatively simple by themselves, the increase is in terms of the total number of steps. Thus, it appears that a plausible explanation for the increase in factor loadings may reside in lapses of attention as the series of mental steps is carried out in succession. People with lower scores on fluid intelligence tests are prone to making a larger number of such errors than people with high scores. Stankov (1983a,b, 1988a,b) describes several other measures of attention that show remarkably similar properties. Schweizer (1998) also defines complexity as a ``high number of cognitive operations in information processing.'' Theoretical Issues Related to Self-Confidence Two theoretical issues of relevance for metacognitive research can be addressed with the present data. These are: (a) the account of overconfidence/underconfidence findings in terms of ``easy/difficult'' or in terms of ``different processes'' distinctions; and (b) the existence of a separate metacognitive trait vis-a-vis the well-established traits based on accuracy and speed measures. Changes in overconfidence are due to ``easy/difficult'' effects. Although the majority of tests used in the present study do not show the strong overconfidence that is typical of general knowledge (or crystallized intelligence) tasks, manipulations of complexity for both the Swaps and Triplet Numbers Tests indicate that easy (or the least complex) versions of the tests do lead to good calibration, rather than underconfidence. The results of factor analyses support this interpretation since they indicate that no change in factor structure takes place under complexity manipulations. The results also indicate that self-confidence and self-evaluation of performance tap different processes since the former show overconfidence and the latter under-evaluation. This is in accordance with the theory proposed by Gigerenzer et al. (1991). Self-confidence is a separate psychological trait. Factor analysis of the four scores (accuracy, time, self-confidence, and self-evaluation) obtained from five tests of the present battery indicates the existence of a self-confidence factor that is distinct from the accuracy (Gf ) and time (SPEED) factors. This is in accordance with the findings of Kleitman and Stankov (in press), Stankov (1998a,b), Stankov (in press), Stankov and
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
141
Crawford (1996a,b, 1997). The present results also suggest that it may be possible to separate a self-evaluation factor, albeit with some difficulty, from the other factors. In our work, self-confidence is seen as an aspect of a metacognitive process of self-monitoring that should be thought of as residing somewhere on the borderline between personality and intelligence. Clearly, there are reliable individual differences in self-confidence. Although it is easy to think of potential criteria that may be predicted by the self-confidence measures (e.g., jobs requiring decisions under conditions of uncertainty, including risk-taking) large-scale predictive validity studies are lacking at present. Stankov and Dolph (1999) report that self-confidence measures may be predictive of successful performance in jobs with low cognitive demand that may require some contact with public. The metacognitive process of self-evaluation is conceptually different from self-confidence. Its possible existence as a process distinguishable from that of self-confidence is also supportive of Gigerenzer at al.'s (1991) distinction. The factor of self-evaluation has been less strongly supported by our data than has been self-confidence. It is therefore impossible to predict if a higher-stratum metacognitive factor, encompassing these two lower-order factors will emerge in the future. The study of the structure of human cognitive abilities will inevitably move away from the reliance on exploratory factor analyses (e.g., the procedures used by Carroll, 1993; Jensen, 1998) and towards the use of confirmatory analyses (e.g., Gustaffson, 1997; Stankov & Roberts, 1997). Eventually, the latter are likely to prevail. The present study, however, illustrates the usefulness of combining the two in order to gain a better understanding of the data at hand. Acknowledgments: This research was supported by the Australian Research Council Small Grant ``Complexity and difficulty in fluid intelligence tasks'' to T. Raykov and L. Stankov. We are grateful to N. Karadimas for programming the tasks and to L. Kennedy and F. Buffier for assistance in data collection. Notes 1. This finding was also supported by factor analysis of scores derived from the calibration curves (i.e., bias scores). 2. Our experiences with this test indicate that Triplet1, Triplet2, and to some extent Triplet3 are so easy that participants start experiencing boredom and frustration if a 6-min time limit (employed with Triplet4) is imposed. In our previous work, shorter versions of these tests showed satisfactory psychometric properties. 3. In order to check if the monotonically increasing trend on factor loadings is statistically discernible, four additional solutions were obtained. In each of these, a set of four factor loadings was constrained to be equal. The aim was to establish whether an equally satisfactory solution as the one reported in Table 2 could be obtained. First, two RAMONA runs employed equality constraints on the Gf factor loadings for the Swaps Test (Chi-square = 81.34, df = 38) and the Triplet Numbers Test (Chi-square = 101.96, df = 38). Both solutions are clearly inferior to the solution reported in the main text and in Fig. 1. These findings are in contrast to what happens if the SWAPS (Chi-square = 74.60, df = 38) or TRIPLETS (Chi-square = 67.00, df = 38) factor loadings are instead constrained to be equal. The change in the goodness-of-fit indices is much smaller in the latter case suggesting that the differences in the size of factor loadings on the Gf factor are more pronounced than those on the group factors. 4. One of the reviewers of the present paper (P. Kyllonen) expressed serious concerns about the impact of the restriction of range on the conclusions about complexity. This is what he has to say with respect to the
142
STANKOV
RAMONA solution of Fig. 1: ``For both Swaps and Triplets, when percentage correct is above 90% (i.e., it's a simple test), the Gf loading is below .35; but when percentage correct is below 90% the loading is above .50. I don't think any further gradient in loading can be seen from the data. Therefore, I do not think any substantive conclusion is necessary to account for this phenomenon. Rather, the conclusion ought to be that if variance is restricted due to a ceiling effect, then don't expect high factor loadings.'' I find this to be an unduly conservative and one-sided position. Using the same cut-off point (90% accuracy) an alternative interpretation may be that the Swaps Test shows complexity effects while the Triplet Numbers Test does not. (Note, for example, that the range of standard deviations for the Swaps levels is approximately the same as it is for the first three Gf markers.) Also, there is a clear increase in factor loadings for the three most difficult levels where restriction in range is smaller. Furthermore, if 90 percent accuracy by itself implies a severe restriction in range, why are loadings of Triplet2 and Triplet3 significant at all? Finally, corrections for the restriction of range produced a stronger trend on Gf factor loadings in the LISREL solution of Table 2 than what was observed in uncorrected RAMONA solution presented in Fig. 1. If restriction of range were to be the major cause of the increase in factor loadings, the effect of its removal would have to be the equalization of factor loadings. This did not happen.
References Baranski, J. V., & Petrusic, W. M. (1994). The calibration and resolution of confidence in perceptual judgements. Perception and Psychophysics, 55, 412 ± 428. Browne, M. W., & Mels, G. (1994). RAMONA PC: User's guide. Columbus, OH: Department of Psychology, Ohio State University. Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York: Cambridge University Press. Cattell, R. B. (1971). Abilities: Their structure, growth and action. Boston: Houghton Mifflin. Crawford, J., & Stankov, L. (1996). Age differences in the realism of confidence judgments: A calibration study using tests of fluid and crystallized intelligence. Learning and Individual Differences, 6, 84 ± 103. Crawford, J., & Stankov, L. (1997). Individual differences in the realism of confidence judgments: Overconfidence in measures of fluid and crystallized intelligence. Australian Journal of Psychology, 8, 83 ± 103. Ferrell, W. R. (1995). A model for realism of confidence judgments: Implications for underconfidence in sensory discrimination. Perception and Psychophysics, 57, 246 ± 254. Fogarty, G., & Stankov, L. (1988). Abilities involved in performance on competing tasks. Personality and Individual Differences, 9(1), 35 ± 49. Gigerenzer, G., Hoffrage, U., & Kleinbolting, H. (1991). Probabilistic mental models: A Brunswikian theory of confidence. Psychological Review, 98, 506 ± 528. Gustafsson, J.-E. (1997). Measuring and understanding G: Experimental and correlational approaches. Paper presented at the conference ``Future of Learning and Individual Differences Research: Processes, Traits, and Content.'' University of Minnesota, Oct. 1997. Horn, J. L. (1997). A basis for research on age differences in cognitive capabilities. In J. J. McArdle & R. W. Woodcock (Eds.), Human cognitive abilities in theory and practice. Chicago, IL: The Riverside Publishing Company. Hunt, E. (1980). Intelligence as an information processing concept. British Journal of Psychology, 71, 449 ± 476. Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger. Joreskog, K. G., & Sorbom, D. (1993a). PRELIS 2: User's guide. Chicago, IL: Scientific Software International. Joreskog, K. G., & Sorbom, D. (1993b). LISREL 8: User's guide. Chicago, IL: Scientific Software International. Juslin, P., & Olsson, H. (1997). Thurstonian and Brunswikian origins of uncertainty in judgment: A sampling model of confidence in sensory discrimination. Psychological Review, 104, 344 ± 366. Kleitman, S., & Stankov, L. (in press). Ecological and person-driven aspects of metacognitive processes in test-taking. Applied Cognitive Psychology. Kyllonen, P. C., & Christal, R. E. (1990). Reasoning ability is (little more than) working memory capacity? Intelligence, 14, 389 ± 433. Myors, B., Stankov, L., & Oliphant, G. W. (1989). Competing tasks, working memory and intelligence. Australian Journal of Psychology, 41(1), 1 ± 16.
COMPLEXITY, METACOGNITION, AND FLUID INTELLIGENCE
143
Raykov, T., & Stankov, L. (1993). On task complexity and ``simplex'' correlation matrices. Australian Journal of Psychology, 45, 81 ± 87. Roberts, R. D., Beh, H. C., & Stankov, L. (1988). Hick's law, competing tasks, and intelligence. Intelligence, 12(2), 111 ± 131. Roberts, R. D., & Goff, G. N. (1997). ASVAB: Little more than Gc? Paper presented at the International Military Testing Association meeting in Sydney, Australia, Oct. 15, 1997. Roberts, R., & Stankov, L. (in press). Individual differences in speed of mental processing and human cognitive abilities: Towards a taxonomic model. Learning and Individual Differences. Schweizer, K. (1996). The speed ± accuracy transition due to tasks complexity. Intelligence, 22, 115 ± 128. Schweizer, K. (1998). Complexity of information processing and the speed ± ability relationship. The Journal of General Psychology, 125, 89 ± 102. Spearman, C. (1927). The abilities of man: Their nature and measurement. New York: Macmillan. Spilsbury, G., Stankov, L., & Roberts, R. (1990). The effects of test's difficulty on its correlation with intelligence. Personality and Individual Differences, 11(10), 1069 ± 1077. Stankov, L. (1983a). The role of competition in human abilities revealed through auditory tests. Multivariate Behavioral Research Monographs, 83-1 63 and VII. Stankov, L. (1983b). Attention and intelligence. Journal of Educational Psychology, 75(4), 471 ± 490. Stankov, L. (1988a). Single tests, competing tasks, and their relationship to the broad factors of intelligence. Personality and Individual Differences, 9(1), 25 ± 33. Stankov, L. (1988b). Aging, intelligence and attention. Psychology and Aging, 3(2), 59 ± 74. Stankov, L. (1989). Attentional resources and intelligence: A disappearing link. Personality and Individual Differences, 10(9), 957 ± 968. Stankov, L. (1994). The complexity effect phenomenon is an epiphenomenon of age-related fluid intelligence decline. Personality and Individual Differences, 16(2), 265 ± 288. Stankov, L. (1998a). Calibration curves, scatterplots and the distinction between general knowledge and perceptual tests. Learning and Individual Differences, 8, 28 ± 51. Stankov, L., (1998b). Self-confidence and constructs tapped by accuracy and speed measures. Manuscript under review. Australian Journal of Psychology. Stankov, L. (in press). Accuracy, confidence and speed in perceptual tasks: Calibration effects and individual difference. Journal of Experimental Psychology: Human Perception and Performance. Stankov, L., Boyle, G., & Cattell, R. B. (1995). Models and paradigms in personality and intelligence research. In D. Saklofske & M. Zaidner (Eds.), International handbook of personality and intelligence, (Chap. 2., pp. 15 ± 43). Sydney: Plenum Publishing Company. Stankov, L., & Crawford, J. D. (1993). Ingredients of complexity in fluid intelligence. Learning and Individual Differences, 5(2), 73 ± 111. Stankov, L., & Crawford, J. (1996a). Confidence judgments in studies of individual differences support the ``confidence/frequency effect''. In C. Latimer & J. Michell (Eds.), At once scientific and philosophic: A festschrift in honour of J.P. Sutcliffe, ( pp. 215 ± 239). Sydney: The University of Sydney Press. Stankov, L., & Crawford, J. D. (1996b). Confidence judgments in studies of individual differences. Personality and Individual Differences, 21(6), 971 ± 986. Stankov, L., & Crawford, J. (1997). Self-confidence and performance on cognitive tests. Intelligence, 25, 93 ± 109. Stankov, L., & Cregan, A. (1993). Quantitative and qualitative properties of an intelligence test: Series completion. Learning and Individual Differences, 5(2), 137 ± 169. Stankov, L., & Dolph, B., (1999). Some experiences with Stankov's tests of cognitive abilities (STOCA). Paper presented at the Industrial and Organizational Psychology Conference in Brisbane. Stankov, L., Fogarty, G., & Watt, C. (1989). Competing tasks: Predictors of managerial potential. Personality and Individual Differences, 9, 295 ± 302. Stankov, L., & Myors, B. (1990). The relationship between working memory and intelligence: Regression and COSAN analyses. Personality and Individual Differences, 11(10), 1059 ± 1068. Stankov, L., & Raykov, T. (1995). Modeling complexity and difficulty in measures of fluid intelligence. Structural Equation Modeling, 2(4), 335 ± 366. Stankov, L., & Roberts, R. D. (1997). Mental speed is not the ``basic'' process of intelligence. Personality and Individual Differences, 22(1), 69 ± 84. Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs, ( Vol. 1). Chicago: University of Chicago Press.