JOURNAL OF APPLIED DEVELOPMENTAL PSYCHOLOGY 1 3 t 153-171 (1992)
The Fagan Test of Infant Intelligence: A Critical Review APRIL ANN BENASICH Rutgers University
ISAAC I. BEJAR Educational Testing Service
Convincing evidence that early visual recognition memory (VRM) is a potential index of later cognitive function is rapidly accumulating. A standardized test of infant VRM to be used as a screening device to identify those infants at high risk for later cognitive delays appears to be a logical next step. The Fagan Test of Infant Intelligence (FTII) is the first such screening measure that has been commercially marketed. The FTII is well grounded empirically and as a research tool is well thought out and systemized. However, when considered as a standardized clinical test some major deficits emerge, including inadequate statistical documentation, a nominal standardization sample, and insufficient reliability and validity information. The FTII is critiqued, and methodological, statistical, and theoretical issues pertaining to this test and to dissemination of such instruments are discussed.
The nature o f early infant competence, its assessment and its relation to later cognitive functioning has historically captured the interest of developmental investigators. Theories which assumed basic underlying continuities in cognitive abilities predicted that there would be systematic relationships between early infant competence and later cognitive performance. However, failure to discover these types o f continuities, as evidenced by low cross-age correlations between tests in infancy and IQ tests in early childhood, cast doubt on such interpretations and led to the view that the nature o f intelligence during infancy is essentially discontinuous, with no corollary o f later intellectual ability identifiable (Eysenck & Kamin, 1981; Kopp & McCall, 1982; Lewis, 1983; McCall, 1979, 1981; McCall, Hogarty, & Hurlburt, 1972; Piaget, 1950; Vernon, 1980; Wohlwill, 1980). For example, the median Pearson correlation between conventional infant Editor's Note. The authors were invited to review the Fagan Test of Infant Intelligencebecause of its sigiaificancefor infant assessment. Joseph Fagan and Douglas Detterman accepted the invitation to prepare an extensive technical reply, which follows this article--Irving E. Sigel, Ed. We wish to thank numerous colleagues for sharing their laboratory experiences with us and for their helpful comments on a previous draft of this review. Correspondence and requests for reprints should be sent to April A. Benasich, Center for Molecular and Behavioral Neuroscience, Rutgers University, 197 University Avenue, Newark, NJ 07102.
153
154
BENASICH AND BEJAR
tests given to 5- to 7- month-olds and standardized IQ scores is r = .25 at 3 years, r = .20 at 4 to 5 years, and only r = .06 at 6 years and older (Fagan & Singer, 1983). Nevertheless, evidence for continuity in cognitive development and thus predictability of intellectual competence from the first year of life to later childhood has continued to be eagerly sought. There has been an urgent and increasing need to identify those infants whose developmental outcome may be compromised for one reason or another, that is, infants at higher risk of physical, emotional, social, and cognitive deficits or delays: preterm, low-birth-weight, or physically handicapped infants and those born of drug-dependent, malnourished, or low-IQ mothers. As infant intervention techniques prove effective in ameliorating cognitive delays and enhancing maternal as well as child functioning, early identification becomes paramount (Benasich, Brooks-Gunn, & Clewell, in press; Clewell, Brooks-Gunn, & Benasich, 1989). A number of investigators have suggested that the lack of correspondence between the commonly used infancy tests and later assessments of child cognitive competence may be a function of the nature and content of the tests themselves (e.g., Bornstein & Sigman, 1986; Fagan & Singer, 1983; S. Rose, Feldman, & Wallace, 1988). The items that comprise global tests of infant intelligence are strongly dependent on sensorimotor capabilities such as reaching, grasping, and hand-eye coordination, skills which apparently do not relate to differences in cognitive ability later in life. As tests in later childhood add items which tap discrimination, memory, categorization, and abstraction, the correlations with cognitive competence later in life rises substantially. Therefore, the key to demonstrating continuity and thus predictability of cognitive ability is to tap conceptual processes similar to those exhibited in psychometric tests of IQ. Recent literature suggests that promising candidates for an infant analogue for later information processing may be habituation and recognition memory (Berg & Sternberg, 1985; Bornstein & Sigman, 1986; Fagan & Singer, 1983; Fagan, Singer, Montie, & Shepherd, 1986; D. Rose, Slater, & Perry, 1986; S. Rose, Feldman, McCarton, & Wolfson, 1988; S. Rose, Feldman, Wallace, & Cohen, 1991; S. Rose, Feldman, Wallace, & McCarton, 1989; Sternberg, 1985). For the most part, these measures have utilized the infants' distribution of attention to visual targets, in particular, visual recognition memory. Visual recognition memory (VRM) takes advantage of the infant's propensity to differentially attend to novel, as compared to familiar, stimuli. The outcome of this paradigm, frequently referred to as "novelty preference", has been interpreted as reflecting the infant's information-processing skills. Measures of VRM have been found to discriminate among groups of infants expected to differ in intelligence later in life, including infants with Down syndrome (Cohen, 1981 ; Miranda & Fantz, 1974), low-birth-weight, preterm infants (Caron & Caron, 1981; S. Rose, 1980, 1981, 1983; S. Rose, Gottfried, & Bridger, 1979; S. Rose & Wallace, 1985a, 1985b; Sigman, 1983; Sigman & Parmelee, 1974), and
CRITICALREVIEW OF FAGAN TESTOF INTELLIGENCE
155
failure-to-thrive infants (Fagan & Singer, 1983; Singer, Drotar, Fagan, Devost, & Lake, 1983), as well as infants of highly intelligent parents (Fantz & Nevis, 1967; see Fagan & Singer, 1983). Reasonable predictive validity has been suggested by moderate correlations (rs = .33 - .66) between VRM in infancy and cognitive achievement in childhood (Caron, Caron, & Glass, 1983; Fagan & McGrath, 1981; Fagan & Singer, 1983; Fagan et al., 1986; Lewis & BrooksGunn, 1981; D. Rose et al., 1986; S. Rose & Wallace, 1985a; S. Rose et al., 1991; Thompson, Fagan, & Fulker, 1991). Certainly these median correlations with an average value of .44 (SD = .09) for VRM tests are significantly greater than the average values for infant sensorimotor tests:. 11 for normal samples, and • 18 for compromised samples tested in a comparable age range (Fagan & Singer, 1983). Given the promising evidence that early VRM is a potential index of later cognitive function, developing a standardized test of infant VRM which could be used as a screening device to identify those infants at high risk for later cognitive delays seems a natural next step. Fagan and colleagues have taken that step by bringing to bear the years of experience the Fagan lab has had in using the VRM paradigm for the study of infant visual perception and information processing. The Fagan Test of Infant Intelligence (Fagan & Shepherd, 1987) is well grounded empirically and as a standardized procedure/research tool is well thought out and systematized; however, as we will see, when considered as a standardized test some major deficits emerge. These deficits include inadequate statistical documentation, lack of rationale regarding test development, a nominal standardization sample, and insufficient reliability and validity information. First, the FTII will be described, to be followed by a critique and a more general discussion.
T H E FAGAN TEST OF INFANT I N T E L L I G E N C E The FTII is a standardized test of VRM to be utilized as a selective screening device for the early detection of later cognitive delays. It is based upon the wellknown tendency of infants to differentially fixate novel, as compared to previously seen, visual stimuli. Novelty preference is tested by presenting the infant with one (or two identical) pictures to study for a preset accumulated looking time and then pairing the now familiar picture with a new or novel picture. A novelty preference score is computed for each test item by dividing the time spent looking at the novel picture during the test trial by the total amount of looking at both stimuli during that time. Infants from 6.5 to 12 months old are presented with a series of 10 such "novelty problems" at each of four ages: 67, 69, 79, and 92 weeks postconceptual age (i.e., for term infants at 27, 29, 39, and 52 weeks postnatal age). Study times and form of novelty problems vary as a function of the infant's postconceptual age. The outcome at each age is comprised of a mean novelty preference score computed across problems.
156
BENASICH AND BEJAR
Equipment The FTII is designed to be a complete self-contained testing system that can be reliably and easily administered. A comprehensive training manual and a video training tape is included with the FFII system. The basic system includes: computerized testing and scoring system (Commodore computer with monitor and disc drive, printer, and computer cart); testing and scoring software; viewing stage; test pictures/stimuli; seating system; and training video, manual, and qualifying tests. Infantest Corporation also includes training for one test administrator at the time the system is installed and provisions for additional administrators to be trained.
Procedure Detailed instructions are provided for testing, including setting up the system, running the computer program, conducting the screening test, and establishing and maintaining rapport with the parents and infant. Briefly, the parent holds the infant on his or her lap during the procedure. The basic procedure consists of a fixed-interval, cumulative, familiarization episode (to either a pair of stimuli or a single stimulus depending on age level) followed by two 10-s visual recognition tests in which the familiarized stimulus is paired with a novel stimuli. Some novelty problems at some ages include two test trials and no familiarization, whereas others have two familiarizations each followed by two test trials. Decisions regarding choice of testing sequence are computer guided with an "override" feature enabling the researcher to use an individualized testing sequence if desired. Normally, a standardized sequence of pictures are presented; all stimuli sets are comprised of pictures of adult or infant faces. Each test at a particular age is composed of 10 novelty problems. The administrator judges infant looking via the corneal reflection technique, utilizing a peephole in the viewing stage. When the corneal reflection of the target is superimposed over the infant's pupil, the administrator presses a key that inputs the infant-looking data directly into the timer and memory of the computer. The computer signals when familiarization times are complete, times the test trials, and scores the test.
Reliability and Validity Only predictive validity is reported in the FTII manual and is taken in large part from the previously cited research reporting the relationship between tests of infant VRM and later intelligence (Caron et al., 1983; Fagan, 1981, 1984; Fagan & McGrath, 1981; Fagan & Shepherd, 1987; Fagan & Singer, 1983; Lewis & Brooks-Gunn, 1981; Rose & Wallace, 1985a, 1985b). Predictive validity for the present version of the FTII is presented based on three samples of infants (N = 124), between 3 and 7 months old, suspected to be at risk for later cognitive deficits (Fagan & Montie, 1986; Fagan & Shepherd, 1987; Fagan, Shepherd, & Montie, 1987; Fagan et al., 1986). Infants included those born preterm/low-birth-weight (< 1500 gm), growth retarded in utereo,
CRITICAL REVIEW OF FAGAN TESTOF INTELLIGENCE
157
treated for hypothyroidism, failure-to-thrive, and babies of diabetic mothers (Fagan & Shepard, 1987). Infants received a series of novelty problems at postconceptual ages of 52 (3 problems), 56 (2 problems), 62 (4 problems), and 69 weeks (3 problems), with the second sample receiving three additional problems at 69 weeks. A mean novelty preference was computed across problems with a minimum number of 7 novelty problems per infant and maximum of 12 in the first sample and 15 in the second sample. All children's cognitive development was assessed at three years by the Stanford-Binet (Terman & Merrill, 1973) and the Peabody Picture Vocabulary Test-Form L (PPVT; Dunn & Dunn, 1981). ~ An IQ score was generated for each child by computing a mean IQ score across the two tests (see Fagan & Montie, 1986). In the first sample (n = 62), mean novelty preference was 59.5% (SD = 8.1) and mean IQ was 96.3 (SD = 23.1); prevalence of retardation in this sample was 13%. Sensitivity to retardation was 75%, with novelty preference scores correctly identifying 6 of 8 retarded children. Specificity to normality was 91% (49 of 54 children correctly identified). Thus, validity of the FI'II for retardation (the percentage of infants with VRM --< 53% whose IQ scores were -< 70) in this sample was 55% and validity for predicting normality (the percentage of infants with VRM > 53% whose IQ scores were > 70) was 96% (Fagan et al., 1986). In the second sample (n = 62), mean novelty preference was 59.7% (SD = 5.4), mean IQ at 3 years was 99.3 (SD = 16.2), and the prevalence of cognitive delay was 8%. Sensitivity to retardation was computed to be 80% (4 of 5 children), and specificity to normality was 91% (52 of 57 children). Validity for retardation was 50% and validity for predicting normality was 97% (Fagan & Montie, 1986). An additional sample of 20 failure-to-thrive infants (Fagan, Singer, & Montie, 1985) are included in the overall estimates of validity sensitivity and specificity. Comparison of the FFII with the Bayley MDI is presented for 27 infants, with the F'I'II showing much higher sensitivity, specificity, and validity (cf. Greenfield, 1989). An age-appropriate sample (6.5- to 12-month-olds) tested on a revised 40item FTII is described in an abstract as 119 normal infants and 22 infants at risk for later mental retardation (Fagan et al., 1987). Significant differences in novelty preferences were seen between normal and at risk infants on 78% of items administered at four age points (binomial test, z = 10.64, p < .0003). Split-half reliability is reported as .58 and the false positive rate as 3.6%. However, no further data or sample information is provided. Fagan (personal communication, April 24, 1991) reports that data is being prepared on an additional 128 normal, middle-class infants, of whom 78, to date, have been followed to 3 years. According to Fagan, infants in this sample were tested a maximum of four times from 6 to 12 months. The mean novelty preference over tests at four ages was 1However,for three of the children, IQ was estimatedfromthe BayleyScalesof Mental Development.
158
BENASICH AND BEJAR
61% (SD = 4.3); mean IQ score (Stanford-Binet and PPVT) for this sample at 3 years was 107 (SD -- 15). Correlations between VRM from 6 to 12 months and 3-year IQ ranged from .47 to .50. Fagan also reports that interobserver reliability was assessed on a group of 90 testers with educational backgrounds ranging from high school through graduate school; reliability averaged 91% (see Fagan & Detterman, 1992).
Studies Comparing FTII and Bayley Scales Two independent studies have attempted to compare the FTII and the Bayley Scales of Infant Development (Bayley, 1969) as a measure of cognitive functioning. Greenfield (1989) awarded pass/fail scores to each infant in a sample of 122 three-month-olds based on their performance on the FTII, Bayley scales Mental Development Index (MDI), and Bayley scales Physical Development Index (PDI). Because this was a concurrent analysis, no validity statistics are available; however, the Bayley scales lacked sufficient sensitivity to identify infants performing at low levels. There was no difference on the major birth variables (birth weight, gestational age, head circumference) between those infants failing and passing the Bayley PDI. Infants failing the MDI had somewhat lower values on these birth variables, however, the differences were not statistically significant. The infants failing the FTII did have significantly lower mean birth weight, gestational age, head circumference, and crown-heel length than infants who passed the FTII. Discriminant analyses based on these variables could not reliably predict FTII outcome. Although the FTII was a more sensitive indicator of current developmental status than the Bayley MDI, the small number of FTII test items at any given age renders the stability of infant scores from any single administration of the FTII suspect. In sum, multiple administrations of the FFII are more likely to provide increased power for prediction of developmental delay. In another study comparing the FTII and the Bayley scales as a measure of cognitive functioning, 77 infants at risk for later cognitive delays were tested at 67 weeks postconceptual age (6 months), and 21 (so far) of these infants were retested at 92 weeks (12 months) (Swales, Claussen, Greenfield, & Bauer, 1989). Using Fagan's novelty percent cutoff score (--< 53%), the 6 month P-'rII predicted that 15.6% of these infants were at risk for cognitive delay, whereas the 6-month Bayley MDI and FDI produced risk predictions of 2.6% and 6.5%, respectively. (Sensitivity and specificity values will not be available until further follow-up is completed.) No risk factor was uniquely associated with membership in the predicted risk subset. FTII scores at 6 months were significantly related to 12-month Bayley MDI (r = .68, p -- .005) and Bayley PDI (r = .54, p = .04). Twelve-month FTII scores were not significantly related to Bayley scores either at 6 or 12 months. As noted in a number of other studies (e.g., DiLalla et al., 1990; Fagan, 1984), FTII scores at 6 and 12 months were not significantly correlated (r -- . 16) leading the authors to suggest that different factors might underlie performance at these two ages. Given the small sample
CRITICAL REVIEWOF FAGAN TESTOF INTELLIGENCE
159
size, this interesting hypothesis cannot be explored. Interitem correlations and split-half reliabilities for the 6-month FTII across the 10 paired-comparision trials were low and insignificant, a consistent finding across studies which does not seem to significantly attenuate the predictive validity of VRM. A recently published study of seven measures of infant cognitive development, including the FTII, utilizing a sample of 208 twin pairs, also found a lack of stability for the FFII between 7 and 9 months (DiLalla et al., 1990). Although a composite FTII score predicted mean twin Stanford-Binet IQ at 3 years as well as mean parent Wechsler Adult Intelligence Scale-Revised (WAIS-R; Wechsler, 1958) IQ, short-term test-retest scores (after approximately 30 min) were r = .15 at 7 months, and r = .35 at 9 months; test-retest reliability from 7 to 9 months was r = - . 10 for the first presentation at each age, and r = . 14 for the second or retest presentation (DiLalla et al., 1990, pp. 761,764). DiLalla et al. comment that this result suggests that the FTII may be measuring different skills at 7 and 9 months, however, evidence for this hypothesis awaits further longitudinal testing.
Independent Studies Assessing the Reliability of VRM Psychometric soundness (i.e., reliability, stability, and internal consistency) of VRM is just beginning to be assessed; given that these measures assess individual differences, the issues are critical ones. As S. Rose, Feldman, and Wallace (1988) observe, reliability of measurement becomes a critical factor when examining individual differences as reliability sets an upper limit to an instrument's ability to correlate with other measures. Two recent studies using measures other than the FTII suggest that measures of infant VRM show good reliability and reasonable sensitivity and specificity when averaged across multiple tasks. However, internal consistency and task-to-task reliability were found to be consistently low. Columbo, Mitchell, and Horowitz (1988) assessed the short-term reliability and longitudinal stability of VRM using the standard paired-comparision paradigm and multiple discrimination tasks on two samples of 4- to 7-month-old infants. One sample comprised of 4-month-olds (N = 35) and 7-month-olds (N = 40) was tested with a series of l0 paired-comparision or novelty problems presented twice within a 2- to 3-week span, 5 problems per session. The second sample (N = 52), was followed longitudinally with the same tasks that were administered to the cross-sectional sample presented at 4 and 7 months. Novelty preference showed good short-term stability at both ages (r = .40, p < .05 at 4 months, and r = .51, p < .01 at 7 months), but only moderate longitudinal stability from 4 to 7 months (r = .34, p < .05). Internal consistency of infant novelty percentages was variable and quite low, with mean task-to-task correlations of .20 at 4 months and .24 at 7 months. Analysis of reliability based on the number of tasks completed per session revealed a large effect of number of tasks within the assessment as well as an effect of number of assessments. Columbo et
160
BENASICH AND BEJAR
al. (1988) suggest "that a single assessment is probably inappropriate for any study of individual differences during infancy" (p. 1207). A second study examined issues of reliability, stability, and predictive validity in a sample of full-term and high-risk preterm infants at three ages (corrected for prematurity): 6, 7, and/or 8 months old (S. Rose, Feldman, & Wallace, 1988). A battery of 11 paired-comparision problems (abstract patterns, faces, geometric forms, and cross-modal transfer) were administered at each age. Aggregate scores of 6 to 11 problems across categories significantly predicted 3-year Stanford-Binet IQ with correlations ranging from r = .37 to .63; scores clustered between .50 and .60. Abstract patterns and faces most strongly predicted 3-year IQ. Combining data across age increased predictability with multiple R = .72. Cross-age correlations (stability) were moderate, ranging from r = .30 to .50; however, internal consistency as indexed by alpha correlations was surprisingly low. Average interitem correlations ranged from - . 16 t o . 13. Composite score alphas were also quite low, with the median alpha for nine scores (three at each of the three ages) equal to . 18. Utilizing cut points, which would maximize sensitivity and specificity (54% for IQ scores dichotomized at 85; 50% for IQ scores dichotomized at 70), prediction of mental deficit was moderate. Sensitivity to retardation (percentage of children with low IQ having low VRM scores in infancy) was 74% with 23 of 31 children with IQs below 85 identified by their infant VRM scores, and 73% with 8 of 11 children with IQs below 70 correctly identified. Specificity to normality (percentage of children with normal IQ having high VRM in infancy) was 83% with 30 of 36 children with high infant VRM scoring above 85 IQ at age 3, and 75% with 42 of 56 children with high infant VRM scoring above 70 IQ at age 3. Using Fagan's (Fagan et al., 1986) indices, the predictive validity for normality (percentage of infants with high VRM scores who had IQ scores above the designated cut point at age 3) was 79% for IQ -> 85, and 93% for IQ -> 70. Predictive validity for delay (the percentage of infants with low VRM whose IQ scores were below the designated cut point at age 3) was moderate for milder delay with 79% of children with IQ below 85 identified; however, predictive validity was poor for more severe delay, identifying only 36% of children with subsequent IQs below 70. Although full-term infants' scores tended to be higher on average than preterm infants' scores, sometimes significantly so, similar relations were obtained for both preterm and full-term infants. These data certainly suggest that information processing in infancy (as measured by VRM) is related to early childhood IQ. Furthermore, it is clear that aggregation of VRM tasks within and across ages improves predictive validity substantially. When looking at the ability of VRM to identify individual variation rather than group performance, however, the low success rate in predicting more severe delays (IQ < 70) raises a red flag. Even the more respectable percentages for predictive validity fail to accurately categorize about 40% of children (21% miscategorized as high risk for low IQ and 21% mis-
CRITICAL REVIEWOF FAGAN TESTOF INTELLIGENCE
161
categorized as low risk, when in reality they were at high risk for a below normal IQ score). CRITIQUE OF THE FAGAN TEST OF INFANT INTELLIGENCE AS A P S Y C H O M E T R I C I N S T R U M E N T The foregoing suggests that a psychometric instrument based on VRM might be feasible. In this section, we evaluate the FTII from that perspective. The American Psychological Association's (APA) Standards for Educational and Psychological Testing (APA, 1985) discuss the characteristics of a test in five broad areas: (a) validity; (b) reliability and error of measurement; (c) test development; (d) scaling, norming, and equating; and (e) documentation.
Validity A number of major problems arise when the FTII validity statistics presented are closely examined. First, although all versions to date of the FTII are intended to test infants from 6.5 through 12 months old, the age range included in the published studies as well as the test manual are 3 to 7 months. Second, only 62 of the aforementioned sample infants were tested with problems from the current version of the test; moreover, the format and grouping of the novelty problems were not the same as the "standardized" version. Third, discussion regarding clinically relevant cut points is not provided in the FTII documentation. Fourth, no long-term prospective studies were done before this test was put on the market, thus predictive validity must be construed from a patchwork of research done in different labs with varying procedures and stimuli. The results of the studies discussed are presented at 2 x 2 tables with one variable being IQ and the other mean novelty score. IQ was dichotomized into retarded and normal, using IQ -< 70 as a cut point. Mean novelty score was dichotomized at -< 53% / > 53%. Standardized IQ scales are well established and dichotomized cutoff scores are often used. On the other hand, dichotomization of the mean novelty score requires closer scrutiny as infants will be designated "at risk of retardation" based on this somewhat arbitrary score. Although discussion regarding the identification of an appropriate cut point is not contained in the documentation, a manuscript for an in-press chapter defines this cut point as < 53%, the score that maximized sensitivity and specificity of the test (Montie, Shepherd, & Fagan, in press). This cutoff, however, was generated using the samples previously described; infants were 3 to 7 months old and were given a much earlier version of the FTII. Moreover, as for the standardization sample, no demographic and/or medical risk information is provided for the clinical sample. Presumably, this score took into account data across multiple testing sessions and, therefore, its validity as a cut point for a testing session at one or two points remains to be assessed. This is of particular importance as this
162
BENASICH AND BEJAR k
cut score has already begun to appear in the literature as a reasonable and valid cut point (citing Fagan's work, e.g., Fagan, 1984; Fagan & Shepherd, 1987; Fagan & Singer, 1983). Such conventions may quickly be adopted and the validity of this dichotomization needs to be substantiated. The sensitivity and specificity information presented in these 2 × 2 tables follows the biostatistical literature in order to provide clinically oriented guidelines and, thus, is not couched in psychometric terms. However, sensitivity and specificity are frequently used as indices to characterize the performance of screening devices. Specificity is a measure indicating the extent to which a diagnosis of "not at risk of retardation" corresponds to eventual normal intelligence. Sensitivity is a measure indicating the extent to which a diagnosis of "at risk of retardation" (based on FFII cut scores) corresponds to later below-normal IQ. 2 These indices do not take error of measurement into account or the base rate of the occurrence of below-normal (--< 70) IQ in the population studied. (About 15% in normal term population, this percentage is higher in at-risk populations depending on the demographics and risk characteristics of the sample.) In addition, valid use of these indices as a guideline to future predictability assumes the use of a random sample. Unfortunately, little information is provided on the characteristics of the FTII samples. A more useful analysis, which incorporates sensitivity and specificity, is an analysis of false positive and negative results. Taking the specificity and sensitivity values reported for the test, which is .91 for specificity and .77 for sensitivity, and assuming a 15% rate of retardation in the population, standard formulas (Fleiss, 1981) may be used to compute these values. 3 Thus, the estimated false-positive rate would be .39 and the false-negative rate would be .04. If the data had been obtained from a random sample of the population, these figures could be interpreted as probabilities. Because the sample cannot be considered a random one, these indices have to be interpreted cautiously. Nevertheless, this 2In order to compute these indices a table tabulating performance on the Fagan test and the outcome of the later IQ test is necessary. The table looks as follows:
Fagan Test Result WlSC Score
(A)
(,4)
IQ ~ 70 (B) IQ > 70(B)
NAB NAa
NBA
NAB + Nf~4
NAB
NAa + NAB
NAB + NAB
NBA + NAB
Given this table, specificity and sensitivity can be computed as follows: Specificity = NAn / (NAB + NAB). Test indicates not at risk when IQ is normal. Sensitivity = NAn / (NAB + NBA). Test indicates at risk when IQ is abnormal. 3A measure of the probability of a false positive, that is, indicating the infant is at risk when it is not, is given by the formula: (1 - specificity) (1 - Probability (retardation)) PF+ =
(l -- specificity) + (sensitivity -- (1 -- specificity))Prob (retardation)
CRITICAL REVIEWOF FAGAN TESTOF INTELLIGENCE
163
instrument seems to minimize the false-negative responses, an expected outcome of a valid screening test. The potential for a false positive appears to be much higher, thus, raising the issue of labeling and its consequences. Construct Validity and Construct Representation
Confidence in the scores achieved on a test is further enhanced if, in addition to reliability and the prediction of a significant criterion, we can account for outcomes on the basis o f a theoretical analysis of the test and the criteria of interest. Thus, any theory that postulates continuity of intellectual functioning from infancy to adulthood would be a candidate framework for this test. This instrument, however, shares with most commercial assessment devices the lack o f a clearly articulated theory for its design and validation. Its viability rests entirely on the empirical relationships between performance on the test and criterion performance, in this case, performance on an IQ test several years after the administration of the FTII. This approach, to be sure, has a long tradition. In fact, most current tests are based on the same actuarial principle. For example, college admissions tests are often validated exclusively in terms of the prediction of grade point average. However, the justification of a test's existence solely in terms of its statistical characteristics is becoming untenable as the concept of construct validity evolves (Messick, 1989a, 1989b) and becomes more accepted. A theoretical framework from which to derive predictions about the nature of correlation with other variables, what Embretson (1983) called the "nomothetic span" o f the test, is consistent with construct validity, but goes beyond it by requiring that the nomothetic relationships be derived from a well-specified theory o f performance in the test in question. The need for such theoretical precision becomes evident when the second aspect of validity, namely "construct representation," is examined. Whereas nomothetic span looks outward, that is, refers to relations of the measure with external information, construct representation looks inward to the test itself, specifically, to the items that make up the test. The examination o f individual items is not part of the traditional approach to validation, which focuses on nomothetic matters and which treats individual items as easily replaceable entities. This does not mean that items are not recognized as entities, because most psychometric models can accommodate the specific difficulty and discrimination of an item. However, merely estimating the statistical characteristics of items is not the same as explaining their statistical characteristics. Such an accounting o f an item's statistical characteristics must be
A measure of the probability of a false negative, that is, indicating the infant is not at risk when, in fact, it is at risk, is given by the formula: PF-
(1 - sensitivity) Prob (retardation) 1 - (1 - specificity) - Prob (retardation) (sensitivity - 1 - specificity)
164
BENASICH AND BEJAR
based on theories of the underlying process, also an important aspect of validation, which Embretson calls "construct representation." Thus, from Embretson's (1982) point of view the validation of test scores requires an ability to anticipate or at least retrospectively account for the covariation of a measure with other measures (i.e., nomothetic span). Further, validation requires that performance on each item be explainable from a performance model that takes into account the attributes of the item (i.e., construct representation). As applied to the FTII, the test developer must account for the correlations of the test with other measures (see S. Rose, Feldman, & Wallace, 1988) and the statistical characteristics of each novelty problem in terms of its perceptual attributes. For example, an item's mean novelty score may be accounted for as a function of the difference in the perceptual attributes of each stimuli and possibly their similarity. Unfortunately, the nomothetic span and the construct representation of the FTII are impoverished or nonexistent.
Reliability and Error of Measurement Estimation reliability is a general concept which attempts to quantify the degree to which responses to a test are a function of the ability or attribute we are interested in measuring. It can be assessed in many ways, such as correlating scores on halves of a test (split-half reliability) or on the same test on two occasions (short-term reliability or stability). Interitem correlations examine the internal consistency of a measure in order to ascertain whether the items are measuring the same ability or attribute. We have indications from studies done in other labs (cf. Columbo et al., 1988; S. Rose, Feldman, & Wallace, 1988; Swales et al., 1989) that split-half reliabilities and interitem correlations on VRM tasks tend to be quite low, although short-term reliability falls in the moderate range. Possibly such critical reliability information was not included in the documentation because the predictive validity appeared so impressive, or was pemeived to be more important, perhaps even to subsume reliability. In reality, however, reliability is a prerequisite for validity. Validity, as measured by the correlation of a measure with some criterion, could result from a variety of artifacts. Furthermore, an upper limit is set on a measure's ability to correlate with other measures as a function of its reliability. This issue is particularly critical when individual differences are assessed (see S. Rose, Feldman, & Wallace, 1988) as is the case with the FTII. First, although no published data is available on split-half reliability and interitem correlations for the FTII, Fagan's lab is collecting this data and its publication should be expediated. Second, although interobserver reliability appears high from Fagan's comments, details as to procedure should be published and these data should be swiftly incorporated into the manual. For example, information as to length of training of testers, time over which reliability was sampled, age range of the infants, use of a random sample, data collection (videotape, simultaneous scoring into the computer, etc.), and guidance on main-
CRITICALREVIEWOF FAGAN TESTOF INTELLIGENCE
165
taining high reliability over time should be provided. Third, error of measurement of the test is reported in a nonstandard way (as the consequence of the specificity and sensitivity data). Examined in this way it is difficult for a test user who is statistically unsophisticated (not an unlikely circumstance given the marketing) to obtain a realistic estimate of the error rate. Thus, a more standard error of measurement statistic should be computed and included in the documentation.
Scaling, Norming, and Equating No demographic data is presented for their "standardization sample." In effect, this is not a standardization sample for the FTII version in use at present because the age group tested is not the same as the stated age range on the test. Moreover, information on socioeconomic status, geographical distribution, race, mean birth weight, and categories of risk conditions are all critical for interpretation of the results of this screening test. A second and related point is that age appropriate norms have not been compiled in order to facilitate interpretation of FTII scores. This is a vital and necessary step if this test is to be widely used. Some preliminary work has been done comparing the FTII with the Bayley Scales of Infant Development and a number of studies are underway in various labs. This is also necessary and important to the development of this instrument.
Test Construction Most test development follows a very actuarial procedure. A large number of items are prepared and tested on a sample typical of the population for whom the test is intended. These items are then subjected to statistical analysis to determine their difficulty and ability to discriminate. The item in the FTII is a novelty problem, that is, the presentation of a novel stimulus along with a stimulus previously presented. The raw responses are the amount of time the infant looks at each stimulus. The operational response for scoring purposes, however, is a proportion of looking time represented by the amount of time devoted to the novel stimulus relative to the sum of the total looking at both stimuli. A continuous response of this sort is problematic because there is little psychometric theory for this class of data. Cognitive psychologists have used the so-called slope and intercept methodology for analyzing continuous reaction time responses by plotting the mean response time as a function of stimulus complexity. The slope of that plot is interpreted as efficiency of processing. This method has been criticized recently on statistical grounds (Dunlap, Kennedy, Harbeson, & Fowlkes, 1989). Unfortunately, psychometricians have not yet provided widely agreed upon methods for dealing with continuous responses in a practical setting. Each test at each age consists of 10 items/novelty problems, but we are not told how those 10 items were chosen. The entire process is summarized thusly: "As noted in the preface, many years have been spent, and tests of over 2,000 infants have been made, in deciding, empirically, which novelty items to include in the clinical version of the FTII." (Fagan & Shepherd, 1987, p. 4). Moreover,
166
BENASICH AND BEJAR
the specific problems are treated as a set. Appendix B of the FTII manual States: "It is imperative that the exact order of testing be followed. Any deviation invalidates the test" (Fagan & Shepherd, 1987). This caution suggests that there are dependencies among the items (i.e., across novelty problems) and, therefore, the items do not function as independent entities. Such dependencies would preclude the application of standard psychometric methods which assume independence among items.
Documentation An instrument of this sort which employs an elaborate apparatus requires documentation for its use (i.e., computer setup, stimuli presentation, judging of infant looking, etc.), as well as statistical documentation in order that professionals may evaluate the test. The manual included with the test emphasizes the former, that is, administration of the test, and does a fairly thorough job of explaining procedural issues. However, the detail devoted to the operation of the test is at the expense of the technical documentation. The section detailing research background comprises 10 pages of a total 45 and lacks many of the details one would expect to find in a test that complies with APA Standards (APA, 1985). Moreover, this section apparently is not meant to fulfill the role of technical documentation given some of the statements made. For example, the authors state: While the predictions from the infant tests have only been followed to three years of age thus far, it is highly likely that the distinctions between normality and retardation made at three years will still be valid at school age. (Fagan & Shepherd, 1987, p. 6) The authors' optimism underscores rather painfully one implication of a test which attempts to identify children at risk of retardation, namely the danger of categorizing children with such unwarranted confidence that inevitably some of them are likely to be inappropriately treated as retarded. The whole issue of labeling is a sensitive and difficult one, nevertheless, it must be confronted and dealt with in a realistic way. Overarching claims which are beyond the data available should be tempered, particularly when targeting audiences outside of research who lack the background to correctly interpret such comments and who would like to see this test as a panacea. DISCUSSION The FTII was developed in an attempt to fill a much needed gap in the infant assessment battery. Years of research into infant visual attention have been put to use in the construction of this screening test. Instructions to the naive individual being trained to run this screening test are comprehensive and clear. The videotape provided is particularly well done. Fagan's claims within the test docu-
CRITICAL REVIEW OF FAGAN TESTOF INTELLIGENCE
167
mentation are careful and responsible, emphasizing the importance of multiple assessments of VRM and use of a battery of tests for assessment. Fagan specifically states in the accompanying manual that this instrument should not be used as a general screening device nor by the average clinician but rather for populations at risk for cognitive deficits. However, the press coverage received by the FTII has raised concern regarding the ultimate use of tests of infant "intelligence." For example, it has been suggested that the FFII could be used to identify Iow-SES, high-IQ babies for enrichment, a strategy that would, in effect, skim the cream and leave the rest behind (Kolata, 1989). Statements like this raise serious questions about philosophical issues such as labeling and the nature of intelligence. The issue of labeling children (i.e., false positives on this test or even true positives) at a very early age and thus setting in motion a "selffulfilling prophecy" is difficult to resolve. On the one hand, if these children are not identified and remediated, their future is compromised; on the other hand, if these children are earmarked as learning disabled or "slow" (rightly or wrongly), the danger is that they will be consigned to a "special education" setting and not encouraged to succeed. Thus, the debate must, of necessity move into the public policy arena. Similarly, the controversy which surrounds the issue of IQ and the nature of intelligence emerges yet again. Exactly what is intelligence? What is one measuring when IQ is assessed; furthermore, what is n o t being measured that is key to highly accurate prediction of cognitive outcome and life success? Most importantly, what is one measuring when VRM is assessed? It is unclear at this point exactly what processes are involved in early VRM or what their relationship is with IQ in later childhood. The opposing views of continuity versus discontinuity provide one-dimensional and superficial theoretical grounds on which to construct a model of infant recognition memory/attentional processes and cognitive development. Although this approach has proven useful in the past, models that are more theoretically and perhaps biologically driven must be assembled in order to advance our understanding of exactly what we are measuring. Further research is clearly indicated given S. Rose, Feldman, and Wallace's (1988) recent data suggesting that the ability of VRM measures to consistently identify children with more severe deficits is low. Fagan states that the FFII is effective for identifying all levels of cognitive deficits; however, this statement is based on two samples of children (N = 128) tested on two different versions of the test at 3 to 7 months old. Another key issue has been raised by Rose and colleagues (S. Rose, Feldman, Wallace, & McCarton, 1989). They have experimented with an alternative binary measure for indexing VRM which indicates that not only the presence or absence of a novelty preference is important, but that the magnitude of the preference is also a critical component. This illustrates well the urgent need for more sophisticated analyses of the statistical properties of measures of VRM. Moreover, issues raised earlier regarding single session testing versus multiple test sessions and
168
BENASICH AND BEJAR
further complicated by developmental trends in infant preference over the first year (i.e., familiarity preferences versus novelty preferences) underscore the need for more psychometric work in this area. To summarize, the most serious methodological problems with this screening test involve the "standardization sample," the lack of sufficient testing of the actual instrument as it is offered, the possible problem of very large confidence intervals for individual infant's scores, poor short-term reliability and interitem reliability, and lack of long-term prospective studies. In order for this screening test to be used with confidence, a legitimate age-appropriate standardization sample must be collected. Ideally, this should be a national, randomized sample hetereogenous to SES, race, gender, and other demographic variables. The highrisk samples must also be clearly defined to both demographic attributes and biological/medical risk factors. Fagan has reportedly taken the first steps toward improving reliability by conducting item analyses and and is in the process of revising the test (see Fagan & Detterman, 1992). This data should be incorporated into the test documentation as swiftly as possible, with sufficient attention to psychometric detail. Moreover, still further attention is necessary in this area as short-term reliability places limits on the levels of long-term predictability that may be achieved. If this screening test is to become a useful clinical tool, individual predictions of deficits must be made as dependable as the group predictions. Although it is true that running a full-scale standardization sample is quite difficult for an individual lab, it could be accomplished via national collaboration using an agreed-upon protocol. This may be an easier task than it appears given the well-documented procedure developed by Fagan and colleagues. In addition, documentation and marketing materials for the FFII should more strongly emphasize the need for multiple testing (cf. Columbo et al., 1988; S. Rose, Feldman, & Wallace, 1988) and use of supporting measures when this test is used clinically. The documentation for the PTII is misleading because it does not point out that little or no psychometric information is available for the indexed age group. The reliability and validity shortcomings should be clearly stated in the test manual. Reading these materials, it is easy to mistake the offered validity statistics for the set that should appear here. Fagan and Detterman (1992), in response to this review, present data on the initial standardization sample for the currently marketed version of the FTII, as well as statistics on interobserver reliability. We welcome this critical data. However, the fact remains that many of these substantial issues have not yet been adequately addressed. In particular, further exploration of the psychometric properties of this test for individual infants and revision of the FTII documentation literature should have high priority. In conclusion, measures of infant attention/information processing including VRM, habituation, and cross-modal transfer appear to be the measures of the future as far as cognitive prediction from early infancy to later childhood are concerned. VRM ability has, so far, proved to be the most amenable to use as a
CRITICAL REVIEW OF FAGAN TEST OF INTELLIGENCE
169
screening measure. The question is not should we identify and use these types of measures, but rather, how best to apply our current knowledge to the development of standardized, reliable instruments.
REFERENCES American Psychological Association, American Educational Research Association and National Council on Measurement in Education. (1985). Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association. Bayley, N. (1969). The Barley Scales oflnfant Development. New York: Psychological Corporation. Benasich, A.A., Brooks-Gunn, J., & Clewell, B.C. (in press). How do mothers benefit from early intervention programs? Journal of Applied Developmental Psychology. Berg, C.A., & Sternberg, R.J. (1985). Response to novelty: Continuity versus discontinuity in the developmental course of intelligence. In H.W. Reese (Ed.), Child development and behavior (Vol. 19, pp. 1-47). New York: Academic. Bornstein, M.H., & Sigman, M.D. (1986). Continuity in mental development from infancy. Child Development, 57, 251-274. Caron, A.J., & Caron, R.F. ( 1981 ). Processing of relational information as an index of infant risk. In S.L. Friedman & M. Sigman (Eds.), Preterm birth attdpsychological development (pp. 219240). New York: Academic. Caron, A.J., Caron, R.F., & Glass, P. (1983). Responsiveness to relational information as a measure of cognitive functioning in nonsuspect infants. In T. Field & A. Sostek (Eds.), Infants born at risk: Physiological, perceptual, and cognitive processes (pp. 181-209). New York: Grune & Stratton. Clewell, B.C., Brooks-Gunn, J., & Benasich, A.A. (1989). Evaluating child-related outcomes of teenage parenting programs. Family Relations, 38, 201-209. Cohen, L.B. (1981). Examination of habituation as a measure of aberrant infant development. In S.L. Friedman & M. Sigman (Eds.), Preterm birth and psychological development (pp. 241253). New York: Academic. Columbo, J., Mitchell, D.W., & Horowitz, F.D. (1988). Infant visual attention in the pairedcomparison paradigm: Test-retest and attention-performance relations. Child Development, 59, 1198-1210. DiLalla, L.F., Thompson, L.A., Plomin, R., Phillips, K., Fagan, J.F., Haith, M.M., Cyphers, L.H., & Fulker, D.W. (1990). Infant predictors of preschool and adult IQ: A study of infant twins and their parents. Developmental Psychology, 26, 759-769. Dunlap, W.P., Kennedy, R.S., Harbeson, M.M., Fowlkes, J.E. (1989). Problems with individual difference measures based on some componential cognitive paradigms. Applied Psychological Measurement, 13, 9-17. Dunn, L.M., & Dunn, L.M. (1981). Peabody Picture Vocabulary Test-Revised: Manual for forms L and M. Circle Pines, MN: American Guidance Service. Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 175-197. Eysenck, H.J., & Kamin, L. (1981). Intelligence: The battle for the mind. London: Pan. Fagan, J.F. (1981, April). Infant memory and the prediction of intelligence. Paper presented at the meeting of the Society for Research in Child Development, Boston, MA. Fagan, J.F. (1984). The relationship of novelty preferences during infancy to later intelligence and later recognition memory. Intelligence, 8, 339-346. Fagan, J.F., & Detterman, D.K. (1992). The Fagan Test of Infant Intelligence: A technical summary. Journal of Applied Developmental Psychology, 13, 173-193.
170
BENASICH AND BEJAR
Fagan, J.F., & McGrath, S.K. (1981). Infant recognition memory and later intelligence. Intelligence, 5, 121-130. Fagan, J.F., & Montie, J.E. (1986). Identifying infants at risk for mental retardation: A cross validation study. Journal of Developmental and Behavioral Pediatrics, 7, 199-200. Fagan, J.F., & Shepherd, P.A. (1987). Fagan Test of Infant Intelligence: Training Manual. Cleveland, OH: Infantest Corporation. Fagan, J.F., Shepherd, P.A., & Montie, J.E. (1987). A screening test for infants at risk for mental retardation. Journal of Developmental and Behavioral Pediatrics, 8, 118. Fagan, J.F., & Singer, L.T. (1983). Infant recognition memory as a measure of intelligence. In L.P. Lipsitt (Ed.), Advances in infancy research (Vol. 2, pp. 31-78). Norwood, NJ: Ablex. Fagan, J.F., Singer, L.T., & Montie, J.E. (1985). An experimental selective screening device for the early detection of intellectual deficit in at-risk infants. In W.K. Frankenburg, R.N. Emde, & J.E. Sullivan (Eds.), Early identification of children at risk: An international perspective (pp. 257-266). New York: Plenum. Fagan, J.F., Singer, L.T., Montie, J.E., & Shepherd, P.A. (1986). Selective screening device for the early detection of normal or delayed cognitive development in infants at risk for later mental retardation. Pediatrics, 78, 1021-1026. Fantz, R.L., & Nevis, S. (1967). The predictive value of changes in visual preferences in early infancy. In J. Hellmuth (Ed.), The exceptional infant (Vol. 1, pp. 349-414). Seattle, WA: Special Child Publications. Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: Wiley. Greenfield, D.B. (1989, April). Screening infants at risk for mental retardation: A comparison of the Bayley Scales and Fagan Test of Infant Intelligence. Paper presented at the meeting of the Society for Research in Child Development, Kansas City, MO. Kolata, G. (1989, April 4). IQ tests in infancy are found to predict scores in school. New York Times, p. C1. Kopp, C.B., & McCall, R.B. (1982). Predicting later mental performance for normal, at-risk, and handicapped infants. In P.B. Bates & O.G. Brim (Eds.), Life-span development and behavior (Vol. 4, pp. 33-61). New York: Academic. Lewis, M. (Ed.). (1983). The origins of intelligence. New York: Plenum. Lewis, M., & Brooks-Gunn, J. (1981). Visual attention at 3 months as a predictor of cognitive functioning at 2 years of age. Intelligence, 5, 131-140. McCall, R.B. (1979). The development of intellectual functioning in infancy and the prediction of later I Q In J.D. Osofsky (Ed.), Handbook of infant development (pp. 707-741). New York: Wiley. McCall, R.B. (1981). Early predictors of later IQ: The search continues. Intelligence, 5, 141-147. McCall, R.B., Hogarty, P.S., & Hurlburt, N. (1972). Transitions in infant sensorimotor development and the prediction of childhood IQ. American Psychologist, 27, 728-748. Messick, S. (1989a). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5-11. Messick, S. (1989b). Validity. In R.L. Linn (Ed.), Educational measurement (pp. 21-34). New York: Macmillan. Miranda, S.B., & Fantz, R.L. (1974). Recognition memory in Down's syndrome and normal infants. Child Development, 45, 651-660. Montie, J.E., Shepherd, P.A., & Fagan, J.F. (in press). The Fagan Test of Infant Intelligence: Applications to the field of mental retardation. In J.A. Mulick & R.F. Antonak (Eds.), Transitions in mental retardation (Vol. 5). Norwood, NJ: Ablex. Piaget, J. (1950). The psychology of intelligence. New York: Harcourt. Rose, D.H., Slater, A., & Perry, H. (1986). Prediction of childhood intelligence from habituation in early infancy. Intelligence, 10, 251-263.
CRITICALREVIEWOF FAGAN TESTOF INTELLIGENCE
171
Rose, S.A. (1980). Enhancing visual recognition memory in preterm infants. Developmental Psychology, 16, 85-92. Rose, S.A. (1981). Lags in the cognitive competence of prematurely born infants. In S.L. Friedman & M. Sigman (Eds.), Preterm birth and psychological development (pp. 255-269). New York: Academic. Rose, S.A. (1983). Differential rates of visual information processing in fullterm and preterm infants. Child Development, 54, 1189-1198. Rose, S.A., Feldman, J.F., McCarton, C.M., & Wolfson, J. (1988). Information processing in seven-month-old infants as a function of risk status. Child Development, 59, 589-603. Rose, S.A., Feldman, J.F., & Wallace, 1.F. (1988). Individual differences in infant information processing: Reliability, stability, and prediction. Child Development, 59, 1177-1197. Rose, S.A., Feldman, J.F., Wallace, I.F., & Cohen, P. (1991). Language: A partial link between infant attention and later intelligence. Developmental Psychology, 27, 798-805. Rose, S.A., Feldman, J.F., Wallace, I.F., & McCarton, C. (1989). Infant visual attention: Relation to birth status and developmental outcome during the first 5 years. Developmental Psychology, 25, 560-576. Rose, S.A., Gottfried, A.W., & Bridget, W.H. (1979). Effects of haptic cues on visual recognition memory in fullterms and preterms. Infant Behavior and Development, 2, 55-67. Rose, S.A., & Wallace, I.F. (1985a). Visual memory: A predictor of later cognitive functioning in preterms. Child Development, 56, 843-852. Rose, S.A., & Wallace, I.F. (1985b). Cross-modal and intramodal transfer as predictors of mental development in fullterm and preterm infants. Developmental Psychology, 21, 949-962. Sigman, M.D. (1983). Individual differences in infant attention: Relations to birth status and intelligence at five years. In T. Field & A. Sostek (Eds.), Infants born at risk: Physiological, perceptual, and cognitive processes (pp. 295-315). New York: Grune & Stratton. Sigman, M.D., & Parmelee, A.H. (1974). Visual preferences of four month old premature and fullterm infants. Child Development, 45, 959-965. Singer, L.T., Drotar, D., Fagan, J.F., Devost, L., & Lake, R. (1983). The cognitive development of failure to thrive infants: Methodological issues and new approaches. In T. Field & A. Sostek (Eds.), Infants born at risk: Physiological, perceptual, and cognitive processes (pp. 222242). New York: Grune & Stratton. Steinberg, R.J. (1985). Beyond IQ : A triarchic theory of human intelligence. New York: Cambridge University Press. Swales, T.P., Claussen, M.S., Greenfield, D.B., & Bauer, C. (1989, April). The validity of the Fagan Test of lnfant Intelligence as a measure of cognitive functioning at six months and one year of age. Paper presented at the meeting of the Society for Research in Child Development, Kansas City, MO. Terman, L.M., & Merrill, M.A. (1973). Stanford-Binet Intelligence Scale: 1973 norms edition. Boston, MA: Houghton-Mifflin. Thompson, L.A., Fagan, J.F., & Fulker, D.W. (1991). Longitudinal prediction of specific cognitive abilities from infant novelty preference. Child Development, 62, 530-538. Vernon, P.E. (1980). Intelligence: Heredity and environment. San Francisco: W.H. Freeman. Wechsler, D. (1958). The measurement and appraisal of adult intelligence (4th ed.). Baltimore, MD: Williams & Wilkins. Wohlwill, J.F. (1980). Cognitive development in childhood. In O.G. Brim & J. Kagan (Eds.), Constancy and change in human development (pp. 359-444). Cambridge, MA: Harvard University Press.