Ten Psychometric Reasons Why Similar Tests Produce Dissimilar Results Bruce A. Bracken Memphis State University
Significantly different results frequently exist betwren two or more tests that purport to measure the same skill when the same child is tested on both instruments. The reasons for these discrepancies may be related to the examinee, examiner, exam&eexaminer interactions, environment, or psychometric characteristics of the tests employed. Since the more human-related reasons for test performance instability receive considerable treatmrnt m assessment training and in the literature, this article cites 10 major ,@~chome~r2ireasons why similar tests may produce disparate scores when a single child is trsted.
There and
are times present
when
what
abilities.
Several
general
academic,
presenting team, may their
of the
their
test
lead
two or more
should
and
that
using
produce
the tests
the
of group
have been
tests
rather
administered
than
may earn
even when
strong
evidence
An operational to measure
to or more
some
have
find
of skill
that
they
various
the
meet,
the child’s
area
or ability.
with
a child,
assessed
skill
the
same
and, are
Such
upon
not,
as a
situations
members
of the members
that than
if two
purport,
minimal
coming
are too limited
in
use of psychoeducational
met
than
mean this
score
general produce
it should
is established
deviation,
that
offered deviate
that difference
Received December 1, 1986; final revision received April 23, 1987. .4ddrrss correspondence and reprint requests to Bruce A. Bracken, chology Memphis State University, Memphis, TN 38152.
155
through
the
examinees
instrument deviate
who
measur-
significantly,
exists.
difference
skill has been
be
if it is
be remembered
Thus, that
the
should
Even
validity
measures
to assess
tests
standard,
concordance
scores
used, the
differences).
one psychoeducational
of a significant
one standard
are
(i.e.,
performances.
on those
of group
tests
and
validity
concurrent
scores
definition the same
that
concurrent
individual
more
skills
suggested
meeting,
M-teams,
tests
have
that
ing similar
port
may cognitive
to be permitted
demonstrate
correlated
demonstrated study
or
level
that
assess
regarding
instruments.
skills
by those
within
knowledge
Theoretically, highly
at the staff
believing
(M-teams)
information
members
the child’s
debates
psychometric
team
perceptual,
results
the meeting
assessment same
various
about
to heated
teams
to be conflicting
language,
in agreement
away from
multidisciplinary
appears
between
tests
by Sattler by an
that
amount
is significant.
PhD,
pur-
(1982)
Department
who equal
The
cri-
of Psy-
156
Journal
terion
of one standard
tests
are
highly
deviation
reliable
t=M). The
aim of this article
ancies
may
general
exist
skills.
result
While
ferences
(e.g.,
have
variables
racial
this
article
tests
scores
(e.g.,
why discrep-
assess
similar
the same
tests
may
be a
examiner/examinee differences
changes
on psychometric
when
measurement
reasons
among
examiner
differences
of
purportedly
health),
differences),
especially
errors
common
that
motivation,
will focus
in most cases,
standard
the most
in test
(e.g.,
environmental
distracters),
adequate
two or more
differences
Psychology
small
is to consider
rapport,
competence),
seems
and
between
of student
of School
(e.g.,
in physical
reasons
diflevels
of
comfort,
for disparate
test
scores. The
previously
mentioned
not be discussed al assessment-related explanations reasons
Thus, why
measure
not
this article
and assessment
provides
may
floor
of a test
is the lower
when
If a test
is to be capable
items
the examinee
from
sufficient
those
number
allow
are
children
and those floor
subtest
floor that
have
sets
other
sets
that
are
instruments & Sattler,
administered that
are
usually
by the test.
even
of subtests have
contains
to only the younger to
older
are
ages (e.g., individuals
a
easy
to very
or retarded
of little
the lowest
value
in
administered span
that Read
are
Absurdities, only
(e.g.,
ages.
age Also,
the entire
IV
age
(Thorndike,
administered
Memory),
tests instru-
to one
of other
that
age levels
multistage
Multistage
the Stanford-Rinet subtests
average
newer
to children
Vocabulary,
have
initial
delayed
age levels.
a set of subtests some
from
among some
that
administered
it must Those
children.
noted
at upper
can
correctly.
or handicapped
is obviously
by instruments,
of age (e.g.,
administered
range between
abilities
As an example,
1986a)
regardless
that
frequently
problems
those
Hagen,
to
an instrument
success.
of low-functioning
are most
are
examinees,
purport
or no test items disabled,
low-level
of abilities
that are assessed
and served
to assess
assessment
group level
psychometric
that
disadvantaged,
of differentiating
ments
multistage
common tests
scores
delayed,
similarly
with better-developed
problems
levels
of standard
only a few items
not
items
incapable
the psychoeducational
evidence
range
answers
for the differentiation
instrument
While
between
psychometric human-related
EFFECTS
of distinquishing
who
of easy
low. An
and ability
The
skills.
produce children
exist
tests will
well in tradition-
as the more
a list of the ten most
differences
across
fairly
textbooks.
attention
FLOOR The
differences
they are covered
to get the same
significant
similar
for significant
because
coursework
seem
variables.
reasons
in this article
some
Copying), \Tcrbal
to all that are
and some Relations,
Matrices). Whether
an instrument
is a single-stage
test or is multi-stage
(i.e.,
all chil-
157
Bracken
dren
are administered
well as the total floor.
should
Psychometrically
between
which
floors
can
Kaufman
Kaufman,
can
cause
be artifacts
Assessment
1983)
is one
the K-ABC
limited
floors served
This
Battery
of many
includes
each
subtest,
the robustness
significant
of one
1985).
deceiving
differences
or both
of the
as
of its to occur
instruments’
especially
faulty
subtests One
because
subtest
between
Since
the K-ABC
would
seem
children
that any 7-year-old candidate
referred
not
pass
a single
a subtest
Reading
standard
of 15 for the Reading
is within
the average
only
item
range
earns
or no success a child
not.
this
Mastery
should
tests
can
be
member
Test;
be examined; resolved
can be more
1973),
and
accurately
the
discrepancies purport
between
to assess
floors,
of 87
credit
of 89.
on
Thus,
that fail to differen-
intellectual
by the abilities?
subtest children
is
(even
abilities.
considered
floor
the
appear
as assessed
child’s
of this sort,
with
a
the Woodcock
reading the floors
between
educational
above
(e.g.,
disability of the two
the two reading
and
programmatic
skill.
EFFECTS
tests with poor
two or more
the same
earns
determined.
CEILING As in the case of limited
ample
child’s
standard
functioning
reading
the child then
(i.e.,
score
Understanding
in this way the differences
easily,
and
reading-disabled
In cases
subtest
abilities.
Reading
a more
detected.
scores
his or her
with average
this
to
subsequently
score
intellectual
had assessed
Woodcock,
accurately
who
it
appear
zero credit 100
comprehension, with
might
a standard
a child
earns
between
test with
of
6 months.
12i/z years,
correctly)
a mean
of average
K-ABC
and children
diagnostic
have been
tests
have
a weak
is appropri-
However,
standard
subtest
reading
the
problems
subtest,
to low-average
case,
of 2’/z and
item
With
those
to differentiate
M-team
reading
needs
In
children)
If a second
might
In
of the age
12 years
who obtains
Similarly,
commensurately
sensitive
nonreading
Reading
that
with
which
and
battery.
Understanding
his or her
is developed
insufficiently
of 87.
from
average
when
the ages
7-year-old
on this K-ABC
with
0 months
for reading
of scores.
children
reading-disabled
Understanding,
Understanding
a Reading
low-functioning
second
subtest
&
floors.
floor lies in the middle
Understanding
score
deviation
Obviously
Kaufman
limited
of age levels
K-ABC
for the Kaufman
does
K-ABC,
its weak
between
is so weak that a low-functioning
Does
severely
at a variety
particular
is Reading
floor
minimal
(K-ABC;
with
the ages of 7 years
serves
be an appropriate
one
Children
by the instrument.
ate for children
obtains
for
instruments
several
(Bracken,
is somewhat
range
tiate
scales),
to determine
floors.
The
floor
as on the Wechsler
be examined
weak
two tests,
limited
fact,
all subtests,
test,
similar A limited
ceilings
can result
instruments, ceiling
exists
in significant
even though when
both tests
an instrument
158
Journal of School
does not have a sufficient very
able
child
assessed.
and
Ceilings,
typically
based
average
number
a child
floors,
differences
a second
instrument
to assess
younger
persons
designed
for older
individuals
example,
girl
(BBCS;
that
The
through
1. The
Test-Revised for
a child
test standard the score child
higher answered
to 153 (still
that
strong
PPVTR,
to identify
designed
differentiating
the
gifted
to assess
older
degree
of giftedness
that
of two or more
It is obvious (all other
variables
abilities.
being
tests used in the previous
example, between
a test,
to demonstrate
across
content
it purports
material
used
to assess.
in a given all difficulty
validity,
content levels.
range).
The
two standard items
with
reduces
range).
tests,
at the
7-year
level
not be the test of choice.
does
a much
better
job
of
the BBCS.
the one
should
be used
with
the sounder
to assess
of the two receptive
the examiners
earns
at this age level)
a ceiling
children,
would
be able
a child’s
vocabulary to resolve
the
test scores.
GRADIENTS
to describe
and it is a reflection
score
the two confusing
ITEM
(nearly
five PPVT-R
at this age level than available
that
fai! as few as live
in the gifted
it should
the ceilings
discrepancies
Item grudien~ is a term
has
constant)
By comparing
that existed
BBCS
children,
years.
on the BBCS
of 160
of only
BBCS
of
The
40
results-results
that child
well into the gifted
although
1981).
through
correctly
score
failure
& Dunn,
disparate
should
attainable
and
Scale
the child with the Peabo-
to 117 (no longer
the highest
Concept
of the two instruments.
all 258 items
PPVT-R
a 71/g-
of receptive
in the age ranges
Dunn
find
test
As an
screened
Basic
2*/z years
tables
However,
test
correctly;
minimally
sufficiently
of 136.
a total
than
It is apparent
they
the
ceiling.
have
inwith
test designed
measures
Bracken
from
the norm
precipitously
earn
deviations that score
confer
one
ceiling;
a sounder
for children
(PPVTR;
who answers
score
drops
can
all items
that
are
nearly
The
has assessed
individuals
by examining
At 7i/z years,
the
when
a limited
members
used
examiner
the two diagnosticians
can be explained
within
and
more
population.
two different
is appropriate
second
is appropriate
superior
scores than
in age range
possess
M-team using
has
which
Vocabulary
PPVTR
ceiling
extreme
instruments
exhibit
will typically
examiner
1984),
7-l
dy Picture
The
the areas
but overlaps
an older
frequently
placement,
first
Bracken,
to serve
will
between
population,
two different
for gifted
vocabulary.
items,
the most
occur
for a younger designed
assume
year-old
same
represent
between
in the skill
and tend to be less accurate
frequently
is designed
a total
to distinguish
or high-average
test scores.
strument
When
items
is average
on extrapolation,
Significant
2-6
of difficult
who
as with
Psychology
how
steeply
of a test’s content it must
adequately
sample
If a test
is to adequately
domain,
it must
A good measure
test
items
validity.
sample
of a given
are
arranged
In order
for a test
the content
assess
the
the full range skill area
domain
universe
must
of
of content possess
a
159
Bracken
comprehensive difficulty, gaps
series
in the
sensitive
skill
Since
steep
as a result
tables Svinicki,
ages
to 95 months
Thus,
first percentile
through rank
gradient
Battelle
items
span
that
child’s
assessed
child can
sometimes
Thus,
before
across
that minimal becoming
Significant
very
low functioning
teristics
of the
The
McCarthy
1972) the .75
reports to .89
instrument, Scales stability range
about
coefficients
a gap
gradients.
span
item
first
gradients
this
determine
score
average, steep
the
has four
(i.e.,
a the
regardless item
of
gradients
to differentiate need
have on changes
to exam-
in standard
differences
among
tests.
IN NORM TABLE LAYOUT between but
two different significant
of the came scale even
that
Practitioners
significant
the
the
can easily
with
at
from
reflect
item
of an instrument
in raw scores
when
of Children’s (with
failure
changes
can exist
is ranked
subdomain
and
tests
continuum.
characteristics,
two administrations
between
to 6,
of a
her or his earned
the ability
unique
between
than
the
rank
it is frequently
steep
With
of 0
the quality
items
percentile.
that
raises
range
that while
of measurement
more
differences
of their
subdomain
three
level
DIFFERENCES fact
of 28, which
only
the 69th
too alarmed
point
raw score
floor,
items
a raw score
the percentile
and
at ages 84
or four
score
and the Dressing
error
by the
raw
also evidence
has
differences
be explained
skill levels
ine the effects scores
between
earned).
three
&
steep.
Responsibility,
and the last two items
and floors
through
Domain
raises
of its ceiling
percentile,
are extremely
subarea,
also be noted
subdomain
performance
only
in the Eating
percentile,
standard
is somewhere
between
ceilings
the first
the instrument’s
item
the maximum
is independent
steep,
the score
raw score
It should
the 53rd
having
Guidubaldi,
Personal
of 1, an additional
the 53rd
Toileting
through
Dressing,
of an
the norm
For example,
Wnek,
that
In the Eating
in standard
scores.
Stock,
the Adaptive
(Eating,
four items
points.
case that tests with poor The
less those
steepness
by inspecting
age levels
subareas
more
item earns
percentile
at some
the
standard
(Newborg,
assessed.
rank
to 3, one
percentile.
item
are than
differences
scores,
simply
on the Battelle,
of the four
of 47 percentile test’s
many
of abilities
and an additional the 53rd
levels
of
glaring
obviously
ability
major
in the reported
gradients
a percentile rank
by level
as to evidence
gradients
in raw
can be determined
four subareas
three
span the full range to 25 earns
item
produce
Inventory
among
reports
and Toileting),
hierarchically
in childrens’
gradients
for gaps
has item
As one example
steep
fluctuations
Developmental 1984)
percentile
item
gradient
at various
with
differences
of minor
item
the Battelle
arranged
not be so steep
gradients.
with
instrument’s
are
should
Tests
or moderate
gradual
tests
that
of which
assessed.
to small
with more scores
of items
the increments
for each
the exception
as a result
it possesses
Abilities
instruments differences
test-retest manual
of the five McCarthy
of the Motor
also
of the unique
high
examiner’s
as an artican
Scale
exist
charac-
reliability. (McCarthy, subscales
at ages
in
71/z-8’/2,
where
it is .69).
norm that
tables each
successive
identical
In these
circumstances
those
scales
Verbal, On
a child
For example, day.
system,
with
on the following
the above-mentioned roughly
intellectual
Additionally,
the
Kaufman,
General
1977),
standard
skills,
standard would
is relatively
of 50 and standard
considerably, score
by the McCarthy,
Cognitive
Index,
decreased
had
on
mean
that
decreased
of an additional IQ
his extra
would
day
equivalent
112 to 101,
and
respectively,
magnitude
or
from
Motor 49,
of 10, his
and
points,
as assessed
have
and 50, 48,
deviation
of this
for a child to lose 11 IQpoints
stable
(with
a reported
total
of some
traumatic,
disabling
on the
McCarthy
are
gradient
and not a result
tioning.
Similarly,
WISC-K,
which
by
of life.
(Kaufman
&
or two-thirds
of a
radical
While given
skill,
resulting
knowingly
it does
two separate
anniversary
dates scores.
have norm
tables
need
differ The
slightly
(1981)
measures).
in such
in the
and
steep
item
intellectual
func-
instruments
from
age
Scales,
produce
as the
level
6 years
in their selection
within
has pointed
out,
grade
characteristics
found
norm
3
age-span
layout.
tables
a
so that in the
for example, Hence,
and
practi-
consider
the
tables.
FOR COMPARISONS
and age equivalents scores that
test two
to assess
differences
of instruments
decisions
of standard
It is frequently
their
use,
and K-ABC,
the norm
or placement
on the same may
dramatic
WISC-R,
OR AGE EQUIVALENTS
diagnostic
child
that may stagger
in their
to be tested
the same diagnosticians
and thus
McCarthy
USE OF GRADE
the psychometric
seen
tables
overall
transition
that
that do not coincide
of children
As Reynolds
be
table
that
of .90)
so, but the differences
norm
in the child’s
evaluate
happen
instruments
to be careful
used for making
Obviously
to the test’s can
a norm
coefficient
4 months.
no one would
consecutively,
changes
in one day on an instrument
test stability
event?
due
of instability
evidences
to 6 years
days
interval
be 50,
as a result
to
on the five Mc-
Memory,
deviation
7
identical
of 13, 8, 3, 5, and 8 on
would
dropped
as mea-
at age 2 years
raw scores
Differences
absence
positions
days
overnight,
scores
have
attained
tioners
on two successive
deviation!
Is it possible
months
when
be tested
scores
5, 4, 4, 5, and 5 scale
scales.
one-half
a mean
even
Quantitative, standard
in the
they find
scores
performance
and included
day would
cost him
may
day, earning
If his raw score
his respective
lower
be tested
a child
Perceptual-Performance,
scores
are contrasted.
less intelligent,
on the next
average
day of life would the child’s
can literally
standard
age level to another,
markedly
considerably
was roughly
a T-score
scores
produces
the previous
respectively,
examine
one 3-month
of the five subscales
to have become
earned
Scales 49.
table
15 days and retested,
Carthy
diagnosticians
from
on each
by the McCarthy.
months,
the
when
norm
raw scores
and be found sured
However,
of the McCarthy
because (i.e.,
a child’s
should
not be
they do not possess they
are not ratio
raw scores
may
or
earn
Bracken
RELIABILITY Tests
with
larger
low reliability
standard
sequently,
errors
tests
rounding
the
consider
(Wechsler, of 3.41
1974)
examples,
resulting
from
The
WISC-R
Full Scale
levels an
MAT-Short
of reliability, Matrix
internal Form
to the Raven’s instruments,
test,
many Test
form
Matrices,
reliabilities
Con-
intervals
that
and,
IQreliability
sur-
of .95 and an test
like most
matrices
the
achieve
the Short
for
6’/2 age
abbreviated
compare
do not
1985a),
at the
SEM
(While
In contrast,
Naglieri,
of a progressive
do not
differmanual
sample.
commercial
of .70
reliability
examiner’s
of reasons.)
(MAT,
coefficient
is an abbreviated
Progressive
other
for a wide variety
Analogies consistency
it has
example, level.
The
test similar or short-form
favorably
with
those
of
measures.
If diagnosticians
were
to compare
his or her performance
be found
screening
of discrepancy
in the standardization
of the
might
confidence
as a result,
cohorts.
score.”
following
reliable
and,
reliable
large
age group
Form
with
produce
error
more
an average
similar
IQ
their
reports
is a very
full-scale
“true
measurement than
for the 6*/2-year
WISC-R
reports
more
low reliabilities
possibility the
DIFFERENCES
of measurement
the examinee’s
To explore ences,
with
produce
161
test.
because Obviously,
a 6t/z-year-old
child’s
WISC-R
on the MAT-Short
Form,
sizeable
of the potentially if diagnosticians
less accurate should
want
score more
Full
obtained than
Scale
differences on the
screening
162
of School Psychology
Journal
information
on
Expanded
the
Form
MAT,
Psychoeducational ity coefficients reliability
partially
If
mean
two
= 100,
are
error
than
standard
produce
each
range
of scores
adequately.
arouse
discussion assess
function
the
more
reliable
assess
There
by chance scores”
the same
several
8 points that
low
in mind
of the tests
system
(say, of. 70) the
since
(e.g.,
each
would
(slightly
more
two tests
could
at a 68%
be expected
ACROSS
skill area,
skills
confi-
to be found
in a
meetings
tests,
because
skills
ways
assessed
or theoretical
do not
that frequently
of the diverse in
significant-
by the two tests
for example,
differences sampling
TESTS
yet produce
assessed
reading
in content
keeping
deviation.
global
These
whether
score
alone,
could
ASSESSED
the specific
skill.
standard
of approximately
are many
global
scores,
It is conceivable
“true
the reliabil-
determine
one or with
same
a full standard
because
of differences
and compare
and
low reliabilities
differences
in M-team
this
examine in obtained
the
DIFFERENCES
in title,
to use
with only
on
deviationj.
score
scores
overlap they
based
that exceed
Two tests can, ly different
the differences
of the tests’
SKILL
advised
discrepant,
of measurements
significant level
are
15) and have equally
have standard
be
should
that
may be associated
SD=
dence
tests
explains
tests
one-half
would
198513).
diagnosticians
for the
that low reliability used.
they
(Naglieri,
in which may
orientations
be
a
of the
instruments. The
Wide
Range
demic
screening
rubric
of “reading.”
Kaufman, second
1983)
subtest,
Along
complete
due,
when there
three
in part,
reading
tests. Reading
assessed. Reading
Decoding
standard
phonetic
WRAT
the WRAT mance
on
effort
subtest rules
includes the
two
assessment
to include words
decoding rendered
measures
reading a more
to obtain with
words also
instruthat
on these
while
similar
vary vary
words
of the K-ABC
not be decoded sword,
in its sight
are
three to the
of nonphonetic
uncle,
subtest.
these
scores
the development
measures may
or comprehen-
is assessed
that could
by the K-ABC
comprehension.
Test,
subtest,
such as gnat,
decoding
as a
comprehensive
in the number
few nonphonetic
reading
as well
a child
reading
words
&
Decoding,
in the reported
during
general
(Kaufman
and abilities.
assess
Decoding
was made
(e.g.,
very
and K-ABC
skills
is an acathe
Scale
reading
Mastery
differences
differs
under
only decoding
a more
reading
Reading
in format,
A conscious
to use
ways in which
K-ABC
Test
that assesses that assess
diagnosticians
be considerable
The
Reading
Reading
of a child’s
to the different
WRAT
thorough
Woodcock
1984)
vocabulary
Achievement
subtest,
choose
M-team
& Wilkinson,
word
K-ABC
of reading
might
as the
may
the
Understanding,
measures
understanding
Thus, ments,
sight
a comparable
Reading
such
Test (Jastak
assesses
Likewise,
a diagnostician
measure,
The
that
has
with these
sion,
Achievement
test
solely
ache,
word list.
and a child’s because
by
recipe).
of the
Thus, performore
163
Bracken
Along
a similar
Woodcock hension ABC
“act
a child’s
Mastery
as assessed Reading
and
line,
Reading
the
content
of the
on the Woodcock
passage
and answer
questions
better
the
assessed three
about
different
measures
different
and
the previous
sample
scores
the
because
purport
the content
to assess
universe items
may attribute
numbers
only,
be carried
while
By examining
the arithmetic
assess
other
tests
multiply,
and
weights
assigned
sarnpling metic time
divide.
may
tests
few tests
others
result
sampled
test
in scores produce
is of the essence
assesses type
that
the content
that would
is frequently
Each
to a particular
domain
significant
It is well known, magnitude
but
often
as a function
1979;
Sattler,
scores
that
1982).
are higher
Tests than
the size of the score
between
the publication
forgotten,
of their
that were those
dates.
but
add, the
tests.
there
content
would
of psychoeducational
likely
be
Because
measures,
are insufficiently
it
sampled.
DATES produce
scores
dates
(Bracken,
some
time
related
count,
subtract,
If all arith-
well,
normed
of
differential
due to this factor.
tests
in the
dozen, twice,
sufficiently
that
are
can
across
tests that have more
differences
hag
differences
publication
to
incidence
of inconsistent
significantly
IN PUBLICATION
test
to whole
to rote and place
because
domains
One
the others
a higher
mathematics,
in the administration
the case that the content
DIFFERENCES
and
differ
test
psychoeducational
the examinee
of item
the
the
operations
(e.g.,
ability
that
and percentages.
from
of concepts
whether
applied
used
include
sample but
four-function
differs
the examinee’s
assess
skill area
domain.
functions fractions,
tests
knowledge
assess
immediately
Some
of the
different
may
of the content
measure
is
Thus,
Two tests
concepts,
of frequently
each
subskills.
the examinee’s
more than), whereas
and
decimals,
subtests
see that
global
significantly
sufficiently.
test also requires
is
skill
TESTS the same
to the four basic
that include
of mathematical
that
and
the second
one can easily
sampling
sampling
weight
out on numbers
measures, items
in their
considerable
procedure reading because
for example,
functions,
as
skill.
produce
do not overlap
principles,
not be parallel
yet
K-
read a
in test scores.
that global
in mathematics,
of mathematical
may
differences
can assess
domain,
competence
this
not be in agreement
two tests
samples
neither
way
ACROSS
The
understand,
that the child
While
in the
DIFFERENCES
content
read,
comprehension,
requires
the passage.
may
by the compre-
subtest.
a child
Reading
Test,
used to assess
condition,
same
read.
differences
of reading
CONTENT As with
that
for significant
procedures
as assessed
her or his reading
Understanding
Mastery
other,
may well be responsible
markedly
from
requires
passage
Reading
than
comprehension, vary
Reading
subtest
assessed inherently
may
by the K-ABC
Understanding
out”
reading
Test,
differ
in
Kaufman,
ago routinely
recent
directly
that
1981;
produce
publication
dates,
to the length
of time
164
journal
of School
Psychology
1960 than 1972
version
(Sattler,
approximately
1982)
Binet lower
IV
(Doll,
the
1984)
and
the
lower
1984;
Dunn
Dunn,
most
recent
suspect,
dates
1965)
produces
that
Adaptive
Peabody
these
Large
tests
Test
one-half
average
(PPVT,
Dunn,
to two-thirds
(Bracken,
intervals
on
(Sparrow,
Vocabulary
PPVT
of publication
are
Behavior
Picture
than 1981).
Prasse,
separated
stan-
& McCallum,
the initial
last two instruments,
larger
1 to
a variety
that are approximately
that are approximately
deviation
for
(Kaufman,
Vineland
produces
norms
Technical
revision
Scale
from
Binet
produces
its
lower
the Binet
the
1949
1959)
and
lower
norms
the norms
score
differences
might
between
their
two
versions
than only
do significant
and its revised
version,
differences
typically
but differences
between
also
across
1974 than
1972
Binet,
but higher
their newer
instruments
assessing other tude
produces
the K-ABC
scores
Binet
IV.
Examin-
more will likely
find that the scores
with
will differ
members
Unfortunately,
WISC-R
as a function
test
on the market
children
M-team
existing
than
test
tests
still
is no reasonable
be
using
rule
from more
thumb
of the
obtained
when
obtained dated
instrument.
to determine
of difference that the longer score
a
differences
of thumb,
the interval
between
score
differences
struments over
time,
samples often
but
they
of newer
that
regions.
associated are
represent
also The
& Kirk,
were
When
publication
dates,
Flynn
OF THE with
a function
such
normed a sample
the population,
NORMINC
changes
of changing partly
instruments.
McCarthy,
ments
The Illinois 1968)
solely
to more
versions
representative
of commonly
the PPVT
on white in such
an unknown
are
subjects
from
of error
of in-
norming
used tests were Abilities
two examples
a way that
amount
dates
of the population
Test of Psycholinguistic
and
is drawn
in the publication characteristics
related older
SAMPLE
restricted
(ITPA; of instru-
geographic
it does
not
results
in the develop-
accurately
of the norms.
When reflect
partly
not well normed.
Kirk,
ment
are
While
has assumptions
REPRESENTATIVENESS The
between
the tests.
tests
different
are developed
and
geographic
regions,
normed
on markedly
disproportionate
different racial
sarnples
representation,
that or
165
Bracken
skewed also
subsamples
produce
equated
of socioeconomic
unequal
in
any
estimates
norming
strata,
one
of a given
sense,
it
is
can
skill.
expect Since
unlikely
that
that the
the test may
tests
their
were
norms
not
will
be
equivalent. The
K-ABC
sentation
is one example
occurs
ported
on some
reduced
examinees. 1985)
mean
The
among
children
of higher
numbers
of low-SES
ABC
socioeconomic
point
difference
Score that
differences
were
among
tests
from
any
(SES)
factors
(Bracken,
and Hispanic of equal that
for approximately
differences
in tests’
the Ka two-
(Bracken,
the magnitude
overrepresentation
re-
Hispanic
exclusion
contrasts
to determine
or
and
It is estimated
accounted from
the
repre-
K-ABC
of black
and
Hispanic-Anglo
impossible under-
of several
children.
do result
The
black,
inclusion
procedures
in the black-white,
white,
a result
Hispanic
disproportionate
variable.
between
status
and
sampling
but it is usually
results
differences
was the disproportionate black
when
stratification
socioeconomic
differences
samples,
score
reduced
which
of what can result
important
1985).
normative
of difference
of particular
selection
variables. The
important
select
point
to recognize
tests that provide
tatively such
sample as the
over-
efforts
important
sumers
need
should
demand
to whom
and
to become that
more
quality
should
effort
the test
reflect
by accident,
selection
examiners
underrepresent
to accurately
ing of a test does not occur to sample
is that
that a concerted
the population
K-ABC
spite of earnest
evidence
population
selective
in their
The
choice
accurate
in norm-
even the best efforts
degree.
in all aspects
Tests
characteristics
the population.
to an ideal
be present
to represen-
is to be administered.
and sometimes
variables
systematically
was made
Psychometric
of instruments,
fail con-
and they
of test development
and
norming.
CONCLUSIONS Significant purport
differences to assess
variables,
among
as systematically Examiners
use.
that
two or more may
examiner
psychometric
instruments
be a result
differences,
differences
and supervision
to become
the examiner
gradient,
norm
table
layout
population,
variables, The
Familiarity
item
understood.
between differences
that
of student
or psychomet-
should
be considered
of psychologists
as the other,
variables.
floor,
is able
exist These
in the training need
they
requires
iner
tests.
human-related
ments
skills.
examiner-examinee
ric differences more
frequently
similar
adept with
be able
reliability,
differences
the quality
instruments’
to determine
and
and standardization
to judge
at determining the
validity,
that do occur
the quality
of the
and
between
of a test’s ceiling,
characteristics.
Once
it appropriateness instruments
instru-
characteristics
as well as the effects
sample
of a test
the limits technical
of the test’s an examfor a given
will be more
easily
166
Journal of School
Psychology
REFERENCES Bracken, B. A. (1981). McCarthy Scales as a learning disability diagnostic aid: A closer look. Journal ojlearnin~ DimhilitieJ, 14, 128130. Bracken, B. A. (1984). Bracken Basic Comept Scale. San Antonio, TX: Psychological Corporation. Bracken, B. A. (1985). A critical review of the Kaufman Assessment Battery for Children (K-ABC). School Psychology Review, 14, 21-36. Bracken, B. A., Prasse, D. P., & McCallum, R. S. (1984). Peabody Picture Vocabulary Test -Revised: An appraisal and review. School Psycholog, Review, 13, 49-60. Doll, E. A. (1965). VinelandSoclal Maturity Scale. Circle Pines, MN: American Guidance Service. Dunn, L. M. (1959). Peabody Picture Vocabulary 7&t. Circle Pines, MN: American Guidance Service. Dunn, L. M., & Dunn, L. M. (1981). Peabody Picture Vocabulq Test-Revised. Circle Pines, MN: American Guidance Service. Flynn, J. R. (1984). The mean IQ of Americans: Massive gains 1932 to 1978. Psycholo.&al Bulletin, 95, 29-5 1. Jastak, S., & Wilkinson, G. S. (1984). 7’he Wzde R ange Achievement 7&t-Revised. Wilmington, DE: Jastak Associates. Kaufman, A. S. (1979). Intelligent testing with the WZSC-R. New York: Wiley. Kaufman, A. S., & Kaufman, N. L. (1977). CI’mm 1 em&a/ion of young children with the McCarthy Scales. Orlando, FL: Grune & Stratton. Kaufman, A. S., & Kaufman, N. L. (1983). K auf man Assessment Battery for Children. Circle Pines, MN: American Guidance Service. Abt1itie.r. Kirk, S. A., McCarthy, J., & Kirk, W. (1968). Ill’tnoiJ 7e’st of Psycholinpistic Urbana, IL: University of Illinois Press. McCarthy, D. (1972). McCarthy Scales ofchildren? Abilities. San Antonio, TX: Psychological Corporation. Naglieri, J. A. (1985a). Matrix Analogies 7&t: Short Form. San Antonio, TX: Psychological Corporation. Naglieri, J, A. (1985b). Matrix Analogies 7&t.- Expanded Form. San Antonio, TX: Psychological Corporation. Newborg, J., Stock, J. R., Wnek, L., Guidubaldi, J., & Svinicki, ,J. (1984). Battelle Developnental Inventory. Allen, TX: DLMiTeaching Resources. Reynolds, C. R. (1981). The fallacy of “two years below grade level for age” as a diagnositic criterion for reading disorders. Journal OfSchool Psychology, 19, 350-358. Sattler, J, M. (1982). Assessment ofchildren intelli.~ence andspeczcll abilities (2nd ed.). Boston: Allyn and Bacon. Sparrow, S. S., Balla, D. A., & Cicchetti, D. V. (1984). VinelandA&ptive Behavzor ScaleJ. Circle Pines, MN: Anerican Guidance Service. Thorndike, R. L., Hagen, E. P., & Sattler, J. M. (1986a). Stanford-Binet Zntelli~cenceScak. Fourth Edition. Chicago: Riverside. Thorndike, R. L., Hagen, E. P., & Sattler, J. M. (1986b). Stanford-Binet Zntellzgence Scale. Fourth Edition Technical Manual. Chicago: Riverside. Wechsler, D. (1974). Wechsler Intelligence Scale for Children-Revised. San Antonio, TX: Psychological Corporation. Woodcock, R. W. (1973). Woodcock Reading Mastery 7&t. Circle Pines, MN: American Guidance Service.