Ten psychometric reasons why similar tests produce dissimilar results

Ten psychometric reasons why similar tests produce dissimilar results

Ten Psychometric Reasons Why Similar Tests Produce Dissimilar Results Bruce A. Bracken Memphis State University Significantly different results frequ...

824KB Sizes 19 Downloads 65 Views

Ten Psychometric Reasons Why Similar Tests Produce Dissimilar Results Bruce A. Bracken Memphis State University

Significantly different results frequently exist betwren two or more tests that purport to measure the same skill when the same child is tested on both instruments. The reasons for these discrepancies may be related to the examinee, examiner, exam&eexaminer interactions, environment, or psychometric characteristics of the tests employed. Since the more human-related reasons for test performance instability receive considerable treatmrnt m assessment training and in the literature, this article cites 10 major ,@~chome~r2ireasons why similar tests may produce disparate scores when a single child is trsted.

There and

are times present

when

what

abilities.

Several

general

academic,

presenting team, may their

of the

their

test

lead

two or more

should

and

that

using

produce

the tests

the

of group

have been

tests

rather

administered

than

may earn

even when

strong

evidence

An operational to measure

to or more

some

have

find

of skill

that

they

various

the

meet,

the child’s

area

or ability.

with

a child,

assessed

skill

the

same

and, are

Such

upon

not,

as a

situations

members

of the members

that than

if two

purport,

minimal

coming

are too limited

in

use of psychoeducational

met

than

mean this

score

general produce

it should

is established

deviation,

that

offered deviate

that difference

Received December 1, 1986; final revision received April 23, 1987. .4ddrrss correspondence and reprint requests to Bruce A. Bracken, chology Memphis State University, Memphis, TN 38152.

155

through

the

examinees

instrument deviate

who

measur-

significantly,

exists.

difference

skill has been

be

if it is

be remembered

Thus, that

the

should

Even

validity

measures

to assess

tests

standard,

concordance

scores

used, the

differences).

one psychoeducational

of a significant

one standard

are

(i.e.,

performances.

on those

of group

tests

and

validity

concurrent

scores

definition the same

that

concurrent

individual

more

skills

suggested

meeting,

M-teams,

tests

have

that

ing similar

port

may cognitive

to be permitted

demonstrate

correlated

demonstrated study

or

level

that

assess

regarding

instruments.

skills

by those

within

knowledge

Theoretically, highly

at the staff

believing

(M-teams)

information

members

the child’s

debates

psychometric

team

perceptual,

results

the meeting

assessment same

various

about

to heated

teams

to be conflicting

language,

in agreement

away from

multidisciplinary

appears

between

tests

by Sattler by an

that

amount

is significant.

PhD,

pur-

(1982)

Department

who equal

The

cri-

of Psy-

156

Journal

terion

of one standard

tests

are

highly

deviation

reliable

t=M). The

aim of this article

ancies

may

general

exist

skills.

result

While

ferences

(e.g.,

have

variables

racial

this

article

tests

scores

(e.g.,

why discrep-

assess

similar

the same

tests

may

be a

examiner/examinee differences

changes

on psychometric

when

measurement

reasons

among

examiner

differences

of

purportedly

health),

differences),

especially

errors

common

that

motivation,

will focus

in most cases,

standard

the most

in test

(e.g.,

environmental

distracters),

adequate

two or more

differences

Psychology

small

is to consider

rapport,

competence),

seems

and

between

of student

of School

(e.g.,

in physical

reasons

diflevels

of

comfort,

for disparate

test

scores. The

previously

mentioned

not be discussed al assessment-related explanations reasons

Thus, why

measure

not

this article

and assessment

provides

may

floor

of a test

is the lower

when

If a test

is to be capable

items

the examinee

from

sufficient

those

number

allow

are

children

and those floor

subtest

floor that

have

sets

other

sets

that

are

instruments & Sattler,

administered that

are

usually

by the test.

even

of subtests have

contains

to only the younger to

older

are

ages (e.g., individuals

a

easy

to very

or retarded

of little

the lowest

value

in

administered span

that Read

are

Absurdities, only

(e.g.,

ages.

age Also,

the entire

IV

age

(Thorndike,

administered

Memory),

tests instru-

to one

of other

that

age levels

multistage

Multistage

the Stanford-Rinet subtests

average

newer

to children

Vocabulary,

have

initial

delayed

age levels.

a set of subtests some

from

among some

that

administered

it must Those

children.

noted

at upper

can

correctly.

or handicapped

is obviously

by instruments,

of age (e.g.,

administered

range between

abilities

As an example,

1986a)

regardless

that

frequently

problems

those

Hagen,

to

an instrument

success.

of low-functioning

are most

are

examinees,

purport

or no test items disabled,

low-level

of abilities

that are assessed

and served

to assess

assessment

group level

psychometric

that

disadvantaged,

of differentiating

ments

multistage

common tests

scores

delayed,

similarly

with better-developed

problems

levels

of standard

only a few items

not

items

incapable

the psychoeducational

evidence

range

answers

for the differentiation

instrument

While

between

psychometric human-related

EFFECTS

of distinquishing

who

of easy

low. An

and ability

The

skills.

produce children

exist

tests will

well in tradition-

as the more

a list of the ten most

differences

across

fairly

textbooks.

attention

FLOOR The

differences

they are covered

to get the same

significant

similar

for significant

because

coursework

seem

variables.

reasons

in this article

some

Copying), \Tcrbal

to all that are

and some Relations,

Matrices). Whether

an instrument

is a single-stage

test or is multi-stage

(i.e.,

all chil-

157

Bracken

dren

are administered

well as the total floor.

should

Psychometrically

between

which

floors

can

Kaufman

Kaufman,

can

cause

be artifacts

Assessment

1983)

is one

the K-ABC

limited

floors served

This

Battery

of many

includes

each

subtest,

the robustness

significant

of one

1985).

deceiving

differences

or both

of the

as

of its to occur

instruments’

especially

faulty

subtests One

because

subtest

between

Since

the K-ABC

would

seem

children

that any 7-year-old candidate

referred

not

pass

a single

a subtest

Reading

standard

of 15 for the Reading

is within

the average

only

item

range

earns

or no success a child

not.

this

Mastery

should

tests

can

be

member

Test;

be examined; resolved

can be more

1973),

and

accurately

the

discrepancies purport

between

to assess

floors,

of 87

credit

of 89.

on

Thus,

that fail to differen-

intellectual

by the abilities?

subtest children

is

(even

abilities.

considered

floor

the

appear

as assessed

child’s

of this sort,

with

a

the Woodcock

reading the floors

between

educational

above

(e.g.,

disability of the two

the two reading

and

programmatic

skill.

EFFECTS

tests with poor

two or more

the same

earns

determined.

CEILING As in the case of limited

ample

child’s

standard

functioning

reading

the child then

(i.e.,

score

Understanding

in this way the differences

easily,

and

reading-disabled

In cases

subtest

abilities.

Reading

a more

detected.

scores

his or her

with average

this

to

subsequently

score

intellectual

had assessed

Woodcock,

accurately

who

it

appear

zero credit 100

comprehension, with

might

a standard

a child

earns

between

test with

of

6 months.

12i/z years,

correctly)

a mean

of average

K-ABC

and children

diagnostic

have been

tests

have

a weak

is appropri-

However,

standard

subtest

reading

the

problems

subtest,

to low-average

case,

of 2’/z and

item

With

those

to differentiate

M-team

reading

needs

In

children)

If a second

might

In

of the age

12 years

who obtains

Similarly,

commensurately

sensitive

nonreading

Reading

that

with

which

and

battery.

Understanding

his or her

is developed

insufficiently

of 87.

from

average

when

the ages

7-year-old

on this K-ABC

with

0 months

for reading

of scores.

children

reading-disabled

Understanding,

Understanding

a Reading

low-functioning

second

subtest

&

floors.

floor lies in the middle

Understanding

score

deviation

Obviously

Kaufman

limited

of age levels

K-ABC

for the Kaufman

does

K-ABC,

its weak

between

is so weak that a low-functioning

Does

severely

at a variety

particular

is Reading

floor

minimal

(K-ABC;

with

the ages of 7 years

serves

be an appropriate

one

Children

by the instrument.

ate for children

obtains

for

instruments

several

(Bracken,

is somewhat

range

tiate

scales),

to determine

floors.

The

floor

as on the Wechsler

be examined

weak

two tests,

limited

fact,

all subtests,

test,

similar A limited

ceilings

can result

instruments, ceiling

exists

in significant

even though when

both tests

an instrument

158

Journal of School

does not have a sufficient very

able

child

assessed.

and

Ceilings,

typically

based

average

number

a child

floors,

differences

a second

instrument

to assess

younger

persons

designed

for older

individuals

example,

girl

(BBCS;

that

The

through

1. The

Test-Revised for

a child

test standard the score child

higher answered

to 153 (still

that

strong

PPVTR,

to identify

designed

differentiating

the

gifted

to assess

older

degree

of giftedness

that

of two or more

It is obvious (all other

variables

abilities.

being

tests used in the previous

example, between

a test,

to demonstrate

across

content

it purports

material

used

to assess.

in a given all difficulty

validity,

content levels.

range).

The

two standard items

with

reduces

range).

tests,

at the

7-year

level

not be the test of choice.

does

a much

better

job

of

the BBCS.

the one

should

be used

with

the sounder

to assess

of the two receptive

the examiners

earns

at this age level)

a ceiling

children,

would

be able

a child’s

vocabulary to resolve

the

test scores.

GRADIENTS

to describe

and it is a reflection

score

the two confusing

ITEM

(nearly

five PPVT-R

at this age level than available

that

fai! as few as live

in the gifted

it should

the ceilings

discrepancies

Item grudien~ is a term

has

constant)

By comparing

that existed

BBCS

children,

years.

on the BBCS

of 160

of only

BBCS

of

The

40

results-results

that child

well into the gifted

although

1981).

through

correctly

score

failure

& Dunn,

disparate

should

attainable

and

Scale

the child with the Peabo-

to 117 (no longer

the highest

Concept

of the two instruments.

all 258 items

PPVT-R

a 71/g-

of receptive

in the age ranges

Dunn

find

test

As an

screened

Basic

2*/z years

tables

However,

test

correctly;

minimally

sufficiently

of 136.

a total

than

It is apparent

they

the

ceiling.

have

inwith

test designed

measures

Bracken

from

the norm

precipitously

earn

deviations that score

confer

one

ceiling;

a sounder

for children

(PPVTR;

who answers

score

drops

can

all items

that

are

nearly

The

has assessed

individuals

by examining

At 7i/z years,

the

when

a limited

members

used

examiner

the two diagnosticians

can be explained

within

and

more

population.

two different

is appropriate

second

is appropriate

superior

scores than

in age range

possess

M-team using

has

which

Vocabulary

PPVTR

ceiling

extreme

instruments

exhibit

will typically

examiner

1984),

7-l

dy Picture

The

the areas

but overlaps

an older

frequently

placement,

first

Bracken,

to serve

will

between

population,

two different

for gifted

vocabulary.

items,

the most

occur

for a younger designed

assume

year-old

same

represent

between

in the skill

and tend to be less accurate

frequently

is designed

a total

to distinguish

or high-average

test scores.

strument

When

items

is average

on extrapolation,

Significant

2-6

of difficult

who

as with

Psychology

how

steeply

of a test’s content it must

adequately

sample

If a test

is to adequately

domain,

it must

A good measure

test

items

validity.

sample

of a given

are

arranged

In order

for a test

the content

assess

the

the full range skill area

domain

universe

must

of

of content possess

a

159

Bracken

comprehensive difficulty, gaps

series

in the

sensitive

skill

Since

steep

as a result

tables Svinicki,

ages

to 95 months

Thus,

first percentile

through rank

gradient

Battelle

items

span

that

child’s

assessed

child can

sometimes

Thus,

before

across

that minimal becoming

Significant

very

low functioning

teristics

of the

The

McCarthy

1972) the .75

reports to .89

instrument, Scales stability range

about

coefficients

a gap

gradients.

span

item

first

gradients

this

determine

score

average, steep

the

has four

(i.e.,

a the

regardless item

of

gradients

to differentiate need

have on changes

to exam-

in standard

differences

among

tests.

IN NORM TABLE LAYOUT between but

two different significant

of the came scale even

that

Practitioners

significant

the

the

can easily

with

at

from

reflect

item

of an instrument

in raw scores

when

of Children’s (with

failure

changes

can exist

is ranked

subdomain

and

tests

continuum.

characteristics,

two administrations

between

to 6,

of a

her or his earned

the ability

unique

between

than

the

rank

it is frequently

steep

With

of 0

the quality

items

percentile.

that

raises

range

that while

of measurement

more

differences

of their

subdomain

three

level

DIFFERENCES fact

of 28, which

only

the 69th

too alarmed

point

raw score

floor,

items

a raw score

the percentile

and

at ages 84

or four

score

and the Dressing

error

by the

raw

also evidence

has

differences

be explained

skill levels

ine the effects scores

between

earned).

three

&

steep.

Responsibility,

and the last two items

and floors

through

Domain

raises

of its ceiling

percentile,

are extremely

subarea,

also be noted

subdomain

performance

only

in the Eating

percentile,

standard

is somewhere

between

ceilings

the first

the instrument’s

item

the maximum

is independent

steep,

the score

raw score

It should

the 53rd

having

Guidubaldi,

Personal

of 1, an additional

the 53rd

Toileting

through

Dressing,

of an

the norm

For example,

Wnek,

that

In the Eating

in standard

scores.

Stock,

the Adaptive

(Eating,

four items

points.

case that tests with poor The

less those

steepness

by inspecting

age levels

subareas

more

item earns

percentile

at some

the

standard

(Newborg,

assessed.

rank

to 3, one

percentile.

item

are than

differences

scores,

simply

on the Battelle,

of the four

of 47 percentile test’s

many

of abilities

and an additional the 53rd

levels

of

glaring

obviously

ability

major

in the reported

gradients

a percentile rank

by level

as to evidence

gradients

in raw

can be determined

four subareas

three

span the full range to 25 earns

item

produce

Inventory

among

reports

and Toileting),

hierarchically

in childrens’

gradients

for gaps

has item

As one example

steep

fluctuations

Developmental 1984)

percentile

item

gradient

at various

with

differences

of minor

item

the Battelle

arranged

not be so steep

gradients.

with

instrument’s

are

should

Tests

or moderate

gradual

tests

that

of which

assessed.

to small

with more scores

of items

the increments

for each

the exception

as a result

it possesses

Abilities

instruments differences

test-retest manual

of the five McCarthy

of the Motor

also

of the unique

high

examiner’s

as an artican

Scale

exist

charac-

reliability. (McCarthy, subscales

at ages

in

71/z-8’/2,

where

it is .69).

norm that

tables each

successive

identical

In these

circumstances

those

scales

Verbal, On

a child

For example, day.

system,

with

on the following

the above-mentioned roughly

intellectual

Additionally,

the

Kaufman,

General

1977),

standard

skills,

standard would

is relatively

of 50 and standard

considerably, score

by the McCarthy,

Cognitive

Index,

decreased

had

on

mean

that

decreased

of an additional IQ

his extra

would

day

equivalent

112 to 101,

and

respectively,

magnitude

or

from

Motor 49,

of 10, his

and

points,

as assessed

have

and 50, 48,

deviation

of this

for a child to lose 11 IQpoints

stable

(with

a reported

total

of some

traumatic,

disabling

on the

McCarthy

are

gradient

and not a result

tioning.

Similarly,

WISC-K,

which

by

of life.

(Kaufman

&

or two-thirds

of a

radical

While given

skill,

resulting

knowingly

it does

two separate

anniversary

dates scores.

have norm

tables

need

differ The

slightly

(1981)

measures).

in such

in the

and

steep

item

intellectual

func-

instruments

from

age

Scales,

produce

as the

level

6 years

in their selection

within

has pointed

out,

grade

characteristics

found

norm

3

age-span

layout.

tables

a

so that in the

for example, Hence,

and

practi-

consider

the

tables.

FOR COMPARISONS

and age equivalents scores that

test two

to assess

differences

of instruments

decisions

of standard

It is frequently

their

use,

and K-ABC,

the norm

or placement

on the same may

dramatic

WISC-R,

OR AGE EQUIVALENTS

diagnostic

child

that may stagger

in their

to be tested

the same diagnosticians

and thus

McCarthy

USE OF GRADE

the psychometric

seen

tables

overall

transition

that

that do not coincide

of children

As Reynolds

be

table

that

of .90)

so, but the differences

norm

in the child’s

evaluate

happen

instruments

to be careful

used for making

Obviously

to the test’s can

a norm

coefficient

4 months.

no one would

consecutively,

changes

in one day on an instrument

test stability

event?

due

of instability

evidences

to 6 years

days

interval

be 50,

as a result

to

on the five Mc-

Memory,

deviation

7

identical

of 13, 8, 3, 5, and 8 on

would

dropped

as mea-

at age 2 years

raw scores

Differences

absence

positions

days

overnight,

scores

have

attained

tioners

on two successive

deviation!

Is it possible

months

when

be tested

scores

5, 4, 4, 5, and 5 scale

scales.

one-half

a mean

even

Quantitative, standard

in the

they find

scores

performance

and included

day would

cost him

may

day, earning

If his raw score

his respective

lower

be tested

a child

Perceptual-Performance,

scores

are contrasted.

less intelligent,

on the next

average

day of life would the child’s

can literally

standard

age level to another,

markedly

considerably

was roughly

a T-score

scores

produces

the previous

respectively,

examine

one 3-month

of the five subscales

to have become

earned

Scales 49.

table

15 days and retested,

Carthy

diagnosticians

from

on each

by the McCarthy.

months,

the

when

norm

raw scores

and be found sured

However,

of the McCarthy

because (i.e.,

a child’s

should

not be

they do not possess they

are not ratio

raw scores

may

or

earn

Bracken

RELIABILITY Tests

with

larger

low reliability

standard

sequently,

errors

tests

rounding

the

consider

(Wechsler, of 3.41

1974)

examples,

resulting

from

The

WISC-R

Full Scale

levels an

MAT-Short

of reliability, Matrix

internal Form

to the Raven’s instruments,

test,

many Test

form

Matrices,

reliabilities

Con-

intervals

that

and,

IQreliability

sur-

of .95 and an test

like most

matrices

the

achieve

the Short

for

6’/2 age

abbreviated

compare

do not

1985a),

at the

SEM

(While

In contrast,

Naglieri,

of a progressive

do not

differmanual

sample.

commercial

of .70

reliability

examiner’s

of reasons.)

(MAT,

coefficient

is an abbreviated

Progressive

other

for a wide variety

Analogies consistency

it has

example, level.

The

test similar or short-form

favorably

with

those

of

measures.

If diagnosticians

were

to compare

his or her performance

be found

screening

of discrepancy

in the standardization

of the

might

confidence

as a result,

cohorts.

score.”

following

reliable

and,

reliable

large

age group

Form

with

produce

error

more

an average

similar

IQ

their

reports

is a very

full-scale

“true

measurement than

for the 6*/2-year

WISC-R

reports

more

low reliabilities

possibility the

DIFFERENCES

of measurement

the examinee’s

To explore ences,

with

produce

161

test.

because Obviously,

a 6t/z-year-old

child’s

WISC-R

on the MAT-Short

Form,

sizeable

of the potentially if diagnosticians

less accurate should

want

score more

Full

obtained than

Scale

differences on the

screening

162

of School Psychology

Journal

information

on

Expanded

the

Form

MAT,

Psychoeducational ity coefficients reliability

partially

If

mean

two

= 100,

are

error

than

standard

produce

each

range

of scores

adequately.

arouse

discussion assess

function

the

more

reliable

assess

There

by chance scores”

the same

several

8 points that

low

in mind

of the tests

system

(say, of. 70) the

since

(e.g.,

each

would

(slightly

more

two tests

could

at a 68%

be expected

ACROSS

skill area,

skills

confi-

to be found

in a

meetings

tests,

because

skills

ways

assessed

or theoretical

do not

that frequently

of the diverse in

significant-

by the two tests

for example,

differences sampling

TESTS

yet produce

assessed

reading

in content

keeping

deviation.

global

These

whether

score

alone,

could

ASSESSED

the specific

skill.

standard

of approximately

are many

global

scores,

It is conceivable

“true

the reliabil-

determine

one or with

same

a full standard

because

of differences

and compare

and

low reliabilities

differences

in M-team

this

examine in obtained

the

DIFFERENCES

in title,

to use

with only

on

deviationj.

score

scores

overlap they

based

that exceed

Two tests can, ly different

the differences

of the tests’

SKILL

advised

discrepant,

of measurements

significant level

are

15) and have equally

have standard

be

should

that

may be associated

SD=

dence

tests

explains

tests

one-half

would

198513).

diagnosticians

for the

that low reliability used.

they

(Naglieri,

in which may

orientations

be

a

of the

instruments. The

Wide

Range

demic

screening

rubric

of “reading.”

Kaufman, second

1983)

subtest,

Along

complete

due,

when there

three

in part,

reading

tests. Reading

assessed. Reading

Decoding

standard

phonetic

WRAT

the WRAT mance

on

effort

subtest rules

includes the

two

assessment

to include words

decoding rendered

measures

reading a more

to obtain with

words also

instruthat

on these

while

similar

vary vary

words

of the K-ABC

not be decoded sword,

in its sight

are

three to the

of nonphonetic

uncle,

subtest.

these

scores

the development

measures may

or comprehen-

is assessed

that could

by the K-ABC

comprehension.

Test,

subtest,

such as gnat,

decoding

as a

comprehensive

in the number

few nonphonetic

reading

as well

a child

reading

words

&

Decoding,

in the reported

during

general

(Kaufman

and abilities.

assess

Decoding

was made

(e.g.,

very

and K-ABC

skills

is an acathe

Scale

reading

Mastery

differences

differs

under

only decoding

a more

reading

Reading

in format,

A conscious

to use

ways in which

K-ABC

Test

that assesses that assess

diagnosticians

be considerable

The

Reading

Reading

of a child’s

to the different

WRAT

thorough

Woodcock

1984)

vocabulary

Achievement

subtest,

choose

M-team

& Wilkinson,

word

K-ABC

of reading

might

as the

may

the

Understanding,

measures

understanding

Thus, ments,

sight

a comparable

Reading

such

Test (Jastak

assesses

Likewise,

a diagnostician

measure,

The

that

has

with these

sion,

Achievement

test

solely

ache,

word list.

and a child’s because

by

recipe).

of the

Thus, performore

163

Bracken

Along

a similar

Woodcock hension ABC

“act

a child’s

Mastery

as assessed Reading

and

line,

Reading

the

content

of the

on the Woodcock

passage

and answer

questions

better

the

assessed three

about

different

measures

different

and

the previous

sample

scores

the

because

purport

the content

to assess

universe items

may attribute

numbers

only,

be carried

while

By examining

the arithmetic

assess

other

tests

multiply,

and

weights

assigned

sarnpling metic time

divide.

may

tests

few tests

others

result

sampled

test

in scores produce

is of the essence

assesses type

that

the content

that would

is frequently

Each

to a particular

domain

significant

It is well known, magnitude

but

often

as a function

1979;

Sattler,

scores

that

1982).

are higher

Tests than

the size of the score

between

the publication

forgotten,

of their

that were those

dates.

but

add, the

tests.

there

content

would

of psychoeducational

likely

be

Because

measures,

are insufficiently

it

sampled.

DATES produce

scores

dates

(Bracken,

some

time

related

count,

subtract,

If all arith-

well,

normed

of

differential

due to this factor.

tests

in the

dozen, twice,

sufficiently

that

are

can

across

tests that have more

differences

hag

differences

publication

to

incidence

of inconsistent

significantly

IN PUBLICATION

test

to whole

to rote and place

because

domains

One

the others

a higher

mathematics,

in the administration

the case that the content

DIFFERENCES

and

differ

test

psychoeducational

the examinee

of item

the

the

operations

(e.g.,

ability

that

and percentages.

from

of concepts

whether

applied

used

include

sample but

four-function

differs

the examinee’s

assess

skill area

domain.

functions fractions,

tests

knowledge

assess

immediately

Some

of the

different

may

of the content

measure

is

Thus,

Two tests

concepts,

of frequently

each

subskills.

the examinee’s

more than), whereas

and

decimals,

subtests

see that

global

significantly

sufficiently.

test also requires

is

skill

TESTS the same

to the four basic

that include

of mathematical

that

and

the second

one can easily

sampling

sampling

weight

out on numbers

measures, items

in their

considerable

procedure reading because

for example,

functions,

as

skill.

produce

do not overlap

principles,

not be parallel

yet

K-

read a

in test scores.

that global

in mathematics,

of mathematical

may

differences

can assess

domain,

competence

this

not be in agreement

two tests

samples

neither

way

ACROSS

The

understand,

that the child

While

in the

DIFFERENCES

content

read,

comprehension,

requires

the passage.

may

by the compre-

subtest.

a child

Reading

Test,

used to assess

condition,

same

read.

differences

of reading

CONTENT As with

that

for significant

procedures

as assessed

her or his reading

Understanding

Mastery

other,

may well be responsible

markedly

from

requires

passage

Reading

than

comprehension, vary

Reading

subtest

assessed inherently

may

by the K-ABC

Understanding

out”

reading

Test,

differ

in

Kaufman,

ago routinely

recent

directly

that

1981;

produce

publication

dates,

to the length

of time

164

journal

of School

Psychology

1960 than 1972

version

(Sattler,

approximately

1982)

Binet lower

IV

(Doll,

the

1984)

and

the

lower

1984;

Dunn

Dunn,

most

recent

suspect,

dates

1965)

produces

that

Adaptive

Peabody

these

Large

tests

Test

one-half

average

(PPVT,

Dunn,

to two-thirds

(Bracken,

intervals

on

(Sparrow,

Vocabulary

PPVT

of publication

are

Behavior

Picture

than 1981).

Prasse,

separated

stan-

& McCallum,

the initial

last two instruments,

larger

1 to

a variety

that are approximately

that are approximately

deviation

for

(Kaufman,

Vineland

produces

norms

Technical

revision

Scale

from

Binet

produces

its

lower

the Binet

the

1949

1959)

and

lower

norms

the norms

score

differences

might

between

their

two

versions

than only

do significant

and its revised

version,

differences

typically

but differences

between

also

across

1974 than

1972

Binet,

but higher

their newer

instruments

assessing other tude

produces

the K-ABC

scores

Binet

IV.

Examin-

more will likely

find that the scores

with

will differ

members

Unfortunately,

WISC-R

as a function

test

on the market

children

M-team

existing

than

test

tests

still

is no reasonable

be

using

rule

from more

thumb

of the

obtained

when

obtained dated

instrument.

to determine

of difference that the longer score

a

differences

of thumb,

the interval

between

score

differences

struments over

time,

samples often

but

they

of newer

that

regions.

associated are

represent

also The

& Kirk,

were

When

publication

dates,

Flynn

OF THE with

a function

such

normed a sample

the population,

NORMINC

changes

of changing partly

instruments.

McCarthy,

ments

The Illinois 1968)

solely

to more

versions

representative

of commonly

the PPVT

on white in such

an unknown

are

subjects

from

of error

of in-

norming

used tests were Abilities

two examples

a way that

amount

dates

of the population

Test of Psycholinguistic

and

is drawn

in the publication characteristics

related older

SAMPLE

restricted

(ITPA; of instru-

geographic

it does

not

results

in the develop-

accurately

of the norms.

When reflect

partly

not well normed.

Kirk,

ment

are

While

has assumptions

REPRESENTATIVENESS The

between

the tests.

tests

different

are developed

and

geographic

regions,

normed

on markedly

disproportionate

different racial

sarnples

representation,

that or

165

Bracken

skewed also

subsamples

produce

equated

of socioeconomic

unequal

in

any

estimates

norming

strata,

one

of a given

sense,

it

is

can

skill.

expect Since

unlikely

that

that the

the test may

tests

their

were

norms

not

will

be

equivalent. The

K-ABC

sentation

is one example

occurs

ported

on some

reduced

examinees. 1985)

mean

The

among

children

of higher

numbers

of low-SES

ABC

socioeconomic

point

difference

Score that

differences

were

among

tests

from

any

(SES)

factors

(Bracken,

and Hispanic of equal that

for approximately

differences

in tests’

the Ka two-

(Bracken,

the magnitude

overrepresentation

re-

Hispanic

exclusion

contrasts

to determine

or

and

It is estimated

accounted from

the

repre-

K-ABC

of black

and

Hispanic-Anglo

impossible under-

of several

children.

do result

The

black,

inclusion

procedures

in the black-white,

white,

a result

Hispanic

disproportionate

variable.

between

status

and

sampling

but it is usually

results

differences

was the disproportionate black

when

stratification

socioeconomic

differences

samples,

score

reduced

which

of what can result

important

1985).

normative

of difference

of particular

selection

variables. The

important

select

point

to recognize

tests that provide

tatively such

sample as the

over-

efforts

important

sumers

need

should

demand

to whom

and

to become that

more

quality

should

effort

the test

reflect

by accident,

selection

examiners

underrepresent

to accurately

ing of a test does not occur to sample

is that

that a concerted

the population

K-ABC

spite of earnest

evidence

population

selective

in their

The

choice

accurate

in norm-

even the best efforts

degree.

in all aspects

Tests

characteristics

the population.

to an ideal

be present

to represen-

is to be administered.

and sometimes

variables

systematically

was made

Psychometric

of instruments,

fail con-

and they

of test development

and

norming.

CONCLUSIONS Significant purport

differences to assess

variables,

among

as systematically Examiners

use.

that

two or more may

examiner

psychometric

instruments

be a result

differences,

differences

and supervision

to become

the examiner

gradient,

norm

table

layout

population,

variables, The

Familiarity

item

understood.

between differences

that

of student

or psychomet-

should

be considered

of psychologists

as the other,

variables.

floor,

is able

exist These

in the training need

they

requires

iner

tests.

human-related

ments

skills.

examiner-examinee

ric differences more

frequently

similar

adept with

be able

reliability,

differences

the quality

instruments’

to determine

and

and standardization

to judge

at determining the

validity,

that do occur

the quality

of the

and

between

of a test’s ceiling,

characteristics.

Once

it appropriateness instruments

instru-

characteristics

as well as the effects

sample

of a test

the limits technical

of the test’s an examfor a given

will be more

easily

166

Journal of School

Psychology

REFERENCES Bracken, B. A. (1981). McCarthy Scales as a learning disability diagnostic aid: A closer look. Journal ojlearnin~ DimhilitieJ, 14, 128130. Bracken, B. A. (1984). Bracken Basic Comept Scale. San Antonio, TX: Psychological Corporation. Bracken, B. A. (1985). A critical review of the Kaufman Assessment Battery for Children (K-ABC). School Psychology Review, 14, 21-36. Bracken, B. A., Prasse, D. P., & McCallum, R. S. (1984). Peabody Picture Vocabulary Test -Revised: An appraisal and review. School Psycholog, Review, 13, 49-60. Doll, E. A. (1965). VinelandSoclal Maturity Scale. Circle Pines, MN: American Guidance Service. Dunn, L. M. (1959). Peabody Picture Vocabulary 7&t. Circle Pines, MN: American Guidance Service. Dunn, L. M., & Dunn, L. M. (1981). Peabody Picture Vocabulq Test-Revised. Circle Pines, MN: American Guidance Service. Flynn, J. R. (1984). The mean IQ of Americans: Massive gains 1932 to 1978. Psycholo.&al Bulletin, 95, 29-5 1. Jastak, S., & Wilkinson, G. S. (1984). 7’he Wzde R ange Achievement 7&t-Revised. Wilmington, DE: Jastak Associates. Kaufman, A. S. (1979). Intelligent testing with the WZSC-R. New York: Wiley. Kaufman, A. S., & Kaufman, N. L. (1977). CI’mm 1 em&a/ion of young children with the McCarthy Scales. Orlando, FL: Grune & Stratton. Kaufman, A. S., & Kaufman, N. L. (1983). K auf man Assessment Battery for Children. Circle Pines, MN: American Guidance Service. Abt1itie.r. Kirk, S. A., McCarthy, J., & Kirk, W. (1968). Ill’tnoiJ 7e’st of Psycholinpistic Urbana, IL: University of Illinois Press. McCarthy, D. (1972). McCarthy Scales ofchildren? Abilities. San Antonio, TX: Psychological Corporation. Naglieri, J. A. (1985a). Matrix Analogies 7&t: Short Form. San Antonio, TX: Psychological Corporation. Naglieri, J, A. (1985b). Matrix Analogies 7&t.- Expanded Form. San Antonio, TX: Psychological Corporation. Newborg, J., Stock, J. R., Wnek, L., Guidubaldi, J., & Svinicki, ,J. (1984). Battelle Developnental Inventory. Allen, TX: DLMiTeaching Resources. Reynolds, C. R. (1981). The fallacy of “two years below grade level for age” as a diagnositic criterion for reading disorders. Journal OfSchool Psychology, 19, 350-358. Sattler, J, M. (1982). Assessment ofchildren intelli.~ence andspeczcll abilities (2nd ed.). Boston: Allyn and Bacon. Sparrow, S. S., Balla, D. A., & Cicchetti, D. V. (1984). VinelandA&ptive Behavzor ScaleJ. Circle Pines, MN: Anerican Guidance Service. Thorndike, R. L., Hagen, E. P., & Sattler, J. M. (1986a). Stanford-Binet Zntelli~cenceScak. Fourth Edition. Chicago: Riverside. Thorndike, R. L., Hagen, E. P., & Sattler, J. M. (1986b). Stanford-Binet Zntellzgence Scale. Fourth Edition Technical Manual. Chicago: Riverside. Wechsler, D. (1974). Wechsler Intelligence Scale for Children-Revised. San Antonio, TX: Psychological Corporation. Woodcock, R. W. (1973). Woodcock Reading Mastery 7&t. Circle Pines, MN: American Guidance Service.