Studies in EducationalEvaluation, 1979, VoL 5, pp.55-75. Pergamon Press Ltd. Printed in Great Britain
STATEWlDE ASSESSMENT IN CALIFORNIA Dale Carlson California State Department of Education
Later to become infamous within the profession, the word "accountability" was not yet in common educational parlance in 1958 when the California Legislature established the Citizens Advisory Commission to study the public schools. The voters, though, knew what they wanted - educational standards and proof that students were meeting those standards. The Commission's final report stated: The Commission believes that in order properly and effectively to evaluate the education program of the public school system in California, a level of instruction must be set by the Legislature through the State Board of Education. The Commission recommends to the Legislature that mandatory statewide examinations be utilized to establish this standard. (Joint Interim Committee, 1961, p. 38) I The examination program which followed from this recommendation has never actually been used to establish standards in any literal or uniform sense. With its legislative origins, however, and its consistent emphasis on the basics of reading, mathematics, and language - and especially with its later requirement of public disclosure of local results - the California program nonetheless bears out the conclusion of Educational Testing Service researchers (ETS, 1973) that the concept of accountability is the heart and soul of state assessment programs.
HISTORY The Commission's recommendations became law in 1961 and reality in 1962, when more than 1-million California pupils - all those in grades S, 8, and ll - were tested in the basic skills of reading, language, and mathematics and in intelligence (which was later euphemized as "scholastic aptitude" and finally dropped from the program in 1972). 2 School districts were allowed to select which tests to use from lists of approved instruments. The legislation mandating the program contained a provision prohibiting public release of the results.
IA complete listing of documents pertaining to the California Assessment program can be obtained from the author c/o the California State Department of Education, Sacramento, Calif. 95814, U.S.A. 2In 1975, other legislation made it illegal to administer group intelligence tests, except for special research purposes. 55
56
Studies m E d u c a t i o n a / E v a ~ a t i o n
Early Changes The first major change in this program came as part of a statewide reading improvement program. The Miller-Unruh Basic Reading Act of 1965 required the administration of uniform reading tests to all pupils in grades I, 2, and 3 as the basis for selecting the most needy districts to receive special funds to hire reading specialists and as a means for evaluating the impact of the specialists' efforts. Simultaneously, the Legislature required the State Board of Education to adopt uniform tests at the upper grade levels, and changed those grade levels from 5, 8, and ll to grades 6 and i0. The amendments also narrowed the scope of the content to reading and scholastic aptitude. In 1969, these upper grade levels were changed to 6 and 12 and the content areas were expanded to again include language and mathematics.
Origin of the Current Program Never popular with school district personnel, the program ultimately created a level of dissatisfaction that led to major changes. Legislation in 1968 had removed the prohibition against public reporting of results and instead mandated an annual report of results on a district-wide basis. The resulting fears and cries of unfair comparison among districts based on results from commercial tests with questionable coverage of the skills taught in California schools became so strong by 1972 that they led to a consensus that major changes were needed. The changes came that year in legislation incorporating the detailed and specific recommendations of a legislative advisory committee on testing chaired by Prof. L. Cronbach. The Committee outlined four questions which they felt the statewide testing program should attempt to answer (Advisory Committee, 1972, p. 4): i. To what degree are the pupils of the state and of each district mastering the fundamental skills toward which instruction is directed? 2. Which schools are attaining unusual to be responsible for that success?
success, and what factors appear
3. Can any explanations be found for the failure of certain schools and districts to achieve results comparable to those elsewhere, which might be remedied by appropriate action and assistance at the state level? 4. When a new instructional activity is introduced into the schools, do subsequent changes in pupil accomplishment testify that the program is accomplishing what is intended?
Structure of the New Program The separation of purposes for state testing and local testing was set forth in the Committee's report. It emphasized the state-level focus of the assessment program - within the legislature's basic and non-negotiable constraint that all pupils at targeted grade levels be tested. The functions of the assessment system called for a method which would: a) provide broad information on student performance at the state level; (b) yield reliable information at the district and school (not pupil) levels; and (c) keep the testing time to a minimum. Multiple-matrix sampling was selected as the most efficient solution to these demands.
Assessment in California
The Committee dealt with a number of other technical and policy issues, one of the most important being the fair and useful reporting of local results. The use of regression techniques to compare a district's results with those of "similar" districts was endorsed and refined, not just to help the public focus on the impact of a district's instructional program, but as a basis for identifying for study unusually effective practices and programs - an approach which still receives very little serious research attention. The Committee's commitment to this method of reporting, i.e., a comparison of actual and "expected" scores, was manifested in a recommendation that the reading test in grade l be replaced with a short easy test of entry level skills to be used to derive "expected" scores at subsequent grades tested.
Statewide Assessment & Proficiency Testing School personnel often ask: "How does the assessment program relate to the California High School Proficiency Examination program or to the new legal requirements regarding minimum graduation standards?" The former is an optional, candidate-funded, early-exit examination which bears almost no relationship to the California Assessment Program other than a reliance upon its normative data as a partial basis for setting and monitoring the passing score. The relationship of the assessment program to minimum competency testing is more complex. While both focus on the general area of "basics," they differ in numerous other ways: purposes, specific types of skills assessed, level of reporting, and relationships to the instructional program. Minimum competency testing can perhaps best be thought of as part of the instructional program, the impact of which should be detectable by the assessment system. This instruction/assessment distinction seems all the more true now that competency testing, under recent California law, is to be extended down through the elementary grades (4 through 6). In summary, the evolution of statewide testing in California over the past 15 years is reflected by the following trends: i. From an undifferentiated focus to one which clarified the usefulness of the information at the state and local levels - which in turn led to: (a) a focus on group information rather than individual pupil scores; (b) use of multiple matrix sampling; and (c) development of broader tests focused directly on California's curriculum documents and textbooks. 2. From district-selected
tests to a common instrument.
3. From a pure accountability program to one designed to provide useful information to those responsible for improving school programs. 4. From non-public information to detailed reports about pupil performance in specific local education agencies. 5. From simple reporting of bare test scores to refinement of a method for reporting results in the context of local educational resources and conditions.
57
Studies m Educationa/Eva/uatlon
58
DESCRIPTION OF THE CURRENT ASSESSMENT PROGRAM The c u r r e n t p r o g r a m c o n f i g u r a t i o n , i m p l e m e n t e d i n 1 9 7 3 - 7 4 and 1 9 7 4 - 7 5 , shown i n T a b l e 1. Each t e s t i s a d m i n i s t e r e d e v e r y y e a r .
is
TABLE l:
Configuration
of Current
California
Assessment
Program
Grade
Test
Content Areas Covered
Matrix Sampling
When Tested
]
Entry Level Test R e a d i n g Test Survey of Basic Skills: (;rade 6 Survey of Basic skills: (;rade 12
Readiness skills Reading Reading, Language, & Mathematics Reading, Language, & Mathematics
No Yes
Fall Spring
Yes
Spring
Yes
Winter
2&3 6
12
Test Development Development of the test instrt~nents was guided by one fundamental prin. ciple: The tests must be based on skills commonly taught in California schools. Other principles and considerations were: a. The tests must reflect the full range of instructional objectives in each content area - that is, all those topics which are covered in most good programs, not just those which all pupils should be able to master by a certain grade level. b. Test items would be selected primarily on tile basis of instructional coverage and suitability for California pupils rather than item difficulty or item discrimination characteristics, consistent with the advice of Cronbach (]971) and Millman (1974). c. 'TILe test administration time would be short, yet a sufficient number of items would be included in the test to allow reporting of subskill information at the lowest level of reporting (school level) to aid in program diagnosis. d. The test items would be acceptable to classroom teachers and would be as free as possible of cultural and linguistic biases. e. Test items would be drawn from existing standardized tests. (Statutes setting forth this requirement have since been amended to provide for more direct comparisons with national norm groups - without relying upon items from publishers' standardized tests.) With these considerations in mind, the general test development sequence included the following steps: i. StatewJde committees of content area experts were formed and charged with translating and delineating the general goals found in stateadopted curriculum frameworks into more specific objectives appropriate for assessment. 2. These specific objectives, or test content specificutions, ~ere then viewed by personnel in all California school districts for completeness and relevance to their instructional programs. The revised speci-
Assessmentm Californm
fications served as the basic guidelines for selecting and developing pools of test items. These documents were subsequently printed and distributed to all school districts under the general title zest C o n tent S p e c i f i c a t i o n s . Figure 1 displays a sample page from the specifications for mathematics (1975). These specifications show the impact of the developing literature in domain-referenced testing, rather than the widespread contemporary practice of constructing items to match lists of behavioral objectives. 3. These content specifications were sent to major test publishers in turn identified from among their collections of items, those tions which matched the specifications. These items were then ted to the Department of Education for review and consideration possible leasing.
who quessubmitfor
4. Teams of classroom teachers screened the submitted items and selected those most appropriate for California students. 5. These items were then reviewed by linguists and minority group testing experts for any subtle biases against students of different language or cultural backgrounds. 6. The final pools of items, several hundred at each grade level, were then divided into several short tests or forms - from i0 to 18 per grade level. All test forms were made equivalent in difficulty and in coverage of major skill areas. Table 2 displays the present configuration of the tests: number of forms, total number of items, number of items per form, and titles of the skill areas for which results are reported to schools and districts. The size of the total item pool in each content area was a function of: (a) the number and breadth of the skill-area domains; (b) the need to have the number of items proportional to the importance assigned to the subcontent area; and (c) the number of skill areas selected for reporting at the school and district levels. The number of forms per grade level was a function of: (a) amount of time designated for testing; (b) the need to obtain estimates of the variance for each school's subskill score; (c) the need (decision) to include items representing all content areas and major skill areas on each form; and (d) the lengths of passages and other space and time requirements of various items. All items were used in one form only. The final outcome - a large number of short tests at each grade level - not only worked to the practical satisfaction of classroom teachers but, as subsequently demonstrated (Pandey & Carlson, 1976), served to provide stable estimates of school means. Future revisions will eliminate all or nearly all the items leased from publishers in favor of items developed especially for the California Assessment Program. Such revisions will not be made frequently, however, given the difficulty of making straightforward year-toyear comparisons with different sets of items and the difficulty of communicating the results of such comparisons to the public. Development of the E n t r y Level Zest differed from the above sequence in that it exists in a single form and was developed completely by Department staff with guidance from the Reading Assessment Advisory Committee. The test, which must be given at the beginning of grade l, yields only a single score, although it consists of five subtests: Immediate Recall, Letter Recognition, Auditory Discrimination, Visual Discrimination, and Language Development.
59
Identify segments or points related to polygons and circles.
8.ii
Components
Subskill Area
Skill Area:
Level
8.1
Choosing a correct pictoria/ representation
Sample
Test
Content
6&12
6&
12
6~12
6&12
6
Grade Level
Specification
Choosing a polnt in relation to parts of the circle
e.Points inside or outside a circle
i:
Choosing a correct pictorial representation
a . Altitude of a triangle
FIGURE
Choosing a correct pictorial representatlon
c. Diameter or chord of a circle
of a rectangle
b . Diagonal
Performance Mode
Item Specification
Identifying line segment in a geometric solid
Subskill
Illustrative
Figures
a. Line segment
of Geometric
8.0 sUBCONTENT ARE%: GEOMETRIC COMPREHENSION
corners edges faces reglon inside block zlven
/
~
(B) II
(C) Ill (D) Iv
I
(B) II
II
III
(C) III (D) IV
(S) II
(C) III
(D) IV
p.
k._/
(E) Not glven
IV
(E) Not Eiven
IV
(E) No: glven
(A) On the circle (B) Outside the circle (C) Inside the circle [D) Mot givem
Point A is nearer to P than to point O. Point A must be
(A) I
Whlch figure shows ~n altitude?
(A) I
I II Ill Which figure shows a diameter?
©G©©
(A) I
Which figure shows a diagonal?
The The The The the (E) Not
(A) (B) (C) (D)
Which parts of this rect~n~alar block remind you most of a line segment?
Exampl e
8
CO
0
& Contents
Test
12
No. of Forms 1
35
Administered
No. of Items
of Tests
250
Mathematics
Spelling
Mathematics
Spelling 198
72
144
expression
558
Written
Skills
144
of Basic
Reading
Survey
160
64
128
expression
480
Written
Skills
128
of Basic
Reading
Survey
- --------------------____________________--------__-
Test
Level
Reading
Entry
Name of Test E Content Area
Format
18
16
10
- --------------------____________________----__-____
2:
--__-.-_---------_-----------________
--6
263
1 ---
Grade
TABLE
11
4
8
8
31
10
4
8
8
30
25
35
No. of Items/Form
Vocabulary; comprehension: literal, interpretive/critical; study-locational Sentence recognition, sentence manipulation, capitalization & punctuation, paragraphs, word forms, language choices Recognition of a misspelled word in the context of a sentence Arithmetic: number concepts, whole numbers, fractions, decimals; algebra; geometry; measurement; probability E statistics
_---_--------------_-
Word identification; vocabulary; compreheninterpretive/critical; studysion: literal, locational Sentence recognition, sentence manipulation, capitalization, punctuation, word forms, Ianguage choices, standard usage Recognition of a misspelled word in a set of words Arithmetic: number concepts, whole numbers, fractions, decimals; geometry; measurement 6 graphs; probability Fr statistics
Word identification: phonetic analysis; comprehension: vocabulary; literal & interpretive; study-locational
Total
Program
Categories
Assessment
Reporting
in the California
S~dies m EducationalEva~ation
Test Production & Scoring Test booklets for each grade are typeset, printed and scored under contract by competitive bid. The booklets are sequentially collated into class,packs by the contractor, such that all forms are administered with relatively equal frequency within each school, each district, and consequently, across the state. All pupil marking is done on the booklets which are optically scanned. A portion of the test booklet is used to collect individual informarion on each pupil; this information is used only as aggregated at the group level. No individual pupil names or codes are used, although special longitudinal studies have been conducted by computer matching of pupil test protocols using sex and date of birth.
ANALYSIS & REPORTING OF RESULTS In any type of reporting it is important to tailor the format and content of the information to the needs of its consumers. The reports of the California Assessment Program can be placed in two categories: school and district level reports, and state level reports.
School & D i s t r i c t
Level Reporting
The booklets are returned to the test contractor, who then scans the booklets, performs the statistical computations, prepares school and district level reports on pre-printed computer sheets, and distributes them to each district. Figure 2 presents a sample school report (without the luster of the three-level colored shading in the pre-printed portion). Pre-printed explanations of the data are presented and illustrated with computer-printed statements of straightforward interpretations of actual results for the school or district. The district receives a composite district-level report printed on the same form. Each report is accompanied by a separately bound Interpretive Supplement organized in terms of the following five questions: i. "How did we do on the test itself?" It can be seen that the "percent correct" score is designed to answer the first question. After many years of using the raw score (number correct) as the basic scale, the Department made the switch to percent correct in 1972 following the earlier recommendation of Tyler (1970) and the precedent set by the criterion-referenced testing movement (see Buros, 1977). 2. "How did we do compared to all other schools in the state?" The percentile rank shows the relative standing of the percent correct score for a school in comparison to those of all other schools (or districts, for a district). This score, which is recomputed annually also allows a reader to compare a school's rank or standing in test performance (in comparison to other schools) with its rank on other input characteristics such as socioeconomic status or mobility level, since all variables are reported on the same scale. The a d w m t a g e s and disadvantages of percentile ranks are treated more fully later. 3. "flow did we do compared to schools 'like us'?" This question is the most difficu]t of all. The l)epar
RI&O]N@.
I
i
THE
gUEBTION$
IN I TYPICAL CA/]FORN|[ PERCENT 0P THE
TM~
IN YOUR S('HOUL
BY £ D M P & R I S O N ,
FOR E I I R P L F . , | 4 RL~QING, TMr~ r O R R A g I s o N ~COR~ eRkl~), | N D | r . A T | N G T Y P I C l L P [ R F O R N R N C ~ . OF $CNOO(bl k ~ X Z YOgRSp R ~ N G E $ F90~4 rwIE R A T ~ TU T~IE 8 p T R P|R{.ENTILE$•
OF
l
J
~IGURE
11011 [ R I R P 6 I # YOUR I C N 0 0 L I I gRAOI[ ] / ~ ¢ f l X l r V [ N|NT |NDIrll || N||NgR T H A N THAT OF 7 5 P I [ R C [ N t ,6~, Ot)'q|a , C N 0 0 L , I N CILIPORM|A.
';he t0~t ~ u m n s h ~ I ~ v l l ~ f ~ (he m ~ l l n K h ~ l or ~llfrlCl In the M i l l *n the neat I~O ccdumPs ~ v i l ~ are g l ~ f ~ y~¢ ¢iistrict and KhOo~ The lUI column g v ~ he ~ll~ rink of y ~ t ~hoOI O¢ district b41ckgr~nd f K u vl ue comoarqK~ o I o he S
The Inl~ptetlva Suppliant provld~ It~ delmlUong Ind $ ~ r ~ o he b~Ckg ound
on
J
The Comoar*~ Score ~and is a range of numbers Uevelopea un,q~Iy +or you~ = c h ~ or dlSlnC( agE+nil wh,ch your ICtUlF ~ o m @n be ¢OmplrK The bin~ repm~nlS scores of ~hOOlS or districts which statlltGcllly are like you~ The ~g~d +S calculated for your ~hooI or diStrice by using the v a i l s of the i;*~Ckg~und flclo~ hlle
The State ~rclntlm Rank IS 1he Percent ot SChools or districts ,n the s a • w~ ch have IO~t ~?centCort~t scores lhin yours FOR E X I W P L f . # ~N 9E~I}ING# YOUR I C " 0 0 L ' 3 |CORE I I egos[ I N E I C O R I F I 0 r 90 PENCENT OF TMb $¢"001. S IN C~LIFOg+NIR.
~, S)j~ ~10: T~hvIII+s P;~Iyour':C~: ..... ',Irfla akr: g,.~ tfo.1 ..... :1 ;e:r ,n+
|TATE PERCENTILE R A ~ | Th,s ~ t l o n of Ihe repor~ helps yOU ¢o i ~ k at you r scores in two dlfferem ways ~o faclhlSte Ihelr intecpretitlon The numbe~ ano graph to the rlghl till you (1) how your ~oce rlflks among Ill ocher SChOOlsor dlslrlctS ,n CelitocnIi and (2l how your ~oce c ~ p i r e s with eSC ~SO O ~ ~ s o r i nots a a v e a ~ o ec eoun acorssimlar o ours
IV[RR&E | I I ~ T M GRADER E C R 0 0 L A4$~ERLrO 9 T , S QUESTIOR8 CORRECTkY,
IN
THE AYER/,GE 8 1 X T H GRADER AN$11ERED 1 1 , ? PEg{I~=NT OF
CORRL[~I¥
i IACKGROU~O FACTOR = u ~ R y
I
Program
l u n n w I¢0~19 tell I¢Or~ foe i l c h of the fOur ¢on1~t i ~ l l + St',Ownfor the Irate y + r dlllrlct ind your ~hO01 Both him ~ r r s l~ this y~l s Eo Iii are g l ~ TT"mfi~l column Shows ii.. ~ m °f the m i i n ~ 1 or Oillrict in the stlte YOU m y wine 10 think of thls icore i i the peetormlnce of the typical ~ChOOIor ¢liltdct in the state Percenl C o r o t score ,s the percenl of c o r ~ ~ponses meda by your pup, S on the lhe~t Ite~ s ~n I given c o n t ~ I ~ l This scorn can be +nterprete(I Is the percent of ques Ions t a~ your typical or a~rage pupil woulO answe co ~ V he o she we e given Ihe lull length test
Assessment
Report
the
0
,
2:
Sample
le 0 ~'?
10•1
| 0 a|
School
1975 - 76 J 1976 + 77 l
I
77j
76
1975 1,976
~ =~• I
I[ S........ -.o [.O0 k
Index Percent AFOC
Year
Factor
Summary
A~=0e b~=~
..... ~9.8S
~$
77
76 l ? 77 / 8 7 I
-
:::: :: ::
~* | ~eR
R| I '~0~
Vaaue
°'st rlc{
I
~ a0 G~e~
Report
bgm~b
R0=II9
bq~eeR ~i~G
I~reenlde ..... t m~nk
lg~6
1976 - 77 i
Percent BilingulJ
OR 90
~111 77 r ~
76 77
I i o, me
1975 1976
i~ 1976
1975 I 1976
Year
~?*~ I1~9 e 2
A~,8 =Rat
bS .~
ARe 0 ~9.0
bOa~
bW , b
7t.7 T|s •
O+mr,cl
e~la$
A?sS GTe~
197Y
b Tl | bTe~
(D|eR 7,~e~
T | a qj
~]~e~
75,i 7T.7
SChOOl
1977
NOT|S: It) An intefp,ehve supplement was Olstttbul~ wdh thl, rs~. Consuff i1 Ior ~urther 0eta+Is 12} the Ca~ito,nlaAdmm,s,ri,ive Coee TtH, 5 prOh,bds In. public release of test ,,suffs untld .~e =~atew+de resu+ts o~ • r ram I ~ n r sen o e S~atep 8~ooardo~ Educb~ghOn p ~ctt, oed~'lot~)
CDS
School
District
I County
],
0a0 ?el,
0 eR 00R
~h°°l Vmue
~ ~ eRin~
it
eO 7
of Pupils
x
Tested
X
kw m l e ~ , e
~l~t
You were not required to teJt pupils in programs for the a4~llionllly handI¢lppea mentally retarded, o, physicelty hlnd+Cipge(i
9UUKLP. I S " f . ~ f . ~ [ C E I V E U AN0 5CU~kL) ~O~ l O O N $ C N U U I . ,
Number
OO00dUOOUx
U0000on00o
OUOOOIIOOOO
0O00g0000
Positions of Actual Score (X) and Comparison Score Rand ((~O) on the State Percentile Rank SCale t
pem~l C e r n ~
SEPTE~E~
~Jan SCNO01. in state
--
RankR
Achievement
: .......
Background
Mathematics
. . .•. . .
Written Expression
leadmg
C~lent A m
Percentile
1975 + 76 1976 77
Mathematics
State
1975 - 76 1975 77
= t975-76 197677
1975-76 ; 1976-77
Yelt
Spelling
Written Expression
raiding
C~mt ~
Survey Scorer
REPORT
Survey of Basic Skllll: Grade 6--April
$CMORL=LEVEL
5
s e Ss r f ~ n t P r o g r a m
lid
gle{I
and
tick
when i n l l t l
li • !1 iO(ie(I IO
~er~nhS ~
to rind y o u r strong I . d
~lk
~11~
T h e (~,ce~¢ile , i n k rang,l aP1own ~y
wly
te*lted
to
the
blCkground
ti•torl
~ e o.. U.I~.e me c ~ . . o n Sco,, eln(ll (~e ..~.1 o, ~,i page i , . m no
aashe* ~ it., ~l,~* IP~0ul(~N0T b* c ~ f u l l d
Note
FOR IrXilqll~.[~ THL 0 & I N * E l POll w01i0 l O 0YiLRL~P T ~ | X FOR ?~ll T0?AL .i|~,P/~6 6COII) 1~II AIIIA 11 NI|TI41[R I larL&I'|Vp. I?III[M;TH MOll l i l & R N I i l l I'HF. lIg|kl , IRl& 0 F ITUDYeLDCATIONAL l~ ME•OI4¢' 1 8 &~/ A R I A OF R | k l ? l ¥ 1 lair A K NI~I I .
to ~xilt onFyd i r l n ~ *~ Cl~tiy to 1~ ~ ~o me ,gm de th. totl~ ~ ! ~ t ~ c ~ t ~ {
ran~ r~Nah~ 10 the toil{ t~t ico~e fat the
Now
,e~o~led on page one ~s ~a0elt(<+ u a
The ~cantlm
Ths rlrJ*~l Iho~
g r l p h l c l l l y ~ • t o w of d l s h ~
a n o i[17 T H
•p ~ r F n g
~n the hfth Colum~ This ,n~ t,~ ~oce ,~ wocd ,dent,hclhon to~ m ~ p~p,lS
ol
The m e ~ u ~ m ~ l I,,or ~
graphically For axamphl in rNdlng yourPer~nt C~rKt ~ e for ~ r o , n ~ t l h ~ h ~ is 8 0 • 1
i~,tl area is likely to till Tt~me f ~
lilt leone for the ¢~t~1 Thl F'~'C~I C~rKt ~ora, mu$ ind mmul ~ut~ml I,ro, 0ira * ran{)l of
mmwlrr illco~ll i,r imlu. ~ m ~
C PJti'orn ~ l l ~ A s
ex~,el~on
bt,.$
Language ChOICeS
?
~0,5
~3.~
$8,~
SI.?
S,S
$,7
g,~
bb,~ i~8,3
~,4
}.6
~.S
iS
!
T
6b.gs
b8°9%
3t-y9
b6.88
I 2§.5 ~
i g~,-
<..,
l
|
~
1977
SCP~
Dmrict
Ichoo6-1o~Nl~cwiend#4
....
.
i...o°..o x
~
x
j .....
#o~ dtetrlct-levWrwpo~
°****.~os.
--..---.-t-
°°°..-~-
.o....o.o
1
1
x
l x x
.
.
x
l
x x x
i
x
.
.
.
w
~
N
5
.
.
.
i=
o
o .
i'i
!.i~
I
.
i"
.
I -
:°
!
o
1
Page 2
S C H O O L COPY
i ..... x
x
...°o..Q-....
°o..ao.-oo-oo** A 1
x
.°1...o.o.
o..o
=o.°o°.° ------------
oo°.....o.
.oo....o.-
.o...o
x o&o..e.e~°.l
x
State PercentiJe Rank t of SkilF~rlill Score +_Melmiuremer,i Error
6--April
S a m p l e School R e p o r t
~¢ha,O~
kg.S8
06"97
bSoO ~
4 { I . T7
79-0~
?$.eq
q.'''-'b
~sO
3.~
~0.6
?7*e
5~.~.
''.Ct
~b.g
! ~I,Z
%.0
Ulg,
,.9
].t
II~POCN,naK~ff4i~4tcilelbk~
5~,q
58,1
b~,3
~q.0
S?.b
--
b%,7
i 72.9
I 7~,b
[
i 57.$
78.e
FIGURE 2 (continued):
Peo~a~#~Waria Statlshcs
Measure~,e~t
Geometry
?~.l
bS.~t
,,.~
5e,l
e6.[
c.9,t
b]~,5
I 7l*l
i
b0.~
YZ.L
I~b*b
hi.5
Arlthmetrc Number Concepts Who~e Numbers Fram~o~s D~,mals
70.6
Word Forming MI~I~
Relahonsh~ps
Standard Usage
65.g
%),2
Punctuateon
Word Forms
%9,~
b~.?
~0o2
I,~.% I
Ca(~itahzahon
Sente.ce Man,pulat,on
Sentence Recogmbon
W rill~
__
Study- tocah~nal
i
!i::ili:
by Skill Ares
Interpretive' Cnt~ca~
L~ersl
Comprehen~,~h
VocabulaPf
Word IOentlfication
Survey S c o r e s
RepoH on the Sur,tey of Basic Skills: Grid•
<),
co
Assessment~ Californ~
of Dyer (1966, 1970; Dyer et al., 1969) and his work with the states of New York and Pennsylvania in developing a technique whereby a school or district could compare its performance with that of a unique norm: the score predicted for a school with given characteristics, such as the socioeconomic level of the parents, entry achievement level of the pupils, and other "hard-to-change" indicators of the level of input and resources the school had to work with. Working first at the district level, and then at the school level, the Department has spent eight years refining the analysis and reporting aspects of this regression approach. Some of the problems and attempts to solve them are described below. Background Variables. One of the first issues revolved about the predictor variables to be used in the regression equations. Early attempts included such factors as group intelligence test scores, minority group membership as a percent of total student population, particular expenditure patterns such as average teacher salary, average class size, and a variety of other information from the 1970 census. Current practice, however, relies on fewer predictors: a measure of previous achievement, one or two main socioeconomic factors - occupational level of the parent and AFDC (public assistance) count - and a few other less influential factors, such as percent of pupils who speak some language other than English and mobility level. These additional factors are included in the equation if their contribution to the prediction is statistically significant and it enhances the credibility of the results by assuring the consumers, chiefly school persons, that no relevant factors are ignored. Entry Level Decisions. Use of previous achievement as a baseline measure required two preliminary decisions: (a) Should the entry level measure be based on actual scores for the criterion group (or a cohort group) or for a contemporary group of younger peers - i.e., should it be longitudinal or cross-sectional? and (b) What point should be taken as the entry level or baseline? The Department's response to the first question was to use contemporary group scores, e.g., use of current grade 1 Entry Level Test (ELT) scores as the baseline for current grade 3 scores. This decision stems from the great difficulty in obtaining true longitudinal data, given pupil mobility, and the fact that schools with increasing proportions of lower socioeconomic status students are not well served by baselines built on declining populations. Post hoc data analyses support this position in that correlations are higher for cross-sectional data than for true or quasi-longitudinal data. The second question, "What point should be considered as baseline?" is essentially a question of the focus of the accountability: the school as a whole or a segment such as the primary grades or the upper elementary grades. The Department chose to use grade i EI,T scores for grade 6, and grade 6 scores for grade 12. This choice means that the grade 6 school residual (prediction minus observed score), for example, is an indicator of the quality of the program in grades 4 through 6. As an indicator of overall elementary program quality, however, it could admittedly be quite misleading. Regression Decisions. Efforts to improve the precision of the predictions have included exploration of such topics as the use of:
65
66
Studies m EducationalEva~ation
(a) non-linear components (relationships were linear at grades 6 and 12; quadratic and cubic components of the ELT were included at grades 2 and 3, largely to ensure a symmetrical distribution of residuals at each level of ELT score); (b) moderator variables (non-significant impact); (c) moving averages for predictors (non-significant); (d) weighted regressions (data points, i.e., schools, are weighted by a function of pupil variance to improve stability over years); and (e) school-within-district regression (not implemented because of the many one-school districts in California and the fact that regression slopes were not uniform across districts; see Cronbach, 1976; California, 1977c) Two other significant issues included the use of different regression weights for different content areas within a grade level and the e o m ~ t a t i o n of new regression weights each year. Although never considered major issues by school district personnel, these questions have never been settled with utter finality. Suffice to say that current procedures employ new weights each year (which are very stable across years) and unique weights for each content area. Table 3 shows the factors used at each grade level, their beta weights, and the total proportion of variance accounted for. Given the magnitude of the various sources of error in the criterion variables, in the predictors, and in the prediction model, one would hope the impact of the instructional program is such that these R2's will never be much higher. Current efforts are not aimed at increasing these figures so much as finding and controlling any biases that may exist against certain types of schools or districts. The actual results for a school are reported in school percentile ranks in conjunction with a comparison score band (the predicted score, plus and minus .67 standard errors). This band allows schools to compare their performance with that of the middle 50% of schools like themselves. For two years, this relationship was highlighted through the printing of an "A," "B," or "W" for performance above, below, or within the band, respectively. One early problem with this form of reporting was related to the high negative correlation between school or district size and magnitude of the residual. (Small schools have a much larger variance of residuals due to unreliability in both the criteria and the predictors. Therefore, if all schools had the same width comparison score band, small schools would be much more likely to fall above or below their comparison score band.) ']'he problem was solved by assuming that variance of residuals was a linear function of two hypothetical variances - the first due to prediction error (which is assumed to be a constant value for schools of all sizes) and the second due to within-school or district sampling and testing error (which is smaller for larger districts and approaches zero for the largest districts). The advantage of this method is that the width of the comparison score is uniquely determined for each school or district. It takes into consideration the school's size as well as any fluctuation in the score due to the testing situation. 4. "Did we do equally well in all skill areas or are there some particular strengths and weaknesses?" The report for each school provides results for a variety of skills and subskills at each grade level: 15 for reading at grades 2 and 3; six to eight skills for each content area in reading, language, and mathematics at grade 6 and from five to nine for each area at grade 12. These numbers are limited somewhat
Language
Mathematics
Reading
Language
Mathematics
6
6
12
12
12
Y = .67(Gr.6
Y = .70(Gr.6
Y = .66(Gr.6
Y = .47(Gr.3
Y = .49(Gr.3
Y = .49(Gr.3
Ach.lndex)
Ach. Index)
Ach. Index)
Ach. Index)
Ach. Index)
Ach. Index)
+ .21(SES)
+ .27(SES)
-
-
-
-
-
-
3 - .03(%Mob)
.78
.78
26(%AFDC)
.78
.56
28(%AFDC)
- .10(%Bil)
23(%AFDC)
29(%AFDC)
.69
.69
.67
.73
Dependent
- .12(ELT-27.2) 2 .09(ELT-27.2) 3 - .02(%Mob)
-.06(ELT-27.2)
R2
- .17(%Bii)
- .13(%Bii)
- .10(%Bil)
Form
32(%AFDC) - . 1 5 ( % B i l )
32(%AFDC)
.25(%AFDC)
- .19(%AFDC)
in S t a n d a r d i z e d
Beta Weights,
with C h i l d r e n program. %Bil = P e r c e n t of students s p e a k i n g a language other than English. %Mob = Percent of students not e n r o l l e d c o n t i n u o u s l y since the first grade. Gr.3 A c h . I n d e x = A c h i e v e m e n t of students in the feeder school at the t h i r d grade. Gr.6 A c h . I n d e x : A c h i e v e m e n t of students in the f e e d e r school at the sixth grade.
Reading
6
Y = .36(ELT)
Y = .41(ELT)
Equation
Equations,
E L T = E n t r y Level Test score. SES = S o c i o e c o n o m i c status b a s e d upon four o c c u p a t i o n a l levels of parents. % A F D C = Percent of students w h o s e p a r e n t s were r e c i p i e n t s of Aid to F a m i l i e s
Reading
3
Note:
Reading
2
Regression
Factors Used in the School-Level Prediction & Total Variance A c c o u n t e d for (R 2)
Content Area
3:
Grade
TABLE
O~ -4
5
68
Studies m EducationalEva~ation
by the need to allocate more items to certain content areas deemed worthy of greater emphasis by the content advisory committees. To discourage over-interpretation of subscore differences, each subscore mean is flanked by a band one standard error wide. Users are advised to judge the sfgnificance of the subskill differences by comparing them with the overall skill-area mean. 5. "How did we do compared to past performance?" This question, considered by many to be the most important one, can be answered in an absolute sense by comparing percent correct scores across years and in a relative sense by comparing percentile ranks across years. In grades or content areas in which most schools are improving, such as reading in the lower grades, percentile ranks allow a school to assess its progress (rate of improvement) relative to that of other schools, since the percentile ranks are recomputed each year. The school and district reports are distributed early in September of each year, providing school personnel about two months for study and analysis of the findings before the results are made public. About a month later, district personnel receive a condensed district-level profile of all total scores and background factors, as well as information on other district characteristics of interest, such as the tax rate and expenditures per pupil. This report is also accompanied by a special guide to interpretation. School districts are prohibited from releasing local results until a state report is prepared and presented to the State Board of Education in November. At that time the media are free to report both local and state results. Districts are provided with filmstrips to help explain the repsults to staff, their local board, and the public. A detailed guide for reporting and using test results (California, 1976) has also been prepared specifically for use by local test directors and public information officers responsible for handling the delicate and volatile matter of media relations. The Department also exerts great effort in conducting intensive indi vidual or small group test interpretation sessions with members of the press
State Level Reporting The annual report of statewJde results (California, 1977b) presents all major findings at the state level. The performance of all pupils is described skill by skill on a percent correct basis, along with changes in that performance. Actual items are reproduced in the report to illustrate what pupils are able or not able to do. The test performance is reviewed by the content committees for professional interpretation of the results in terms of relative strengths and weaknesses. Their comments are printed for the reader's assistance. Overall, test results are also presented by type of pupil, e.g., by sex, mobility level, English language fluency, and socioeconomic level and by type of school (e.g., by size, location, and average socioeconomic level), in order to provide the most useful information for state policy analysis.
Equating Studies. State law requires that studies be performed to allow comparison of statewide results with national performance. Current procedures equate the California Assessment Program tests to various nationally standardized tests. This makes it possible to estimate how California students would have scored if they had all taken several national tests. This
AssessmentmCaliforn~
method substantially decreases the risk of misinterpretation that could be (and has been) caused by looking at any single test. Further, the timeliness (hence, the validity) of such comparisons is enhanced since all new tests can be equated as soon as they become available. Future efforts will focus on collecting corollary socioeconomic status information from national norm samples so that statewide results can be more clearly related to national results using balancing or standardizing techniques to answer the more interesting question: "How do California pupils compare with national averages, when one accounts for the non-achievement differences between California pupils and the national samples?"
IMPACT OF THE PROGRAM During the first years of implementation of the current assessment program, the emphasis and expectation for use of the results occurred at the state level. As district-level personnel began to notice trends and patterns in relation to other information, the application of the results at the local level also increased. It is therefore reasonable to expect a gradual increase in the application of the results in coming years as a consequence of the greater specificity and credibility of test results. The following discussion of the impact and uses of the test is not limited to that of the latest version of the program, but occasionally draws examples from the earlier versions of the program where it is useful and warrented.
Local Impact In the second year of the current program, the Department conducted a survey of reporting techniques employed by districts, and ways in which results have been used. More than half (550) of the state's school districts responded. Nearly three-fourths (69%) of the districts indicated that the assessment program had revealed new information about strengths or weaknesses in the local program and 55% claimed to have made actual program changes indicated by results of the statewide assessment program. This finding contrasts sharply with the ubiquitous (but apparently unsubstantiated) criticism that state assessment results are of no value since they do not provide individual pupil information. One is not surprised to find teachers supporting this latter view since they are accustomed to pupillevel information. It has been much more discouraging to find that the idea and practice of using group information to evaluate and modify programs is also foreign to many administrators and curriculum specialists. Informal observation of district use of results indicates the existence of several aspects of that phenomena with the overworked label, needs assessment. Most districts have used the results to identify weaknesses in certain content or subskill areas, certain grade levels, certain schools, or for certain types of pupils. Changes have been noted in instructional program goals and objectives, materials used, staffing patterns employed, and teaching time allocated among other indications of shifts in emphasis. A study now in progress is attempting to determine the district characteristics which are associated with productive use of results for program evaluation and modification. Preliminary evidence indicates that districts
69
70
Studies m Educationa/Eva~ation
with more decentralized curricula use the results in ways quite different from those with more uniform curricula. The D y n a m i c s of Local Test Use. Identification of need is a necessary but insufficient condition for change. The overall climate for change must be appropriate. Newspaper publicity about test results plays a crucial role in the creation of that climate, but opposing arguments can be made about its impact. Some maintain that publicity itself spurs people to make needed changes; others argue that it leads only to cosmetic program changes designed to take the heat off the accountable person and may even retard real change. Both positions are probably true - depending on the local situation, the level of publicity, and the support of the administration, board, and community.
Undoubtedly, the reporting format and types of scores also make a difference. One of the scores (norms) used in the California program, mainly the school and district percentile ranks, is a source of great controversy although probably endemic to California. Districts are accustomed to reporting standardized test results from their local evaluation program in terms of the national percentile rank of the median pupil in the district. The California Assessment Program reports the position of the district mean in comparison to the averages of all other districts, a practice consistent with the voices of authority (Flanigan, 1951; Davis, 1974). The impact of these different methods is as profound as one would expect, given the smaller variation among means than pupil scores. This leads to high-scoring districts getting higher percentile ranks on the state results and lower scoring districts receiving lower percentile ranks. Districts whose median pupil is at the 40th percentile rank of a distribution of pupil scores are apt to find their district mean at about the 16th percentile rank of district averages. ]'he press and public are less than amused at the apparent contradiction. The Department has prepared correspondence tables between the two types of scores so a user can determine if student performance is actually better on one test than another, or if the difference is merely a function of the different distributions of quite different statistics. This information is useful in making an accurate interpretation of the reults but does little to erase the pub]icts skepticism that one score is right and the other wrong. The underlying issue is twofold: (l) Which score most accurately portrays the overall performance of a district? and (2) Which score leads the public to feelings of alarm if warranted, and to feelings of pleasure and pride if warranted? Does the district percentile rank exaggerate the differences among districts or does the pupil percentile rank minimize them? The answers are largely a matter of judgment. Guidelines from the world of statistical inference or the conventional wisdom of test interpretation are singularly useless. This problem illustrates the inherent difficulty of judging the acceptability of educational performance, even when assisted by various normative types of information.
State Level Impact The legislation establishing the statewide assessment program implied three purposes or functions: E v a l u a t i n g the E f f e c t i v e n e s s of School Programs. The statewide assessment results are used increasingly in examining achievement trends of stu-
Assessment~California dents in specially funded programs. The basic skill components of virtually all statewide programs have been evaluated, at least in part, using California Assessment Program information. Examples are Early Childhood Education, Miller-Unruh Basic Reading Program, Educationally Disadvantaged Youth, Indian Education, Bilingual Education, Educationally Handicapped, Mentally Gifted Minors, Migrant Education, and a variety of other innovative and experimental projects and programs. Information from statewide assessment is very suitable for state-level program evaluation for several reasons. In addition to the obvious benefit of common information when searching for program impacts, a second advantage is that the results can be automatically separated for pupils in different programs, thereby requiring no additional testing. A third strength of California Assessment Program data for program evaluation is the provision for partial statistical control of pre-existing differences of project participants, for either pupil- or school-level analyses, by using the same background variable information used in calculating the comparison score bands. The most powerful analysis to date, applied to the Early Childhood Education reform effort, uses school residuals as outcome measures. The technique compares the changes in mean residuals for groups of project schools over years. This technique helps to control for some portion of the pre-existing differences between groups for which the residuals themselves fail to compensate. At the statewide level, the findings have contributed to a number of program changes. The decline in mathematics achievement scores at grade 6 a few years ago triggered a special study which identified specific problem areas. 3 These findings led to changes in the curriculum framework, to the criteria for textbook adoption, and eventually to a shift of emphasis in the adopted textbooks and the instructional program itself. The declining scores in the area of written expression several years ago (preceding the current publicity) led to a special study of the weaknesses in students' writing (California, 1977a). The Department used the information from this study and other projects to launch a writing improvement project throughout the state with the help of a grant from the National Endowment for the Humanities. A l l o c a t i n g R e s o u r c e s to Schools (& Pupils) with the Greatest E d u c a t i o n al Needs. The allocation of special state funds to the lowest achieving
districts, was, in fact, one of the main reasons for the origin of one part of the statewide testing program - the primary grades reading testing required by the Miller-Unruh Basic Reading Program. This resource allocation function has been extended to other state and federal programs, most notably to the Early Childhood Education reform effort. In this program, however, ground rules are quite different - test results serve as one basis of financial reward for greater accomplishment rather than financial compensation for greater need. Districts are given funds to extend the restructuring efforts to additional schools within the district based in part on positive test results (residuals and change in residuals).
3The general profile of test score trends in California has closely paralleled national trends, i.e., rising scores in the lower grades and declining scores in the upper grades.
71
72
S ~ d i e s m EducationalEva~ation
A m u l t i t u d e o f r e s e a r c h s t u d i e s c o n d u c t e d by v a r i o u s s t a t e a g e n c i e s , f e d e r a l a g e n c i e s ( s u c h as t h e O f f i c e o f C i v i l R i g h t s a n d U.S. O f f i c e o f Education), universities, and p r i v a t e a g e n c i e s h a v e u s e d s t a t e w i d e t e s t i n g p r o gram r e s u l t s in studying the effectiveness o f e d u c a t i o n and e q u a l i t y o f e d u cational opportunity for economically disadvantaged students. Identifying Successful Practices. The 1972 a d v i s o r y c o m m i t t e e emphas i z e d a statewide assessment program's potential for identifying those factors responsible for the unusual success of some schools, assuming such schools exist (see Klitgaard & Hall, 1973) after controlling for background variables. The Department has only begun to explore this post hoc approach to educational research, that of observing the effects of natural variation.
One of the most carefully controlled and documented efforts Department was a school effectiveness study (California, 1975). the findings related to the role of the principal, pupil time on the effects of different methods of individualizing instruction. nicely into the mosaic of recent knowledge of teacher and school
of the Many of task, and They fit effects.
A more recent study focused on a contrast between very low-achieving schools which had improved and other very low-achieving schools which it seemed were degenerating even further. All schools were receiving special funds to restructure their elementary school programs. The findings present a vivid picture of the differential impact of schools which implement widescale changes for the sake of true improvement in contrast to that of schools which apparently were making cosmetic changes primarily to procure funding and please some "significant others." The findings contained strong implications for the role of a state agency in fostering true reform and lasting change.
AN ASSESSMENT OF THE PROGRAM The program is evolving. Many changes have been made and more undoubtedly lie ahead. Very few features of the program are uncontested successes, but the process of relying upon content advisory committees in the development of the tests is nearly one, as is the use of matrix sampling. The ratio of the amount of information yielded to the amount of time taken by testing represents a bargain to most of the school personnel affected. The maximizing of the amount of subskill information reported undoubtedly contri butes to this feeling, though its use is still rather limited. Although generally superior to a "raw" score, the percent correct score itself is not without potential for misunderstanding. Not only does it become confused with percentile ranks, but it also resembles too closely the grading scale many readers came to know and loathe in their school days. Consequently it becomes very difficult for them to think of a percent correct of 65, for example, as anything but failure - even if it represents performance in an advanced or complex skill area. The concept of the comparison score band presents philosophical and technical problems, although the idea of a unique and appropriate norm for interpreting the results for a unique school is patent]y sensible. The social implications for certain cultural and minority groups is another matter - one which currently surfaces only occasionally but probably holds great promise for lively controversy in the future. The most obvious technical problems of the regression approach are really data problems, primarily the
Assessment~Californ~
difficulty of capturing background information which is accurate for the group tested (given the mobility in today's schools) and unbiased in frequency or effect for all types of pupils and skills. A federally funded project involving six states which were using or planning to use this technique helped to uncover the common problems and to focus development and research efforts (NCEA, 1976, 1977). Nevertheless, far too little research on the biases and possible model mis-specification errors has been conducted to announce a state of euphoria. It is in the area of reporting that one feels most uncomfortable. The task of getting an enormously large and heterogeneous public to understand the results and register the "proper" degree of concern is overwhelming. This problem, of course, is complicated by the fact that the information is inevitably viewed as political fodder, a fact upsetting only to the naive (Fitzgibbon, 1975). This fact does, however, exacerbate the poorly understood dilemma faced by those presenting technical data to the public - i.e., the Hobson's choice of presenting accurate but complex data which have a high probability of being misunderstood, or of presenting simpler and slightly less precise information, which appears easier to understand, but, in fact, has a high probability of leading the public to a grossly distorted understanding. The seemingly innate need to compare local or state results to national norms without sacrificing the tests' curricular relevance forced the Department to develop its current application of equating techniques. Except for a minority who see precise national comparisons as the basic validation of all other ways of interpreting the results (rather than as one useful piece of supplementary information), acceptance of the equating methods has been gratifying. Although the method does provide recent information on several national tests at low cost and a minimum of testing, one has the nagging feeling that there must be a better way - better to the publishers who encounter such vigorous lack of cooperation from school personnel in norming studies, and better from the statewide assessment perspective. Discussions involving Department personnel and publishers have touched on ways in which cooperative arrangements could be formed to exploit the large amount of test and non-test information collected by statewide assessment programs in order to assemble smaller, yet more representative samples with less total testing (perhaps with the assistance of latent trait models). Possibly the recent federal attention to testing will provide the resources to allow the exploration of such arrangements. The publicity local test results receive and the impact of this publicity on school personnel have led to a situation in which many school personnel, while eager to reaffirm their own integrity, are very worried about the rumors they hear (and perpetuate) that their colleagues in neighboring schools or districts are engaging in a variety of devious practices that are designed to raise test scores. These rumors usually turn out to be rumors. Too many competing forces are involved in most districts to allow for a systematic cheating or coaching operation. The large number of test items involved at each grade level also militates against such practices. The current practice of using the same large set of items each year does accomodate the intense interest in charting progress from year to year, but is far from a fulfillment of true domainreferencing and matrix-sampling technology. New items should be randomly selected and used each year. Such a procedure (Shoemaker, 1975) would not only remove the coaching factor that may be involved in even small in-
73
74
Studies m Educationa/Evaluation
creases for a school or the state as a whole, but also control for any grad ual, unintentional teaching toward specific items rather than toward the skills themselves. Obstacles to this ultimate solution involve item development and field testing costs, typesetting costs for new tests, and the need to keep new items at the same difficulty value as existing items so that changes can be clearly interpreted. Perhaps application of latent trait theory can assist in the longitudinal comparisons yet allow for intro. duction of new items with some variation of difficulty values. The potential of this technique is now being explored not only to meet these comparative needs, but also to report the skill area results in a manner which enhances their usefulness to teachers and curriculum personnel. The practical test assembly problems must not be minimized, however; the sheer volume of work and time required to balance the test form in length, content, difficulty, and a variety of other stimulus-and-response characteristics cannot be appreciated by those who have not attempted it. One can only hope that a sustained effort over the years to develop a large and sufficiently diverse item pool will allow this procedure to become a reality.
LOOKING TOWARD THE FUTURE Proposed legislation provides for some rearranging of the program: the deletion of testing in grade 2, the expanding of testing in grade 3 to include language and mathematics as well as reading, and the addition of grade 8 with testing in reading, language, and mathematics. These changes were proposed to increase the information's usefulness for program evaluation purposes - reading at grade 3 was too narrow, grade 2 testing was deemed not to add sufficient information to justify its cost, and grade 8 fills the large gap between grades 6 and 12 and provides a basis for evaluating a statewide school reform effort in secondary education. Other predictable changes and shifts in emphasis will also be designed to increase the data's utility for evaluating programs, such as more use of the results for selecting sample schools for special in-depth investigations, more use of the total state pupil-level data file to follow students from grade to grade or to compare program and non-program participant~ while controlling for re-existing differences, and more use of results for allocating funds on both a "need" basis and an "exemplary" program basis. Finally, more time and effort are to be devoted to analysis of the data so that the ratios of (a) time and resources allocated to collection of data to (b) analysis of results can be reduced from I00 to i as estimated by the 1972 Advisory Committee to some more balanced proportion. Such a change is essential before the program will fulfill its potential as a tool for improving educational programs.
REFERENCES Advisory Committee. Report of Advisory Committee on statewide testing program. Sacramento: California Assembly Committee on Education, 1972. BUROS, O.K. Fifty years in testing: Some reminiscences, criticisms, and suggestions. Educational Researcher, 1977, 6, 9-15. California State Department of Education. California school effectiveness study. Sacramento: Author, 1975.
Assessmentm Californ~
California State Department of Education. H a n d b o o k for r e p o r t i n g and using test results. Sacramento: Author, 1976. California State Department of Education. An a s s e s s m e n t of the w r i t i n g perf o r m a n c e of C a l i f o r n i a h i g h school seniors. Sacramento: Author, 1977(a). California State Department of Education. S t u d e n t a c h i e v e m e n t in C a l i f o r n i a schools, 1976-77 annual report. Sacramento: Author, 1977(b). California State Department of Education. Technical report to the C a l i f o r nia a s s e s s m e n t program. Sacramento: Author, 1977(c). CRONBACH, L.J. Test validation. In R.L. Thorndike (Ed.), E d u c a t i o n a l measurement. Washington, D.C.: American Council on Education, 1971. CRONBACH, L.J. R e s e a r c h on c l a s s r o o m s and schools: F o r m u l a t i o n of questions, design, and analysis. Stanford, Calif.: Stanford Evaluation Consortium, 1976. DAVIS, F.B. S t a n d a r d s for e d u c a t i o n a l and p s y c h o l o g i c a l tests. Washington, D.C.: American Psychological Association, 1974. DYER, H.S. The Pennsylvania plan. S c i e n c e E d u c a t i o n , 1966, 50, 242-248. DYER, H.S. Can we measure the performance of educational systems? NAASP Bulletin, May 1970, 96-105. DYER, H.S., LINN, R.L., & PATTON, N.J. A comparison of four methods of obtaining discrepancy measures based on observed and predicted school system means on achievement tests. A m e r i c a n E d u c a t i o n a l R e s e a r c h Journal, November 1969, 6, 591-605. Educational Testing Service (in collaboration with Education Commission of the States). S t a t e e d u c a t i o n a l a s s e s s m e n t p r o g r a m s , 1973 revision. Princeton, N.J.: ETS, 1973. FITZGIBBON, T . J . Political use of education test results. Focus on Evaluation (Monograph No. 2 ) . New Y o r k : H a r c o u r t B r a c e J o v a n o v i c h , 1975. FLANAGAN, J . C . Units, scores, and norms. In E.F. Linquist (Ed.), Educational m e a s u r e m e n t . Washington, D.C.: American Council on Education, 1951. .Joint Interim Committee. R e p o r t of the Joint I n t e r i m C o m m i t t e e on the p u b l i c e d u c a t i o n system. Sacramento: Senate of the State of California, 1961. KLITGAARD, R.E., & HALL, G. A s t a t i s t i c a l search for unusually e f f e c t i v e schools (R-1210-CC/RC). Santa Monica, Calif.: The Rand Corporation, 1973. MILLMAN, J. Criterion-referenced measurement. In W.J. Popham (Ed.), Evaluation in education. Berkeley, Calif.: McCutchan Publishing, 1974. National Council on Educational Assessment. P r o c e e d i n g s of the national forum for the a d v a n c e m e n t of state e d u c a t i o n a l a s s e s s m e n t p r o g r a m s (Vol. i, Nos. 1,2). Trenton, N.J.: NCEA, 1976, 1977. PANDEY, T.N., & CARLSON, D. Assessing payoffs in the estimation of the mean using multiple matrix sampling designs. In D.N.M. de Gruijter & L.J.T. van der Kamp (Eds.), A d v a n c e s in p s y c h o l o g i c a l and educational measurement. London: John Wiley & Sons, 1976. SHOEmaKER, D.M. Toward a framework for achievement testing. R e v i e w of E d u c a t i o n a l R e s e a r c h , 1975, 45, 127-147. TYLER, R.W. What is an ideal a s s e s s m e n t p r o g r a m ? Sacramento: California State Department of Education, July, 1968. Reprinted by the Research Consortium for Educational Assessment, 1970.
75