On the use of students' science notebooks as an assessment tool

On the use of students' science notebooks as an assessment tool

S t u d i e s in Educational Evaluation F_J~EVIER Studies in Educational Evaluation 30 (2004) 61-85 www.elsevier.com/stueduc ON THE USE OF S T U D ...

2MB Sizes 26 Downloads 35 Views

S t u d i e s in Educational Evaluation

F_J~EVIER

Studies in Educational Evaluation 30 (2004) 61-85 www.elsevier.com/stueduc

ON THE USE OF S T U D E N T S ' SCI ENC E N O T E B O O K S AS AN ASSESSMENT TOOL 1

Maria Araceli Ruiz-Primo* and Min Li** *Stanford Education Assessment Laboratory, Stanford University, USA **Department of Educational Psychology, University of Washington, Seattle, USA

The success of science education reform relies on the quality of instruction that takes place in classrooms. It is expected that the opportunities to learn science made available to students should be appropriate, meaningful, and rich. Consistent with the National Science Education Standards (National Research Council, 1996), we believe that students should not be held accountable for achievement unless they are given adequate Qpportunity to learn science. Therefore, both students' performance and opportunity to learn science should be assessed. Furthermore, inferences about student learning and achievement should be based not only on achievement tests, but also on students' classroom performances and products (Pellegrino, Chudowsky, & Glaser, 2001). This article reports some findings from a series of studies in which students' science notebooks are used as a method for examining both student performance and some aspects of opportunity to learn. In what follows we describe the framework we (Ruiz-Primo, 1998; Ruiz-Primo, Li, Ayala, & Shavelson, 1999, 2000, in press) have developed to conceptualize notebooks as an assessment tool. We then provide evidence about the technical qualities of notebooks and the information they can provide on student performance and opportunity to learn. A Framework for Conceptualizing Students' Science Notebooks as an Assessment Tool Science notebooks or journals are seen as a log of what students do in their science class. For example, students may describe the procedures they use, observations they 0191-491X/$ - see front matter © 2004 Published by Elsevier Ltd. doi: 10.1016/j.stueduc.2004.03.004

62

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

make, conclusions they arrive at, and their reflections. Science notebooks can be viewed as a written account, with more or less detail, and of diverse quality, of what students do and learn in their science class. We (Ruiz-Primo, 1998; Ruiz-Primo, et al., 1999, 2000, in press; Ruiz-Primo, Li, & Shavelson, 2004) have defined notebooks as a compilation of entries (or items in a log) that provide a record, at least partially, of the instructional experiences a student had in her or his classroom for a certain period of time (e.g., a unit of study). Since notebooks are generated during the process of instruction, the characteristics vary from entry to entry as they reflect the diverse set of activities in a science class. Research has increasingly shown that writing notebooks plays an essential role in learning science. First, being able to write, record data, document activities, organize information, present findings, and communicate reflection, are necessary skills involved in doing science. These skills are authentic to what scientists do in their work (e.g., Cuozzo, 1996; Lemke, 1990; Rowell, 1997; Southerland, Gess-Newsome, & Johnson, 2000). Indeed, a part of learning science is to master the language or genre for science, such as constructing written texts for specific proposes and specific audiences (e.g., Christie, 1985; Martin, 1989; NRC, 1996). Second, writing notebooks can facilitate students' learning of science. Such a process promotes students to apply new information and integrate it with existing ideas to make sense and develop explanations (e.g., Northwest Regional Educational Laboratory, 1997). As asserted by Glynn and Muth (1994, p. 1065), "when students write about their observations, manipulations and findings, they examine what they have done in greater detail, they organise their thoughts better, and they sharpen their interpretations and arguments". Thus, writing science notebooks is considered as a constructive, reflective process that helps students clarify scientific concepts (Aschbacher & McFee-Baker, 2003; Hannelink, 1998). Furthermore, notebooks provide a record of students' learning that they can review to see changes in their own conception and start to develop their metacognitive and self-assessment skills (e.g., Cuozzo, 1996; Willison, 1996). Third, many argue that writing notebooks provides students a way of expressing enjoyment, personal meaning, and their ownership of learning since they can write down as well as reflect on their thoughts or feelings in notebooks (e.g., Baker & McLoughlin, 1994; Dixon, 1967; Florio & Clark, 1982). Finally and more relevant to the scope of this study, previous research.has found that science notebooks can be a good source of information for teachers to examine students' understanding, analyze their misconceptions, and as a means of providing students with the necessary feedback (e.g., Audet, Hickman, & Dobrynina, 1996; Cuozzo, 1996; Dana, Lorsbach, Hook, & Briscoe, 1991; Fellows, 1994; Heinmiller, 2000; Hewitt, 1974; McColskey & O'Sullivan, 1993; Northwest Regional Educational Laboratory, 1997; Shepardson & Britsch, 1997). We have proposed another perspective and function of science notebooks (RuizPrimo, et al., 1999, 2000, in press; Ruiz-Primo, Li, & Shavelson, 2004). We proposed notebooks as an u n o b t r u s i v e assessment tool to be used not only by teachers but also by individuals outside the classroom (e.g., school district personnel, program evaluators). Our framework was developed around the type of information that can be collected in science notebooks.

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

63

Type of Information What information can be collected from an outsider's perspective when using students' science notebooks as an assessment tool? The most evident answer is information on students'pel:l('ormance. 2 In addition, we proposed (Ruiz-Primo, 1998; Ruiz-Primo, et al., 1999, 2000) that students' notebooks could also be used to collect information about opportunity to learn. Student Pelformance

Our fi'amework focused on two aspects of student performance: quality o f The relevance of the latter aspect needs no explanation. But why do we focus on communication? We believe that constructing sound and scientifically appropriate communications helps students not only better to understand scientific concepts and procedures, but also to participate in a scientific community. Not knowing the "rules of the game" alienates students from the scientific culture and keeps them scientifically illiterate (e.g., Bybee, 1996; Lemke, 1990; Martin, 1989, 1993). Furthermore, the National Science Education Standards in the United Stated (NRC, 1996) considered both aspects, communication and understanding, as fundamental to a focus on both performance- and product-based assessments. The question is, then, whether it is possible to collect information about students' communication and understanding using their science notebooks as a source of information. If we consider science notebooks as one of the possible artifacts that students produce in class, evidence about students' communication and understanding might be collected from the written/schematic/pictorial accounts one can find in their notebooks. For example, if a student's notebook entry shows a description of an experiment, we could ask whether the description is presented clearly enough to make the procedure replicable by others. To approach student understanding we followed the idea of dimensions of scientific literacy suggested by Bybee's (1996). We focused on conceptual and procedural understanding. Conceptual understanding involves the functional use of scientific words/vocabulary appropriately and adequately as well as relating to the concepts, represented by those words (i.e., understanding facts, concepts, and principles as parts of conceptual schemes). Procedural understanding emphasizes scientific inquiry--the processes of science. The processes of science include both procedural skills (e.g., observing, measuring, hypothesizing, and experimenting), but also using evidence, logic, and knowledge to generate research questions and construct explanations (Duschl, 2003). For example, students should be encouraged to gather data as well as to use the data as evidence to support their conclusions and to critique others' experiments. communication and u n d e r s t a n d i n g .

Opportunity to Learn

It seems that competent teachers have the following three characteristics (Sadler, 1989, 1998): (1) They design and select ways (e.g., instructional tasks) to monitor their student's progress in terms of their understanding and performance in science; (2) they

64

M,A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

communicate to students their progress and encourage them to go further in their learning of science, and (3) they use the information gained with these instructional tasks to monitor their teaching. These characteristics, then, can be considered indicators of teaching quality, an aspect related to the opportunity students have to learn science (see National Science Education Standards/NRC, 1996). Based on this information, we have considered three aspects of opportunity to learn that can be measured using students' notebooks as a source of information: u n i t implementation, quality (~f notebook entries, and teacher f e e d b a c k to student performance. The rationale behind these aspects is the following: There is evidence that notebooks reflect with great fidelity what students do and what teachers focus on in the science class (Alonzo, 2001; Baxter, Bass, & Glaser, 2000). If science notebooks are seen as an account of what students do in their science classroom, it should be possible to map instructional activities implemented in a science classroom when information from individual science notebooks is aggregated at the classroom level. If none of the students' notebooks from a classhas any evidence that an instructional activity was carried out, it is unlikely that the activity was implemented. Furthermore, each notebook entry can also be analyzed according to the inferred demands imposed on the students. We also think that teachers should consider science notebooks as one natural strategy to monitor their students' learning progress. If teachers communicate their progress to students and encourage them to improve their learning, at least some evidence of this communication should be found in the notebooks. If teachers adjust their instructional practices based on the information gained as they monitor student progress, these adjustments should also be partially reflected in the quality of the students' notebook entries. Science Notebooks Assessment Approach Consistent with the framework described above, science notebooks as an assessment tool can provide information at two levels: (1) individual level -- a source o f evidence bearing on a student's performance over the course of instruction; and (2) classroom level -a source of evidence of opportunities students had to learn science. The assessment approach views science notebooks as a compilation o f communications with diverse characteristics. Each of these communications-is considered as a notebook entry. The characteristics of notebook entries vary since each.entry may ask students to complete different tasks depending on the instructional activity implemented on a particular day (e.g., write a procedure or define a concept). Each notebook entry is assessed on the following aspects: •



Unit I m p l e m e n t a t i o n - What intended instructional activities were implemented as reflected in the students' notebooks? Were any other additional activities implemented that were appropriate to achieve the unit's goal? Type o f E n t r y - What are the characteristics of the notebook entries observed in the students' science notebooks? What are the inferred demands imposed on the students in these entries?

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85 •



65

- Were students' communications appropriate according to the scientific genres'? Did students' communications indicate conceptual and procedural understanding of the content? Teacher Feedback - Did the teacher provide helpful feedback on students' performance? Did the teacher encourage students to reflect on their work?

Student Pe@)rmance

Unit h n p l e m e n t a t i o n

Unit implementation is based on the idea that there is an interest in documenting the implementation of a particular curriculum. For example, an individual outside the classroom is interested in knowing what is enacted in the classroom versus what it is prescribed in an intended curriculum. To answer the question about the implementation of the intended curriculum (What intended instructional activities were implemented as reflected in the students' notebooks?), we first define the instructional activities as evidence that the unit was implemented. The specification of these activities should be based on an analysis of the intended curriculum. An inventory of the major activities serves as a v e r ( / i c a t i o n list for capturing the implementation of the basic instructional activities, as well as "other" activities implemented but not required by the curriculum (i.e., Were any other additional activities implemented that were appropriate to achieve the curriculum/unit goal?). The verification list should follow the organization of the curriculum or unit being analyzed (Figure 1) and it is a central characteristic of the scoring used in our approach (see below). Evidence of the implementation of an instructional activity can be found in different forms in a student's notebook: description of a procedure, hands-on activity report, interpretation of results, and the like. Variation in these forms is expected across activities and students' notebooks. Furthermore, notebook entries may vary from one student to the next within the same classroom for a number of reasons (e.g., the student was absent when a particular instructional activity was implemented). The variety of notebook entries can be even wider when Students' science notebooks are compared across different classrooms. To tap the variation in notebook entries within- and between-classes, entries in the notebooks are linked to the intended instructional activities specified in tile verification list based on the curriculum analyzed. In the verification list presented in Figure 1 it is possible to know that Activity 1, Separating Mixtures, is organized in four parts (P) and a section of Reflective Questions. For example, within P. 1, Make and Separate Mixtures, students conduct three activities (Making, Screening, and Filtering Mixtures). As part of Making Mixtures, students fill out two activity sheets provided by the curriculum developers. It can also be inferred that the definitions of mixtures, solutions, and diatomaceous earth are introduced in this part. In P. 2, Weigh and Separate Salt Solution, students are involved in discussing some Reviewing Questions (RQ) about the previous activities they conducted. Unit implementation is evaluated dichotomously based on the question asked, "Is there any evidence in the student's notebook that activity 'X' was implemented?" A score of 1 denotes the affirmative and 0 denotes absence.

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

66

Mixture Unit ' 1. Separating Mixtures P I. Make and separate mixtures . Making Mixtures Activity Sheet: Separating Mixtures - Part 1 Activity Sheet. Separating Mixtures - part 3 Screening Mixtures Filtering Mixtures Defining "Diatomaceous Earth" Defining "Mixtures" Defining "Solutions" P2.

Weight and separate a salt solution Making & Weighting a Salt Solution Activity Sheet: Separating a Solution - Part 3 RQ: What is a mixture? RQ: How can a mixture be separated? RQ: Can you separate a solution with a screen/a filter? RQ: How might you separate the ingredients in a solution? Defining "Evaporation"

P3.

Salt crystals Evaporating a Solution - Saltwater Solution Activity Sheet. Separating a Solution - Part 6 Activity Sheet: Separating Mixtures Review -- Part 1 ActiviO, Sheet: Separating Mixtures Review - Part 2 Defining "Crystal"

P4.

Separate a mixture - gravel, powder, salt Challenge: Design a method to separate a dry mixture

Reflections on the Activity Integrating: How are screen & filter similar/different? Integrating: How separate flour & water and sugar & water? Integrating: How to find if citric acid forms a solution? Integrating: How a solution is different from a mixture? Valuing: What is your favorite mixture and solution? Thematic: What characteristics of solid mats allow to separate? Thematic: Why create and separate mixtures?

Figure 1: Verification List o f Activity 1 o f the FOSS Mixtures and Solutions Unit

Type o f Entry A s m e n t i o n e d b e f o r e , the c h a r a c t e r i s t i c s o f n o t e b o o k e n t r i e s v a r y f r o m d a y to d a y a c c o r d i n g to w h a t s t u d e n t s w e r e a s k e d to do (e.g., w r i t e a p r o c e d u r e or e x p l a i n a c o n c e p t ) . T h e k e y i s s u e , f r o m t h e a s s e s s m e n t p e r s p e c t i v e , is to i d e n t i f y t h e n o t e b o o k e n t r i e s a c c o r d i n g to w h a t s t u d e n t s w e r e a s k e d to do. W e i d e n t i f i e d f o u r t e e n g e n e r a l e n t r y c a t e g o r i e s b a s e d o n s t u d e n t s ' s c i e n c e n o t e b o o k s f r o m d i f f e r e n t c l a s s r o o m s and t h e t y p e o f

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61r-85

67

activities that students are supposed to do in a science class (see National Science Education Standards/NRC, 1996). 3 Notebook entries can be found in different forms of communication: verbal written/text - (e.g., explanatory, descriptive, inferential statements); schematic (e.g., tables, lists, graphs showing data), or pictorial (e.g., drawing of apparatus). Moreover, some o f the categories proposed also include sub-types o f entries according to the form o f communication. For example, a definition can be verbal or pictorial (e.g., drawing of a pendulum system), therefore, the type of entry, definition, includes both sub-types o f definitions. Table 1 presents the types and sub-types that we identified. We assumed that all the entries provide information, at least partially, about the students' conceptual and procedural understanding and communication skills. Table 1:

Types of Notebook Entries Type of Entry

Defining Exemplifying Applying Concepts Predicting/Hypothesizing Reporting Results Interpreting Data and/or Concluding Reporting & Interpreting Data and/or Concluding Content Questions/Short Answer Quick Writes

Reporting Procedure

Reporting a Quasi Experiment Reporting an Experiment Designing an Experiment Assessment

Don't Care About Activity

Code 1 2 3 4 5 6 7. 8 9 10 17 18 i9 20 11 12 13 14 15 16 21 22 23

Sub-Types " Defining, verbal " Defining, pictorial No Sub-Type No Sub-Type No Sub-Type ° Reporting results, verbal ° Reporting results, graphic No Sub-Type ° Reporting & interpreting, verbal ° Reporting & interpreting, graphic No Sub-Type ° Contextualizing science ° Narrative affective ° Narrative reflections ° Procedure recount * Procedure instructions ° Procedure directions No Sub-Type No Sub-Type No Sub-Type " Simple forms (e.g., short answer) ° Complex forms(e.g., performance assessments) No Sub-Type

To know more about the quality of the notebook entry, each is coded at two levels. First, a code is used to identify the type of entry (e.g., an entry in which an experiment is reported is coded as "15"). Once the type o f entry is identified, a set of second-level codes is used to define more precisely the characteristics o f the entry. Second-level codes are o f three types: (a) the characteristics of the investigations/experiments reported in the entry, if appropriate (e.g., replications of the experiments can be implied or more than one level of

68

M.A. Ruiz-Primo,M. Li / Studies in EducationalEvaluation 30 (2004) 61-85

the independent variable is studied, or both); (b) the format of the entry (e.g., the entry does not have a formal prompt or format of the entry is provided by teachers or curriculum developers), and (c) general characteristics of the entry (e.g., the entry is repeated in another part of the notebook, or the entl:y has a supplemental picture/graph, or content of entry is clearly copied fi-om textbook). For example, an entry coded as 15 - reporting an experiment - can have a second-level code "3" (i.e., 15.3) if replications of the experiment/investigation are done. Also, that entry can have an additional code "6" (i.e., 15.3.6) if the format of the entry is provided to the students (e.g., a printed sheet for students to report the experimeht). Student Pel?]brmance

For each notebook entry identified, students' performance can be scored as to the quality of the communication and the understanding. Both aspects are scored according to the requirements of the task. Two questions guide the scoring of performance: Did students' communication correspond to the appropriate communication genre? Did students' communication indicate conceptual and procedural understanding of the content presented? The scoring of the quality of communication takes the perspective of genres in scientific communication (Lemke, 1990; Martin, 1989, 1993;), 4 This approach links types of entries with genres. We defined general characteristics of communication for each type of entry (i.e., Does the student communication have the distinctive elements of the written genre at hand? See Table 2). 5 Quality of communication is evaluated on a four-point scale: 0 - Incoherent and not unde1~tandable communication (e.g., incomplete sentences); 1 - Understandable but not using the characteristics of the genre (e.g., examples are provided but the category to which the examples belong is not provided); 2 - Understandable and uses some of the basic characteristics of the genre (e.g., category to which the examples belong is provided, but only in the form of a title, not making the logical relationship explicit); and 3 Unde1~s'tandable and uses all the basic characteristics of the genre (e.g., category to which the examples belong is provided and makes the logical relationship explicit). If a student's communication was scored "0" we did not attempt to score the student's understanding. Conceptual and procedural understanding are evaluated on a four-point scale: (NA) Not Applicable (i.e., instructional task does not require any conceptual or procedural understanding); 0 - No Understanding (e.g., examples or procedures described are completely incorrect); 1 - Partial Understanding (e.g., relationships between concepts or descriptions of observations are only partially accurate or incomplete); 2 -Adequate Understanding (e.g., comparisons between concepts or descriptions of an investigation plan are appropriate, accurate, and complete); and 3 - Advanced Understanding (e.g., communication focuses on justifying responses/choices/decisions based on the concepts learned or provides relevant data/evidence to formulate the interpretation). -

2

Incoherent, not understandable communication

Incoherent, not understandable communication

Incoherent, not understandable communication

Exemplifying/ Categorizing

Reporting a Procedure

0

Defining Pictorial

.g r~ Type of Genre

Procedure covers some of the important steps. BUT steps are not presented in a cleat' sequence (i.e., numbered), so it is difficult to replicate procedure, Procedure may or i'nay not have a title, AND/OR may or may not have technical terms when appropriate.

category (e.g., solid nlaterials that do not go through a filter) are provided BUT not the categoay name OR the logical relationship between them.

Procedure covers most of the important steps. AND steps are presented in a clear sequence (i.e., are numbcrcd or clearly scquenced), so procedure can be replicated Procedure may or may not have a title. OR may or may not have technical terms if appropriate, BUT not both.

Procedure covers all of the inlportanl steps. AND steps arc presented in a cigar scqucnce (it.. are numbered), so proccdurc can be replicated. AND has a title. AND has lcchnical terms if appropriate.

('gtteg~ri2M,g. The attributes that define a category AND the category nanle AND the logical rclatim/ship between thcnl arc provided.

Categorizing. The attributes that define a category AND thc category name are provided BUT the logical relationship between them is not explicit.

Categorizing. The attributes that define a

Exem/)l(Ji'i,g. Studcnt provides the category to which examples/instances belong AND makcs the logical tdationship explicit (e.g.. A is an example of B)

in the form of title, not making the logical relationship explicit (e.g., A is an example of B).

Exempl(~'mg. Category is provided BUT only

Exempli~,mg. Examples/Instances are provided BUT the category to which the exmnples belong is not provided AND the logical relationship between them is not provided.

Reprcsentalion can be easily idenlified AND rues( of the important parts offllc representation are labeled, AND has a title, AN D has technical terms if appropriate.

3 Understandable and uses all the basic characteristics of tile genre

Representation can be easily identified AND most of the important parts arc labeled, Representation lnay or may not have a title, OR may or may not have technical tcrnls if appropriate.

2 Understandable and uses some of the basic characteristics of the genre

Score

Representation can be easily identified (e.g., as a drawing o f a penduhan), BUT most of the important parts are not labeled (i.e., it can have one or more labels, but not the most important ones), Representation may or may not have a title, AND may or may not have technical terms if appropriate (e.g., student uses "little pieces of glasses" instead of"cl2cstals").

1 Understandable but not using the characteristics of the genre

Table 2. Examples of the Criteria Used to Score Quality of Communication

O'x

~o k.n

&

tao

<5

t~

~-.

r./a

.e:

70

M.A. Ruiz-Primo, M. Li /Studies in Educational Evaluation 30 (2004) 61-85

Teacher Feedback Feedback is usually defined in terms of information about how successfully something is being, or has been, done; However, providing feedback is more than making a judgment about a student's work or performance. Feedback can be considered only if it can lead to an improvement of the student's competence (Ramaprasad, 1983; Sadler, 1989). If the information is simply recorded, passed to a third party, or if it is too coded (e.g., a grade or a phrase such as "incomplete!") to lead to an appropriate action, the main purpose of the feedback is lost and can even be counterproductive (Sadler, 1989, 1998). Effective feedback should lead students to be able to judge the quality of what they are producing and be able to monitor themselves during the act of production (Sadler, 1989, 1998). For assessing the quality of students' work or performance, a teacher must possess a concept of quality appropriate to the task that allows her/him to recognize and describe a fine performance, and indicates how a poor performance can be improved (Sadler, 1989). Example of a Student's Notebook Entry

Score Unit hnplementation: 1 This entry can be linked to the FOSS Unit Variables, Activity 1, Pendulum, Experiment I. Type of Entl'y: 15.3 The first part of the code indicates that the entry is "reporting an experiment," and the second part, "3" indicates that there is evidence of replications. Quality of Communication: 1 The communication quality of this student's report was poor. The procedure was nat replicable since the description was incomplete; it is not clear how the outcome was measured, sub-titles were missing• Procedural Understanding: 1 Interpretation of the results in not appropriate. The student had a misconception in the conclusion by thinking that a variable is only a variable if it has an effect on the outcome, if it does not, then the variable studied is not a variable. Teacher Feedback: -2 Student needs feedback from the teacher not only for improving the quality of the communication of the experiment, but also for helping her/him understand the concept of variable. Teacher's feedback to this student was scored as "provided but incorrect" because the teacher rewarded the student's response despite the evidence of the student's misunderstanding.

Figure 2: An Example of a Student's Notebook Entry and the Scores Assigned We approach the quality of teacher feedback using a six-level score: -2 - feedback provided, but incorrect (e.g., teacher provides an A+ for an incorrect notebook entry); -1 no feedback, but it was needed (e.g., teacher should point out errors/misconceptions/ inaccuracies in student's communication); 0 - no feedback; 1 - grade or code phrase comment only; 2 - comment that provides student with direct, usable information about current performance against expected performance (e.g., comment is based on tangible differences between current and hoped performance, "Don't forget to label your

M.A. Ru&-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

71

diagrams!"), and 3 - comment that provides a student with information that helps her to reflect on/construct scientific knowledge (e.g., "Why do you think is it important to know whether the material is soluble for selecting the method of separation?"). To bring home the scoring approach, Figure 2 provides an example of a student's notebook entry and the scores assigned across the different dimensions: unit implementation, type of entry, student performance and teacher feedback. The scoring materials consist of two parts: (I) Notebook Scoring Form - a matrix that includes, as rows, the instructional activities to be considered as evidence that the unit was implemented and, as columns, the aspects to be scored; and (2) Criteria Table - a table that specifies codes, criteria, and examples to be used in scoring. The Notebook Scoring Forn~ should follow the curriculum organization.

Types of Scores Five types o f scores are obtained with our assessment approach: unit implementation, quality o f communication, conceptual understanding, procedural understanding, and teacher feedback. These five scores are the sum of scores obtained for each notebook entry identified. In order to make the three student performance scores comparable within and across units we created another three scores that reflect the mean performance of each student (total score divided by the number o f entries identified in each student's notebook). These scores were named by Li, Ruiz-Primo, Ayala, and Shavelson (2000), "mean scores". The advantage of these mean scores is that they shared the same scale (from 0 to 3), making it possible to compare easily the level o f a student's performance on the different aspects. Application of the Assessment Approach: Evaluating Science Notebooks as an Assessment Tool We examined the viability o f our approach in different studies, one o f them considered as a pilot. In this section, we briefly report key results across studies. A detailed description of these studies can be found elsewhere (Li, et al., 2000; Ruiz-Primo, et al., 1999, in press; Ruiz-Primo & Li, 2001; Ruiz-Primo, Li, & Shavelson, 2004). We collected students' science notebooks in medium sized urban schools in the US San Francisco Bay Area. All the participant teachers/classrooms in our studies implemented Full Option Science System (FOSS, 1993) as a science curriculum. In our studies, we focused on two units o f this curriculum: Variables and Mixtures and Solutions (henceforth only Mixtures) (see Appendix A for a brief description of the FOSS curriculum and the two units). All teachers reported that they regularly used notebooks in their science classes. In the studies reported in this article, we did not provide any directions to teachers on how to use science notebooks or the characteristics notebooks should have. In all the studies teachers have been asked to rank students into five achievement groups - from the top 20% to the bottom 20% - according to science proficiency. We focused on notebooks from students in three o f the five groups: top, medium, and low proficient students. For each student in our studies we collated information about student performance using

72

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

performance assessment in a pretest posttest design. Using the pretest and posttest scores, we calculated effect sizes by classrooms and units. Analysis o f Science-Notebook Entries

Figure 3 is an example of the notebook scoring form we used in our studies. The Notebook Scoring Form followed the description of the implementation presented in the teacher guide for the Variables and the Mixtures FOSS units. The Notebook Scoring Form follows the units' organization: one verification list for each activity and one for assessments suggested in the guide (i.e., hands-on assessments, pictorial assessments, reflective questions). Each activity-verification list contained different Parts (P) that corresponded to the organization of the activity (see Figure 1). For each instructional activity specified on the Notebook Scoring Form, seven questions are asked according to the three aspects of the notebook we evaluated: Unit Implementation .

2. 3.

Is there any evidence that the unit-based instructional activity or that an appropriate extra-instructional activity was implemented? Is the activity sheet/report complete? What type of entry is identified in the evidence provided?

Student Performance .

5. 6.

Is the communication appropriate according to the genre at hand? Is there any evidence of conceptual understanding in the communication? Is there any evidence of procedural understanding in the communication?

Teacher Feedback 7.

Is there any evidence that the teacher provided feedback to the student's communication?

The shaded boxes (Figure 3) in the Notebook Scoring Form indicate that the criteria do not apply to the notebook entries in hand. For example, the criterion, "Completeness of Report," only applies to the "Activity Sheet". Activity sheets are provided by FOSS for students to fill out for each activity and they are considered an essential piece of the implementation of any unit activity. The blank lines allow the inclusion of different types of entries that can be found in a notebook as an indication that a particular activity was conducted, from a description of a procedure (e.g., how to create a mixture) to a description of observations (e.g., what was observed when a salt solution was filtered).

M.A, Ruiz-Primo, M. Li / S t u d i e s in E d u c a t i o n a l Evaluation 30 (2004) 6 1 - 8 5

_

1

73

2

3

~

0-3

1-23

0-3

.

AqLvlltve.v Utlil---Ac'livtll ] Se/xn'ati/lg Mi.vlnl'c'.s

--0-1

P I Make alld Separate Mivlnres Making Mixtures

Ac'OviO' Sheet: Separating -- Mixtures Part 1 Activity Sheet: Separating -- Mixtures Part 3 1.1.1 1.1.2 Screening Mixtures 1.2.1 1.2.2

Defining: Solutions Extra Activity

P. 2 Weight and Separate a Salt Solution Making & weighting a Salt Solution Activity Sheet: Separaling a Solution -- Part 3 2.1 2.2 Revie w Question: What is a mixture? ... Extra Activity

P. 3 Salt Co,stals Evaporating a Saltwater Solution Activity Sheet: Separating a Solutio n -- Part 6 Activity Sheel: Separating Mixtures Review - Part 1 3.l 3.2 Extra Activity ....

Reflections on the Activio~--Questions at the end of the Act.

I-

Integrating: How are screen & filter sirailar/different?

Figure 3: Example of a Portion of the Notebook Scoring Form for Activity 1, Separating Mixtures, of the Mixtures Unit In all the studies two or three raters independently scored the students' notebooks. Raters were experts in the unit activities. Students' notebooks are always mixed and randomly ordered for scoring. Raters were unaware o f the classroom to which a student belonged or the performance level o f the student (top, middle, or low). S o m e o f the students' notebooks have been used for scoring training purposes. Across our studies we have scored more than 1804 science notebook pages across the two units (Variables 961 and Mixtures 843). 6

74

M.A. Ruiz-Primo, M. Li /Studies in Educational Evaluation 30 (2004) 61-85 Some Findings

We focused on four main issues: Provide information about the technical quality of the notebook assessment - Can different raters reliably score student's science notebooks? Can scores on quality of comnaunication, conceptual understanding, and procedural understanding be interpreted as reflecting students' academic performance? And, do notebook scores bearing on student performance correlate positively with other measures o f their performance? Describe and analyze the types of entries most frequently found in students' science notebooks. Describe teacher feedback notebook practices. Look for associations between types and quality of entries and teacher feedback with students' learning.

.

.

3. 4.

A g r e e m e n t and Reliability Across Raters

We evaluated inter-rater agreement in classifying notebook entries according to the type and characteristics, and inter-rater reliability for each type o f score. Despite the variability in the nmnber and type of notebook entries and the diversity of the forms of communications (written, schematic or pictorial), we found across studies that raters consistently identified whether or not an instructional task was implemented, and consistently classified the type of entry (Table 3). Percent of agreement varied according to the unit. For identifying type of entry, agreement always was above 85%. Unit Implementation coefficient is the highest across all the types o f scores. Raters also consistently scored student performance. Coefficients varied by unit and across studies. However, coefficients were never lower than .80 (but see Table 3, footnote). Table 3:

Averaged Percent of Agreement for Type of Entry and Inter-rater Reliability Across Studies by Unit and Type of Score Percent of Agreement

Unit

Inter-rater Reliability

Unit Quality of C o n c e p t u a l Procedural Teachers Implementation Communication Understanding Understanding Feedback Means Means Means Means Means Score Score Score Score Score Variables 83.38 .96 .84 .88 .83 .88* Mixtures 84.98 .99 .82 .87 .83 .89 * In the pilot study, the inter-raterreliabilityfor this unit was very low, .49, due to the misapplicationof one scoring rule. If this numberis consideredthe averagedcoefficientis .75. Type of Entry

M.A. Ruiz-Primo, M. Li /Studies in Educational Evaluation 30 (2004) 61-85

75

Students' Pe@)twmtTce Scores as Achievement Indicators To examine whether the notebook scores can be considered achievement indicators, we correlated the notebook performance scores students obtained on the posttest performance assessments. We used three types of performance assessments, varying in the proximity of the assessment tasks to the characteristics of the curriculum. One performance assessment is considered as close to the units studied - the assessment task is very close to the content and activities of the unit in hand. Another type of performance assessment is considered proximal - the assessment task was designed considering the knowledge and skills relevant to the unit, but content was different to the one studied in the unit. Finally, we used a performance assessment considered as distal - assessment task is based on state/national standards in a particular domain (see Ruiz-Primo, Shavelson, Hamilton, & Klein, 2002). Across all o f our studies we randomly assigned students within each classroom to one of two sequences of pretest and posttest: (1) close - close or (2) proximal - proximal (e.g., those students who took the close performance assessment as a pretest, also took the close assessment as a posttest). All the students took the distal assessment at the end o f the school year. Before we present our findings on notebook scores as achievement indicators, we first present some descriptive findings about the notebook student performance scores. Results across studies have consistently shown that student performance is not high. Consider as a typical example the data collected in one of the studies in which students' science notebooks across six classrooms were scored (Ruiz-Primo et al., in press). Table 4 provides the mean scores across types of scores. Low student performance scores across units revealed that students' communication skills were not well developed and that students only partially understood the different topics addressed in the units.

Table 4:

Means and Standard Deviations for Each Type of Scores Across Units and Classrooms Type o f Score

Quality o f C o m m u n i c a t i o n Conceptual understanding Procedural understanding *

Variables Notebooks /7 36 20* 36

Max 3 3 3

Mean 1.31 1.16 1.28

SD ,37 ,58 .39

Mixtures Notebooks Max 3 3 3

Mean 1.10 .99 1. i 1

SD .29 .55 .31

There was no evidence of entries focusing on conceptual understanding in two classes (12 students) and four students of other classes.

Students' notebook entries across the six classrooms did not provide evidence o f high quality scientific communication, no matter what type of entry was at hand. Notebook entries were understandable. They had some of the distinctive elements o f the written genres for the scorers to be able to classify the entry (e.g., definitions had the concept and the meaning o f it). However, students rarely used technical terms or the suitable grammatical characteristics according to the genre (e.g., use present tense in defining a concept). Nor did they use the appropriate genre structure and characteristics (e.g. in

76

M.A. Ruiz-Primo, M. Li /Studies in Educational Evaluation 30 (2004) 61-85

reporting an experiment they did not provide the experiment's purpose or conclusion). For example, we found that most o f the data reported in the students' notebooks were a string of numbers without any organization at all ( e . g , 12, 12, 12; meaning n u m b e r o f cycles on three trials), or a very rudimentary form o f organization (e.g. 28 - 12, 25 - 13; the first n u m b e r means the length o f the string, and the second means the n u m b e r o f p e n d u l u m cycles). Across studies, mean scores for conceptual and procedural understanding indicated the partiality o f students' knowledge. Conceptual understanding mean score indicates that students' n o t e b o o k entries that focused on conceptual understanding (e.g., providing examples o f solutions) were only partially correct (e.g., some o f the examples provided by the student were incorrect). Procedural understanding mean scores also indicated that students' entries did not provide accurate, appropriate, and complete information in their entries. Across studies, we found positive correlations among the three types o f scores (.49 .73). W e interpreted these results as an indication that the three aspects are related but still tapping somewhat different aspects o f student performance. In what follows we present our findings on issues related to the validity o f notebook scores. Correlations b e t w e e n the n o t e b o o k p e r f o r m a n c e scores and the p e r f o r m a n c e assessments scores have been, in general, consistent across studies (Ruiz-Primo et al., 1999, in press): all were positive, as expected, and the pattern varied according to the proximity o f the assessments (Table 5). On average, the higher the student performance score obtained in the notebook, the higher the score obtained b y the student in the perfomaance assessments, independent o f the proximity o f the assessment to the curriculum studied. The pattern o f the correlations observed is close to the expected pattern - higher correlations were observed with the more proximal assessments. -

Table 5:

Correlations and Partial Correlations Between Student Notebook Performance Score and Performance Assessment Scores of Different Proximities Student Notebook Performance Score

Proximity of Performance Assessments Close Proximal Distal *** ** * a b c d

Correlations Complete Sample Variables .09 (n = 14) .54** (n = 22) .34 (n = 29)

Correlation is significant at _<.005 level. Correlation is significant at .01 level. Correlation is significant at .05 level. Reading scores were controlled. Four outliers dropped. Two outliers dropped. One outlier dropped.

Mixtures .35 (n = 20) .49 (n = 16) .26 (n = 29)

Partial Correlations a Without Outliers

Variables Mixtures .89*** b .58** d (n = 10) (n = 19) .71 ***c .61'* d (n = 20) (n = 15) .49** c .43* a (n = 27) (n = 28)

Without Outliers Variables .83 ***b (n = 7) .55 **c (n = 16) .50** b (n = 22)

Mixtures .25 d (n = 14) .63** d (n = 16) .30 d (n = 24)

77

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

The pattern of correlations for the Variables unit was just as expected, but for the Mixtures unit, the correlation between the notebook p e r f o r m a n c e score and the proximal performance assessment score was higher than the correlation with the close performance assessment score. Even when the correlations were adjusted for general ability (i.e., reading score), the pattern and the magnitude did not change dramatically for the Variables unit, but they dropped in the Mixtures unit ahnost by half in the close and distal assessments. We interpreted our results across studies as indicating that notebook performance scores may serve as an achievement indicator, even at a distal level, when the content o f the assessment is not based on the content of the curriculum students studied in their science classes. Characterizing Type q f Entries

Across studies we found that most of the science notebook entries, across all the classrooms and units, were pertinent/appropriate to the learning of science. Few notebook entries were classified as "Don't care about activity" c a t e g o r y . . H o w e v e r , the intellectual demands imposed on the students in the entries found were usually low. We have found that teachers in the classrooms analyzed tended to ask students to record the results o f an experiment or to copy definitions. These types of tasks by themselves can hardly help students to improve their understanding. Consider as a sample the results we obtained in our study in which ten classrooms were analyzed (Ruiz-Primo et al., 2002; Table 6). Although the profiles o f types of entries varied from classroom to classroom, in all classrooms and across both units the types of entries most frequently found were reporting data (34.99%), definitions (18.98%), and content questions/short answer (15.05%). The types o f entry least frequently found were designing experiments (0.12%), and reporting and interpreting data (0.78%). On the rest o f the categories the percentage varied across the two units. Only in three classrooms did we find evidence that formal assessments were provided at the end of the units. All of the assessments found were classified as simple forms of assessment (e.g., short-response, matching exercises). Table 6. Percentage of Type of Entries by Unit Type of Entry Defining Exemplifying Applying Concepts Predicting/Hypothesizing Reporting Results Interpreting Results/Concluding Reporting and Interpreting Results/Concluding Reporting Procedures Reporting Experiments Designing Experiments Content Questions/Short Answers Quick Writes-Reflections, Affective Questions Assessments Don't Care About Activity

Variables (n = 60) 20.58 6.90 1.80 1.15 32.34 5.02 0.95 2.78 6.34 0.20 8.87 8.91 3.69 0.48

Mixtures (n = 60) 17.38 0.99 4.30 0.90 37.63 2.29 0.61 4.33 3.07 0.05 21.29 5.81 0.98 0.33

78

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

At the subcategory level results indicated that most of the definitions found in the students' science notebooks were verbal and very few pictorial. Pictorial definitions were mainly found in the Variables unit. Forms of reporting data, verbally or graphically, varied according to the unit. Graphical data were mainly found in the Variables unit, but the opposite was the case in the Mixtures unit. We believe this is congruent with the content of the units. In the Variables units, the data students collected were more suitable to be organized and represented in a table or a graph (string length on the horizontal axis and number of cycles on the vertical axis); than in the Mixtures unit, in which most of the data collected were observations (e.g., description of the salt crystal left after evaporation). Unfortunately, most of.the procedures reported across the two units were classified as "narrative recount procedures," instead of instructions or directions. We think this finding is important since research (e.g., Martin, 1989, 1993) has found that narrative descriptions of procedures in science (e.g., Today we put two cups of water...) decrease generalizability of the procedure. The reason is that narrative procedures are presented as a recollection of events and in the past tense; therefore, they typically express temporality. The accurate and appropriate reporting of procedures is essential to the work of scientists, as is reporting experiments. However, we found that not only on few occasions students were asked to report an experiment, but also that most of the experiments reported were incomplete. They usually lacked the purpose/objective of the experiment or data interpretation and conclusions. Students hardly used the evidence they collected to draw conclusions or provide explanations. (For detailed information on type of entries and their characteristics see Ruiz-Primo et al., 2002). Teacher Feedback

We already mentioned that notebook performance mean scores were, in general, low. The issue is what the teachers did to improve students' understanding and performance. Did they provide appropriate feedback to students? Was there any written evidence of this feedback? If there was no written evidence, is it possible to infer that they provided verbal feedback to students? Across studies we have observed that helpful feedback was rarely provided. Furthermore, feedback of any type was also hard to find. We found that in six of the ten classrooms studied there was not evidence of teacher feedback in any notebook entry across the two units despite the evidence of errors or misconceptions in the students' communications. Moreover, for those four classes in which feedback was provided, the feedback mean scores were low, indicating that the teachers' comments were not of high quality. Typically the feedback focused on students' understanding. If some feedback was provided on quality of communication, teachers paid attention to spelling errors rather than the quality and the characteristics of the students' written communications. Figure 4 shows the percentage of types of feedback teachers provided across students' notebook entries for students' understanding in those classrooms in which feedback was found. The graph had two categories that we have not explained, "Inconsistent" and "Not Necessary." The former refers to those cases in which in the same notebook entry the teacher provided both correct and incorrect feedback (e.g., -1 and +2).

79

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

The latter one refers to the cases in which teacher did not provide feedback, but it was not necessary based on the characteristic of the entry (e.g., a copied definition).

Variables

9°I:

901

807060(33 500 40-i rl03 30-

Mixtures

80

70) 60 "

201

50~rt~iiiiiiiiii u I 4030~

--

20.

1 0 - ] ~ 0-

10. 0.

/ I

@fie Typeof Feedback

I

/

Figure 4: Percentage of Type of Feedback Focusing on Students' Understanding Used by the Four Teachers in the Variables and Mixtures Units Based on this graph, it is clear that the highest percentages are found in the 0 and 1 scores across the two units. This means that teachers either did not provide feedback ("0") or the feedback provided was reduced to a grade, a checkmark, or a code phrase ("1"). Literature on feedback (Clarke, 2000; Sadler, 1989, 1998; Tunstall & Gipps, 1996) has emphasized that providing only a grade, a checkmark, or a general comment cannot help students to reduce the gap between where they currently are and where they should be (the learning goal). In contrast to what has been recommended by the research, the teachers rarely provide helpful comments to their students (scores 2 or 3). Unfortunately, we also found that some of the feedback provided by teachers was incorrect (i.e., teacher provided an A+ to an incorrect response, a -2 score). Notice also that the percentage of type -1, i.e., teachers did not provide feedback but they should, is also considerable. On average, across both units around 1/6 of the feedback was in those two categories. Conclusions In this article we reported some findings on the use of students' science notebooks as an assessment tool for providing evidence bearing on student performance and on the opportunities students have to learn science. We examined whether students' notebooks could be considered a reliable and valid fo..,'r, of assessment and whether they could be used to explain, at least partially, between-class variation in student performance. We also used students' science notebooks to examine the nature of instructional activities they did in their

80

M.A. Ruiz-Primo, M. Li /Studies in Educational Evaluation 30 (2004) 61-85

science class, the nature of teachers' feedback, and how these two aspects of teaching related to the students' performance. Each entry of each student's science notebook was analyzed according to the characteristics of the activity, quality of student performance, and teacher feedback. Results have several implications. First, raters can consistently classify notebook entries despite the diversity of the forms of comnmnications (written, schematic, or pictorial). They can also consistently score students' quality of communication, conceptual and procedural understanding, and the quality of teachers' feedback. Second, inferences about student performance using notebooks were justified. High and positive correlations with scores obtained from the performance assessments indicated that students' notebook performance score can be considered an accurate indicator for their science achievement. Third, the intellectual demands of the tasks found in the notebooks were, in general, low. Teachers typically just asked students to record the results of an experiment or to copy definitions. These types of tasks by themselves are not challenging enough to either engage students in scientific inquiry or help students improve their understanding. Fourth, low student performance scores across studies revealed that students' communication skills and understanding were far from the maximum score and did not improve over the course of instruction during the school year. And, fifth, this latter finding may be due, in part, to the fact that teachers provided little, if any, feedback. Indeed, most of the teachers did not provide any feedback even though the errors or misconceptions were evident in their students' notebooks. If some feedback was provided, comments were reduced to a grade, checkmark, or a code phrase. Therefore, such a lack of effort to provide useful feedback is very likely to have contributed to students' inability to achieve the desired performance. We concluded that the benefits of science notebooks as a learning tool for the students and as a source of assessment information for teachers was not fully exploited in the science classrooms studied. Reflecting on the notebook entries we examined, we believe that teachers should carefully plan, design and select the entries for students to write on the basis of their understanding of the unit goals. The ongoing accurate and systematic documentation of the development of ideas, concepts, and procedures is a powerful scientific tool for replicating studies, for discussing and validating findings, and for developing models and theories; simply put: for developing scientific inquiry. Furthermore, research has demonstrated that students' learning and understanding can be improved if students are asked to write in science in an appropriate, purposeful, and relevant way (Lemke, 1990; Martin, 1989, 1993, Rivard, 1994). Across our studies, students' writing in science notebooks was largely mechanical. For nearly every instructional activity, students were asked to write down the results found or the procedures used without providing explanations or conclusions. Students' notebooks had few entries focusing on the understanding of the concepts learned that day. The only entry related to the concepts learned was definitions, mainly copied from the textbook or a dictionary. Students were never asked, for example, to contrast and compare concepts (e.g., mixtures and solutions), or to apply them in different contexts (to improve transferability of knowledge)..Related to the fact that notebook writing was used without careful design, the notebook entries were not coherent, either. Notebook entries were mainly a set of unconnected random activities that reflected little alignment between the instructional tasks and the unit goals. Moreover, the quality of the descriptions was, in

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

81

general, poor. For examp[e, procedures were hardly repticable, results were almost never organized in a way that could help students to find patterns, and almost never were used as evidence in explanations or conclusions. Overall, those observations reveal that it is not instructionally useful to have students write down entries that lack intellectual challenges, coherence, or connections to the unit goals. The meaningful utilization o f notebooks requires a shift of the learning culture among both teachers and students. That is, teachers and students have to first welt understand why and how to write notebooks in order to gain the maximum benefit from notebooks as learning and assessment tools. In addition, we recommend that teachers should start with clarifying what are the unit goals, specifying what are performances that students are expected to demonstrate, and identifying what are the instructional activities that can help students achieve those goals. Once they have concentrated on those big questions and communicated with students about the purposes of writing, they then need to carefully select tasks for students to write in their science notebooks, accompanied by clear expectations and appropriate scaffolding. The fact that teacher feedback was not consistently found raises concern about the teachers' classroom assessment practices. For example, how much do teachers know about effective feedback and its impact in improving student learning (e.g., Black & Wiliam, 1998; Sadler, 1989, 1998), or what skills are required for teachers to provide effective feedback? O f course, many logistical problems, such as time constraint and large class size, can make it difficult for teachers to provide feedback to students. To overcome those problems, we believe that teachers first need carefully to select the types o f entry to work with students. These must be tied closely to the purposes of the instructional activity, connected with other entries, and supportive to the unit goals. Second, teachers need to think of options for assisting their assessment practices and helping students move toward self-monitoring (e,g., self- and peer-assessment; Sadler, 1989). Third, the educators and research community need to think carefully about how science notebooks can be conceptualized, implemented, and assessed in ways that most effectively reflect their main purposes. If science notebooks are to be used as an unobtrusive assessment tool we need to make an effort to help teachers coordinate the power of purposeful recording and thoughtful reflection about students' work with h e l p i n g students improve their understanding and performance of science inquiry. Acknowledgement We wish to thank the anonymous reviewers for helpful comments. Notes

2,

The series of studies reported was supported by the National Science Foundation (No. SPA8751511 and TEP-9055443). The opinions expressed, however, are solely those of the authors and do not necessarily reflect those of the granting agencies. An earlier version of this article was presented at the EARLI 10th Biennial Conference, Padova, Italy. We acknowledge that students' performance is multi-faceted, e.g., including cognitive and affective aspects. In this article, we selected to focus on the cognitive aspect due to the scope of our studies.

82

3.

4, 5.

6.

M.A. Ruiz-Primo, M. Li /Studies in Educational Evaluation 30 (2004) 61-85

We acknowledge that many different schemes can be used to analyze students' notebooks communications (see Audet, Hickman, & Dobrynina, 1996; Keys, 1999). Lemke (1990) classifies the scientific genres in minor - short or simpler forms of communication, such as descriptions, comparisons, and definitions - and major - usually longer, more complex, and more specialized communications, such as lab reports. The approach does not intend to focus on the functional analysis of the students' written communication (e.g., lexical density or characteristics of the clauses; cf. Halliday & Martin, 1993; Keys, 1999) but to use only the general characteristics of the genres as criteria for scoring the quality of the cormnunications. Number of pages scored in our first study was not counted.

References Alonzo, A.C. (2001). Using student notebooks to assess the quality of inquiry science instruction. Paper presented at the AERA annual meeting. Seattle, WA. Aschbacher, P.R., & McFee-Baker, C. (2003) Incorporating literacy into hands-on science classes: Reflections in student work. Paper presented at the annual meeting of the AERA, Chicago. Audet, R.H., Hickman, P., & Dobrynina, G. (1996). Learning logs: A classroom practice for enhancing scientific sense making. Journal qf Research in Science Teaching, 33 (2), 205-222. Baker, G., & McLoughlin, R. (1994). Teachers, writing andfizctual texts. Melbourne: Catholic Education Office. Baxter, G.P., Bass, K.M., & Glaser, R. (2000). An analysis of notebook writing in elementary science classrooms. CSE Technical Report 533. Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student Testing/Graduate School of Education & Information Studies. University of California, Los Angeles. Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education, 5 (1), 7-74. Black, P. (1993). Formative and summative assessment by teachers. Studies in Science Education, 21, 49-97. Bybee, R.W. (1996). The contemporary reform of science education. In J. Rhoten & P. Bowers (Eds.), Issues in science education (pp. 1-14). Arlington, VA: National Science Teachers Association. Christie, E. (1985). Language and schooling. In S.N. Tchudi (Ed.), Language, schooling and society (pp. 21-40). Upper Montclaire, N J: Boynton/Cook. Clarke, S. (2000). Closing the gap through Jeedback in Jormative assessment." Effective distance marking in elementary schools in England. Paper presented at the AERA annual meeting. New Orleans, LA. Cuozzo, C.C. (1996). What do lepidopterists do? Educational Leadership, 54 (4), 34-37.

M.A. Ruiz-Primo, M. L'i / Studies in Educational Evaluation 30 (2004) 61-85

83

Martin, J.R. (1993). Literacy in science: Learning to handle text as technology. In M.A.K. Halliday & J.R. Martin (Eds.), Writing science: Literacy and discursive power (pp. I66-202). Pittsburgh, PA: University of Pittsburgh Press. McColskey, W.. & O'Sullivan, R. (1993). How to assess student pelformance in science: Going beyond m~dtiple-choice tests: A resource mcatttc/l /br teache~w. Tallahassee, FL: Southeastern Regional Vision for Education. National Research Council (1996). National science education standards.

Washington DC:

Author. Northwest Regional Educational Laboratory (1997). Assessment strategies to inJorm science and mathematics instruction: lt's just good teaching. OR: Northewet Regional Educational Laborary. Pellegrino, J., Chudowsky, N., & Glaser, R. (2001). Knowing what students know. The science and design (?/educational assessment. Washington, DC: National Academy Press. Penrose, A., & Katz, S. (1998). Writing in the sciences. NY: St. Martin's Press. Ranaaprasad, A. (1983). On the definition of feedback. Behavioral Science, 28, 4-13. Rivard, L.P. (1994). A review of writing to learn in science: Implications for practice and research. Journal o[Research in Science Teaching, 31 (9), 969-983.

Roth, E.J., Aschbacher, P.R., & Thompson, L.J. (2002). Adding value. Scaffolding students? Work so science notebooks improve teaching and/earning. Paper presented at the Annual Meeting of the AERA, New Orleans. Rowell, P. M. (1997). Learning in school science: The promises and practices of writing. Studies in Science Education, 30, 19-56.

Ruiz-Primo, M.A. (1998). On the use o f students'science journals as an assessment tool." A scoring approach. Stanford University: School of Education, Ruiz-Primo, M.A., Li, M., Ayala, C., & Shavelson, R.J. (2000). Students'science journals as an assessment tool. Paper presented at the AERA Annual Meeting. New Orleans, LA. Ruiz-Primo, M.A., & Li, M. (200l). Exploring teachers'Jeedback to students' science notebooks. Paper presented at the NARST Annual Meeting. San Louis, MO. Ruiz-Primo, M.A., & Li, M. (2002a). Assessing some aspects o f teachers' instructional practices through vignettes: An exploratory study. Paper presented at the AERA Annual Meeting. Chicago, IL. Ruiz-Primo, M.A., & Li, M. (2002b). Vignettes as an alternative teacher evaluation instrument. An exploratow study. Paper presented at the AERA Annual Meeting. New Orleans, LA Ruiz-Primo, M.A., Li, M., & Shavelson, R. J. (2002). Looking into students' science notebooks." What do teachers do with them? Paper submitted for publication.

84

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

Ruiz-Primo, M.A., Li, M., & Shavelson, R.J. (2004). Looking into students' science notebooks: Exploring entriex to track quality q[instructional activities and teacher Jeeclback. Paper submitted for publication. Ruiz-Primo, M.A., Li, M., Ayala, C., & Shavelson, R.J. (1999). Student science journals and the Paper presented at the NARST Annual Meeting. Boston. evidence they provide." Chtssroom learning and ot)portuniO~ to learn.

Ruiz-Primo, M.A., Li, M., Ayala, C., & Shavelson, R.J. (in press). Evaluating students' science notebooks as an assessment tool. International Journal ~?/Science Education. Rmz-Primo, M.A., Shavelson, R.J., l-Iamilton, L., & Klein, S. (2002). O11 the evaluation of systemic education reform: Searching for instructional sensitivity. Journal of Research in Science Teaching, 39 (5), 369-393, Sadler, R,D. (1989). Formative assessment and the design of instructional systems. Instructional Science. 18. 119-144.

Sadler, R.D. (1998). Formative assessment: Revisiting the territory. Assessment in Education, 5 (1), 77-84. Shepardson, D.P., & Britsch, S.J. (1997). Children's science journals: Tool for teaching, learning, and assessing. Science and Children, 34 (5), 13-17, 46-47. Southerland, S.A., Gess-Newsome, J., & Johnson, A. (2000, April). Defining science in the classroom: How scientists' views shape classroom practice. Paper presented at the NARST Annual Meeting. New Orleans, LA. Tunstall, P.,& Gipps, C. (1996). Teacher feedback to yourn children in formative assessment: A typology. British Educational Research Journal, 22 (4), 389-404. Willison, A. (1996). The quality of writing. Milton Keynes: Open University.

The Authors M A R I A A R A C E L I R U I Z - P R I M O is a senior research scholar at the School o f Education, Stanford University. H e r s p e c i a l i z a t i o n s are e d u c a t i o n a l a s s e s s m e n t , a l t e r n a t i v e a s s e s s m e n t s in science. M I N L I is an assistant p r o f e s s o r at the C o l l e g e o f Education, U n i v e r s i t y o f W a s h i n g t o n , Seattle. H e r areas o f specialization are educational m e a s u r e m e n t and assessment. Correspondence:

M.A. Ruiz-Primo, M. Li / Studies in Educational Evaluation 30 (2004) 61-85

85

Appendix

Descriptions of FOSS Curriculum and The Two Selected Units Full Option Science System (FOSS, 1993) states as a goal the promotion of "scientific literacy for all students and instructional efficiency for all teachers" (FOSS, 1993, p.1). FOSS is organized in modules or units according to: (1) content domains - life science, physical science, earth science, and scientific reasoning and technology; and (2) grade levels - kindergarten, Grade 1-2, Grade 3-4, and Grade 5-6. Curricululn units include three main components: a teacher guide, the equipment kit, and a teacher preparation video. Each unit has different activities (sections), designed in a way that they can be implemented independently. The Variables unit is one of two units focused on scientific reasoning and technology, and Mixtures and Solutions is one of two units for physical science. In the Variables unit students design and conduct experiments; describe the relationship between variables discovered through experimentation; record, graph, and interpret data; and use these data to make predictions. When learning the unit, students identify and control variables and conduct experiments using four multivariable systems, each corresponding to an activity: Swingers, Lifeboats, Plane Sense, and Flippers. In the four activities students construct the system to be tested. For example, in Swingers students construct a pendulum, and in Plane Sense they construct a plane using paper clips as passengers. In each system (e.g., pendulum and plane), students identify variables and manipulate and control them to obselwe their effects on an outcome (e.g., the number of cycles the pendulum swings in 15 seconds and number of winds that takes the plane to travel a given distance). In the Mixtures and Solutions unit, students gain understanding of the concepts of mixtures and solutions, saturation, concentration, and chemical reaction. Each concept is the focus of each activity in the unit: Separating Mixtures, Reaching Saturation, Concentration, and Fizz Quiz. Students make mixtures and solutions, use different methods to separate mixtures, determine the amount of a substance required to saturate a certain volume of water, determine the relative concentrations of several solutions, and observe changes in substances by mixing solutions.