Validating placement: Local means, multiple measures

Validating placement: Local means, multiple measures

Available online at www.sciencedirect.com Assessing Writing 12 (2007) 170–179 Validating placement: Local means, multiple measures David A. Reinheim...

124KB Sizes 1 Downloads 33 Views

Available online at www.sciencedirect.com

Assessing Writing 12 (2007) 170–179

Validating placement: Local means, multiple measures David A. Reinheimer ∗ English Department, MS2650, Southeast Missouri State University, Cape Girardeau, MO 63701, USA

Abstract This study outlines a response to composition placement review based on the understanding that demonstrating validity is an argumentative act. The response defines the validity argument through principles of sound assessment: using multiple, local methods in order to improve program performance. The study thus eschews the traditional course-grade correlation methodology, and develops a local teacher- or expert-rater system combined with a matched-pair analysis. While the results do not quite meet the hypothesized level, ultimately the study was successful in creating an effective argument as part of a successful program review. © 2008 Elsevier Inc. All rights reserved. Keywords: Placement; Validation; Program assessment; Program review; Writing assessment

During the 2005–2006 academic year, the provost of the researcher’s institution appointed a faculty task force to review the placement of incoming first-year students into composition courses. This review was just one of many that occurred at the institution in order to ensure that institutional cost was as low as possible during a time of decreasing state funding and a predicted decrease in students. The placement program under review uses a 50-min impromptu expository essay as the primary measure, referring as well to ACT1 English subscores and class rank in borderline cases. The review was interested in the same balance between accuracy and cost that usually defines decisions regarding placement methods; specifically, the review asked if the current method was valid enough to offset the cost to the institution or if a less costly placement method, such as the ACT Writing exam, should be adopted. Analysis of the review’s primary concern indicated a response that was not a convergent, thismethod-is/is-not-valid approach; rather, the question required a divergent approach comparing cost and effectiveness. This question thus reflects current thinking about validation, which sees

∗ 1

Tel.: +1 573 651 5905; fax: +1 573 986 6198. E-mail address: [email protected]. ACT is a major college entry test in the US (see http://www.act.org).

1075-2935/$ – see front matter © 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.asw.2008.02.004

D.A. Reinheimer / Assessing Writing 12 (2007) 170–179

171

validity as relative rather than absolute (Alderson, Clapham, & Wall, 1995). In fact, validity theory suggests that all validity studies, not just those that are part of a program review, are rhetorical acts, balancing different kinds of validity, different tests for validity, and the entire rhetorical context of the response (Alderson et al., 1995; Cronbach, 1988, 1989; Huot, 2002; Messick, 1989). As with any rhetorical act, the first step in a validity study would be to determine how to provide increasingly number-driven administrators with effective evidence to demonstrate the quality of the program within the rhetorical context of a program review (Glau, 2002). To identify this evidence, the responder is well served to go back to basic principles of sound assessment: local measures are preferred over external measures; multiple measures provide a more accurate picture than single measures; and assessment is intended to provide feedback leading to improvement. Not only can principles of sound assessment define the parameters of a validity argument, but those same principles also describe exactly the kind of information the audience of a program review (i.e., administrators) wants. Glau (2002) and Kinkead and Simpson (2000) both suggest that administrators want clear, short, quantitative data: this not only includes providing what Glau (2002) terms “nuggets” of data, but presenting nuggets that speak directly to the point and do not require an extended discussion in order for the administrator to make sense of the data. A well-constructed local measure often fits these requirements better than an external measure, in part because local administrators may be familiar with the local measures; thus, the local measure requires less contextualization. Further, if the administrative audience is familiar with the principles of assessment, as more and more administrators are these days, then the local measure will carry greater credibility. That same familiarity with assessment will also look for the use of multiple measures, not an onerous task as multiple measures often generate appropriate “nuggets” of data which contribute to the sufficiency of a validity argument’s evidence, providing a more complete and accurate picture of the placement situation. Finally, assessment should not only generate data; it should also show how those numbers can lead to improvement, an argument that administrators usually find especially effective (Kinkead & Simpson, 2000). Just as local assessment grows out of the local curriculum, the local population, and other local parameters, so too should a local validity argument grow out of the local assessment context, rather than being imposed on a context for which the argument is not appropriate. Take, for instance, the correlation (particularly the course-grade correlation), the most common method of responding to validity questions (Alderson et al., 1995). While the course-grade correlation may be an appropriate validity measure for some methods of placement, it presents theoretical obstacles when imposed on many placement contexts. A course grade is often not an appropriate benchmark because of the manifold intervening variables involved in the generation of a course grade (Alderson et al., 1995), variables that are not part of what some placement measures assess, and the result of imposing the course-grade correlation on all placement contexts is a litany of “embarrassingly weak” results (Haswell, 2005). Ever since Huddleston (1954), the results of correlational placement validation have ranged from about 0 to .5 (see Table 1). Based on a metastudy of the use of correlations, the best of these results could be called “moderately strong” (Orcher, 2005, p. 207), which may make the evidence more persuasive, but also requires the kind of excessive contextualization that can obscure the evidence for an administrative audience. It should also be noted that correlational results can contradict other validity evidence; thus it is that, while the correlations are low, placement almost always seems to “work” from an instructor’s perspective (Smith, 1993). That being said, course-grade correlations can be effective for some placement contexts. In order to obtain grades that have any meaning for programmatic assessment, however, a rather rigorous framework must be imposed on instructors to obtain reliable data (Diederich, 1974;

172

D.A. Reinheimer / Assessing Writing 12 (2007) 170–179

Table 1 Results of validation studies Study

Correlation with writing sample alone

McKendy (1992) Breland (1987) Drain and Manos (1986) Guerrero and Robison (1986) Haines (1985) Evans, Holmes, Jantzi, and Haynes (1983) Culpepper and Ramsdell (1982) Haines (1981) Olson and Martin (1980) Michael and Shaffer (1979) Michael and Shaffer (1978) Breland (1977)

.32 .38 to .52 .34 .36 .14 .16 and .38 .41 .25 .24 and .22 .29 and .28 .31 and .32 .23 and .28 .26 .47 .46

Checketts and Christensen (1974) Huddleston (1954) Note. Table adapted from McKendy (1992).

Walvoord, 2004); the challenge here is to create a framework that grows out of curriculum and instruction. Assuming reliable data can be generated, there are those placement methods out there—diagnosis (Haswell, 1991), for instance, and directed self-placement (Royer & Gilles, 2003)—that assess similar contextual issues to those involved in course grades, and hence for which a course-grade correlation would be more appropriate. The important part is to remember the principle of locality, and examine the method being validated before deciding which validation method to use: writing assessment and placement are both highly contextual, and thus so should placement validation be (Williamson, 1993). Several validation studies have followed this principle, and developed alternate methods of validation: teacher-rating correlations (McKendy, 1992), discriminant analysis (Bers & Smith, 1990; Hughes & Nelson, 1991), and other models (Smith, 1993). Not only should a validity argument be based on a method appropriate to the local placement context, it must also determine which method is most appropriate to the questions being asked in a program review. There are several types of validity, and definitions have been shifting (Alderson et al., 1995; Cronbach, 1988). Alderson et al. (1995) adopt the terms rational validity, empirical validity, and construct validity: Rational (or ‘content’) validation depends on a logical analysis of the test’s content to see whether the test contains a representative sample of the relevant language skills. Empirical validation depends on empirical and statistical evidence as to whether students’ marks on the test are similar to the marks on other appropriate measures of their ability . . .. Construct validation refers to what the test scores actually mean. (p. 171) A review of a new placement method may be asked to argue rational validity, which can be done through the testimony of various experts, as well as empirical validity, which will rely on quantitative evidence. An argument for construct validation can certainly build on the other two arguments, but raises unique issues because abstract scores are being implemented into real students’ experience. Once the appropriate type of validity has been identified, the review must

D.A. Reinheimer / Assessing Writing 12 (2007) 170–179

173

then identify the appropriate method for generating evidence for the validity argument, a question that can often be answered with the help of the local institutional research office. Having identified the type of argument to be made, and the appropriately local methods for gathering relevant evidence, the last parameter for the argument is the ability of the argument to close the “assessment loop,” that is, to use the argument to indicate program improvement. In order to close that loop, multiple measures are often required. For instance, while a correlation can be used to demonstrate construct validity in certain contexts, it alone does not provide sufficient information for specific program improvement. A correlation provides a single number defining the overall success of a local placement method, but as an aggregate number does not indicate what parts of that method should be improved for greater validity. To argue for program improvement, a correlation could be combined with a matched-pair analysis or a Wilcoxon signed-rank test (Galbato & Markus, 1995), for example, which can provide the disaggregated information that can help determine action plans for program improvement that administrators would likely find rhetorically effective. The placement process under review in this study essentially consists of three parts: the measure itself, an impromptu writing task; the numerical score assigned to the student’s results, a holistic score ranging from 0 to 6; and a placement decision based upon that holistic score. When the placement process was established, well over a decade ago, the rational validity of the test itself and the empirical validity of the holistic score were both clearly determined. The program review was concerned with the placement decision based on the validated holistic score, an interpretation that gives the score meaning; hence, it was determined that the program review was specifically concerned with the construct validity of the process. What the review wished to see was if the meaning of the scores was valid; that is, whether or not the placement decision based on a student’s test score put the student into the correct class. As the placement measure is a holistically-scored, impromptu writing sample, a course-grade correlation was deemed inappropriate for the reasons discussed above; rather, it was determined that a correlation comparing the original placement decision with an expert-rater assessment of the same placement artifact would provide the most effective evidence for the validity argument. It was hypothesized that the correlation would exceed .6, which would be effective evidence for the administrative audience because it both exceeds previous results and falls into the “strong” range (Orcher, 2005, p. 207); if the results did not meet that hypothesis, a matched-pair analysis would be used to determine where changes in the program would improve performance. 1. Method 1.1. Participants The placement program under study is at a mid-sized, four-year, comprehensive state university in the Midwest of the U.S. After students are admitted, they sign up for one of approximately 12 orientation sessions held between January and August, at which they write the placement examination: a 50-min impromptu expository writing task on a general topic. Immediately following the exam, each essay is holistically scored by at least two readers from a small cadre of 4–6 experienced scorers. On a 6-point scale, a score below 3 places students into developmental writing; a score of 4 or greater into first-semester composition; and for scores of 3 and 3.5, the student’s ACT English subscore and high school class rank are considered before placing the student. In addition, students may qualify for an equivalency exam for first-semester composition by scoring a 4.5 with requisite English ACT subscore and high school class rank or by scoring a

174

D.A. Reinheimer / Assessing Writing 12 (2007) 170–179

5 or above. At the beginning of each semester, students in developmental writing are retested in the classroom using a similar writing task to verify their placements. A sample of 620 essays was systematically selected from the placement essays of the incoming class of 2002. That class was chosen because it was the last year when all students sat for the placement exam; the following year, new testing procedures were adopted that, based on the student’s ACT English subscore, exempted the top 15% of each incoming class from writing the placement essay. All identifying information regarding both the student and the original scoring was removed from the exams, which were then assigned random identification numbers. The scorers for the validation procedure were seven faculty and staff scorers who regularly score essays as a group. All of the scorers underwent initial training for holistic scoring, and had as much as a decade’s worth of experience scoring live essays in both placement and proficiency settings. While new jobs and retirement made it impossible to reconvene the original scorers, four of the seven scorers for this study did evaluate the placement essays when they were first scored. Five of the seven scorers had taught at least one of the courses in the composition sequence, and four of the seven had taught both developmental writing and first-semester composition at the institution. 1.2. Procedure The scoring session was held on three consecutive days in the early summer. Before the scoring session began, criteria were derived from the student learning objectives of the developmental and first-semester composition courses (see Appendix A). The essays were divided into groups of 10 and placed in folders; scoring sheets with blanks for the scorer’s ID number, the essays’ ID numbers, and the essays’ scores were produced. Approximately 50 essays were pulled from the unsampled placement essays to serve as rangefinders. At the start of the scoring session, the rangefinders were used to set the scale for the outcome criteria: essays were read, scored, and discussed until agreement had been reached on the scale. Additional range-finding sessions were held during the scoring session as needed. The essays were then scored. For the purposes of the validation, placement was coded numerically, with 1 representing developmental composition, 2 representing first-semester composition, and 3 representing qualification for the equivalency exam. Any disagreements were resolved by a third reader. The scores were recorded on the scoring sheets, which were collected at the end of the scoring session. Both readers’ ID numbers and scores were entered into a spreadsheet that already contained the following data from the original placement scoring: first and second readers’ IDs and scores, total score, ACT English subscore, high school class rank, and original placement. Statistical analysis was run on the data using the spreadsheet application. 2. Results A validation argument relies in part on data that administrators perceive to be reliable, so the first figures computed were inter-rater reliability for both the original placement scoring sessions and the validation scoring session. As shown in Table 2, most of the coefficients for the original placement scorers range between .5 and .73, and overall, the coefficient is .69, equivalent to Diederich’s reliability benchmark of .7 (Diederich, 1974). The inter-rater reliability figures for the validation scoring (see Table 3), however, are not as promising. Overall, the reliability was only .49, and the four scorers who persist from the placement scoring all had lower reliability in the validation scoring session; the effects of this reliability figure are discussed at further length below.

D.A. Reinheimer / Assessing Writing 12 (2007) 170–179

175

Table 2 Inter-rater reliability, placement scoring sessions Scorer number

All

001

002

003

004

005

006

007

019

151

All 001 002 003 004 005 006 007 019 151

.69 .68 .73 .73 .68 .67 .60 .50 .60 .70

.68 – .66 NA .47 NA NA NA NA .78

.73 .66 – NA .85 NA NA NA .72 .70

.73 NA NA – .78 NA NA NA NA NA

.68 .47 .85 .78 – NA .76 NA .44 .73

.67 NA NA NA NA – .63 NA .73 NA

.60 NA NA NA .76 .63 – NA NA NA

.50 NA NA NA NA NA NA – NA .60

.60 NA .72 NA .44 .73 NA NA – .50

.70 .78 .70 NA .73 NA NA .50 .64 –

Note. “NA” indicates N < 6. Table 3 Inter-rater reliability, validation scoring session Scorer number

All

All 003 004 019 151 182 190 200

.49 .52 .49 .57 .54 .50 .55 .44

003

004

019

151

182

190

200

.52

.49 .88 – .26 .61 .19 – .48

.57 .55 .26 – .64 .55 .59 .29

.54 .64 .61 .64 – .35 .77 .55

.50 .11 .19 .55 .35 – .81 .71

.55 1.00 – .60 .80 .80 – .33

.44 .40 .48 .29 .55 .71 .33 –

– .88 .55 .64 .11 1.00 .40

While this reliability data places the results in the study under question, the correlation coefficient comparing the placement and validation results is a moderately strong .55 (Orcher, 2005). These results are stronger than most previous studies (see Table 1), and hence provide effective evidence for a validation argument for an administrative audience that is regularly interested in comparative data; however, the results are not as strong as hypothesized. Therefore, a matched-pair analysis was run on the data to pinpoint strengths and weaknesses more specifically (Table 4). As could be expected, the problems appear to lie not in the extremes, but in the middle range, specifically those scores where the ACT English subscore and the high school class rank were used to place the student in developmental English. The greatest apparent disagreement between the two scoring sessions was with the 3.5- score level, in which there was 94% disagreement between the original placement and the validation results. Other subpopulations in need of discussion are the 3.0- and 4.5- score levels. 3. Discussion Using this local methodology and these multiple measures, it is easy to construct an action plan for program improvement in the program review. The validation coefficient (.55) is below the hypothesized .60, indicating a general weakness in the program. The matched-pair analysis indicates that the weakness lies mostly in the 3.5- score level, specifically those results indicating placement into developmental composition. This data indicates that a more accurate placement decision would be to place the entire 3.5 subpopulation into first-semester composition, regardless

176

D.A. Reinheimer / Assessing Writing 12 (2007) 170–179

Table 4 Matched-pair analysis, placement score and validation score Placement score 1.0 1.5 2.0 2.5 3.03.0+ 3.53.5+ 4.0 4.5 5.0 5.5 6.0

n 2 1 13 9 89 20 82 30 312 39 17 3 1

Hits validation 2 1 13 9 39 15 5 29 263 25 16 3 1

Misses 0 0 0 0 50 5 77 1 49 14 1 0 0

% Hits 100.0 100.0 100.0 100.0 43.8 75.0 6.1 96.7 84.3 64.1 94.1 100.0 100.0

Note. “+” refers to students placed into first-semester composition based on ACT English subscore and/or high school class rank; “-” refers to students placed into developmental composition based on ACT English subscore and/or high school class rank.

of ACT score or class rank. When the original placement decisions are amended to reflect this change, the coefficient increases to .61, which exceeds the study’s hypothesis and provides a strong case for adopting the action plan for program improvement. Furthermore, while this evidence can be used effectively as is, an analysis of the methodology indicates that the results are likely lower than they could be: the low inter-rater reliability numbers indicate that the scoring can be more accurate with two changes to the methodology, and thus that a reiteration of the study may well result in stronger, more accurate validation. The first change to the methodology would be to revisit the scoring scale for the validation. Overall, the validation inter-rater reliability coefficient of .49 is well below both the .69 achieved in the placement scoring and Diederich’s .7 benchmark. One likely cause is the scale used for the validation scoring session. First, the learning objectives for the various composition courses were not developed to be used as a scoring scale, and thus, they are not phrased in the kind of parallel language that is inherently conducive to constructing criteria. In the objectives for developmental composition, for instance, the objectives are stated very broadly, whereas the objectives for firstsemester composition divide writing skills into more objectives, as well as drilling down into more specific skills. Hence, during the initial range-finding for the validation scoring, significant negotiation was necessary to transform these objectives into a viable scale, introducing some possible variance. Not only does this analysis indicate that the results of the study are less accurate than possible, it can also point to further program improvement. The lack of parallelism in the language of the two sets of objectives opens the door to discussions among faculty regarding the meaning of the course objectives and how the developmental and first-semester composition courses work together. These discussions could lead not only to a stronger curriculum, but to a more definite, less variable placement scale. When constructing a validation argument, then, even the apparent shortcomings of a validation study can be used to defend a placement program. The second change to the methodology regards the scorers themselves. First, because the scale was created for this study, the scorers were necessarily inexperienced with using the scale, which likely affected the results. While extensive range-finding occurred at the beginning and at later points of the scoring session, it seems apparent that the range-finding process should be

D.A. Reinheimer / Assessing Writing 12 (2007) 170–179

177

re-examined, with additional sessions and perhaps different artifacts. In addition, scorers would naturally be more experienced with the scale in a reiteration of this study, increasing the reliability of scores and accuracy of results. Finally, the data indicate an issue regarding scorer experience as teachers: that the scorers of placement should be faculty with experience teaching the courses into which the students are being placed. Two of the three lowest reliability coefficients belong to scorers 182 and 200 who had taught neither the developmental nor first-semester course, whereas the two highest correlations (and three of the top four) belong to faculty who have taught both courses. When reiterating the study, it would be best to include only scorers who have taught the full range of the composition sequence. Based on the results of this study, the program under review developed a response that argued that the current system was valid enough to merit the expense, and that specific changes to the program would improve the accuracy of student placement into composition courses. Specifically, the program argued to change the placement criteria to place all essays with a score of 3.5 into firstsemester composition. Judgment was reserved, however, for the other subpopulations where the matched-pair analysis indicated potential weakness, as the N for most of these subpopulations is small, which may have skewed the data. Judgment was reserved on the 3.0- score for the additional reason that the curriculum has a first-week retest to confirm placement into developmental writing. These judgments would be revisited in a reiteration of the study with adjustments as explained above. While the statistical results did not match the hypothesis, ultimately the important result is this: this data contributed to a successful validation argument in the context of a program review. The institutional response was very positive, with the only recommended change being the adjustment suggested by the program. The understanding of validation as a rhetorical rather than a mathematical act can help a writing program administrator (WPA) respond to a program review, as it places the topics of the argument square in our bailiwick: WPAs are often rhetoricians who are now being asked to create rhetorical argument. By depending on principles of sound assessment—local and multiple measures, and using assessment for the purpose of program improvement—that rhetorical argument can effectively respond to the relativity of validity, of the placement context, and of the argument’s audience. In this way, WPAs can construct responses to administrative reviews that will not only defend particular placement practices, but lead to the improvement of placement in composition courses. Appendix A Score: 1 (The student should be placed into developmental writing.) Overall, these criteria are not met: • The student is familiar with writing as a recursive process. • The student is able to critically read his/her own writing. • The student demonstrates proficient editing and proofreading skills. Score: 2 (The student should be placed into first-semester composition.) Overall, these criteria are not met: • The writing reflects coherent thought. • The writing reflects effective organization.

178

D.A. Reinheimer / Assessing Writing 12 (2007) 170–179

• The writing reflects reasonable stylistic force and fluency. • The writing reflects regularity in grammatical and mechanical conventions generally accepted in educated usage. • The student demonstrates the ability to use critical reading of their own work as a basis of revision. Score: 3 (The student qualifies for the first-semester composition challenge exam.) Overall, these criteria are met: • • • •

The writing reflects coherent thought. The writing reflects effective organization. The writing reflects reasonable stylistic force and fluency. The writing reflects regularity in grammatical and mechanical conventions generally accepted in educated usage. • The student demonstrates the ability to use critical reading of their own work as a basis of revision. References Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and validation. Cambridge: Cambridge University Press. Bers, T. H., & Smith, K. E. (1990). Assessing assessment programs: The theory and practice of examining reliability and validity of a writing placement test. Community College Review, 18 (3), 17–27. Breland, H. M. (1977). A study of College English placement and the Test of Standard Written English (College Board research and development report RDR 76-77-4). Princeton, NJ: The College Entrance Examination Board. Breland, H. M. (1987). Assessing writing skill. New York: Educational Testing Service. Checketts, K. T., & Christensen, M. G. (1974). The validity of awarding credit by examination in English composition. Educational and Psychological Measurement, 34 (2), 357–361. Cronbach, L. J. (1988). Five perspectives on validity argument. In: H. Warner (Ed.), Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum Associates Cronbach, L. J. (1989). Construct validity after thirty years. In: R. Linn (Ed.), Intelligence: Measurement, theory and public policy (pp. 147–171). Urbana, IL: University of Illinois Press. Culpepper, M. M., & Ramsdell, R. (1982). A comparison of a multiple choice and an essay test of writing skills. Research in the Teaching of English, 16 (3), 295–297. Diederich, P. B. (1974). Measuring growth in English. Urbana, IL: NCTE. Drain, S., & Manos, K. (1986). Testing the test: Mount Saint Vincent University’s English writing competency test. English Quarterly, 19 (4), 267–281. Evans, P., Holmes, M., Jantzi, D., & Haynes, S. (1983). The Ontario Test of English Achievement. The final report on its development. Ontario: Council of Ontario Universities. Galbato, L. B., & Markus, M. (1995). A comparison study of three methods of evaluating student writing ability for student placement in introductory English classes. Journal of Applied Research in the Community College, 2 (2), 153–167. Glau, G. R. (2002). Hard work and hard data: Using statistics to help your program. In: S. Brown & T. Enos (Eds.), The Writing Program Administrator’s resource: A guide to reflective institutional practice (pp. 291–302). Mahweh, NJ: Lawrence Erlbaum. Guerrero, B. J., & Robison, R. (1986, January). Holistic evaluation of writing samples for placement in post-secondary English composition courses. Paper presented at the Annual Meeting of the Hawaii Educational Research Association, Honolulu, HI. Haines, V. Y. (1981). Marianopolis validity study. Montreal, Quebec: Dawson College. Haines, V. Y. (1985). Correlation study of different measures of a college student’s writing ability and the teacher’s own assessment. Montreal, Quebec: Dawson College. Haswell, R. H. (1991). Gaining ground in college writing: Tales of development and interpretation. Dallas, TX: Southern Methodist University Press.

D.A. Reinheimer / Assessing Writing 12 (2007) 170–179

179

Haswell, R. H. (2005, March). Post-secondary entrance writing placement. Retrieved October 7, 2005, from http://comppile.tamucc.edu/placement.htm Huddleston, E. M. (1954). Measurement of writing ability at the college-entrance level: Objective vs. subjective testing techniques. Journal of Experimental Education, 22, 165–213. Hughes, R. E., & Nelson, C. H. (1991). Placement scores and placement practices: An empirical analysis. Community College Review, 19 (1), 42–46. Huot, B. (2002). (Re)articulating writing assessment for teaching and learning. Logan, UT: Utah State University Press. Kinkead, J., & Simpson, J. (2000). The administrative audience: A rhetorical problem. WPA: Writing Program Administration, 23 (3), 71–84. McKendy, T. (1992). Locally developed writing tests and the validity of holistic scoring. Research in the Teaching of English, 26 (2), 149–166. Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5–11. Michael, W. B., & Shaffer, P. (1978). The comparative validity of the California State University and Colleges English Placement Test (CSUC-EPT) in the prediction of fall semester grade point average and English course grades of first-semester entering freshmen. Educational and Psychological Measurement, 38 (4), 985–1001. Michael, W. B., & Shaffer, P. (1979). A comparison of the validity of the Test of Standard Written English (TSWE) and of the California State University and Colleges English Placement Test (CSUC-EPT) in the prediction of grades in a basic English composition course and of overall freshman-year grade point average. Educational and Psychological Measurement, 39 (1), 131–145. Olson, M. A., & Martin, D. (1980, April). Assessment of entering student writing skill in the community college. Paper presented at the Annual Meeting of the American Educational Research Association, Boston, MA. Orcher, L. T. (2005). Conducting research: Social and behavioral science methods. Glendale, CA: Pyrczak. Royer, D., & Gilles, J. R. (2003). Directed self-placement: Principles and practices. Cresskill, NJ: Hampton. Smith, W. L. (1993). Assessing the reliability and adequacy of using holistic scoring of essays as a college composition placement technique. In: M. M. Williamson & B. Huot (Eds.), Validating holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 142–205). Cresskill, NJ: Hampton. Walvoord, B. A. (2004). Assessment clear and simple: A practical guide for institutions, departments and general education. San Francisco, CA: Jossey-Bass. Williamson, M. M. (1993). An introduction to holistic scoring: The social, historical, and theoretical context for writing assessment. In: M. M. Williamson & B. Huot (Eds.), Validating holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 1–44). Cresskill, NJ: Hampton.