Currents in Pharmacy Teaching and Learning 12 (2020) 1–4
Contents lists available at ScienceDirect
Currents in Pharmacy Teaching and Learning journal homepage: www.elsevier.com/locate/cptl
Teachable Moments Matter
Guidance for high-stakes testing within pharmacy educational assessment
T
Michael J. Peetersa,⁎, M. Ken Corb a
University of Toledo College of Pharmacy & Pharmaceutical Sciences, 3000 Arlington Ave, Mail Stop 1013, Toledo, OH 43614, United States Faculty of Pharmacy and Pharmaceutical Sciences, 3-209 Edmonton Clinic Health Academy, University of Alberta, 11405-87 Ave, Edmonton T6G1C9, Alberta, Canada
b
ARTICLE INFO
ABSTRACT
Keywords: High-stakes testing Psychometric Reliability Testing standards
Background: The term “high-stakes testing” is widely used among pharmacy educators, but the term often seems misused or used incompletely. This Teachable Moments Matter (TMM) focuses on the importance of scientific-rigor when assessing learners’ abilities. This article discusses highstakes testing — what it is and what it is not. This TMM is not meant as an extensive review of the topic. Impact: As imperative for ethically-fair high-stakes testing, we will focus on defining and explaining high-stake testing, to include: evidence for validation, development of cut-scores, magnitudes of reliability coefficients, and other reliability measurement tools such as Generalizability Theory and Item-Response Theory. Teachable moment: From our perspectives as educational psychometricians, we hope that this discussion will help foster scientifically-rigorous use and reporting of high-stakes testing in pharmacy education and research.
1. Background Currents in Pharmacy Teaching and Learning (CPTL) has a new Live and Learn article category. The purpose of articles in this category is to tell the story behind an investigation that did not turn out as planned due to a substantial problem or limitation. Authors are asked to reflect on their experience from this problem-tainted study and to discuss issues that they learned from it, even though initially-conceived ideas in the final study outcomes were not realized. It is hoped that other CPTL readers might learn from these insights. Dy-Boarman and colleagues1 provided CPTL's first article of this type. In numerous situations (professional meetings as well as other discussions with colleagues), we have repeatedly perceived misunderstandings by some pharmacy academics about what is and what is not “high-stakes testing” in pharmacy education. In their original manuscript version, Dy-Boarman and colleagues1 had used the term “high-stakes testing” a number of times, but that version did not report expected evidence to support high-stakes testing. Thereafter, “high-stakes testing” was removed in peer-review before the article was published. In this short Teachable Moments Matter (TMM), we do not intend to review all issues involved with highstakes testing. We do discuss an accepted definition of high-stakes testing and the implication for assessment practices in pharmacy education.
⁎
Corresponding author. E-mail addresses:
[email protected] (M.J. Peeters),
[email protected] (M.K. Cor).
https://doi.org/10.1016/j.cptl.2019.10.001
1877-1297/ © 2019 Elsevier Inc. All rights reserved.
Currents in Pharmacy Teaching and Learning 12 (2020) 1–4
M.J. Peeters and M.K. Cor
2. Definition of high-stakes testing Central to educational testing are The Standards for Educational and Psychological Testing (The Standards)2 jointly developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. The Standards provide guidance on what high-stakes testing includes.2 The Standards have been updated periodically throughout the 20th and 21st centuries, with the most recent revision being in 2014. In describing “Use and Interpretation of Educational Assessments,” The Standards specify that stakes refers to “the importance of the results of testing programs for individuals, institutions, or groups. When the stakes for an individual are high, and important decisions depend substantially on test performance, the responsibility for providing evidence supporting a test's intended purposes is greater than might be expected for tests used in low-stakes settings.” (p188).2 One way to determine the stakes of individual assessments in professional programs such as pharmacy is to understand how performance on any single assessment in question impacts decisions about whether or not students can progress (i.e. progression policies). For example, compare a progression policy that requires a grade point average (GPA) minimum of 2.4 to continue on in the program to a different policy that requires students to pass all courses (C+ or higher) in order to progress. In the 2.4 minimum GPA model, students could conceivably progress while doing poorly in one course as long as they keep their grades high enough in their other courses. In this model, the stakes of a single exam or major assignment would be muted because students will have opportunities to compensate for the poor performance on other assessments. Comparatively, the model of requiring students to pass all courses with a C+ or higher amplifies the stakes of a single assessment within each course. Failing one highly-weighted assessment could stop a student in their tracks because it could lead to failing an entire course. A consequential impact also comes from the weighting of assessments that culminate in each course grade. This example illustrates how stakes of an individual assessment are connected to factors such as progression policies. Understanding the consequences on a learner's progress should provide a basis for readers to start judging the stakes of their own assessments, and this judgement should be in relation to how these assessments are situated in the program. Irrespective of specific progression policies, it remains important that educators reflect on the potential stakes of individual assessments in their courses and act accordingly. When the stakes for an individual assessment are high, efforts to validate use of those scores and interpretations from those scores should be proportional to the stakes. The higher the stakes, the more scientific evidence that is needed to support the intended use as well as consequential decision-making. An important assumption with any testing termed “high-stakes testing” is the need for validation because of the consequences of such testing in major decisions. 3. Evidence needed for validation in high-stakes testing As previously described by Peeters et al 3, validation is the process of generating evidence to support inferences for decisionmaking. High-stakes testing necessitates validation evidence more so than lower-stakes testing. The Standards include requirements to produce evidence to support fairness, validity, and reliability for an intended use of test scores. Reliability is a crucial aspect of fairness, thus The Standards concisely state “The reliability/precision of measurement is always important.” (p33)2 According to The Standards, “the extent that scores are not consistent across replications of the testing procedure (i.e. they reflect random errors in measurement), their potential for accurate prediction. […] and for wise decision-making is limited.” (pp 34–35, italics emphasis added)2 As such, reliability is key validation evidence for the generalization inference.3 Reliability is not an immutable property of an instrument; it is sample dependent and as such requires regular re-analysis based on data from your own applications with your own learners.4 In other words, reliability can be analyzed following every test administration and should be regularly reconsidered with new cohorts. 4. Alternate reliability methods The Standards note that techniques from Generalizability Theory and Item-Response Theory can also be used to model reliability. Generalizability Theory is especially relevant for estimating the reliability of performance assessments like objective structured clinical examinations (OSCEs). These types of assessments can feature things like rubrics, multiple occasions, and multiple raters. Generalizability Theory allows educators to model how these measurement characteristics impact reliability. This allows you to answer questions such as how many raters or stations or questions are needed in order to achieve different levels of reliability. Readers are referred to an overview of the use of Generalizability Theory in pharmacy settings for more detail.5 On the other hand, Item-Response Theory acknowledges the fact that test questions can differ in their level of difficulty (i.e. student ability is corrected for variance in difficulty among questions). As a latent trait model, Item-Response Theory facilitates ability to construct assessments more efficiently (with tighter control over reliability) as well as to understand how reliability changes depending on actual performance. However, Item-Response Theory is limited in its application in health sciences education because it usually requires large samples sizes, and it is therefore most often applied to large-scale testing settings like licensing exams.6 5. Magnitudes of reliability coefficients One common question regarding reliability is how high the reliability coefficient should be. As The Standards note, “the need for precision increases as the consequences of decisions and interpretations grow in importance.” (p33)2 In other words, it depends on the stakes and is a continuum.7 In low-stakes testing, where test results are aggregated with other assessments before high-stakes 2
Currents in Pharmacy Teaching and Learning 12 (2020) 1–4
M.J. Peeters and M.K. Cor
decisions are made (such as multiple quizzes and exams in one course), lower reliability coefficients such as 0.5–0.7 could suffice for each assessment. Aggregating from multiple assessments will increases reliability of the final grade. However, when stakes are higher (i.e. the assessment has a large degree of weight in the final decision), a higher reliability coefficient (>0.8) should be sought. For very high-stakes testing such as professional licensing, a reliability coefficient of >0.9 is often recommended. 6. Cut-scores Another notable issue for decisions from high-stakes testing is cut-scores (i.e. the score above which a student “passes” and below which a student “fails” an assessment). Arriving at a cut-score for high-stakes decision-making requires a scientifically-rigorous evidence-based approach. Standard-setting procedures need to be carefully thought out, and multifaceted evidence should be used in determining each cut-off. Different approaches on how to do this are discussed in the literature.8,9 Each involves a process. That is, these approaches lay out systematic steps for how to determine cut scores that will reliably differentiate between those who meet necessary standards versus those who do not. The amount of effort and scientific rigor to set cut-scores should be proportional to the stakes of the intended use. Higher stakes should involve a more intentional and thoughtful standard-setting process. 7. Teachable moment We highlight the concept of “stakes” as a basis to inform the level of effort and scientific-rigor that should go into validation endeavors. We have focused on reliability evidence because many educators are familiar with the concept and will have some experience generating or interpreting this type of evidence in some settings. Keep in mind that assessment in education also requires careful consideration of issues in validity beyond reliability. Readers are further referred to the Validity section of The Standards2 and the primer on validation3 for information about other important sources of evidence that can help elevate the of rigor of an assessment so that it may serve as high-stakes. These include validity evidence based from The Standards of test content, response processes, relationships to other variables, and consequences,2 or inference evidence from Kane's Framework for Validation of scoring, generalization, extrapolation, and implications.3 The Standards are foundational criteria used to inform large-scale development of tests such as the Scholastic Aptitude Test (SAT), the American College Test (ACT), the Medical College Admission Test (MCAT), the United States Medical Licensing Examination (USMLE), Pharmacy College Admission Test (PCAT), and North American Pharmacist Licensure Examination (NAPLEX). While not all course-level assessments need to meet all of The Standards to the same scientific rigor as should be attained with assessments for high-stakes purposes, there are notable applications for locally-developed high-stakes educational testing at colleges/schools of pharmacy. For example, when a college/school of pharmacy uses OSCE scores to make progression decisions, scientifically-rigorous validation effort is needed to show evidence for validation inferences. Each college/school of pharmacy should ultimately reflect on where high-stakes assessments occur in their program and commit appropriate amounts of resources to support implementation of a high-quality assessment. If a high-stakes assessment is not backed by robust validation evidence, it would be wise to not employ those scores for important decisions. In pharmacy educational research, investigators need to ensure scientifically-rigorous measurement of their study outcomes.7 Whether an outcome is students' learning, perceptions of learning, or satisfaction with a course revision, these are non-physical-world constructs that need psychometric consideration. Reliability is one critical element in validation evidence for the generalization inference. An imprecise, poorly-measured construct alone can account for study findings, whether it is no association found where there should be an association (Type II decision error) or an association found were there should not be one (Type I decision error). Needfully, validation evidence (especially reliability) should be reported for rigorous educational outcome measures. Importantly, as Dy-Boarman et al 1 recommend in their reported, “ensuring the psychometric reliability of our performance assessments among our students is essential and needs close evaluation. […] Dependent variables need to be rigorous, objective markers.” If one has measured students' abilities with an educational test (including a performance assessment), psychometric evidence (especially reliability) should be analyzed and reported. The Standards expect high-stakes testing to be scientifically-rigorous (e.g., evidence for reliability) while being implemented in a fair and ethical manner. Educational research in pharmacy should demonstrate these same expectations. Disclosures None. Declaration of competing interest None. References 1. Dy-Boarman EA, Bottenberg MM, Diehl B, Mobley-Bukstein W, Quaerna B. Lessons learned from an investigation exploring association between grit and student performance in a pharmacy skills laboratory course. Curr Pharm Teach Learn. 2018;10(11):1443–1446. https://doi.org/10.1016/j.cptl.2018.08.014. 2. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Standards for Educational and
3
Currents in Pharmacy Teaching and Learning 12 (2020) 1–4
M.J. Peeters and M.K. Cor
Psychological Testing. Washington, DC: American Educational Research Association; 2014. Peeters MJ, Martin BA. Validation of learning assessments: a primer. Curr Pharm Teach Learn. 2017;9(5):925–933. Zibrowski EM, Myers K, Norman G, Goldszmidt MA. Relying on others’ reliability: challenges in clinical teaching assessment. Teach Learn Med. 2011;23(1):21–27. Cor MK, Peeters MJ. Using generalizability theory for reliable learning assessments in pharmacy education. Curr Pharm Teach Learn. 2015;7(3):332–341. De Champlain AF. A primer on classical test theory and item response theory for assessments in medical education. Med Educ. 2010;44(1):109–117. Peeters MJ, Beltyukova SA, Martin BA. Educational testing and validity of conclusions in the scholarship of teaching and learning. Am J Pharm Educ. 2013;77(9) https://doi.org/10.5688/ajpe779186. 8. Lane S, Raymond MR, Haladyna TM. Handbook of Test Development. 2nd ed. New York, NY: Routledge; 2016. 9. Smith EV, Stone GE. Criterion Referenced Testing: Practice Analysis to Score Reporting Using Rasch Measurement Models. Maple Grove, MN: JAM Press; 2009.
3. 4. 5. 6. 7.
4