Studies in Educational Evaluation 65 (2020) 100843
Contents lists available at ScienceDirect
Studies in Educational Evaluation journal homepage: www.elsevier.com/locate/stueduc
Validity inquiry process: Practical guidance for examining performance assessments and building a validity argument
T
Cynthia A. Conna,*, Kathy J. Bohanb,1, Suzanne L. Pieperc,2, Matteo Musumecid,3 a
Professional Education Programs, Northern Arizona University, United States College of Education, Northern Arizona University, United States c Office of Curriculum, Learning Design, & Academic Assessment, Northern Arizona University, United States d Faculty of Life Sciences & Medicine, Education Operating Services, Centre for Education, King’s College London, United Kingdom b
ARTICLE INFO
ABSTRACT
Keywords: Performance assessment Examining validity Validity inquiry instruments Validity argument Design-based research
Given the increased use of performance assessments (PAs) in higher education to evaluate achievement of learning outcomes, it is important to address the barriers related to ensuring quality for this type of assessment. This article presents a design-based research (DBR) study that resulted in the development of a Validity Inquiry Process (VIP). The study’s aim was to support faculty in examining the validity and reliability of the interpretation and use of results from locally developed PAs. DBR was determined to be an appropriate method because it is used to study interventions such as an instructional innovation, type of assessment, technology integration, or administrative activity (Anderson & Shattuck, 2012). The VIP provides a collection of instruments and utilizes a reflective practice approach integrating concepts of quality criteria and development of a validity argument as outlined in the literature (M.T. Kane, 2013; Linn, Baker, & Dunbar, 1991; Messick, 1994).
1. Introduction Performance assessments (PAs) are prevalent in higher education and allow students to demonstrate their knowledge, skills, and perceptions through activities such as writing portfolios, oral presentations, and projects. The types of PAs typically encountered in professional academic programs such as education, the health professions, science, engineering, and the arts and humanities require a structured and formalized observation of the student demonstrating the skills or behaviors of the profession. PAs can be administered throughout all phases of a student’s academic program, but they should (a) align to the curricular or professional learning objectives, (b) be authentic, direct assessments of a student’s performance, and (c) align to the discipline’s standards of what constitutes good practice or competence. When used effectively, PAs can show a person’s ability to apply theoretical or conceptual knowledge or establish a person’s preparedness to enter a profession. The PA results can provide individual feedback to students,
inform instructional decisions, and be a source of data for program evaluation. Initially, the study’s investigators sought to find tools or procedures for improving PA scoring guides to increase confidence in the data being collected. However, based on an initial review of the literature, the study’s aim evolved into investigating, developing, and implementing best practices to support faculty in examining the validity and reliability of the interpretation and use of results from locally developed PAs. This study utilized a design-based research (DBR) methodology (Anderson & Shattuck, 2012; Barab, 2004; Brown, 1992; McKenny & Reeves, 2012), a process in the learning sciences intended “for advancing new theory and practice” (Barab, 2014, p. 151). DBR is used to study interventions such as an instructional innovation, type of assessment, technology integration, or administrative activity (Anderson & Shattuck, 2012). The study resulted in the development of the Validity Inquiry Process (VIP).
⁎ Corresponding author at: Professional Education Programs, Northern Arizona University, 801 South Knoles Drive, P.O. Box 5774, Flagstaff, AZ 86001-5774, United States. E-mail addresses:
[email protected] (C.A. Conn),
[email protected] (K.J. Bohan),
[email protected] (S.L. Pieper),
[email protected] (M. Musumeci). 1 College of Education, Northern Arizona University, P.O. Box 5774, Flagstaff, AZ 86001-5774. 2 Office of Curriculum, Learning Design, and Academic Assessment, Northern Arizona University, P. O. Box 4091, Flagstaff, AZ 86001-4091. 3 Faculty of Life Sciences & Medicine, Education Operating Services, Centre for Education, King's College London, Henriette Raphael Building, Guy’s Campus, London SE1 1UL.
https://doi.org/10.1016/j.stueduc.2020.100843 Received 24 July 2018; Received in revised form 12 January 2020; Accepted 21 January 2020 0191-491X/ © 2020 Published by Elsevier Ltd.
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
2. Theoretical background
Over the decades, the classic understanding of validity as various types (e.g., content, criterion, construct) has evolved into Messick’s (1994) unified framework of construct validity, subsuming the various types of validity under one umbrella. This unified approach requires obtaining evidence from different sources for the interpretation of test and assessment results (Brennan, 2013; M.T. Kane, 2013). The unified concept of validity is largely accepted in the field of educational measurement and was adopted by the 2014 Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). The complexity of PAs administered in authentic settings provides unique challenges for assessment developers and raters. With PAs, students demonstrate skills or produce artifacts that require rater judgments. The judgments must be anchored to professional standards or learning outcomes. In PA development, considering all or most of the possible ways a student might respond requires thoughtful consideration of the range of potential responses. The PA developers and evaluators must also ensure rigor and fairness with the interpretations of how to rate the response and provide feedback. Faculty involved in developing PAs bring a variety of expertise to their role, but not necessarily assessment or measurement knowledge needed to verify the quality of PAs, which in part has led to the development of this research study.
2.1. Performance assessments in higher education Well-designed PAs provide a number of important benefits. One benefit is that they “merge learning and assessment” (Suskie, 2009, p. 26). Classroom-based research studies have shown that students are most successful in classes where inquiry-based, active learning methods are emphasized over content coverage (McConnell, Steer, & Owens, 2003; Sundberg, Dini, & Li, 1994). McConnell et al. (2003) found that active learning methods contributed to improved student retention, deeper understanding of course material, and an increase in logical thinking skills. Another benefit of PAs is that students receive specific feedback on their performance. Students are increasingly viewing themselves as “learning consumers,” and they expect “clear, practical, and specific feedback” that will help them to develop and improve their knowledge and skills (Lombardi, 2008, p. 4). PAs are most powerful in promoting student success when they transparently state the assignment’s purpose, outline the steps for successfully completing the performance task, and incorporate useful feedback through a scoring guide based on welldefined criteria (Winkelmes et al., 2015). PAs also respond to external stakeholders’ call to prepare students to succeed in work and life. A survey conducted for the Association of American Colleges and Universities found that employers overwhelmingly endorsed “practices that involve such things as collaborative problem-solving, research, senior projects, community engagement, and internships” (Hart Research Associates, 2013, p. 12). PAs answer these requests by “documenting and encouraging critical, creative, and self-reflective thought” (Moss, 1992, p. 230). While PAs show promise in terms of improving teaching and learning, barriers to the successful design and use of results from these assessments exist. Beginning with early uses of PAs in schools, “problems [have arisen] in the areas of scoring, validity, instruction versus accountability, time constraints, costs, and teacher resistance” (Lombardi, 2008, p. 4). The most frequently cited barriers are the time it takes to develop and manage an effective PA, and the difficulty of examining the validity of the use and interpretation of the results.
3. Context of study This study was conducted at a public university in the United States with large educator preparation programs graduating approximately 1050 teacher preparation and advanced education students annually. To meet accreditation requirements, a variety of locally developed PAs across 31 programs were developed and continue to be implemented. Practical guidelines for examining validity and reliability for the interpretation of locally developed PAs that form the foundation for program evaluation efforts are needed to ensure instructional decisions are being made based on quality evidence (Baartman, Bastiaens, Kirschner, & van der Vleuten, 2006; Worrell et al., 2014). With over 200 PAs being implemented in the institution’s educator preparation programs, there were multiple examples of assessment instruments and scoring guides with varying strengths and weaknesses. This study was deemed important because there were no university or national accreditation user-friendly guidelines for systematically reviewing the quality of the PAs. Yet, new national accreditation standards required documentation of validity and reliability of the use of results from assessment instruments. This new requirement coincided with faculty interest in establishing greater confidence in the assessment data collected from locally developed instruments. Thus, an underlying aim of the project was to transform faculty engagement in institutional assessment and accreditation reporting to be more directly tied to teaching, learning, and curricular improvements.
2.1.1. Development of performance assessments Following the Stages of Backward Design (Wiggins & McTighe, 2005), the initial step for developing a PA is the identification of “desired results” (p. 18) or learning outcomes to be evaluated. The second stage is to “determine acceptable evidence” (p. 18). Through this stage, if a PA is selected as an appropriate, meaningful strategy for evaluating learning, the purpose (i.e., what the students should gain from completing the PA and how the PA will provide evidence of student competence) should be articulated. To construct the PA, McMillan (2018) recommends three steps: 1) state the performance task, 2) describe the task, and 3) write the performance task prompt. After the development of the performance task and prompt, a scoring guide aligned to the identified learning outcomes should be developed. The final stage of the Backward Design model, “plan learning experiences and instruction” (p. 18), includes selection of instructional strategies, materials, and exercises to prepare learners with the knowledge, skills and/or behaviors aligned to the desired results.
3.1. Research questions Multiple research questions were explored through this study. The overarching question was: what strategies or tools can support faculty in examining the validity and reliability of the interpretation and use of PA results? The sub-questions investigated through the iterative DBR methodology that align with the research project stages are stated below.
2.1.2. Performance assessments and technical issues As PAs have become more widely used in higher education, faculty need to be able to address technical concerns about validity and reliability. Since PAs are often authentic tasks mirroring performance domains relevant to a discipline or profession, the assessments are typically perceived as having face validity (i.e., the appearance of emulating a real-life task that students might be expected to accomplish as practitioners). However, there is a need to go beyond face validity.
1 What issues and challenges are faculty identifying related to the use of results from locally developed PAs? 2 What strategies or tools for examining the validity and reliability for the interpretation and use of PAs already exist in the literature? 3 What strategies or tools can be developed and implemented for examining validity and reliability of PAs based on findings from the 2
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
Table 1 Iterative Stages aligned to Research Sub-questions and Development of VIP Strategies and Tools. Stages and Sub-questions
Data sources
Participants
Outcomes
Stage 1: What issues and challenges are faculty identifying related to the use of results from locally developed PAs?
Informal interviews with faculty
2 initial investigators
Expert consultation
Teacher preparation and advanced education faculty
Stage 2: What strategies or tools for examining validity and reliability of PAs already exist in the literature?
Conducted review of the literature including, but not limited to the following: o Linn et al. (1991) o Messick (1994) o Stevens & Levi (2005) o Suskie (2009)
2 initial investigators
Faculty concerned about confidence in data collected with locally developed PAs Issues with providing evidence of validity of PA data results to accreditors Frameworks and resources identified:
Stage 3a: What strategies or tools can be developed and implemented for examining validity and reliability of PAs based on findings from the literature and data collected from pilot administrations?
Findings from the literature
2 initial investigators
Facilitator observation notes, informal faculty feedback, Student Survey results, and artifacts collected from faculty completing VIP instruments through the pilot with 2 programs and 14 PAs
4 Educational Technology faculty
Stage 3b: What strategies or tools can be developed and implemented for examining validity and reliability of PAs based on findings from the literature and data collected from pilot administrations?
4 History Education faculty 223 Educational Technology graduate students
Additional review of the literature including, but not limited to the following: o Graham et al. (2012)
2 initial investigators; 3 additional investigators 48 program faculty from various programs (e.g., Bilingual Ed., Early Childhood Ed., Elementary Ed., Special Ed., Secondary Ed., Ed. Leadership, School Psychology) 4 student teacher supervisors using a common student teaching assessment instrument 9 graduate students
o M Kane (2013, M.T. Kane, 2013) Facilitator observation notes, informal faculty feedback, Student Survey results, artifacts collected from faculty completing revised VIP instruments with 13 additional programs and 40 PAs; VIP Faculty Satisfaction Survey and VIP meeting video recordings also added as new sources of data
o Criteria for quality PAs o Establishing reliability of PAs o Rubric development resources o Validity frameworks Challenges with finding user-friendly strategies and tools for examining PA validity and reliability Utilized frameworks and resources to develop first draft of Validity Inquiry Process (VIP) instruments: o Content Analysis
o Validity Inquiry Form o Metarubric for Examining PA Rubrics o Student Survey o Conducting Review of Reliability VIP draft piloted with faculty and students Revisions to initial draft of VIP including guiding questions, formats, and procedures based on facilitator observation notes and informal feedback from faculty participating in VIP meetings Additional literature identified and findings incorporated into VIP instruments: o Inter-rater agreement
o Validity argument
Literature regarding forming validity argument led to focus on development of PA purpose statement
On-going revisions to VIP guiding questions, formats, and procedures Transformation of faculty engagement and assessment practices
(continued on next page)
3
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
Table 1 (continued) Stages and Sub-questions
Data sources
Participants
Outcomes
Stage 4: How do the developed strategies or tools assist faculty in addressing issues and challenges related to the use of results from PAs?
Additional review of the literature including, but not limited to the following: o Brookhart and Nitko (2019)
2 initial investigators; 3 additional investigators
Updated literature identified and incorporated into the processes related to reviewing reliability Tested revised versions of VIP strategies and instruments:
Approximately 300 attendees from five conference presentations
o Cook et al. (2015) Continuing to solicit feedback through facilitator observation notes, informal faculty feedback, Student Survey results, and artifacts collected from faculty completing VIP instruments Feedback from attendees familiarized with VIP through national and state conference presentations
o Content Analysis o Validity Inquiry Form
o Metarubric for Examining PA Rubrics
o Student Survey o Conducting Review of Reliability Developed and tested VIP instrument: o Validity Argument template VIP used as evidence to successfully attain national accreditation Higher quality PAs and greater confidence in assessment data results for providing feedback to students and for use in improving curricula
4.1. Research design
university and articulated the study’s overarching research question. These two initial investigators led efforts in Stages 1 and 2, and initial pilot testing conducted during Stage 3a. After the initial pilot tests of the VIP strategies and tools were conducted, the use of the process expanded to include the three additional investigators who assisted with facilitation of VIP meetings conducted with faculty. During the duration of this study, all investigators served as participant observers.
To address the research questions, design-based research (DBR) was determined to be an appropriate method because it focuses on impacting practice through “the design and testing of a significant intervention” (Orngreen, 2015, p. 23). DBR typically involves a “collaborative partnership with participants in the research context” (DesignBased Research Collective, 2003, p. 7). Barab (2014) also describes DBR as “a process of reflective action” that is “grounded in actual happenings” (p. 163). The DBR research stages were developed from guidelines outlined in the literature (Anderson & Shattuck, 2012; Bakker, 2014; Barab, 2014; McKenny & Reeves, 2012; Wheaton, 2018). The research sub-questions, which align to the research stages, as well as the relevant data sources, participants, and outcomes for each of the research stages are stated in Table 1.
4.2.2. Faculty Over the course of the study, approximately 56 faculty and 44 student teacher supervisors from undergraduate and graduate Early Childhood, Elementary, Secondary, and Special Education teacher preparation programs, as well as advanced educator preparation programs (e.g., Bilingual/Multicultural, Educational Technology, Educational Leadership, School Psychology) participated. Faculty were recruited based on their role as a lead instructor for a course with an embedded, key PA. Additionally, student teacher supervisors who had previously implemented the PA under review were also recruited. Course-embedded, key assessments from programs receiving conditions (i.e., required areas for improvement) through an accreditation review were prioritized.
4.2. Participants
4.2.3. Students During Stage 3a, the Student Survey, a VIP instrument described below, was piloted with 223 Educational Technology graduate students who had completed a course-embedded, key PA. During Stage 3b, it was determined that surveying students about poor quality PAs would not produce useful data. Additionally, during this stage, three students participated in a facilitated meeting regarding a common assessment used by teacher preparation programs, the Teacher Candidate Work Sample. By Stage 4, the Student Survey was re-introduced with students in the School Psychology graduate program after examining and improving the quality of the PAs under review.
literature and data collected from pilot administrations? 4 How do the developed strategies or tools assist faculty in addressing issues and challenges related to the use of results from PAs? 4. Method
4.2.1. Investigators The five investigators involved in the study serve in university leadership roles. The investigators have over 15 years of experience working in K-12 settings and overseeing nationally accredited, state approved teacher preparation and advanced education programs. The investigators have experience developing PAs and rubrics and have led faculty development workshops on these topics. Three of the investigators have experience serving as program reviewers for national, regional, or state accreditation agencies. In 2012, one of the initial investigators served as a faculty consultant to support the institution’s transition to the new Council for the Accreditation of Educator Preparation (CAEP) standards and expectations. The investigator consulted with a measurement expert at the
5. Data collection, analysis, and iterative findings Data collection for this study included conducting a literature 4
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
review, informal interviews and feedback from faculty, faculty-completed VIP instruments, facilitator observation notes, scribed facilitated meeting notes, and faculty and student survey results. Following facilitated meetings, the investigators and a lead program faculty member communicated to review data collected, determine iterative revisions, and document next steps for the specific PA and/or the program faculty. For the DBR project, all of the investigators met at least quarterly to analyze data collected from recent pilot administrations and discussed findings about revisions needed for the process. This led to updates to the instruments and implementation procedures. A narrative describing each stage of the study and the associated research sub-question, example analyses of data collected, and related outcomes are discussed next.
5.2.2. Criteria for evaluating the quality of performance assessments Linn et al. (1991) recommended evaluating the quality of PAs using the following criteria: content coverage, content quality, cognitive complexity, meaningfulness, generalizability and transfer, costs and efficiency, and consequences. A detailed description of these criteria is provided in Table 2. 5.2.3. Reliability of performance assessments Reliability, or the consistency and comparability of assessment results, is also essential for establishing validity. Stemler (2004) provides guidance on establishing inter-rater reliability through consensus, consistency, and/or measurement approaches. For locally developed PAs used in higher education, inter-rater agreement (i.e., a consensus approach) tends to be the most relevant for the types of decision inferences intended. Inter-rater agreement focuses on comparing the exact scores attained by multiple raters based on established criteria. Calibration is a strategy to support all raters in achieving a common understanding of the PA scoring criteria through faculty conversations during the development, implementation, or revision phases of a PA. Faculty use the scoring procedures until high agreement is established. Results should be monitored and calibration training provided systematically to ensure continued consistency in scoring procedures. For high-stakes decisions, the use of multiple raters may be important to address any sensitivity and specificity concerns.
5.1. Stage 1: Identify issues and challenges with use of performance assessments During Stage 1, the first research sub-question investigated facultyperceived issues and barriers for establishing evidence of PA validity. Through informal interviews with faculty, the issues identified included:
• Limited time to develop initial PAs; • Dissatisfaction with the quality of many PAs and thus lack of confidence in the use of data results; and • Lack of guidance on how to efficiently review and modify PAs.
5.2.4. Existing validity frameworks from the literature Lai, Wei, Hall and Fulkerson (2012) published a white paper focused on establishing a validity argument for PAs. The paper provided a “validity agenda” (Lai et al., 2012, p. 6) for PAs. For each source of validity evidence, a mix of qualitative and quantitative approaches was suggested and examples were provided along with guiding questions. Kane, Crooks and Cohen’s (1999) framework focused on scoring, generalization, extrapolation, and implications. The authors emphasized the importance of clearly stating the proposed interpretation and use of assessment scores before determining what and how evidence will be presented in a validity argument. However, both frameworks lack practical methods and instruments. During Stage 3, M. Kane’s (2013) and M.T. Kane’s (2013) conceptualizations of a validity argument were located in the literature, contributing to further develop of the innovation.
The two initial investigators collaborated to further explore the identified issues and challenges brought forth by program faculty, which led to the determination to pursue the study and recognition that an iterative research design such as DBR was appropriate. 5.2. Stage 2: Investigate potential solutions from the existing literature The second stage included conducting a review of the literature to explore the second research sub-question: what strategies or tools for examining validity and reliability of the use and interpretation of PA already exist in the literature? The initial investigators sought to identify existing guidelines or procedures that could be easily understood and efficiently implemented by faculty. An initial search was completed by a university librarian. The search terms performance-based assessment, measurement validity, and education were used, and the Academic Search Complete (EBSCOhost) and JSTOR databases were searched. From these searches conducted by the librarian, eight relevant articles were identified. Using keywords from the eight identified articles to further search library databases and snowballing references from these articles, the initial investigators located an additional 27 relevant articles. The articles were categorized and analyzed to identify information and approaches that were viewed as easy to understand, feasible for faculty to implement, and constituted a defensible process. The categories found through the resources included how to develop and evaluate rubrics, criteria for evaluating PAs, approaches to establishing consistency in data collection, and existing validity frameworks. The findings deemed most relevant based on the second research subquestion are presented next.
5.2.5. Analysis of literature review findings The results of the literature review allowed the initial investigators to gain a stronger grounding in the relevant concepts and available frameworks. Early in the study it became apparent that it would be critical to examine the connection between the PA prompt and scoring guide or rubric. Additionally, the publications located were judged to be difficult to interpret and use by faculty and practitioners. Baartman et al. (2006) concurred that there was a lack of practical instruments with operational definitions aligned to establishing validity criteria to assist higher education faculty with evaluating PAs. Due to the nature of the work, the Linn et al. (1991) validity or quality criteria were utilized to develop a reflective and efficient process that would be viewed as feasible to implement. Recognizing the importance of Messick’s (1994) and Stevens and Levi’s (2005) contributions, concepts related to construct validity and guidelines for developing scoring guides were integrated into the practical guidelines the initial investigators determined were needed.
5.2.1. Rubric development and use Stevens and Levi’s (2005) work related to developing rubrics and, specifically, their idea of developing a metarubric to evaluate rubrics was identified as useful guidance. However, it became apparent that the clarity of instructions provided for the PA was directly tied to the fairness of the scoring guide. This led to a deeper review of the literature to locate additional overall guidance for evaluating the quality of PAs.
5.3. Stage 3a: VIP development, pilot implementation, and refinements Based on the limited findings of practical, easy-to-implement guidelines from the literature review conducted during Stage 2, the initial investigators focused efforts on research sub-question three: what strategies or tools can be developed and implemented for examining validity and reliability of PAs based on findings from the literature and 5
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
Table 2 List of Validity Criteria Identified Through a Review of the Literature. Validity Criteria
Description
Domain Coverage
Domain Coverage is defined as the breadth and depth of content addressed (Messick, 1994). Similarly, Linn et al. (1991) refer to domain coverage in terms of how comprehensive the performance assessments are that make up a program evaluation plan as well as showing evidence that a “broad representation of content specialists” (p. 20) are involved in developing the learning outcomes. Content Quality is defined as being aligned to the “current understanding of the field” while also reflecting aspects of the discipline that are intended to “stand the test of time” (Linn et al., 1991, p. 19). Content quality is further defined as measuring tasks that “are worthy of the time and efforts of students and raters” (p. 19). To ensure content quality, Messick (1994) recommends a mix of measures including both “extended performance tasks and briefer structured exercises” (p. 15) and the assessment of both content knowledge and related skills. Also, the performance assessment should not contain “anything irrelevant…leading to minimal construct-irrelevant variance” (p. 21). Cognitive Complexity is defined as emphasizing “problem solving, comprehension, critical thinking, reasoning, and metacognitive processes” and it is important to identify the “processes students are required to exercise” in order to complete the performance assessment (Linn et al., 1991, p. 19). Messick (1994) adds that the complexity of the performance task should match the construct (i.e., learning outcome) being measured as well as “the level of developing expertise of the students” (p. 21). The value of performance assessments resides with the idea that these types of assessments “get students to deal with meaningful problems that provide worthwhile educational experiences” (Linn et al., 1991, p. 20). Messick (1994) recommends “favoring rich contextualization of problems or tasks … to engage student interest and thereby improve motivation and interest” (p. 19). Another way of defining Meaningfulness is considering the authenticity of the assessment. Generalizability is concisely defined as how the response to the content and context of the performance assessment transfers to other related discipline situations (Linn et al., 1991; Messick, 1994). To provide evidence of generalizability, it is important to evaluate the problems, projects, or scenarios presented in terms of whether they address multiple and varied topics or problems. The context or problem situation should be evaluated in terms of the richness and level of detail. Linn et al. (1991) suggest evaluating Consequences by considering if the performance assessment takes a reasonable amount of time to implement in relation to other course topics. The method for establishing the pass-fail or cut score and the extent to which the benefits of implementing the assessment outweigh any unintended adverse ramifications. Fairness is defined as ensuring that all students have the same opportunity to gain the knowledge and skills necessary to complete the assessment. Fairness also relates to confirming all student work is evaluated using the same criteria (Linn et al., 1991). Cost and Efficiency is described by Messick (1994) as the idea of “utility” or the “costs and efficiency relative to the benefits” (p. 21). Performance assessments can provide valuable insight related to learning. These benefits need to be considered in relation to the practicality of implementing the assessment. Costs need to be acceptable and sustainable by the unit responsible and attention needs to be “given to the development of efficient data collection designs and scoring procedures” (Linn et al., 1991, p. 20).
Content Quality
Cognitive Complexity
Meaningfulness
Generalizability
Consequences Fairness Cost and Efficiency
guide. Analysis of PAs should discuss the richness of the problem, project, or scenario in relation to the discipline and/or the inclusion of an exemplar to model a potential solution. 5 To address Cost and Efficiency, Linn et al. (1991) suggest reflecting on whether the PA is too burdensome to implement. Additionally, costs related to the instrument, scoring procedures, evaluator training, or the data collection and reporting system should also be included.
data collected from pilot administrations? During Stage 3a, the initial version of the VIP and related instruments was developed and piloted. The process emphasized practical guidelines that could be easily understood, implemented efficiently, and encouraged engagement by faculty and practitioners. The VIP strategies and instruments emerged from connecting the definitions of the validity criteria and other concepts located through the literature review to identify ways to gather appropriate evidence through a reflective, theory-to-practice process. The five VIP strategies and instruments developed for examining a PA include: Content Analysis, Validity Inquiry Form, Metarubric for Examining Performance Assessment Rubrics, Student Survey, and Conducting Review of Reliability. Fig. 1 provides a representation of the VIP. Copies of the strategies and tools as well as examples can be accessed online: https://nau.edu/pep/quality-assurance-system/, click Validity Inquiry Process tab.
5.3.2. Validity inquiry form The Validity Inquiry Form evolved into the primary instrument for the VIP. The first element requests that reviewers state the purpose of the PA. The Validity Inquiry Form includes reflective questions to support faculty in articulating the purpose and the interpretation and use of data resulting from the PA. The instrument also contains a series of questions to evaluate the PA. If the purpose of the PA has not previously been documented, it is critical to clearly state and agree upon the purpose so reviewers will have the necessary information and perspective to respond to the form questions. Table 3 shows the reflective questions for supporting development of the purpose statement, as well as alignment between the validity criteria and associated reflective questions. A question on the Validity Inquiry Form promotes an evaluation of the Cognitive Complexity of the PA. The Rigor/Relevance Framework® (Daggett, 2016) is referenced as an approach and contains two continuums: the revised Bloom’s Taxonomy and a continuum of application of skills within a discipline or across disciplines. Reviewers determine where the work required for the PA falls within the two continuums, and if the quadrant is appropriate in relation to the preparedness of the student to be successful on the PA when it is administered. The Validity Inquiry Form should be completed by the faculty member who developed the PA and at least one other individual with appropriate content or assessment expertise. It is helpful if a principal investigator, lead faculty member, or assessment coordinator facilitates the process. The meeting with the individuals who complete the review should focus on the items that were identified as areas for
5.3.1. Content analysis Several strategies were developed for the analysis and documentation of evidence for four of the validity criteria. These strategies include: 1 For gathering evidence related to Domain Coverage, write a narrative stating how content experts were involved in the development of the learning outcomes and/or how discipline organization standards were utilized. 2 To document evidence of Content Quality, meeting minutes or other documentation demonstrating the involvement of content experts in the design of the PA can be used. 3 Content Quality can also be demonstrated by documenting the balance between extended performance tasks and briefer structured exercises. Analyze and document the number of key PAs falling into each category for a program of study. If the PAs appear to favor one type, then options can be discussed regarding ways to balance the number of extended and briefer structured PAs. 4 Generalizability can be documented through an analysis of the assessment prompt, related instructional materials, and the scoring 6
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
Fig. 1. Validity Inquiry Process (VIP), Practical Guidance for Examining Performance Assessments and Building a Validity Argument.
improvement. Faculty should discuss and determine appropriate revisions. A template with an example chart for documenting the Validity Inquiry Form results is provided on the website noted previously.
Assessment Rubrics should be used by reviewers to evaluate the Fairness of the scoring guide. The questions on this form are drawn from the work of Pieper, 2012, Messick (1994) and Stevens and Levi (2005). The completion of this instrument provides a deep review of the scoring guide by examining the learning outcomes aligned to the PA, scale, descriptions, overall qualities, and use of the rubric. The quantitative and qualitative data collected through the Metarubric for Examining
5.3.3. Metarubric for examining performance assessment rubrics After faculty have used the Validity Inquiry Form to guide their review of the PA instructions, the Metarubric for Examining Performance Table 3 Validity Inquiry Form Reflective Questions by Validity Criterion. Purpose of Performance Assessment
Validity Criterion Domain Coverage Content Quality Cognitive Complexity Meaningfulness Consequences
Fairness Efficiency
Questions • Reflective is the purpose of the assessment? • What is the purpose communicated to candidates? • How is the performance assessment data interpreted and used? • How What is the connection(s) between the data from this performance assessment and other data sources? • Reflective Questions
Q1. Do the performance assessment instructions adequately address (i.e., in terms of breadth and depth) the student learning outcome(s)/ standard(s) aligned to it? Q2: Does the performance assessment evaluate process or application skills as well as content knowledge? Q3: Analyze the performance assessment in terms of cognitive complexity. One approach is to use the Rigor/Relevance Framework (see http://leadered.com/rigor-relevance-and-relationships-frameworks/ Identify the quadrant that the assessment falls into and provide a justification for this determination. Q4: Do you view this performance assessment as authentic (i.e., “representative of real life tasks”) in terms of the problem, project, and/or scenario that is being presented to students (Gall et al., 1996, p. 268)? Q5: How was the “pass-fail (cut) score” established (Downing, 2003, p. 832)? Q6: How do the benefits of implementing the performance assessment outweigh any unintended adverse consequences (Downing, 2003)? Q7: “Are the consequences of the performance assessment, [in terms of percent of overall grade and/or use as a data point to determine continuation in program], reasonable?” (Gall et al., 1996, p. 268, adapted from Linn et al., 1991) Q8: Do all students have the same opportunity to gain the knowledge and skills necessary to complete the assessment? Q9: Is the time allowed to complete the assessment reasonable? Q10: “Is the performance assessment too…cumbersome [i.e., difficult to implement or communicate expectations or instructions steps to students] to administer” (Gall et al., 1996, p. 268, adapted from Linn et al., 1991)?
7
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
Table 4 Compiled Responses from Educational Technology Faculty Completing Validity Inquiry Form for PAs. Course & Assessment
Faculty Consensus: Strength(s)
Faculty Consensus: Area(s) for Improvement
ETC547, Article Abstracts
Allows students to practice research and writing skills while acquiring foundational content knowledge. Assessment addresses Strategic Planning, Programs & Funding, and Innovation & Change in a comprehensive manner.
Performance assessment needs to be more than a writing piece to address intent of standards. Review indicated that Shared Vision component should be evident and assessed through the final product developed by students as well as through their self-assessment of the project. Assessment could be more authentic if students were asked to pilot the instructional plan with students or share it with colleagues.
ETC 567, Grant Proposal ETC 567, Global Digital Citizenship Unit ETC 625, Unit Plan All required ETC courses, ETC Program Portfolio
Assignment provides assessment of student understanding of global digital citizenship concepts and effective ways for teaching these concepts. Students are asked to apply the theory to practice and cite research for strategies integrated into the creation of an original instructional plan. Comprehensive performance assessment supporting monitoring of student learning throughout program of study
Performance Assessment Rubrics should be compiled and discussed during a facilitated meeting and agreed upon revisions should be made to the scoring guide.
Assignment instructions need revised to encourage deeper reflections related to learning process engaged in and application of knowledge to professional practice.
the meaningfulness and authenticity of each PA. These results, along with open-ended comments collected through the survey, were used by Educational Technology faculty to make improvements to the respective PA. Table 5 illustrates an example of open-ended comments documenting both positive reactions as well as critical comments to the PAs.
5.3.4. Student survey Another important strategy for examining the Meaningfulness of a PA is to solicit feedback from the students completing the assessment. The work of Gall, Borg and Gall (1996) supports this strategy by suggesting that opinions regarding the authenticity of the PA should be gathered from groups “other than the experts who designed the performance assessment task and scoring criteria” (p. 268). The Student Survey collects feedback from students upon completion of a PA.
5.4. Stage 3b: VIP implementation with additional program performance assessments During Stage 3b, research sub-question three continued to be investigated. The investigative team analyzed and used data collected from faculty, student teacher supervisors, and student feedback from the initial pilots, as well as their own observation data from facilitating the VIP, to continue to refine the strategies and tools. For example, the guiding questions aligned to the validity criteria were re-worded to improve clarity and some guiding questions were removed to avoid redundancy. The Validity Inquiry Form used ratings of Needs Improvement, Acceptable, and Effective, while the Metarubric for Examining Performance Assessment Rubrics used a Yes/No scale. In later versions of these instruments, the scales on the two forms were aligned to the former three ratings. Additionally, based on feedback collected from discipline experts through national venues, a follow-up literature review was conducted during this stage, and nine additional sources were identified. Based on the recommendations in one publication (M. Kane, 2013), guiding questions were added to both the Validity Inquiry Form and Metarubric for Performance Assessment Rubrics to support faculty in explicitly documenting the purpose of the PA. M.T. Kane’s (2013) work also led the research team to develop the Validity Argument form. Graham, Milanowski, Miller and Westat’s (2012) guidance drove efforts for computing inter-rater agreement from evaluator scores.
5.3.5. Pilot of VIP During Stage 3a of the study, the VIP instruments were piloted with four Educational Technology faculty and four History Education faculty. These faculty used the Content Analysis, Validity Inquiry Form, and Metarubric for Examining Performance Assessment Rubrics instruments to examine six Educational Technology and three History Education PAs. Data was collected through completed VIP instruments, facilitator observation notes, and scribed facilitated meeting notes. The results informed iterative revisions to the PAs under review. Table 4 provides an example of compiled responses from Educational Technology faculty completing the Validity Inquiry Form for five key PAs. Results from these pilot administrations also clarified items, scales, and formats of the VIP instruments. Facilitator observations and informal faculty feedback reinforced the usefulness of the process. The Student Survey was administered to 223 Educational Technology graduate students who had completed and received feedback on PAs embedded in three required courses. Based on responses collected in Spring 2013, several revisions were made to the Student Survey instrument including making changes to one of the response labels and separating items regarding current and future professional practice. Quantitative results were compiled documenting student perception of
5.4.1. Validity argument Historically, Cronbach (1988) recommended test developers construct a validity argument to provide an overall statement of how the
Table 5 Student Survey Open-Ended Comments Regarding Educational Technology PAs. ETC Course, Unit Plan & Educational Technology Program Portfolio Positive Reactions Regarding PAs
Critical Comments Regarding PAs
“This assignment allowed me to create something I could use in my career outside of this course. I found the assignment very useful and helpful.” “The organization and structure of the assignments and key information was excellent. Each assignment was relevant and aligned to the objectives.” “Allowing us to select topics applicable to our jobs, really made the contribution to our current professional practice possible. I think this is important in allowing us to see if we can demonstrate what we have learned in a "real" setting, while also giving us an opportunity to gain additional feedback.”
“[The Unit Plan assignment] should have focused more on creating the website and making it more interactive.” “I felt we should be learning about how to make a successful online lesson rather than just doing a regular unit plan. While the unit plan is good, I wish [there] was more focus about how to teach it online.”
8
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
evidence from creating and implementing the assessment validated the use of a measure. M.T. Kane (2013) developed a framework focused on building a comprehensive validity argument. The goal of a validity argument is to “[evaluate] the inferences and assumptions inherent in proposed interpretation/use” (p. 118). Kane proposes to simplify the process with a two-step approach: “1) state what is being claimed, and 2) evaluate the plausibility of these claims” (p. 118). The argumentbased approach to validation requires the development of a comprehensive purpose statement, claims for the assessment instrument, and a proposed interpretation and use argument of the test results (Baldwin, Fowles, & Livingston, 2005; Cook, Brydges, Ginsburg, & Hatala, 2015; M.T. Kane, 2013; Mislevy & Haertel, 2006). First, the assessment developers make an a priori statement of the claims for the interpretation and use of the assessment results, and then collect evidence to write a statement of the plausibility of the claims supporting (or refuting) the intended use of the results. The validity argument should begin “with a clear statement of the proposed use of the assessment scores (i.e., interpretations and decisions)” (Cook et al., 2015, p. 564). The investigators chose the term Purpose to guide this step of the process. The term avoids technical jargon and supports faculty in framing the discussion about the intended use of the PA results. The validity argument should also address the instrument’s intended interpretation and use of the data collected from the PA, the quality of PA instructions and scoring guide, and the evidence of reliability of the PA data collected (American Educational Research Association et al., 2014; Cook et al., 2015; Council for the Accreditation of Educator Preparation (CAEP), 2015, 2018; Downing, 2003; M.T. Kane, 2013). The first two components of the validity argument can be written after the VIP has been completed. The third component can be composed following the calibration training of evaluators, implementation of the PA, and analysis of results. A template for writing a comprehensive validity argument can be found on the website noted earlier in the article.
with faculty teaching for the Bilingual Multicultural Education (n = 8), Early Childhood Education (n = 6), Elementary Education (n = 10), Special Education (n = 6), Secondary Education (n = 10), Educational Leadership (n = 4), and School Psychology (n = 4) degree programs. Two other investigators with responsibilities for unit level oversight of the programs also joined the VIP facilitation and data collection team. During this stage, two-hour VIP meetings were scheduled with selected program faculty. Faculty were asked to complete VIP instruments, the Validity Inquiry Form and Metarubric for Examining Performance Assessment Rubrics, prior to the meeting. Two investigators facilitated the meetings and staff or a graduate student scribed notes documenting the faculty discussions and consensus building. At the end of the meetings, a plan and timeline were established for making identified revisions to the PA, documenting evidence for developing a validity argument for the PA, and/or determining if other action was needed (e.g., development of a new PA, review of curriculum and assessment map, holding a calibration exercise for evidence of reliability). After each meeting, the two facilitators and the scribe reviewed and compared observation and meeting notes. These data were used not only to inform substantive revisions to many PAs, but also to improve the process addressed by the overarching research question. For example, one common PA, a Teacher Candidate Work Sample instrument administered in the student teaching course, was intended to collect evidence regarding the learning of K-12 students through the lessons and assessments implemented by student teachers. Through a review of the instructions and rubric, it became apparent that the usefulness of the results was limited since only two overarching criteria addressing assessment and analysis were required. The descriptions for the scoring criteria were vague, and no calibration training was required to address consistency in relation to scores given. This common PA was reviewed by a group of faculty (n = 8), student teacher supervisors (n = 4), staff (n = 2), and students (n = 3), and multiple changes were identified and implemented. The changes included 1) substantive revisions to the PA prompts and rubric, including dividing the PA into four assignments that build upon each other, 2) developing and implementing calibration training and tests of inter-rater agreement, and 3) delivery of the PA rubric through a different online platform to facilitate the collection of early feedback as well as summative scores. There were also examples of well-developed instruments or aspects of instruments. For example, a course-embedded PA administered in a Special Education teacher preparation program assessed knowledge and skills related to the development of a Functional Behavior Assessment and Behavior Intervention Plan. The rubric criteria provided rich descriptions for a criterion related to implementing strategies to reinforce positive behavior. The highest-level description stated: Details type of reinforcement used. Includes a detailed schedule that takes into consideration prompts, cues, and pre-correction over time. Consideration is given to the beginning plan and long-term strategies for reinforcement. The VIP assisted Special Education faculty in further articulating the purpose of this PA, refining the instructions, and ultimately providing evidence for writing a validity argument supporting the use of PA results. Investigators observed a variety of responses when faculty were prompted to articulate a PA’s purpose. The majority of faculty expressed the purpose, development, and implementation of the PA as solely an accreditation requirement. With the additional guiding questions, faculty engaged in deeper reflections about the purpose and meaningfulness of the PA in relation to learning goals and their perspectives shifted. Faculty began defining the purpose in terms of the knowledge and skills students need to be successful, recognizing the usefulness of the PA to provide formative and summative feedback to students as they progress through the program of study, and determining how the results should be interpreted and used for program improvement. They were able to determine appropriate next steps such as the need for an entirely new or different PA or the recognition that they could move forward with the evaluation of the instrument, often resulting in a range of revisions for the PA.
5.4.2. Conducting review of reliability Gathering evidence that scores are reliable, accurate, reproducible, and consistent is critical for forming the validity argument and can be investigated through a method such as calculating inter-rater agreement. Graham et al. (2012) provided additional guidance on calculating inter-rater agreement, including procedures for computing the percentage of absolute agreement and adjacent agreement. Conducting rater calibration training is useful for improving the consistency of scores (Graham et al., 2012). Rater calibration training gives evaluators an opportunity to work through a sample artifact and discuss how to interpret the scoring guide descriptions. The training session should address items such as: (a) providing an overview of the purpose, interpretation, and use of data from the PA; (b) reviewing fundamental strategies to avoid rater error and bias (Suskie, 2009); (c) collaborating to develop consensus among raters on performance levels and descriptors; and (d) practicing the scoring process and calculating interrater agreement through a calibration exercise. Prior to the training, an expert panel of faculty can be convened to rate recently completed PA artifacts. The artifacts should be representative of the types of work and anticipated performance levels (e.g., high, medium, and low quality). A calculation of inter-rater agreement can be computed through comparing the results of the expert panel scores with participant evaluator ratings. If the initial calibration exercise scores are below standard guidelines, evaluators should be asked to score a second artifact following the training, and inter-rater agreement should be calculated again. 5.4.3. Pilot of revised VIP The two initial investigators facilitated a training for educator preparation program administrators to familiarize them with the process. One of the administrators in attendance, who later joined the study as an investigator, agreed to implement the VIP to analyze PAs 9
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
5.4.4. Improvements to procedures It was discovered that multiple approaches could be used to implement the process, ranging from a facilitated implementation to independent use of instruments. For small programs with only one or two faculty members, independent use of VIP instruments appeared appropriate. However, for mid- to large-size programs with multiple faculty teaching the same course, a facilitated process was developed. The facilitated process was also intended to deepen engagement by faculty in relation to the PA. As documented in facilitator observation notes and through informal feedback, faculty expressed that they felt listened to throughout the dialogue that occurred during the facilitated meetings and that the process encouraged brainstorming ideas related to course or curricular improvements. The investigators were able to observe that faculty morale improved. Perspectives shifted from fulfilling an accreditation reporting requirement to conducting program evaluation in meaningful ways that would result in supporting student learning and improving curricula. In part, this result is likely due to focused time devoted to this curricular work. The VIP instruments guided the discussion and kept the focus on the validity criteria allowing for the best use of limited time and documenting evidence to support or refute the validity of the PA results. Based on data collected from stakeholders through facilitator observation and scribed meeting notes, several improvements were made to the review process. For example, faculty requested more time be spent on articulating purpose. Faculty also recommended the establishment of timelines for completing next steps to ensure revisions to the PA were completed prior to it being implemented again, and the development of a policy regarding an appropriate timeline for the systematic, deep review of PAs utilized for program evaluation.
five important outcomes related to the overarching research question and sub-questions. 1 Tested instruments (sub-question three): The most essential outcome was the development of several easy-to-implement instruments to guide reviewing PAs that can provide evidence for a validity argument. Recently, other authors have identified the need for tested instruments for this purpose and have proposed solutions. For instance, Brookhart and Nitko (2019) developed “criteria for improving the validity of scores from classroom assessments” (p. 41). These authors also describe M. Kane’s (2013) work regarding the construction of a validity argument and provide a “summary of the different types of validity evidence for educational assessments” (p. 47). While Brookhart and Nitko offer example validity argument evidence and guiding questions for examining the validity of assessment scores, the information is not organized in user-friendly instruments. The results of our study also address Baartman et al. (2006) call for the development of practical instruments to assist faculty in establishing the validity criteria of PAs. The VIP provides practitioners who have limited expertise in measurement with a reflective practice approach that draws upon multiple criteria and recommendations from the relevant literature. 2 Importance of the purpose statement (sub-question two): Kane et al. (1999) emphasized the importance of determining the purpose of an assessment to guide the formulation of the claims of the assessment to establish warrants supporting or disproving the claims. In this study, each VIP meeting began by asking faculty to write a statement of the PA’s purpose. Through observation data and artifacts, it became evident that faculty struggled and needed further guidance to articulate a clear purpose statement. The decision was made to add guiding questions to VIP instruments to better assist faculty in articulating the PA’s purpose. 3 Improved PA quality (overarching research question and sub-question two): The VIP strategies and tools guided faculty to improve the quality of PAs. The instruments integrated criteria defined in the literature (Linn et al., 1991; Messick, 1994). Higher PA quality led to greater faculty confidence in the interpretation and use of data for program improvement. 4 Clearer student learning outcomes (sub-question four): Students found authentic assessments promoted learning that connected to real-workplace expectations (McConnell et al., 2003; Sundberg et al., 1994; Suskie, 2009). Through the implementation of the VIP, many poor PA instructions and rubrics were significantly revised by faculty to better align to program goals. Consistent with the literature, the PA revisions provided more meaningful learning opportunities for students by explicitly communicating PA purpose as well as clear expectations and descriptive rubric elements consistent with student learning outcomes (Lombardi, 2008; Winkelmes et al., 2015). 5 Increased faculty engagement (sub-question four): The VIP addressed faculty-identified barriers including lack of time and challenges with interpreting PA results (Buechler, 1992, in Lombardi, 2008). Faculty viewed the process as an efficient and effective opportunity for open dialogue to gain consensus for why and how the PA is used in the course and program. Greater confidence in PA results provided faculty with meaningful data to guide curricular improvements. Since the implementation of the VIP, the investigators have also noted a positive shift in faculty attitudes in relation to improving PAs as well as greater involvement in larger discussions around program curricula including work required for accreditation.
5.5. Stage 4: Analyze the innovation A subsequent review of the literature was conducted during this stage and an additional 15 articles were identified. Brookhart and Nitko’s (2019) publication was identified as providing information about quality criteria and the development of a validity argument. Cook et al. (2015) publication from the medical education literature discussing the application of Kane’s framework was also located. Although these articles addressed examining the validity of PAs, they did not offer user-friendly guidelines for conducting this work. In total, 58 articles were identified and reviewed through the literature reviews conducted during Stages 2, 3b, and 4, and key findings were incorporated into the VIP. Additionally, over the past seven years, the VIP has been piloted with approximately 40 PAs. For the study’s analysis stage, the VIP strategies and tools were implemented internally with faculty, supervisors, and students representing a variety of educator preparation programs. During Stages 2 and 3, participants’ feedback about the process was documented in facilitator observation notes or through email correspondence. By Stage 4, a VIP Faculty Satisfaction Survey was used to obtain feedback from participants on the usefulness of the VIP, as well as a means to solicit suggestions for further improvement. The VIP facilitated meetings were also video recorded to provide additional data for future analysis. The VIP instruments have also been shared with external educator preparation program faculty and administrators at several national conferences. 6. Results The overarching goal of the study was to improve confidence and use of the data being collected from PAs. Through the iterative process of developing, piloting, and implementing the VIP instruments, the VIP offered an approach for 56 educator preparation faculty and 4 student teacher supervisors to review more than 40 PAs, make revisions, determine if a new PA was needed, and provide a process for collecting evidence to make an informed validity argument. The study produced
6.1. Limitations The VIP was developed through an iterative process following DBR guidelines. The process was conceived to serve a pragmatic purpose of assisting higher education faculty with systematically developing, 10
Studies in Educational Evaluation 65 (2020) 100843
C.A. Conn, et al.
revising, implementing, and interpreting results from PAs using theoretical and practical guidance from measurement experts. While this study is limited to describing the work of the education faculty at a large public university in the United States, an intended outcome of a DBR study is that the innovation or intervention can be used or adapted for use in other similar settings. While direct involvement of investigators in studies is sometimes considered a limitation, DBR is centered on the development of a collaborative partnership between the investigators and participants. This was determined to be a good fit to the intended goal and research questions for this study. To triangulate the initial findings regarding the VIP, the investigators are now implementing two additional strategies for collecting data. These strategies include video recording facilitated meetings to demonstrate the process and allow for additional review and analysis. A survey instrument intended to solicit feedback from faculty engaging in the process has also been developed and is being implemented at the conclusion of facilitated meetings. While the generalizability of the results of this study is limited by the nature of the iterative methodology with local participants and potential inconsistencies with implementation, DBR intentionally strives to build capacity and understanding through collaborative efforts to develop, test, and refine an intervention or innovation useful to the discipline or field.
creating complex interventions in classroom settings. Journal of the Learning Sciences, 2(2), 141–178. Brückner, S., & Pellegrino, J. W. (2016). Integrating the analysis of mental operations into multilevel models of validate an assessment of higher education students’ competency in business and economics. Journal of Educational Measurement, 53, 293–312. Cook, D. A., Brydges, R., Ginsburg, S., & Hatala, R. (2015). A contemporary approach to validity arguments: A practical guide to Kane’s framework. Medical Education, 49(6), 560–575. Council for the Accreditation of Educator Preparation (CAEP) (2015). CAEP evidence guide. Washington, DC: Author. Retrieved from http://caepnet.org/∼/media/Files/ caep/knowledge-center/caep-evidence-guide.pdf?la=en. Council for the Accreditation of Educator Preparation (CAEP) (2018). CAEP handbook initial-level programs 2018. Washington, DC: Author. Retrieved from http://caepnet.org/∼/media/Files/caep/accreditation-resources/2018-initial-handbook.pdf?la=en. Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer, & H. Braun (Eds.). Test validity. Hillsdale, NJ: Lawrence Erlbaum. Daggett, W. R. (2016). Rigor/relevance framework® A guide to focusing resources to increase student performance. Retrieved fromhttp://www.leadered.com/pdf/Rigor %20Relevance%20Framework%20White%20Paper%202016.pdf. Design-Based Research Collective (2003). Design-based research: An emerging paradigm for educational inquiry. Educational Researcher, 32(1), 5–8. Downing, S. M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education, 37(9), 830–837. Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Educational research: An introduction (6th ed.). White Plains, NY: Longman Publishers. Graham, M., Milanowski, A., Miller, J., & Westat (2012). Measuring and promoting interrater agreement of teacher and principal performance ratings. Nashville, TN: Center for Educator Compensation Reform. Hart Research Associates (2013). It takes more than a major: Employer priorities for college learning and student success. Retrieved fromhttp://www.aacu.org/leap/documents/ 2013_EmployerSurvey.pdf. Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement Issues and Practice, 18(2), 5–17. Lai, E. R., Wei, H., Hall, E. L., & Fulkerson, D. (2012). Establishing an evidence based validity argument for performance assessment. Retrieved fromhttps://images. pearsonassessments.com/images/tmrs/Establishing_evidence-based_validity_ argument_performance_assessment.pdf. Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15–21. Lombardi, M. (2008). Making the grade: The role of assessment in authentic learning. (ELI Paper 1). Retrieved fromhttps://library.educause.edu/resources/2008/1/makingthe-grade-the-role-of-assessment-in-authentic-learning. Kane, M. (2013). The argument-based approach to validation. School Psychology Review, 42(4), 448–457. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. McConnell, D. A., Steer, D. N., & Owens, K. D. (2003). Assessment and active learning strategies for introductory geology courses. Journal of Geoscience Education, 51(2), 205–216. McKenny, S., & Reeves, T. C. (2012). Conducting educational design research. New York, NY: Routledge. McMillan, J. H. (2018). Classroom assessment principles and practice that enhance student learning and motivation (7th ed.). New York, NY: Pearson Education, Inc. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13–23. Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence-centered design for educational testing. Educational Measurement Issues and Practice, 25(4), 6–20. Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62(3), 229–258. Orngreen, R. (2015). Reflections on design-based research in online educational and competence development projects. International Federation for Information Processing, 468, 20–38. Pieper, S. (2012). Evaluating descriptive rubrics checklist. May 21, Retrieved from https:// nau.edu/wp-content/uploads/sites/105/2018/08/EvaluatingRubricsChecklist_ PieperS_2012.pdfUnpublished checklist. Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research, and Evaluation, 9, 1–11. Stevens, D. D., & Levi, A. J. (2005). Introduction to rubrics: An assessment tool to save grading time, convey effective feedback and promote student learning. Sterling, VA: Stylus Publishing, LLC. Sundberg, M. D., Dini, M. L., & Li, E. (1994). Decreasing course content improves students’ comprehension of science and attitudes toward science in freshman biology. Journal of Research in Science Teaching, 31 679–639. Suskie, L. (2009). Assessing student learning: A common sense guide (2nd ed.). San Francisco, CA: Jossey-Bass. Wheaton, D. (2018). Reflections on the US PREP design-based research project. Retrieved fromhttps://www.coursehero.com/faculty-club/classroom-tips/redesign-a-course/. Wiggins, G., & McTighe, J. (2005). Understanding by design (2nd ed.). Upper Saddle River, NJ: Pearson Merrill Prentice Hall. Winkelmes, M., Copeland, D. E., Butler, J., Jorgensen, E., Sloat, A., ... Jalene, S. (2015). Benefits (some unexpected) of transparently designed assessments. The National Teaching & Learning Forum, 24(4), 4–6. Worrell, F., Brabeck, M., Dwyer, C., Geisinger, K., Marx, R., Noell, G., & Pianta, R. (2014). Assessing and evaluating teacher preparation programs. Washington, DC: American Psychological Association.
6.2. Implications for future research Consistent with DBR principles, the investigators are seeking ways to further build capacity and a sustainable process for this work. The VIP strategies and tools developed through the described iterative stages now require further external testing and review to determine if the validity criteria and PA guidelines incorporated in the process can assist other faculty in improving the quality of their locally developed PAs. The use of the VIP to guide PA development should also be studied to determine if the process results in higher-quality PAs created more efficiently and cost effectively. A connected research question could also be explored: does the VIP help faculty think about curricula more deeply resulting in better informed instructional decisions? Additionally, research should be conducted to determine if higherquality PAs result in increased student skill attainment. For instance, Brückner and Pellegrino’s (2016) sociocognitive approach could be employed to provide additional cognitive validity evidence. By collecting and analyzing students’ cognitive processes as they think-aloud while completing a PA, additional qualitative evidence could be collected to help explain students’ responses on the PA. This information combined with quantitative Student Survey data could strengthen a validity argument for the interpretation and use of the results from the measure. References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Anderson, T., & Shattuck, J. (2012). Design-based research: A decade of progress in education research? Educational Researcher, 41(1), 16–25. Baartman, L. K. J., Bastiaens, T. J., Kirschner, P. A., & van der Vleuten, C. P. M. (2006). The wheel of competency assessment: Presenting quality criteria for competency assessment programs. Studies in Educational Evaluation, 32, 153–170. Bakker, A. (2014). Research questions in design-based research [PDF File]. Retrieved fromhttp://www.fi.uu.nl/en/summerschool/docs2014/design_research_michiel/ Research%20Questions%20in%20DesignBasedResearch2014-08-26.pdf. Baldwin, D., Fowles, M., & Livingston, S. (2005). Guidelines for constructed-response and other performance assessments. Princeton, NJ: Educational Testing Services. Barab, S. (2004). Design-based research: Putting a stake in the ground. Journal of the Learning Sciences, 13(1), 1–14. Barab, S. (2014). Design-based research: A methodological toolkit for engineering change. In R. K. Sawyer (Ed.). Cambridge handbook of the learning sciences. Cambridge, MA: Cambridge University Press. Brennan, R. L. (2013). Commentary on ‘validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 74–83. Brookhart, S. M., & Nitko, A. J. (2019). Educational assessment of students (8th ed.). New York, NY: Pearson. Brown, A. (1992). Design experiments: Theoretical and methodological challenges in
11