Measuring Information and Communication Technology Literacy using a performance assessment: Validation of the Student Tool for Technology Literacy (ST2L)

Measuring Information and Communication Technology Literacy using a performance assessment: Validation of the Student Tool for Technology Literacy (ST2L)

Computers & Education 77 (2014) 1–12 Contents lists available at ScienceDirect Computers & Education journal homepage: www.elsevier.com/locate/compe...

1MB Sizes 0 Downloads 45 Views

Computers & Education 77 (2014) 1–12

Contents lists available at ScienceDirect

Computers & Education journal homepage: www.elsevier.com/locate/compedu

Measuring Information and Communication Technology Literacy using a performance assessment: Validation of the Student Tool for Technology Literacy (ST2L) Anne Corinne Huggins, Albert D. Ritzhaupt*, Kara Dawson University of Florida, USA

a r t i c l e i n f o

a b s t r a c t

Article history: Received 19 November 2013 Accepted 2 April 2014 Available online 16 April 2014

This paper reports the validation scores of the Student Tool for Technology Literacy (ST2L), a performance-based assessment based on the National Educational Technology Standards for Students (NETS*S) used to measure middle grade students Information and Communication Technology (ICT) Literacy. Middle grade students (N ¼ 5884) from school districts across the state of Florida were recruited for this study. This paper first provides an overview of various methods to measure ICT literacy and related constructs, and provides documented evidence of score reliability and validity. Following sound procedures based on prior research, this paper provides validity and reliability evidence for the ST2L scores using both item response theory and testlet response theory. This paper examines both the internal and external validity of the instrument. The ST2L, with minimal revision, was found to be a sound measure of ICT literacy for low-stakes assessment purposes. A discussion of the results is provided with emphasis on the psychometric properties of the tool and some practical insights on with whom the tool should be used in future research and practice. Ó 2014 Elsevier Ltd. All rights reserved.

Keywords: Technology literacy Information and Communication Technology Literacy NETS*S Validation Reliability

1. Introduction A series of recent workshops convened by the National Research Council (NRC) and co-sponsored by the National Science Foundation (NSF) and National Institute for Health highlighted the importance of teaching and assessing 21st century skills in K-12 education (NRC, 2011). Information and Communication Technology (ICT) literacy, or the ability to use technologies to support problem solving, critical thinking, communication, collaboration and decision-making, is a critical 21st century skill (NRC, 2011; P21, 2011). The National Educational Technology Plan (USDOE, 2010) also highlights the importance of ICT literacy for student success across all content areas, for developing skills to support lifelong learning and for providing authentic learning opportunities that prepare students to succeed in a globally competitive workforce. It is clear that students who are ICT literate are at a distinct advantage in terms of learning in increasingly digital classrooms (NSF, 2006; USDOE, 2010), competing in an increasingly digital job market (NRC, 2008) and participating in an increasingly digital democracy (Jenkins, 2006; P21, 2011). Hence, it is critical that educators have access to measures that display evidence of validity and reliability in scores representing this construct in order to use the measures, for example, to guide instruction and address student needs in this area. The International Society for Technology in Education (ISTE) has developed a set of national standards for ICT literacy known as the National Educational Technology Standards for Students (ISTE, 2007). These standards are designed to consider the breadth and depth of ICT literacy and to be flexible enough to adapt as new technologies emerge. The standards were modified based on the 1998 version of the standards. NETS*S strands include knowledge and dispositions related to Creativity and Innovation, Communication and Collaboration, Research and Information Fluency, Critical Thinking, Problem Solving and Decision Making, Digital Citizenship and Technology Operations and Concepts. NETS*S have been widely acclaimed and adopted in the U.S. and many countries around the world and are being used by schools for curriculum development, technology planning and school improvement plans.

* Corresponding author. School of Teaching and Learning, College of Education, University of Florida, 2423 Norman Hall, PO Box 117048, Gainesville, FL 32611, USA. Tel.: þ1 352 273 4180; fax: þ1 352 392 9193. E-mail address: [email protected] (A.D. Ritzhaupt). http://dx.doi.org/10.1016/j.compedu.2014.04.005 0360-1315/Ó 2014 Elsevier Ltd. All rights reserved.

2

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

Yet, measuring ICT literacy is a major challenge for educators and researchers. This point is reinforced by two chapters of the most recent Handbook of Educational Communications and Technology that highlight research and methods on measuring the phenomena (Christensen & Knezek, 2014; Tristán-López & Ylizaliturri-Salcedo, 2014). Though there is disagreement on the language used to describe the construct (e.g., digital literacy, media literacy, technological literacy, technology readiness, etc.), several agree on the key facets that make up the construct, including facets such as knowledge of computer hardware and peripherals, navigation of operating systems, folders and file management, word processing, spreadsheets, databases, e-mail, web searching and much more (Tristán-López & Ylizaliturri-Salcedo, 2014). Such skills are essential for individuals in K-12, post-secondary and workplace environments. For-profit companies have attempted to measure ICT literacy to meet the No Child Left Behind (USDOE, 2001) mandate of every child being technologically literate by 8th grade. States have employed different methods to address this mandates with many relying on private companies. Many of these tools, such as the TechLiteracy Assessment (Learning, 2012) claim alignment with NETS*S. However, most for-profit companies provide little evidence of a rigorous design, development and validation process. The PISA (Program for International Student Assessment) indirectly measures some ICT-related items such as frequency of use and self-efficacy via self-report, but ICT literacy is not a focus of the assessment. Instead, the PISA measures reading literacy, mathematics literacy, and science literacy of 15 year-old high school students (PISA, 2012). Many states have adopted the TAGLIT (Taking a Good Look at Instructional Technology) (Christensen & Knezek, 2014) to meet the NCLB reporting requirements. This tool includes a suite of online assessments for students, teachers, and administrators and claims to be connected to the NETS*S for the student assessment. The questions of the assessment were originally developed by the University of North Carolina Center for School Leadership Development. This is a traditional online assessment that includes a wide range of questions focusing on the knowledge, skills, and dispositions related to ICT literacy. The utility includes a reporting function for schools to use for reporting and planning purposes. However, very little research has been published on the design, development, and validation of this suite of tools for public inspection. A promising new initiative is the first-ever National Assessment of Education Progress (NAEP) Technology and Engineering Literacy (TEL) assessment, which is currently under development (NAEP, 2014). TEL is designed to complement other NAEP assessments in mathematics and science by focusing specifically on technology and engineering constructs. Unlike the other NAEP instruments, the TEL is completely computer-based and includes interactive scenario-based tasks in simulated software environments. The TEL is scheduled for pilot testing with 8th grade students in the Fall of 2013, and slated for release to the wider public in sometime in 2014. However, if one carefully reads over the framework for this instrument, one will discover the instrument is not designed to purely measure ICT literacy. Rather, the instrument focuses on three interrelated constructs, including Design and Systems, Technology and Society, and Information and Communication Technology (NAEP, 2014). This paper focuses on a performance-based instrument known as the Student Tool for Literacy (ST2L) designed to measure the ICT literacy skills of middle grades students in Florida using the 2007 National Educational Technology Standards for Students (NETS*S). This is the second iteration of the ST2L with the first iteration aligned with the original 1998 NETS*S (Hohlfeld, Ritzhaupt, & Barron, 2010). Specifically, this paper provides validity and reliability evidence for the scores using both item response theory and testlet response theory (Wainer, Bradlow, & Wang, 2007). 2. Measuring ICT literacy The definition, description, and measurement of ICT literacy has been a topic under investigation primarily since the advent of the World Wide Web in the early nineties. Several scholars, practitioners, and reputable organizations have attempted to carefully define ICT literacy with associated frameworks, and have attempted to design, develop, and validate reliable measures of this multidimensional construct. For instance, in Europe, they have created the European Computer Driving License Foundation (ECDLF), which is a framework and comprehensive assessment of ICT literacy skills used to certify professionals working in the information technology industry. This particular certificate has been adopted by 148 countries around the world in 41 different languages (Christensen & Knezek, 2014). We attempt to review some of the published measures of ICT literacy and related constructs in this short literature review. We do not claim to cover all instruments of ICT literacy; rather, we cover instruments that were published and provided evidence of both validity and reliability. Compeau and Higgins (1995) provide one of the earlier and more popular measures of computer-self efficacy and discuss its implications for the acceptance of technology systems in the context of knowledge workers. The measure is intended to be used with knowledge workers. Building on the works of Bandura (1986), computer-self efficacy is defined as “a judgment of one’s capability to use a computer” (Compeau & Higgins, 1995, 192). Their study involved more than 1000 knowledge workers in Canada, and several related measurement systems, including computer affect, anxiety, and use. They designed and tested a complex path model to examine computer-self efficacy and its relationship with the other constructs. Unsurprisingly, computer self-efficacy was significantly and negatively correlated with computer anxiety. Also, computer use has a significant positive correlation with computer self-efficacy. This scale has been widely adopted, and the article has been cited more than 2900 times according to Google Scholar. Parasuraman (2000) provides a comprehensive overview of the Technology Readiness Index (TRI), which is a multi-item scale designed to measure technology readiness, a construct similar to ICT literacy. Parasuraman (2000) defines technology readiness as “people’s propensity to embrace and use new technologies for accomplishing goals in home life and at work” (p. 308). This measure is intended to be used by adults in marketing and business contexts. The development process included dozens of technology-related focus groups to generate the initial item pool followed by an intensive study on the psychometric properties of the scale (including factor analysis and internal consistency reliability). Though the TRI has been mostly used in business and marketing literature, it demonstrates that other disciplines are also struggling with this complex phenomenon. Bunz (2004) validated an instrument to assess people’s fluency with the computer, e-mail, and the Web (CEW fluency). The instrument was developed based on extensive research on information and communication technology literacies. The research was conducted in two phases. First, the instrument was tested on 284 research participants and a principle component factor analysis with varimax rotation resulted in 21 items in four constructs: computer fluency (a ¼ .85), e-mail fluency (a ¼ .89), Web navigation (a ¼ .84), and Web editing (a ¼ .82). The 4-factor solution accounted for more than 67% of the total variance. In the second phase, Bunz’s (2004) 143 participants

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

3

completed the CEW scale and several other scales to demonstrate convergent validity. The correlations were strong and significant. The measure was used with students in higher education contexts. Overall, preliminary support for the scale’s reliability and validity was found. Katz and Macklin (2007) provided a comprehensive study of the ETS ICT Literacy Assessment (renamed iSkills) with more than 4000 college students from more than 30 college campuses in the U.S. The ETS assessment of ICT literacy focuses on several dimensions of ICT literacy that are measured in a simulated software environment, including defining, accessing, managing, integrating, evaluating, creating and communicating using digital tools, communications tools, and/or networks (Katz & Macklin, 2007). They systematically investigated the relationship among scores on the ETS assessment and self-report measures of ICT literacy, self-sufficiency, and academic performance as measured by the cumulative grade point average. The ETS assessment was found to have small to moderate statistically significant correlations with other measurements of ICT literacy, which provides evidence of convergent validity of the measurement system. ETS continues to administer the iSkills assessment to college students at select universities, and provides comprehensive reporting. Schmidt et al. (2009) developed a measure of Technological Pedagogical Content Knowledge (TPACK) for pre-service teachers based on Mishra and Koehler’s (2006) discussion of the Technological Pedagogical Content Knowledge framework. Though not a pure measure of ICT literacy, the instrument includes several technology-related items that attempt to measure a pre-service teacher’s knowledge, skills, and dispositions towards technology. The development of the instrument was based on an extensive review of literature surrounding teacher use of technology and an expert review panel appraising the items generated by the research team for relevance. The researchers then conducted a principal component analysis and internal consistency reliability analysis of the associated structure of the instrument. The instrument has been widely adopted (e.g., Abitt, 2011; Chai, Ling Koh, Tsai, & Wee Tan, 2011; Koh & Divaharan, 2011). Hohlfeld et al. (2010) reported on the Student Tool for Technology Literacy (ST2L) development and validation process, and provided evidence that the ST2L produces valid and reliable ICT literacy scores for middle grade students in Florida based on the 1998 NETS*S. The ST2L includes more than 100 items, most of which are performance assessment items in which the learner responds to tasks in a software environment (e.g., Spreadsheet) that simulates real world application of ICT literacy skills. The strategy for developing the technology tool was as follows: 1) technology standards were identified; 2) grade-level expectations/benchmarks for these standards were developed; 3) indicators for the benchmarks were outlined, and 4) specific knowledge assessment items were written and specific performance or skill assessment items were designed and programmed. Using a merger of design-based research and classical test theory, Hohlfeld et al. (2010) demonstrate the tool to be a sound assessment tool for the intended purpose of low-stakes assessment of ICT literacy. Worth noting, the ST2L has been used by more than 100,000 middle grade students in the state of Florida since its formal production release (ST2L, 2013). Across these various studies that all address the complex topic of measuring ICT literacy, we can make a few observations. First, there is not have consensus on the language used to describe this construct. Computer-self efficacy, CEW fluency, ICT literacy, technology readiness, or technology proficiency are all terms that can be used to describe a similar phenomenon. Second, each article presented here built on a conceptual framework to explain ICT literacy (e.g., social cognitive theory, NETS*S, TPACK, etc.) and used sound development and validation procedures. We feel this is an important aspect of the work on ICT literacy and that it must be guided by frameworks and theories to inform our research base. Third, the instruments were developed for various populations, including pre-service teachers, middle grade students, knowledge workers, college students, and more. Special attention must be paid to the population the ICT literacy measurement is designed for. Finally, there are several different methods to measure this complex phenomena, ranging from traditional paper/pencil instruments to online assessments to fully computer-based simulated software environments. The authors feel that the future of measuring ICT literacy should embrace an objective performance-based assessment in which the learners are responding to tasks in a simulated software environment, like the ST2L. 3. Purpose Following the recommendations of Hohlfeld et al. (2010), this paper presents a validation of scores on the Student Tool for Technology Literacy (ST2L), a performance assessment originally based on the 1998 NETS*S and recently revised to align with the 2007 NETS*S. This tool was developed through Enhancing Education Through Technology (EETT) funding as an instrument to assess the ICT literacy of middle grade students in Florida (Hohlfeld et al. (2010)). This paper provides the validity and reliability evidence for scores on the modified instrument

Table 1 Demographic statistics. Variable

Groups

Frequency (n)

Percentage (%)

Grade

5 6 7 8 9 10 11 Male Female Asian Black Hispanic White Other Yes No Yes No

5 1234 1598 3035 9 1 2 2934 2950 125 1114 682 3626 337 3569 2315 5508 376

.09 20.97 27.16 51.58 .15 .02 .03 49.86 50.14 2.12 18.93 11.59 61.62 5.73 60.66 39.34 93.61 6.39

Gender Race

Free/Reduced lunch English with family

4

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

according to these new standards (ISTE, 2007), with methodology operating under both item response theory and testlet response theory (Wainer et al., 2007). Specifically, this paper addresses the following research questions: (a) Do scores on the ST2L display evidence of internal structure validity?, and (b) Do scores on the ST2L display evidence of external structure validity? 4. Method 4.1. Participants Middle school teachers from 13 Florida school districts were recruited from the EETT grant program. Teachers were provided an overview of the ST2L, how to administer the tool, and how to interpret the scores. Teachers then administered the ST2L within their classes during the fall 2010 semester. Table 1 details demographic information for the sample of N ¼ 5884 examinees. The bulk of students (i.e., n ¼ 5867) are in grades 6 through 8, with a wide range of diversity in gender, race and free/reduced lunch status. A small percentage (i.e., 7%) of examinees was from families that did not speak English in the home. 4.2. Measures ST2L: The ST2L is a performance-based assessment designed to measure middle school students’ ICT literacy across relevant domains based on the 2007 NETS-S: Technology Operations and Concepts, Constructing and Demonstrating Knowledge, Communication and Collaboration, Independent Learning, and Digital Citizenship. These standards are designed to consider the breadth and depth of ICT literacy and to be flexible enough to adapt as new technologies emerge. NETS*S have been widely acclaimed and adopted in the U.S. and in many countries around the world. They are being used by schools for curriculum development, technology planning and school improvement plans. The ST2L includes 66 performance-based tasks and 40 selected-response items, for a total of 106 items. The selected-response item types include text-based multiple-choice and true/false items, as well as multiple-choice items with graphics and image map selections (see Fig. 1 for an example). The performance-based items require the examinee to complete multiple tasks nested within simulated software environments, and these sets of performance-based items were treated as testlets (i.e., groups of related items) in the analysis (see Fig. 2 for an example). The testlets allow for ease on the examinee as multiple items are associated with each prompt, and they are also more applicable to technological performance of the examinees outside of the assessment environment. The original version of the ST2L was previously pilot tested on N ¼ 1513 8th grade students (Hohlfeld et al., 2010). The purpose of the pilot test was to provide a preliminary demonstration of the overall assessment quality by considering classical test theory (CTT) item analyses, reliability, and validity. Pilot analysis results indicated that the original version of the ST2L was a sound low-stakes assessment tool. Differences between the piloted tool and the current tool reflect changes in national standards. In the current dataset for this study, Cronbach’s alpha as a measure of internal consistency of the ST2L items was estimated as a ¼ .96. For the ST2L assessment used in this study, there were fourteen sections of items. These fourteen sections are a part of the NETS*S domains as defined and described by ISTE. The first consisted of fifteen selected-response items measuring the construct of technology concepts, which was shortened to techConcepts in the remaining text and tables. The second consisted of four performance-based items that measured the examinee’s ability to manipulate a file, which was shorted to techConceptsFileManip in the remaining text and tables. The third and fourth sections consisted of ten and three performance-based items, respectively, that measured the examinee’s ability to perform research in a word processor, which was shortened to researchWP. The fifth section measured the examinee’s ability to perform research with a flowchart with five performance-based items (i.e., researchFlowchart). The sixth, seventh, and eight sections measured examinee’s creative ability with technology, each with four performance-based items that focused on the use of graphics, presentations, and videos,

Fig. 1. Example multiple-selection item.

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

5

Fig. 2. Example performance-based task item.

respectively (i.e., creativityGraphics, creativityPresent, creativityVideo). The ninth, tenth, and eleventh sections consisted of eight, six, and four performance-based items, respectively, which measured examinee ability in applying technological communication through browsers (i.e., communicationBrowser) and email (i.e., communicationEmail). The twelfth and thirteenth sections measured critical thinking skills in technology with five and nine performance-based items, respectively (i.e., criticalThink). Finally, the fourteenth section measured digital citizenship of examinees with twenty-five selected-response items, which was shortened to digitalCit. PISA: The PISA questionnaire was included in this study as a criterion measure for assessing external validity of ST2L scores. It has been rigorously analyzed to demonstrate both reliability and validity across diverse international populations (OECD, 2003). Students were asked to provide information related to their Comfort with Technology, Attitudes towards Technology, and Frequency of Use of Technology. The three constructs employed different scales for which internal consistency in this study’s dataset was a ¼ .78, a ¼ .89, and a ¼ .54, respectively. The low internal consistency of the attitudes toward technology is expected due the shortness of the scale (i.e., five items). 4.3. Procedures Data were collected in the fall semester of 2010. Middle school teachers from the 13 Florida school districts were recruited from the EETT grant program. Teachers were provided an overview of the ST2L, how to administer the tool, and how to interpret the scores. Teachers then administered the ST2L within their classes during the fall 2010 semester. Teachers also had the opportunity to report any problems with the administration process. 4.4. Data analysis: internal structure validity The testlet nature of the items corresponds with a multidimensional data structure. Each testlet item is expected to contribute to the ICT literacy dimension as well as to a second dimension representing the effect of the testlet in which the item is nested. Dimensionality assumptions of the testlet response model were assessed via confirmatory factor analysis (CFA). Fit of the model to the item data was assessed with the S–X2 index (Orlando & Thissen, 2000, 2003). Data from the selected response (i.e., multiple-choice/true-false) non-testlet items were then fit to a three-parameter logistic model (3PL; Birnbaum, 1968). The 3PL is defined as

"

# eai ðqs bi Þ Psi ðYi ¼ 1jqs ; ai ; bi ; ci Þ ¼ ½ci *ð1  ci Þ ; 1 þ eai ðqs bi Þ

(1)

where i refers to items, s refers to examinees, Y is an item response, q is ability (i.e., ICT literacy), a is item discrimination, b is item difficulty and c is item lower asymptote. Data from the performance-based (i.e., open-response) testlet items were fit to a two-parameter logistic testlet model (2PL; Bradlow, Wainer, & Wang, 1999). The 2PL testlet model is defined as

Psi



   Yi ¼ 1qs ; ai ; bi ; gsdðiÞ ¼

"

# eai ðqs bi gsdðiÞ Þ ; 1 þ eai ðqs bi gsdðiÞ Þ

(2)

where Ysd(i) represents a testlet (d) effect for each examinee. The testlet component (Ysd(i)) is a random effect, allowing for a variance

6

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

estimate of Ysd(i) for each testlet. A 2PL testlet model was selected over 3PL because the item formats did not lend to meaningful chances of successful guessing. The item calibration was completed with Bayesian estimation using Markov chain Monte Carlo methods in the SCORIGHT statistical package (Wainer, Bradlow, & Wang, 2010; Wang, Bradlow, & Wainer, 2004). Item fit, item parameter estimates, testlet effect variance components, standard errors of measurement, information, reliability, and differential item functioning (DIF) were examined for internal structure validity evidence. 4.5. Data analysis: external structure validity ICT literacy latent ability estimates were correlated with the three PISA measures (i.e., use of technology, general knowledge of technology, and attitudes toward technology) to assess external structure validity evidence. Positive, small to moderate correlations were expected with all three external criteria, with a literature-based hypothesis that comfort with technology would yield the strongest relative correlation with technology literacy (Hohlfeld et al., 2010; Katz & Macklin, 2007). 5. Results Prior to addressing the research questions on internal and external structure validity evidence, items with constant response vectors and persons with missing data had to be addressed. Two items on the assessment had constant response vectors in this study’s sample (i.e., researchWP21 was answered incorrectly by all examinees and communicationEmail24 was answered correctly by all examinees), and were therefore not included in the analysis. The first stage of analysis was focused on determining the nature of the missing data on the remaining 104 test items used in the analysis. Basing our missing data analysis process on Wainer et al. (2007), we began by coding the missing responses as omitted responses and calibrated the testlet model. We then coded the missing responses as wrong and recalibrated the testlet model. The theta estimates from these two calibrations correlated at r ¼ .615 (p < .001). This indicated that the choice of how to handle our missing data was non-negligible. We then identified a clear group of examinees for whom ability estimates were extremely different between the two coding methods. Specifically, they had enough missing data to result in very low ability estimates when missing responses were coded as wrong and average ability estimates with very high standard errors when missing responses were coded as omitted. We then correlated the a and b item parameter estimates from the calibration with missing responses coded as omitted with the a and b item parameter estimates from the calibration with missing responses coded as wrong, respectively. Discrimination (a) parameters were mostly larger when missing data was treated as omitted, and the correlation indicated that the differences between the calibrations were non-negligible (r ¼ .629, p < .001). Difficulty (b) parameters were more similar across the calibrations with a correlation at r ¼ .931 (p < .001). The lack of overall similarity in results shown by these correlations indicated that coding the missing data as wrong was not a viable solution. In addition, it was clear that some individuals (n ¼ 109) with large standard errors of ability estimates when missing data was coded as omitted had to be removed from the data set. Ultimately, they did not answer enough items to allow for accurate ability estimation, and their inclusion would therefore compromise future analysis, such as the correlations between ability and external criteria. Majority of these 109 individuals answered only one item of the 106 test items. The remaining data set of N ¼ 5884 examinees (i.e., those discussed in the above Participants section) was examined for nature of missingness according to Enders (2010). For each item, we coded missing data as 1 and present data as 0. We treated these groups as independent variables in t-tests in which the dependent variable was either the total score on frequency of technology use items or the total score on attitudes toward technology items. All t-tests for all items were non-significant, indicating that the missingness was not related to frequency of technology uses or attitudes toward technology. We were unable to perform t-tests on the total score of self-efficacy with technology use of items due to severe violations of distributional assumptions (i.e., a large portion of persons scored the maximum score for self-efficacy), and hence a simple mean comparison was utilized. The self-efficacy total scores ranged from 0 to 76, and self-efficacy means of the group of missing data differed from the means of the group with non-missing data by less than three points on all items. We concluded that these differences were small and, therefore, that missingness was not related to this variable. Based on these analyses, we proceeded under the assumption that missing data was ignorable (MAR) for the 3 PL and testlet model analysis. 5.1. Internal structure validity evidence Before fitting the 3PL and testlet models, we checked the assumption of model fit through CFA analysis with a hypothesis that each item would load onto the overall ability factor (theta) as well as a testlet factor associated with the testlet in which the item was nested. Fig. 3 shows an abbreviated diagram of the CFA model fit in Mplus version 7 (Muthén & Muthén, 2012), with weighted least squares estimation with adjusted means and variances. All latent factors and item residuals were forced to an uncorrelated structure. The model fit the data to an acceptable degree, as indicated by the root mean square error of approximation (RMSEA ¼ .068), the comparative fit index (CFI ¼ .941) and the Tucker–Lewis fit index (TLI ¼ .938). While the fit could have been improved slightly, these results were deemed acceptable for meeting the dimensionality assumptions of the item/testlet response models. We then assessed item fit in IRTPro (Cai, Thissen, & du Toit, 2011) to determine if each multiple choice/true-false item fit the 3PL model and if each open-response item fit the 2PL testlet model. This statistical package was used for item fit as the calculation of fit indices is built into the program, however it was not used for final parameter estimation as it lacks the preferred Bayesian estimation approaches used in this study. The S-X2 item fit index (Orlando & Thissen, 2000, 2003) was used to assess fit of the model to the item data, and significance level at a ¼ .001 was used due to sensitivity of the chi-square test to large sample sizes. A total of four (i.e., <4%) of the items were deemed as displaying problematic misfit of the model to the data. One was a techConcepts item in which examinees scoring below a summated score of 82 on the test often displayed a frequency of observed correct responses that was below the expected frequency of correct responses for the item. For examinees scoring above a summated score of 82 on the test, the opposite pattern was often observed. Another item with misfit concerns was a communicationEmail item in which there was a variety of both over predicted expected correct responses and under

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

7

Fig. 3. Diagram of Confirmatory Factor Analysis used for Model Fit Testing.

predicted expected correct responses across the range of total test scores. An additional communicationEmail item had misfit concerns at the middle range of total summated scores associated with an expected number of correct responses that was lower than the observed correct responses, and the opposite pattern for more extreme total summated scores. A final item with misfit concerns was a digitalCit item in which there was a consistent pattern of expected correct responses being higher than observed correct responses, except within total summated scores above 85. To determine if the misfit associated with these items was problematic for ability estimation, two sets of ability estimates were calculated in a separate set of analyses in IRTPro; one analysis in which all items were included and another in which the misfitting items were removed from the analysis. The correlation between the two sets of estimated ability parameters (using expected a posterior estimation) was r ¼ .99 (p < .001), indicating that the inclusion of the four misfit items in the assessment analysis was not problematic for ability estimation. The final Bayesian testlet model was then estimated within the SCORIGHT package with five MC chains, 20,000 iterations, and 3000 discarded draws within each iteration, based on recommendations from Sinharay (2004) and Wang et al. (2010). Acceptable model convergence was reached as indicated by confidence interval shrink statistics near one (Wang et al., 2004). The variance estimates of the twelve testlet effects (Yd) are shown in Table 2. All variances are significantly different from 0, indicating that their effect on item responses is non-negligible and the testlet model must be retained. In other words, the tool measures some types of abilities that are associated more with particular testlets than with general ICT literacy, and the use of the testlet model separates these components providing for a more accurate estimate of ICT literacy for each student. Fig. 4 is a plot of the standard error of measurement (sem) of q estimates (i.e., ICT literacy estimates) for each examinee. Approximately 82% of the sample had estimates with sem  .30, and 95.53% of the sample had estimates with sem  .40. Several of the 4.47% of examinees with larger sem estimates were further examined. For example, individuals ID ¼ 3286 and ID ¼ 5421 (see Fig. 4) answered less than six items on the assessment. Using the sem estimates, there are two ways we estimated reliability/information of the theta estimates. Under item response theory, information can be calculated from the sem and can be used as an indicator of reliability. The relationship between sem and test level information is defined as

IðqÞ ¼



 1 2 ; sem

(3)

where IðqÞ represents the level of test information at a particular value of theta. Therefore, having 95.53% of the sample with sem  .40 indicates that 95.53% of the sample has a test information level of IðqÞ  6:25. For 82% of the sample, the test information level is at IðqÞ  11:11. Table 2 Estimated variance of testlet effects. Testlet

Teslet#

EstimatedVariance of Yd

se(Var [Yd])

techConceptsFileManip researchWP1 researchWP2 researchFlowChart creativityGraphics creativityPresent creativityVideo communicationBrowser communicationEmail1 communicationEmail2 criticalThink1 criticalThink2

1 2 3 4 5 6 7 8 9 10 11 12

.67 .32 .23 .37 .52 .25 .23 .31 .35 .47 .33 .23

.04 .02 .03 .03 .04 .03 .04 .02 .02 .04 .03 .01

8

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

Fig. 4. Standard error of measurement of ICT literacy estimates (qs).

For a reliability coefficient that aligns with CTT methodology, we can compute the test level reliability if we assumed that all individuals had the same sem. Under CTT,

r ¼ 1

  sem 2

sq

;

(4)

where r represents the CTT reliability coefficient and sq represents the standard deviation of latent ability scores. For the sample in this study, sq ¼ :97: If all examinees had a sem ¼ .4, then r ¼ .83. If all examinees had a sem ¼ .3, then r ¼ .90. Therefore, the CTT reliability estimate for the scores in this sample is between r ¼ .83 and r ¼ .90 for 95.53% of examinees. Item parameter estimates are presented in Table 3. All discrimination parameters estimates were above ai ¼ .39, indicating positive, moderate to large relationships between item responses and ICT literacy latent scores. Difficulty parameter estimates were distributed normally with Mb ¼ .39 and SDb ¼ 1.29, with two outlying items of extreme difficulty. Specifically, one of the criticalThink items was extremely difficult (i.e., b b ¼ 4:89) and one of the researchWP items was also extremely difficult (i.e., b b ¼ 4:66). Four items displayed lower asymptote estimates of ci  .49, indicating large amounts of guessing on those selected-response items. All four were digitalCit items. DIF was examined with non-parametric tests that allow for ease of examination of DIF across a large amount of items. Due to the large number of items, it was expected that some items would display DIF, so we began with an examination of differential test functioning (DTF) to first determine any grouping variables that had DIF in items that aggregated to a problematic amount of test level differences, or DTF. Weighted s2 variance estimates of DTF (Camilli & Penfield, 1997) were estimated in the DIFAS package (Penfield, 2012) and showed small, negligible DIF across groups defined by gender, race (collapsed), free/reduced lunch, and English spoken in the home. Grade (collapsed into 6th, 7th, and 8th) showed a relatively larger DTF variance estimate, specifically when comparing 6th grade to 8th grade examinees. We then estimated DIF across 8th and 6th graders in the DIFAS package (Penfield, 2012) and used Educational Testing Service’s classification of A, B, and C items (Zieky, 1993) to flag items with small, moderate, and large DIF. We located three items with large DIF (researchWP13, digitalCit15, and digitalCit25) and thirteen items with moderate DIF (techConcepts8, researchWP15, researchFlowchart2, creativityPresent4, creativityVideo4, communicationBrowser5, communicationEmail11, communicationEmail21, criticalThinkSS26, criticalThinkSS28, digitalCit4, digitalCit5, and digitalCit10). We reran the DIF analysis by estimating proficiency only on the items that were not flagged as large and moderate DIF, and found that three of the above items were no longer problematic (i.e., displayed small DIF), but the remainder were flagged as either moderate or large DIF across groups defined by 6th grade and 8th grade classification. 5.2. External structure validity evidence The technology literacy estimates were then correlated to three outside criteria measures from the PISA. Pearson’s correlations with each of the summated scores from the three criteria are presented in Table 4. The correlations are all positive, small to moderate and statistically significant. The strongest correlation is with the Comfort with Technology scores, followed by Attitudes towards Technology scores and Frequency of Use of Technology scores. 6. Discussion The results of this study must be interpreted with an understanding of the limitations and delimitations of this research. This study was limited to middle grade students (N ¼ 5884) from school districts in Florida during the fall of 2010. While the ST2L is intended to be software and operating system independent, students may not find the interface similar enough to the specific software suites that they are accustomed to using in their schools and homes. Thus, the ST2L may not adequately measure the knowledge and skills of these students. The external validation process included correlating the scores of the ST2L to perceived technology ability levels (Comfort with Technology,

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

9

Table 3 Item parameter estimates from Bayesian testlet model estimation with MCMC methods. Item

Testlet#

a

se(a)

b

se(b)

c

se(c)

techConcepts1 techConcepts2 techConcepts3 techConcepts4 techConcepts5 techConcepts6 techConcepts7 techConcepts8 techConcepts9 techConcepts10 techConcepts11 techConcepts12 techConcepts13 techConcepts14 techConcepts15

NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

1.71 1.31 .87 1.41 1.54 .39 2.66 1.06 1.53 1.81 1.46 1.45 1.02 1.69 1.06

.10 .10 .06 .13 .10 .08 1.39 .08 .20 .12 .11 .09 .10 .10 .07

.34 .48 .95 .97 .72 1.32 2.24 .49 2.04 .98 1.10 1.56 .15 .02 1.03

.04 .07 .17 .06 .09 .90 .56 .08 .11 .09 .14 .14 .16 .06 .16

.15 .24 .21 .24 .32 .47 .36 .18 .18 .38 .41 .31 .29 .22 .27

.02 .02 .05 .02 .04 .09 .13 .03 .01 .04 .05 .06 .05 .02 .05

techConceptsFileManip1 techConceptsFileManip2 techConceptsFileManip3 techConceptsFileManip4

1 1 1 1

1.43 1.46 1.66 1.21

.06 .06 .07 .05

.51 .41 1.49 .55

.03 .03 .05 .04

– – – –

– – – –

researchWP11 researchWP12 researchWP13 researchWP14 researchWP15 researchWP16 researchWP17 researchWP18 researchWP19 researchWP110

2 2 2 2 2 2 2 2 2 2

.44 1.60 1.72 2.13 3.00 1.80 3.47 2.68 1.63 2.46

.04 .06 .08 .08 .12 .06 .15 .11 .06 .09

4.66 .87 1.37 .93 .44 .94 .48 1.25 .08 .18

.40 .03 .04 .03 .02 .03 .02 .03 .03 .02

– – – – – – – – – –

– – – – – – – – – –

researchWP22 researchWP23

3 3

2.37 1.62

.12 .07

.50 1.37

.02 .05

– –

– –

researchFlowchart1 researchFlowchart2 researchFlowchart3 researchFlowchart4 researchFlowchart5

4 4 4 4 4

1.16 1.16 1.14 2.28 1.65

.05 .05 .05 .11 .07

1.37 .63 .11 .01 .34

.06 .04 .03 .02 .03

– – – – –

– – – – –

creativityGraphics1 creativityGraphics2 creativityGraphics3 creativityGraphics4

5 5 5 5

1.89 .66 1.27 1.21

.09 .04 .06 .05

.16 2.55 .73 .60

.03 .15 .03 .04

– – – –

– – – –

creativityPresent1 creativityPresent2 creativityPresent3 creativityPresent4

6 6 6 6

2.05 1.16 2.44 2.04

.09 .05 .11 .11

1.63 1.03 .53 2.40

.05 .05 .03 .07

– – – –

– – – –

creativityVideo1 creativityVideo2 creativityVideo3 creativityVideo4

7 7 7 7

.97 1.00 1.24 1.57

.04 .05 .07 .07

1.72 1.66 1.78 1.39

.07 .07 .07 .05

– – – –

– – – –

communicationBrowser1 communicationBrowser2 communicationBrowser3 communicationBrowser4 communicationBrowser5 communicationBrowser6 communicationBrowser7 communicationBrowser8

8 8 8 8 8 8 8 8

.97 1.85 4.73 4.44 3.03 1.77 1.36 2.24

.04 .08 .41 .31 .16 .07 .05 .10

1.38 1.74 1.99 1.62 1.71 .20 .88 1.44

.06 .05 .05 .04 .05 .03 .04 .04

– – – – – – – –

– – – – – – – –

communicationEmail11 communicationEmail12 communicationEmail13 communicationEmail14 communicationEmail15 communicationEmail16

9 9 9 9 9 9

1.13 1.52 1.68 2.20 2.23 1.97

.05 .06 .07 .10 .10 .10

.56 1.70 1.50 1.86 .93 2.42

.04 .05 .05 .05 .03 .07

– – – – – –

– – – – – –

(continued on next page)

10

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

Table 3 (continued ) Item

Testlet#

a

b

se(b)

c

se(c)

communicationEmail21 communicationEmail22 communicationEmail23

10 10 10

1.08 1.51 1.97

se(a) .05 .07 .09

1.08 1.09 1.23

.05 .04 .04

– – –

– – –

criticalThinkSS11 criticalThinkSS12 criticalThinkSS13 criticalThinkSS14 criticalThinkSS15

11 11 11 11 11

1.52 1.90 1.01 1.91 1.12

.07 .09 .05 .08 .05

1.78 1.92 .70 .77 .97

.06 .06 .04 .03 .05

– – – – –

– – – – –

criticalThinkSS21 criticalThinkSS22 criticalThinkSS23 criticalThinkSS24 criticalThinkSS25 criticalThinkSS26 criticalThinkSS27 criticalThinkSS28 criticalThinkSS29

12 12 12 12 12 12 12 12 12

1.88 1.50 1.75 1.91 1.85 2.04 1.42 2.10 1.68

.07 .23 .07 .07 .08 .08 .05 .08 .09

.77 4.89 .03 .44 1.50 .83 .63 .67 1.95

.03 .55 .03 .03 .05 .03 .03 .03 .07

– – – – – – – – –

– – – – – – – – –

digitalCit1 digitalCit2 digitalCit3 digitalCit4 digitalCit5 digitalCit6 digitalCit7 digitalCit8 digitalCit9 digitalCit10 digitalCit11 digitalCit12 digitalCit13 digitalCit14 digitalCit15 digitalCit16 digitalCit17 digitalCit18 digitalCit19 digitalCit20 digitalCit21 digitalCit22 digitalCit23 digitalCit24 digitalCit25

NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

.40 1.85 .61 2.65 2.57 2.52 .50 1.34 1.55 1.52 1.96 2.22 1.99 1.25 2.56 .82 2.13 2.25 2.50 .82 1.54 2.83 1.80 2.14 2.61

.06 .13 .12 .16 .15 .14 .06 .10 .13 .11 .11 .19 .12 .10 .19 .14 .23 .23 .17 .08 .18 .18 .10 .12 .15

2.05 .05 .70 .60 .78 .52 1.83 1.11 1.39 .92 .20 1.05 .20 1.47 .99 2.19 .79 .21 .78 .19 1.06 .45 .58 .14 .60

.92 .06 .35 .05 .05 .05 .56 .16 .17 .11 .05 .11 .06 .19 .08 .16 .14 .07 .07 .16 .06 .05 .06 .04 .05

.54 .34 .29 .32 .30 .29 .39 .42 .49 .37 .26 .53 .31 .39 .42 .16 .62 .55 .37 .22 .30 .29 .23 .20 .28

.09 .03 .07 .03 .03 .02 .09 .05 .05 .04 .02 .04 .02 .06 .04 .03 .04 .02 .03 .04 .02 .02 .03 .02 .03

Attitudes towards Technology, and Frequency of Use of Technology), which was based on self-report measures. Thus, students completing these self-assessments might have provided what they perceived as socially acceptable responses. In light of these constraints, the overall analysis indicated that the ST2L tool was able to produce scores for the examinees that had sufficient reliability, sound psychometric properties providing evidence of internal validity, and evidence of external validity in relationship to related constructs. Specifically, CTT reliability coefficients, item response theory information indicators, and Cronbach’s alpha of internal consistency were all of sufficient magnitude to utilize the test for low-stakes purposes. Tests with stakes for students require reliability coefficients greater than .80, with a preference for coefficients of .85–.90 (Haertel, 2013), which were met by the data in this study even though the stakes associated with the test are low. For internal validity, the items fit the theorized testlet and item response theory models to an acceptable degree, ICT literacy estimates were of a sufficiently low standard error of measurement, item difficulty was largely aligned with the desired property of developing a test that covers a wide range of technology literacy levels, item discriminations were sufficiently high to indicate that all items scores were correlated with ICT literacy scores, and the vast majority of subgroups of examinees in the population (e.g., racial subgroups) displayed invariant measurement properties in the items of the ST2L tool. All of these internal psychometric properties indicate that the tool is operating as designed and can produce reliable test scores without wasting examinee time on a test that is too difficult/easy for the population or that lacks the power to discriminate between persons with different levels of ICT literacy. Table 4 Correlation of ICT literacy estimates and external criteria scores. External criteria

Correlation with ICT literacy (r)

Statistical significance of correlation (p)

Frequency of Technology Use Comfort with Technology Attitude toward Technology

.131 .333 .212

<.001 <.001 <.001

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

11

With respect to external validity, the low-stakes ST2L is only useful if it displays the expected relationships with other constructs related to ICT literacy. The results of the external criteria analysis showed that the ICT literacy scores on the ST2L have the small to moderate, positive relationships that were expected with the constructs of frequency of technology use, comfort with technology, and attitudes toward technology. In addition, comfort with technologies displayed the strongest relationship with technology literacy, supporting the research hypothesis. The ST2L tool was originally developed according to low stakes test purposes. For low stakes applications, the ST2L is more than satisfactory as indicated by internal and external structure analysis. Beginning with the framework of the NETS*S, an extensive, thorough process was followed for defining indicators. The assessment items were mapped to these indicators and provide measurement of the indicators in innovative, relevant, performance-based ways. Test quality criteria demonstrate reasonable item analysis, reliability, and validity results for a relatively short, criterion referenced test. The tool may be beneficial as districts report aggregated data for NCLB purposes and teachers target technology related curricular needs. Examining external structure validity evidence is always difficult when both internal and external criteria measures produce imperfect test scores. In this study, three PISA measures were used to determine external structure of the ST2L, but the correlation estimates are most likely attenuated due to factors such as measurement error in the test scores, small number of items per PISA subscale, and use of observed scores for the PISA measures. Future studies may want to utilize different and more numerous external criteria. Four conclusions drawn from the analysis are indicative of some of the limitations of the study as well as minor revisions that are needed to the ST2L before future administration. First, two of the researchWP items (i.e., using word processors) may need revision before future administration of the assessment due to extreme difficulty. One was answered incorrectly by all respondents and the other had an estimated difficulty parameter that indicated it was most appropriate for examinees who are more than four standard deviation units above the mean of ICT literacy for this population. Similarly, one of the critical thinking items was also very difficult and may need revision or removal. Second, while the majority of examinees completed the exam in its entirety, there was a non-negligible group of examinees who quit the assessment after several items. It was not possible to obtain accurate ability estimates of these examinees, and more incentive to complete the assessment may be needed for this small group of examinees who seemed to be less motivated to complete the exam. Third, some of the selected-response items had a larger amount of guessing than desired. It is expected that some guessing will occur on multiple choice/truefalse items, but some item removal or revision may be called for on those with particularly large amounts of guessing. Finally, it was clear from the DIF analysis that some items measured different constructs for sixth grade students as compared to eighth grade students. It was a relatively small percentage of the total items, yet it may deserve further consideration as to whether or not the ST2L tool is best used for a more homogenous population in terms of grade. From the practical perspective, the ST2L is a tool available to middle grade educators throughout the state of Florida to meet the NCLB requirements of demonstrating the ICT literacy of 8th grade students in their respective school districts. Other states aside from Florida may also elect to use this tool for reporting requirements within their states by making arrangements with the Florida Department of Education (FLDOE). This tool can be used as a low stakes assessment to provide data related to the technology literacy of middle grade students for district reporting, curriculum design, and student self-assessment. The tool, in its present form, is not suitable for use in high-stakes applications such as computing school grades or evaluating individual student performance for promotion/retention. There is also something to be said about the development procedures for the ST2L and the associated NETS*S. As previously described, the ST2L development team followed sound development procedures for item writing and review and conducted usability analyses to ensure that the user interface and the simulated performance-based tasks were as clear and intuitive as possible (Hohlfeld et al., 2010). The development team included content teachers (e.g., mathematics), computer teachers, educational technology specialists, media specialists, programmers, college professors and several other key stakeholders to help operationalize the NETS*S at age appropriate benchmarks. Had these rigorous procedures not been followed, the psychometric properties of the tool would likely not be acceptable. The ST2L has the potential to be used in several research applications in the field of educational technology beyond simple reporting purposes. For instance, Holmes (2012) used the ST2L as a measure of 21st century skills in her dissertation focusing on the effects of projectbased learning experiences on middle grade students. Another example comes from Ritzhaupt, Liu, Dawson, and Barron (2013) who used the ST2L to demonstrate a technology knowledge and skill gap (digital divide) between students of high and low socio-economic status, white and non-white students, and female and male students. Further, Hohlfeld, Ritzhaupt, and Barron (2013) used the tool in a comprehensive study of the relationship between gender and ICT literacy. Future researchers can use the ST2L to expand understanding of technology-enhanced teaching and learning in the 21st century while accounting for several important variables (e.g., socio-economic status) in their models. As we have shown in this paper, there are many measurement systems that are related to ICT literacy. While each of these tools contributes to our understanding of the measurement of ICT literacy in various populations, not all of these measurement systems used innovative items (e.g., simulated software tasks) to measure the multifaceted construct. In fact, most instruments reviewed here used traditional measures of ICT literacy based on paper/pencil or online assessments via self-report measures (e.g., Bunz, 2004; Compeau & Higgins, 1995; Parasuraman, 2000; Schmidt et al., 2009). The authors believe that for future instruments, we must use technology in our design frameworks to measure this complex construct. That is, the measurements themselves should use the types of innovate items found on the ST2L, iSkills assessment, or the in progress NAPE TEL assessment. Doing so provides a more objective and authentic measure of ICT literacy beyond simple self-report, as well as increases the generalizability of the measure’s scores to real life application of ICT literacy skills. This paper reports a systematic and rigorous process for the validation of a measure of ICT literacy based on modern test theory techniques (i.e., item and testlet response theory). As the body of knowledge grows in this realm, we must periodically revise and refine our measures. As noted by Hohlfeld et al. (2010) “Validation of measurement instruments is an ongoing process. This is especially true when dealing with the measurement of technology literacy while using technology, because technology is perpetually changing. The capabilities of the hardware and software continue to improve and new innovations are introduced. As a result, developing valid and reliable instruments and assessment tools to measure the construct is difficult and an ongoing process” (p 384).

12

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

The authors believe that the measurement of ICT literacy is a vital 21st century measurement, as evidenced by the calls from NSF, NIH, and others. Because technology does evolve so quickly, we must periodically update our measurement systems to reflect the newest innovations. This paper contributes to this charge by re-aligning tool with the 2007 NETS*S standards. References Abitt, J. T. (2011). An investigation of the relationship between self-efficacy beliefs about technology integration and technological pedagogical content knowledge (TPACK) among preservice teachers. Journal of Digital Learning in Teacher Education, 27(4), 134–143. Bandura, A. (1986). Social foundations of thought and action. Englewood Cliffs, NJ: Prentice Hall. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord, & M. R. Novick’s (Eds.), Statistical theories of mental test scores (pp. 397–460). Reading, MA: Addison-Wesley. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A bayesian random effects model for testlets. Psychometrika, 64, 153–168. Bunz, U. (2004). The computer-email-web (CEW) fluency scale-development and validation. International Journal of Human–Computer Interaction, 17(4), 479–506. Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO for Windows [Computer Software]. Lincolnwood, IL: Scientific Software International. Camilli, G., & Penfield, D. A. (1997). Variance estimation for differential test functioning based on Mantel–Haenszel statistics. Journal of Educational Measurement, 34, 123–139. Chai, C. S., Ling Koh, J. H., Tsai, C., & Wee Tan, L. L. (2011). Modeling primary school pre-service teachers’ technological pedagogical content knowledge (TPACK) for meaningful learning with information and communication technology (ICT). Computers & Education, 57, 1184–1193. Christensen, R., & Knezek, G. A. (2014). Measuring technology readiness and skills. In Spector, Merrill, Elen, & Bishop (Eds.), Handbook of research on educational communications and technology (pp. 829–840). New York: Springer. Compeau, D. R., & Higgins, C. A. (1995). Computer self-efficacy: development of a measure and initial test. MIS Quarterly, 19(2), 189–211. Enders, C. K. (2010). Applied missing data analysis. NY: The Guilford Press. Haertel, E. H. (2013). Reliability and validity of inferences about teachers based on student test scores (ETS Memorial Lecture Series Reports). Princeton, NJ: Educational Testing Service. Hohlfeld, T. N., Ritzhaupt, A. D., & Barron, A. E. (2010). Development and validation of the Student Tool for Technology Literacy (ST2L). Journal of Research on Technology in Education, 42(4), 361–389. Hohlfeld, T., Ritzhaupt, A. D., & Barron, A. E. (2013). Are gender differences in perceived and demonstrated technology literacy significant? It depends on the model. Educational Technology Research and Development, 61(4), 639–663. Holmes, L. M. (2012). The effects of project-based learning on 21st century skills and no child left behind accountability standards. Doctoral dissertation, University of Florida. International Society for Technology in Education. (2007). National Educational Technology Standards for students. Retrieved from http://http://www.iste.org/standards/ standards-for-students/nets-student-standards-2007. Jenkins, H. (2006). Convergence culture: Where old and new media collide. New York: New York University Press. Katz, I. R., & Macklin, A. S. (2007). Information and communication technology (ICT) literacy: integration and assessment in higher education. Journal of Systemics, Cybernetics and Informatics, 5(4), 50–55. Koh, J. H., & Divaharan, S. (2011). Developing pre-service teachers’ technology integration expertise through the TPACK-developing instructional model. Journal Educational Computing Research, 44(1), 35–58. Learning. (2012). TechLiteracy Assessment. Available at http://www.learning.com/techliteracy-assessment/. Mishra, P., & Koehler, M. (2006). Technological pedagogical content knowledge: a framework for teacher knowledge. The Teachers College Record, 108(6), 1017–1054. Muthén, L. K., & Muthén, B. O. (1998–2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén. National Assessment of Educational Progress. (2014). Technology and Engineering Literacy Assessment. Retrieved https://nces.ed.gov/nationsreportcard/tel/. National Research Council. (2011). Assessing 21st Century Skills: Summary of a workshop. Washington, DC: The National Academies Press. National Research Council. (2008). Research on future skill demands. Washington, DC: National Academies Press. NSF. (2006). New formulas for America’s Workforce 2: Girls in science and engineering. Washington: D.C. Orlando, M., & Thissen, D. (2000). Likelihood-based item fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64. Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S–X2: an item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27, 289–298. Parasuraman, A. (2000). Technology Readiness Index (TRI) a multiple-item scale to measure readiness to embrace new technologies. Journal of Service Research, 2(4), 307–320. Partnership for 21st Century Skills (2011). Framework for 21st century learning. Washington, DC. Retrieved from http://www.p21.org/tools-and-resources/publications/1017educators#defining. Penfield, R. D. (2012). DIFAS 5.0: Differential item functioning analysis system user’s manual. Penfield. PISA. (2012). Program for International Student Assessment (PISA). Available at http://nces.ed.gov/surveys/pisa/. Ritzhaupt, A. D., Liu, F., Dawson, K., & Barron, A. E. (2013). Differences in student information and communication technology literacy based on socio-economic status, ethnicity, and gender: evidence of a digital divide in Florida schools. Journal of Research on Technology in Education, 45(4), 291–307. Schmidt, D. A., Baran, E., Thompson, A. D., Mishra, P., Koehler, M. J., & Shin, T. S. (2009). Technological pedagogical content knowledge (TPACK): the development and validation of an assessment instrument for preservice teachers. Journal of Research on Technology in Education, 42(2), 123. Sinharay, S. (2004). Experiences with MCMC convergence assessment in two psychometric examples. Journal of Educational and Behavioral Statistics, 29, 461–488. ST2L. (2013). Student Tool for Technology Literacy (ST2L). Retrieved from http://st2l.flinnovates.org/index.aspx. Tristán-López, A., & Ylizaliturri-Salcedo, M. A. (2014). Evaluation of ICT competencies. In Spector, Merrill, Elen, & Bishop (Eds.), Handbook of research on educational communications and technology (pp. 323–336). New York: Springer. U.S. Department of Education. (2001). No child left behind: enhancing education through technology act of 2001. Retrieved from http://www.ed.gov/policy/elsec/leg/esea02/ pg34.html. U.S. Department of Education. (2010). National Educational Technology Plan 2010. Retrieved from http://www.ed.gov/technology/netp-2010. Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. NY: Cambridge University Press. Wainer, H., Bradlow, E. T., & Wang, X. (2010). Detecting DIF: many paths to salvation. Journal of Educational and Behavioral Statistics, 35(4), 489–493. Wang, X., Baldwin, S., Wainer, H., Bradlow, E. T., Reeve, B. B., Smith, A. W., et al. (2010). Using testlet response theory to analyze data from a survey of attitude change among breast cancer survivors. Statistics in Medicine, 29, 2028–2044. Wang, X., Bradlow, E. T., & Wainer, H. (2004). A user’s guide for SCORIGHT (Version 3.0): A computer program built for scoring tests built of testlets including a module for covariate analysis (ETS Research Report RR 04–49). Princeton, NJ: Educational Testing Service. Zieky, M. (1993). DIF statistics in test development. In P. W. Holland, & H. Wainer (Eds.), Differential item functioning (pp. 337–347). Hillsdale, NJ: Erlbaum.