Computers in Human Behavior 27 (2011) 2005–2010
Contents lists available at ScienceDirect
Computers in Human Behavior journal homepage: www.elsevier.com/locate/comphumbeh
Measurement invariance in training evaluation: Old question, new context J. William Stoughton ⇑, Amanda Gissel, Andrew P. Clark, Thomas J. Whelan North Carolina State University, Department of Psychology, 640 Poe Hall, Campus Box 7650, Raleigh, NC 27695-7650, United States
a r t i c l e
i n f o
Keywords: Surveys Computer-mediated surveys Training Training evaluation Differential item functioning Item response theory
a b s t r a c t Technological advances that have been put to use by organizations have not escaped the training domain. With the shift towards computer-mediated surveys, training evaluations have been converted from traditional paper-and-pencil formats to Web-based environments. This begs the question as to whether or not these modalities are equivalent. Accordingly, this study examined the item functioning of parallel Web-based training evaluations and traditional paper-and-pencil evaluations of a training intervention. Item response theory (IRT) analyses revealed few differences between how an individual would respond to particular items (i.e., differential item functioning) regardless of the modality employed to complete a training evaluation. This provides evidence for the equivalence of paper-and-pencil and computer-mediated training evaluations. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction The implementation of training programs, whether for employee development or organizational interventions, has unequivocally become an industry unto itself. More than 90% of all private businesses reported the use of some type of systematic training intervention to train employees. Virtually all medium and larger organizations reported systematically training managerial employees (Goldstein, 1986; Saari, Johnson, McLaughlin, & Zimmerle, 1988). In 2000, US organizations with more than 100 employees budgeted $54 billion dollars towards formal employee training (Industry Report, 2000). In light of this large investment of time and capital on the part of organizations for training intervention development and implementation, it is essential that organizations also have in place systems to ensure acceptable returns on their investment; specifically through training evaluation. In addition, more and more organizations are turning to computer-based data collection formats when attempting to gather information from large groups of people. They do so in search of promised reduced administration expenses and shortened collection-results-analysis-useable feedback cycle time. These tempting advantages have encouraged an increasing number of companies to adopt computer-based data collection techniques (Thompson, Surface, Martin, & Sanders, 2003). Organizations are also increasingly investing in the use of information technology (IT) to support all aspects of organizational work from group work to individual teaching, training, and learning (Isakowitz, Bieber, & Vitali, 1998). The intersection of training evaluation and technology is an area
⇑ Corresponding author. Tel.: +1 626 991 2287; fax: +1 919 515 1716. E-mail address:
[email protected] (J. William Stoughton). 0747-5632/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.chb.2011.05.007
not yet thoroughly explored. There has been little evaluation of technology or technology-based methods for gathering training data. Training evaluation is defined as the systematic collection of descriptive and judgmental information necessary to make effective decisions related to the selection, adoption, value, and modification of a training intervention (Goldstein & Ford, 2002). The information gleaned from a training evaluation would then be available to aid in effective revision of a training intervention to achieve any number of multiple instructional objectives. According to Goldstein and Ford (2002), training evaluation is best viewed as an information gathering technique that should have as its goal to capture, to the best of its ability, the dynamic flavor and objectives of a training program. They go onto state that training program objectives can reflect numerous goals ranging from individual trainee progress to larger organizational-level goals, including overall profitability (Goldstein & Ford, 2002). As training can have such a wide variety of objectives (e.g., increasing efficiency, teaching declarative knowledge, decreasing the number of workplace accidents, skill-building and development) and utilize such a wide range of modalities (e.g., traditional classroom training, self-directed workbook training, self-directed computer or Web-based training, on-the-job training, virtual reality simulation), it is necessary that training evaluation be flexible enough to take a variety of different forms. Some examples of these different forms may include skill-based tests (e.g., demonstrating the ability to properly operate a forklift), behavioral measures (e.g., utilizing newly-learned leadership skills), job performance measures (e.g., effectively prioritizing work tasks), efficiency measures (e.g., measuring how many more widgets were produced per hour after the training intervention), and utility analyses (e.g., how did this training intervention impact the organization’s bottom line).
2006
J. William Stoughton et al. / Computers in Human Behavior 27 (2011) 2005–2010
Due to the great diversity of possible training objectives and modalities, it is not the current study’s intention to tackle the multifarious topic of measurement equivalency across all possible types of training evaluation differing between modalities, but rather to serve as a first step towards exploring the question of measurement equivalency across survey modalities in the context of training evaluation. One of the more commonly utilized tools employed to evaluate training interventions is the survey (e.g., Bennett, Alliger, Eddy, & Tannenbaum, 2003; Blumenfeld & Crane, 1973; Gray, Hall, Miller, & Shasky, 1997; Lauver, 2007; Wreathall & Connelly, 1992). Trainee surveys can be effective tools for capturing data on potential training evaluation outcomes, such as acquired declarative knowledge, trainee affective reactions, and trainee perceptions of overall training utility. Trainee surveys were chosen for this first step because the survey is such a commonly used method of training evaluation and there is an existing body of literature on this topic in other business-related contexts (e.g. performance evaluation), which serve as a solid jumping-off point for the current study. Most of the recent research on measurement equivalency between traditional and electronic (i.e., computer-based) modalities has focused on general survey research. In this body of literature, there is some evidence supporting measurement equivalency across survey modalities (Cole, Bedeian, & Feild, 2006; Potosky & Bobko, 1997; Stanton, 1998), although there is not yet a clear consensus in the field (Buchanan, Johnson, & Goldberg, 2005; Fouladi, McCarthy, & Moller, 2002). Considered collectively, research on differing modes of administration have produced mixed findings and been limited in quantity (Cole et al., 2006). Thompson and colleagues (2003) also concluded that research exploring measurement invariance across paper-and-pencil and Web-based surveys remains in a stage of infancy. Moreover, past work on measurement equivalency in survey research has been prone to certain methodological limitations. The vast majority of past studies comparing electronic to traditional paper-and-pencil survey instruments have utilized undergraduate students or participants recruited from the Internet (Buchanan & Smith, 1999; Epstein, Klinkenberg, Wiley, & McKinley, 2001; King & Miles, 1995; Knapp & Kirk, 2003; Potosky & Bobko, 1997). While these populations are relatively accessible in large numbers, dependence on such convenience samples raises questions as to the representativeness of results gleaned from these types of studies. The non-random nature of these samples is also cause for concern for the generalizability of results to other settings (Kraut et al., 2004). There appears to be a paucity of research pertaining to the existence of measurement equivalence in the training evaluation context. And, there appears to be a gap in the literature concerning measurement equivalency between differing modalities of traditional (i.e., paper-and-pencil) and electronic (i.e., Web-based) formats in the context of training evaluation. According to the APA, when a measure is converted for use online, it is important for scientific inference, not to mention ethical responsibility, to provide evidence of measurement equivalence (ethical principles 9.02b and 9.05, American Psychological Association, 2002). Buchanan et al. (2005) also remarked that, ‘‘it is clear that one cannot simply mount an existing [measure] on the World Wide Web and assume that it will be exactly the same instrument’’ (p. 125). Lack of measurement equivalence evidence further weakens any conclusions drawn from such data, considering that any findings are vulnerable to alternative explanation (Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000). Attention given to this topic by the APA indicates it to be an important issue to consider whenever a new modality of intervention is introduced. It is crucial to perform accurate evaluation in the appropriate context of measurement equivalence across modalities. However, despite continued calls
for further research on this topic of measurement invariance (Millsap & Kwok, 2004; Vandenberg, 2002), organizational researchers still have the tendency to assume that traditional and electronic surveys are equivalent across modalities (Cole et al., 2006). As outlined above, there is a dearth of research examining measurement equivalence across modalities on training evaluation as well as mixed results and methodological concerns surrounding the current body of literature on measurement equivalence across modalities in general survey research. Therefore, we contend that the application of findings concerning measurement equivalency from general surveys to training evaluation would be inaccurate. Accordingly, the question of measurement equivalence needs to be re-tested in the context of training evaluation as a first step towards examining measurement invariance in training evaluation modalities. In order to establish measurement equivalency this study will examine training evaluation data for differential item functioning (DIF). DIF occurs when an item functions differently for a subgroup of a population than the item functions for another group (Camilli & Shepard, 1994; Thissen, 2001). Typically, item response theory (IRT) methods are used to conduct DIF analyses, a model-based theory of measurement that considers how item properties and trait levels relate to an individual’s response to items (Embretson & Reise, 2000). When items exhibit DIF, it suggests that individuals belonging to a certain group that have the same standing on a latent construct as measured by a scale, are not equally likely to endorse a certain response option. Consequently, the primary research question this study hopes to address is whether the modality (e.g. electronic versus traditional paper-and-pencil) used to administer a training evaluation elicits any significant difference in ratings of affective response and perceived utility at the item level by examining measurement equivalency between traditional paper-and-pencil and electronic formats of a classroom instructor and course evaluation. 2. Method 2.1. Participants Participants for this study were 322 undergraduate students from a large Southeastern university who participated in this study to receive course credit. The average age of participants was 18.76 years (S.D. = 2.04), 65% of the sample was female, and a majority of the sample was Caucasian (i.e., 77%). 2.2. Procedure As part of a larger data collection effort this study utilized a between-groups design with one independent variable, a survey format with four levels: paper-and-pencil, Web-based with no access controls, Web-based with group access controls, and Web-based with individual access controls. Upon giving their informed consent on an html Web page, participants were directed to complete a survey. Participants were randomly assigned to one of the four study conditions, mentioned above, via a Javascript control embedded in the informed consent html page. Participants in the paper-and-pencil condition were directed to pick up a paper copy of the survey from an envelope in a hallway in the same building the respondents attend classes (i.e., an area with ease of accessibility). In each of three Web-based conditions, participants were provided a link to the survey materials. Importantly, all survey topic information was the same across conditions, only the modality changed. Questionnaires were adapted from pre-existing measures and concerned participants’ satisfaction with their course, instructor, and university climate.
J. William Stoughton et al. / Computers in Human Behavior 27 (2011) 2005–2010
2.2.1. Paper-and-pencil modality After volunteering and giving consent to participate, respondents in the paper survey condition were given instructions to pick up a paper copy of the survey from an envelope located in an area that was easy to access (i.e., in a hallway within the same building in which respondents attend classes). Participants claimed blank surveys, independently and at their convenience, without interacting with a survey administrator or other individuals. The pick up site was accessible to students on weekdays from 8:00 a.m. to 10:00 p.m. The paper survey was printed on plain paper and asked respondents to circle an option from the response scale for each item. The appropriate response scale was presented following each item. The paper survey was accompanied by a stamped envelope that the respondent used to mail the finished survey back to the researcher at a university address. Instructions accompanying the paper survey then directed the respondent to access a Web page that contained debriefing information and an electronic name form field to complete to ensure course credit was assigned. Respondents in all four conditions were provided with identical debriefing information, which explained the intent of the study in detail and asked them not to discuss the study with others. 2.2.2. Web-based modality without access controls After volunteering and giving consent to participate, respondents in this Web-based condition were given a link to the survey materials. No PIN was required to access the survey in this condition. Regardless of whether a PIN was required, the survey for all Web-based survey conditions was administered via a commercial online survey vendor that presented the items identically and in the same order as the paper format, with the appropriate response scale presented via radio buttons following each item. A progress bar was visible in the Web-based survey to let participants know how close they were to completing the survey; comparable to the ability of participants in the paper-and-pencil condition to see how many physical pages of the survey they had remaining. Participants in all of the Web-based survey conditions were allowed to backtrack and change responses as they would in the paper condition. After participants in all Web-based survey conditions completed the survey and submitted their responses, they were automatically directed to a page with debriefing information and an electronic name form field to complete to ensure course credit was assigned. 2.2.3. Web-based modality with group access controls After volunteering and giving consent to participate, respondents in this Web-based condition were given a link to a Web page that provided them with a PIN to access the survey. The respondents in this condition were informed that the PIN they were using to access the survey was shared with other students from their university and was being used to protect the integrity of their survey responses. Respondents were required to enter the PIN before beginning the survey, and were not permitted to continue with the survey if they did not enter the correct PIN. The rest of the survey was identical to the instrument administered in the Web-based modality without access control condition. 2.2.4. Web-based modality with individual access controls After volunteering and giving consent to participate, respondents in this Web-based condition were given a link to a Web page that provided them with a PIN to access the survey. The respondents in this condition were informed that the PIN they were using to access the survey was assigned on an individual basis and was specific to their survey. Respondents were required to enter the PIN before beginning the survey and were not permitted to continue without entering the correct PIN. The rest of the survey
2007
was identical to the instrument administered in the other Webbased conditions described above. Data collection continued until 90 paper-and-pencil surveys had been distributed and 90 Web-based surveys had been received, in each of the three Web conditions. The program used to randomly assign participants to conditions was unable to track distributions of electronic surveys, so completed surveys had to be counted. This resulted in more Web-based surveys for analysis than paper-and-pencil. This issue was exacerbated as 21 participants consented to perform the paper-and-pencil survey and then failed to mail it back. The total number of usable surveys per condition was as follows: paper-and-pencil (N = 66), Web-based with no access controls (N = 84), Web-based with group access controls (N = 88), and Web-based with individual access controls (N = 84). 2.3. Measures 2.3.1. Instructor and course evaluation form Instructor and course evaluation form (14 items, a = 0.92, a = 0.89 respectively). Scales concerning participants’ satisfaction with undergraduate psychology instructors and perceptions of university climate (adapted from Patterson, West, & Shackleton, 2005) were used to simulate a measure of affective and utility ratings which parallel those commonly assessed in training evaluation research. Responses were presented in a Likert-type format from (1) strongly disagree to (5) strongly agree. An example item from the Instructor Evaluation form is, ‘‘The instructor stated the course objectives/outcomes.’’ An example item from the Course Evaluation form is, ‘‘The course readings were valuable aids to learning.’’ Responses to university climate questions (28 items) were presented in conjunction with the Instructor and Course Evaluation Form, but were not included in later analyses as they were not directly relevant to the aims of the current research study. 2.3.2. Demographic characteristics Demographic characteristics (4 items). Demographic information concerning participants’ age, gender, ethnicity, and university class standing were collected as part of the survey. 3. Results 3.1. Background analyses Prior to conducting analyses corresponding to the study’s research questions, a confirmatory factor analysis (CFA) was run to determine invariance across the Web-based survey format conditions. This was done in order to collapse across the three Webbased survey format conditions. Using Amos 16, a multi-group CFA was run to compare the respective models. The CFA results indicated invariance across the groups with a chi-square difference v2 (26, N = 256) = 17.70 and p > .05. Accordingly, the conditions were collapsed for further analyses. 3.2. Exploratory factor analysis (EFA) IRT analyses assume unidimensionality, thus an exploratory factor analysis (EFA) was conducted using principal-axis factoring in SPSS on the initial 14 instructor and course evaluation items using the sample of 322 participants. To determine the number of factors, we utilized Cattell’s (1966) scree test and eigenvalues to determine how many factors to keep. This initial factor analysis indicated a two-factor solution. The factor analysis was re-run with a two-factor solution subjected to promax rotations (i.e., oblique rotation) and assessed for interpretability. Tabachnick and Fidell (2001) cite 0.32 as a
2008
J. William Stoughton et al. / Computers in Human Behavior 27 (2011) 2005–2010
Table 1 Factor loadings for instructor evaluation form. Item 1 2 3 4 5 6 7 8 9 10 11 12 13 a b
Item stem
Table 3 Course IRTLRDIF results. Instructor loadingsa
Course loadingsb
0.540
0.228
0.479
0.270
0.952
0.159
0.897 0.690
0.050 0.030
0.625
0.142
0.761
0.030
0.693
0.195
0.111
0.828
0.084
0.864
0.145
0.665
0.295
0.583
0.361
0.519
The instructor stated the course objectives/outcomes The instructor explained difficult material well The instructor was enthusiastic about teaching the course The instructor was prepared for class The instructor gave prompt and useful feedback The instructor effectively used instructional technology The instructor consistently treated students with respect Overall, the instructor was an effective teacher The course readings were valuable aids to learning The course assignments were valuable aids to learning This course was intellectually challenging and stimulating This course improved my knowledge of the subject Overall; this course was excellent
The eigenvalue for the factor analysis was 7.93, explaining 61% of the variance. The eigenvalue for the factor analysis was 1.00, explaining 8% of the variance.
Item
1 2 3 4 5
Compact model
v2
df
p
7.4 9.6 2.7 7.3 7.5
5 5 5 5 5
0.19 0.09 0.75 0.20 0.19
for DIF across paper-and-pencil and Web-based survey modalities in the compact model when all item parameters were held equal. See Table 3 for individual item results. However, for Course Evaluation Item 2, an augmented model showed DIF when a parameters were held equal. The Boundary Response Function (BRF) in Fig. 1 depicts Course Evaluation Item 2 (item stem, ‘‘Course assignments were valuable learning aids to learning’’). As can be seen in the BRF, for people below theta = 0 some DIF occurs for less favorable response options. For individuals where theta > 1, DIF does not appear to be an issue. Stated differently, for individuals responding to course evaluation surveys the modality of administration is invariant with one exception. So, DIF analyses revealed no substantial differences for how an individual would respond to items regardless of modality used for surveys.
4. Discussion good rule of thumb for the minimum loading of an item. Accordingly, an item was thrown out based on this rule prior to this rerun. The factor loadings for the remaining 13 items can be seen in Table 1. Reliability analyses were run on each of the resultant factors. The Cronbach’s alpha of the instructor subscale was a = 0.92 while the alpha for the course subscale was a = 0.89. No further items were discarded as a result of the reliability analyses. 3.3. IRTLRDIF analyses 3.3.1. Instructor evaluation Using the graded response model for polytomous data (Samejima, 1969, 1997), a commonly used IRT model, IRTLRDIF did not identify any instructor evaluation items as displaying DIF across paper-and-pencil and Web-based survey modalities. As shown in Table 2, DIF was not identified in any items in the compact model when all item parameters were held equal. In other words, for persons responding to instructor evaluation surveys the modality does not affect measurement equivalency. 3.3.2. Course evaluation Again using the graded response model for ploytomous data (Samejima, 1969, 1997), IRTLRDIF again did not flag a single item
Table 2 Instructor IRTLRDIF results. Item
1 2 3 4 5 6 7 8
Compact model
v2
df
p
2.4 2.7 3.7 2.9 5.9 3.4 3.5 3.3
5 5 5 5 5 5 5 5
0.79 0.75 0.59 0.72 0.32 0.64 0.62 0.65
As training evaluation is an oft neglected stage in the process of program development (Goldstein & Ford, 2002), it is important that any data gleaned through such evaluation provide accurate data. While this lack of evaluation may be prevalent in the business domain, educators perform course evaluation routinely following course completion (Smith, 2007). The sample evaluated in this study recently experienced this shift from paper-and-pencil to computer based course evaluation, and therefore presented an excellent opportunity for differences between evaluative modalities to be investigated. Though the validity of course evaluations have been questioned, it is known that the survey data they produce, much like the data produced by training program evaluation, is utilized for the purposes of feedback and administrative decision making (Smith, 2007). As these evaluations are primarily used for course maintenance, and to a lesser degree impact the careers of the instructors evaluated; it is just as critical in the education, as in the business, domain that the data produced be as accurate as possible. Data for this study were analyzed using variety of methods. Factor loadings produced by exploratory factor analysis indicated the evaluation employed as consisting of two factors identified by the evaluative target of their constituent items. Loadings on the first factor were primarily associated with the evaluation of the course instructor while loadings on the second factor were primarily associated with the evaluation of the course itself. Two items that did not fit cleanly into either of these factors were further investigated to identify possible reasons for anomalous loadings. One of these items produced weak loadings on both factors. It was decidedly omitted from further IRTLRDIF analyses as it did not seem to fit conceptually with the other items and likely assessed some factor outside of this study’s scope. The discarded item (item stem, ‘‘The instructor was receptive to students outside the classroom.’’), involved the assessment of an instructor’s availability outside the classroom, and was not directly related to evaluation of the instructor’s performance or the course itself. The second anomaly, Item 14, loaded substantially on both factors. This cross-loader assessed a student’s overall evaluation of the course.
J. William Stoughton et al. / Computers in Human Behavior 27 (2011) 2005–2010
2009
4.2. Future research
Fig. 1. Course IRTLRDIF results – item 2, a parameters equal.
As students likely used the instructor and the course utility ratings in conjunction when answering Item 14, it is not unusual for crossloading to have occurred. Overall the study’s results indicate that paper-and-pencil evaluation methods do not produce data that significantly differs from that produced by computer-based evaluation, with one exception. As noted above, for individuals below theta = 0 responding to Course Evaluation Item 2 (item stem, ‘‘Course assignments were valuable aids to learning.’’) some DIF occurs for less favorable response options. Such findings can be interpreted to suggest that computer-mediated and paper-based evaluation formats are sufficiently interchangeable. It is possible to conclude from these findings that no detriment has been caused by a modality shift in the education or training domains. Furthermore, it may be possible for practitioners to reliably compare a mix of paper and electronic evaluations of a singular intervention. This could prove useful as not all individuals have the computer access required for electronic evaluation; in such circumstances a paper-and-pencil version may be substituted. 4.1. Limitations Although this study contributes to the extant literature and found that no detriment to evaluative data is caused by the recent shift in training evaluation methods, there are certain limitations that hinder the generalizability of the results produced. The most readily identifiable of these limitations is the substitution of an academic course in place of a training program. Though the use of a classroom setting is not unprecedented in business and military training (Barnett, 2007; Boyce, LaVoie, Streeter, Lochbaum, & Psotka, 2008; Rothwell & Kazanas, 1994), there are a number of other factors (e.g., students as trainees, a semester as a series of training sessions, etc.) that compound issues of generalizability to the workplace. Ultimately, there is just not enough cross-disciplinary research in the domains of training and education to confidently endorse such a substitution. Therefore, the findings of this study are at most a bridge for future research. In addition, this study was conducted through an online university experimental recruiting tool, there were no attempts made to directly control the experimental surroundings. This was done in an effort to enhance the generalizability of the study’s findings by mimicking an applied setting. In practice, respondents asked to complete Web-based surveys are unlikely or unable to be controlled. Therefore, participants were free to access survey materials anywhere they were able to access an Internet accessible computer.
The current study focused on the evaluation of an academic course, but its findings are not limited to the field of education. It is likely that similar investigations performed on workplace training programs would produce equivalent results. Therefore, future research should investigate these modalities in the context of employee training to ensure that electronic modalities are not compromising evaluative data. Additionally, researchers could explore other response formats (e.g., open-ended comments, ipsative items) and their performance across mediums. As surveys are not the only method employed in training evaluation, future research should investigate evaluative differences between other traditional methods (e.g., on the job evaluation) and alternative technology driven modalities (e.g., virtual reality). One key form of data not analyzed in this study, but most commonly utilized in academic departmental decision making and organizational 360° or end-of-year-surveys, are free response comments. Past research has shown that online courses often receive more emotional and less constructive feedback than do traditional courses (Rhea, Rovai, Ponton, Derrick, & Davis, 2007). It is possible that heightened anonymity provided by the Internet reduces the amount of useable commentary gleaned from course evaluations. Further investigation into mode of evaluation and comment quality will be necessary to ensure that evaluative data loss is not occurring as a result of the modality shift. Training practitioners must be cautious when modifying program evaluations, as any data lost or corrupted by evaluation modalities could result in the development of faulty training programs. References American Psychological Association (2002). Ethical principles of psychologists and code of conduct. American Psychologist, 57, 1060–1073. Barnett, J. (2007). How training affects soldier attitudes and behaviors toward digital command and control systems. Military Psychology, 19(1), 45–59. Bennett, W., Alliger, G., Eddy, E., & Tannenbaum, S. (2003). Expanding the training evaluation criterion space: Cross aircraft convergence and lessons learned from evaluation of the Air Force Mission Ready Technician program. Military Psychology, 15(1), 59–76. Blumenfeld, W., & Crane, D. (1973). Opinions of training effectiveness: How good? Training and Development Journal, 27(12), 42–51. Boyce, L., LaVoie, N., Streeter, L., Lochbaum, K., & Psotka, J. (2008). Technology as a tool for leadership development: Effectiveness of automated web-based systems in facilitating tacit knowledge acquisition. Military Psychology, 20(4), 271–288. Buchanan, T., Johnson, J. A., & Goldberg, L. R. (2005). Implementing a five-factor personality inventory for use on the Internet. European Journal of Psychological Assessment, 21, 115–127. Buchanan, T., & Smith, J. L. (1999). Using the Internet for psychological research: Personality testing on the World Wide Web. British Journal of Psychology, 90, 125–144. Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245–276. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage Publications. Cole, M., Bedeian, A., & Feild, H. (2006). The measurement equivalence of web-based and paper-and-pencil measures of transformational leadership: A multinational test. Organizational Research Methods, 9(3), 339–368. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates. Epstein, J., Klinkenberg, W. D., Wiley, D., & McKinley, L. (2001). Insuring sample equivalence across Internet and paper-and-pencil assessments. Computers in Human Behavior, 17, 339–346. Fouladi, R. T., McCarthy, C. J., & Moller, N. P. (2002). Paper-and-pencil or online? Evaluating mode effects on measures of emotional functioning and attachment. Assessment, 9, 204–215. Goldstein, I. L. (1986). Training in organizations (2nd ed.). Pacific Grove, CA: Brooks/ Cole. Goldstein, I., & Ford, J. (2002). Training in organizations (4th ed.). Canada: Wadsworth. Gray, G., Hall, M., Miller, M., & Shasky, C. (1997). Training practices in state government agencies. Public Personnel Management, 26(2), 187–202. Industry Report (2000). Training 37(10), 45–48. Isakowitz, T., Bieber, M., & Vitali, F. (1998). Web information systems. Communications of the ACM, 41(7), 78–80.
2010
J. William Stoughton et al. / Computers in Human Behavior 27 (2011) 2005–2010
King, W. C., & Miles, E. W. (1995). A quasi-experimental assessment of the effect of computerizing noncognitive paper-and-pencil measurements: A test of measurement equivalence. Journal of Applied Psychology, 80, 643–651. Knapp, H., & Kirk, S. A. (2003). Using pencil and paper, Internet, and touch-tone phones for self-administered surveys: Does methodology matter? Computers in Human Behavior, 19, 117–134. Kraut, R., Olson, J., Banaji, M., Bruckman, A., Cohen, J., & Couper, M. (2004). Psychological research online: Report of board of scientific affairs’ advisory group on the conduct of research on the Internet. American Psychologist, 59, 105–117. Lauver, K. (2007). Human resource safety practices and employee injuries. Journal of Managerial Issues, 19(3), 397–413. Millsap, R., & Kwok, O. (2004). Evaluating the impact of partial factorial invariance on selection in two populations. Psychological Methods, 9, 93–115. Patterson, M. G., West, M. A., & Shackleton, V. J. (2005). Validating the organizational climate measure: Links to managerial practices, productivity, and innovation. Journal of Organizational Behavior, 26(4), 379–408. Potosky, D., & Bobko, P. (1997). Computer versus paper-and-pencil administration mode and response distortion in non-cognitive selection tests. Journal of Applied Psychology, 82, 293–299. Rhea, N., Rovai, A., Ponton, M., Derrick, G., & Davis, J. (2007). The effect of computermediated communication on anonymous end-of-course teaching evaluations. International Journal on E-Learning, 6(4), 581–592. Rothwell, W., & Kazanas, H. (1994). Management development: The state of the art as perceived by HRD professionals. Performance Improvement Quarterly, 7(1), 40–59. Saari, L., Johnson, T., McLaughlin, S., & Zimmerle, D. (1988). A survey of management training and education practices in US companies. Personnel Psychology, 41, 731–743.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, No. 17, 100–114. Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory (pp. 85–100). New York: Springer-Verlag. Smith, B. P. (2007). Student ratings of teacher effectiveness: An analysis of end-ofcourse faculty evaluations. College Student Journal, 41, 788–800. Stanton, J. M. (1998). An empirical assessment of data collection using the Internet. Personnel Psychology, 51, 709–725. Steenkamp, J. B. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78–90. Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.). Needham Heights, MA: Allyn and Bacon. Thissen, D. (2001). IRTLRDIF v2.0b: Software for the computation of statistics involved in item response theory likelihood-ratio tests for differential item functioning.
. Thompson, L. F., Surface, E. A., Martin, D. L., & Sanders, M. G. (2003). From paper to pixels: Moving personnel surveys to the Web. Personnel Psychology, 56, 197–227. Wreathall, J., & Connelly, E. (1992). Using performance indicators to evaluate training effectiveness: Lessons learned. Performance Improvement Quarterly, 5(3), 35–43. Vandenberg, R. (2002). Toward a further understanding of and improvement in measurement invariance methods and procedures. Organizational Research Methods, 5, 139–158. Vandenberg, R., & Lance, C. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70.