Available online at www.sciencedirect.com
Computers in Human Behavior Computers in Human Behavior 24 (2008) 1216–1228 www.elsevier.com/locate/comphumbeh
Internet testing: Equivalence between proctored lab and unproctored field conditions Klaus J. Templer a
a,*
, Stefan R. Lange
b,1
Nanyang Technological University, Nanyang Business School, S3-B1-B65, Singapore 639798, Singapore b Siemens AG, Learning Campus, Leadership and Development, 81617 Munich, Germany Available online 20 June 2007
Abstract Companies that use web-based testing do not need to invite applicants to their offices for screening purposes, and applicants are not required to travel. Given the world-wide accessibility of the Internet and the savings in travel costs, web-based testing expands the applicant pool to geographically distant regions. This advantage comes along with the tangible drawback of less control over the testing situation and therefore possible influence on the data quality of scores obtained via Internet testing. This study examined the equivalence of proctored and unproctored web-based psychological testing. Results from 163 potential applicants who participated in a combined laboratory–field and between-subject/within-subject design study with two experimental conditions and two control conditions did not provide evidence that testing conditions affected test results. 2007 Elsevier Ltd. All rights reserved. Keywords: Employee selection; Equivalence; Internet testing; Online assessment; Psychometrics; Screening
1. Introduction Studies have established the conditions under which paper–pencil tests and computerized tests show equivalent results (Mead & Drasgow, 1993; Richman, Kiesler, Weisband, & Drasgow, 1999). With the technological advances of the World Wide Web and the *
1
Corresponding author. Tel.: +65 6790 4754; fax: +65 6792 4217. E-mail addresses:
[email protected] (K.J. Templer),
[email protected] (S.R. Lange). Tel.: +49 636 84424; fax: +49 636 86509.
0747-5632/$ - see front matter 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.chb.2007.04.006
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
1217
Internet, researchers have turned their interest to the question whether web-based test administration (Internet testing) does also provide equivalent results when compared to paper–pencil tests. Recent studies confirmed an earlier review by Lievens and Harris (2003) who saw initial evidence for measurement equivalence between web-based and traditional testing (Bartram & Brown, 2004; Bressani & Downs, 2002; Carlbring et al., 2007; Coyne, Warszta, Beadle, & Sheehan, 2005; Epstein, Klinkenberg, Wiley, & McKinley, 2001; Herrero & Meneses, 2006; Huang, 2006; Knapp & Kirk, 2003; Luce et al., 2007; Oswald, Carr, & Schmidt, 2001; Ployhart, Weekley, Holtz, & Kemp, 2003; Salgado & Moscoso, 2003; Whitaker, 2007). Companies that use web-based testing do not need to invite applicants to their offices and do not need to employ a test administrator when screening applicants; and applicants are not required to travel. Given the world-wide accessibility of the Internet and the savings in travel costs, Internet testing expands the applicant pool to geographically distant regions. This means the real advantage of web-based tests will only be fully used when applicants take them under unproctored field conditions. However, Buchanan and Smith (1999) pointed at a number of potential challenges to the validity of web-based testing. These revolve around the lack of control in the testing situation and the possibility of extraneous factors, e.g. distraction, environmental cues, technical variability between different hardware and software configurations; or temporary factors, e.g. fatigue, intoxication, that may influence responses and of which the recruiter or test administrator might not even be aware of. A limitation of most studies so far was that researchers administered web-based test versions under proctored laboratory conditions. Only a few studies included examining the instruments under real-life conditions in the field. Miller et al. (2002) applied commonly used measures of alcohol use and compared a paper–pencil version with two web-based testing conditions. Data collection was conducted over two separate 48-h periods one week apart. The two web-based conditions were identical with the exception that the second group was asked to take a break from the survey by quitting the browser at any point and reconnecting to the web site at a later time. The interruption provided a proxy of real world interruptions (e.g., participant fatigue, lack of time to initiate or complete the survey) that may be common with self-paced home-based testing. Results showed no significant differences between the test conditions. Oswald et al. (2001) compared supervised and unsupervised Internet testing. Multiple group confirmatory factor analyses showed for personality measures a better model fit for supervised groups than for unsupervised groups. The model fit was equal for cognitive ability tests. Chuah, Drasgow, and Roberts (2006) did not find indications for inequivalence when they compared paper-and-pencil, proctored computer lab, and unproctored Internet testing conditions for scales consisting of adjective markers of the Big-Five personality traits. Do, Shepherd, and Drasgow (2005) found negligible differences and measurement equivalence for proctored and unproctored versions of personality scales, a cognitive ability test, and a biodata instrument. Coyne et al. (2005) found no significant differences in coefficients of equivalence and stability (CES) for groups that either completed the online version of a test battery under unsupervised conditions or the PC version under supervised and standardized conditions. The test battery included ability, interest, and personality tests. However, a limitation of these studies was that equivalence had not been tested directly. Groups of different participants had been compared, instead of testing the same partici-
1218
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
pants on two occasions under proctored and unproctored conditions. The study of Beaty, Fallon, and Shepherd (2002) is an exception, as they used a within-subject design. In their conference paper they reported negligible differences in test scores for participants who first took a cognitive ability test under unproctored conditions and at a later time under proctored conditions. To our knowledge, there exist no published studies that have researched the equivalence of unproctored and proctored Internet testing of non-cognitive personality-related dimensions and cognitive abilities with a within-group design or a combined withingroup/between-group design. The objective of this paper was therefore to examine the equivalence of web-based psychological measures administered under proctored lab conditions and under unproctored field conditions – both between groups and within groups over time. Specifically, we tested whether results from proctored web-based tests taken in the lab would be equivalent to results from unproctored web-based tests taken under field conditions, as reflected in similar means and variances between test conditions and over time, and as reflected in similar coefficients of equivalence and stability (CES). 2. Method 2.1. Participants Participants were 163 potential applicants (45% male and 55% female) with an average age of 21.9 years (SD = 1.44 years) and an age range from 19 to 25. The sample consisted of undergraduate students from a large state university in Singapore. The study was conducted in cooperation with a large multinational company based in Germany. Besides the objective of this study – to examine the equivalence of web-based test results obtained under proctored lab conditions vs. unproctored field conditions – the multinational company wanted to collect data for establishing Singapore-specific norms from potential applicants for its International management trainee program. The participants of this study matched the intended target group of potential applicants. 2.2. Design We used a counterbalanced 2 · 2 between-subject/within-subject design. The participants completed the same test battery at two different points in time with approximately two weeks between the sessions. Participants were randomly allocated to one of four groups. Control group 1: Participants completed both testing rounds under proctored conditions in a university computer lab. Experimental group 1: Participants completed the first session under proctored conditions in the computer lab, and the second session under unproctored field conditions. Experimental group 2: Participants completed the first round of tests under unproctored field conditions, and the second round under proctored conditions in the computer lab. Control group 2: Participants completed both rounds under unproctored field conditions.
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
1219
2.3. Tests Six subtests from the web-based test platform PERLS were used for this study. (For a description of the PERLS online test system see Hiltmann, Kirbach, Mette, & Montel, 2005; Kirbach & Montel, 2002; Kirbach, Montel, Oenning, & Wottawa, 2004). The decision for these six subtests was based on the company’s competency model for its International management trainee program. The six tests measured five non-cognitive, personality-related dimensions and one cognitive ability, namely (1) Orientation towards cooperation, (2) Readiness to learn, (3) Commitment, (4) Readiness to solve problems, (5) Adopting the views of others, and (6) Creativity based on rules. The Cronbach alpha reliabilities for these scales were .89, .67, .78, .75, .87, and .84 respectively, and were comparable with coefficients reported in the PERLS user manual (Hiltmann et al., 2005), and with coefficients that the company had obtained with German and North American samples. Orientation towards cooperation indicates the extent to which participants prefer cooperation compared to competition at work. The test contains descriptions of four teamwork scenarios with differing amounts of cooperation between team members. Participants judge each scenario on 10 pairs of adjectives, e.g. ‘‘I see this situation as being rather: constructive/destructive’’. Additionally, participants state whether they would like to work in a team situation as described in the scenario. Low scores indicate a preference for competitive situations, whereas high scores indicate a preference for work situations based on cooperation. Readiness to learn measures the motivation to deliberately seek new situations and requirements. The test consists of 12 items that describe behaviors associated with readiness to learn. Half of the items are negatively poled. Participants respond to each item by using a movable pointer and clicking the mouse on a scale defined by the end points ‘‘0 – Does not apply at all’’ and ‘‘100 – Applies completely’’. Participants with high scores view their adaptation to new circumstances and the assimilation of new content and material in a highly positive light. The 16-item Commitment scale (five items negatively poled) measures the participant’s readiness to deploy resources with the ultimate aim of achieving one’s career objectives. Participants respond to each item by clicking the mouse on a scale from ‘‘0 – Does not apply at all’’ to ‘‘100 – Applies completely’’. High test scores indicate that participants are particularly ambitious when it comes to accepting professional challenges and that they are willing to work under stress and are prepared to utilize a sizeable amount of resources. The 14-item scale Readiness to solve problems (three items negatively poled) measures the willingness to solve customer problems. Participants respond to each item by clicking the mouse on a scale from ‘‘0 – Does not apply at all’’ to ‘‘100 – Applies completely’’. A high score indicates that participants are not only in a position to understand and tackle customer requirements precisely; they also take pleasures in doing their very best in order to meet these requirements and in knowing that the customer is satisfied with the end result. The dimension Adopting the views of others measures the ability to evaluate the importance of information from the perspectives of different stakeholders. Participants are exposed to a text with information on a particular country. They then rate the importance of 34 statements on the text from two differing perspectives, one from the perspective of a
1220
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
tourist, the other from the perspective of a potential investor and businessperson, e.g. ‘‘From my point of view as a tourist this information is . . .’’; participants respond to each statement by clicking the mouse on a scale from ‘‘0 – completely irrelevant’’ to ‘‘100 – essentially important’’. Participants with high scores display the ability to correctly evaluate information according to interests of different stakeholders. Lastly, the speeded cognitive ability test Creativity based on rules measures creativity in the context of specific rules and under specific boundary conditions. In six trials of 2 min duration each, participants have to generate with mouse clicks a maximum number of patterns that fulfill particular geometrical requirements. The items of the scales for Commitment, Readiness to learn and Readiness to solve problems were interspersed on the test across different subscales. This procedure, instead of using a group item placement format, can reduce faking (McFarland, Ryan, & Ellis, 2002). 2.4. Procedure Participants were recruited with the incentive of a lucky draw for a mobile phone through email and through announcements at the beginning of lectures. The volunteers were randomly assigned to one of the four test conditions. However, those allocated to a lab condition who indicated their non-availability for the assigned lab time slot were re-assigned to the field–field condition. Participants had no prior information that the second test session would have the same set of tests as in the first session. For the test sessions, participants accessed a secure Web site that was hosted in Germany by the multinational company. The web-based test administration started with an overall introduction, guidelines for answering the questions and acquainting the participants with the keys to be used. The participants provided name, year of birth, gender and an e-mail address. Upon completing the online registration participants received a password, which was required for subsequent login. During the test, each item was presented separately on the screen. After answering each item, participants confirmed the answer by pressing the ‘‘Next’’ button. Participants took between 70 and 90 min to complete the test battery. The system allowed for taking a break upon completion of any segment and for resuming the test battery at a later time. The system automatically saved the data at the end of each completed segment and thus, subjects would be directed to the point that they left off previously. On submission, the data were automatically saved and were no longer accessible for participants. For participants allocated to the field-based conditions e-mail reminders were sent out before the test sessions along with instructions on how to access and complete the webbased tests. They were asked to complete the test battery within one week at any time and any location to their convenience. Participants allocated to lab-based conditions were asked to complete the tests in a computer lab on a designated day and time under proctored conditions. Attendance was taken and participants received instruction sheets before logging in to the website. Participants were instructed to remain silent at all time. They were allocated to alternate computers in the lab to prevent any form of communication and were instructed to raise their hands in the event of any problems. The test administrator would then attend to their queries individually with minimal interruption to the rest of the participants. They were instructed to leave the lab quietly after completion of the full test battery but were discouraged from taking a break in the middle of the test.
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
1221
2.5. Statistical analyses Inspection of the scatterplots for the correlations between time 1 and time 2 and calculation of difference scores revealed that there were eight outliers with difference scores of more than 2.5 standard deviations between time 1 and time 2 in at least one test. As these participants had also experienced Internet connection problems, we decided to delete the data of these eight participants from the original data set of 171 participants, which resulted in the above described final sample of 163 participants. The outliers were relatively equally distributed over the test conditions with one outlier in the lab–lab condition, two outliers in the field–field condition, two outliers in the field–lab condition, and three outliers in the lab–field condition. The tests of equivalence were performed in four ways. Between-subject analyses (I): With separate Levene’s tests and ANOVAs, we first conducted between-subject analyses and tested whether there were variance and mean differences in test scores between the four groups at time 1. An indicator of inequivalence would be if both experimental groups had mean scores or variances that were different from those of the control groups. We then repeated these analyses for time 2; this procedure served as a cross-validation for the results obtained at time 1. Between-subject analyses (II): Second, for an overall comparison between proctored and unproctored test conditions, we combined the two lab groups and used independent samples t-tests to compare their results with the results of the combined field groups. Again, the tests with the data obtained at time 2 served to cross-validate the findings from time 1. Within-subject analyses: Third, with a within-subject analysis approach, we applied paired-sample t-tests to compare test results from time 1 with test results from time 2. Differences from time 1 to time 2 for the experimental groups would indicate inequivalence. However, should similar differences appear for the control groups then this would indicate a time effect rather than an effect attributable to a difference between proctored and unproctored conditions. Coefficients of equivalence and stability (CES): Fourth, we calculated coefficients of equivalence and stability (CES), i.e. the cross-mode correlations for the two experimental groups. CES according to Schmidt and Hunter (1999) delivers the most accurate reliability figure, as it controls for all three sources of error variance, namely scale-specific measurement error factors, transient error, and random response error. In case of inequivalence between proctored and unproctored test conditions these CES should be consistently lower than the test–retest reliability coefficients (TRR) in the control groups.
3. Results Table 1 shows the means and standard deviations for the six tests at time 1 and time 2 for the total sample, for the two control groups (lab–lab; field–field) and two experimental groups (lab–field; field–lab), and for the combined proctored lab and combined unproctored field groups. For comparison purposes, we report standard T-scores, based on the norms obtained for the total sample at time 1.
1222
Subtest
Time
Condition Total (N = 163)
Lab–lab (N = 40)
Lab–field (N = 35)
Field–lab (N = 38)
Field–field (N = 50)
Lab combined (N = 75)
Field combined (N = 88)
M
SD
M
M
M
SD
M
SD
M
M
SD
SD
SD
SD
Orientation towards cooperation
1 2
50.00 49.73
10.00 10.43
52.55 53.76
9.45 8.81
48.11 46.92
7.95 9.36
48.28 47.45
9.85 10.21
50.59 50.20
11.49 11.61
50.48 50.68
9.00 9.98
49.59 48.85
10.81 10.81
Readiness to learn
1 2
50.00 50.00
10.00 9.23
48.54 48.96
9.58 9.00
51.04 50.81
9.31 9.47
50.39 49.52
9.22 7.35
50.14 50.64
11.43 10.68
49.71 49.23
9.47 8.19
50.25 50.71
10.48 10.14
Commitment
1 2
50.00 49.74
10.00 9.08
48.45 48.61
9.66 8.19
50.52 50.11
10.44 9.60
50.07 50.03
10.88 9.89
50.83 50.15
9.41 8.95
49.42 49.30
10.01 9.02
50.50 50.13
10.02 9.17
Readiness to solve problems
1 2
50.00 47.73
10.00 10.54
50.73 48.97
9.62 10.43
48.74 46.30
13.13 12.56
48.98 48.89
6.70 8.63
51.07 46.85
10.00 10.51
49.80 48.93
11.35 9.53
50.17 46.62
8.75 11.33
Adopting the views of others
1 2
50.00 52.49
10.00 9.07
52.56 53.12
6.79 8.59
52.04 53.95
7.70 6.88
46.80 50.05
12.61 12.58
48.96 52.83
10.74 7.39
52.32 51.63
7.18 10.76
48.03 53.29
11.57 7.16
Creativity based on rules
1 2
50.00 55.46
10.00 11.94
48.04 56.46
8.76 12.99
49.62 56.75
9.14 11.70
49.05 51.82
12.53 12.81
52.56 56.53
9.08 10.24
48.78 54.20
8.91 13.03
51.04 56.62
10.78 10.80
N in parentheses.
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
Table 1 Means and standard deviations of T-scores for the subtests by experimental condition at time 1 and time 2, standardized on time 1
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
1223
3.1. Between-subject analyses (I) For group comparisons at time 1, Levene’s tests of homogeneity of variances showed inequality of variances for the subtests Readiness to solve problems (F = 5.87; p < .01) and Adopting the views of others (F = 3.03; p < .05), with the latter not significant after Bonferroni adjustment for six comparisons. There was no clear pattern, i.e. the experimental groups had neither consistently higher nor lower variances than the control groups. Levene’s tests for time 2 did not show any significant variance differences for the six tests. ANOVA results (Table 2) did not reveal significant mean differences except for Adopting the views of others, but only at time 1 (F = 2.94; p < .05; g2p ¼ :052), and for Orientation towards cooperation, but only at time 2 (F = 3.65; p < .05; g2p ¼ :064). The effect sizes (partial ETA squared) for these differences were low. After Bonferroni adjustment for six comparisons, there were no significant differences. Taken together, the results of the Levene’s tests and the ANOVAs did not indicate inequivalence of test scores obtained under proctored and unproctored Internet testing conditions. 3.2. Between-subject analyses (II) Levene’s tests for equality of variances indicated that at time 1 the combined field condition group showed higher variances in two of the six tests: Adopting the views of others (F = 7.32; p < .01) and Readiness to solve problems (F = 5.01; p < .05; but non-significant after Bonferroni adjustment). The t-test for equality of means showed a higher score in Adopting the views of others for the field condition group (t = 2.87; p < .01; g2p ¼ :046). As there were no significant differences in means or variances for the six tests between the field condition and lab condition groups for time 2, the results from time 1 were not cross-validated, i.e. there was no indication of inequivalence of test scores from proctored and unproctored Internet testing conditions.
Table 2 ANOVAs for between-subject effects Subtest
Time
df
F
p
g2p
Orientation towards cooperation
1 2
3, 159 3, 159
1.74 3.65
.16 .01
.032 .064
Readiness to learn
1 2
3, 159 3, 159
0.43 0.37
.73 .78
.008 .007
Commitment
1 2
3, 159 3, 159
0.46 0.27
.71 .85
.009 .005
Readiness to solve problems
1 2
3, 159 3, 159
0.58 0.67
.63 .57
.011 .012
Adopting the views of others
1 2
3, 159 3, 159
2.94 1.31
.04 .27
.052 .024
Creativity based on rules
1 2
3, 159 3, 159
1.76 1.55
.16 .20
.032 .028
1224
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
3.3. Within-subject analyses Paired-sample t-tests for the experimental group that had their first test session in the lab under supervised conditions and the second session under unsupervised field conditions showed under field conditions at time 2 higher scores in Creativity based on rules (t = 5.68; p < .001; g2p ¼ :487), and a marginally significant increase for Adopting the views of others (t = 1.98; p = .055; g2p ¼ :104). The experimental group with the first session under field conditions and the second session under lab conditions also showed higher scores at time 2 for Creativity based on rules (t = 2.25; p < .05; g2p ¼ :120) and for Adopting the views of others (t = 2.48; p < .05; g2p ¼ :143). These results indicated that gains in test scores were rather an effect of repeated test administration than an effect of the test condition. Both control groups also showed similar gains in the cognitive ability test Creativity based on rules (lab–lab group: t = 6.47; p < .001; g2p ¼ :518; field–field group: t = 3.48; p < .01; g2p ¼ :198), which points at a steep learning curve for this test for repeated test administration. There were no other significant mean differences for the lab–lab group between first and second test administration. The field–field group had at time 2 lower scores for Readiness to solve problems (t = 4.56; p < .001), and higher scores in Adopting the views of others (t = 3.00; p < .01). In summary, the results of the pairedsample t-tests showed a systematic learning effect for repeated test administration of the cognitive ability test under proctored as well as under unproctored conditions. They did not show any systematic effects of the proctored versus unproctored conditions on test scores. 3.4. Coefficients of equivalence and stability (CES) Table 3 shows the CES or cross-mode correlations for the lab–field and field–lab groups and the intra-mode test–retest reliabilities (TRR) for the lab–lab and field–field groups. Inspection of the coefficients did not reveal a trend towards lower CES compared to the intra-mode retest reliabilities. The average TRR for the lab–lab condition was the lowest (.71) and for the field–field condition the highest (.80), with the average CES for the two Table 3 Coefficients of equivalence and stability (CES) and test–retest reliabilities (TRR) Subtest
Orientation towards cooperation Readiness to learn Commitment Readiness to solve problems Adopting the views of others Creativity based on rules Average CES/TRR
Condition Overall (N = 163)
Lab–lab (N = 40)
Lab–field (N = 35)
Field–lab (N = 38)
Field–field (N = 50)
.68
.52
.61
.52
.86
.79 .84 .75
.79 .78 .69
.74 .83 .80
.67 .84 .69
.88 .88 .80
.67
.60
.70
.79
.55
.74
.78
.77
.82
.66
.75
.71
.75
.74
.80
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
1225
experimental conditions lab–field and field–lab in-between (.75 and .74); the CES and TRR did not differ significantly from each other. Comparing the CES/TRR for the subtests, we did not find significant differences between the four conditions for Commitment, Readiness to solve problems, and Creativity based on rules. We found significant differences between the conditions for the other three subtests. However, these differences were unsystematic and did not point towards an effect of proctored versus unproctored testing conditions: Under the field–lab condition, but not under the lab–field condition, the crossmode correlation was for Readiness to learn lower (Z = 2.57, p < .05), and was for Adopting the views of others higher (Z = 2.10, p < .05) than the corresponding intra-mode retest reliabilities under the field–field condition; there were no significant differences of these two coefficients compared to the lab–lab condition. For Orientation towards cooperation the coefficient was higher for the field–field condition compared to the coefficients of the lab–field condition (Z = 3.23, p < .05) and field–lab condition (Z = 2.57, p < .01), but it was also higher compared to the intra-mode lab–lab condition (Z = 3.31, p < .001); the coefficients for the lab–field and field–lab conditions did not differ significantly from the corresponding coefficients of the lab–lab condition. Taken together, the comparisons of the cross-mode coefficients of equivalence and stability with the intra-mode test–retest reliabilities did not provide any indication of test score inequivalence between supervised lab administration and unsupervised field administration. 4. Discussion In the present study we applied a combined laboratory–field and between-subject/ within-subject design to examine equivalence of proctored and unproctored Internet test administration for five non-cognitive work-specific personality-related measures and one speeded cognitive ability test. The overall results from variance and mean comparisons and from comparisons of cross-mode with intra-mode correlations suggest invariance between proctored and unproctored Internet testing. The few differences that we found were unsystematic; they did not indicate any trends towards lower or higher variances or towards lower or higher means for unproctored compared to proctored Internet testing conditions. We also did not observe any trend towards lower or higher cross-mode correlations compared to intra-mode correlations. This indicates that the tests were robust against variations of the test situation; participants concentrated on the test content and were not affected by test administration mode when processing the items. The strongest effect was for the speed test measuring a cognitive ability: We found considerable increases in test scores from the first to the second test administration. However, as we observed these increases not only under the experimental but also under the control conditions, the increases in test scores from time 1 to time 2 are clearly attributable to repeated testing, i.e. to a learning effect, and are not associated with the deployed testing condition. The first and major strength of the present study was its experimental design: The combined laboratory–field and between-subject/within-subject design with two experimental conditions and two control conditions allowed for testing of differences between groups at two different points in time. It additionally allowed for testing of differences over time for the same individuals, who either took the test either under the same or under a different testing condition. A second strength of this study was the choice of the sample. Whereas earlier studies on test equivalence had either US-American or European participants, this
1226
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
is to our knowledge the first study with an Asian sample. The results of our study therefore give an indication that results from earlier studies can likely be generalized beyond the West. As for the limitations, most studies published in this field had undergraduate students as participants, and our study is no exception. However, our participants were potential applicants. They belonged to the target population of the test battery. (Note that the second aim of this study had been to generate country-specific norms.) Additionally, participation was not anonymous, as participants were required to disclose their names and provide their email addresses. And participants were aware of the name of the company, which could well have been their future employer. We therefore expected that participants worked seriously on the test battery. A second limitation of our study was that we did not record whether participants under field conditions used the opportunity to take breaks, or whether they experienced any disturbances while working on the test battery. Future studies could therefore include real job applicants. One could also record and test the effects of breaks and disturbances during the field test condition, and one could manipulate and deliberately include breaks and disturbances during the laboratory test condition. An additional area of future research will be establishing the prognostic validity of unproctored tests. First results are already promising: Unproctored biodata and situational judgment tests predicted job performance of entry-level call center customer service personnel and collections representatives in the financial services and telecommunications industry (Tippins et al., 2006). Our results are encouraging and have not only practical implications for researchers who plan to use web-based tests for research purposes. The results have also and especially practical implications for recruitment and selection processes in organizations: If results from proctored and unproctored Internet testing are comparable, then recruiting organizations, by using unproctored tests, will be able to increase the number of applicants for screening with ease and without major costs. The trend to Internet recruitment advertising has already increased the number of attracted candidates, but this expansion of applicant pools has also led to an increase in the number of under-qualified applicants, which has made it more difficult to select the right people (Chapman & Webster, 2003). Unproctored Internet testing will help sifting out under-qualified applicants. A resulting larger pool of applicants who are qualified translates into a higher payoff from the recruitment and selection system (Boudreau & Rynes, 1985). In conclusion, we recommend the development of web-based tests, which includes establishing their psychometric properties (Naglieri et al., 2004), and their usage in unproctored settings for screening potential candidates as a first step in a multiple-hurdle selection process. Unproctored Internet testing in this case serves as a tool for screeningout the least qualified applicants. For selecting-in from the remaining pool of more qualified applicants the candidates should be invited for proctored testing in order to reconfirm the results obtained from the unproctored tests (Tippins et al., 2006), followed by further assessments in order to assess those competencies and dimensions, which have not yet been tested but are also needed for successfully performing on the job and in the organization. References Bartram, D., & Brown, A. (2004). Online testing: Mode of administration and the stability of OPQ 32i scores. International Journal of Selection and Assessment, 12(3), 278–284.
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
1227
Beaty, J. C., Fallon, J. D., & Shepherd, W. (2002). Proctored versus unproctored web-based administration of a cognitive ability test. Presented at the 17th annual SIOP conference, Toronto. Boudreau, J. W., & Rynes, S. L. (1985). Role of recruitment in staffing utility analysis. Journal of Applied Psychology, 70, 354–366. Bressani, R. V., & Downs, A. C. (2002). Youth independent living assessment: Testing the equivalence of web and paper/pencil versions of the Ansell-Casey Life Skills Assessment. Computers in Human Behavior, 18, 453–464. Buchanan, T., & Smith, J. L. (1999). Using the Internet for psychological research: Personality testing on the World-Wide Web. British Journal of Psychology, 90, 125–144. ¨ st, L.-G., et al. (2007). Internet vs. paper and Carlbring, P., Brunt, S., Bohman, S., Austin, D., Richards, J., O pencil administration of questionnaires commonly used in panic/agoraphobia research. Computers in Human Behavior, 23, 1421–1434. Chapman, S. D., & Webster, J. (2003). The use of technologies in the recruiting, screening, and selection process for job candidates. International Journal of Selection and Assessment, 11(2/3), 113–120. Chuah, S. C., Drasgow, F., & Roberts, B. W. (2006). Personality assessment: Does the medium matter? No. Journal of Research in Personality, 40, 359–376. Coyne, I., Warszta, T., Beadle, S., & Sheehan, N. (2005). The impact of mode of administration on the equivalence of a test battery: A quasi-experimental design. International Journal of Selection and Assessment, 13(3), 220–224. Do, B.-R., Shepherd, W. J., & Drasgow, F. (2005). Measurement equivalence across proctored and unproctored administration modes of web-based measures. Presented at the 20th annual SIOP conference, Los Angeles. Epstein, J., Klinkenberg, W. D., Wiley, D., & McKinley, L. (2001). Insuring sample equivalence across Internet and paper-and-pencil assessments. Computers in Human Behavior, 17, 339–346. Herrero, J., & Meneses, J. (2006). Short web-based versions of the perceived stress (PSS) and Center for Epidemiological Studies-Depression (CESD) scales: A comparison to pencil and paper responses among Internet users. Computers in Human Behavior, 22, 830–846. Hiltmann, M., Kirbach, C., Mette, C., & Montel, C. (2005). PERLS-Testhandbuch (PERLS user manual). Bochum: Eligo. Huang, H.-M. (2006). Do print and web surveys provide the same results? Computers in Human Behavior, 22, 334–350. Kirbach, C., & Montel, C. (2002). PERLS – Ein neues System fu¨r das Internetrecruiting und -Assessment. Wirtschaftspsychologie, 1, 39–42. Kirbach, C., Montel, C., Oenning, S., & Wottawa, H. (2004). Recruiting und Assessment im Internet. Go¨ttingen: Vandenhoeck & Ruprecht. Knapp, H., & Kirk, S. A. (2003). Using pencil and paper, Internet and touch-tone phones for self-administered surveys: Does methodology matter? Computers in Human Behavior, 19, 117–134. Lievens, F., & Harris, M. M. (2003). Research on Internet recruitment and testing: Current status and future directions. In C. L. Cooper & I. T. Robertson (Eds.). International review of industrial and organizational psychology (Vol. 18, pp. 131–165). Chichester: Wiley. Luce, K. H., Winzelberg, A. J., Das, S., Osborne, M. I., Bryson, S. W., & Taylor, C. B. (2007). Reliability of selfreport: Paper versus online administration. Computers in Human Behavior, 23, 1384–1389. McFarland, L. A., Ryan, A. M., & Ellis, A. (2002). Item placement on a personality measure: Effects on faking behavior and test measurement properties. Journal of Personality Assessment, 78, 348–369. Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114(3), 449–458. Miller, E. T., Neal, D. J., Roberts, L. J., Baer, J. S., Cressler, S. O., Metrik, J., et al. (2002). Test–retest reliability of alcohol measures: Is there a difference between Internet-based assessment and traditional method? Psychology of Addictive Behaviors, 16(1), 56–63. Naglieri, J. A., Drasgow, F., Schmit, M., Handler, L., Prifitera, A., Margolis, A., et al. (2004). Psychological testing on the Internet. New problems, old issues. American Psychologist, 59(3), 150–162. Oswald, F. L., Carr, J. Z., & Schmidt, A. M. (2001). The medium and the message. Presented at the 16th annual SIOP conference, San Diego. Ployhart, R. E., Weekley, J. A., Holtz, B. C., & Kemp, C. (2003). Web-based and paper-and-pencil testing of applicants in a proctored setting: Are personality, biodata, and situational judgment tests comparable? Personnel Psychology, 56(3), 733–752.
1228
K.J. Templer, S.R. Lange / Computers in Human Behavior 24 (2008) 1216–1228
Richman, W. L., Kiesler, S., Weisband, S., & Drasgow, F. (1999). A meta-analytic study of social desirability distortion in computer-administered questionnaires, traditional questionnaires, and interviews. Journal of Applied Psychology, 84(5), 754–775. Salgado, J. F., & Moscoso, S. (2003). Internet-based personality testing: Equivalence of measures and assesses’ perceptions and reactions. International Journal of Selection and Assessment, 11(2/3), 194–205. Schmidt, F. L., & Hunter, J. E. (1999). Theory testing and measurement error. Intelligence, 27(3), 183–198. Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segal, D. O., et al. (2006). Unproctored Internet testing in employment settings. Personnel Psychology, 59, 189–225. Whitaker, B. G. (2007). Internet-based attitude assessment: Does gender affect measurement equivalence? Computers in Human Behavior, 23, 1183–1194.