Computer-based tests: the impact of test design and problem of equivalency

Computer-based tests: the impact of test design and problem of equivalency

Computers in Human Behavior Computers in Human Behavior 23 (2007) 32–51 www.elsevier.com/locate/comphumbeh Computer-based tests: the impact of test d...

459KB Sizes 0 Downloads 7 Views

Computers in Human Behavior Computers in Human Behavior 23 (2007) 32–51 www.elsevier.com/locate/comphumbeh

Computer-based tests: the impact of test design and problem of equivalency Petr Kveˇton *, Martin Jelı´nek, Dalibor Voborˇil, Helena Klimusova´ Institute of Psychology, The Academy of Sciences of the Czech Republic, Veverˇı´ 97, 602 00 Brno, Czech Republic Available online 17 April 2004

Abstract Nowadays, computerized forms of psychodiagnostic methods are often produced without providing appropriate psychometric characteristics, or without proving equivalency with conventional forms. Moreover, there exist tests with more than one computerized versions, which are mostly designed differently. Study I focused on the impact of test design. It was found that even simple change of color scheme (light stimuli on dark background vs. dark stimuli on light background) had a significant effect on subjects’ performance. Study II examined equivalency of a computerized speeded test, which is broadly used within psychological practitioners in the Czech Republic; this form was found non-equivalent with its conventional counterpart.  2004 Elsevier Ltd. All rights reserved. Keywords: Computer-based assessment; Speeded test; Equivalency; Test design; Ergonomics

1. Introduction Computer-based testing can seem to be only beneficial as the computer has the ability to do all the routine work of test administration; it facilitates the standardization of testing procedures; substantially saves time and decreases economical costs of data input; allows for correct scoring; and permits a variety of measures of cognitive and perceptual performance, such as reaction time, to be quickly assessed (Butcher, 1987; de Beer & Visser, *

Corresponding author. Tel.: +420 532290159; fax: +420 549244667. E-mail address: [email protected] (P. Kveˇton).

0747-5632/$ - see front matter  2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.chb.2004.03.034

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

33

1998; Mead & Drasgow, 1993; Rosenfeld, Booth-Kewley, & Edwards, 1993). Regardless of such advantages, it is necessary to note potential serious limitations that may reduce usage of computer-based administration or even make it impossible do exist (AERA, APA, & NCME, 2001). Validity of computerized test versions remains unknown until equivalency of both methods is empirically verified and the computer version can be considered to be a parallel form of the original method (Ford, Vitelli, & Stuckless, 1996). Verification of validity of computer tests becomes crucial especially in those cases when test scores serve as a basis for a clinical diagnosis or personal selection or placement of workers (George, Lankford, & Wilson, 1992). Guidelines for computer-based tests and interpretations (APA, 1986) determine three parts of evidence of psychometric equivalence of conventional and computer tests: (1) descriptive statistics: means, variances, distributions, and rank orders of scores; (2) construct validity; (3) reliability. Bartram (1994) offers a more specific definition of equivalency between different test forms. According to him, two test versions are equivalent if they (1) have identical reliability; (2) correlate between each other on such a level that can be expected based on their reliability; (3) have comparable correlations with other variables; (4) have identical means and standard deviations. In order to establish equivalency between test forms, Bartram (1994) recommends carrying out an experimental plan with repeated measurements of four groups (also suggested by Maresˇ, 1992); two of them are control groups (computer–computer, and paper–paper) and the other two groups have crossed conditions (computer–paper, and paper–computer). Bartram (1994) further notes that especially when testing equivalency of performance tests, it is advisable to use all four groups in order to take into consideration diverse results based on the experimental design test–retest rather than only two groups with crossed conditions as it can be found in other studies. Whereas questionnaire methods were typically found to be independent of the media of administration (Merten & Ruch, 1996; Pinsoneault, 1996; Watson, Thomas, & Anderson, 1992), the issue of media employed becomes fundamental in case of cognitive ability tests (e.g. Kveˇton & Klimusova´, 2002). Within them, power and speeded tests can be distinguished (Cohen & Swerdlik, 1999). In power tests, time is not a very important factor, but some items are so difficult that no test-taker should be able to solve them all and obtain a perfect score (i.e. Intelligence Structure Test – Amthauer, 1993). By contrast, a speeded test comprises very easy and homogeneous items that any examinee is able to ˇ ecˇer, answer if he/she was given sufficient time (i.e. Bourdon’s Test – Senka, Kuruc, & C 1992 or Kucˇera’s Test of Concentration of Attention – Kucˇera, 1992, which were used in the present studies). Generally, common methods are neither pure speeded nor power tests. Results of computerized cognitive ability tests can be influenced mainly by formal factors (Bartram, 1994), which are related to test visual and control interface ergonomics. According to French and Beaumont (1990), presentation of stimuli on a computer monitor can cause lower performance of a tested person. Problems can appear especially when transforming conventional paper-and-pencil tests with more complex visual stimuli to their computerized forms (Aspillaga, 1996; Kveˇton, Jelı´nek, Voborˇil, & Klimusova´, 2003). Due to differences in test design, even two different computerized versions of the same test (i.e. two computerized forms of the same test created by different companies) could be found non-equivalent. Moreover, a computerized form could produce different

34

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

results due to differences in display devices in terms of technical characteristics (size of monitor, screen refresh rate, etc.) With regard to substantial differences in computer display qualities, Krantz (2000) suggests that for psychological researches with demanding visual stimuli, computer monitors should be calibrated beforehand and if the data are published, these calibration routines should be reported in the method section for later verification. In case of speeded tests, the nature of computer control interface can decrease the reaction latency (Mead & Drasgow, 1993). Without a doubt, the motor activity required to press a key is significantly lower in comparison to marking a bubble on an answer sheet. To conclude, it seems that cognitive ability tests and especially speeded tests may be influenced by the mode of administration. In spite of that common tests transferred into their computerized forms are still widely used with norms and other psychometric characteristics obtained on the conventional form of the test. And it is especially Eastern Europe and also the Czech Republic that have as late as in the 1990s tried to catch up the lead of Western technological standards, and where the psychological assessment software has been produced and employed sometimes without any awareness or knowledge of potential threats. There were two main goals of this study. First, to contribute to the better knowledge of causes for non-equivalency by examining two computerized forms of Bourdon’s Test. Second, to examine the equivalency of conventional and computerized form of TKP – a speeded test originally developed and widely used in Czech Republic. 2. Study 1: Bourdon’s test 2.1. Method 2.1.1. Instruments Bourdon’s Test is a typical test of concentration, based on the principle of shape-discrimination; it is widely used in organizational psychology. For our purposes, we have ˇ SU ´ P version, which was published by Psychodiagnosticke´ a didakticke´ chosen the BoPr C testy (Senka et al., 1992). Its test sheet consists of 30 lines, each with 85 symbols, 2550 in total. All symbols are printed on a single page of A3 paper format. The test symbols are small squares marked in one of the four corners or on one of the four inner sides by small black-filled arcs (four + four positions). The examinee’s task is to search for and cross out three out of the eight options of the symbol. The remaining five symbols are to be underlined. The three symbols that are to be crossed out are presented on the upper left corner of the answer sheet. A testtaker compares each symbol with the three ones above – line by line; time limit for each line of symbols is 50 s. After time expiration the last processed symbol is marked and then the tested person continues on the next line. For the purposes of this study a new computerized version of Bourdon’s Test was developed. The visual composition of this version is nearly identical to the well-known commercial one, created by Psychosoft. In addition, there is a possibility to change the design of test environment in this new version, which was fundamental for our purposes. In this study, we focused on the color scheme of the test design and we used two color variations. The first one comprised of dark symbols displayed on a light background (see Fig. 6); the second one employed inversed color scheme – light symbols on a dark background (see

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

35

Fig. 7). The color contrast was adjusted to be pleasant and not disturbing (dark – RGB #808080; light – RGB #FFFFFF). The program was set for usage at different computer configurations; the stimuli graphics were defined vectorially (as opposed to commonly used fix-sized bitmap format that does not allow for stretching at proportions) and therefore the same design quality on different graphical resolutions is preserved. The computer screen does not allow for displaying all 2550 symbols at once in the same way as on paper, where there are all 30 lines of symbols printed on a single page. The reason is low resolution of the computer screen. That is why the version used in the present study (just like the Psychosoft’s version) displays only 85 symbols at a time, which corresponds to one line on paper. Only after expiration of the time limit, a new set of symbols is displayed. The three requested symbols are constantly displayed in the top left corner of the screen. At the bottom of the screen, there is a progress bar indicating the actual position in the test, both in the form of numerical information (actual set/sets in total) and also as a graphical line, whose length increases with every completed set. The test is controlled by the keyboard – the arrow down key underlines and the arrow up key crosses out the actual symbol, which is marked by a cursor of different color. After marking a symbol, the cursor automatically moves to the next one. It is also possible to backtrack already marked symbols using the left and right arrow keys, but only within the displayed set. 2.1.2. Sample and procedure The study was conducted in autumn 2001. The sample consisted of 92 psychology students (66 females and 26 males) with age ranging from 18 to 27 years (M = 21.43, SD = 2.41). Subjects were randomly divided into three separate groups, which were tested independently under different conditions of test administration. The first group was administered the regular-colored computerized version of Bourdon’s test, which is presented with dark symbols on light background. The second group was given inverse-colored computerized version of Bourdon’s test (light symbols on dark background). Both groups used computers of the same configuration – 17 in. monitor, 1024 · 768 graphical resolution and 85Hz refresh rate. The last group was administered the conventional paper-and-pencil form of the test. Performance of the participants were analyzed within individual groups using the following scores of ability of attention that are measured by all three forms of Bourdon’s Test: (a) the total number of processed symbols; (b) the number of all correctly processed symbols; (c) the number of all incorrectly processed symbols; (d) relative index of incorrectness (computed as the sum of errors divided by the sum of processed symbols multiplied by 100); (e) relative range of performance (computed as the sum of processed symbols in one’s best performed set minus the sum of processed symbols in set with one‘s worst performance, divided by the mean value of processed symbols in a single set and finally multiplied by 100); and (f) relative range of errors (computed similarly to the previous index, only with values of the number of errors in a single set). 2.1.3. Methods of data analysis One-way Analysis of Variance, supplemented by Tukey’s Post Hoc Tests, was used to compare the results obtained from different conditions of administration. Also the internal consistency of results of the three groups was compared, using Cronbach’s Alpha coefficient.

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

36

2.2. Results 2.2.1. Comparison of scores across different conditions of administration The purpose of the Study 1 was to find out whether and how visual design can affect the test results. The six scores were compared among groups using the One-Way Analysis of Variance. Differences between individual groups were identified with the use of Tukey‘s Post Hoc Tests. As can be seen in Table 1, statistically significant differences between groups with different form of administration were found in all of the six scores of the Bourdon’s Test but one – the relative range of errors. Significant differences were found between the regular and the inverse computer form in variables representing the total number of processed symbols (score a) and the total number of correctly processed symbols (score b). The conventional version group differed from both computerized ones in the total number of incorrectly processed symbols (score c) while the latter ones did not differ from each other. There was a statistically significant difference in the relative number of errors (score d) between the inverse computerized and conventional test versions; relative range of performance (score e) differed between the inverse computerized version and the other two. These results are summarized in Table 2. The results clearly demonstrate that there is a significant difference between the forms of administration, which is caused not only by the media employed (computer vs. paper-andpencil) but also by the visual aspects of the presented stimuli on the computer screen. Respondents reached approximately the same number of errors in both regular and inverse computerized versions, but this number was higher than in the conventional form. However, results indicate faster tempo in the regular computer form (that is the form Table 1 Bourdon’s Test: comparison of groups with different test forms Bourdon’s Test’s score

Test form

Mean

SD

F

Score a

Computer regular Computer inverse Paper-and-pencil

1815.87 1594.07 1634.79

310.25 338.04 294.22

4.243*

Score b

Computer regular Computer inverse Paper-and-pencil

1794.50 1571.21 1619.06

308.37 336.04 295.67

4.219*

Score c

Computer regular Computer inverse Paper-and-pencil

21.37 22.86 14.09

12.54 11.19 11.11

5.045**

Score d

Computer regular Computer inverse Paper-and-pencil

1.20 1.48 0.90

0.75 0.75 0.82

4.358*

Score e

Computer regular Computer inverse Paper-and-pencil

48.21 59.37 43.97

14.42 14.45 16.91

8.061**

Score f

Computer regular Computer inverse Paper-and-pencil

536.43 605.32 698.16

260.75 396.94 566.10

* **

5% level of significance. 1% level of significance.

1.081

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

37

Table 2 Bourdon’s Test: Statistically significant differences between groups with different test forms Score a

Computer inverse

Computer regular

Score b

Computer inverse

Computer regular

Score c

Paper-and-pencil

Computer inverse Computer regular

Score d

Computer inverse

Paper-and-pencil

Score e

Computer inverse

Computer regular Paper-and-pencil

Test forms in the second and third columns differ from each other at least at the 5% level of significance, separately for each row. Forms of administration that are not mentioned within a row are not significantly different from the others.

where the color scheme resembles the paper-and-pencil one) than in the inverse computer form. Moreover, the examinees that were administered the inverse computer form showed higher relative range of performance and also reached the highest relative index of incorrectness. 2.2.2. Variability of performance over time Bourdon’s Test is used not only for examining the overall speed and quality of concentration, but also for determining the changes in time. The previous analyses dealt with the variables of variance range of performance and errors, but these represent only summarizing indices. Further, quantitative performance variability was analyzed in relation to time. Due to low occurrence of errors in individual sets, the variability of incorrectness was not analyzed. As can be seen from Fig. 1, the curves of both computerized forms are very similar in terms of the progress of the learning effect, although they are markedly different in terms of the absolute numbers. This trend profile differentiates both computerized forms from the paper-and-pencil ones. In comparison to the curve of the paper-and-pencil form, the development of the performance curves of both computerized versions is steeper in terms of increasing productivity. When trying to describe this finding numerically, the gradual increase of the curves was expressed as the average change of performance during the entire test. Due to the adequately consistent progress of the performance curves, this average change was simply computed as a sum of numerical differences between individual subsequent sets of time divided by the total number of processed sets. The last set of symbols was not included in the calculation since there was a substantial decrease in quantity of performance noted in the paper-and-pencil form, which does not correspond to the general trend. This decrease in quantity of performance in the last set in the conventional test form can be explained by the fact that the examinees can clearly see this set as definitely the last one and after nearly 30 min of hard work tend to deconcentrate and decrease their effort. Although the computer forms also display this particular information regarding the test progress, it does not seem to be that striking. The average change in performance in time was found to be different when comparing the three test versions. This difference was found statistically significant at the 5% level. Results can be seen in Table 3. Analysis of variance was further supplemented by the

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

38

Fig. 1. Bourdon’s Test: number of processed symbols for each test form.

Table 3 Bourdon’s Test: the average change of performance in time Test form

Mean

SD

F

p

Computer regular Computer inverse Paper-and-pencil

0.58 0.57 0.36

0.43 0.36 0.33

3.470*

0.035

*

5% level of significance.

Tukey’s Post Hoc Tests in order to determine differences among individual test versions. Both computerized versions did not differ from each other; however, the paper-and-pencil version was found to be significantly different (at the 5% level of significance) from both computerized test versions. 2.2.3. Internal consistency of each test form Variability of performance was also analyzed using Cronbach’s Alpha coefficient. Table 4 summarizes the Alpha coefficients for the index of processing speed (that is, the total number of processed symbols) and also for the index of number of errors. Table 4 above demonstrates that all three test forms show very high and similar levels of internal consistency in terms of the processing speed. The situation differs in the case of the absolute incorrectness; the lowest internal consistency can be found in the inverse computerized test version and the highest internal consistency has been identified in the case of the paper-and-pencil version. This demonstrates there is a difference not only in the overall quality of performance of the examinees depending on the conditions of administration, but also in the consistency of performance quality in time.

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

39

Table 4 Bourdon’s Test: Cronbach’s Alpha coefficients for the speed and incorrectness of processing Test form

Cronbach’s Alpha (score 1)

Cronbach’s Alpha (score 3)

Computer regular Computer inverse Paper-and-pencil

0.991 0.991 0.989

0.834 0.746 0.909

3. Study 2: equivalency of test of concentration of attention 3.1. Method 3.1.1. Instruments Kucˇera’s (1992) Test of Concentration of Attention (TCA) is a ‘‘proofreading’’ test, where the task of an examinee is to compare symbols in neighboring columns. TCA measures psychomotor tempo, absolute and relative correctness of psychomotor performance, and propensity to inaccurate performance. In the time limit of 4.5 min, the examinee should mark all those symbols in the right column that somehow differ from those in the left column. Identical symbols should be left unmarked. Each column comprises 375 symbols arranged in rows, each with 15 symbols. Out of this total number there are 125 symbols where the left and the right column differ from each other. The tested person always begins to examine differences in symbols from the left side of the first row in the right column. It is not allowed to skip any row. The test results provide information regarding both absolute and relative performance of the examinee. The absolute results can be classified into three different values: (1) number of differing symbols that an examinee attempted to solve (that is, the number of differing symbols contained within the processed section of the test), this value thus provides information about the speed of realized psychomotori; (2) number of correctly solved symbols (that is, the number of correctly identified differing symbols); (3) total number of incorrectly solved symbols (that is, the sum of number of missed differing symbols and number of symbols incorrectly identified as differing). These three scores can be further combined to provide relative information regarding the correctness of the examinee’s performance, such as the following: (4) total number of incorrectly solved symbols and number of symbols solved ratio (score 3/score 1). Next to the conventional version, we used the computerized version of TCA that was included in the Psychosoft software package. The authors of the software intended this version to be basically comparable to the original paper-and-pencil version in terms of visual appearance; however, it differs in some aspects. In the computer version, a control line of symbols appears on the left side of the computer screen and a similar sequence of symbols that are to be checked against the control line is on the right side. The symbols used are exactly the same as those in the paper-and-pencil version. The time limit for the entire test is 4.5 min, which is also the same as in the original version of TCA. As it can be seen in Fig. 8 of Appendix A, TCA’s basic color scheme is black and white. White symbols are presented on black background. The examinee moves a red-colored cursor between the symbols using the left and the right arrow keys (as depicted on the bottom of the screen, Fig. 8), and identifies differing symbols. He/she marks the symbols differing from the left control line using spacebar. When marked, a symbol becomes inverse. Once the examinee reaches the end of a line of symbols, another line automatically appears

40

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

on the same place and the cursor is moved to the beginning of the new line. At the same time, the line number on the top of the screen increases by one. The examinee also has an option to revise and modify his/her responses not only within the current sequence, but also on previous lines. The moving between individual lines can be time-consuming because it is necessary to move the cursor through the entire line, symbol by symbol. 3.1.2. Subjects This study was conducted in autumn 2001. The total sample comprised 105 high school students in the age range from 16 to 18 (M = 16.73, SD = 0.66). Based on Bartram’s (1994) suggestion, the participants were tested twice in repeated testing sessions with a two-month lag between the measurements. In the first testing session, participants were randomly divided into two groups; one group was administered a conventional (paperand-pencil) form of the method whereas the other one completed a computerized version of the method. Two months later, each group was divided again into another two halves that were once again administered either the conventional or the computerized form of the method. Computerized tests were administered in the school’s lab with the following hardware configuration – 17 in. monitor, 640 · 480 graphical resolution and 60 Hz refresh rate (due to the limitations of MS-DOS environment). This arrangement resulted in the following four -groups within-subjects design: group A (N = 27), that was administered twice the conventional version of the method; group B (N = 27) with crossed conditions of administration – initially the conventional version and later the computerized version; group C (N = 25) again crossed conditions of administration – initially the computerized version and later the conventional version; group D (N = 26) that was administered both times the computerized version of the method. 3.1.3. Methods of statistical analysis Repeated Measures General Linear Model (GLM) was used to determine the overall effect of the order and form of administration on the tests results. The above procedure was supplemented by Test of Between-Subjects Effects from the Multivariate GLM to identify the effect of the form of administration on tests results for the first and the second measurement separately. Additional Tukey’s Post Hoc Tests were used to identify significant differences between individual groups. The stability of tests scores across form and order of administration was verified using Pearson’s coefficient of correlation. Possible difference in values of correlation coefficients among groups was tested for statistical significance using Fisher’s Z transformation. 3.2. Results 3.2.1. The effect of the order and form of administration on TCA scores The effect of the order and the form of administration on the scores was analyzed using the Repeated Measures General Linear Model method. Table 5 presents the respective Wilks’ Lambda coefficients and the respective F statistics with corresponding significance levels. As can be seen, results of the analysis proved 1% level of significance for all tested effects. The following Table 6 presents descriptive statistics of TCA scores for each group in the first and the second measurement sessions and the F statistics of the Test of Between-Subjects Effects from Multivariate GLM indicating the interaction between the form of administration and TCA scores.

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

41

Table 5 TCA: the effect of interaction of the order and the form of administration TCA’s score

Wilks’ Lambda

F

p

Score Score Score Score

0.643 0.679 0.831 0.876

18.705** 15.951** 6.864** 4.782**

0.000 0.000 0.000 0.004

**

1 2 3 4

1% level of significance.

For the purposes of the study’s consistency, the significance of between-groups effects presented above was computed for all four groups. Another suitable procedure would be to combine groups with the same mode of administration of TCA (A + B; C + D) within the first measurement as they were split randomly and thus basically belong to the same sample. It is worth mentioning that when comparing means of the combined groups (A + B and C + D) in the first measurement using Univariate GLM procedure, the main effect for the form of administration reached significance at the 1% level of significance for all scores. Each examined score will be further presented and commented on a relevant graph. Statistically significant differences between groups within each measurement – as they were indicated by Tukey’s Post Hoc Tests – will be mentioned in the commentary to the relevant graphs. Fig. 2 displays the total number of processed symbols. Within the first measurement, groups C and D had slightly higher means on this score than groups A and B, but the differences were not proved to be statistically significant. Those groups that were tested on a computer during the second measurement reached in this second measurement higher mean scores compared to groups that were administered the conventional version of TCA. The differences were found to be statistically significant. There is also an obvious tendency of increasing number of processed symbols between the first and second measurements for each group, except group C (computer–paper). In case of the number of correctly solved symbols displayed in Fig. 3, there are similar tendencies as can be seen in Fig. 2. Those groups with the conventional form of administration reached lower mean scores in both measurements. In the second measurement, the difference between modes of administration was statistically significant at the 5% level of significance in case of group D versus both groups A and C, or almost reached the 5% level of significance in case of group B versus groups A and C. In case of Score 3, groups that were administered the conventional version of TCA made fewer errors in total than the other two groups in both measurements. In comparison to the first measurement, group B, crossing from the conventional to the computer version, had both the highest absolute number of errors and the highest increase in errors. Group B also significantly differed from groups A and C with the computer form of administration in the second measurement. On the contrary, the total number of errors decreased in group C, crossing form the computer to the conventional mode of administration. Groups A and D that kept the same form of administration during both measurements only slightly increased the total number of errors (see Fig. 4). Score 4 represents the relative accurateness of the examinee independent of the quantity of processed symbols. There are two extremes that should be mentioned. Whereas group A (paper-and-pencil both in test and retest) was equally accurate in both measurements, group B (crossed from the conventional to the computer mode) deteriorated noticeably

42

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

Table 6 TCA: comparison of groups in the first and the second measurement Order of administration

Group

Mean

SD

F

p

84.67 86.19 94.40 92.35

15.13 16.01 14.22 14.82

2.541

0.061

paper–paper paper–computer computer–paper computer–computer

93.11 104.33 93.08 105.38

14.75 13.85 14.88 13.15

6.046**

0.001

Score 2 Number of correctly processed symbols Test (A) paper–paper (B) paper–computer (C) computer–paper (D) computer–computer

83.26 84.74 92.44 90.46

14.35 15.59 13.98 14.23

2.401

0.072

5.121**

0.002

Score 1 Total number of processed symbols Test (A) paper–paper (B) paper–computer (C) computer–paper (D) computer–computer Retest

Retest

(A) (B) (C) (D)

(A) (B) (C) (D)

paper–paper paper–computer computer–paper computer–computer

91.67 101.11 91.40 102.92

14.12 13.51 14.98 12.43

Score 3 Total number of errors Test (A) (B) (C) (D)

paper–paper paper–computer computer–paper computer–computer

1.41 1.44 1.96 1.88

1.19 1.19 0.98 1.21

1.657

0.181

paper–paper paper–computer computer–paper computer–computer

1.44 3.22 1.68 2.46

1.30 2.82 1.11 1.53

5.176**

0.002

paper–paper paper–computer computer–paper computer–computer

0.0158 0.0164 0.0207 0.0200

0.0103 0.0118 0.0102 0.0115

1.325

0.270

paper–paper paper–computer computer–paper computer–computer

0.0149 0.0309 0.0185 0.0229

0.0130 0.0256 0.0134 0.0133

4.253**

0.007

Retest

(A) (B) (C) (D)

Score 4 Total errors to totally processed Test (A) (B) (C) (D) Retest

(A) (B) (C) (D)

The table indicates (starting from the left column): label of analyzed score, order of measurement, group, mean, standard deviation of the mean, the F statistic and level of statistical significance for the difference between groups. Bold font indicates the form of administration in the respective measurement. ** 1% level of significance.

in accurateness and differed significantly from group A (and almost significantly from group C) in the second measurement (see Fig. 5). The presented results seem to prove that the influence of the form of administration on the TCA method does exist. Our results show that the groups that were administered the computerized version of TCA reached higher scores in the number of processed symbols.

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

43

(A) (B) (C) (D)

Fig. 2. TCA: number of processed symbols (score 1).

(A) (B) (C) (D)

Fig. 3. TCA: number of correctly solved symbols (score 2).

The difference from those administered the conventional version was statistically significant in the second measurement (retest). It can thus be proposed that the computerized test version of TCA may enable test takers to reach higher speed of test processing that may stem from the different control interface of the computerized version. In comparison to the paper-and-pencil version, the control interface of the computerized one is so simple

44

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

(A) (B) (C) (D)

Fig. 4. TCA: number of incorrectly solved symbols (score 3).

(A) (B) (C) (D)

Fig. 5. TCA: ratio of number of incorrectly solved symbols and number of processed symbols (score 4).

that it is possible to respond even without eye–hand co-ordination, hence increase the processing rate of responding and thus the overall effectiveness. Trends in the quantitative performance change between the first and the second measurement support the hypothesis about the accelerating effect of the computer control interface. Those groups with the same conditions of administration (either paper-and-pencil or computer – A and D) experienced similar increases of their performance between the first and the second measurement. However, group D tested on a computer reached both

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

45

times higher absolute values than group A. The greatest improvement in performance experienced group B with crossed conditions of administration (initially paper-and-pencil and then computer) in comparison to group C (initially computer then paper-and-pencil) where the quantitative performance hardly changed. Interestingly enough, group B (crossed conditions of administration) did not differ in performance from group D (without crossed conditions) in the second measurement. This finding supports our assumption about the simplicity of the computer control interface. The examinees in group B have already been familiar with the nature of the test but did not have any experience with the computer control interface of the test. However, the unfamiliarity with the computer version was not a reason for lower quantitative performance in the second measurement because the examinees were able to perform as well as group D that was familiar with both the test nature and the computer control interface. Generally, the familiarity with test nature enabled the examinees to better utilize the effectiveness of the computer control interface and thus increase their performance. This interpretation may explain why there was a significant difference between pairs of groups with the same condition of administration in the second measurement and not in the first one. As far as the influence of the form of administration on the quality of performance (both absolute and relative incorrectness) is concerned, the computerized version increases incorrectness of performance in general. The results show that groups with the computer condition tended to higher incorrectness in comparison to the groups with the paper-andpencil condition. This tendency was even more obvious in the second measurement. This finding suggests that markedly higher processing tempo on computer in the second measurement had a negative effect on the quality of decision-making; the higher number of processed symbols increased the probability of error occurrence. The highest increase of both absolute and relative incorrectness was noted in group B (initially paper-and-pencil and then computer). It is possible that the uncomfortability of perception of graphical stimuli presented on the computer screen had negative influence on the quality because of higher processing speed in an unfamiliar computer testing environment. However, this hypothesis needs to be further examined. 3.2.2. Stability of TCA scores across different order and form of administration The relationship between the results of repeated measurements was examined using Pearson’s coefficient of correlation. The overview of the obtained correlation coefficients for individual TCA scores is presented in the following Table 7, which is divided according to the groups with different order and form of administration. For comparison, values of test–retest reliability of TCA (form B) that were obtained by Kucˇera (1992) with twomonths time interval on the sample of 141 persons are given in the last column of the table. As expected, in case of groups A and D (same form of administration in both measurements) the correlation between the first and the second measurement closely approaches the results indicated by the test author. Similar values of correlation coefficients were also obtained even in case of group C with crossed mode of administration (initially computer, then paper-and-pencil). On the contrary, in case of group B (crossed mode of administration – initially paper-and-pencil, then computer), values of correlation coefficients for the quantity of performance were noticeably lower. Comparison of statistical significance of the difference between group B and the others using Fisher’s Z transformation demonstrates that correlation coefficients do differ at the 5% level of significance (group B vs. A, Z = 2, 41; B vs. C, Z = 2, 48; B vs. D, Z = 2, 25).

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

46

Table 7 TCA: Pearson’s correlation coefficients for the relationship between the measurements TCA’s score

Score Score Score Score * **

1 2 3 4

Group (A) paper– paper

(B) paper– computer

(C) computer– paper

(D) computer– computer

0.867** 0.871** 0.523** 0.348

0.555** 0.535** 0.337 0.226

0.876** 0.880** 0.257 0.291

0.857** 0.864** 0.419* 0.294

Test–retest reliability of TCA 0.82 0.85 0.38 Not stated

5% level of significance. 1% level of significance.

Another variable seems to influence the quantitative scores when crossing from the paper-and-pencil to the computer form of administration. It could be asked, why the lower stability of quantity scores appeared only in one group with crossed conditions of administration when the same effect could be expected in both groups with crossed conditions? In relation to the considerable increase of the average speed of performance of the group B between the measurement; one of the possible interpretations is that this increase differed among individuals of this group. More computer skilled individuals can utilize their experiences and with prior familiarity with the test nature could reach higher speed than less computer skilled individuals within this group. It can be concluded that the participants of group B were, due to their familiarity with the test, able to better utilize the effective computer control interface. But lower stability of the scores of this group implies that this ability to become more effective with the computer interface may differ among individuals. On the other hand, group C examinees were not able to fully utilize the computer speediness because they experienced the computer administration already in the first measurement when had not yet been familiar with the test itself. Therefore, individual differences in this ‘ability’ did not significantly influence the score of this group in the first measurement and the stability of scores between measurements remained high. An assumable reason for lower stability of the quantity scores when crossing from the paper-and-pencil to the computer form of administration are individual factors, such as computer experience and possibly computer anxiety. As far as the absolute and the relative incorrectness are concerned, group A with the paper-and-pencil form of administration in both measurements shows understandably the highest stability of scores. 4. Discussion As some researchers point out (Aspillaga, 1996; Federico, 1991; French & Beaumont, 1990), the visual design of a test can have a certain effect on perceiving information from the computer screen and thus influence the equivalency of the test. Therefore, study 1 focused on the impact of a certain aspect of visual design of computer administration on test results. For this reason, apart from the conventional version of the Bourdon’s Test also two computerized versions of the same test, which differed only in the color scheme, were used. One computerized version resembled the paper-and-pencil one comprising dark symbols on light background (regular computer version). Its inverse color scheme (light symbols on dark background, that is inverse computer version) was selected for the other computerized version. Acquired results confirm that besides the mode of administration

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

47

also the way of graphical presentation of stimuli affects the test-taker’s performance. A significant difference in the speed of performance was found among different forms of the test. The highest speed of performance was reached by those participants taking the computerized form with regular color scheme displaying dark symbols on light background. The lowest speed of processing was reached by those respondents taking the inverse computerized test form. This group had also the highest score of relative incorrectness. Another obvious difference between the conventional and both computerized versions was demonstrated mainly in the number of errors: the examinees administered the conventional version produced significantly lower number of errors in comparison to both computer forms. This result confirms and corresponds to the TCA (see study II) findings that enabling higher response speed increases the number of errors. Bourdon’s Test also enables to track changes of the quantity and the quality of performance in time. In comparison to the conventional form, both computerized forms showed higher increase of processing speed during the test. The increasing familiarity with the test and improving ability of controlling the test enabled the tested persons to gradually increase the speed of performance. The difference between the conventional and both computerized forms of the test was also found in the level of internal consistency of the performance quality. The lowest level of internal consistency was found in case of the inverse computerized form of the test. To conclude, in agreement with the relevant literature, subjects administered the computerized forms produced more errors. However, relative incorrectness in regular computerized form was lower (at the same level of incorrectness as the conventional one) than in inverse computerized form, obviously due to higher tempo. On the basis of this finding and also considering the levels of alpha coefficients for incorrectness, we would consider the regular colored computerized form (which visually resembles the paper-and-pencil color scheme) to be a more appropriate computerized tool for evaluation of attention changes in time. The purpose of study 2 was to determine whether different administration modes (paper-and-pencil versus computer) cause measurement non-equivalency in the selected speeded test – Test of Concentration of Attention. Empirical literature often mentions that speeded tests are affected by the mode of administration (i.e. Greaud & Green, 1986; Mead & Drasgow, 1993). When comparing the two forms of administration of TCA, there are two main factors that are worth mentioning: the speed of processing and the number of errors. Generally, groups with the computer form of administration reached higher processing speed but also higher incorrectness of performance, both absolute and relative. This finding is supported by previous research by van de Vijver and Harsveld (1994). The stability of TCA scores was found to be adequately high with the exception of the processing speed scores (scores 1 and 2) in case of group, which crossed from the paperand-pencil to the computer condition of administration. It can be suggested that more computer-experienced subjects that were already familiar with the nature of the test could utilize this experience and thus reach higher scores in processing speed. Therefore it can be concluded that differences in this human-computer relation factor could cause lower stability within this group. However, computer experience and other informal factors (i.e. computer anxiety, computer attitude) influencing performance in computerized tests (Bartram, 1994; Chua, Chen, & Wong, 1999; Mahar, Henderson, & Deane, 1997) and their predictors – in particular age, gender, education (Farina, Arce, Sobral, & Carames, 1991; Pope-Davis & Twing, 1991) were not analyzed within both studies, nevertheless, taking into consideration, all groups were comparable in these characteristics.

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

48

In conclusion, it must be stated that results across different modes of administration of the two speeded tests were found to be non-equivalent. However, the most important conclusion that can be drawn from our findings is that even a relatively marginal modification in test design (i.e. color scheme change) can substantially influence tested individuals’ performance and thus decrease validity of any test conclusion. Moreover, it can also be hypothesized that also various technical parameters of individual computers and monitors can have a certain impact on performance in test situation. Taking into consideration, it lead us to a possible proposition for a further research; we consider interesting to compare administration of the same computerized form of a test within contrast groups provided ‘‘low-tech’’ (i.e. small monitor size, low graphical resolution, low refresh rate) and ‘‘high-tech’’ (i.e. modern LCD monitor, higher graphical resolution) computer hardware. Acknowledgements A preliminary report of experimental results was presented at the 8th European Congress of Psychology, Vienna, 6. – 11. July, 2003. The presented paper was carried out under the terms of the research plan of the Institute of Psychology, The Academy of Sciences of the Czech Republic (No. AV0Z7025918) and is a part of the Institute’s key project (No. K9058117) and GACR project (No. 406/99/1052). Appendix A See Figs. 6–8.

Fig. 6. Regular computerized version of Bourdon’s Test.

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

Fig. 7. Inverse computerized version of Bourdon’s Test.

Fig. 8. Computerized version of TCA.

49

50

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

References AERA, APA, & NCME (2001). Standardy pro pedagogicke´ a psychologicke´ testova´nı´ [Standards for educational and psychological testing]. Praha: Testcentrum. APA (1986). Guidelines for computer-based tests and interpretations. Washington, DC: Author. Amthauer, R. (1993). IST-70. Bratislava: Psychodiagnostika, s.r.o. Aspillaga, M. (1996). Perceptual foundations in the design of visual displays. Computers in Human Behavior, 12(4), 587–600. Bartram, D. (1994). Computer-based assessment. In C. L. Cooper (Ed.), International review of industrial and organizational psychology (pp. 31–69). London: Wiley. Butcher, J. N. (1987). The use of computers in psychological assessment: an overview of practices and issues. In J. N. Butcher (Ed.), Computerized psychological assessment (pp. 3–15). New York: Basic Books. Chua, S. L., Chen, D., & Wong, A. F. L. (1999). Computer anxiety and its correlates: A meta-analysis. Computers in Human Behavior, 15, 609–623. Cohen, R. J., & Swerdlik, M. E. (1999). Psychological assessment and testing (4th ed.). Mountain View, CA: Mayfield Publishing. de Beer, M., & Visser, D. (1998). Comparability of the paper-and-pencil and computerized adaptive versions of the General Scholastic Aptitude Test (GSAT) Senior. South African Journal of Psychology, 28(1), 21–27. Farina, F., Arce, R., Sobral, J., & Carames, R. (1991). Predictors of anxiety towards computers. Computers in Human Behavior, 7, 263–267. Federico, P. A. (1991). Measuring recognition performance using computer-based and paper-based methods. Behavior Research Methods, Instruments, & Computers, 23(3), 341–347. Ford, B. D., Vitelli, R., & Stuckless, N. (1996). The effects of computer versus paper-and-pencil administration on measures of anger and revenge with an inmate population. Computers in Human Behavior, 12, 159–166. French, C., & Beaumont, J. G. (1990). A clinical study of the automated assessment of intelligence by the Mill Hill Vocabulary test and the Standard Progressive Matrices test. Journal of Clinical Psychology, 46, 129–140. George, C. E., Lankford, J. S., & Wilson, S. E. (1992). The effects of computerized versus paper-and-pencil administration on measures of negative affect. Computers in Human Behavior, 8, 203–209. Greaud, V., & Green, B. F. (1986). Equivalence of conventional and computer presentation of speed tests. Applied Psychological Measurement, 10, 23–34. Krantz, J. H. (2000). Tell me, what did you see. The stimulus on computers. Behavior Research Methods, Instruments, & Computers, 32(2), 221–229. Kucˇera, M. (1992). Test koncentrace pozornosti Test of Concentration of Attention. Bratislava: Psychodiagnosticke´ a didakticke´ testy, n. p. Kveˇton, P., Jelı´nek, M., Voborˇil, D., & Klimusova´, H. (2003). Ekvivalence tradie`nı´ a pocˇ´ıtacˇove´ formy testu IST-70 [Equivalency of traditional and computerized form of IST-70]. Cˇeskoslovenska´ psychologie, 47(6), 562–572. Kveˇton, P., & Klimusova´, H. (2002). Metodologicke´ aspekty pocˇ´ıtacˇove´ administrace psychodiagnosticky´ch metod. Cˇeskoslovenska´ psychologie, 46(3), 251–264. Mahar, D., Henderson, R., & Deane, F. (1997). The effects of computer anxiety, state anxiety, and computer experience on users’ performance of computer based tasks. Personal and Individual Differences, 22(5), 683–692. ´ stav pro Maresˇ, J. (1992). Psychodiagnostika podporovana´ poe`ı´tae`em [Computer based assessment]. Praha, U informace ve vzdı`la´va´nı´. Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114(3), 449–458. Merten, T., & Ruch, W. (1996). A comparison of computerized and conventional administration of the German versions of the Eysenck Personality Questionnaire and the Carroll Rating Scale for Depression. Personality and Individual Differences, 20(3), 281–291. Pinsoneault, T. B. (1996). Equivalency of computer-assisted and paper-and-pencil administered versions of the Minnesota Multiphasic Personality Inventory-2. Computers in Human Behavior, 12(2), 291–300. Pope-Davis, D. B., & Twing, J. S. (1991). The effects of age, gender, and experience on measures of attitude regarding computers. Computers in Human Behavior, 7, 333–339.

P. Kveˇton et al. / Computers in Human Behavior 23 (2007) 32–51

51

Rosenfeld, P., Booth-Kewley, S., & Edwards, J. E. (1993). Computer-administered surveys in organizational settings: Alternatives, advantages, and applications. American Behavioral Scientist, 63, 485–511. ˇ ecˇer, M. (1992). Bourdonov test [Bourdon’s Test]. Bratislava: Psychodiagnostika, s.r.o. Senka, J., Kuruc, J., & C van de Vijver, F. J. R., & Harsveld, M. (1994). The incomplete equivalence of the paper-and-pencil and computerized versions of the General Aptitude Test Battery. Journal of Applied Psychology, 79, 852–859. Watson, Ch. G., Thomas, D., & Anderson, P. E. D. (1992). Do computer administered Minnesota Multiphasic Personality Inventories underestimate booklet-based scores? Journal of Clinical Psychology, 48(6), 744–748.