Performing automatic exams

Performing automatic exams

COMPUTERS & EDUCATION PERGAMON Computers & Education 31 (1998) 281±300 Performing automatic exams G. Frosini, B. Lazzerini *, F. Marcelloni Dipartim...

3MB Sizes 0 Downloads 29 Views

COMPUTERS & EDUCATION PERGAMON

Computers & Education 31 (1998) 281±300

Performing automatic exams G. Frosini, B. Lazzerini *, F. Marcelloni Dipartimento di Ingegneria della Informazione: Elettronica, Informatica, Telecomunicazioni, University of Pisa, 56126, Pisa, Italy Received 18 March 1998; accepted 1 June 1998

Abstract In this paper we describe a tool to build software systems which replace the role of the examiner during a typical Italian academic exam in technical/scienti®c subjects. Such systems are designed so as to exploit the advantages of self-adapted testing (SAT) for reducing the e€ect of anxiety and of computerised adaptive testing (CAT) for increasing the assessment eciency. A SAT-like pre-exam determines the starting diculty level of the following CAT. The exam can thus be dynamically adapted to suit the ability of the student, i.e. by making it more dicult or easier as required. The examiner simply needs to associate the level of diculty with a suitable number of initial queries. After posing these queries to a sample group of students and collecting statistics, the tool can automatically associate a level of diculty with each subsequent query by submitting it to sample groups of students. In addition, the tool can automatically assign a score to the levels and to the queries. Finally, the systems collect statistics so as to measure the easiness and selectivity of each query and to evaluate the validity and reliability of an exam. # 1998 Elsevier Science Ltd. All rights reserved.

1. Introduction Exams aim to assess primarily the level of preparation reached by the students. Also, they can be used to evaluate the e€ectiveness of the teaching methods adopted (Barrie, 1975; Gattullo, 1988). An exam is characterised by validity and reliability. Validity expresses the ability to measure in real terms what the examiner has pre-set. Reliability expresses the non-dependence of the results on the examiner and on when the exam takes place. * Corresponding author: Dipartimento di Ingegneria della Informazione: Elettronica, Informatica, Telecomunicazioni, University of Pisa, 56126, Pisa, Italy. Fax: +39-50-568-522; E-mail: [email protected]. 0360-1315/98/$19.00 # 1998 Elsevier Science Ltd. All rights reserved. PII: S 0 3 6 0 - 1 3 1 5 ( 9 8 ) 0 0 0 4 2 - 6

282

G. Frosini et al. / Computers & Education 31 (1998) 281±300

To establish the level of validity and reliability, the queries which make up the exam need to be characterised in terms of easiness and selectivity. Easiness is an indication of the number of people who correctly answer the query. Selectivity is an expression of the ability of the query to discriminate between well and less well prepared candidates. A valid and reliable exam consists of queries which are all positively selective. To get a safe subdivision between the best students and the worst, queries of average diculty should be used. However, particularly easy (dicult) queries are needed to discriminate between the worst (best) students (Barrie, 1975). The easiness and selectivity of the queries are measured by introducing indexes and by plotting histograms. The recent trend towards increases in student numbers has made the assessment of students a time-consuming and labour intensive task. On the other hand, the development of arti®cial intelligence techniques has made computers able to perform a number of tasks, typically seen as speci®cally human ones, even more accurately and eciently than human beings do. In particular, software systems for computer-based assessment of students have been designed and implemented, thus freeing up human resources (Jackson, 1991, 1996; Hung, Kwok & Chan, 1993; Thoennessen & Harrison, 1996). In the literature, a great variety of techniques have been proposed for automatic grading. They di€er from each other mainly in terms of the administration procedure used in carrying out the examination. Basically, three categories can be identi®ed: ®xed-item tests (FITs), computerised adaptive tests (CATs) and self-adapted tests (SATs). In FITs, the same set of queries is submitted to all examinees in the same order. In CATs, the queries are selected according to algorithms which adapt the examination to the speci®c characteristics of each examinee. The algorithms are designed to maximise the eciency, that is, to reach the maximum information on the examinee's skills with the minimum number of queries (Weiss, 1982). In SATs, the adaptation is carried out directly by examinees who select the diculty level of each query based on their self-perceived ability (Rocklin & O'Donnel, 1989; Rocklin, 1994). Several papers have been devoted to compare the three categories of test administration procedures (Vispoel, Rocklin & Wang, 1994; Ponsoda, Wise, Olea & Revuelta, 1997). In general, CATs are more ecient than FITs and SATs. Less time is needed to administer CATs than FITs and SATs since fewer queries are required to achieve acceptable accuracy in the evaluation of the examinee. Further, CATs can provide accurate marks adapted to a wide range of abilities. On the other hand, SATs reduce the e€ect of anxiety by allowing the examinees to have control over the diculty of the exam. As noted in Jackson, 1991, anxiety is generally negatively correlated with the performance of an assessing strategy. It has been experimentally shown that failure increases anxiety while success reduces it (Hung et al., 1993). Delegating the choice of the diculty level to the examinee diminishes the probability of failure and, therefore, reduces the e€ect of anxiety. Further, success results in high perception of ability in the speci®c domain producing feelings of internal control and con®dence that facilitate the successful accomplishment of the examination. In this paper, we describe a tool to build automatic examiners which replace the role of a human examiner in a typical oral/practical Italian academic exam in technical/scienti®c subjects. In Italy the university examination process usually takes place in two steps: a written exam followed by an oral exam (even for scienti®c subjects). Generally, the oral exam consists

G. Frosini et al. / Computers & Education 31 (1998) 281±300

283

of a ®xed number of queries (4±6) posed one after another. The examiner adapts the exam to suit the ability of the student by submitting less (more) dicult queries to less (more) prepared students. Lower scores are associated with easier queries. The maximum ®nal mark a student can gain is 30/30, the passing mark is 18/30. The examiner may decide to interrupt the exam prematurely if he believes that the student will not be able to get the passing mark. The examiner usually knows the quality of the students as they followed his lessons and interacted with him during the academic year. So, he starts the exam from the most appropriate level of diculty. Further, during the exam, he can perceive how well prepared the student is and quickly adapt the diculty of the exam. For instance, if he starts with a relatively easy query and the student con®dently answers the query, he can decide to submit directly a query of the maximum level of diculty. This complexity of evaluation is clearly based on the examiner's perception. Such perception is dicult to reproduce in an automatic examiner that poses a few (4±6) queries. Although a CAT procedure with a ®xed number of queries appeared as the best solution, we had the problem to determine the initial level of diculty. A good student would not be able to reach the maximum possible score in the examination if the level of the ®rst query were di€erent from the maximum. On the other hand, it is not possible to know whether the student is good or bad when the exam starts. To avoid penalizing students, a mechanism is needed to determine the initial level appropriately. To this aim, our tool exploits a combination of SAT and CAT administration procedures. The diculty level of the ®rst query to submit to the examinee in the actual exam is determined by means of a SAT-like pre-exam. This approach reduces the e€ect of anxiety and increases the self-con®dence of the examinee, without reducing the eciency of the exam. The actual exam is a CAT consisting of a number of queries which is established by the examiner. These queries are related to a corresponding number of subjects which the system chooses randomly. The queries are chosen in such a way that the level of diculty of the exam is adapted to the ability of the candidate, by raising or lowering as necessary the level on the basis of how the exam itself is going. The tool has been designed so that it can base itself on a suitable number of initial queries with which a level has been manually associated. Then, after a phase of collecting statistics, it can associate the level with any subsequent query inserted by the examiner. This is obtained by submitting the query to sample groups of students. In addition, the system can automatically assign a score to levels (based on the number of queries in the exam and on the maximum possible mark) and also to each possible answer to individual queries (on the basis of the type of query). During the exam the marks gained in the individual queries and the ®nal mark are collected. The statistical analysis of the easiness and the selectivity of the queries is carried out using measurement indexes. The results are utilised to evaluate the correctness and clarity of each query. At the end of the exam, the partial and ®nal results of the exam are shown. The student's answer, the correct answer (where necessary) and the mark gained are presented for each query. These marks are also used to adjust the level of diculty of the single queries. The tool was developed by using Microsoft Visual C++ under the Windows operating system (Gregory, 1996). We utilised the tool to build an automatic examiner, which carries out

284

G. Frosini et al. / Computers & Education 31 (1998) 281±300

Fig. 1. Architecture of the tool.

automatic exams in relation to the programming language Pascal (Jensen & Wirth, 1985). This system is currently being used experimentally in some university courses. 2. Architecture of the tool The basic architecture of our tool is shown in Fig. 1. The data base contains the queries inserted by the examiner and the partial and ®nal results automatically stored by the tool itself. The computational engine appears logically split into three modules. The design module handles the insertion of the queries into the data base and the determining of their levels of diculty. The exam module selects the queries to submit to the students, and calculates and stores the marks. The analysis module analyses the results gathered during the exams and calculates particular indexes which express the easiness and selectivity of every query. Finally, the user interface has a twofold role in relation with the type of user. The student interface presents the queries during the exam and allows an easy insertion of the answer(s). The examiner interface drives the insertion of the queries and shows the statistics. 3. Data base The data base contains three types of queries: non-structured, structured and applicative queries.

G. Frosini et al. / Computers & Education 31 (1998) 281±300

285

A non-structured query may have a single or multiple answer. A single answer query consists of a question followed by a list of answers: only one of the answers is correct. A multiple answer query consists of a question followed by a list of answers: more than one answer may be correct. A structured query has two sections, the ®rst (introductory section) outlines the subject which the queries of the second section (interrogative section) are about. The queries of the second section are called subqueries. A subquery can be a non-structured query (non-structured subquery) or it can be a question which requires as an answer the insertion of a numerical value or a predetermined sequence of words (subquery with insertion). Finally, an applicative query consists in the speci®cation of a problem and in the request for writing a subprogram which resolves the problem itself. The applicative query requires an accurate description and is explained in detail in the next section. The partial and ®nal results are collected during every exam; they are used to analyse the easiness and selectivity of each query. Examples of the various types of queries relative to the programming language Pascal are shown in Figs. 2±5.

Fig. 2. Non-structured query: single answer query.

286

G. Frosini et al. / Computers & Education 31 (1998) 281±300

Fig. 3. Non-structured query: multiple answer query.

3.1. Applicative query As claimed in several papers (Hung et al., 1993; Jackson, 1996), the automatic marking of student programming exercises should be based on the evaluation of speci®c programming

Fig. 4. Structured query.

G. Frosini et al. / Computers & Education 31 (1998) 281±300

287

Fig. 5. Applicative query.

features such as correctness, style, eciency, complexity and programming skill. For each feature, the automatic examiner collects a partial mark. The ®nal mark is obtained as a weighted sum of the partial marks. The weighting factors are associated with each feature by the examiner: higher factors are associated with the features the examiner considers more important in assessing the programming ability. As our experiments were carried out in a ®rst year university course, we decided it would be more meaningful to test if the students were able to write correct programs rather than high quality programs. Therefore, we associated the highest weighting factor with correctness. The automatic assessment of each programming feature requires the examiner to provide the tool with the solution (i.e., a correct program) and a test program. The test program runs the student's program to verify its output on given input test data. In the following, we analyse the programming features in detail and explain how a mark is associated with them. 3.1.1. Correctness A (sub)program is correct if it executes the actions for which it was written. When a student submits a subprogram, the automatic examiner links the subprogram to the test program and runs the resulting executable code by using the supplied input test data. More test data sets may be provided, and for each of them the output of the program is matched against the corresponding output test data. The mark is calculated on the basis of the percentage of correct matches. In this ®rst release of our system, we adopt a primitive character-by-character comparison. This obviously requires the examiner to specify precise formatting information for possible printing of output data. Although this approach provides quite limited ¯exibility on how the examinees can present their output, statistics executed on a sample of 50 examinees highlighted that they did not feel themselves particularly constrained. In any case, we are investigating more sophisticated mechanisms such as the pattern matching approaches proposed in Jackson, 1991 and Hung et al., 1993. At the end of the examination, the solution supplied by the examiner is shown to the examinee. This real-time feedback makes the examination a means not only for assessing, but also for improving the level of knowledge of an examinee. As the automatic examiner is based on an adaptive algorithm, this teaching function is tailored to the student being examined.

288

G. Frosini et al. / Computers & Education 31 (1998) 281±300

3.1.2. Style In the last two decades, several de®nitions of good style of programming have been proposed (Marca, 1981; Rees, 1982; Berry & Meekings, 1985; Lake & Cook, 1990). Most de®nitions are based on measuring the following characteristics: program length, identi®er length, percentage of comment lines in relation to the total number of code lines, indentation, percentage of blank lines in relation to the total number of code lines, number of characters per line, number of spaces per line, and number of `gotos'. The examiner can associate a di€erent weighting factor with each characteristic so as to increase or decrease its contribution to the mark. For instance, the percentage of the comment lines is generally associated with a high weighting factor because it is considered more important than the other characteristics. For most style characteristics, adopting a `full or zero mark' approach is not suitable (Berry & Meekings, 1985; Jackson, 1991). Let us consider, for instance, the metric associated with the program length. If a `full or zero mark' approach were used, the examiner would have to de®ne a crisp set, denoted mark set, of program lengths which the full mark is associated with. If the length of the examinee's subprogram does not belong to the set, then the subprogram is evaluated zero. Let us assume that the full mark interval be [10,20]. If the length of the examinee's subprogram is 21, then the subprogram is evaluated zero with respect to the style characteristic `program length'. But a di€erence of one line does not seem to be such a distinguishing feature. Actually, the transition between lengths, which, respectively, obtain the full mark and the zero mark, appears gradual rather than abrupt. To take this gradual membership into consideration, our system allows mark sets to be fuzzy sets. A fuzzy set S of a universe of discourse U is characterised by a membership function mS:U 4 [0,1] which associates with each element y of U a number mS(y) in the interval [0,1] which represents the grade of membership of y in S (Zadeh, 1973). Fig. 6 shows an example of a fuzzy mark set. Here, the lengths between B and C will gain a full mark; the lengths between A and B, and between C and D will result in partial marks; the lengths less than A or greater than D will gain nothing for the student. When an examinee submits the program to the automatic examiner, the latter computes the membership grade of the program length to the corresponding fuzzy mark set. The mark gained by the examinee is calculated as the product between the mark associated with the characteristic and the membership grade. For all the style metrics, the system requests the examiner to de®ne a mark set. The examiner can decide to de®ne crisp or fuzzy mark sets.

Fig. 6. Fuzzy mark set.

G. Frosini et al. / Computers & Education 31 (1998) 281±300

289

Fig. 7. An example of eciency mark set.

3.1.3. Eciency The eciency of a program measures the time spent executing it on a computer. To assess this feature, the automatic examiner might require the teacher to provide a mark set of execution times. This approach, however, would not be fair. Execution time depends strongly on the computer performance. To overcome the in¯uence of contextual factors on the reliability of the eciency measure, we adopt the following procedure. Firstly, the automatic examiner evaluates the execution times Te and Td of, respectively, the examinee's program and the solution program provided by the examiner by running them with the input test data. Then, the eciency index E, de®ned as E = (TeÿTd)/Td, is computed. Finally, the grade of membership of E to the eciency mark set is evaluated. An example of mark set for the eciency is shown in Fig. 7. Here, values less than zero get the full mark because the execution time of the examinee's program is less than or equal to the examiner's solution. 3.1.4. Complexity To measure the complexity of a program, we adopt McCabe's cyclomatic complexity metric (McCabe, 1976). Although this metric has often been criticised due to its lack of theoretical foundation (Sheppard, 1988), it has the advantage of being simple and easily computable on a computer. Further, it has already been successfully adopted by existing systems which assess programs automatically (Hung et al., 1993; Jackson, 1996). McCabe's cyclomatic metric is based on the number of edges and nodes which compose a ¯ow graph. The computation of the metric requires, therefore, the generation of the ¯ow graph corresponding to the program being evaluated. The cyclomatic complexity of the program is de®ned as V(G) = E ÿ N + 2, where E is the number of edges and N is the number of nodes composing the ¯ow graph G. As McCabe's metric measures the number of linearly independent paths in the program, it is an indication of the grade of maintainability of the program (Hung et al., 1993). 3.1.5. Programming skill An examination should measure not only the capability to produce (high quality) software, but also the e€ort spent in producing such software. The e€ort is an indication of the programming skill of the examinee. In a preliminary version of our system, we quanti®ed the e€ort as the time spent in producing the code. After testing our system in some sample examinations, we realised that this time should not be considered as a reliable metric for

290

G. Frosini et al. / Computers & Education 31 (1998) 281±300

programming skill. Indeed, the amount of time spent in producing code may depend not only on programming skill, but also on other factors such as the typing ability and the diculty in interpreting the requirements speci®cation. In Hung et al., 1993, an attempt is made to quantify the programming skill through the use of a measure known as debugging e€ort. The debugging e€ort is measured as the number of times students have to recompile their program before obtaining a working solution. This approach can penalise students for doing typing errors. We observed, however, that typing errors are quite easy to identify and correct. Consequently, the interval time spent to ®x typing errors is quite short. Based on this consideration, we adopt a metric for the programming skill which takes two factors into account, namely the number of compilations along with the time interval between two consecutive compilations. The mark gained by the examinee with respect to the programming skill is calculated as the weighted sum of the marks gained for the two factors.

4. Computational engine The computational engine drives the insertion of the queries, associates a level with every query stored in the data base, carries out the exams and analyses the partial and ®nal results for calculating statistics. 4.1. Design module Using their own teaching experience, examiners can associate several levels of diculty with the various queries. To take this into account, we associate each query with a positive integer (level), between 1 and K (K is usually of the order of 3±6), which indicates the level of diculty in such a way that increasing degrees of diculty correspond to higher levels. Basing itself on a suitable number of initial queries to which a level has been associated by the examiner (sample queries), our tool can associate, after collecting statistics, a level with each subsequent query that the examiner inserts. In addition, the tool can automatically assign a score to the levels and to the queries. The automatic determining of the levels and the scores is carried out in three distinct phases. In the ®rst phase, named level±percentage association, for every level, the examiner submits a number of sample queries to a group of m students (sample group) who are a signi®cant sample of the category: the aim is to associate with every level the percentage, or rather, the range of percentages, of students who correctly answer the sample queries of that level. In the second phase, called query-level association, the examiner submits each query inserted during this phase to the same sample group: the aim is to associate a level with the query based on the range of percentages of students who correctly answer the query itself. In the third phase, named score generation, the tool associates a score with each level and with the alternatives which constitute the possible answers to the queries. 4.1.1. Level±percentage association Typically on the basis of the students' curricula, examiners form a sample group. Then, for each level, they execute a random experiment which consists of proposing a sample query to

G. Frosini et al. / Computers & Education 31 (1998) 281±300

291

Fig. 8. Percentage intervals associated with levels.

the sample group. The result of the experiment is the percentage x of the students who have correctly answered the query. We consider the random variable X which represents all the possible values of x: we can assume that X is a Gaussian variable (Gattullo, 1988). To determine univocally the density of probability fX(x) we need to calculate the average value ZX and the variance s2X . Therefore, referring to n independent repetitions of the experiment, we get n values x1, . . . , xn, which make up a sample with n values of X. The arithmetic mean xn and the variance s2n of the sample can be used to estimate the average value ZX, and of (n ÿ 1)/n times the variance n and s2n are s2X , respectively (Wilks, 1962). Whenever the examiner poses a new sample query, x xn and the variance of the values s21 , . . . ,s2n recalculated until the variance of the values x1 , . . . , are suciently small. At which point we have a reliable estimate of ZX and s2X . The lower and higher extremes of the percentage ranges of students who correctly answer the queries of the various levels are obtained from the values of the abscissas, expressed as percentages, of the intersection points between curves which represent, for each level j, j 1 Rj RK, the functions f X …x† (Fig. 8). A more exact de®nition of the percentage range [BK ÿ j, BK ÿ j + 1] is: ÿ   j‡1 abscissa intersection f jX …x†, f X …x† for 1Rj
4.1.2. Query-level association For each query inserted the system carries out the random experiment described above. Let x^ be the percentage of students who have correctly answered the query. The level to associate

292

G. Frosini et al. / Computers & Education 31 (1998) 281±300

^ is maximum. As an example, let us assume that x=80%. ^ with the query is j, such that f jX …x† Then, with reference to the density functions shown in Fig. 8, the following values are ^ ˆ 0:02, f 2X …x† ^ ˆ 0:11 and f 1X …x† ^ ˆ 2:71. It results that the level to associate to computed: f 3X …x† the query is the level 1.

4.1.3. Score generation The exam consists of a number Q of queries set by the examiner. The score max_sc associated with the highest level is equal to the ratio between the maximum mark possible in the exam and the number Q. The score scj associated with a generic level j can either be directly de®ned by the examiner, or automatically calculated by the system as follows: scj ˆ max sc 

j K

where K is the maximum diculty level. In both cases, the association of the scores with the possible answers to the queries follows these rules. (i)

Non-structured: single answer query. The score associated with the correct alternative, i.e. corresponding to the correct answer, is equal to the score associated with the level which the query belongs to; the score associated with other alternatives is zero. (ii) Non-structured: multiple answer query. The score associated with each correct alternative is equal to the ratio between the score associated with the level which the query belongs to and the number of possible correct alternatives. The score associated with any incorrect alternative is equal to the negative number representing the opposite of the ratio of the score associated with the level which the query belongs to and the number of possible incorrect alternatives. (iii) Structured query. The score associated with each subquery is equal to the ratio between the score associated with the level which the query belongs to and the number of subqueries. If the subquery is not structured the previous rules (i) and (ii) are applied. If the subquery is with insertion, the score associated with the correct answer is the score associated with the subquery. However, if the answer is not correct the score is zero. (iv) Applicative query. The score associated with each feature is equal to the product of the score associated with the level which the query belongs to and the weighting factor associated with the feature.

4.2. Exam module The automatic exam consists of determining the initial level, selecting the queries and notifying how the exam has gone.

G. Frosini et al. / Computers & Education 31 (1998) 281±300

293

4.2.1. Determining the initial level The determining of the initial level is carried out by a pre-examination administered by a SAT procedure. First, the automatic examiner requests the examinee to choose a diculty level. Then, it randomly selects a query within that level. The examinee answers the query, receives feedback about whether the query was answered correctly or incorrectly, and then chooses the diculty level for the next query. Finally, when the number of queries matches a number ®xed by the examiner, the automatic examiner starts the next phase. If the examinee answered correctly at least one query, then the diculty level is ®xed to the highest level among the levels associated with the correctly answered queries; otherwise, the diculty level is ®xed to the level immediately lower than (possibly equal to) the lowest diculty level selected by the examinee. 4.2.2. Selecting the queries After determining the initial level, a CAT procedure executes Q cycles and in each cycle poses a query. The following actions take place in each cycle: (i) (ii)

presentation of a query, called main query; assessment of the answer to the main query, based on the score associated with the queries (see Section 4.1.3 `Score generation') with the following speci®cations: (a) in the case of a multiple answer query, if the mark is negative it is assumed to be zero; (b) if the student answers ``I don't know'' (where applicable), a new query is presented, called reserve query, on the same subject and at the same level as the main query; the assessment of the answer to the reserve query follows the above rules, the e€ect is the same as if the main query had not been posed, however the maximum possible mark is equal to half the maximum score for the level (a second answer ``I don't know'' is considered as an incorrect answer); (iii) determining the level of diculty of the next query, on the basis of the following Pascallike algorithm:

294

G. Frosini et al. / Computers & Education 31 (1998) 281±300

where new_lev and curr_lev are the levels associated with the next and the current query, respectively; prev_mark and curr_mark are the marks gained by the student in the previous and current query, respectively; prev_sc and curr_sc are the maximum scores possible in the previous and current query, respectively. Before giving a new query, the computational engine veri®es whether the student will be able to get the minimum mark for passing the exam on the assumption that the remaining queries are all answered correctly. Otherwise, the exam is terminated immediately. The following algorithm is used to do this:

where q is the number of queries already given, succ_lev is the successor of the level associated with the last of the q queries, v is the mark gained by the student for having answered all the remaining queries correctly, gained_v is the mark gained by the student after q queries, V is the maximum mark possible in the exam (V*3/5 is, therefore, the minimum mark needed to pass the exam). At the end of the exam the automatic examiner shows the partial and ®nal results. For each query given, the student's answer, the correct answer and the mark are reported. The ®nal mark is obtained by adding up the marks for each individual query. Possibly, the ®nal mark is rounded to the closest integer. During the exam, the mark for individual queries and the ®nal mark are recorded. The automatic examiner can utilise these marks for improving the reliability of the estimates of the average value ZX and the variance s2X of the random variable X introduced in Section 4.1.1 `Level±percentage association'. The improvement may produce a change of the diculty level associated with the queries. Finally, the automatic examiner can show a collection of statistics at the examiner's request. 4.3. Analysis module A query is characterised by easiness and selectivity.

G. Frosini et al. / Computers & Education 31 (1998) 281±300

295

Let us consider, of all the students who have taken the exam, the ns students to whom the query was given whose easiness and selectivity we want to determine. The easiness of a query can be measured using the index Ie, de®ned as: Ie ˆ

t ns  max

where t is the sum of the marks reported in the query by the ns students, and max is the maximum mark possible by correctly answering the query (Gattullo, 1988). The Ie index varies in the real range [0,1]: the lower extreme corresponds to the case in which all the students have got zero, while the higher extreme corresponds to the case in which all the students have got the maximum mark. Typically, indexes between 0.75 and 1 indicate easiness; indexes between 0.25 and 0.75 indicate average diculty; and ®nally, indexes lower than 0.25 indicate accentuated diculty. To measure the selectivity of a query, the ns students are sorted according to increasing values of the ®nal mark gained in the exam (students who got the same mark may be listed in alphabetical order). Two distinct groups B (best students) and W (worst students) of an equal number of students are formed from the ns students. Typically, groups B and W consist of the last and ®rst s = bns/2c students, respectively (bc stands for the integer part). We therefore use the Ibw index, de®ned as: Ibw ˆ

bÿw s  max

where b is the sum of the marks gained by the s best students, w is the sum of the marks gained by the s worst students (Gattullo, 1988). The Ibw index varies in the real range [ÿ1,1]: the negative values obviously correspond to misplaced or mistaken queries. Typically, indexes between 0.4 and 1 indicate good selectivity; indexes between 0.2 and 0.4 indicate acceptable selectivity; and indexes between 0 and 0.2 indicate insucient selectivity. The characteristics of easiness and selectivity are correlated: average diculty generally corresponds to good selectivity (Barrie, 1975). The indexes Ie and Ibw may be used by the examiner to verify the soundness of the queries contained in the data base. If a query turns out to be extremely dicult and negatively selective, then it was correctly answered by a low number of worst students. It is reasonable to deduce that the diculty of the query is due to its wrong formulation, which has made the query dicult to understand. The examiner may try to modify the formulation of the query and analyse again the indexes Ie and Ibw after a new examination. The combination of the values of Ie and Ibw for a query may provide the examiner with useful feedback for improving the quality of the formulation of the queries stored in the data base. Simple qualitative and quantitative measurements of the easiness and selectivity can be obtained by using particular histograms. We therefore divide the students into groups so that the students who have received the same ®nal mark are in the same group. For each group, we then calculate the percentage of students who have correctly answered the query. Finally, we plot the histogram shown in Fig. 9. The histogram gives a clear qualitative measurement of the easiness and selectivity of the query. The easiness is in fact related to the area of the histogram: the easier a query is, the greater the percentage of students who answer correctly and the

296

G. Frosini et al. / Computers & Education 31 (1998) 281±300

Fig. 9. Histogram of a generic query.

greater the area of the histogram. The selectivity, on the other hand, is related to the di€erence in height of the two bars which correspond to the percentages of two contiguous groups: the greater this di€erence is, the greater the discrimination made by the query between the two groups. A more selective measurement of the easiness and selectivity using a histogram can be obtained by means of the E and S indexes of easiness and selectivity, respectively. They are de®ned as: m2 X



pi

iˆm1

m2 ÿ m1



pb ÿ pw m2 ÿ m1

where m1 and m2 are, respectively, the lower and upper marks of the interval of marks considered, pi indicates the percentage of students who correctly answered the query and got the ®nal mark i, pb and pw indicate the percentage of students who got the best and worst ®nal mark, respectively, in the interval [m1, m2] (Barrie, 1975). The automatic examiner allows the human examiner to choose the interval [m1, m2] and verify the easiness and selectivity of a query for several ranges of ®nal marks. The E and S indexes may be used like the Ie and Ibw indexes to improve the quality of the queries in the data base. The human examiner can then reformulate any queries which may have been misunderstood.

G. Frosini et al. / Computers & Education 31 (1998) 281±300

297

5. User interface The user interface permits the communication between the tool and two di€erent types of users: the examiner and the student.

5.1. Examiner interface The examiner interacts with the tool during the insertion of the queries, the determining of the levels associated with the queries and the analysis of the partial and ®nal results of the exams. The insertion of the queries is driven by query patterns which help the input of the three available types of queries: non-structured, structured and applicative queries. Determining the levels associated with the queries requires a continuous interaction between the examiner and the tool: the examiner must choose a new query of the pertinent level until the level±percentage association is completed (see Section 4.1.1. `Level±percentage association'). The analysis of the partial and ®nal results of the exams uses a graphical environment to show the histograms described in Section 4.3 `Analysis module'.

5.2. Student interface During the exam, the queries are visualised to the student in a form which depends on the type of query. When a non-structured query is posed, all the alternatives are simultaneously visualised and the cursor is placed on the ®rst alternative. The student can use either the mouse or the directional arrows to move onto the alternatives, the TAB key to choose the selected answer, and the return key to con®rm the choice(s). Once con®rmation has been given, the system presents the next query. When a structured query is posed, the introductory section and the ®rst subquery of the interrogative section are visualised. If the subquery is non-structured, the above procedure is followed. In the case of a subquery with insertion, students have two possibilities: (i) (ii)

insert the numerical value or the sequence of words which they think are correct, and then con®rm with the return key; hit the return key if they don't know the answer. In both cases, the system presents the next subquery. The introductory section stays visualised during the presentation of all the subqueries.

When an applicative query is posed, the speci®cation of the problem to resolve is visualised. Students interact with an integrated environment, which allows them to write, compile, execute and debug the program. Students have two buttons available. One allows them to indicate that the solution is ready; the other allows them to escape the current query and pass on to the next query.

298

G. Frosini et al. / Computers & Education 31 (1998) 281±300

6. An automatic examiner We built an automatic examiner to carry out automatic exams in relation to the programming language Pascal. We formed the sample groups for the automatic determining of the levels and the scores by selecting the students on the basis of the average gained in their previous exams. We split the range between the maximum and the minimum possible average into six subranges and formed each sample group by taking a student from every subrange. We used three levels of diculty xn and of the variance s2s2 of the and ®xed the threshold of the variance s2x of the values x1 , . . . , 2 2 values s1 , . . . sn to 0.002 and 0.0002, respectively. Indeed to take a sample with n values of X is equivalent to choosing a value xn (s2n ) which, in most cases, is situated in an interval centred around ZX (s2X ) and wide about 2±3 sx (2±3 ss2 ). Therefore, the values of the thresholds seem to be the best arrangement between the search for a reliable estimate and the need to limit the number of experiments. Moreover to avoid considering meaningless values of ZX and s2X , we carried out at least four repetitions of the experiment before verifying s2x and s2s2 . We needed eight queries for the ®rst level, six for the second level and six for the third level, respectively. The average values and the variances for every level are summarised in Table 1. The percentage intervals are shown in Fig. 8, where B0=0%, B1=37%, B2=67% and B3=100%. Finally, we decided that the exam consisted of six queries. The scores associated with the levels and the queries were automatically calculated by the system. The automatic examiner is currently being used in some university courses. More than three hundred automatic exams have been performed. The Ie and Ibw indexes were calculated for all the queries presented in the exams. The formulation of a few queries was modi®ed. As an example, let us consider the structured query of level 3 in Fig. 4. This query was presented 20 times and the results are displayed in Table 2 ((*) and (**) indicate that the mark was obtained by correctly answering only the ®rst or the second subquery, respectively). We obtained the following values for Ie and Ibw indexes of easiness and selectivity, respectively: Ie ˆ

35 25 ÿ 10 ˆ 0:35 Ibw ˆ ˆ 0:3 20  5 10  5

The histogram shown in Fig. 10 was plotted for the query. Finally, to verify the validity of the automatic exams we asked a group of one hundred students to take two examinations, carried out by the human and the automatic examiner, respectively. In 70% of the cases, the ®nal marks obtained by the candidate in each exam di€ered by one point, in 90% of the cases by less than three points and only in 3% of the cases by more than three points (the marks were expressed out of 30). Table 1 Average values and variances for the three levels level 1 level 2 level 3

average value 0.85 0.51 0.23

variance 0.017 0.012 0.012

G. Frosini et al. / Computers & Education 31 (1998) 281±300

299

Table 2 Results obtained by presenting the query in Fig. 4 Mark gained in the query 0 0 0 0 0 0 2.5(*)

Final mark 18 18 19 20 20 20 22

Mark gained in the query 0 2.5(**) 5 0 0 0 2.5(*)

Final mark 22 23 24 24 25 25 26

Mark gained in the query 2.5(*) 2.5(*) 2.5(**) 5 5 5

Final mark 26 27 27 28 30 30

7. Conclusions We have presented a tool to build automatic examiners which take the place of the teacher during oral/practical Italian academic exams in technical/scienti®c subjects. The automatic examiners exploit the advantages of both SATs and CATs. The levels of diculty are automatically associated with the queries on the basis of some initial statistics. These levels may then be adjusted after each exam based on the marks gained by the examinees. In addition, a score is automatically assigned to levels and queries. The tool described was developed by using the C++ programming language under the Windows operating system. We utilised the tool to build an automatic examiner which carries out automatic exams in relation to Pascal. This system is currently being used experimentally in some university courses. More than three hundred automatic exams have been performed: their reliability and validity has been very satisfactory.

Fig. 10. Histograms of the subqueries of the query in Fig. 4.

300

G. Frosini et al. / Computers & Education 31 (1998) 281±300

References Barrie, H. (1975). Introduzione alle Tecniche di Valutazione. Bologna: Zanichelli. Berry, R. E., & Meekings, B. A. E. (1985). A style analysis of C programs. Communications of the ACM, 28(1), 80±88. Gattullo, M. (1988). Didattica e DocimologiaÐMisurazione e Valutazione nella Scuola. Roma: Armando. Gregory, K. (1996). Special Edition Using Visual C++ 4.2. Que Corporation. Hung, S., Kwok, L., & Chan, R. (1993). Automatic programming assessment. Computers and Education, 20(2), 183±190. Jackson, D. (1991). Using software tools to automate the assessment of student programs. Computers and Education, 17(2), 133±143. Jackson, D. (1996). A software system for grading student computer programs. Computers and Education, 27(3/4), 171±180. Jensen, K. & Wirth, N. (1985). Pascal User Manual and Report. Third Edition. New York: Springer Verlag. Lake, A., & Cook, C. (1990). STYLE: an automated program style analyser for Pascal. SIGCSE Bull., 22(3), 29±33. Marca, D. (1981). Some Pascal style guidelines. SIGPLAN Notices, 16(4), 70±80. McCabe, T. A. (1976). A complexity measure. IEEE Transactions on Software Engineering, 2(4), 308±320. Ponsoda, V., Wise, S. L., Olea, J., & Revuelta, J. (1997). An investigation of self-adapted testing in a Spanish high school population. Educational and Psychological Measurement, 57(2), 211±221. Rees, M. J. (1982). Automatic assessment aids for Pascal programs. SIGPLAN Notices, 17, 33±42. Rocklin, T. R. (1994). Self-adapted testing. Applied Measurement in Education, 7(1), 3±14. Rocklin, T. R., & O'Donnel, A. M. (1989). Self-adapted testing: a performance-improving variant of computerized adaptive testing. Journal of Educational Psychology, 79, 315±319. Sheppard, M. (1988). A critique of cyclomatic complexity as a software metric. IEE Software Engineering Journal, 3, 30±36. Thoennessen, M., & Harrison, M. J. (1996). Computer-assisted assignments in a large physics class. Computers and Education, 27(2), 141±147. Vispoel, W. P., Rocklin, T. R., & Wang, T. (1994). Individual di€erences and test administration procedures: a comparison of ®xeditem, computerized-adaptive, and self-adapted testing. Applied Measurement in Education, 7(1), 53±79. Weiss, D. J. (1982). Improving measurement quality and eciency with adaptive testing. Applied Psychological Measurement, 6, 473± 492. Wilks, S. S. (1962). Mathematical Statistics. New York: John Wiley and Sons. Zadeh, L. A. (1973). Outline of a new approach to the analysis of complex systems and decision processes. IEEE Transactions on Systems, Man, and Cybernetics, SMC-3(1), 28±44.