Diagnosys—a knowledge-based diagnostic test of basic mathematical skills

Diagnosys—a knowledge-based diagnostic test of basic mathematical skills

PII: SO360-1315(97)00001-8 DZAGNOSYS-A ComputersEduc. Vol. 28, No. 2, pp. 113-131. 1997 0 1997 Elsevier Science Ltd All tights reserved.Printed in G...

2MB Sizes 0 Downloads 43 Views

PII: SO360-1315(97)00001-8

DZAGNOSYS-A

ComputersEduc. Vol. 28, No. 2, pp. 113-131. 1997 0 1997 Elsevier Science Ltd All tights reserved.Printed in Great Britain 0360-1315/97 $17.00+0.00

KNOWLEDGE-BASED DIAGNOSTIC BASIC MATHEMATICAL SKILLS

TEST OF

JOHN APPLEBYt’, PETER SAMUELS’ and TAMSIN TREASURE-JONES* ‘Department

of Engineering Mathematics, University of Newcastle upon Tyne, Newcastle upon Tyne NE1 7RU, England [e-mail:[email protected]] *Computer Based Learning Unit, University of Leeds, Leeds, England

(Received 1 July 1995; accepted I January 1997)

Abstract-Diagnosys is a knowledge-based computer diagnostic test of basic mathematical skills. It was initially developed for university entry level students in engineering but is widely applicable to other student groups and educational institutions. The need for diagnostic testing, the advantages of a computer-based test and the opportunity to produce one provided by the TLTP programme are described. The major development strands of the Test are identified and outlined. The actual use of the Test and the way this affected development is described. Diagnosys can also be used as a shell for producing tests in other subjects. A test of basic mechanics based on the Diagnosys shell is described. The issues involved of developing tests in other areas of cognitive skill and other extensions to the Test are discussed. Q 1997 Elsevier Science Ltd

1. INTRODUCTION

1.1. The need for diagnostic testing

Recent reports from the Engineering Council [l] and the Institute of Mathematics and its Applicationsi[2] have highlighted the increasing variability in the mathematical background of university entry engineering students and the lowering of mathematical standards, especially in the areas of algebra, trigonometry and calculus. At Newcastle University, there has been concern about increased dropAout rates from engineering degree courses. These trends have increased the need for diagnostic mathematics testing at university entry for selection, streaming and identification of “at risk” students. Diagnostic test marks can also be used as a predictor for future examination performance and help to estimate the level of unusual (e.g. rare foreign) qualifications [3]. The situation is not limited to university engineering degrees. Another recent report, jointly from~the London Mathematical Society, the Institute of Mathematics and its Applications and the Royal Statistical Society [4], highlights concerns amongst those teaching also mathematics and science, and, by implication, future mathematics teachers. The main purposes of diagnostic testing, therefore, are: 1. to provide students with immediate feedback on their ability as it relates to a course; 2. to enable instructors to identify “at risk” students or stream students according to performance; and 3. to enable instructors to identify common areas of difficulty within a group. The results of diagnostic testing are used within the wider educational context to provide appropriate support to students. Often this takes the form of providing them with opportunities to “level up” their knowledge to that required at the beginning of a degree course. 1.2. Motivation for a computer-based

test

There are three different ways to administer a diagnostic test: a hand-marked paper-based test: an optically-marked paper-based test; or a computer-based test. A comparison of these three types is shown in Table 1. Hand-marked paper-based tests have the following advantages: ?? ??

they allow students to enter their own written answers, students may answer questions in any order.

The disadvantages of paper-based tests are: t To whom all correspondence

should be addressed. 113

114

JOHN APPLEBY er al. Table 1. Comparison of performance of different diagnostic test types Feature Answer type Order Marking Accuracy Inhomogeneous groups

Hand-marked paper-based

Optically-marked paper-based

Computer-based

Any Any Slow Low Too long/ inappropriate

Multiple choice Any Fast High Too long/ inappropriate

Maths interface Facility Fast Exact Expert system

for large groups, they are slow to mark, causing a bottleneck in giving feedback for large groups, statistical analysis of paper-based tests requires data entry, causing a further delay in giving instructors feedback, both marking questions on paper and data entry into a computer introduce the danger of making errors, with inhomogeneous groups, a pen-and-paper test with a fixed set of questions is likely either to be too simple for stronger students, or demoralizing to weaker students, or unacceptably long if it has a wide range of difficulty of questions. Optically-marked paper-based tests have the advantages of minimal errors, fast processing and unrestricted answer ordering. Their disadvantages are that they allow only for multiple choice questions and have the same problems when handling inhomogenous groups. A computer-based test can potentially allow for a variety of answer input types, including a mathematics interface. There are no marking errors provided the users are not penalized for syntax errors and the possibility of semantic errors is minimized. The use of a knowledge-based system can overcome the problem of testing an inhomogenous group. Test processing is fast, both for students and instructors. A facility for answering questions in any order can also be provided. In view of these comparisons, it was felt that computer-based diagnostic tests offered a significant potential advantage over paper-based tests provided a knowledge base could be programmed successfully. It was also felt that providing a mathematics interface was a significant improvement over an optically-marked mulitple choice test as long as syntax and semantic errors were minimized. Further to the advantages of a computer-based diagnostic test in speed, accuracy and appropriateness identified above, it was decided that an acceptable computer-based test should be no less efficient in terms of overall staff cost, discounting any development cost. The opportunity for developing a test was provided by the TLTP programme, as described below. 1.3. The opportunity-the TLTP programme Phase I of the Teaching and Learning Technology Programme (TLTP) was established by the U.K. Universities Funding Council (UFC) in 1992. Its aim was to “make teaching and learning more productive and efficient by harnessing modern technology” [5]. E7.5 million was allocated to the Programme for 3 years and universities were invited to submit bids. A successful bid was made to TLTP by the five Universities in the North East of England (Durham University, Teesside University, University of Newcastle, University of Northumbria at Newcastle and Sunderland University) under the title of “co-ordinated development and evaluation of courseware for basic mathematical skills”. The main deliverable of this bid was the Diagnosys system, which was made available to all Higher Education institutions under the TLTP agreement in 1995. TLTP is currently running a second phase programme and is now itself run by the Higher Education funding councils of England, Scotland, Wales and Northern Ireland. Their support is acknowledged. 2. REQUIREMENTS ANALYSIS In order to fulfil the purpose of a computer-based diagnostic test already identified, the following initial system requirements were identified: 1. the Test should assess a student’s maths knowledge accurately in less than one hour; 2. it should provide students with immediate relevant feedback on performance with a view to brushing up on weak/rusty areas;

DIAGNOSYS

115

3. it should provide tutors with quick feedback on individual students and summary data for a whole group with a view to identifying “at risk” students and general group weaknesses; and 4. is should be applicable to university entry students, i.e. students taking any maths-related discipline at foundation year or first year level. An additional requirement identified during the development process was that, in order for the Test to be useful to other institutions, it should be adaptable by an instructor, both in terms of its mode of use (such as what “administrative” questions to ask, how many second chances to allow and what sort of profile to produce at the end) and content (such as altering the text and answer type of individual questions). Before a system architecture was identified further fundamental design decisions were made: 1. It was decided to use a skills approach in order to identify different areas of knowledge. 2. The skills were organized into a hierarchy in order that an expert system could make inferences on student knowledge from the answers previously given and select the most appropriate question to ask next. This reduces the number of questions asked for a variable ability group. 3. It was decided to use a maths interface and a variety of different question types in order to encourage students to think about questions and generate their own answer, rather than guess. 4. A simple initial profile of student ability was provided by assigning levels to the skills in order to target initial questions according to previous qualifications (or attempted qualification). 5. Student response data (administrative, assessed answers and actual question responses) would be kept both for improvements of the system and for educational development. The requirements analysis and basic decisions made above led to the identification of the following main development strands: The skills network. This consists of specifying the skills, assigning them each a level, and defining #the links between them. Question design. This consists of writing the content of and defining the presentation of questions associated with skills and selecting their answer type. The test shell. This covers the development of the overall test management system, including the design of the interface, assessment of answers and providing feedback for students. The expert sysrem. This handles the generation of the initial student profile, making inferences from student answers and selecting the next question. The m&s interface. The parsing of mathematical answers and application of various evaluation criteria. The utility programs. These generate feedback for tutors on student test performance and group performance on individual skills. Follow-up material. The production of supporting learning material based on the content of the test as defined by the skills and questions. Each of these strands is described in the following sections, with a systematic set of guidelines for its production. A historical account of what actually occurred in the development of the Diugnosys system is also given, where appropriate.

3. THE SKILLS

NETWORK

3.1. Description

A skills network is the combination of named individual skills, a specified level for each skill and specified directed links between skills which joins them together into a hierarchy. The basic idea of ‘the skills hierarchy is that of prerequisite associations defined by GagnC [6]. Skill A is linked forwardly to skill B when skill A is a prerequisite to skill B, meaning skill A must be mastered before skill B can be attempted. An example, showing part of the network from Diugnosys ~2.0, is shown in Fig. 1. Skill levels are used in the initial student profile and am described in more detail in Section 6. Only a proportion of the prerequisite associations between skills are made into links in the skills network. This is because of the way the expert system works on the skills network and is described in more detail in Section 6.

JOHN APPLEBYetal.

116

Level 1

Level 2

105: Decimal places

202: Significant figures

Level 3

302: Scientific notation 203: -ve powers - defn 106: +ve powers - defn

304: -ve powers - rules 204: +ve powers - rules

Fig. I. Part of the skills network for Diagnosys ~2.0.

3.2. Guidelines

In order to produce a new skills network, the following guidelines are suggested. As in the development of the current mathematics test, these guidelines have been written in mind of the need to produce a working test in a reasonably short space of time, along with the sort of constraints other developers would probably face working in an educational institution, i.e. only a limited number of pilot versions would be feasible. It is therefore assumed that the first major trial of a linked network will take place with a large group of students taking the test “for real”. All the trial groups should be as inhomogeneous as possible in order to evaluate the performance of questions with a variety of difficulties. The breadth and depth of content of the test may be increased during the production process. It is advised that the most inter-related areas are tackled first (such as precalculus algebra for the current mathematics test). Paper-bused trial. Classify the knowledge you want to test into (maybe between five and fifteen) skill areas. Design a paper-based test with the intention of adequately testing each of the skill areas with a suitable number of questions. Use it with a trial group. The data derived can then be used in a variety of ways: The first thing to do is to attempt to identify individual skills within each of the skill areas and assess whether each question tests knowledge on a single skill (the validity of the individual skills model is discussed in Section 10.2). This can be done only by educational judgement. The authors suggest it is better to set “natural” questions within a skill area first before attempting to identify and test individual skills. It is also advised not to pose simple “conceptual” questions as students often fail to do these correctly but can do questions which are “based” on these concepts (implying that the former are not prerequisite to the latter). The similarity of questions can be measured by comparing their relative frequencies. The relative frequencies of two questions A and B are defined in Table 2. If +.fAE GAB

Gus

+hB

then the questions are probably too similar. A skill area may not be adequately tested by the questions chosen. This is indicated by a poor correlation between performance on a skill area in the test with some other performance indicator for Table 2. Relative skills frequency table

A right B right B wrong

;:

A wrong

f&B fhS

DIAGNOSYS

117

the skill area, such as an exam. If this is the case, new questions need to be introduced to attempt to cover the relevant skill areas.

Unlinked Diugnosys trial. From the results of one or more paper-based trials, derive a preliminary network of unlinked skills with one question on each skill with a mathematical answer type. Use this network with another trial group.

1. The difficulty of each skill can be estimated by the proportion of students getting the question on it wrong. The relative difficulty of skills enables an initial classifiation of skills into levels. 2. The student-question value array can be used in a parfial correlafion analysis to identify potential links. Correct responses are scored with 1 and incorrect responses with 0. Incomplete data can1be “filled in”, either with a fixed value (such as 0 or OS), or by using a statistical model, such as those described in [7]. This analysis will yield partial correlation coefficients between all the skills and can indicate which are best linked. Care should be taken when linking different skill areas as they should satisfy the prerequisite assumption defined above. The direction of the link is determined by a combination of educational judgement, the relative difficulty of the skills (a harder skill should not be linked forwards to an easier skill) and the relative values of the cross-frequencies, defined in Table 2: if

then skill A could be linked forwards to skill B, but certainly not backwards. 3. The actual answers given can be analysed to generate distracters for multiple choice or list of choices questions-see Section 7. 4. Analyse the question reposnse data for unexpectedly difficult questions in order to see whether the presentation of the question is causing some unforseen difficulty. For example, students may be having problems with the mathematics interface (such as entering algebraic fractions) or misinterpreting the question. Linked Diagnosys trial offull network. From the findings of one or more unlinked trials, produce a linked network with one question on each skill and each skill assigned to a level. The number of links should be chosen carefully in an attempt to trade off between speed and accuracy of the inferences. A rule of thumb is that the total number of links should be the same order of magnitude as the number of skills and no skill should be linked to more than about four other skills in either direction. Different types of question can be introduced in order to simplify and speed up interaction, based on distracters discovered with the previous versions. This test is then used in nefwork trial mode, where students are first asked questions according to their entry level and the expert systems inferences until they complete the test, and then asked questions on all the skills previously inferred. This will produce the same data as for the unlinked test and can be evaluated in the same way. On top of this, the following further analyses can be carried out.

1. Each link can be evaluated by measuring the proportion of correct forward and backward inferences (see Section 6). If the link performs well in only one direction, it can be made into a uni-directional link, although this should be justified educationally. 2. The overall density of links can be evaluated by measuring the proportion of questions the Test asked in order to infer ability on all the skills. A low proportion of questions asked is only desirable if the average inference confidence is suitably high as inferences have a transitive (“knock-on”) effect. The number of links can then be changed by adding or removing those with a tbreshhold partial correlation value. 3. The performance of the entry level model can be assessed for each qualification group assigned ~LJa particular level. Their performance at that level can be measured against that of other qualificadon groups and their performance at higher and lower levels. This may result in changing the entry level for a particular qualification group or changing the number of skill levels. The level of individual skills should again be assessed by comparing their difficulty with the average difficulty of skills at the same

118

JOHN APPLXBY er al.

level. 4. The performance of new answer types can be compared with the performance of mathematics answer types for the same questions on previous versions to ensure that students are cognitively engaging with questions rather than just guessing. Linked Diugnosys trial. In the light of one or more linked trials with a full network, Diugnosys can be used with a large group of “real life” students in normal mode. Several questions can now be introduced for each skill either in order to permit second attempts at questions where an incorrect answer had been entered unintentionally or to vary the test for different students or for re-testing. The average time to complete the test can now be measured and should be an improvement on the previous versions. The breadth and depth of content of the test could therefore be widened at this stage. It is advised that the more complex, inter-related areas are covered in earlier versions as they will require more refinement. On completion of the test, Diugnosys will assign one of seven different values to each skill for each student (see Section 6): the student attempted a question on this skill and got it right the student attempted a question on this skill and got it wrong probably_yes the student did not attempt a question on this skill but the expert system has inferred they have this skill from their other answers probably-no the student did not attempt a question on this skill but the expert system has inferred that they do not have this skill from their other answers the Test was not completed but this was one of the skills Diugnosys was going to test next possibly because the system inferred it was likely that the student had this skill the Test was not completed, it was not one of the skills Diugnosys was planning to test unassigned next but it was on the Test this skill was not on the version of the Test selected by the student (Diugnosys can offer not_on_test a variety of different test networks according to the course the student is taking) yes no

This data, and the student response data for each question, can be used to further improve the test: 1. The difficulty of each question can again be estimated by the proportion of students getting it wrong. The inferred data values probably-yes and probably-no can either be ignored, included on an equal footing as yes and no, or included with a different weighting. Diugnosys chooses different questions on the same skill at random so the relative frequency of answers should be comparable. The relative difficulty of different questions on the same skill can be measured. Any remotely significant difference (e.g. P=O.2) should then be investigated. 2. A partial correlation analysis can again be carried out on the student-skill answer array. This can only be achieved by first filling in the missing and inferred elements with an estimated value in order to make the data rectangular. Various approaches have been tried for this, including scoring yes values 1, probably-yes 0.85, probably-no 0.15, no 0 and all other values 0.5 which will be described separately [8]. (N.B. standard statistical techniques are not appropriate because the inferences make the data bi-typed and the positions of the values being filled in are not random.) The partial correlations are again used to suggest which skills should be linked. 3. The relative difficulty of questions can again be compared. The level of individual skills can be changed accordingly. The answer data for unexpectedly difficult questions can again be analysed in detail.

3.3. Development history The design of the skills network used in Diugnosys was carried out at the University of Newcastle over a number of years with various different versions of the Test. The development did not completely follow the guidelines stated above. Paper-based tests were administered to over 100 students in 1991 and over 500 students in 1992. One or two questions were asked in each of 14 different skill areas. This test performed well when compared with exam performance and was reworked as a network of 45 skills connected by 36 links in version 1.O of Diugnosys. This was the only version of Diugnosys to have a single uni-directional link (i.e. one

DIAGNOSYS

119

particular inference could be made in only one direction). The skills in this network were also divided into 4 levels. The normal version of Diagnosys vl .Owas used with over 500 students in 1993. Its network performed reasonably well but was widened and completely reworked in Diagnosys ~2.0 to one containing 8 1 skills connected by 79 links (all bidirected), again connected by 4 levels. Diagnosys ~2.0 was also used with over 500 students at four of the project institutions in 1994, A detailed analysis of its network and question response data was carried out in order to make further modifications to questions, skills and links along the lines indicated above: 1. The overall connectedness of the network performed reasonably well with an average 45% of the questions being asked. 2. The validity of the inferences was tested by a small trial group of students completing the whole network. The average inference confidence was found to be 84% (with no statistically significant difference between forward and backward inferences). Several links were identified as making suspect inferences and changes were made to the network in the following version. Some poorly worded questions were also identified. 3. A partial correlation analysis revealed several suspect questions and links which led to changes in the following version. 4. The response data for several qestions with an unexpected difficulty value were analysed in more detail leading to several more changes in question wording. 5. The entry level model was analysed for all the qualification groups containing more than 10 students. One or two group entry levels were changed accordingly. These analyses were incorporated into Diagnosys v2.2-the version publically released in May 1995. This had a network of 91 skills connected by 91 links. This, and versions up to 2.31, have been used in approx. 20 institutions with several thousand students. 4. QUESTION

DESIGN

4.1. Description Once a skill has been defined, one or more questions need to be written to test it and their answer type chosen. An example of a question is given in Fig. 2. Diagnosys also provides the facility for displaying graphs and diagrams. is

4.2. Guidelines Mathematical input questions should generally be reserved for higher levels as it is suspected that using the maths interface successfully is linked to general ability. Variants should be supplied for multiple choice and list of choices questions in order to introduce variety for different students and discourage rote learning and surface interaction. The use of applications is an important issue. There is a significant literature on “word problems” (e.g. see [9]). Purely mathematical questions show a less applicable ability. Applications should be the most common use of a particular skill. Diagrams should not be too complex due to the limited amount of time students have to answer each question and the difficulty of mixing the interpretation of verbal and visual information. Emphasis should first be placed on making them directly informative rather than visually appealing.

204: Expanding one bracket Expand and collect terms to leave the answer in the simplest possible form. 3x - 2r(4z - 1)

Fig. 2. Example question from Diagnosys v I .O.

120

JOHN APPLEBY ef al.

The authors discourage the incorporation of extraneous information as the ability to filter relevant information does not appear to combine well with the ability to perform particular skills. Basic conceptual (such as the definition of a vector) and procedural skills (such as handling negative numbers) should not be assumed for intermediate level questions as these also appear not to combine well with linked higher level abilities, leading to less valid inferences. The skills in the test are really defined only by the questions set on them so it is important to design questions carefully. The validity of links between skills will depend on the relationship between the questions on the skills as well as any educational prerequisite association (see Section 6.1). For example, it is educationally quite valid to link solving linear equations forward to solving quadratic equations, but some students are able to solve certain simple quadratic equations by factorization such as: x2+3x+2=0 whilst not being able to solve a question involving recognizing a disguised linear equation such as: 2 -=1-i X

Also, it may not be feasible to evaluate certain types of answers, such as algebraic factorization-see Section 7. 5. THE DIAGNOSYS SHELL

5.1. Principal components

In order to provide a flexible shell, with an expert system, an interface permitting a wide range of display features and input types, and multiple modes of operation, the Diugnosys system is logically composed of a number of modules, each accessing a range of plain text files. These files contain both technical and subject specific features, text, menus, questions, and the final “model” of the student’s knowledge. The main program controls the normal sequence of operation, described in Section 5.2, and other modes (see Section 5.3). The basic sequence is: ?? ?? ??

initialize the system and the student model, ask questions, updating the model, produce output.

This requires the following functions: Question processing: obtaining question text and graphics, setting input and evaluation parameters. Inference engine: establishing initial model, selection of skill to test, finding question, updating model and question list. Interface: displaying text and graphics, including help facility and tutorials, obtaining input (specified character set). Evaluation of answer: matching, value or parsing of answer string, and evaluation according to several parameters. Utili& file handling, checking of files etc. There are different types of text files: technical configuration, main customization file, with menus for initial model, test areas, skill names, output parameters, and features including lives, time limit etc., text file, with prompts, messages, non-mathematical (non-technical) questions etc., file of mathematical (or other) questions, skills network, with levels and links, library of symbols for display, the student model-reated after name and other background information is obtained, and updated at intervals and at the end of the Test. Diugnosys offers a variety of different answer input and evaluation types. The full list of types available

within the shell is shown in Table 3.

DIAGNOSYS Table 3. Diagnosys input and evaluation types Input type

Evaluation type

Integer Real Pair of teals Fraction Complex String Character Algebra (52 variables) Multiple-choice Hidden multiple-choice List

String match Numeric or real with tolerance Pair of reals Algebraic Algebraic String match String match Algebraic String match String match List

5.2. The normal sequence of operation of Diagnosys

Normal operation, when in use by a student, is in three phases. During the initial phase, name and other information is obtained, including group and qualification information that is used to establish both the topic areas to be tested, and the initial level of questions to be set (i.e. the initial, hypothesized, student model) (see Fig. 3). The. main program loop follows the sequence (see Fig. 4): choose skill to be tested, and find a question to test it, display the question, obtain an answer, evaluate the answer, update the student model. Additional links include the option (for the student) of displaying the correct answer, and of having a second chance at a skill/question. Reference is made during this process to the inference engine and list of rules, to the current student model, and to the list of questions, which includes input and evaluation criteria. The loop finishes wHen

Start I itia.l information & tutorials I

Skill list & initial model

Fig. 3. Diagnosys shell: initial phase.

JOHN APPLEBY et al.

122

either there are no skills remaining to be tested (i.e. all have either a definite or an inferred status), or the student quits. The final phase may include asking further non-technical questions and providing additional information, and then will conclude by presenting the results of the test in one of various forms, chosen by the tutor in advance. Post-processing of the results for the group will follow, and is described in Section 8. 5.3. Other modes of operation The other modes of operation of the Diagnosys system are principally for the use of developers, with the exception of selftest mode, which simply leaves the final profile in readable, text, form for the student to print out themselves. The other modes are:

edithestq correct

diagnostic evalnetwork

to display and test individual screens and questions to produce a list of all correct answers and parameters assists in finding errors in the customized text files runs the test as normal, saves the student model including inferences, then continues testing all previously inferred skills, finally saving a model with all skills classified definitely. This mode permits an evaluation of the links in the network (see Section 3.2).

5.4. Interface design The user interface in Diagnosys has undergone extensive design and informal evaluation. Some of the design principles of [lo] have been adopted. A consistent interaction style has been used for the assessment phase of the system with a split windowed interface as shown in Pig. 5.

I Choose skill

v Find & display question

+ Get answer

I Evaluate answer

Fig. 4. Diagnosys

shell:

main program loop.

Second Chance

DIAGNOSYS

123

question number and skill title

question display window

I

prompt message

I

answer input box

information about required response

system messages Fig. 5. Screen design for the assessment phase in Diagnosys.

Whenever possible, the rest of the system adopts a similar screen design and interaction is limitedi to pressing the space bar or entering text and pressing the Enter key. help and quit options are available most of the time. Information on the Test, and keyboard and mathematics tutorials are given at the beginning. These are designed to take only a few minutes to complete. At the end of the Test, the user is shown a summary of their performance (more extensive feedback is given on paper after the Test is completed). The individual profile for a sample student is shown in Fig. 6. 5.5, Customizable features

Most use of the Diagnosys system so far has utilized the existing network and file of questions, and has involved customization of a number of features of the operation and form of output. The most important is the menu of groups, which controls the topic areas to be tested for this student, and the menu of qualijcations, which controls the initial student model and hence the starting level of the questions. Other features easily customized include the time limit, the number of lives permitted for second chances, and the choice of information provided in the final profile. The form of the profile can be a bar chart or a list of skills, and such a list can display any or all of: skills possessed, skills not possessed, skills probably possessed etc. In the subsequent printed form (which may be different), it can also include reference to supporting materials for subsequent study. All text files except the skills network use a mark-up language to describe text, symbols and formatting, and there is a file of symbol descriptions that can be extended to include subject-specific symbols or small diagrams used frequently (and which may be scaled up or down). The file of mathematics (or other technical) questions uses this mark-up language to describe the question display, and also, for example, the correct answer for display. Each question also specifies the input and evaluation parameters. Questions may have variant parts, which permits several similar questions for provision of second chances, or simply varied tests. An example text for a skill with the question variants from Diagnosys ~2.3 is shown in Fig. 7 (cf. Fig. 2). If questions of a very different kind from those used previously are to be implemented, the window sizes and prompts, messages etc. may be changed. Colours and font sizes may also be changed if desired. The same need might arise from translation into other languages. A particular requirement for many countries is the use of “,” in place of “.‘I as the decimal point, both in display and input; this can: be changed in the test configuration.

JOHNAPPLEBYetal.

124

INDIVIDUALPROFILE : Smith Student Name Current Course (scope of test) : FoundationYear : A Level MathematicsQualification : 35 Time spent (in minutes) : 21 3 1995 Date taken (d m yr)

SKILLS WE THINK YOU HAVE 101 Multiplicationof negative numbers ...... 208 Factors of algebraicproducts 210 Expanding one bracket 211 Collectingterms ...... SKILLS YOU APPEAR TO HAVE PROBLEMS WITH You should work on these skills until you feel confidentabout them 303 Simplificationof fractionalpowers .......

406 Solution of quad. by completingthe square SKILLS WE DIDN'T HAVE TIME TO TEST You should ensure that you are confidentabout these areas as well. 209 Simple Factorisation 212 Solving linear equations .......

Number of questionsasked:

16

Your tutor will be able to advise you on available resourceswhich you can use when working on the areas above. Fig. 6. Part of a Diagnosys

student profile.

$question;210;Expand one bracket $input;algebra;$eval;algebra $charlimit;(,O $var;a;$correct;2x'(2)-6x-(3);Svarend $var;b;$variables;y;$correct;6y-(2)-3y;Svarend $questionbody $newline,3.5 Expand the bracket in : $neuline,l $var;a;2x ( x - 3x;$squared;);$varend $var;b;3y ( 2y - 1 );$varend $endquest Fig. 7. Example skill question text from Diagnosys ~2.3

DIAGNOSYS

125

Other text that might need changing for foreign languages or different purposes includes the names of the skills to be tested, and the wording of headings in the final profile. Questions about name, qualifications etc. can also be changed, and new questions can be constructed. 5.6. The student model jile The file containing the student model is stored as a Prolog external database, and includes the complete list of skills to be tested with their status (e.g. tested, inferred, non on test, etc.), complete response data for everyquestion, time taken number of lives used, and how many used successfully.

5.7, Technical features of the Diagnosys shell The whole development of Diagnosys from ~1.0 to ~2.31 has been written in PDC Prolog for DO$.t Prolog has particular advantages for expert system design, for database management (such as question lists, where matching of requirements with availability is achieved with one instruction), and for par&g and evaluation of algebraic answers. Future versions may use mixed language programming. DOSS was chosen because, at the beginning of development in 1993, there were still substant$al numbers of students who had not yet met windowed operating systems, and many institutions still dad 286 and 386 machines, for which Windows&based software could be very slow. For this kind of tast, rapid familiarization with the interface is essential, devoted full screen operation is needed, and be principal form of input will inevitably be keyboard rather than mouse. However, future versions dill probably be Windows based, as we may assume general familiarity and also the existence of suitable machines.

6.THEEXPERTSYSTEM

6.1. The inference mechanism The skills network is used to make inferences about associated skills and select relevant questions to ask the user. Part of the skills network for Diagnosys ~2.0 is shown in Fig. 1. The meaning of the directed links in a skills network is as follows: if A and B are two skills, then A-B means that A is a prerequisite to B in the sense of Gag& [6]. The effect of this link is twofold:

1. If the user gets a question on skill B correct then it is inferred that he or she also has skill A. 2. If the user gets a question on skill A wrong then it is inferred that he or she also does not have skill B. These two rules apply transitively across the network according to the partial ordering given by the links.

For example, for the network in Fig. 1, if a user gets a question on skill 302 correct then the useu is inferred to have skills 202 and 203 due to direct inferences, but also 105 and 106 due to indirect transitjve inferences. A uni-directed link is a special case of the above where only one of the the two inferences applies. This method allows the system to reduce significantly the number of questions that are asked compa&zd with a conventional test. For example, as already mentioned, the data from over 500 students who tdok Diagnosys ~2.0 showed that, on average, they answered only 45% of all the questions available on the Test. There are two possible sources of error in the inferences: 1. The system marks a question incorrectly (e.g. a student guesses correctly at a multiple choice question or has problems entering an algebraic expression). 2. The inferences between skills defined by the links are not valid. This is especially difficult as the lidks are, between questions, rather than skills-it may be educationally valid to state that one skill is, a prerequisite for another but it may not be true that the ability to do a particular question on one skill is a prerequisite for being able to do a particular question on another skill. Thus both the skills and the questions on the skills need to be looked at closely in order to determine which are best linked. t PDC Prolog for DOS is a registered trademark of the Prolog Development Center A/S. $ DOS is a registered trademark of the Microsoft Corporation $ Windows is a registered trademark of the MicrosoftCorporation

JOHN APPLEBY er al.

126

In view of this, only some of the prerequisite associations, according to Gag& [6], are made into network links. 4.2. Question selection strategy At any time during the test, each skill in the network being tested takes one of six probability

values:

yes, no, probably-yes, probably-no, possibly or unassigned as described in Section 3.2. These values

change according to the inferences made, as described in Section 3.2. In basic terms, skills marked possibly are tested before skills marked unassigned (in fact, Diagnosys leaves skills that remain unassigned right up to the end). Roughly speaking, the system will ask the question at the highest level which is marked possibly until the user gets a question wrong. The system adapts to the user’s responses by marking the nearest linked skills as possibly and putting them at the top of the stack (note: this action is only local rather than transitive across the network, in contrast to the inference mechanism). For example, for the part of the network shown in Fig. 1, if a student is first asked a question on skill 202 and gets this question right, skill 302 would be marked possibly and moved to the top of the stack. If they got 202 wrong, a question on skill 105 would be asked next. This process of asking easier or harder related questions models the way a lecturer might diagnose the exact level of ability of a student by means of an interactive question-and-answer session in which a sequence of related questions might be posed. 6.3. Initial student projile The prior qualification given by the user forms the system’s initial model of the user’s ability. Some examples of initial assigments for different qualifications are given in Table 4. Thus students who have taken an U.K. A-level in mathematics will be asked level 3 questions to start with while a student who has only studied for U.K. GCSE mathematics will start by being asked level 2 questions. The initial model does not currently make use of grades (although this is a customizable feature of the system and was investigated for the original paper-based tests used [3]). The number of levels used in the Test could also be increased, although it would be better to grade skills according to student performance rather than their perceived difficulty, making the network refinement process slow. 7. THE

MATHEMATICS

INTERFACE

Diagnosys offers a variety of different answer types and evaluation types, as shown in Fig. 3.

The implementation of numeric, string and multiple-choice answer input types is straightforward and widespread. It is not difficult to extend the range to list entry (viz. “which two of the following ...“) and hidden multiple choice, in which answers are shown in sequence to avoid the “eliminate some then guess” strategy. The authors have tended to use these judiciously as the former can penalize a mostly correct answer and the latter may be poorly understood by the student (who may think they can review answers). Effective testing of mathematical topics also requires algebraic input and evaluation, preferably with the following features: ?? ?? ?? ?? ??

making the mutliplication sign (*) optional; handling more than one variable and allowing a choice of different names in question design; handling special symbols (such as CTT); providing a simple interface for entering powers and fractions; and restricting answer formats but not requiring a specific canonical form (e.g. expand brackets and collect terms but in any order).

Beevers et al. [ll] describe the interface used in their system: CALM, which addresses most of these Table 4. Skill probability assignments for some qualifications in Diagnosys ~2.2 Qualification

Level 1

Level 2

Level 3

Level 4

A-level, Foundation Year, CSYS GCSE, Vocational, Access None

Possibly Possibly Possibly

Possibly Possibly Unassigned

Possibly Unassigned Unassigned

Unassigned Unassigned Unassigned

127

DIAGNOSYS

points by using function evaluation to within a preset tolerance. More recent developments [12], as part of the TLTP Murhwise project [5] have added to these. Powers and functions are entered in “computer” convention, for example, x?2)/(x + 1), but are re-displayed in mathematics format for user checking before final entry. Diagnosys uses a similar strategy but with the following differences: 1. Powers are entered naturally, using the arrow keys to move the cursor between the normal script and superscript. 2. A tutorial on entering powers, fractions and special symbols is given at the beginning of the test. 3. For function evaluation, apart from checking the value (by ensuring it equals a target function !to within a pre-set tolerance at several points within a prescribed range), two further checks can be made: the string length can be fixed to within prescribed bounds (e.g. to check for simplification); and the number of occurrences of any specified substring can be counted. Thus the number of occurrences lof “(” may indicate whether factorization has been successful. 4. Answers involving complex numbers written in Cartesian form are treated as functions of i (for mathematics subjects) or j (for engineering subjects) for checking. 5. An invalid expression (e.g. x + 2 - ) may be re-entered. The function parser/evaluator is also used for plotting graphs used in some questions. The current fractions interface uses the / character, which causes some confusion in priority for algebraic fractions having mutiple symbol numerators or denominators, such as 5 (as already described). 2Y A more natural fractions interface, such as that used in the Algebra Mentor software t may be developed in a future version. Diagnosys adds leading + signs to expressions before parsing them. For example, (x - 1)(x - 2) and (-1+~)(-2+x)havedifferentlengths.Theyareconvertedto +(+x-1)(+x-2)and +(-1+x)(-2+,x) respectively. This has the advantage of making the string length of simplified expressions unique for many simple problems, assisting evaluation. If an analytical check were to be made, this strategy would also give a closer correspondence in parsed structure. There is current research into analytical checks on algebraic input (e.g. [13]), including providing diagnostic messages, but it is well known than no canonical form exists for many algebraic expressions. Therefore numerical evaluation offers a far simpler, and usually effective, solution.

8. THE UTILITY PROGRAMS 8. I. Individual profile

When the student finishes the Test, they receive immediate feedback on their overall performance. This includes whatever “administrative” details (department, year, etc.) the tutor chooses, together with time taken, lives used etc. An example of an individual student profile is given in Fig. 6. The performance on the questions can be presented either as a bar chart of success on each area of the test (as a percentage, and including inferred and untested skills), or as a list. The list comprises any or all of: skills actually tested, skills inferred, skills not tested, and either or both of skills possessed and those not possessed in each category. For weaker students, or if the test is for a group known to contain strata of background qualifications, there may be no purpose in listing skills not possessed. Conversely, for groups where only isolated skills are missing, or only isolated students are to be identified, there may be no point in a long list of skills possessed. This is under the control of the tutor configuring the test. The same facility exists for the printed profile, produced subsequently by the tutor, but in a different format if desired. For example, the bar chart, or a list of only those skills actually tested, might appear on screen, whereas a longer list could appear in print. The printed profile can also include cross-references to support materials, which might be textual (textual backup notes are provided for the basic mathematics test) or CAL facilities. The tutor can include such references as they wish. t Algebra Mentor Brooks/Cole

Publishing

Company, Pacific Grove, Calif.,

U.S.A.

128

JOHN APPLEBY et al.

8.2. Group analysis The group processing program has currently four options:

1. 2. 3. 4.

mass processing of individual profiles, the production of a group profile, ranked and alphabetic lists, tabulated response data for each question.

All four options use a text file to specify a filter for processing subsets of files. For example, if several groups have used the system, each may be post-processed independently, or, for large groups, names “AM” might be processed before “N-Z” etc. An example of a group profile is shown in Fig. 8. A second level of filtering allows the analysis of overall knowledge of each skill by categories. These can use any of the initial questions in the first phase of the test, such as qualifications, department, gender, age group, etc., so that admissions tutors or others can be informed about preliminary knowledge of various categories of incoming student. A further facility has had limited use: a simple table of the success on each skill and other responses for every student in a group may be downloaded for subsequent spreadsheet analysis. 9. FOLLOW-UP MATERIAL The project has been developing paper-based revision notes on the individual skills in the Test. These have a consistent style (explanation, worked examples, brief examples and exercises) and relate to an individual skill or a cluster of (up to three) skills. The revision sheets are either one or two pages in length GROUPPROFILE Categories : Percent ‘YeS’/‘Prob Skill 101 .... 208 209 210 211 212 .... 303 .... 406 ....

Name Multiplication

2

3

4

5

88

100

77

89

85

100 100 100 100 100

100 100 100 89 100

49 14 57 49 31

78 44 67 56 56

77 59 79 73 67

48

22

3

11

23

6

11

0

11

5

Yes’ of negative

numbers

.. Factors of algebraic products Simple Factorisation Expanding one bracket Collecting terms Solving linear equations .. Simplification

of fractional

powers

.. Solution

of quad. by completing

the square

..

Categories : 1 => qua1 = A Level,

2 3 4 5

1

=> => => ->

qua1 qua1 qua1 qual

= = = =

AS Level, Norwegian CSYS, Scottish Higher, BTec, GCSE, Other, all,

Number of students in 1,2,3,4,5 Average time of students (mins) Average number of lives used by all 2nd attempts Average no. successful

Qual,

Access/Found

students

Fig. 8. Example Diugnosys group profile.

= = = =

Year,

33,9,35,9,86

68,64,50,76.61 1.9 0.6

129

DIAGNOSYS

and could be given out individually to the students according to their performance in the diagnostic test. However, the authors do not necessarily wish to promote such a behaviourist approach as a learning method: i.e. if a particular student gets a certain question wrong, encouraging him or her to read notes and do several examples on this skill may not be the best strategy for that individual to acquire or rehearse this ability. The authors would prefer to promote the Test as a useful indicator of general areas or ability and weakness and the notes as a useful resource for students “brushing up” on material with which they are already conversant. 10. DISCUSSION

AND

EXTENSIONS

10.1. Other available tests

There are several computer-based assessment systems available at present in the U.K. Eighteen systems for diagnosing, monitoring, self-testing and grading student ability in mathematics were mentioned at a recent conference [14]. Some of these assessment systems are embedded within courseware, some are stand-alone systems, some are shells while others are generic authoring languages. Also, a limited survey has been carried out recently into the provision of diagnostic testing in the U,K. for non-specialist mathematicians [15], and there are groups collaborating on future provision in the U.K. Work is also underway in the U.S.A. on a computer-based test to automate part of the Graduate Records Examination [16]. This test will emphasize conceptual and mathematical reasoning ability as opposed to mathematical content knowledge. However, Diugnosys is the only computer-based diagnostic test known to the authors which uses an expert system to target and reduce the number of questions asked. 10.2. Validity of skills-based assessment

Singley and Anderson discuss the definition of cognitive skills in [9]. They identify the three types: conceptual understanding, rule selection and rule application. The majority of the skills identified in the current Diagnosys test comprise of a combination of these three types. Hawkins et al. [17] have discussed the appropriate use of technology in assessment, including mathematics, from a wide perspective. They emphasize the need to assess actual performance as opposed to “inferring ability from decontextualized measures of cognitive traits”. They argue that the use of paperbased tests has driven education to emphasize the abilities of factual recall and the solution of short, well-defined problems. They promote the methods of performance assessment of complex tasks and porrfolio assessment, where a student collects together demonstrations of ability similar to an art portfolio in order to measure the rich array of student ability in an authentic context. Whilst the authors find such aspirations laudable, it seems difficult to incorporate them into the current U.K. educational context where there is such pressure on lecturers to ensure students from an increasing variety of backgrounds have the necessary mathematical skills to develop further competencies. 10.3. Using the Diagnosys shell in other subject areas Another project at Newcastle University has produced a prototype test of basic mechanics skills using the Diagnosys shell [18]. It was tested on 100 students in February 1995 and on over 300 students1in October 1996. The analysis of results from this Test have suggested some difficulties in using the hierarchical skills model for this subject. It may be due to the complex nature of the domain which needs to be modelled carefully (such as distinguishing between the separate abilities of problem formulation, rule selection and rule application, as suggested by Singley and Anderson’s research into word problems in calculus 091, Chap. 5). Alternatively, it may be that the inference mechanism is inappropriate (e.g. basic conceptual skills such as the difference between a vector and a scalar do not link to higher level procedural skills such as problems involving vector and scalar quantities). Possible ways round this are to useforward and backward links separately or to assess contextualized problem-solving skills first, and assess the component conceptual or procedural skills only when a student gets and answer wrong. These idea have not yet been tried. The difficulties outlined above are probably indicative of the sort of problems that will be faced when attempting to use Diagnosys to construct tests in other areas of cognitive skill.

130

JOHN APPLEBY

et al.

10.4. Future developments The main ways in which the system will be developed further are aimed at making customization easier in order to assist the development of new topic areas and other subjects and to increase the Test’s utility to institutions other than U.K. higher education. Developments currently underway are:

1. A custom editor for all the text files and full development documentation. 2. The range of difficulty of the Test will be extended, especially the lower end due to demands for schools and Further Education colleges. The skills tested will also be more closely tied with external qualification syllabi (such as GCSE mathematics and National Curriculum levels). 3. Improving the default ordering of questions asked. 4. Allowing question answers to be deferred or allowing students to select the order in which topic areas are assessed. 5. Allowing students to collect their own printed output. Other proposed developments are: 1. Providing better links to follow-up materials. 2. Providing better facilities for question designers and instructors. 3. Widening the functionality and modes of use of the Test to allow for greater variety of choice for potential users of the system. 4. A Windows version of the system. A probabilistic system, assigning confidence levels to each skill during the diagnostic process to each skill (cf. the graphical structure probability method for medical diagnosis described in [19]), could in principle be developed, but the authors doubt that the possible gains would offset the additional complexity-not so much for the system as for the question designer. However, it may be possible to use past results to improve the algorithm for assigning the overall score. This requires an assessment of the relative difficulty of each question and the relative ability of each student (as stronger students tend to answer harder questions due to the entry level model adopted, for example). Work in this area has shown some interesting results and it is hoped to publish this separately [8].

Il. CONCLUSION The increased numbers of students entering Higher Education science and engineering courses in the U.K., together with an ever widening variety of educational and cultural backgrounds, has led many institutions to seek ways of targetting tuition and thus using staff time most efficently. Self-study materials can contribute, but identification of both individual and group needs is a prerequisite for early intervention. The Diugnosys system has already provided, for several universities, a useful tool for students and tutors, and is currently under evaluation in more than fifty other institutions in the U.K. and abroad. Further development will increase its utility for school-level mathematics and other subject areas. The nature of the topic areas covered initially, in basic mathematics, lends itself to a directed network of mainly procedural skills from which abilities may be inferred by asking a suitable selection of questions. Although the consequent inferences are not certainties, validation, and the predictive capability of the test, show that information deduced is useful; formal assessment would require a higher level of confidence. Further development is envisaged in extending the coverage both to other areas and levels of mathematics, and to other subjects, such as mechanics. It is not yet clear how easy it will be to establish a suitable skills network, and it is possible that simpler subnetworks of three or four skills may be more effective. Diugnosys has been produced to meet a specific need in Higher Education, but its development has highlighted some interesting issues in the teaching, learning and assessing of mathematics. The widespread use of such tools may assist in the accommodation of the new mass-education system in the U.K. and elsewhere.

DIAGNOSYS

131

12. AVAILABILITY

The current version of Diugnosys-v2.3 1 is available to all U.K. higher education establishments at an nominal cost. Others may purchase an individual copy or a site licence. The paper-based follow-up notes may also be purchased for photocopying. Please send all enquiries to: TLTP Project, Department of Engineering Mathematics, University of Newcastle upon Tyne, Newcastle upon ‘Qne NE1 7RU, England. Tel.: (+44) 0191 222 6286; Fax: (+44) 0191 222 5498; e-mail: [email protected] REFERENCES I. Sutherland, R. and Pozzi, S., The changing mathematics background of undergraduate engineers. Technical report, The Engineering Council, IO Maltravers Street, London WCZR 3ER, 1995. 2. IMA working group. Mathematics matters in engineering. Technical report. The Institute of Mathematics and Its Applications, 1995. 3. Anderson, A., Appleby, J., Dwan, M. and Fletcher. P The assessment and development of supported self-study materials in basic mathematical skills: Report on diagnostic testing. Technical report, Department of Mathematics and Statistics and Department of Engineering Mathematics, University of Newcastle upon Tyne, 1993. 4. Howson, A. G., Barnard, A. D., Crighton, D. G., Davies, N., Gardiner, A. D., Jagger, J. M., Morris, D., Robson, J. C. and Steele, N. C., Tackling the mathematics problem. Technical report, The London Mathematics Society. the Institute of Mathematics and its Applications and the Royal Statistical Society, 1995. 5. Teaching and Learning Technology Programme, Coldharbour Lane, Bristol. TLTP Caralogue, Phase I-Spring 1995, 1995. 6. Gagnt, R. M., The Conditions of Learning. Holt, Rinehart & Winston, New York, 3rd edn, 1977. 7. Beale. E. M. L. and Little, R. J. A., Missing values in multivariate analysis. Joumal of rhe Royal Statistical Society, 1975, 1, 129-145. 8. Samuels, P C. and Appleby, J. C., Scoring a variable difficulty knowledge-based test. To be submitted. 9. Singley. M. K. and Anderson, J. R., The Transfer of CognitiveSkill. Harvard University Press, London, 1989. 10. Nielsen, J.. Traditional dialogue design applied to modem user interfaces. Communications of the ACM, 1990, 33( IO), 109-118. II. Beevers, C. E., Foster, M. G., McGuire, G. R. and Renshaw, J. H., Some problems of mathematical CAL. Computers & Education, 1992.18, 119-125. 12. Beevers, C. E., The Mathwise maths interface. Private communication. 13. Strickland, P M., Algebraic approaches to assessment. Presented at BruneVCTICMS Conference on Computer-Based Assessment in Mathematics, 1994. 14. Bishop, P, Report on assessment conference, list of assessment shells. Maths & Stats 1994, J(4), 26-29. A quarterly newsletter published by the CTI Centre for Mathematics and Statistics. 15. Edwards, I?, Implementing diagnostic testing for non-specialist mathematicians. Technical report, The Open Learning Foundation, 1996. 16. Tucker, A., New GRE mathematical reasoning test. Notices of the AMS, 1995.42(2 February), 245-247. 17. Hawkins, J., Frederiksen, J., Collins, A., Bennett, D. and Collins, E., Assessment and technology. Communications of the ACM, 1993,36(3 May), 74-76. 18. Appleby, J. C., Treasure-Jones, T., Samuels, I?, Site, I?, Anderson, A. and Gorial. B., Managing the transition to university education: computer-based diagnostic testing of background knowledge. In 1995 International Conference of Engineefiing Deans and industry Leaders, Monash University, Australia (1995). 19. Lauritzen, S. L. and Spiegelhalter, D. J., Local computations with probabilities on g&pica1 structures and the applications to expert systems. Journal of the Royal Statistical Society, Ser: B, 1988,50(2), 157-224.