System 28 (2000) 345±354
www.elsevier.com/locate/system
Vocabulary and neural networks in the computational assessment of texts written by second-language learners Paul Meara *, Catherine Rodgers, Gabriel Jacobs Centre for Applied Language Studies, University of Wales Swansea, Singleton Park, Swansea SA2 8PP, UK Received 5 October 1999; received in revised form 15 December 1999; accepted 10 January 2000
Abstract This paper explores the potential of a neural network in language assessment. Many examination systems rely on subjective judgments made by examiners as a way of grading the writing of non-native speakers. Some research (e.g. Engber, 1995. The relationship of lexical pro®ciency to the quality of ESL compositions. Journal of Second Language Writing 4(2), 139±155) has shown that these subjective judgements are in¯uenced to a very large extent by the lexical choices made by candidates. We took Engber's basic model, but automated the evaluation of lexical content. A group of non-native speakers of French were asked to produce a short text in response to a picture stimulus. The texts were graded by French native speaker teachers. We identi®ed a number of words which occurred in about half the texts, and coded each text for the occurrence and non-occurrence of each word. We then trained a neural network to grade the texts on the basis of these codings. The results suggest that it might be possible to teach a neural network to mimic the judgements made by human markers. # 2000 Elsevier Science Ltd. All rights reserved. Keywords: Neural network; Computational assessment; Vocabulary; French
Students of foreign languages in academic institutions often have to write freeform texts as part of their assessed work. Of necessity, such texts tend to be judged (to a greater or lesser extent) impressionistically, since it is all but impossible to implement a standard, rigid mark scheme as can be done with some other types of language exercise. A number of studies have shown that in grading free-form texts, * Corresponding author. Tel.: +44-1792-295391; fax: +44-1792-295641. E-mail address:
[email protected] (P. Meara). 0346-251X/00/$ - see front matter # 2000 Elsevier Science Ltd. All rights reserved. PII: S0346-251X(00)00016-6
346
P. Meara et al. / System 28 (2000) 345±354
assessors are in¯uenced to a great extent by lexical content. Santos (1988) demonstrated that assessors were generally able to mark language and content independently of each other except in the case of lexis, since meaning is obviously closely related to choice of words. Other studies (e.g. Harley and King, 1989; McClure, 1991) have con®rmed that lexical resource is a critical factor in assessors' subjective judgements of language exercises. Work by Engber (1995) was the starting point for the experiment described in this paper. Engber examined the extent to which lexical richness and lexical errors are related to the grade assigned by assessors to timed essays produced in a second language by learners of that language (in her case, learners of English). Once again, it is clear from her study that the diversity and accuracy of the students' lexical resource had a considerable eect on the judgements of the assessors. Texts with rich, varied, and error-free lexis tended to be highly rated, while those with a more restricted range of lexical choices, and a high proportion of lexical errors, tended to be rated poorly, whatever the other merits or demerits of the texts. Taking Engber's basic model, we devised a method of determining the extent to which a desktop computer could reliably grade learners' texts. Students of French were asked to produce such texts which were marked by three assessors. The texts were subjected to a lexical analysis and the results analysed by a computer program, the objective being to train the program, as far as possible, to mimic the assessors' judgements. With a remarkable degree of accuracy, the program was able to make the same distinction that the human assessors had made with respect to an especially important marking frontier operated in UK higher-education institutions: the one separating an upper-second-class and a lower-second class degree1. In our own institution this equates to the borderline between 59 and 60 marks out of 100. It is common knowledge that when UK university teachers assess a piece of degreelevel work judgmentally, they tend to think ®rst in terms of the degree classi®cation system, broadly assessing the work in terms of a class, then assigning a numerical value to the judgement. This has the eect of causing assessors, in our institution for example, to consider very carefully whether a script is worth say 58 or 60 marks, while the dierence in their minds between say, 53 and 57 marks Ð thus with twice the gap between the two marks Ð may well be far less distinct. The scale, in practice, is therefore not linear and, for this reason, in many institutions, averages of marks are considered to be less important than a student's pro®le in terms of classes awarded in individual examinations. The computer program we used for our experiment was WinNN, an arti®cial neural network running on a standard PC under Windows (3.1, 95 and NT). It was developed by Y. Danon (the ®rst version dates from 1994) and is available as shareware (Danon, 1994). Arti®cial neural networks (usually called simply neural networks) are pieces of software capable of `learning' to solve complex categorisation and 1 In the UK, degrees are usually classi®ed as First, Second or Third Class, with the Second Class subdivided into Upper Second and Lower Second, sometimes called II (a) and II (b) or II (i) and II (ii). The sub-division is a critical one, since an upper-second degree is considered by students, teachers and employers to be a good quali®cation, while a lower-second degree is considered to be merely respectable.
P. Meara et al. / System 28 (2000) 345±354
347
pattern-matching problems. They consist of a network of interlocking nodes arranged in layers (Fig. 1). The interconnections between the layers determine the way in which the nodes behave. Each node takes a value, each connection weights the value of its source, and each node in the next layer sums the value of the connections it receives from the layer above. The weightings can either be positive or negative, shown in Fig. 1 by a plus sign or a minus sign, and the result is that a node is either activated or inhibited, by varying amounts, by the nodes in the layer above. Nodes are organised into an input layer, an output layer and very often one or more so-called hidden layers in between. Hidden layers allow the program to solve a richer array of problems than is possible with just an input and an output layer. Through many iterations, the program `learns' the underlying relationships between inputs and outputs by induction, in other words not by conforming to any pre-programmed rules but by continuously adjusting the weightings of nodes so that a particular input pattern comes to be reliably associated with a particular output pattern. In this way, it is possible to train a neural network to mimic certain aspects of human behaviour and thinking. It has been shown quite clearly that neural networks can be used to evaluate freeform writing produced by native speakers. The main approach here has been a methodology known as Latent Semantic Analysis (Landauer and Dumais, 1998; Landauer et al., 1998; Rehder et al., 1998). However, the neural networks used for this sort of analysis tend to be very large and complicated, typically running on mainframe computers, and involving an enormous number of connections together with a very large database of texts. We wished to see if useful results could be obtained with a very small network and a minimal textual resource. In order to do this, we asked 36 L1-English learners of Business French in their fourth (and ®nal) year of study at university to write for 50 min, under examination conditions, a text in French describing and evaluating an advertisement promoting holidays in Scotland (Fig. 2) which had appeared in a French magazine. The
Fig. 1.
348
P. Meara et al. / System 28 (2000) 345±354
Fig. 2.
P. Meara et al. / System 28 (2000) 345±354
349
advertisement was selected primarily for its visual impact, including the fact that it did not provide an inordinate number of lexical cues, and for the fact that it appeared to present interesting potential points of discussion, although almost any image, indeed any topic, might have suced. The resulting 36 texts, of an average of roughly 300 words each, were marked completely independently by three native-speaking French assessors2, each with considerable experience in examining in French at the level of the students concerned (and thus sensitive to the 59/60 borderline). They were asked to mark in their usual way for essays, and to express their overall judgement in the normal way as a score out of 100 marks, i.e. as if they were marking ®nal-year examination scripts. We established a word list from the texts produced by the students, and selected a set of words that occurred in approximately 50% of them. The 50% point was chosen because we considered that such a set of words would help to discriminate between the stronger and the weaker scripts. By de®nition, words which appeared in all texts would not discriminate, and words which appeared in very few texts would, also self-evidently, be poor discriminators. The words selected (Fig. 3) were in no way special or dicult, and indeed all are quite common in French. We then generated 100 sets of 10 target words by selecting at random from this list. An example target set is shown in Fig. 4. Next, we described each of the students' texts in terms of each of the 100 10-word target set. This process gave results of the kind shown in Fig. 5, where a 0 represents the absence of a word, and a 1 its presence. It can be seen that in, say, Text 1, adresse is absent, chapeau is present, aide is present, encore is absent, peu is absent, and so forth. Fig. 6 shows part of the resulting data ®le. Again, each line of the ®le represents a given target set of words, and indicates, by a 0 or a 1, the presence or absence of a
Fig. 3. Words which appeared in approximately half the texts.
Fig. 4. An example of a randomly selected 10-word target set.
Fig. 5. Binary assignments for an example target set. 2 Thanks are due to Nathalie Morello and Martine Thompson, colleagues in the Department of French, who, together with one of the authors of this paper (Rodgers), graded the students' texts.
350
P. Meara et al. / System 28 (2000) 345±354
Fig. 6. Part of the neural-network data ®le.
word in the corresponding digitised text ®les on the far right. The ®nal right-hand column of digits is the averaged assessment of the assessors as to whether the text was judged to be equal to or above 60 marks (1) or below 60 marks (0). The data ®le was then passed through the neural network shown in Fig. 7. Except for the values of the input layer, the values of all the nodes and the weightings between them were initially set randomly. At each iteration of the program, the value of the output node was compared with the `correct' output (i.e. say whether it corresponded to the average of the three examiners' marks at equal to or above 60 marks or below 60 marks). If the result was incorrect, the program made a small adjustment to the weightings in the direction of the correct answer. This process was repeated over and over again, until the program could no longer improve its performance. Fig. 8 is a screenshot of the WinNN neural network in action. At the middle-top left, it can be seen that the network has been through 1280 iterations before having found 100% good patterns (it has therefore ®nished its iterations). Fig. 9 shows the reduction in the error (the dierence between the human marker's decision and that of the computer) as the data is iterated. At zero iterations, the error is a random number. By iteration 1280, that number approaches zero. Our initial thoughts before we ran the data through neural network were that we might be able to identify a few 10-word target sets which would be modestly good at discriminating, say in around three quarters of cases. We did not expect the rather spectacular and, in some ways, rather inexplicable results which actually emerged, and which can be seen in Fig. 10. What we are interested in is whether the 10-word target sets could be trained to discriminate between the two sets of texts, or whether
Fig. 7.
P. Meara et al. / System 28 (2000) 345±354
351
Fig. 8.
some texts would fail to be classi®ed correctly by a particular set of target words. This ®gure shows that 74% of 10-word target sets could be trained to discriminate perfectly. A further 15% of the target sets failed to categorise only one or two texts correctly. Only one of the target sets could be considered a serious failure. This raises a number of important questions. Our original question was concerned with what features of a target set make it a successful one, but, since such a high proportion of sets were successful, it probably makes more sense for us to ask what are the features which characterise unsuccessful target sets, and what is it about the unsuccessful sets which sets them apart. It must be remembered that the neural network is working by examples, not rules. The generalisations which it ®nds certainly underlie some form of behaviour, but, as often when working with neural networks, those generalisations may not be clearly discernible. Indeed, at present we have no satisfactory answer to the question of what makes a target set successful. Nor do we yet have answers to some other questions which come to mind, such as whether or not the model will correctly classify new cases, or will generalise to larger sets of texts, dierent student abilities, or to more and/or ®ner-grained mark classi®cations. All these matters will be considered in work which we plan to carry out.
352
P. Meara et al. / System 28 (2000) 345±354
Fig. 9.
We do, however, have an answer to one question which we have already been able to investigate, namely whether or not the model works better with multi-word targets as opposed to single words. We examined this by building sets of 10 two-, three- and four-word targets, which, like the one word targets described earlier, appeared in approximately 50% of the texts. The number of such targets is small, and there is some overlap between the sets of dierent lengths. It appears that there is a negative correlation between successful discrimination at the 59/60 mark borderline and the number of words in an item within a target set. That is, single word targets seem to discriminate marginally better than two-word targets, which are in
Fig. 10.
P. Meara et al. / System 28 (2000) 345±354
353
turn considerably better than three-word targets. The four-word target sets are poor discriminators. This result, which at ®rst may seem counter-intuitive, is probably explicable by the fact that as the size of the target item increases, the individual multi-word targets contain more high-frequency words which are themselves poor discriminators. In any event, very few three- and four-word sequences occurred in more than a handful of the texts at our disposal. Despite the fact that our research has generated several questions to which at present we do not have answers, we believe that the results presented in this paper have some potentially far-reaching implications. We are particularly grateful to an anonymous reviewer who drew our attention to some important issues of validity raised by the approach we describe in this article. The position we have taken here is that reliability is a necessary pre-condition of validity, but not necessarily distinct from it, and we agree with Chapelle (1999) that reliability can be seen as one type of validity evidence. However, it is unclear to us how straightforwardly ideas about validity apply to neural network models. There seems to be a particular problem with models which lack obvious face validity, and yet still `work' in the sense that they produce the desired outcome in a non-intuitive way. On the practical level, we are not, of course, suggesting that a small-scale neural network which can be run on almost any desktop computer could replace human markers. Yet such a neural network could certainly be useful as a check on assessors' judgements. It could perhaps even automate the second marking which often takes place in both higher-education institutions and for public examinations. And since our neural network can already distinguish between marks assigned by assessors for weaker scripts (below 60 marks) and stronger ones (equal to or above 60 marks), it could conceivably be used right now for tests of discursive writing in a second language where a broad categorisation into better and worse would be a useful distinction (e.g. at the Pass/Fail level). It may be dicult, if not impossible, to convince colleagues that a computer is capable of arriving at a dependable judgement in an exercise involving so many complex interacting parameters, but if a neural network were used merely as a kind of second-marking ally, then the inevitable lack of faith in machine (and the perceived threat, however vague, of machine taking over from human being) might be lessened. References Chapelle, C., 1999. Validity in language assessment. Annual Review of Applied Linguistics 19, 254±272. Danon, Y., 1994. See, for example for Windows 95 (WinNN 32), http://www.winsite.com/info/pc/win95/ misc/winn3212.zip/downl.html. Engber, C.A., 1995. The relationship of lexical pro®ciency to the quality of ESL compositions. Journal of Second Language Writing 4 (2), 139±155. Harley, B., King, M.L., 1989. Verb lexis in the written compositions of young L2 learners. Studies in Second Language Acquisition 11, 415±440. Landauer, T.K., Dumais, S.T., 1998. A solution to Plato's problem: the latent semantic analysis theory of the acquisition, induction and presentation of knowledge. Psychological Review 104, 211±240. Landauer, T.K., Foltz, P.W., Laham, D., 1998. Introduction to latent semantic analysis. Discourse Processes 25, 259±284.
354
P. Meara et al. / System 28 (2000) 345±354
McClure, E., 1991. A comparison of lexical strategies in L1 and L2 written English narratives. Pragmatics and Language Learning 2, 141±154. Rehder, B., Schreiner, M.E., Wolfe, M.B., Laham, D., Landauer, T.H., Kintsch, W., 1998. Using latent semantic analysis to assess knowledge: some technical considerations. Discourse Processes 25, 337±354. Santos, T., 1988. Professors' reactions to the academic writing of non-native-speaking students. TESOL Quarterly 22, 69±90.