CUSUM: a credible method for the determination of authorship?

CUSUM: a credible method for the determination of authorship?

SCIENTIFIC & TECHNICAL CUSUM: a credible method for the determination of authorship? RA HARDCASTLE Document Evidence Ltd, Gatsby Court, 172 Holliday ...

9MB Sizes 0 Downloads 33 Views

SCIENTIFIC & TECHNICAL

CUSUM: a credible method for the determination of authorship? RA HARDCASTLE Document Evidence Ltd, Gatsby Court, 172 Holliday Street, Birmingham BI 1TJ, United Kingdom Science & Justice 1997; 37: 129-138 Received 19 April 1996; accepted 7 November 1996

Authorship attributions based upon the CUSUM (Cumulative Sum Analysis) method are still being presented in court proceedings as evidence despite the findings of several independent researchers that the method is unreliable. This paper describes the shortcomings of the method, presents further examples of its failures and summarises the findings of other researchers. The CUSUM method is seen to be discredited and should not be accepted as providing reliable evidence of authorship of either the written or the spoken word.

Die CUSUM Methode als Nachweis einer Urheberschaft wird immer noch vor Gericht als Beweismittel prasentiert, obwohl mehrere unabhangige Experten festgestellt haben, da13 diese Methode unzuverlassig ist. Es werden die Einschrankungen dieser Methode beschrieben, weitere Beispiele vorgestellt und die Ergebisse anderer Experten zusammengefaBt. Die CUSUM Methode ist als Nachweis der Urheberschaft eines geschriebenen oder gesprochenen Textes nicht zu akzeptieren.

Les identifications d'auteurs basCes sur la mCthode CUSUM (Cumulative Sum Analysis) sont encore prksentkes en tribunal c o m e moyen de preuve malgrk les rCsultats de nombreux chercheurs indkpendants qui montrent que la mCthode n'est pas fiable. Cet article dCcrit les limites de la mtthode, prksente des exemples dlCchecs et rCsume les rCsultats d'autres chercheurs. La mCthode CUSUM est ainsi discrCditCe et ne devrait pas Ctre acceptke comme moyen de preuve fiable de I'identitC de l'auteur de documents Ccrits ou parlCs.

La atribucidn de autorias basadas en el mCtodo CUSUM (AnBliis de Sumas Acumulativas) se presenta todivia en 10s juicios como evidencia a pesar de 10s hallazgos de varios investigadores independientes sobre su baja fiabilidad. Este trabajo describe 10s puntos dkbiles de este mCtodo, presenta ejemplos de sus fallos y hace una recopilacidn de 10s hallazgos de otros investigadores. El mCtodo CUSUM se ve desacreditado y no debe ser aceptado como evidencia fiable de autoria tanto de la palabra oral como de la escrita.

Key Words: Forensic science; Document examination; Text analysis; Linguistics; Stylistics; CUSUM; Qsum. Science & Justice 1997; 37(2): 129-138

129

CUSUM: a credible method for the determination of authorship?

Introduction In a previous paper [I] an assessment of the CUSUM (Cumulative Sum Analysis) method of determining the authorship of questioned texts was described. The method was found to be ill-defined and unreliable. Since then other researchers have published the results of their own independent assessments of the CUSUM method as proposed by Morton et a1 [2-51 and have confirmed its unreliability. Some of these researchers have investigated alternative methods of statistically analysing the same underlying data but even with such modifications reliable attributions of authorship could not be made. Despite this, the main advocates of the CUSUM method apparently still maintain that the method is viable and continue to supply reports based upon their applications of the method for use in court proceedings in this country and abroad. In this paper more examples of failures of the CUSUM method are presented and the findings of other researchers are summarised. A remarkable instance where the method was inappropriately taken to the extreme of being applied to a single disputed sentence is also described.

'"1

20 0

In the absence of a sound theoretical basis there is no explanation as to why such a consistency of word use should occur within one person's utterances or why different people might use these words at significantly different rates. The English language has grammatical rules and conventions for phraseology and we have to construct our sentences in recognised ways so that others can understand, otherwise communication fails. It follows that the scope for personal variation in the rates of use of certain types of words is necessarily limited.

mIm I r

34353637 i S j 0 1 0 4 1 1213414516171S495051

'Short Words' habit rate (%) 12or

19 20 21 72 7.3 74 25 20 27 28 20 .30 3. 1 .32 33-34 35 36 'Vowel Word\' habit rate ( % )

'Short and Vowel Words' habit rate (%) FIGURE 1 Distributions of average habit rates for three CUSUM habits among 580 authors. 130

The basis of the CUSUM method The validity of the CUSUM method depends upon the assumptions that in the utterances (written and spoken) of people, words of a particular pre-defined class occur at a roughly consistent rate and that the average rate often differs significantly from one person to another. Various classes of words have been proposed but those relied upon most frequently by the proponents of the method are 'Short Words' (defined as words of two or three letters), 'Vowel Words' (defined as words beginning with a vowel) and especially 'Short + Vowel Words' (words which either consist of two or three letters or which begin with a vowel). Such classes of words are referred to by Morton [2-51 as 'habits'. This nomenclature has been followed here but it should be noted that no psychological or linguistic process has been identified which might generate such habits and therefore the term is arguably inappropriate.

Something like a quarter of all English words used are selected from a pool of only a dozen or so [5], including the, a, to, in, etc. It is such non-context words which make up most of the 'Short Words' counted in a CUSUM analysis. Whilst people do differ in their ranges of vocabulary, the effect upon the count of 'Short Words' resulting from, say, a preference for using longer words in place of the relatively few words of two or three letters that are context related (for example, using vehicle instead of car), can only be a correspondingly small one. The observed average rates of use of the CUSUM habit words within long texts fall within a fairly narrow range. Figure 1 shows the distributions of average rates for the three most commonly used habits among English texts by 580 different authors. These texts were single-authored books with a mean length of 79,000 words and a minimum length of 6,000 words [6]. For all three habits over 90% of the average rates fall within a 10% wide band. It could also be argued that at least some of the texts which give rise to the more extreme values may demonstrate the influence of subject matter. Such texts include 'The Book of World Horoscopes', 'Chinese Astrolony', -- 'Science Basic Facts' (which has a dictionarytype layout) and 'Phonics They Use - Words for Reading and writing'. Science & Justice 1997; 37(2): 129-138

RA HARDCASTLE

It would be unusual to find texts representing normal everyday English with habit rates outside the ranges seen in Figure 1. In contrast, the variation seen within one person's text sentence by sentence is generally much larger; it is this large variation that makes all the more remarkable the claim [2,3] that sections of text from one author as small as 5-10 sentences can be reliably distinguished from text by a second author. The claim that the rate of use of 'Short Words' is consistent within one person's utterances despite changes of circumstance and subject matter is also contrary to expectations. For example, if a narrative concerned primarily the actions of the speaker then frequent instances of the word 'I' would be expected but these would not contribute to the counts of 'Short Words' since the word consists of only one letter. On the other hand, if the narrative concerned the actions of another person then frequent occurrences of 'he' or 'she' might well be found which would contribute to the count. The construction and interpretation of CUSUM charts A detailed description of the calculations required to construct a CUSUM chart is given in the previous paper [I]. Briefly, a CUSUM chart consists of two graphs superimposed; one is derived solely from the total numbers of words in each sentence (sentence lengths), the other from the counts of the 'habit' words within each sentence. The first graph represents the cumulative sum sentence by sentence calculated from the differences between the length of each sentence and the average sentence length for the whole

FIGURE 2 (a) CUSUM chart for the combined habit of Short + Vowel words in 65 sentences of specimen text of known authorship. (b) A histogram of the 5 sentence moving average habit rate for the specimen text used to produce (a). Science & Justice 1997; 37(2): 129-138

text. At any point, if the next sentence is longer than the average then the graph rises; if it is shorter, it falls; if it is equal to the average, the graph line is horizontal. It follows that the general shape of the CUSUM plot and whether it is a smooth or jagged curve is simply a function of changes of sentence length from one sentence to the next. The second graph is another cumulative sum plot. This one is derived from the differences between the numbers of habit words in each sentence and the average number of habit words per sentence for the whole text. If the number of habit words in each sentence is close to being a constant fraction of the total number of words in the sentence, then this second cumulative sum plot produces a graph which is similar in shape to the first one but with the vertical scale compressed. Morton applies a scaling factor to the vertical co-ordinates of the second graph and then superimposes it upon the first graph to create the so-called CUSUM chart. The procedure advanced by Morton for calculating the scaling factor results in an approximation. For a discussion of how his factor differs from the correct scaling factor, see the previous paper [I]. Figure 2(a) shows an example of a CUSUM chart for a specimen text of known authorship. This text was accepted by a CUSUM proponent as showing that the author was consistent in his use of 'habit' words. The critical part of the CUSUM method is the interpretation of such a CUSUM chart. For real, non-trivial texts of single authorship the sentence length graph and the scaled-up habit words graph are never identical - there are always differences between them due to the variations from the average habit rate sentence by sentence. Morton claims that where a significant divergence occurs between the superimposed graphs this indicates different authorship for part of the text but he has advanced no suitable objective criteria for deciding when a divergence is significant. Furthermore he claims that even a layman can make this judgement [5]. The best that can be said for the CUSUM method is that it could be used to locate portions of text which differ in the rate of use of habit words from the average for the whole text. However, it is debatable whether the CUSUM chart is the best tool for doing even this. A separation of the graphs caused by a portion of text with a higher or lower than average habit rate can continue into the following part of the chart that relates to a subsequent section of the text. Also the two graphs are constrained to meet at the end of the chart even if it is the last part of the text that is anomalous. Morton advocates moving a transparent overlay of one graph over the other in the vertical direction to resolve these problems by comparing the shapes of sections of the graphs individually. This is, however, an unacceptably subjective procedure on a number of counts. It requires that a portion of the chart be somehow selected for scrutiny and an anchor point between one graph and the other be chosen. Since the

CUSUM: a credible method for the determination of authorship?

two graph lines will never be identical the superimposition of one portion upon another will never give an exact match but one is invited to make some sort of ad hoc assessment as to whether the match is acceptable and to decide whether a movement of the overlay to line up the graphs for one portion of text implies differing authorship for another portion. If the CUSUM chart is abandoned then plotting the habit rates of individual sentences instead is not very informative except to demonstrate the large variations found from one sentence to another. However, a simple moving average over, say, five sentences is clearly capable of revealing changes in the local habit rate within a text. Figure 2(b) shows a histogram of the 5 sentence moving average habit rate within the specimen text used to construct the CUSUM chart shown in Figure 2(a). The histogram in Figure 2(b) shows a typical variation in habit rate from one part of the text to another up to about 20%; this variation is of the same order of magnitude as the variation in overall average habit rates among texts from many different people as seen in Figure 1. Furthermore, the variations in average habit rate do not diminish that much as the size of the text sample is increased. Figure 3 illustrates the extent of variation found among 50 non-overlapping sections of text, each 1,000 words long taken from single works by 150 authors [6]. The uppermost graph line represents the maximum habit rate ('Short + Vowel Words') for any 1,000 word section of an author and the lowermost graph line represents the corresponding minimum habit rate. The author sequence has been sorted into ascending order of the mean habit rate which is represented by the middle graph line. The standard deviation for the mean habit rate of an author is typically about 2% and therefore the range of possible values within three standard deviations of the mean is typically about 12% wide. The variations found within single texts are further illustrated by Figures 4 and 5. Figure 4 shows the average habit

rates ('Short + Vowel Words') for 100 successive sections of text, each 1,000 words long, taken from the works of three authors. The only criteria used to select these three texts from the 150 used for Figure 3 were that the texts were long enough and that they represented high (54%), medium (51%) and low (49%) overall average habit rates. There is no reason to suspect that they are untypical. As can be seen from Figure 4 there are large variations from one section of text to the next. Plotting the same data as frequency distributions in Figure 5 reveals that they overlap to a considerable extent. It is clear that the prospects for reliable discriminations between different authors of text samples as large as 1,000 words (usually equivalent to 50-100 sentences) are poor. For shorter texts the prospects are correspondingly worse. The claim that the CUSUM method can reliably establish that a disputed text as short as 5-10 sentences is of different authorship from some specimen text is extraordinary [3, 51. As an example of the unreliability of the CUSUM method consider Figure 6(a) which is a CUSUM chart produced in another court case by one of the method's proponents. This chart was constructed from a composite text; sentences 1-29 were from one specimen text, sentences 30-40 were in dispute and sentences 41-70 were from a second specimen text. The separation of the two CUSUM graph lines in the centre of the chart was interpreted as demonstrating that the questioned text did not represent an utterance by the author of the specimen texts. The questioned text contained 11 sentences comprising 122 words, with an average habit rate (for the combination of short words and words beginning with a vowel) of 61%. The specimen texts had average habit rates of 56% and 53% respectively. In Figure 6(b) the habit rate within the three texts is plotted as a moving average for five sentence groups. This shows that the habit rate within the questioned text is of a similar magnitude to that of the latter part of the first specimen text. The assertion that

Author number

FIGURE 3 The range of variation in Short + Vowel habit rates among 50 blocks of 1,000 words each taken from books by 150 different authors. The upper and lower lines represent the maximum and minimum habit rates respectively for a single block. The middle line represents the average habit rate for-.all 50 blocks. 132

Science & Justice 1997; 37(2): 129-138

RA HARDCASTLE

'Short + Vowel Words' habit rate (%)

FIGURE 5 Distributions of Short + Vowel habit rates among 100 blocks of text each of 1,000 words taken from books by three different authors using the same data as Figure 4.

-701

.

1

1I

' d ,

31

21

41

51

71

61

Sentence number counts of habit words 0 sentence lengths

+-

(b)

Text block number

FIGURE 4 Variations in Short + Vowel habit rates between 100 successive blocks of text each of 1,000 words taken from books by three different authors.

80 h

the habit rate in the questioned text is significantly different from that of the specimen text is clearly unjustifiable. In another court case a CUSUM 'expert' took the CUSUM method to the astonishing extreme of applying it to a single disputed sentence. This was done despite several references within the preface to the expert's own report and in Morton [3, 51 to the effect that the method cannot discriminate single sentences but requires a minimum of five sentences and that with such a small sample only quite large differences in habit rate can be detected. The disputed sentence was ''I know that vou know I was involved but on the advice of my solicitor I saying nothing and you will have prove it all the wav." This was a verbal utterance written down by a police officer. The sentence comprises 29 words of which 20 are words of two or three letters or which begin with a vowel (i.e., the words that are underlined above). The CUSUM 'expert' inserted this sentence into a passage of specimen text from the alleged speaker of the words and produced a CUSUM chart equivalent to that shown in Figure 7(a). In this chart the disputed sentence has been inserted as sentence number 7 within 14 sentences of specimen text. The 'expert' Science & Justice 1997; 37(2): 129-138

Sentence number

FIGURE 6 (a) CUSUM chart for the combined habit of Short + Vowel words in a composite of three texts. Sentences 30-40 were in dispute; the remaining sentences were from specimen texts of known authorship. (b) Histograms of the 5 sentence moving average habit rate within the three texts used to produce Figure 6a.

concluded that the small separation of the two CUSUM lines in the chart showed that the disputed sentence was not characteristic of the defendant's utterances. Examination of the raw data reveals that the separation between the CUSUM graph lines in Figure 7(a) arises simply because 69% of the words in the disputed sentence are habit words compared to an average of only 53% of the words within the chosen specimen text. However, the separation between the CUSUM graph lines in Figure 7(a) is certainly not unusual for the person concerned as Figure 7(b), a CUSUM chart for another portion of specimen text

CUSUM: a credible method for the determination of authorship? 50 40 30 20 10 0 -10 -20 -30 -40 -50

8

3

>

(a)

100 w

U

r al

2 %

-

.

3 2 +d

k

0

Z

-.....-.. my* " . .. ....... . . ........:c=. ..-. =em . . .. . . B

so-

=

. . ' . Q

C88

6 0 - =:m.".~:~,pt=.::

**.

40-

a

..I

9

9.-

=

. I

6)

2

-%-c'

50 30 20 10 0 -10 -20 -30 -40 -50

2

2

1

5

10

15

Sentence number

+E l -

counts of habit words sentence lengths

chart the 'Ombined habit Of (a) Short + Vowel words in a composite text. The single disputed sentence has been inserted as sentence number 7 within 14 sentences of specimen text. (b) CUSUM chart for the combined habit of Short + Vowel words in another portion of specimen text from the same author as the specimen text used for Figure 7(a). -.

demonstrates. Furthermore, when the percentages of habit words within individual sentences of the specimen texts were inspected, this revealed that the proportion of habit words in the disputed sentence was by no means exceptional. Figure 8 shows a scatterplot of the percentages of habit words found in sentences from several specimen texts. As might be expected, the range of variation of habit rate decreases somewhat as sentence length increases but the proportion of habit words in the disputed sentence is not significantly different from the proportions seen in some of the specimen sentences of a similar length. The point marked 'A' on the scatterplot represents a sentence of 30 words of which 21 were habit words (70%) but the CUSUM 'expert' claimed this sentence was 'anomalous' because it contained some reported speech and should be eliminated before CUSUM processing. Strangely, he did not identify it as anomalous in his original report and did not exclude it from the specimen text he analysed. The point marked 'B' represents a sentence of 23 words of which 20 were habit words (87%) i.e. as many habit words as the disputed sentence but in a sentence of shorter length! There were also two pairs of successive sentences within the specimen text which , taken as single samples of text, contained 32 out of 45 (71%) and 28 out of 40 (70%) habit words respectively. It follows that there can be no justification for claiming that the CUSUM method has shown that the defendant is 134

20-

e,

3 40

0 -

0

1

a

10 20

30 40

50 60 70

80 90 100

Total number of words in sentence

FIGURE 8 Scatterplot of the proportion of habit words in each sentence from several specimen texts and in the single disputed sentence versus the total number of words in those sentences. The point marked Qrepresents the disputed sentence; the points marked A and B represent specimen sentences of particular interest.

unlikely to ever utter 20 habit words out of 29 as is found within the disputed sentence. Interestingly, during the court proceedings it was intimated by defence counsel that if the fifth word "know" was replaced by the word "think" then the sentence might not be disputed. This substitution, however, would have no effect whatsoever upon the CUSUM analysis since the word counts would remain exactly the same. Attention was drawn in the previous paper [ l ] to the unsatisfactory way in which 'anomalies' are dealt with within the CUSUM method. It is said that sentences which include, for example, quotations of other people's words or lists or even an abrupt change of subject can produce anomalies in the CUSUM charts. It is common therefore for such anomalies to be removed on an ad hoc basis from the specimen texts analysed. Without clear rules for deciding what constitutes an anomaly this is a dubious practice. In the case referred to above involving the single disputed sentence, a reasonable argument could be made that the phrase "on the advice of my solicitor" is a stock phrase that may not be representative of the defendant's mode of utterance. If this phrase is omitted from the disputed sentence, the habit rate drops to 15 out of 23 words (65%), a rate which falls well within the range of variation exhibited by the specimen sentences of a similar length as shown by Figure 8. In other cases where the practitioners of the CUSUM method say that the text is the utterance of more than one person but they cannot identify which parts are by whom, they appear to rely upon a general lack of consistency between the two graphs in the CUSUM chart. However, no objective measure of this inconsistency is used and no allowance seems to be made for the reasonable expectation that some people will be less consistent than others or for the greater variation normally encountered when the habit Science & Justice 1997; 37(2): 129-138

RA HARDCASTLE --

SO 40 g 30 20 > 10 0 2-10 -20 -30 -40 -50

50 40

(a)

<

30

5:

0

O,

5

, lrn

5

20

> 10

2-10 3 -20 -30 -40 -SO

Sentence number

+-

0 -

.Z 40

-2

20

O

1

11

71

+-

counts of hab~twords sentence lengths

31

4I

Sentence number counts of habit wordy

---sentence lengths

(b)

51

hl

71

1

Sentence number

11

21

31

41

51

hl

71

Sentence number

FIGURE 9 (a) CUSUM chart for the combined habit of Short + Vowel words in a disputed text which an 'expert' concluded was of multiple authorship. (b) A histogram of the 5 sentence moving average habit rate in the disputed text used to produce 9(a). (c) CUSUM chart for the combined habit of Short + Vowel words in a specimen text from the purported author of the disputed text used to produce 9(a). (d) A histogram of the 5 sentence moving average habit rate in the specimen text used to produce 9(c). --

analysed is a less frequently occurring one (e.g., words beginning with a vowel only). Indeed, reports have been issued concluding that a disputed text is of multiple authorship based upon an analysis of a single specimen text or even no specimen text at all. Figure 9(a) shows a CUSUM chart for a disputed text which an 'expert' concluded was of multiple authorship. Figure 9(b) shows the variation of habit rate within the text by average Over five sentence groups. means a pair of graphs for a Figures 9(c) and 9(d) show a text attributed to the purported author of the disputed text. The variation in habit rate within the disputed and specimen texts is arguably of a similar nature and extent.

A further factor that ought to be taken into consideration where the disputed text is a police interview record (in which a police officer has written a contemporaneous account of the questions put and the answers given) is the influence of the note-taking process itself. When accurate transcripts of tape-recorded interviews are studied, they reveal that the verbal utterances of most people are poorly structured with repetitions, hesitations, self-corrections and so on and it is often difficult to partition such a transcript into discrete sentences. It would not be unnatural for a police officer writing contemporaneous notes of an interview to filter out at least some of the extraneous words. This could well be done, if not subconsciously, with the best of Science & Justice 1997; 37(2): 129-138

- -

intentions so as to obtain an unambiguous and clear account. In addition, a notetaker is a human being and not a machine so honest mistakes will be made, not least because words are generally spoken at a much faster rate than they can be written down and the notetaker may often have difficulty in keeping up. In controlled tests designed to investigate maximum speeds of writing [7] it was found that the notetakers made errors of many types including omitting words, substituting one word for another of similar meaning and even inserting additional words. Leaving aside the legal issue of when such an interview record is admissible as evidence in court, it is not unreasonable to expect that the imperfect nature of the interview recording procedure may well lead to 'inconsistencies' within CUSUM charts or to a difference in overall habit rate between such a record and specimen texts written by the interviewee himself. The CUSUM method has been used to distinguish between for the that Persons habit rates be are a few percentage points they that a few Percentage apart or to points within a text demonstrate multiple authorship. Clearly, when the text is a manually recorded interview transcript some allowance at least ought to be made for the effects of the imperfect recording process and it is debatable whether this can be done satisfactorily, if at all. 135

CUSUM: a credible method for the determination of authorship?

The findings of other researchers Several other research groups have independently investigated the CUSUM method and found it to be unreliable. Canter [8] points out that no aspect of human behaviour reveals such high levels of consistency as those required for the CUSUM technique to work. He also criticises the subjectivity of the interpretation of the CUSUM charts and uses a correlation coefficient statistic to objectively evaluate CUSUM charts constructed for texts of known single and multiple authorship. A correlation coefficient can be calculated directly from the original CUSUM data values avoiding the need to apply a scaling factor as required for the CUSUM chart. With this objective approach Canter found that a CUSUM analysis would mistakenly attribute multiple authorship for about half of 107 texts of a general nature where there was only one true author and the analysis would only attribute multiple authorship in about one third of 130 texts where there were actually two authors. In further testing using 52 samples of single-authored texts drawn from confessions, multiple authorship was mistakenly attributed to about a third of them. In 87 mixed texts taken from this same material only about a third were identified as being of multiple authorship. In tests, using visual inspection of the CUSUM charts only, a volunteer sympathetic to the CUSUM method was not able to identify multiple authorship any more reliably than the correlation testing.

Hilton and Holmes [9] also express serious concerns about the subjective nature of the CUSUM method. They conclude that for a passage of text by Jane Austen the interpretation of the corresponding CUSUM chart by Morton and Michaelson [2] was biased reflecting their knowledge of its authorship. Hilton and Holmes present an alternative method of analysing the data based upon weighted cumulative sums which have a firm basis in statistics and use rigorous wellestablished testing procedures. In tests, using weighted cumulative sums applied to written works only, they found that although this alternative method produced marginally better results than the original CUSUM technique, these results were still incorrect at least half of the time. Furthermore, they found that when the complete texts of Austen's novels were studied, the results they obtained were not stable with respect to sample size indicating that this author at least does not follow habits as rigidly as is required for the CUSUM method to determine authorship correctly. They conclude that such cumulative sum techniques do not yield consistently reliable indicators of authorship. Canter and Chester [lo] have also tested the weighted cumulative sum approach on specimen texts of single and multiple authorship originating from seven authors. They found that it does not reliably discriminate between such texts and there was not even a trend towards discrimination. In one paper Sanford et a1 1111 report the results of an investigation into the reliability of the 'Short Words' habit for 136

discriminating one author from another. Using test material of known authorship they found that within-subject variation in habit rate was not small in comparison with between-subject variation and that there was a lack of consistency in habit rate for individual authors. In another paper Sanford et a1 [12] applied the CUSUM method to a set of texts of known authorship representing both written and spoken utterances from 20 subjects. These texts were analysed for various habits. It was found that in general there was a close linear relationship between the numbers of habit words in each sentence and the sentence length, i.e., longer sentences did not have a significantly different habit rate from shorter sentences. However, the observed habit rates were not consistent for individuals from one text to another nor was there a significant difference between the subjects. Comparisons of written and spoken utterances of the same subjects showed statistically significant differences in accordance with other scientific comparisons of speech and writing [13]. Sanford et a1 conclude that the CUSUM technique is based upon assumptions which are at best of limited reliability and are most likely completely false. They further conclude that the method should not be entertained as a forensic technique. In unpublished material Farringdon has proposed the use of a test applied to weighted cumulative sum charts as an adjunct to Morton's CUSUM method. He uses a weighted cumulative sum chart to calculate a statistical measure of significance following a method described by Bissell [14, 151. However, Bissell states that where the weighted chart 0bse~ati0nsthemselves are unlikely to be normally distributed the test should be treated with caution and should be used for general guidance rather than for precise significance levels. He also acknowledges that there still remains a subjective element in the test since the user has to select 'turning points' and a section of text to test within the weighted CUSUM chart. Limited trials of this statistical procedure by the present author have shown that choosing slightly different parameters can alter the outcome but the reliability of Farringdon's particular implementation of the test procedure cannot be properly evaluated until it is described in more detail. Nevertheless it is obvious that if the underlying assumption behind the CUSUM method regarding the consistency of habits within the utterances of one person is false, then extending the method in this way cannot redeem it. As stated above, Hilton and Holmes [9] and Canter [lo] found that weighted cumulative sum tests did not give reliable results. De Haan and Schils [16] have also found that authors are frequently not consistent in their habit rates in specimen texts. In order to test the effectiveness of the CUSUM method they used artificial 'texts' created by sampling sentence lengths from a distribution for real specimen texts, assuming different mean habit rates for them and then superimposing Science & Justice 1997; 37(2): 129-138

RA HARDCASTLE

random errors typical of those found in real texts upon the individual sentence values. They found that when one such text of 5 sentences was inserted into 25 sentences of another text with a different mean habit rate, the point of insertion was correctly identified in only about 15 out of 100 simulations when the mean habit rates differed by the maximum amount seen in any two samples in a large corpus of specimen texts. Even with larger inserted texts of 25 sentences they found that the success rate did not exceed 20%. They point out that basic statistics predict that sample sizes of about 275 observations (sentences) are necessary to distinguish between true proportions (habit rates) with a mutual difference of about 0.10 at the 5% significance level. They conclude that the CUSUM technique is totally unreliable for detecting insertions as small as 5-25 sentences. Canter [8] points out that the possibility remains that the advocates of the CUSUM method have their own special way of interpreting their charts that is particularly sensitive to mixed authorship and does not lead to the misattributions found by others. If so, this goes against the claim that anyone can interpret CUSUM charts and such an ability needs to be demonstrated under controlled conditions unless the CUSUM 'experts' can specify more precisely how they interpret the CUSUM charts so that others can replicate their findings. The only published direct response to the criticisms of other researchers by Morton [17] does nothing - to address this fundamental issue.

The CUSUM method in Court Reference has already been made to how remarkable it would be if people were consistent enough in their use of so-called habit words for the CUSUM technique to work on text samples of the size and nature of those of forensic interest. Therefore it is not unreasonable to demand that the method be independently verified before it is used in any court proceedings. The fact that in a few court cases CUSUM evidence has already been heard does not in itself lend credibility to the method. The claim that even a layman can interpret a CUSUM chart is clearly absurd. The vast majority of non-scientists are likely to have only a limited experience of the interpretation of even simple graphs; to suggest that they can objectively assess the degree of similarity between two graphs, particularly cumulative sum graphs, is not credible. This is particularly the case where CUSUM charts are presented which have wildly different vertical scales. As previously noted [I] there is a tendency for CUSUM 'experts' to present CUSUM charts with compressed scales when a consistency is identified and expanded scales when an inconsistency is identified. Consider the following examples taken from three actual cases examined by CUSUM practitioners. Figures 10(a), 10(b) and 10(c) show CUSUM charts for texts in these cases. These charts have been prepared from texts analysed by the 'experts' and have been plotted using the same Science & Justice 1997; 37(2): 129-138

Sentence number

(b)

3

Sentence number

-50 1

11

21 4

+-

31

41

0

Sentence number

n-

counts of habit words sentence lengths

FIGURE 10 CUSUM charts for texts analysed by 'experts' in three separate cases.

scales. Perhaps the reader would like to decide whether they (presumably as no mere layman!) can be confident as to whether any of the charts indicates multiple authorship. The answers are given in a footnote below. It has already been pointed out that in a police interview where one person attempts to record contemporaneously the questions and answers spoken, there are likely to be honest errors of transcription and possibly some subconscious tidying up of the narrative. The CUSUM method, even if it were reliable, could not distinguish between replies which were not recorded verbatim and replies which were fabricated. If the CUSUM method is applied to a typed version of an interview record there is an additional possible source of error. It is not uncommon to find a few small errors of transcription when a typed version is compared with the handwritten version and in some cases replies are deliberately omitted from the typed version for legal or procedural reasons. In a recent case one of the CUSUM practitioners examined a typed interview record and declared it to be an

CUSUM: a credible method for the determination of authorship?

unreliable record because of inconsistencies within the CUSUM chart. When the handwritten version was inspected, however, it was found to be a much abbreviated record with only the main context words written down. For example, "We have been told that we will get a hiding" in the typed version corresponded to "We told we get hiding" in the handwritten contemporaneous record. Since approximately 15% of the words in the typed version did not occur in the handwritten record, the results of the CUSUM analysis were rendered completely irrelevant. This last case exhibited another failing. Despite it being a supposed prerequisite of the CUSUM method that the chosen habit should be shown to be consistent within specimen texts from the subject concerned before being used to test a disputed text, this was not done. An analysis of just the disputed text alone was performed. In other cases sometimes only one or possibly two or three specimen texts are said to have been checked for consistency of the chosen habit for the person involved. Finally, it has to be said that the CUSUM method seems to have instilled a disciple-like faith in its followers. There is a distinct reluctance on their behalf to even consider other ways of looking at the same raw data such as the scatterplot shown in Figure 8, or the moving average shown for example in Figure 2(b). In one case when such graphs were presented to the court by the present author they were dismissed out of hand by the CUSUM 'expert' because "they ignore the sequence and arrangement of words in sentences". This is plain nonsense; all that is required to construct a CUSUM chart is two lists of numbers, one containing the counts of habit words in each sentence and one containing the total numbers of words in each sentence; no other information about the text is needed. Indeed if the words within each sentence were mixed randomly this would have no effect whatsoever upon the resulting CUSUM chart.

Conclusions As more independent objective studies of the CUSUM method are done, the weaker becomes the claim that the method can provide reliable evidence of the authorship of texts of forensic interest. Even with various modifications to put it on a more statistically sound basis the method fails and the underlying premise that people are consistent in terms of the so-called 'habits' is therefore false. The application of the CUSUM method in court cases often shows a deplorable lack of scientific rigour and objectivity and falls below the standards required of a reputable forensic technique. It is clear that forensic scientists seeking a linguistic tool for the determination of authorship must turn their attention to other methods. As Canter [8] has emphasised, the fact that courts have accepted the unscientific approach of the CUSUM method

as evidence is perhaps a more general cause for concern. Clearly courts need to be especially cautious when expert evidence is presented which is based upon a novel technique and when that technique has not been validated by others.

Footnote The texts represented by the CUSUM charts in Figure 10 were analysed in three separate cases. One 'expert' claimed that the results of his CUSUM analyses were consistent with the text represented by Figure 10(a) being of single authorship and the text represented by Figure lO(b) being of multiple authorship. A second 'expert' claimed that his CUSUM analysis of the text represented by Figure 10(c) was consistent with it being of single authorship. Acknowledgement I wish to express my thanks to Tim Lane and his colleagues of the COBUILD lexical computing project at the University of Birmingham for their assistance in extracting data from the Bank of English corpus. References 1. Hardcastle RA. Forensic linguistics: an assessment of the CUSUM method for the determination of authorship. Journal of the Forensic Science Society 1993; 33: 95-106. 2. Morton AQ and Michaelson S. The Qsum Plot. Internal Report CSR3-90. University of Edinburgh: Department of Computer Science 1990. 3. Morton AQ. Proper Words in Proper Places. Departmental Research Report 19911R18. University of Glasgow: Department of Computing Science 1991. 4. Morton AQ. The Scientific testing of utterances. Cumulative sum analysis. Journal of the Law Society of Scotland 1991; 357-359. 5. Morton AQ and Farringdon MG. Identifying Utterance. Expert Evidence 1992; 1: 84-92. 6. Data drawn from the Bank of English corpus created by COBUILD at the University of Birmingham, Edgbaston, Birmingham B 15 2TT. 7. Hardcastle RA and Matthews CJ. Speed of writing. Journal of the Forensic Science Society 1991; 3 1: 21-29. 8. Canter D. An Evaluation of the 'Cusum' Stylistic Analysis of Confessions. Expert Evidence 1992; 1: 93-99. 9. Hilton ML and Holmes Dl. An Assessment of Cumulative Sum Charts for Authorship Attribution. Literary and Linguistic Computing 1993; 8: 73-80. 10. Canter D and Chester J. Investigation into the claim of weighted CUSUM in authorship attribution studies. Forensic Linguistics (in press). 11. Sanford AJ, Aked JP, Moxey LM and Mullin J. (in press) Discriminating one author from another through simple habits of expression: an empirical analysis. In: French P. and Coulthard M. eds. Papers in Forensic Linguistics. London: Routledge. 12. Sanford AJ, Aked JP, Moxey LM and Mullin J. A critical examination of assumptions underlying the cusum technique of forensic linguistics. Forensic Linguistics 1994: 1; 151-167. 13. Halliday MAK. Spoken and Written Language. Oxford: Oxford University Press, 1990. 14. Bissell AF. Weighted Cusums - method and applications. Total Quality Management 1990; 1: 391402. 15. BS5703 British Standard 5703, Parts 2 (1980) and 3 (1981). Guide to data analysis and quality control using Cusum techniques. 16. de Haan P. and Schils E. The Qsum plot exposed. 14th International conference on English language research on computerised corpora 1994; 13: 93-105. 17. Morton AQ. Response. Forensic Linguistics 1995: 2; 230-233.

Science & Justice 1997; 37(2): 129-138