~OURNAL OF VERBAL LEARNING AND VERBAL BEHAVIOR 5,
28-34 (1966)
The Effects of Authorship, Topic, Structure, and Time of Composition on Letter Redundancy in English TextsI WILLIAM J. PAISLEY
Institute for Communication Research, Stanford University, California Previous studies of letter redundancy in English texts showed differences which, because of nonsystematic sampling, could be regarded only as error variance. In this study thirty-nine 2528-character samples from English translations of 9 Greek texts were selected to permit controlled analyses of authorship, topic, structure, and timeof-composition factors. Letter redundancy was found to covary with all 4 factors. Authorship and topic differences are of ideographic interest; they may also represent control problems in information-theory-based studies of verbal behavior. The structural analysis showed that prose texts are more redundant than verse texts; this finding has implications for the study of special structural constraints (e.g., tetegraph English, aircraft-control English). Translations of the same text from the 14th, 16th, and 20th centuries showed that English letter redundancy is decreasing, as Zipf's "principle of least effort" (1949) would predict.
straint (Ulma,-Ulob~) from actual frequencies of occurrence of the 26 letters plus space and period. They estimated sequential constraint (U1:2,3,4...n) from the frequencies with which pairs of the 28 characters appeared either contiguously or separated by n intervening characters. Their sample, 10,000 characters from the book of Isaiah in the King James version of the Bible, yielded values similar to those obtained by Shannon. Newman and Wangh (1960) expanded the Newman-Gerstman study by including 20,000 characters from The Atlantic Monthly and a work by William James. They found that the three texts differed both in distributionai constraint and in sequential constraint. The Newman-Waugh study marked a turning-point in the application of information measures to printed texts. Whereas investigators had sought previously to estimate a single set of redundancy values for the English language, Newman and Waugh showed that samples of text differ and that single samples should not be accepted as representative.
Early estimates of uncertainty in English letter sequences were obtained by asking Ss to guess the next letter in a passage n letters long. It was assumed in these studies (Shannon, 1951) that each S would consult his learned set of letter-sequence probabilities to produce a maximum-likelihood guess. By varying the amount of preceding context, estimates of sequential constraint over 0,1,2 n intervening letters were obtained. 2 Newman and Gerstman (1952) eliminated the S (and his perhaps idiosyncratic strategy of guessing) by analyzing texts of printed English. They computed distributional con•
.
.
1 This study was improved substantially by the suggestions and criticisms of Dr. Leonard Horowitz of Stanford University. 2 This discussion will adhere to the terminology of Garner (1962), since that survey of information measurement in psychology is becoming a standard reference. Note, however, that estimates of sequential constraint in this study (as in the precedent studies by Newman and Gerstman, 1952; Newman and Waugh, 1960) omit interaction terms; The sum of interaction terms has proved to be about zero in Garner's analyses (1962, pp. 235-239). 28
LETTER REDUNDANCY
Factors Affecting Redundancy. A demonstration of differences between the Bible, The Atlantic Monttdy, and the work of William James leads to little understanding of redundancy patterns in these or other texts. Each text is the work of a different writer discussing a different topic in a different century. Which factor accounts for the difference in redundancy? Secondary analysis of the Newman-Waugh data cannot answer this question; the factors are confounded. It is necessary first to specify which factors may affect redundancy in English texts and secondly to test each factor in an experimental design in which other factors are held constant. Granting that only vocabulary, spelling, and syntax can affect letter redundancy (that is, which words are chosen, how they are spelled, and how they are arranged in sequences), four antecedent factors may be suggested: (1) Autkorsltip Factor. Each writer's vocabulary is characterized both by unique content and by unique probabilities of occurrence of even the most common words. Mosteller and Wallace (1963) have published an economical proof of this. (2) Topic Factor. Many topics require special vocabularies. That is, a travel guide contains names of places; a chemistry text contains chemical terms; the book of Isaiah (the Newman-Gerstman sample) contains names of ancient kingdoms and many futuretense verbs. (3) Structure Factor. Two great formal structures of written English are prose and verse. These differ in spelling; there are poetic variants of many words. They differ even more in vocabulary and syntax; poets labor to find the best places for the best words. (4) Time-o]-Composition Factor. English vocabulary and spelling change constantly. Vocabulary change now keeps pace with technology and outstrips spelling change. In the 17th century it was spelling that shifted
29
rapidly from the arbitrary usages of the Elizabethans to the frozen forms of the dictionary makers. METHOD With redundancy patterns attributed to variation in four source factors, it is necessary to sample texts in such a way that only one factor is free to vary in each comparison. Thus a design testing the authorship factor should control the topic, structure, and time-of-composition factors. To provide such control a procedure has been adopted in this study which must be credited to Newman and Waugh (1960) in its germinal form. Seeking to compare redundancy levels in three languages, they established comparability by choosing translations o[ the same text (Isaiah 27-28) in Samoan and Russian. The value of such a procedure in the present study is obvious: by selecting English translations of the same text, executed in the same structure at about the same time by two or more translators, we hold constant all factors except authorship (or "transl.atorship"). M e t h o d o] Analysis. The present study follows the lead of Newman-Gerstman and Newman-Waugh in coding printed texts in 28 characters including the space and period. All end-stop punctuation marks--semicolon, colon, exclamation point, question mark, ellipsis, period--were coded as period; other forms were ignored. Numerals, parentheses, and special characters were ignored. No space was inserted following a period. Texts were keypunched in a uniform format for processing on the Stanford Computation Center IBM 7090 and Burroughs B-5000. 8 Each sample consisted of 2528 characters from the beginning of the excerpt cited. This sample is only one-fourth as large as the 10,000-character samples of NewmanGerstman and Newman-Waugh. By reducing the size of each sample it was possible to process the great number of texts required for controlled comparisons. Altogether, 39 texts totalling 98,542 characters were included in the study. These texts cover 3 time periods, 18 authors, and 9 topics. The published procedure of Newman and Gerstman was followed with only two exceptions (sequential constraint was measured out to 15 spaces instead of 10; also, an empirical correction for bias was computed, although this correction, involving the subtraction of a constant from all estimates of sequential constraint, has no effect on any of the comparisons reported below). Briefly, single-letter Computer time was made available by National Science Foundation grant NSF-GP948 to the Stanford University Computation Center.
30
WILLIAM J. PAISLEY
uncertainty was computed as --~p(log2p) of the frequencies of occurrence of the 28 characters, while sequential constraint was computed from 15 bivariate matrices, one for each interletter distance, in which the frequencies of occurrence of the 780 possible character pairs were arrayed (there are actually 282 or 784 cells in each matrix, but four character pairs--period-period, space-space, periodspace, and space-period--are prohibited by the coding scheme). Additional Measures. Newman and Waugh suggested that letter frequency may covary with the "difficulty" of a text. This issue is extraneous to the defined problem of the present study, but two measures of "difficulty"--mean word length and number of different words in the first 350 words of each text--were computed nonetheless. They are cited as corroborative data in two of the analyses below. RESULTS
Authorship. Two recent verse translations of the Iliad (by Rees and Lattimore) were sampled to assess the influence of the authorship factor. The results of this analysis are shown in Table 1. It may be seen that each TABLE 1 A~ AUTHORSH~ DIFFERENCE: SINGLE-LETTER UNCERTAINTIES O F Two TRANSLATIONS OF FlvE Booxs OF THE Iliad Translator Book
Rees
Lattimore
Difference
1 6 12 18 24
4.131 4.145 4.084 4~76 4.102
4.112 4.107 4.067 4.066 4.072
.019 .038 ~17 ~10 .030
Means
4.108
4.085
.023*
* t = 4.628, p ~ .01.
translator has a characteristic level of redundancy which rises and falls from book to book but holds its position relative to the level of the other translator. Over the five books the mean difference between translations is significant beyond the .01 significance level.4 4 A t test of the difference between correlated means is used in each comparison involving matched translations (e.g., two texts by the same translator, two translations of the same text, etc).
Except for mean word length, other measures corroborate the finding of less redundancy in Rees's texts. In every book Rees uses more different words than does Lattimore; this difference is significant beyond the .05 level. In four of the five books less sequential constraint is found in Rees's translation than in Lattimore's, although the mean difference over the five texts is not significant.
Authorship Differences in "Committee" Translations. Many versions of the Bible have been rendered by committees of translators. Given the interaction which usually takes place in committee efforts, it seems unlikely that any single translator on the committee could stamp an entire text with his singular style. Should we therefore expect a committee translation not to yield the consistencies observed above for an individual translation? Two recent committee translations of the New Testament were sampled to answer this question. One of them, the Basic English version, has built-in vocabulary constraints which might be expected to force uniformity on the several translators. The other, the Confraternity version, was undertaken as a conservative modernization of the Roman Catholic Bible. Both versions read smoothly, suggesting that their committees have attempted to integrate the individual contributions. Nevertheless, as Table 2 shows, authorship consistencies are not found in these commitTABLE 2 THE E ~ _ ~ r oF COM~ITTE~ AUTHORSHIP: SINGLE-LETTER UNCERTAINTIES OF TWO TRANSLATIONS OF FOUR TEXTS FROM THE GOSPEL OF MATTHEW Translation Chapter of Matthew
Basic English
Confraternity
Difference
2a 5a 5b 6a
4'.061 4.103 4.076 4.059
4.093 4.095 4.038 4.078
--.032 .008 .038 --.019
Means
4.075
4.076
--.001
31
LETTER REDUNDANCY
tee translations of the Gospel of Matthew. The texts hold their relative positions through the 5th chapter (perhaps the work of a single translator in each case?), but reverse these positions in the 2nd and 6th chapters. At this level of analysis, therefore, although replication is in order, it appears that a committee does not have a style in the sense that a single author does. Topic. Table 1 showed that each of the five books sampled from the Iliad had a characteristic level of uncertainty in both translations. Such differences are regarded in this study as topic differences. For example, Book 6 of the Iliad opens with a list of the warriors who fell in battle. Because of this roll-call there is an average of 188 different words in the first 350 words of each translation. Book 24, which begins with a council of the gods, has an average of only 175.5 different words in the first 350 words of each translation. Vocabulary increases as exposition requires; the corresponding shift in letter redundancy is here identified as a topic difference. A statistical test of the topic difference calls for more than two replications of each topic (book), however. Therefore two additional translations were sampled for Books 1 and 24. Single-letter uncertainties of the four translations are presented in Table 3. I t TABLE 3
A TOPIC DI~'.~~RENCE: SINGLE-LETTER UNCERTAINTIES
tions, included to balance the two verse translations, it appears that topic differences are found in both prose and verse structures. Topic Differences' within a Single Discourse. The 5th chapter of Matthew, four texts from which were included in Table 2, contains the first part of the "Sermon on the Mount." This is an apparently homogeneous discourse; it might be expected that topic differences would not be found within it. Nevertheless, the Basic English and Confraternity translations showed consistent shifts from Matthew 5a to Matthew 5b. Is this only a fortuitous pattern? If there is a systematic topic difference between the two halves of Matthew 5, it should appear in other translations as well. Therefore two additional translations, the Phillips and the Revised Standard Version, were sampled. Table 4 shows single-letter uncertainties of the four translations of the two TABLE 4
A TOPIC DIFFERENCE WITHIN A SINGLE DISCOURSE: SINGLE-LETTER UNCERTAINTIES OF FOUR TRANSLATIONS OF TWO TEXTS FROM THE GOSPEL OF MATTHEW
Translation
Matthew 5a
Matthew 5b
Difference
Confraternity Basic English Rev. Stnd. Ver. Phillips
4.095 4.103 4.103 4.120
4.038 4.076 4.077 4.080
.057 .027 .026 .040
Means
4.105
4.068
.037*
* t = 5.43, .02 > p > .01.
OF FOUR TRANSLATIONSOF TWO BOOKS OF THE Iliad Translator
Book 1
Book 24
Difference
Richards Lattimore Rees Rieu
4.104 4.112 4.131 4.144
4.078 4.072 4.102 4.093
.026 .040 .029 .051
Means
4.123
4.086
.037 *
* t = 6.67, p ~ .01.
may be seen that Book 1 is consistently less redundant than Book 24, the mean difference passing the .01 level of significance. Since Richards' and Rieu's are prose transla-
halves of Matthew 5. The pattern is almost as significantly regular as that of Table 3, the mean difference having a probability between .02 and .01. Closer inspection of the text of Matthew 5 reveals differences to which information measures are sensitive. The first half of the chapter contains the Beatitudes with their repeated form, "Blessed are the . . . , for they . . . . " The second half of the chapter contains a reinterpretation of Mosaic law with the repeated form, "You have heard it
32
WILLIAM J . PAISLEY TABLE 5 A STRUCTUREDIFFERENCE: SINGLE-LETTF_,RUNCERTAINTY,SEQUEI~TIAL CONSTRAINT, MEAN WORDLENGTH~ANDNUMBER OF DIFFERENT WoRDs IN FOUR VERSE AND FOUR PROSE TR~NSLATTONSOF BOOK I OF THE Iliad
Text
Single-letter uncertainty
15-letter sequential constraint
Mean word length
Different words in first 350
Prose Lang et al. Butler Richards Rieu
4~77 4.095 4.104 4.144
1A49 1.542 1.669 1.492
4.107 3.996 3.938 4.138
167 157 162 172
Prose Means
4.105
1.538
4.045
164
Verse Lattimore Bryant Rees Ernle
4.112 4.112 4.131 4.159
1.395 1.428 1.426 1.286
4.223 4.322 4.202 4.568
168 177 172 188
Verse Means
4.128
1.384
4.329
176
said, 'Thou shalt not . . . ,' but I say to you . . . . " The vocabulary subtly changes, and information measures detect the change. Structure. Since the Iliad has been rendered many times both in prose and in verse, the influence of the structure factor may be assessed by sampling several translations in each structure of the same text. Thereby topic is explicitly held constant, while authorship differences are reduced to error-variance within each structure. Four prose and four verse translations of Book 1 of the Iliad were processed, as reported in Table 5. Although the two types of structure differ greatly in mean single-letter uncertainty, high variance within each type renders the mean difference nonsignificant. Yet higher uncertainty seems to be related to verse structure, and other measures strengthen this relationship: (a) there is an average of 176 different words per 350 in the verse texts, only 164 in the prose texts; (b) all prose texts have more sequential constraint than any verse text; (c) all verse texts have longer mean word lengths than any prose text. Differences (b) and (c) are each significant at the .05 level (t = 2.61 and 3.05,
respectively). Thus the influence of the structure factor appears to have been established. Time of Composition. The fact that we call some words and some spellings "archaic" is presumptive evidence that time of composition may affect letter redundancy. This hypothesis may be tested by sampling translations of the Bible from widely separated centuries. A search of library shelves brought to light one translation from the 14th century and four translations from the 16th century. A primary criterion of selection in each case was the preservation of the original text. Table 6 presents single-letter uncertainties of the five early texts and the four modern texts previously discussed. A trend of decreasing redundancy over time is clearly visible, although variance heterogeneity prohibits a t test of the mean difference. A median split of the eight cases in the 16th and 20th centuries yields an exact-probability-test p of .014, however, since the 2 X 2 table has two zero cells. Next it may be asked whether the time-ofcomposition difference is a function of vocabulary or of spelling. This question is
33
LETTER REDUNDANCY TABLE 6 A TIME-oF-CoMPOSITION DIFFEREt~CE: SINGLE-LETTER UNCERTAIlqrlES oF NInE TRANSLATIONS OF MATTHEW 5A Translation
Single-letter uncertainty
Wycliffe (1389)
4.043
Tyndale (1525) Geneva (1560) Bishops ( 1572 ) Rheims (1582)
4.081 4.085 4.086 4.081
4.083
Confraternity Basic English Rev. Stnd. Ver. Phillips
4.095 4.103 4.103 4.120
4.105
Mean
answered by modernizing the spelling of the two oldest translations, the Wycliffe and the Tyndale. When this is done, a single-letter uncertainty in the Wycliffe text shifts upward from 4.043 to 4.050. Such a minor change indicates that vocabulary rather than spelling must be the primary source of higher redundancy in this 14th century text. When Tyndale's text is respe|led, however, single-letter uncertainty shifts from 4.081 to 4.111, an increase which leaves it high in the range of modern translations. Therefore the time-of-composition difference in this 16th century text is chiefly a function of spelling. Does modern spelling result in less redundancy for 16th century texts in general? Since the respelled Tyndale text shifted in the direction of the overall mean, a regression effect may be suspected. To test this further a 16th century text is needed in which archaic spelling yields uncertainty already above the overall mean. If this text when respelled rises still further, it may be concluded (tentatively) that modern spelling is less redundant in general. Of a set of Elizabethan texts not otherwise reported here, an excerpt from Christopher Marlowe's Tamburlaine Part H had the highest single-letter uncertainty, 4.139. When respelled the text moves still farther above the overall mean, to 4.142. Therefore regres-
sion is not a sufficient explanation of decreased redundancy in the respelled texts. There is a more adequate explanation: decreased use of the letter e. There is an average of 330 e's in each of the four 16th century Bible texts but an average of only 274 in the four corresponding 20th century texts (as an example of 16th century fondness for the letter e: "Blessed are the poore in sprete, for theirs is the kyngdome of heven."). This difference is significant beyond the .01 level ( t - - 4 . 0 9 ) . More to the point, the Tyndale text when respelled contains only 290 e's, whereas it originally contained 350. Since deviation of letter frequencies either above or below the equal-probability norm of 90.3 occurrences (2528/28) increases redundancy, one source of letter redundancy in 16th century English has been identified. DISCUSSION
The finding that redundancy covaries significantly with authorship has two implications. First, letter redundancy may be added to the repertory of measures on which writing styles are compared. Secondly, the relationship of redundancy to "difficulty" should be explored as an additional perspective on the "readability" of texts. Since difficulty is an attribute not of texts but of the transaction between readers and texts, it seems most promising to continue this line of investigation in the laboratory, focusing on readers' behavior as a function of textual redundancy. The measurement of difficulty in terms of polysyllabism, sentence length, vocabulary burden, etc., is an unnecessarily inferential procedure. The finding that redundancy covaries significantly with topic is important not so much for the future study of topic differences as for the study of other factors in the presence of known topic differences. It is not meaningful to demonstrate that one writer is more redundant than another unless topic has been explicitly controlled. Structure, like topic, is interesting as a con-
34
WILLIAM J. PAISLEY
trol variable in future studies. Telegraph English and aircraft-control English (cf. Frick and Sumby, 1952) are examples of language constrained by structure. The time-of=composition factor has fascinating implications for the study of linguistic evolution. The fact that letter redundancy decreases over time seems to strengthen Zipf's "least effort" theory (Zipf, 1949), especially when superfluous e's disappear from words with the passage of time. When English became increasingly a written language, the graphemes poore and poor became unequal in the effort required to transmit them. It may well be, in linguistic evolution, that the grapheme that conveys the same information with the least redundnacy is most likely to survive. In conclusion, the search for single estimates of letter redundancy in English should be discontinued. A pattern of systematic vari-
ation has been found which deprives the single estimate of significance. REI~ER~NCES
FRICX, F. C., Am) SUMSV, W. H. Control tower language. J. Acoust. Soc. Amer., 1952, 24, 59559fi. GARNZR, W. R. Uncertainty and structure as psychological concepts. New York: Wiley, 1962. MOSTELLER, F., AND WAIff~ACE,D. L. Inference in an authorship problem. Y. Amer. S¢at. Assn., 1963, 58, 275-309. NEW~tAN, E. B., AND GERSYMAN, L. J. A new method for analyzing printed English. J. Exp. Psych., 1952, 44, 114-125. NEWMAN, E. B., AND WAIIOH,N. The redundancy of texts in three languages. Inlormation and Control, 1960, 3, 141-153. SHANNON, C. Prediction and entropy of printed English. Bell System Tech. J., 1951, 30, 50-64. ZIrF, G. K. Human behavior and the principle of least effort. Cambridge: Addison-Wesley, 1949. (Received September 14, 1964)