Entropy of English text: Experiments with humans and a machine learning system based on rough sets

Entropy of English text: Experiments with humans and a machine learning system based on rough sets

Intelligent Systems NOR'III. I-IOU,ha~ Entropy of English Text: Experiments with Humans and a Machine Learning System Based on Rough Sets HAMID MORAD...

851KB Sizes 2 Downloads 62 Views

Intelligent Systems NOR'III. I-IOU,ha~

Entropy of English Text: Experiments with Humans and a Machine Learning System Based on Rough Sets HAMID MORADI JERZY W. GRZYMALA-BUSSE and JAMES A. ROBERTS* Department of Electrical Engineering and Computer Science, Unicersity of Kansas, Lawrence, Kansas 66045, USA

ABSTRACT The goal of this paper is to show the dependency of the entropy of English text on the subject of the experiment, the type of English text, and the methodology used to estimate the entropy. Claude Shannon first described the technique for estimating the entropy of English text by a human subject guessing the next letter after viewing a string of characters taken from actual text. We show how this result is affected by using different humans in the experiment (Shannon used only his wife) and by using different types of text material (Shannon used only a single book). We also show how the results are affected when we replace the human subjects with a machine learning system based on rough sets. Automating the play of the guessing game with this system, called LERS, gives rise to a lossless data compression scheme. © Elsevier Science Inc. 1998

1. INTRODUCTION The entropy of a language is a statistical parameter which measures how much information is produced on the average for each letter of a text. Entropy is the average number of binary digits required per letter of the original text to translate each letter in the most efficient way. In his 1951 paper [22], Shannon proposed a new method of estimating the entropy of English text. This method exploits the fact that anyone speaking the language possesses an enormous knowledge of the statistics of the lan-

*Corresponding author. INFORMATION SCIENCES 104, 31-47 (1998) © Elsevier Science Inc. 1998 655 Avenue of the Americas, New York, NY 10010

0020-0255/98/$19.00 PII S0020-0255(97)00074-1

32

H. MORADI ET AL.

guage and depends on experimental results. In his experiment, Shannon randomly chose 100 passages 15 characters in length (these characters could only be one of 26 letters in English text or a space) from the book Jefferson the Virginian and asked his spouse to guess each character of every passage. The subject continued to make a guess for each character until the correct guess was made, and then she was asked to guess the next character with the knowledge of the previous characters that she had already guessed. The number of guesses was recorded for each character. Shannon's results show that the prediction gradually improved, apart from some statistical fluctuation, with increasing knowledge of the past. Using the number of guesses that it took for each letter, Shannon developed a formula for the upper and lower bound of the entropy for English text. In another experiment, Shannon randomly selected 100 passages 101 characters in length (from the same book) and asked his subject to guess the next letter in each sequence of text, given the previous 100 letters. He then concluded that the long range statistical effects (up to 100 letters) reduce the entropy of English text to the order of one bit per letter. Many papers related to Shannon's paper have been written. Several papers provide important theoretical material. Maixner [13] clarifies the details behind the derivation of Shannon's lower bound to Nth-order ( N - 1 known letters) entropy approximations. He demonstrates an explicit proof valid for an ideal prediction, adding a negative proof for general experimental conditions. For Shannon's entropy lower bound formula to fail, the subject would have to be exceptionally poor predicting in a quite unusual type of language. Still, in principle, it indicates potential sources of distortion, though much less prominent in real applications. Savchuk [21] gives necessary and sufficient conditions on the source distribution for Shannon's bound to hold with equality. Background on entropy estimate limitations can be found in [1, 2, 4, 15, 19]. Several papers comment on Shannon's empirical results for English text. Grignetti [8] recalculates Shannon's estimate of the average entropy of words in English text. Treisman [23] comments on contextual constraints in language. White [25] uses a dictionary encoding technique to achieve compression for printed English, while Jamison and Jamison [12] and Wanas et al. [24] discuss entropy of other languages. Shannon's estimates are used in many wide ranging applications. Miller and Selfridge [14] discuss the effects of meaningful patterns in English text on the entropy calculations. They conclude that when short range contextual dependencies are preserved in nonsense material, the nonsense is as readily recalled as meaningful material. From his results it is argued that contextual dependencies extending over five or six words permit positive transfer, and that it is these familiar dependencies, rather than the

ENTROPY OF ENGLISH TEXT: EXPERIMENTS

33

meaning per se, that facilitate learning. In [3], Blackman discusses interference in communication links and its effect on the information that is lost. In [6], Cover estimates the entropy of English text by altering Shannon's technique and having the subjects place sequential bets on the next symbol of text. The amount of money that each subject places on a bet depends on the underlying probability distribution for the process. By using the subject's capital after n bets, Cover estimates that the English text has an entropy of approximately 1.3 bits/symbol, which agrees well with Shannon's estimate. An important reference work on the subject is the book by Yaglom and Yaglom [26] which contains an extensive bibliography. Since Shannon's paper in 1951, many have tried to improve his experiments to more accurately and definitely calculate the entropy of English text. Burton and Licklider [5] took Shannon's experiment a step further. As the source of test material they used ten novels instead of a single volume (Shannon's only source was Jefferson the Virginian). These ten novels were of about the same level of reading difficulty and abstraction. From each source they randomly selected ten passages of each of ten lengths (number of letters known to the subjects): 0, 1, 2, 4, 8, 16, 32, 64, 128, and approximately 10,000 letters. A different subject was used for each novel. Subjects were given the passages with different lengths and in each instance were asked to guess the next letter. The number of guesses were recorded and Shannon's formulas were used to calculate the upper and lower bound entropy. Based on the results of their experiment, Burton and Licklider concluded that written English does not become more and more redundant as longer and longer sequences are taken into account. In fact, their results showed that knowledge of more than 32 characters does not necessarily make for better prediction of the next letter. These results seem to dispute Shannon's conclusion that knowledge of more letters (up to 100 characters) could reduce the entropy of English text. We took Shannon's experiment even a step further than Burton and Licklider suggested. One hundred passages 64 characters in length were chosen from a novel entitled Scruples H by Judith Krantz. Just as in Shannon's first experiment, the subject was asked to guess every letter in each sequence of the text, given the knowledge of previously guessed letters. This is different than Burton and Licklider's experiment where the subjects were given the first 64 letters (in the case of passages that were 64 letters long) and asked to guess the next letter only. The experiment is also different than Shannon's experiment, because Shannon's passages were only 15 letters in length, and ours were 64 letters long. Shannon also did an experiment were his passages were one hundred letters long, but he did not ask his subject to guess every letter of each sequence, just the last one. Our results show that the entropy of English text decreases until about 31

34

H. MORADI ET AL.

letters are known to the subjects. This result agrees well with Burton and Licklider's results and further questions Shannon's conclusion that knowledge of more than 32 letters could significantly reduce the entropy of English text. In 1964, Paisley [16] published a paper in which he studied the English text's entropy variation due to authorship, topic, structure, and time of composition. Paisley's allowed character set included the 26 letters of English text, space, and a period. All punctuation marks were coded as periods. Numerals, parentheses, and special characters were ignored. Paisley's test material included 39 texts of 2,528 character samples (totaling 98,542 characters) from the English translation of 9 Greek texts. These texts covered 3 time periods, 18 authors, and 9 topics. Shannon's guessing scheme was not used in this experiment. Paisley used frequencies of occurrences of all possible characters and character pairs to estimate the single letter probability and conditional probabilities of the letter combinations for every text. With redundancy patterns attributed to variation in four source factors, it was necessary to design a testing scheme in such a way that only one factor is free to vary in each comparison. For example, when testing for the authorship, Paisley used two verse translations of the same book (Iliad) (keeping topic and structure constant) by two different authors (Rees and Lattimore) as his test material. The translations were done in about the same time period (constant time of composition). Paisley then concluded that letter redundancy depends on all four factors. In our paper, Shannon's guessing scheme (not letter probabilities as in Paisley's experiment) was used to study the effects of topics and subjects of the experiment on the entropy of English text. Eight different subjects were used (Shannon only used his wife). In our first experiiment we concluded that using Shannon's guessing scheme, the entropy of any English text approaches its limits when it is calculated using the number of guesses made for the 32nd character of the sequence with the knowledge of the previous 31 letters (Burton and Licklider [5] also agreed with this conclusion). For this experiment, we took advantage of that conclusion. One hundred sequences of text 32 letters in length were chosen from four different books (a total of 400 sequences). The books were selected to have entirely different topics. As before, the only acceptable characters in our sample sequences were the 26 English alphabet letters and a space. Each subject was given the first 31 characters of every sequence and was asked to guess the 32nd. From this experiment we concluded that indeed the entropy of English text depends on both the subjects of Shannon's experiment and the type of used text. Shannon proposed a thought experiment in which identical twins at either end of a communication channel could serve as a lossless data

ENTROPY OF ENGLISH TEXT: EXPERIMENTS

35

coder/decoder. The first twin guesses the sequence of text that is to be sent, character by character. The number of guesses for each character is then coded and sent instead of the character itself. Given the number of guesses for each character, the twin at the other end will be able to determine the correct letter. The case of identical human beings is of course a fantasy, but not identical machines. We are unaware that anyone has built Shannon's communication system using an AI system as the subject of Shannon's experiment. In this paper the idea is explored and experimented with, using LERS, a machine learning system based on rough sets. The rough set theory is a new tool to deal with uncertainty [9, 10, 11, 17, 18]. LERS can be trained to make educated guesses on the next letter in the sequence of letters. Obviously, the more examples the system can see, the better trained it will be and the better guesses it will make. Shannon's proposed communication system can practically be built to achieve text compression using LERS as the identical twins at both ends of the communication channel. Our next two experiments were designed to use LERS as the subject of Shannon's experiment. In the first experiment we suggested (Burton and Licklider also did) that if 31 characters are known to the predictor, the number of guesses for the 32nd character (using Shannon's guessing scheme) could be used in Shannon's formulas to calculate an accurate estimate of the entropy of English text. This means that if identical twins could be found for the coder and the decoder and they had complete knowledge of the conditional probabilities, then we could achieve maximum compression of the English text using Shannon's communication system. To use LERS as the subject of Shannon's experiment means that we would have to train the system to be able to guess the 32nd character of a sequence, given the first 31 characters. There a r e 3 96 possible ways of arranging 32 characters in a sequence. There is no computer able to store a file with 396 sequences of 32 characters. In this experiment, we used only a very small fraction of all possible sequences with length of 32 characters. Randomly selected sequences of text (other than those used for training purposes), 32 characters in length, were selected from the same book for testing purposes. Not surprisingly, LERS was not able to make good guesses in any of test cases. Next, we trained LERS with examples of four characters each from the novel Scruples II by Judith Krantz. LERS used the first three characters of each sequence as known letters and the fourth letter as the decision value to learn from. The number of guesses made by LERS for the fourth character of each sequence were recorded and used in Shannon's formulas to calculate the estimated entropy of the novel. In this experiment LERS outperformed human subjects.

36

H. M O R A D I ET AL.

Our research indicates that Shannon's communication system could be built using LERS as the prediction tool to achieve good compression for English text. This can be accomplished because LERS can be used both as the coder and the decoder, thus eliminating the identical twin problem. Given more data to learn from, LERS could make even better guesses and therefore achieve even better compression rates. 2. E N T R O P Y OF E N G L I S H TEXT F R O M THE STATISTICS OF ENGLISH Calculating the entropy of English text can be done by the N-gram entropy, denoted FN, a "measure of amount of information due to statistics extending over N adjacent letters of text" [22]. The N-gram entropy F N is given by [22],

F u = - ~ . , p ( j , bi)log p ( j l b ~ ) i,j

-

~.,p(j, bi)logp(j,

bi) + ~ _ , p ( b i ) l o g p ( b i

i,j

),

i,j

where b; is a sequence of N - 1 letters, or ( N - 1 ) - g r a m , and j is an arbitrary letter following b i. When N increases, F N includes longer and longer statistics and approaches H, the entropy of English text,

H

=

lim F N . g - - - , oc

F N can be calculated for N = 1, 2, and 3 from the standard tables of letters,

digrams, and trigrams [20]. For a 27 letter alphabet (26 letters and a space), F 0 is, by definition, log 2 27 = 4.76 bits per letter. F 1, F 2 and F 3 can be calculated from the preceding formula, using N-gram frequencies, as 4.03, 3.33, and 3.1 bits per letter, respectively. Higher order approximations ( F 4, F 5..... F N) are not calculated this way since tables of N-gram frequencies are not available. However, word frequencies have been tabulated [7] and can be used for further approximation.

ENTROPY OF ENGLISH TEXT: EXPERIMENTS 2.1.

37

IDEAL PREDICTOR

The best guess for the next letter, given that block of preceding N - 1 letters, would be the letter with the highest conditional probability. The second best guess would be the letter with the second highest conditional probability. An ideal predictor would guess the letters in the order of decreasing conditional probabilities to ensure the best possible guess for each letter. Each letter is guessed correctly with a minimum of one guess and maximum of 27 guesses (with 27 letter alphabet). Thus using an ideal predictor will reduce the coding process to mapping of letters into numbers from 1 to 27. This is done by mapping the most probable letter (based on conditional probability) into 1, the second most probable letter into 2, etc. Thus the ideal predictor is a transducer which translates the text into a sequence of numbers running from 1 to 27. The reduced text, containing numbers from 1 to 27, represents the same information as the original text. Only the reduced text is transmitted, reducing the number of bits required to transmit the information. This communication system is shown below in Figure 1, taken from [22]. The system works in the following way. Suppose the letter A is to be transmitted. The preceding letters to this letter are sent to the predictor and the predictor will make a guess on the letter. If A is the first letter, then no preceding letters are sent to the predictor. The predictor's guess and the actual text (letter A) are sent to the comparator. If the guess was incorrect, then the predictor is asked to guess again. This is continued until the predictor guesses the letter correctly. Suppose the predictor guessed the letter A on the third try. The reduced text is now number 3. Number 3 is transmitted over the channel to the receiver. The first stage on the receiver side is a comparator. The receiver must also employ the exact same predictor as the transmitter. The predictor at the receiver will also make guesses on the letter with the knowledge of the preceding letters. The predictor will continue guessing the letter until it is told to

OriginText al PI PREDICTOR

l PREDICTOR

Fig. l. Communication system using reduced text.

38

H. MORADI ET AL.

guess the next letter. At the receiver, the comparators function is to count the number of guesses the predictor makes. The third guess of the predictor will be the letter A, the original letter. An upper and a lower bound for the N-gram entropy FN can be calculated with the knowledge of the symbol frequencies. In [22], Shannon defines these bounds as follows, 27

27

~_, i( qN -- qiN+1)log i ~
i=1

where qiu is a frequency of the number i for ( N - 1)-gram. 3.

EXPERIMENTS WITH HUMAN SUBJECTS

The first experiment carried out in this paper is an extension of Shannon's experiment. 3.1.

FIRST E X P E R I M E N T

In this experiment, one hundred samples of text, each 64 characters long, were selected randomly from the book Scruples H by Judith Krantz. The only acceptable characters in our sample sequences were the 26 letters of English alphabet and a space. All punctuation marks were ignored. If a sequence contained characters that could not be ignored without the text losing its meanings (for example numerals), then that sequence was thrown away and another one was randomly selected. Our subject was intelligent, educated, and had a good understanding of the English language and its constraints. These samples were all unfamiliar to the subject. The subject was asked to guess the letters of each sample starting with the first letter. If the subject guessed correctly, then the subject was asked to guess the next letter in the sequence with the knowledge of the previous letters. If the guess was incorrect, the subject was told so and asked to guess again. This continued until all 64 letters of each sample were guessed correctly. Meanwhile, the number of guesses for each letter was kept in a data base and updated with each new sample. Thus one hundred samples were obtained for which the subjects had 0, 1, 2 ..... 63 preceding letters. The upper and lower bound entropies can be calculated for the Scruples H novel using preceding formulas. Some of these calculated values are presented in Figure 2.

e,UJ I

$

0



) •



II



m



j

,,

*

mm

$





0 L~

mm

mm

*

mm

45/

Nil

°"~

.'~

|°°

0 'q"

p'.

C

ID

o~ CO

°I • .4 ° oi ° "! •



"I-I" .J"

"./. "#, V " V" . ' / " 0""

/: °

~

Z"

o.

~

---t-

,4

~doJlu3

.

0

°I

0

40

H. M O R A D I E T AL.

Some of these numbers are higher and some are lower than the entropies obtained from the experiment done by Shannon [22]. The discrepancy is caused by the different type of text and subject from those used by Shannon..In any case, this experiment shows that knowledge of more letters than 31 did not reduce the entropy of the text.

3.2. SECOND EXPERIMENT The passages used for this experiment were taken from four different types of books. The following four books were used: Scruples II by Judith Krantz, Digital Signal Processing by William D. Stanley, 101 Dalmatians by the Walt Disney Corporation, and United States FederalAviation Handbook and Manuals. The books are completely different in content. The first book listed previously is a novel, the second book is a technically written textbook in electrical engineering, the third book is a very easy children's book, and the fourth book is a government document. In the first experiment we showed that the entropy of English text, using Shannon's experiment, levels off after 32 characters. Therefore, in this experiment, one hundred samples of text, each 32 characters in length were taken from the four books (there were one hundred passages from each book). Eight different subjects (four were males and four were females) were used for this experiment. Once again, we tried to use educated, intelligent subjects that had a good understanding of English literature and its constraints. The subjects were given the first 31 letters of each passage from each book and were asked to guess the 32nd letter. The subjects were told to continue to guess until the correct guess was made. The number of guesses was recorded for each sequence and used in Shannon's formulas to calculate upper bound entropy for all four books. The results are shown in Table 1. The data obtained from all four male subjects resulted in the lowest entropy for the Digital Signal Processing TABLE 1 Upper Bound Entropies Female subjects 1

ScruplesH Digital Signal Processing 101 Dalmatians US FederalAviation Handbook & Manuals

2

3

Male subjects 4

1

2

3

4

2.95 2.07 1.95 2.01 2.84 2.3 2.68 2.4 2.2 1.65 1.76 2.07 2.28 1.62 2.0 2.17 2.4 1.86 2.06 1.94 2.54 1.97 2.37 2.22 2.62 1.86 1.9 1.95 3.0 2.5 2.72 2.64

ENTROPY OF ENGLISH TEXT: EXPERIMENTS

41

book with the entropy ranging from 1.62 to 2.28 and the highest entropy for the government document with entropy ranging from 2.5 to 3.0. The children's book was the book with the second lowest entropy and the novel was the book with the third lowest entropy in all four cases. Based on the data obtained from the male subjects, there is a correlation between the types of English text and its entropy. The data obtained from the female subjects were not as conclusive as the data obtained from the male subjects. Entropy for the digital signal processing book was calculated to be the lowest and for the children's book to be the second lowest in three of four cases. These data do not result in the government document having the highest entropy for any of the four female subjects as opposed to the data from the male subjects that resulted in the government document having the highest entropy for all four cases. In general, the entropy for the technically written book (Digital Signal Processing) was the lowest and that of the children's book (101 Dalmatians) was the second lowest in seven out of eight cases (in each case the subject of the experiment is the same). These results indicate a definite dependency of the entropy of English text on the type of text used. The results from the other books were not very conclusive. In five out of eight cases the entropy was calculated to be the third lowest for the children's book and in four out of eight cases the government document resulted in the largest entropy. To study the effects of the subjects of the experiment on the entropy estimates, it is only necessary to compare the entropy calculated for any particular book (keeping the topic constant) for all eight subjects. The lowest entropy calculated for the Digital Signal Processing book was 1.62 bits per letter. The highest entropy calculated for this book was 2.28 bits per letter. This is a discrepancy of about 28%. It is apparent that Shannon's experiment (using a human subject to guess the next letter in a sequence) is subject dependent.

4. EXPERIMENTS WITH THE MACHINE LEARNING SYSTEM LERS In principle, it is possible to recover the original text at the receiver in Figure 1 using an identical twin. In reality, it is impossible to find such an identical twin. To be able to encode the text using Shannon's guessing scheme and actually recover the same text at the receiver, a method must be developed to make educated guesses at the transmitter that can be duplicated at the receiver. In our experiments we used for that purpose a machine learning system, called LERS. LERS finds a minimal description

42

H. MORADI ET AL.

of a concept, described by positive examples, while at the same time excluding from the description of the concept the remaining, negative examples. All that is required for a problem to be addressed using a machine learning approach is to have a suitable training data set. The most important condition is that the data should be sufficient, i.e., that they should contain enough information. Such a data set may be represented by a table, called a decision table, in which variables (attributes and a decision) represent columns and examples represent rows (see, e.g., Table 2). Table 2 represents examples of the three consecutive letters taken randomly from the following words: bear, beast, imperfect, statesman, and entrepreneur, respectively. LERS may work with imperfect data: with missing values of attributes and with inconsistent examples. Two examples are inconsistent when they are characterized by the values of all attributes, but they belong to two different concepts. A decision table with at least one pair of inconsistent examples is called inconsistent. Table 2 is an example of an inconsistent decision table. Indeed, the first two examples show conflict--values of both attributes are the same yet the decision values are different. LERS handles inconsistencies using rough set theory, introduced by Z. Pawlak in 1982 (see, e.g., [17, 18]). In the rough set theory approach, inconsistencies are not removed from consideration [9, 10, 11]. Instead, lower and upper approximations of the concept are computed. On the basis of these approximations, LERS computes two corresponding sets of rules: certain and possible, using the algorithm LEM2 [11]. The following rules, (Attribute-l, p) -~ (d, r) with strength 1, (Attribute-2, p) ~ ( d, r) with strength 1, (Attribute-l, t) ~ (d, s) with strength 1,

TABLE 2 Decision Table Attribute- 1

Attribute-2

e

a

e p t e

a e e p

Decision F S F S F

ENTROPY OF ENGLISH TEXT: EXPERIMENTS

43

are certain rules for an example of Table 2. The strength of a rule is the total number of examples correctly classified by the rule during training. Despite all possible inconsistencies, certain rules are always true. On the other hand, the following rules, (Attribute-l, e) ~ (d, r) with strength 2, (Attribute-l, p) --* (d, r) with strength 1, (Attribute-2, a) ~ (d, s) with strength 1, (Attribute-l, t) --* (d, s) with strength 1, are possible rules for the same example of Table 2. These rules are sometimes correct and sometimes false. For example, the first rule is correct for the first example and false for the second example. 4.1.

CLASSIFICATION OF UNSEEN EXAMPLES

After the rule set has been induced, unseen examples given to the system will be checked for possible matches with the attribute-value pairs of rules. If enough input data has been used for training, the system will find one or more matches with the rules and one or more possible decisions. The decision to which concept an example belongs is made on the basis of three factors: Strength_factor, Specificity_factor, and support. They are defined as follows: Strength_factor is a measure of how well the rule has performed during training. Specificity_factor is a measure of completeness of the rule--the matching rules with a larger number of attribute-value pairs are more specific. The third factor, support, is related to a concept and is defined as the sum of scores of all matching rules from the concept. The concept getting the largest support wins the contest. In LERS this way of classification of an example has been accepted. Specificity is equal to the total number of attribute-value pairs on the left-hand side of the matching rule. Thus, the concept G for which the following expression,

Y'~

Strength_factor(R) * Specificity_factor(R),

matching rules R describing C

is the largest is a winner and the example is classified as being a member of C. If an example is not completely matched by any rule, LERS uses

44

H. MORADI ET AL.

partial matching. All partially matching rules, i.e., rules with at least one attribute-value pair matching the corresponding attribute-value pair of an example, are identified. For any partially matching rule R, the additional factor whose default value is equal to the ratio of the number of matched attribute-value pairs of the rule R to the total number of attribute-value pairs of the rule R is computed. This factor is called the Matching_factor(R). Thus, in partial matching, the concept C for which the following expression,

Matching_factor(R) * Strength_factor(R) partially matching rules R describing C

* Specificity_factor(R), is the largest is the winner and the example is classified as being a member of C. For example, the first example from Table 2 is matched by the following two possible rules: (Attribute-l, e) --, (d, r) with strength 2, and (Attribute-2, a) ~ (d, s) with strength 1, The first rule has score 2, the second rule has score 1. These two rules compete with each other. The first rule wins, so the decision is r. This decision is correct (compare with Table 2). 4.2.

EXPERIMENTING

WITH LERS

As mentioned earlier, using Shannon's experiment, the entropy of English text approaches its limit after the 32nd character has been guessed. Therefore the entropy of English text can be approximated by calculating the entropy of the 32nd character. One of the experiments explained earlier consisted of asking the subjects to guess the 32nd character given the previous 31 characters. The data collected from that experiment were used to calculate an approximation for the entropy of English text for different types of books. We tried the same experiment with LERS. FIRST EXPERIMENT. The system was trained with 10,592 examples (about 158 pages of the novel Scruples H) of 32 characters each. These data were

ENTROPY

OF ENGLISH

TEXT: EXPERIMENTS

45

consistent and all examples were distinct. The first 31 characters were used as known letters (attribute values) and the 32nd was used as the decision value. L E R S was run on a DEC 5000 computer and took over forty one hours (real time) to be trained. L E R S returned 5,043 rules with an average of 3.66 conditions per rule. For testing purposes, strings of 31 characters were randomly selected from the book and selected as attributes. LERS was not able to find a match with its rules set, and therefore no decision was given. There were just not enough input data for training purposes. For all practical purposes, it is impossible to provide enough input data to be trained for reasonable accuracy with 31 attributes. The next step was to train L E R S with a smaller window size and see how accurate it could be in predicting the letters of English text. SECOND EXPERIMENT. In this experiment, three characters were used as known letters (attributes) and the fourth was used as the decision. The system was trained with 86,004 examples of four characters each. Among these 86,004 examples, 79,960 examples, i.e., 92.97%, conflicted with each other. In other words, the original decision table was inconsistent. Furthermore, many examples were duplicated; only 13,244 were distinct. L E R S induced 1,286 certain rules with an average of 2.83 conditions per rule and 12,624 possible rules with an average of 2.92 conditions per rule. Only possible rules were used for further experiments. One hundred sequences of text, four characters long each, were randomly selected (other than the examples used to train the system with) from the same novel for testing purposes. The results for all one hundred examples are presented in Table 3. All entries describing guess numbers with 0 occurrences (out of 100) are ignored. Using the preceding data and Shannon's upper bound entropy formulas, the entropy for the novel was calculated to be H ( x ) = 2.46 bits per letter. This is lower than the entropy calculated from the data obtained with a human as subject of Shannon's experiment (2.62 bits~letter) for the same experiment. This means that if we use Shannon' s communication system and L E R S as the predictor, we should be able to code, transmit, and decode the contents of the Scruples H novel with an average of 2.46 bits per letter.

TABLE 3 Results for One Hundred Examples Guess # # of times (out of 100)

1

2

3

4

5

6

7

8

10

13

16

17

23

25

51

16

9

4

7

1

2

2

1

3

1

1

1

1

46 5.

H. M O R A D I E T AL. CONCLUDING REMARKS

Increasing the size of the window (from three characters to four or more) will result in better guesses by LERS and therefore lower entropy but will also require more data to be used for training purposes and therefore days or perhaps weeks of processing time. Also, the original training data were noisy, i.e., containing characters that were not recognized by LERS as English text characters. Recall that only the 26 English letters and the space are considered in this experiment. Since a huge sample of the English text was needed to train the system, a scanner was used to scan in over 150 pages of the novel Scruples II into a text file. Using an editor, the text was cleaned up by eliminating all illegal characters (all punctuation, all numbers, and so on). There were a few illegal characters that were mistakenly left in. Eliminating all illegal characters will cause LERS to be trained with better input data and therefore will produce better results and reduce the entropy of the text. Using LERS solves the problem of finding an identical twin at the receiver end in Shannon's compression scheme. LERS can be a good predictor at the transmitter and at the receiver with exactly the same algorithm used for picking the next character in the sequence at both ends, a requirement for the Shannon's communication system. Considerable research remains. Since the text entropy varies with subject, the effects of training with one set of data and then compressing another must be determined. We also need to further investigate the performance of LERS itself over a wide range of data types and lengths of character string. The alphabet needs to be expanded to include case and punctuation. Although such a compression scheme can do no better asymptotically than the Lempei-Ziv algorithm, see for example [27], there may be some advantages when compressing small amounts of text. REFERENCES 1. G. A. Barnard, Statistical calculation of word entropies for four western languages, IRE Trans. Inform. Theory IT-1:49-53 (1955). 2. G. P. Basharin, On a statistical estimate for the entropy of a sequence of independent random variable, Theory Probab. Appl. 4:333-336 (1959). 3. N. M. Blackman, Prevarication versus redundancy, Proc. IRE 1711-1712 (1962). 4. C. R. Blyth, Note on estimating information, Technical Report 17, Dept. Statistics, Stanford University, Stanford, CA, 1958. 5. N. G. Burton and J. C. R. Licklider, Long-range constraints in the statistical structure of printed English, Amer. J. Psych. 68:650-653 (1955). 6. T. M. Cover, A convergent gambling estimate of the entropy of English, IEEE Trans. Inform. Theory IT-24:413-421 (1978).

ENTROPY OF ENGLISH TEXT: EXPERIMENTS

47

7. G. Dewey, Relative Frequency of English Speech Sounds, Harvard University Press, 1923. 8. M. Grignetti, A note on the entropy of words in printed English, Inform. Contr. 7:304-306 (1964). 9. J. W. Grzymala-Busse, Knowledge acquisition under uncertainty--A rough set approach, J. Intell. Robotic Syst. 1:3-16 (1988). 10. J. W. Grzymala-Busse, Managing Uncertainty in Expert Systems, Kluwer Academic Publishers, Dordrecht, 1991. 11. J. W. Grzymala-Busse, LERS--A system for learning from examples based on rough sets, in: R. Slowinski (Ed.), Intelligent Decision Support. Handbook of Applications and Aduances of the Rough Set Theory, Kluwer Academic Publishers, Dordrecht, 1992, pp. 3-18. 12. D. Jamison and K. Jamison, A note on the entropy of partially known languages, Inform. Contr. 12:164-167 (1968). 13. V. Maixner, Some remarks on entropy prediction of natural language texts, Inform. Stor. Retr. 7:293-295 (1971). 14. G. A. Miller and J. A. Selfridge, Verbal context and the recall of meaningful material, Amer. J. Psych. 63:176-185 (1950). 15. T. Nemetz, On the experimental determination of the entropy, Kybernetika 10:137-139 (1972). 16. W. J. Paisley, The effects of authorship, topic structure, and time of composition on letter redundancy in English text, J. Verbal Learn. Behau. 5:28-34 (1966). 17. Z. Pawlak, Rough sets, Int. J. Comput. Inform. Sci. 11:341-356 (1982). 18. Z. Pawlak, Rough classification, Int. J. Man-Mach. Stud. 20:469-483 (1984). 19. E. Pfaffelhuber, Error estimation for the determination of entropy and information rate from relative frequency, Kybernetika 8:50-51 (1971). 20. F. Pratt, Secret and Urgent, Blue Ribbon Books, 1942. 21. A. P. Savchuk, On the evaluation of the entropy of language using the method of Shannon, Theory Probab. Appl. 9:154-157 (1964). 22. C. E. Shannon, Prediction and entropy of printed English, Bell Syst. Tech. J. 30:50-64 (1951). 23. A. Treisman, Verbal responses and contextual constraints in language, J. Verbal Learn. BehaL,. 4:118-128 (1965). 24. M. A. Wanas, A. I. Zayed, M. M. Shaker, and E. H. Taha, First- second- and third-order entropies of Arabic text, IEEE Trans. Inform. Theory IT-13:123 (1967). 25. H. E. White, Printed English compression by dictionary encoding, Proc. IEEE 55:390-396 (1967). 26. A. M. Yaglom and J. M. Yaglom, Probabili~ and Information, Science House, Leningrad, 3rd revised ed., 1973, Chap. IV, Part 3, pp. 236-329. 27. J. Ziv and A. Lempel, Compression of individual sequences by variable rate coding, IEEE Trans. Inform. Theory IT-24:530-536 (1978).

Receil~ed 28 September 1995; reuised 15 October 1996