Information Sciences 148 (2002) 11–26 www.elsevier.com/locate/ins
Improvement of the LR parsing table and its application to grammatical error correction Masami Shishibori a,*, Samuel Sangkon Lee b, Masaki Oono a, Jun-ichi Aoe a a
b
Department of Information Science and Intelligent Systems, Faculty of Engineering, Tokushima University, 2-1 Minami Josanjima-Cho, Tokushima-Shi 770-8506, Japan Department of Computer Engineering, Jeonju University, 1200, 3 Ga, Hyoja Dong, Wansan Gu, Jeonju, Jeonbuk 560-756, Republic of Korea
Received 30 December 2000; received in revised form 18 February 2002; accepted 1 June 2002
Abstract A LR parsing table is generally made use of the parsing process based on the context free grammar for natural languages. Besides the parsing process, it can be used as the index of approximate pattern matching and error correction, because it has the characteristic to be able to predict the next character in the sentence. As for the issue of the traditional LR parsing table, however we can mention if the number of sequences to be processed becomes large, many reduce actions will be created in the parsing table, as a result, it takes a great deal of time to parse the sentence. In this paper, we propose the method to construct a new LR parsing table without reduce actions from the generalized context free grammar. This new parsing table denotes the states to be transited after accepting each symbol. Moreover, we applied this new parsing table to detect and correct erroneous sentences which include the syntax errors, unknown words and misspelling. By using this table, the symbol which is allocated just after the error position can be utilized for selecting correction symbols, as a result, the number of candidates produced on the correction process is reduced, and fast system can be realized. The experiment results, using 1050 sentences including error characters, show that this method can correct error points 69 times faster than the traditional method, also keep the almost same correction accuracy as the traditional method. 2002 Elsevier Science Inc. All rights reserved.
*
Corresponding author. Tel.: +81-886-56-7508; fax: +81-886-56-7508. E-mail address:
[email protected] (M. Shishibori).
0020-0255/02/$ - see front matter 2002 Elsevier Science Inc. All rights reserved. PII: S 0 0 2 0 - 0 2 5 5 ( 0 2 ) 0 0 2 7 2 - 4
12
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
Keywords: LR parsing table; Reduce action; Extended right graph; Natural language processing; Error correction
1. Introduction A LR parsing table is generally made use of the parsing process based on the context free grammar for natural languages. Besides the parsing process, it can be used as the index of approximate pattern matching and error correction, because it has the characteristic to be able to predict the next character in the sentence. As for the context-free grammar, it is impossible to make up the context-free grammar which accepts all natural languages. Generally speaking, the context-free grammar can accept only 70% of the natural languages. In the case when the target domain is limited, however, it is possible to prepare the task-specific grammar [5], and the grammar-based method is able to obtain a high correction accuracy. As for the issue of the traditional LR parsing table, however we can mention if the number of sequences to be processed becomes large, many reduce actions will be created in the parsing table, as a result, it takes a great deal of time to parse the sentence. In this paper, we propose the method to construct a new LR parsing table without reduce actions from the generalized context free grammar. This new parsing table denotes the states to be transited after accepting each symbol. This new table contains only the states reached by the shift moves and does not include any reduce moves which were calculated beforehand. This new table is called a Descendant Set table (DS table). By using the DS table, symbols which admit a transition can be decided upon at the same time without building a new stack. As a result, the number of candidates produced on the correction process is reduced, and fast system can be realized. Moreover, we applied this new parsing table to detect and correct erroneous sentences which include the syntax errors, unknown words and misspelling. In order to correct the erroneous sentences automatically, the method to narrow correct symbols using character and word n-gram models has been already proposed [4,6,13]. And, the method to select candidate words by calculation of the hash and Levenshtein distances between the input word and each word in the dictionary has been also proposed [3,12]. Both of them, however, make use of the only local statistics information and not the grammatical information, so they cannot correct using the global information in the erroneous sentence. On the other hand, Saito et al. [5,9,10] noted that the context free grammar [1] can accept the wider language classes than the regular grammar and n-gram models, and they applied the LR parsing method [1,14] to the error correction method. The LR parsing algorithm is a table driven shift-reduce parsing algorithm that can handle arbitrary context-free grammars, and parses the
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
13
sentence looking ahead the suitable symbols for the context-free grammar. By using the property that the suitable symbols for the grammar are looked ahead, their method can find that the symbol of the place where the parsing fails may be wrong, and the method can correct an erroneous sentence according to the context-free grammar. By using the DS table, the symbol which is allocated just after the error position can be utilized for selecting correction symbols, as a result, the number of candidates produced on the correction process is reduced, and fast system can be realized. In order to observe the effect of this method, we made an experiment using 1050 sentences (28.5 Kbyte) including 418 error characters. The experimental results show that this method can correct error points 69 times faster than the traditional method, also keep the almost same correction accuracy (90%) as the traditional method. In the following sections, the fast error correction method presented is described in detail. Section 2 describes the outline of the LR parsing algorithm. In Section 3, first of all the traditional error correction method using the LR parsing is explained according to Saito et al. [5,9,10], and then the construction algorithm of the DS table used with this method is shown. In Section 4, the replacement process for altered errors, the deletion process for extra errors, and the insertion process for missing errors are introduced as the error correction method using the DS table, respectively. In Section 5, experimental results with various key sets are given. Finally, our results are summarized and the future research is discussed in Section 6.
2. LR parsing algorithm The LR parsing algorithm is a table driven shift-reduce parsing algorithm that can handle arbitrary context-free grammars in polynomial time. This LR parser analyzes the input sequence by using the LR parsing table and the stack. The LR parsing table consists of the action table, which indicates the next action, and the goto table, which decides to which state the parser should go. This table is made beforehand according to the grammar by the table generator. The LR parser scans the input sequence symbol by symbol from left to right, and pushes down grammar symbols and state symbols into the stack. The parser executes the following four kinds of processes according to ACTION½sm ; ai in the action table, where sm is the state symbol allocated on the top of the stack and ‘‘ai ’’ represents the next input symbol. (1) Shift move: In the case that ACTION½sm ; ai ¼ \shift s", the following shift move is executed. Input symbol ‘‘ai ’’ and next state s to be transited are pushed down into the stack. (2) Reduce move: In the case that ACTION½sm ; ai ¼ \reduce p", the following reduce move is performed by using the pth grammar rule A ! b. First of
14
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
Fig. 1. An example of the context-free grammar.
all, r state symbols and r grammar symbols are popped up from the stack, where r stands for the length of the right side b of the pth grammar rule. Next, the left side symbol A of the grammar is pushed down into the stack, after that, the new state symbol s obtained by GOTO½smr ; A is pushed down into the stack. (c) Accept: In the case that ACTION½sm ; ai ¼ \accept", as it means that the input sequence is accepted, the parser is finished. (d) Error: In the case that ACTION½sm ; ai ¼ \error", as it means that the error is detected, the parser is stopped. The parser repeats the above processes until ‘‘accept’’ or ‘‘error’’ is detected. While the encountered entry has only one action, parsing proceeds exactly the same way as LR parsers, which are often used in compilers of programming languages. When there are multiple actions in one entry, these actions are called to be conflicted. In this case, all the actions are executed in parallel with the graph-structured stack [14]. An example of the context-free grammar is shown in Fig. 1, and its LR parsing table is shown in Table 1. In this grammar, lower characters indicate terminal symbols, and upper characters indicate non-terminal symbols. In Table 1(a), entries ‘‘sh i’’ in action table (the left part of the table) indicate the action ‘‘shift one word from input buffer onto the stack and go to state i’’. Entries ‘‘re i’’ indicate the action ‘‘reduce constituents on the stack using rule i’’. The entry ‘‘acc’’ stands for the action ‘‘accept’’, and blank spaces represent ‘‘error’’. The goto table (Table 1(b)) decides to which state the parser should go after a reduce action. 3. Improvement of the LR parsing table 3.1. Descendant set table As for the issue of the traditional LR parsing table, we can mention if the number of sequences to be processed becomes large, many reduce actions will
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
15
Table 1 Parsing table obtained from the grammar (a) Action table in the parsing table a young man 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
loves
$
s1 r5 s8
r5 s5
r5 s6 s10 r7 r8 r2
r6 s5
acc r7 r8 r2
r6 s6
r9 s1
(b) Goto table in the parsing table S NP VP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
woman
4
3
DET
r3
r1 r3 r4
ADJ
NOUN
VERB
2 9
7
12
11
13 14
2
be created in the parsing table, as a result, it takes a great deal of time to parse the sentence. In this paper, we propose the method to construct a new LR parsing table without reduce actions from the generalized context free grammar. This new parsing table denotes the next state to be transited by accepting each symbol and state. This table is called a Descendant Set table (DS table). As for shift moves, state symbols to be transited by shift moves are stored into the DS table. As for reduce moves, state symbols to be transited by shift moves, which are necessarily caused after executing reduce moves, are indicated in it. A DS table is made from the parsing table and the directed graph, which is called a ERG (Extended Right Graph). The ERG is the extension of the RG
16
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
(Right Graph) proposed by Aoe et al. [2]. The RG stands for the order of grammar rules applied for successive reduce moves, on the other hand, the ERG can indicate not only the grammar order but also state symbols to be transited by their moves. 3.2. Construction algorithm for the ERG Context-free grammars are extensively used to describe both formal and natural languages. A context-free grammar G is defined by a 4-tuple G ¼ ðVN ; VT ; P ; SÞ. • VN : A set of non-terminal symbols including the start symbol S, where A; B; C; . . . 2 VN . • VT : A set of terminal symbols, where a; b; c; . . . 2 VT and . . . X ; Y ; Z 2 VN [ VT . • P: A set of rewriting rules (productions) of the form A ! a , where A is called the left-hand side of the rule, a is called the right-hand side of the rule, and þ 1. a; b; . . . 2 ðVN [ VT Þ ,
2. a ; b ; . . . 2 ðVN [ VT Þ , where ðVN [ VT Þ is the set of all finite-length þ
strings of symbols in a set VN [ VT , and ðVN [ VT Þ is a set ðVN [ VT Þ without a empty set . • S: The start symbol. The following definition about the relation q is introduced for formal discussions: AqX ½9 a : A ! a X : Moreover, the RG and the ERG are represented as the binomial relation ðN ; RÞ. N is the finite set of nodes, and R shows relation on the set N. If a; b 2 N , and ða; bÞ 2 R, it is defined that node a and node b are connected by an arc. [Construction Algorithm for the ERG] Input: G ¼ ðV N ; V T ; P ; SÞ Output: ERG ¼ ðV ; qÞ for G Step 1: {Construction of the RG} Construct the corresponding RG to the grammar G, where V ¼ fX j Aq X ; 8 A 2 V N g. Step 2: {Addition of state symbols} Let node A be an arbitrary non-terminal symbol in the set V. RSðAÞ and LSðAÞ obtained by the following expressions are appended to RG ¼ ðV ; qÞ. RSðAÞ ¼ fp 2 I j p ¼ GOTO½q; A 8q 2 Ig; LSðAÞ ¼ fq 2 I j p ¼ GOTO½q; A 8p 2 Ig: This step is performed for all the non-terminal symbols in V, where I is the set of LR state symbols.
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
17
For example, in the case that the corresponding ERG to the grammar of Fig. 1 is constructed, first of all, since ðS 0 ; SÞ 2 q for the first rule S 0 ! S, an arc is drawn from node S 0 to node S. And then, the goto table of the parsing table is verified about start symbol S, and GOTO½0; S ¼ 4, so that RSðSÞ ¼ f4g, LSðSÞ ¼ f0g. Next, because ðS; VP Þ 2 q is admitted by S ! NP VP , an arc is drawn from node S to a node VP. Since GOTO½3; VP ¼ 12, RSðVP Þ ¼ f3g and LSðVP Þ ¼ f12g. Moreover, As ðVP ; NP Þ 2 q is admitted by VP ! VERB NP , an arcs is drawn from node VP to a node NP. Since GOTO½0; NP ¼ 3 and GOTO½11; NP ¼ 14, RSðNP Þ ¼ f0; 11g and LSðVP Þ ¼ f3; 14g. Finally, the ERG as shown in Fig. 2 can be obtained. 3.3. Construction algorithm for the DS table On the LR parsing, in the case that the state of the top of the stack changes from p to q by accepting the terminal symbol ‘‘a’’, the new state q is called a descendant of the state p. This method describes the set of their descendants at the corresponding place of the DS table, namely DS½p; a. The algorithm to generate the DS table using the ERG is as follows: [Construction Algorithm for the DS table] Input: action table and ERG for G Output: DS table for G
Fig. 2. An example of the ERG.
18
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
Step 1: {Classification of the move patterns} For state r and terminal symbol ‘‘a’’ of the action table, proceed to Step 2 if ACTION½r; a ¼ \shift s", and proceed to Step 3 if ACTION½r; a ¼ \reduce i", where this step is applied for all states and terminal symbols in the action table. If ACTION½r; a ¼ \acc", this algorithm should be finished. Step 2: {Acquisition of descendants by shift moves} Add state s to elements of DS½r; a. Step 3: {Acquisition of descendants by reduce moves} Let A ! b to be the ith grammar rule, and compose the set UPPER_NODE(A), which consists of both node A and upper nodes of A in the ERG. Let node A0 to be A0 2 UPPER NODEðAÞ, and verify element n of RSðA0 Þ, so that if ACTION½n; a ¼ \shift s", add state s to DS½r; a. For example, the corresponding DS table for the parsing table as shown in Table 1 is calculated by the following procedures. First, as for ACTION[12, $] ¼ ‘‘reduce 1’’, since the first rule is S ! NP VP , we can find RSðSÞ ¼ f4g from ERG of Fig. 2, and ACTION[4, $] ¼ ‘‘acc’’, as a result, DS[5, $] ¼ {‘‘acc’’}. Next, as for ACTION[13, $] ¼ ‘‘reduce 3’’, the third rule NP ! DET ADJ NOUN is applied, and UPPER NODEðNP Þ ¼ fVP ; Sg can be composed by ERG. As for the node VP, we can find RSðVP Þ ¼ f12g from ERG, however ACTION[12, $] ¼ ‘‘reduce 1’’, so any descendants are not defined. Next, as for the node S, since RSðSÞ ¼ f4g and ACTION[4, $] ¼ ‘‘acc’’, DS[12, $] ¼ {‘‘acc’’}, after all the DS table as shown in Table 2 is completed. The DS table can be composed without using the ERG. However, as the ERG is described not only the order of rules applied for successive reduce moves but also transition states after executing them, the DS table can be obtained quickly. Table 2 DS table obtained from the grammar a 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
young
man
woman
8 8
5 5
6 6
loves
$
1
10 10 10 10 5 5
acc acc acc acc
5 6
1 1 10
acc acc acc
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
19
4. Application to error correction 4.1. Three correction processes This method handles three kinds of errors: altered, extra and missing errors. 1. Altered errors: Right symbols are replaced with other wrong symbols. 2. Extra errors: Extra symbols are inserted into the right input sequence. 3. Missing errors: Arbitrary symbols in the right input sequence are missing. In order to correct the above errors, this method prepares three correction processes, namely the replacement process for altered errors, the deletion process for extra errors, and the insertion process for missing errors. The parser can detect errors, however it cannot distinguish the kinds of errors, all of above processes will be called, and they correct the erroneous sentence in parallel with the following processes: 1. Replacement process: Error symbols are replaced with correction symbols obtained by referring the DS table. 2. Deletion process: Error symbols are deleted. 3. Insertion process: Correction symbols obtained from the DS table are inserted at the error position. If some sequences are generated as the correction candidates, they are called as a correction candidate group, and each of candidate is scored accordingly the number of replacement, deletion and insertion, in order to select the most likely sentence out of the group.
4.2. Traditional error correction The traditional error correction method utilizes the LR parsing table for not the syntax analyzer but the prediction of the input sequence. We define SðpÞ and RðpÞ to be the sets of symbols which admit, respectively, the shift moves and reduce moves for state p. The traditional method utilizes the property that the succeeding symbols to the state p are limited to terminal symbols obtained by SðpÞ [ RðpÞ. By using this property, we can find out the correction symbol able to connect to the error position. In the example of Table 1, in the case that the parser detects ‘‘error’’ at state p ¼ 0, since Sð0Þ ¼ fa; bg, Rð0Þ ¼ / and SðpÞ [ RðpÞ ¼ fa; bg, we can predict that state 0 is expecting ‘‘a’’ or ‘‘b’’ as the correction symbol. If some correction symbols are predicted as shown in this example, correct candidate sequences for the input sequence are generated every each correction symbol, and they are going on the parsing. However, if the number of grammar rules becomes big, the number of terminal symbols including in SðpÞ [ RðpÞ becomes large. As a result, a great number of candidates are produced on the correction process is big, and it takes a great deal of time to correct the input sequence.
20
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
In order to improve the time efficiency, this method turns the attention to the symbol allocated just after the symbol to be corrected, where this symbol is called a rear symbol, namely the proposed method verifies the relation between correction symbols and the rear symbol when correction symbols are predicted. By using this method, as correction symbols are restricted, the number of candidates produced on the correction process is reduced, and fast system can be realized. At this point, in order to verify the connection relation without executing shift and reduce moves, we need the DS table which indicates the next state to be transited by each symbol and state. Especially, by using the DS table, as for successive reduce moves, the next transition state can be obtained immediately. 4.3. Correction processes using the DS table The following definition about the configuration [11] of the parser is introduced for the correction algorithms: ðX1 X2 Xj p; ak akþ1 an $Þ; where ‘‘X1 X2 Xj ’’ is the content of the stack, p is the state of the top of the stack, and ‘‘ak akþ1 an $’’ represents the input sequence which have not yet been read. The following definition is given as the configuration when the error is detected: ða p; ?ak akþ1 an $Þ;
ð1Þ
where ‘‘ak ’’ is erroneous symbol, and ? represents the error position. As stated in Section 3.2, moreover, we regard the symbol allocated just after the symbol to be corrected as the rear symbol, for example, ‘‘akþ1 ’’ is the rear symbol in the case of the replacement process, whereas ‘‘ak ’’ is the rear symbol in the case of the insertion process. Let us suppose that a symbol ‘‘a’’ is the rear symbol on the configuration (1) and the state t is admitted the transition by the rear symbol ‘‘a’’. If the state t is included in descendant set composed by both the state p and terminal symbol ‘‘b’’, this method regards the symbol ‘‘b’’ as the correct symbol calculated by both the state p and rear symbol ‘‘a’’. CORRECTðp; aÞ is the set of these correct symbols and is defined as the following: fb 2 VT j a 2 SðtÞ [ RðtÞ; 9t 2 DS½p; bg: This method verifies at the same time not only the relation between the top state of the stack and correction symbols but also the relation between correction symbols and the rear symbol. As a result, the number of candidate sequences as well as correction symbols becomes small. Algorithms for each correction processes are described.
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
21
Replacement process. First, ‘‘akþ1 ’’ is defined as the rear symbol. Next, ‘‘ak ’’ is replaced with the element ‘‘b’’ obtained from CORRECTðp; akþ1 Þ, then the following configuration is going on parsing: ða p; bakþ1 an $Þ; where the end symbol ‘‘$’’ and symbols which have been already replaced or inserted cannot be replaced with this process. Deletion process. First, ‘‘ak ’’ is removed from the configuration (1), then the following configuration is parsed: ða p; akþ1 an $Þ; where the end symbol ‘‘$’’ and symbols which have been already replaced or inserted can not be replaced with this process. Insertion process. First, ‘‘ak ’’ is defined as the rear symbol. Next, the element ‘‘b’’ obtained from CORRECTðp; ak Þ is inserted at the error position, then the following configuration is going on parsing: ða p; bak akþ1 an $Þ; where end symbol ‘‘$’’ and symbols which have been already deleted cannot be inserted with this process. This method corrects the erroneous sequence by the above processes, however it is possible that the parser regards the right symbol as the error symbol after the actual error symbol is accepted. And so, this method backs the error position little by little and executes each processes again, because the parser might be not able to find the accurate error position. As for the correction candidate group generated by the above method, reliability R of each candidate sequence is calculated by the following expression, so that each candidate in the group is ranked according to R. R ¼ 1 ðcr nr þ cd nd þ ci ni Þ=N ;
ð2Þ
where nr , nd and ni are the number of the replacements, deletions and insertions respectively, N is the number of symbols of the input sequence, and cr , cd and ci are heuristic constants. For example, when parsing the following input sequence: \a man loves young woman $" according to the grammar of Fig. 1, the correction process is the following steps. First, the syntax error is detected at the word ‘‘young’’ by parsing the input as shown in Fig. 3, where the error configuration is the below: ð0 ½a 1 ½man 5 ½loves 10; ? young woman $Þ: The above three correction processes are applied for this configuration. First, the replacement process is applied to the above configuration as shown in the right-hand side sequences of Fig. 4, where CORRECTð10; womanÞ ¼ fag.
22
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
Fig. 3. An example of the error detection.
Fig. 4. An example of the error correction.
The following sequences are successful, as a result, we can obtain the new following sequence:
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
23
\a man loves a woman $"; which the input word ‘‘young’’ is replaced with a new word ‘‘a’’. Next, the insertion process creates a the corrected configuration: ð0 ½a 1 ½man 5 ½loves 10; a young woman $Þ because CORRECTð10; youngÞ ¼ fag, as shown in the left-hand side sequences of Fig. 4. And we can find the another corrected sequence: \a man loves a young woman $"; which a new word ‘‘a’’ is inserted between input word ‘‘loves’’ and ‘‘young’’. Moreover, the deletion process which takes the word ‘‘young’’ out of the input sequence is applied, however this process can be executed as CORRECT ð10; $Þ ¼ f/g. This configuration will be accepted by the grammar, after all the input sequence is corrected to two new sequences.
5. Evaluation This system has been implemented in C language on a SUPER WORKSTATION AS 5000, and the used grammar is the task-specific grammar [5] for queries concerning the registration of the conference. The details of this grammar are as the following: the number of rules is 2048, the number of grammar symbols is 797, the number of vocabularies is 770, the number of LR states is 3465 and the target language is Japanese. The parsing table is compiled by the canonical LR(1), and the number of retreats [7,8] of the error position is just one. This method has been tested against three persons. Each of them types same 350 sentences, and 1050 sentences (29.5 Kbyte) including 418 error points are prepared in amount. The classification of error patterns in this experimental data is shown in Table 3. In this table, ‘‘multiple errors’’ means that altered and extra errors occurs successively in one sentence. Moreover, as shown in Table 3, since the altered errors appear most abundantly and the next is extra errors and missing errors hardly appear, constants in the reliability are set to cr > cd > ci , cr ¼ 3:2, cd ¼ 2:8, ci ¼ 1:2. While the correction process is going Table 3 Classifications of error patterns Error patterns
The number of errors
Altered error Extra error Missing error Multiple error
308 (73.7%) 57 (13.6%) 20 (4.8%) 33 (7.9%)
24
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
Table 4 Experimental results of the correction time Generated sequences Correction candidates Replacement Deletion Insertion Time
Method A
Method B
1,986,917 97,548 12,824 598 280 2634 (s)
17,510 2753 717 121 54 38 (s)
on, the reliability R of each candidate is calculated by Eq. (2), then the candidate sequence which has less reliability than 0.5 is removed from the candidate group. The above data were corrected by using the traditional parsing table and the DS table respectively, and these results are evaluated about the correction time and accuracy. First of all, Table 4 shows the experimental results of the correction time. In Table 4, ‘‘Method A’’ and ‘‘Method B’’ mean the traditional method and the proposed method, respectively. The item ‘‘Generated Sequences’’ indicates the number of candidate sequences (configurations) generated in the parsing, and the item ‘‘Correction Candidates’’ shows the number of candidates sequences created as the correction candidate when the correction process is finished. ‘‘Replacement’’, ‘‘Deletion’’ and ‘‘Insertion’’ represent the number of each of processes called during the correction process, respectively. In Table 4, the Method A creates about 20 times greater number of generated sequences than the number of correction candidates, on the other hand, the value of the Method B is about six times. This fact shows that many wasteful generated sequences can be reduced by using the DS table. Thus, from the experimental results, it is find that this method is 69 times faster than the traditional method. Next, the correction accuracy is shown. We made an experiment on the correction accuracy by comparing the first correction candidate, which is the candidate of the highest reliability R, with the right sentence. As for the correction accuracy, 89.6% of the inputs are corrected by this method and the traditional method is 90.6%, so the almost same accuracy can be obtained. Thus, it can be concluded that this method can correct error points much faster (69 times faster) than the traditional method, also keep the almost same correction accuracy 90% as the traditional method. On this experiment, this method cannot correct about 8% of input sentences to the right sentences and cannot even recover. Then, those sentences were picked out and checked, as a result, it is found that the multiple error is contained in most of those sentences. On this experiment, since we use the DS table which consists of descendants after accepting the only one terminal symbol, if the rear symbol is error, the right correction symbol can not be selected from the DS table. So, this method can not correct the multiple error
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
25
part where not only the erroneous symbol but also the rear symbol is wrong. On the other hand, as the traditional method does not use the DS table and does not refer to the rear symbol, there is some possibility that many generated correction symbols include the right one even if the rear symbol is error. As stated above, the problem of this method is how to correct erroneous sentences including multiple errors. In order to deal with such sentences, we have only to use the extended DS table which consists of descendants after accepting the two and more consecutive terminal symbols. But, as the size of the DS table becomes very big and it becomes sparse, we must devise the method to compress the DS table. And, if the error correction process is executed using the extended DS table, it is expected that the number of the generated sequences increases explosively. So, we need to introduce the heuristics toward the natural language. In order to correct the multiple error, moreover, Saito [10] has proposed the method to introduce a ‘‘dummy nonterminal symbol’’, which can be obtained from the goto table, against a big erroneous part. Although the DS table used by this method does not have information about the goto table, the ERG has the relation between the nonterminal symbol and states after accepting it. Therefore, we can apply his method to this method against the multiple error.
6. Conclusions This paper proposed the method for improving the time efficiency of the traditional method by using the DS table instead of the parsing table on the occasion of automatic error correction, and the validity of this method has been supported by empirical observations. This method can correct error points of input sequences 68 times faster than the traditional method, also keep the same correction accuracy as the traditional method. As future improvements, an efficient data structure for DS table should be designed, and the proposed method cannot cope with successive error points, so that the method to correct the consecutive errors automatically should be consider.
References [1] A.V. Aho, J.D. Ullman, Principles of Compiler Design, Addison-Wesley, Reading, MA, 1977. [2] J. Aoe, Y. Yamamoto, N. Harada, R. Shimada, The construction of weak precedence parsers by parsing tables, Transactions of Information Processing Society of Japan 18 (5) (1977) 438–444. [3] M. Hatada, H. Endoh, Spelling correction method for English and Katakana in Japanese OCR text, Transactions of Information Processing Society of Japan 38 (7) (1997) 1317–1327.
26
M. Shishibori et al. / Information Sciences 148 (2002) 11–26
[4] N. Itoh, H. Maruyama, A method of detecting and correcting errors in the results of Japanese OCR, Transactions of Information Processing Society of Japan 33 (5) (1992) 664–670. [5] K. Kita, T. Kawabata, H. Saito, HMM continuous speech recognition using generalized LR parsing, Transactions of Information Processing Society of Japan 31 (3) (1990) 472–480. [6] T. Kurita, T. Aizawa, A method for correcting errors on Japanese words input and its application to spoken word recognition with large vocabulary, Transactions of Information Processing Society of Japan 25 (5) (1984) 831–841. [7] M.D. Mickunas, J.A. Modry, Automatic error recovery for LR parsers, Communications of the ACM 21 (6) (1978) 459–465. [8] J. Rohrich, Methods for the automatic construction of error correcting parsers, Acta Informatica 13 (1980) 115–139. [9] H. Saito, M. Tomita, Parsing noisy sentences, in: Proceedings of 12th International Conference COLING-88, 1988, pp. 561–566. [10] H. Saito, Robust error handling in the generalized LR parsing, Transactions of Information Processing Society of Japan 37 (8) (1996) 1506–1513. [11] M. Sasa, Programming Language Processor, The Iwanami Software Science Series, 1989. [12] Y. Takagi, E. Tanaka, A fast garbled spelling correction method based on constituent character hashing, Transactions of the Institute of Electronics, Information and Communication Engineers (IEICE) J80-D-II (2) (1997) 579–588. [13] K. Takeuchi, Y. Matsumoto, Japanese OCR error correction using stochastic morphological analyzer and probabilistic word N-gram model, in: Proceedings of the 18th International Conference on Computer Processing of Oriental Languages, 1999, pp. 561–566. [14] M. Tomita, Efficient Parsing for Natural Language (A Fast Algorithm for Practical Systems), Kluwer Academic Publishers, Dordrecht, 1986.