Copyright © IFAC Xi'all , PRC , 1989
~ LIIl-~l a dlill ~
Sv s t ~ IIl S,
AN EXPERT SYSTEM FOR UNDERSTANDING CHINESE HOMONYMS IN SPEECH PROCESSING Huai Jin-Peng DFparlllll'll1 of CO/llp ull'!' SriPllrF, BFijil/g L'niI'l'!'sil\' of A.l'!'ul/ fl lllics al/d A.slroll alllirs, Beijillg, PRC
Abstract. Because there are only 1,22~ kinds of ~ronounciation in 50,000 Chinese characters, Chinese homonyms must be solved in the Chinese speech processing. The paper elaborates an applied expert system to discriminate the Chinese homonyms. Un the Chinese mo~phology, syntax and Chinese linguistican'S experence, the knowledge baset here consists of three parts: lexical base, linguistic base expressed with a Chinese ATN grammar and a special base. The inference mechanism has been designed 'Wit h a retrieval and fo~'Ward chaining. The experimental result has sho'Wn that some ~5% Chinese homonyms are able> to be discriminated. Keywords. expert system; information retrieval; speech recognition; PinY1n code convertor; inference processes.
INf RODUC1'ION
'r he purpose of computer speech processing is to improve the computer intelligence and realize the speech input and the output And the research has shown that it is not only on the acoustic information, but on the linguistic combined 'With the acoustic that the speech processing '01111 be improved thCll"oughly. So the processing system consists of two parts and is as follows: 1) The acoustic processor: "to finish the acoustic processing such as the feature extraction, segmenting and recognizing, and to get a senten~ in which each 'Word may have more than' one candidate because of noise and homonyms;
for each Chinese Pin-Yincode to have more than one candidate character in Chinese 1'in-1in code system, or rather there are a great number of Chinese homonyms. And because homonyms have the same acoustic features " i t is impossible to discrinnate homonyms in the acoustic processor. In the linguistic the diffieult homonym-discriminating prob~em mus~ be solved to reali~e the aonvertion of Chinese Pin-Yin codes into ~he characters. Generally when underlJt anddng Chinese homonyms',. one has t he Chinese basic knowledge (lexical, morpholClgic, ' syntactic and semantic knowledge) so 8S to discrilllinate them in a sentence. through the method Clue understands a sentence, the paper proposes an applied expert system on underl!ltanding Chinese homonyms and the, resu~t has shown that it is very effective.
2) The linguistic processor: to process the
above sentence to achieve a sentence with each word only corresponding to one candidate on the linguistic knowledge such as phonology, morphology, syntax and semantics. Chinese· characters are the most exact representation in Chinese. Because there are only 1,2~ kinds of pronounciation in 50,000 Chinese characters~ it is probable
THE SYSrEM SfRUCrURE AND PRlNClPLi
'r he expert ~stem consiSts of three parts and is as follo'Ws: Data Base It stores initial input information and
0-
70
Huai Jin-Pe ng
ther information achieved in the period of the system WOI'king. lti.hen the system finishes, the Base contains a correct result.
The expert system works with a Chinese PinYin code sentence and takes a Chinese chaacter sentence as a result. The inference procedure is as follows:
KnCDwledge Base The systemis most characterized by the use of large bodies of domain knowledge stored in the Knowledge Base, which has proved useful for solving Chinese homonym-discriminating problem. The base has three parts: lexical base on the Chinese morphology, linguistic base on the syntax and special base on a context and semantics.
Inference
Mechani~
The inferring strategy used here is based on a predictable forward chaining with 8 retrieval. The system works with a Chinese Pin-Yin code sentence from the acoustic processor. The mechanism achieves a homonym chain of each word i f possible from the lexical base. The morphologic analysis begins on the morphologic rules~ The syntactic is going to discriminate the homonyms Qy the use of the Chinese ATN grammar. During the syntactic analysis the mechansim first makes a hypothesis on some state in the Chinese AXN grammar. If hypothesis mat ches a current input homonym in t he chain successfully, the homonyms are discriminated; otherwise another hypothesis is made. If no hypothesis may be made to be suitable to the grammar, a retrieval will be made to correct previous hypotheses. So if some hypothesis matches a homonym successfully, a retrieval st ack 10Iill be used t oo st ore those hoconyms unmat ched in t he current homonym chain as a retrieval point. If there is a syntactic error in the input sentence, the mechanism will point out the error and stop. Otherwise a Chinese character sentence will be achieved from the dat a base. The three parts mentioned above are independen+ of one anot her. So any part can be revised if necessary. The system structure is shown in Fig. 1.
The SYstem Principle
The inference procedure (input, out-put) begin ,. set a Chinese Pin-Yin code sentence into the data base; 2. build homonym chains of each word and revise the data base; 3. set an input point (IP) to the first word in the sentence; 4. morphologic analysis: discriminate homonyms through morphology from the lexical base; 5. while (IP~end) syntact ic analysis: to discriminate homonyns through a Chinese AXN grammar with a predictable - forward chaiJling; 6. special analysis: to discriminate hmmonyms which have --the same Chinese syntactic category through the special base;
7. if there are still homonyms, the process will wait for an operater to choose one on key board; 8. output the correct result; ~. i f (continue), then call The procedure (input, Qutput) else stop. end.
l'ig. 1.
The System structure
LEX! CAL BASE AND MORPHOl.OGI C ANALYSIS
Understanding Chinese HOlllonyms in Speec h Processing
Lexical Base Lexical base is an importan~ part in knowledge base and ~he basis of t-he expert system. The knowledge base needs to incorporate th~ knowledge required to perform task W&ll • A frame is used to represent a word and morphology and is sholln in l"ig. 2 and Fig.
3. Pin-Yin code fram&: category-of: Chinese-character: word-frequency: homonym: defauU: (1 ,NIL) if-needed: if (number of homonyms is N and the successi ve homonym is pointed by NHP). then (N, NHP). Fig. 2 Pin-Y iI'l code frame 'In the above frame ' category-of ' means the word syntactic category which is the basis of morpho.logic and syntactic analysis • In the view of the contemporary Chinese morphology and linguistician's suggestio~, all Chinese words are divided into 13 groups and reads as follows: Noun: N; adjective: A; adverb: D; pronoun: P; verb: V; auxilary: Z; numeral: N; classifier: Q; conjunctiQn: C; auxllary word: H; prepositive: R; adverbial word: F; ( for example, noun of time and place); noun of locality: F; 'Chinese-character' means Chinese character expression corresponding to Pin-Yin code in the frame. though homonyms have the same Pin-Yin code, their Chinese characters are different and an exact expression.
71
the slot's value if necessary during the morphologic and syntactic analysis. Default slot suggests a value unless there is a contradictory evidence. N is the homonym number sum and NHP is a pointer to the Successive homonym. When lH' l1 is NIL, there is no homonym • category frame: prev-category: succ-category: Fig.3. Syntactic Category Frame On The category frame we construct a syntactic category list shown in Fig. 4. .-----
A frame: A tprev-cat egory: succ-category: I--
:'~l
D t- D frame:
t----
prev-category: NIL succ-category: A.. V
N, I-1 - -1
.! Fig. 4.
A category list
In contemporary Chinese there are many r~ trictions on the constitution of adjacent words. For eXample, ad jactiv,es mOdify nlDuns , so adjective's succe·ssive stntactic category should match noun. The syntactic category frame jus.t represents modifiable rules • Prev-category means the permitted syntactic category of a previous adjacent word ( PAW ) and succ-category the permitted category of a suc-cessive adjacent word (SAW). If the PAW and SAW category of some homonym do not meet the category frame, delete the homonym from the data base and revise the homonym slot in the code frame.
Morphologic Analysis Word frequency indicates the order to be analysed. The greater the frequency, the more prior the word. So it improves the analysis efficiency. In the homonym slot' if-needed ' slot contains an attached procedure to determine
The analysis discriminates homonyms Qy the use of Chinese morphology expressed with the syntactic category frame. Because some homonyms may be discriminated during the morphologic analysiS, the syntactic analysis efficiency is improved, or the system
72
I-iuai Jin-Pcng
efficiency Is lmprovea. It works when all homonyms are achieved. The procedure morphologic analysis begin 1. while ( IP 1= end ) 2. search a word with
homony~
from
the left to the right; achieve the current word category frame; loIhlle (NPH';' NIL)
4. 5.
6.
in all Chinese simple sentences,loIhile Chinese compound sentences are able to processed if the domain is augmented by the use of conjunction compounding simple sentences
2) Dralol the most common ATN structure of the above model sentences. Add the some tests and a sequence of actions to the structure;
if (PAW and SAw' categGry respectively match the prev-category
3) Substitute subnetlolork or circular arcs for frequent parts in the structure.
and succ-category slot's value's ) then, (search it's successive' homonym loIith NPH); els~ (del~te the homonym from
As mentioned above, loIe have constructed a Chinese Augmented Transition Network Gra-
it's chain);
mmar as shown in Fig. 5, r'ig. 6, Fig. 7 and r'ig. 8.
end, • CAT A CA! N A CBlN ESl!i ATN GRAMMAR ( CAtN G \ AND srnf AC£.! C AN ALYi SI S In 1~6~, Augmented Transition Network grammars (ATN's), the grammar model used as a basis for the syntactic processor, were developed by William Woods. A transition network grammar looks like a finite state transition diagram in which it is a directed graph loIith labeled states and labeled arcs, a distinguished start state, and a set of distinguished final states. The label on an arc indicates the input type. An ATN is produced loIith a test and have chosen The mmar because it
loIhen each arc is augmented a sequence of actions. We ATN formalism fGr the grahas advantages as follows:
1). perspicuity; 2). pOloler of generation; 3). effective expllession; 4). being easy to a chle-ve language rule sand lalol; 5). effe ctive operation. And especially it' s opera_ tion seems to be similar to one's operation loIhen he understands Ila+ural language. ATN grammars are the most common models in syn-
CAt M
CAt
CAt P
JUMP
CAt H
Q
(NP)~(N1)-:::=:=:(N2)~N3) JUMP
(
)
9AJ.'
J~ POP
N
,---.. (N 4)
r'ig. 5 Noun Phrase Subnetlolork PUSH PP
Cl
CAt H
CAT D (vp) ~(V1
-
JUMP
\I
CAT Z
V
CAr A . ---........ (V2) ~ (V3)
-......---\ JUMP JUMP CAr
CAT T POP ,--... (V5)
CAl V J .c=::-({4),
JUMP Fig. 6 Verb Phrase Subnetlolork
CAt R (pp)"""---'"
PUSH VP*
(P1)~
V
CAT C* v
CAr P (P2)C:::
JUMP
(P3~
PUSH NP*( JUMP pOP
cr 1
, - - - (P4)
tactic analysis. Fig. 7 Prep. Phrase Subnetwork Building a Chinese Augmen+ed Transition Netlolork Grammar (CATNG)
In this diagram, the states are represented as bracketed label and the arcs as directed arrows between the states. ll:ach arc
1) Select the sentences to be processed: first define the domain the grammar can process and choose at least 30 sample sentences. The domain processed ranges
is an indication of holol the arc may be taken. The label on some arc may call for a structure. In the system, CATNG is represented with production rules, Which have
F
LnderSlandin~ Chinese HOlllonYIIlS in Speech Processing
a fon., H' (cClLdition) rlllilll(action). Condition part consists of two tes~s associated with an arc, one refering to a current state, arc-receiving condition and a current, the other to a structure test. or a context • rhe test must be satisfied for an arc to be taken, the actions are- to be executed as the arc traversed, revising a current state and an input pointer (IP). In the syntactic, the inference makes a hypothesis on the current state and select an arc. So each product ion rule corresponds to an arc with a test and actions.
Fig. 8 a Sentence Main Structure A CAr arc may be taken if the current input word is of the syntactic category specified by the arc test, then the current state and IP are revised. A JUMP arc specifies the st ate to which a jump transit ion is to be made unconditionally and the only state is revised. If the both tests are satisfied,a PUSH arc is taken and the other information is saved on a stack (PS) as a correct return while a relevant subnetwok is called for. When a POP arc Is taken, the stack is poped. rhe constitute which was POPed then becomes the current input the inference begins with. As for those arcs marked with a symble '*~ one more test in the successive or previous words is involved, because some words have more than one category. While the arrangment of states and arcs reflects the sen-
73
tence surface structure, the actions on the arc produce a deep structure.
Syntactic Analysis It is the lDain part in the expert sYlltem. rhe syntactic analysis procedure begin 1. set a state pointer (SF) to the initial state S and input pOinter (IP) to the first word in the sentence; 2. while (IF~ end) 3. search a rule, or make a hypothesis, on the SP and IF from t he linguist ic base; 4. if( the condition is satisfied), then ( unmatched homonyms are lIaved into a retrieval stack (RS) and deleted from the current homonym chain); 5. elseif (unempty in RS), 6. then (a retrieVal is made by the ulle of top elements in RS ); 7. else (output an error information and stop); end. If there exists a syntactic error in an sentence, the analysis will fail. In COl'ltemorary Chinese, because many homonyms have different syntactic categories some arcs from the current state may match those homonyms unsuitable to the current sentence, CDr a production rule hall been incorrectly chosen. it is ab1e to be correced with the help of a retrieval. So a retrieval pOinter is saved only when some homonyms are to be deleted froD the chain. If there is no rule to be matched successfully, a retrieval will start with the top element of RS so all to correct the previous selection. But when RS ill empty, the system will indicate a syntactic error. So far the morphologic and the syntactic anal.ysis have only discrimilUlted the homonyms which have the same syntactic category matching the> taken arc. But i t is still a problem hew to discriminate the homonyms with the lIame category and select only one from t-hem. The special base is designed to give 1I0me rules solving the problem. For example, a production rule of auxilary wordll 'deO' (~ ,;~ ,:~ ) is: as follows:
74
Huai Jin-Peng
Rule 11': If (SAW category mat-che! N OR NIL), t'ben (discriminate '6~ .); Rule '2: I f (SAW category matche! V'erb OR Adj. ),
then ( discriminate
't~ , );
Rule 13:
W.A. Wood! (1~70). Transition network for language analysis. Comunication of t he AGM., Vol. 13, num. 10. Hu yu-shu. The Contemporary Chinese (revised edition), ~ai idyca+ion PreliU! •
If (PAl! category matches Verb. OR SAW category matches Adv.), then ( discriminate '31- '). But we have a difficulty discriminating P 't-a1' (if!. ,*~,.z), they are able to discriminated only by an operator on keyboard. So the ba!e is going to be improved in the future because a simple sentence includes les! context information. All of them account for about 5%. After the above process finishes, the system output a correct result sentence with Chinese character! and stops.
CONCLUSION The paper introduces an effective expertsystem for di!crim1nating Chinese homonyms on the Chinese morphology, syntax and semantics checking the input sentence with the help of syntactic - analysis. Though there are 5% error rates, which can be corrected with an operator interference, the system is still very effective. Now the expert system has been built in IBM-PC and realized with a C language. The lexical base has more than 35,000 words and 13 rule!. The linguistic bal!le and special base have 125 product ion rules. The system is able to process all Chinese simple sentences and automatically transforms the Chinese Pin-Yin code sentence into the Chinese character sentences. The each part in the system is independent. The experimental results have shown l'hat ~5;" Chinese homonyms are able be discriminated.The system is rea.l time and when it is combined with the Chinese speech recognition sy!tem, the Chinese speech is transformed into the Chinese character!. The method used is suitable to the morphologic and syntactic analysis in the natural language processing.
M.&+es (1~75). Syntactic Analysis in a speech understanding system. ~~ No. 3116, Bolt Beranek and Newman Inc. Cambridge, Ma. M.Bates (1~78). The theory and practice of ATN grammar. Natyral Language Communication with Computers. M.Bates (1~75). The use of syntax in a speech understanding system. aiii Transactions on speech signal processing Vol. AS3P-23, no. T, pp. 112-117.