Int, .I. Man-Machine Studies (1978) 10, 569-591
A text-to-speech translation system for Italiant LEONARDOLESMO
Istituto di Scienze dell'Informazione, Universit?t di Torino, CENS, Italy MARCO MEZZALAMA
Istituto di Elettrotecnica Generale, Politecnico di Torino, CENS, Italy AND PIERO TORASSO
Istituto di Scienze dell'lnformazione, UniversiM di Torino, CENS, Italy (Received 23 January 1978 and in revised form 20 April 1978) This paper describes a text-to-speech translation system for the Italian language. It is based on a formal translator which converts a graphemic text into the corresponding phonemic representation and on a suprasegrnental features descriptor. The translator is implemented according to the formal techniques developed for lexical and syntactical analysis of artificial languages. The suprasegmental features descriptor determines pause positions, phoneme durations and pitch contour on the sole basis of the punctuation marks and the accent marks present in the text.
Introduction The widespread diffusion of computers and computerized machines lends ever increasing interest to man-machine communication techniques which are easy to use and to be learned by non-professional computer users. For this reason a great deal of work has been devoted to the development of many sophisticated high-level languages, which can be considered the first step towards the final aim of communicating with computers via quasi-natural languages. Even more useful would be the introduction of communication by speech with computerized machines, since the user may have difficulty in managing the usual inputoutput devices (e.g. when attending to control tasks requiring continuous attention); other advantages of communicating by speech are the increased communication speed and the possibility of using low-cost, easily available media such as telephone lines. An overview of the applications of speech communication is reported in Flanagan (1976). While speech recognition systems (the input side of man-machine communication) work only under very strong constraints and the researches in this field have given only partial solutions (Reddy, 1976), computer speech output studies have already produced practical results (Rabiner & Schafer, 1976). A rough classification of the tThis work has been carried out at the Centro per l'Elaborazione Numerale dei Segnali (Digital Signal Processing Laboratory)of the ConsigiioNazionale delle Ricerche of Italy. 569 0020-7373/78/050569 + 23 $02.00/0 © 1978 AcademicPress Inc. (London)Limited
L. LESMO, M. MEZZALAMA AND P. TORASSO
570
techniques used will better clarify the field. This classification will be made taking into account two sets of parameters: (a) the size of the components used to produce the speech (i.e. phrases, words, syllables, dyads, phonemes); (b) the coding technique used to represent them. These techniques can be roughly subdivided into two classes depending on the domain (time or frequency) in which the basic units used to produce speech are codified. In the first case signal waveform samples are usually represented by means of PCM (Pulse Code Modulation) or ADPCM (Adaptive Differential Pulse Code Modulation) which allow a data rate of approximately 24 Kbit/sec. In the second case a frequency description is given in terms of maxima of the spectrum (formants) or by means of the coefficients of a linear model representing the human phonatory apparatus (LPC: Linear Prediction Code) (Flanagan, 1976). Table 1 sketches the relationship holding between the size of the components and the coding technique. TABLE 1 Relationship between the size of the components used to produce the speech and the coding technique used to represent them Phrases
Words
PCM or ADPCM suitable suitable LPC suitable suitable Formants or articulatory possible possible parameters
Morphs
Dyads
Phonemes
possible impossible suitable possible
impossible impossible
possible suitable
suitable
From the top left-hand to the bottom right-hand corner of the table an increase in flexibility and complexity can be found. The synthesis-by-rule approach based on a parallel formant synthesizer (corresponding to the bottom right-hand corner of the table) has been chosen (Mezzalama & Rusconi, 1975), because a satisfactory text-tospeech system must be able to produce any word and control the prosodic features (i.e. intonation, stress, etc.). Synthesis-by-rule systems have already been developed for different languages and many studies have been carried out to estimate the value of the parameters used in speech synthesis, namely formant frequency, bandwidth and amplitude (Mattingly, 1966; Rabiner, 1967; Klatt, 1976; Ainsworth, 1974; Vaissiere, 1971; Rothenberg, Carlson, Granstroem & Lundqvist-Gauffin, 1975; Ferrero, Magno-Caldognetto, Vagges & Lavagnoli, 1975a, b). All these studies have shown that suprasegmental (prosodic) features cannot be ignored if output speech of reasonable quality is desired. The time evolution of pitch and the stress levels (that affect phoneme durations and speech signal amplitude) are the most important physical correlates of suprasegmental features in the acoustic domain (Lehiste, 1970). The problem of describing prosodic features can be faced according to two different approaches. The first is based on the introduction into the text of explicit information related to stress levels and pitch contour (fundamental period of the speech signal), so
571
TEXT-TO-SPEECH TRANSLATION SYSTEM FOR ITALIAN
l Grapherne-to-phoneme I , translator
I
Suprosegmental descriptorfeoturesI
[
[
FIG. 1. Block-diagram of the test-to-speech translation system. that the behaviour of these elements can be easily determined. In this case the text is normally given in phonemic form, so that there is a direct correspondence between the phoneme labels and the acoustic parameter values used as input by the synthesizer to generate the output speech. The second approach is related to a much more general problem, since its goal is the translation of a normally written text into the corresponding vocal output. This requires an accurate analysis of the input text in order to determine the implicit structure of the text using knowledge at different levels (phonetics, phonology, syntax and semantics). It is worth noticing that this step can be avoided in a vocal output system for a limited protocol, because a direct storage of the phonetic representation of the vocal output may be convenient (Witten & Madams, 1977). As stated above, the translation of a written text into speech involves different levels. The whole system can be schematically represented as a sequence of independent processes as shown in Fig. 1. The first step is the translation of the graphemic text into the corresponding phonemic representation. This operation can be performed in different ways, depending on the language which the system is built for. For languages such as English, where the reading of a symbol is not, in general, uniquely determined, the use of a dictionary in which phonemic representations of words or morphs are recorded, is needed, if a correct transcription of all words is required (Allen, 1976; Elovitz, Johnson, McHugh & Shore, 1976). For other languages, as French or Italian, presenting a stronger correlation between graphemic symbols and pronunciation, a simpler solution is possible; it is based on a set of rules that directly convert groups of graphemes into the corresponding phonemes (Divay & Guyomard, 1977; Francini, Debiasi & Spinabelli, 1968).
572
L. LESMO, M. MEZZALAMA AND P. TORASSO
The second step consists of the determination of the prosodic structure; this problem is still being investigated and it is no simple matter, since there is little or no evidence in the text of the syntactic and semantic structures affecting the acoustic realization. During the recent years many efforts have been made to design a syntactic and semantic analyzer to determine the internal structure of sentences (grammatical categories, word functions) since only by taking these factors into account can a set of rules be defined to place breath-group boundaries correctly, to assign durations and stress levels and to control pitch evolution (Allen, 1976; Umeda, 1976). Moreover, it is only known how to control suprasegmental features (in particular pitch contour) accurately for very simple sentences: declarative and interrogative, with no subordinate clauses (Olive, 1975; Mezzalama, Rusconi & Torasso, 1976). Approximate solutions have been suggested for unrestricted texts; they are generally based on information embedded in the written text (e.g. punctuation marks), thus avoiding the problem of the syntactic--semantic analysis (Lirnard, Teil, Choppy, Renard & Sapaly, 1977; Maeda, 1974). The third step, i.e. the generation of the output speech stream starting from the phonetic representation, is closely connected to the characteristics of the hardware synthesizer and requires a correct choice Of the parameters representing the phonemes. This paper deals with a text-to-speech translation system for the Italian language. The system design is based on the following main simplifying assumptions: (a) no word dictionary is necessary since a set of simple rules can be singled out to convert a word from graphemic to phonemic representation; (b) suprasegmental features (pitch contour, phoneme duration, pause position and duration) can be adequately predicted by taking into account just stress positions and punctuation marks. In normal written text of the Italian language, the accent mark on words in which the last vowel is stressed, and the punctuation marks, are explicit. In all the other words the accent mark is not present and in most words the stress falls on the penultimate vowel; in order to disambiguate these cases and yet avoid the use of a word dictionary, we only need to mark the word stress when it does not fall on the penultimate vowel. When a word has to be emphasized or when the primary stress of the sentence has to be pointed out, a suitable mark can be inserted in the text to denote the stress position. The output generated after the graphemic to phonemic translation and the suprasegmental features extraction processes have been carried out is an abstract representation of the acoustic message in terms of phoneme label sequences and of sets of phoneme attributes. This representation has the advantage of being sufficiently detailed to control a synthesis-by-rule system directly without constraints associated with the characteristics of the hardware synthesizer or with the parameters used to represent the phonemes.
The synthesis-by-rule system Before we go on to describe the translation from written text to phonemic representation, we present a brief outline of the implementation of the synthesis-by-rule system. The input to the synthesizer consists of a command string constituted by four different components denoted by P, A, D, I, where:
573
TEXT-TO-SPEECH TRANSLATION SYSTEM FOR ITALIAN
J RXI~I correctness J verifier
J
FIG. 2. Sketch of the grapheme-to-phonemetranslation process. P
is the list of the symbols corresponding to the input text in terms of labels of the system's elemental phonetic units; A is the list of the modifications of the target values of the phonetic units--each entry of A is an index pointing to a table in which the coefficients of linear transformations, modifying the parameter steady state values, are stored; D is the list of the durations of the phonetic units appearing in P; I is a suitable representation of the pitch contour. The parameters used to represent the steady state values of the phonetic units are the voice/unvoice ratio and the coefficients of a digital filter H(z) modelling the vocal tract. The filter is implemented in a parallel form according to: 4
H(z) = Y~ k=a
al,+fll, z-1 1-~'kz-l+O~ z - z
+ K.
The relation between 7k, Ok, the center frequency Fk and the bandwidth Bk of the second order resonator is given by: ~'k = 2e- Bkr COS(2nFkT), Ok = e - 2 B k T
where T is the sampling rate. The parameters as and fl~ are related to the gain of the kth resonator and allow the introduction of zeroes in the global transfer function.
The grapherne-to-phoneme translation The grapheme-to-phoneme translation process is sketched in Fig. 2. The formal correctness verification consists of the following tests: (a) legality of consonant clusters; (b) legality of punctuation mark sequences; (c) correctness of accent mark position and insertion of subsumed accents. The translation process receives as input a correct graphemic text and converts it into the corresponding phonemic one; the suprasegmental features detection problem is not tackled in this step.
574
L. LESMO~ M. MEZZALAMA AND P. TORASSO
It is worth noticing the similarities existing between the text conversion problem and the generation of the machine code from a high level language program. In both cases the fundamental steps are the lexical correctness test and the translation based on syntactical rules and on an exploitation of the information implicit in the structure of the language. Taking into account such similarities, formal methods (well studied for compiler construction) have been preferred to heuristic ones for the sake of dearness and generality. CORRECTNESS
It is obvious that not all the sequences of graphemes and words belong to the Italian language, so that an algorithm is necessary to reject the incorrect strings. We have to reject, for example, the successions of consonantal symbols which are not pronounceable, in spite of the fact that each of them has a corresponding phonemic transcription. For instance, the phonemes corresponding to the graphemes M, N, S, T are defined, but it is clear that the string "MNST" is not pronounceable. As it is not known how to define the concept of pronounceability formally, we have chosen to reject the grapheme sequences not appearing in at least one word of the Italian language. Correct sequences of numeric symbols are also accepted; they are translated according to a procedure which will be presented in the following. We also reject texts in which illegal sequences of punctuation marks appear (e.g. ";: ?") or in which we detect more than one stress mark in the same word. In this step we put the stress mark on the penultimate vowel of those words which do not have an explicit stress mark. The lexical analysis~is performed by means of a finite automaton M, which can be formally defined, according to Aho & Ullmann (1973) as: M = , where QM -----{qo, ql . . . . , qN} is a finite set of states, ZM = {A,B,C,D,E,F,G,H,I,L,M,N,O,P,Q,R,S,T,U,V,Z,.,?, !, ;, :, (,),',',',,, 1,2,3,4,5,6,7,8,9 0} is the input alphabet. (notice that in Italian there are two types of stress marks: "for open stressed vowels and ' for closed stressed vowels) , qo is the start symbol, FM = {qN} is the set of final states, and 8M:QM× ZM >QM is the transition function. The complete definition of the transition function 8u is reported in Garello & Rabbia (1977) and the part of the automaton M concerning the grapheme "g" is reported as an example in Fig. 3. From every state qi (where i~<60, that is the stress mark has not been scanned), the input of the grapheme " a " leads to the state ql and then, if a stress mark and "g" are scanned, the automaton reaches the state qso. In Italian a single occurfence of "g" after a vowel (corresponding to qs0) can be followed by another vowel, a nasal ("m", "n"), a liquid ('T', "r") or a grapheme belonging to the set {"h", "d", "g"}. A double occurrence of "g" after a vowel (state qgl) can be followed by a vowel, a liquid or by the grapheme "h". An occurrence of "h" after "g" (ql0e) can be followed either by "e" or by 'T'. Just the arcs leaving the states qs0, qal, q~0n have been drawn in Fig. 3 almost exhaustively (in fact, the space and the punctuation marks have not been taken into account).
TEXT-TO-SPEECHTRANSLATIONSYSTEMFOR ITALIAN
575
Fxo. 3. Part of the correctness verifierconcermng tbe grapheme O. For example, the Italian word RA'GGIO (ray) makes the automaton pass through the following sequence of states: (qo,qll,ql,qa~,qso,q91,q63,q64); for A~GHI (needles) we obtain the sequence (qo,ql,qao,qso,qlo6,q63); the incorrect word A'GHO (the sequence G H O is not allowed in Italian) makes the acceptor issue an error message because the transition from the state q~06 is not defined for the input symbol O. The automaton has been automatically trained by a modified version of a grammatical inference technique (Torasso, 1977), so that it is deterministic and has the minimum number of states. The use of an automaton allows a very effective lexical analysis, since recognition or rejection time is proportional to the length of the input string (Aho & Ullmann, 1973). The ability to insert the subsumed accent mark is achieved by means of an error recovery mechanism. This consists of a set of procedures activated when the transition function 5M is not defined for the current pair (state, input symbol). The activated procedure determines whether the scanning interruption is due to an error, to the absence of accent mark in the last scanned word or to the presence in the text of a number expressed in numeric form. In the first case the cause of the error is notified to the operator; in the second case an accent mark is inserted after the penultimate vowel of the word and the lexical analysis is resumed from the interruption point; however, a set of function words (i.e. articles, conjunctions, prepositions) are left unstressed, since the spectral analysis of non-emphatic speech has shown that in these words no vowel is affected in duration and/or amplitude. This set of words is stored in a table which is scanned before the subsumed accent is inserted. Finally, when the interruption is due to the occurrence of a number in the text, a procedure is activated that translates the number into its corresponding graphemic representation. TRANSLATION The output of the translation step is a string formed by the concatenation of elements of the form C(N) (which will be called from now on "phonemic parts"), where C is the phoneme label and N is an index that, if not equal to zero, specifies some modifications
~576
L. LESMO, M. MEZZALAMA AND P. TORASSO
of the standard values of the phoneme acoustic parameters (allophonic variations). In the output string the punctuation marks are maintained because the analysis of suprasegmental features, which depends on them, is left to the following steps. In general there is no one-to-one correspondence between graphemes and phonemes; moreover, in some cases, the grapheme dusters must be translated into different phonemic parts, depending on the context in which they appear. For example, the grapheme "g" corresponds to the phoneme/d.~e/(as in JOY) when it is followed by "e" or "i" and to the phoneme/g/(as in GAS) when it is followed by any other grapheme (except for 'T' and "n", whose occurrences after "g" produce different sounds which are not described for the sake of simplicity). In order to obtain the phoneme/g/before an "e" or an "i" it is necessary to insert after "g" the grapheme "h". Analogously, in order to obtain/d.,~e/, before a hard vowel ("a","o","u") it is necessary to insert an "i" between "g" and the vowel; in this case the 'T' is not translated unless it is stressed (e.g. "ghi" -->/gi/;"gio" ~ / d ~ o / ) . The same rules hold for the geminate "g". However, the main difficulties arise from the characteristics of the "stop" phonemes. In fact, we have accounted for the acoustic realization of the stop phonemes by means of three different phonemic parts (PhL1, PhL~, PhL3), corresponding to three different configurations of the human vocal tract. In the case of unvoiced stops ("p", "t", "k") PhL1 refers to the silence tract, PhL~ to the burst and PhL3 to the transition towards the following phoneme; PhL~ and PhLa depend on the following phoneme, while PhL 1 is independent of the context. For voiced stops ("b","d","g") we have that PhL~ refers to the transition from the preceding phoneme and depends on it, PhL~ refers to the silence and is independent of the adjacent phonemes, PhLa refers to the burst and depends on the following phoneme. The dependence of the first phonemic part of / g / o n the preceding vowel prevents us from having a single state corresponding to the occurrence of "g", because the output of that phonemic part must be delayed until the following grapheme has been scanned to disambiguate between/d-~e/and/g/. The most important allophonic variations of the phonemes are related to the four following points. (a) Stress. The stress level is indicated by an allophonic variation. The index N is equal to 1 if the phrase stress falls on the vowel, equal to 2 if the word stress falls on the vowel, equal to 3 if the stressed vowel is the last of the word (refer to the section "Phoneme duration computation"). (b) Gemination. If the same consonantal grapheme occurs twice in adjacent positions, then the consonant is"geminate"; in this case the index N takes the value 5. For example, the pair "LL" in the word "PALLONE" (ball) is represented as L(5). (c) Elision. When a string of the type "MARIA AVEVA" (Mary had) appears in the text, the characteristic phenomenon known as "elision" occurs; that is, two unstressed occurrences of the same vowel separated by the space are not pronounced as a concatenation of two instances of the vowel, but a single occurrence of the same vowel modified by lengthening its duration. This phenomenon is accounted for by the introduction into the output string of a single phonemic part, with the modification index 4. (d) Dependency on the context for stops. As stated above, some of the phonemic parts in which the "stop" consonants are translated depend on the characteristics of the adjacent phonemes. This fact is accounted for by introducing allophonic variations
TEXT-TO-SPEECHTRANSLATIONSYSTEMFOR ITALIAN
577
TABLE 2
Meaning of the allophonic variation indices 0 1 2 3 4 5 6 7 8 9 10
No modification Emphatic stress Word stress (the stressed vowel is not the last phoneme of the word) Word stress (the stressed vowel is the last phoneme of the word) Elision Gemination Stop phonemic part influenced by A . . . . . . . . . . E . . . . . . . . . . . I . . . . . . . . . . O ,, . . . . . . . . U
which apply to the phonemic parts depending on the context. The meanings o f all the allophonic variation indices are given in Table 2. The formal device used to perform the translation is a finite transducer, which is obtained by taking a finite automaton and permitting the machine to emit a string of output symbols on each move (Aho & Ullman, 1973). It is formally defined as: T = (QT, ET, AT, fT,So,FT) where QT = {So,S1. . . . . sK} is the finite set of states Xr = EM is the input alphabet Ar = {C(N)} where C is a phonemic label and N is an allophonic variation index, is the output alphabet fix : Q x × Z x ~ Q r × A { is a mapping defining next state and output symbol sequence by means of current state and input symbol So is the initial state F r = {SK}is the set of final states. The fix mapping (reported in Garello & Rabbia, 1977) has been defined in such a way that the transducer T is deterministic and the translation of the input string is uniquely determined. In Fig. 4 the part of the translator T concerning the occurrence of "g" after " a " is reported. The labels associated with the arcs of the figure are of the form a/~3, where a~Y,x is the input symbol which causes the transition from the state s~ (from which the arc leaves) to the state sj and 13 (the output string) is either the empty string (s) or a sequence 13z 13~... ~,, where each 13KeArfor k = 1, 2 . . . . . n. In Fig. 5 some examples of the behaviour of the automaton during the translation of Italian words are reported.
The description of suprasegmental features Many studies in acoustic phonetics have shown that the behaviour of prosodic features can be more easily described if it is correlated to a hierarchical segmentation o f speech (Lehiste, 1970; Umeda, 1976). In fact a human speaker must insert pauses in the speech stream when he is reading aloud, because of the physiological characteristics of the human phonatory apparatus (limited capacity of the lungs); these pauses can be seen as boundaries of speech segments usually called "breath groups". The pause insertion
,\ i
==
p/l(o)
G'i 0
0
=J
TEXT-TO-SPEECH TRANSLATION SYSTEM FOR ITALIAN
IS
i
l
S
Output
IS I S
So
579
I s I
Output
output
so
so
L
$7
E
L
$r
E
R
, $1o
A
Sl
L(O)
A
sl
L(O)
A
st
R(O)
SlS
(
$15
(
$15
•
G
ssi
A(2)
G
s51
A(2)
G
ssl
A(2)
0
s4
GI (6)Gz(O)G~(9)
H
s103
G I (6)G2(0!
G
Sio9
c
I~
S~o
I
s~
G3(8)
I
Sto4
GF(SI
I~
$29
I~
sz~
E
M
i
Example I : LA'GO (lake)
I
L,
S I
Output
i
Example 2 : LA'GHI (lakes)
Example 3 : RA'GGI (rays)
is J s J Output
IS I
So
$0 P
s4s
•
R
s~
A
st
PI(O)P2 (6)P3(6)
A
$i
i
i
Sl5
¢
S ]
Output
$0
R(O)
S
$12
A
s,
SN(O)
$15
(
i
Sl5
(
G
Sel
A(2)
G
$st
A(2)
G
Sel
A(2)
I
s=o4
G F(O)
G
Stos
•
G
SlO9
E
N
Se
I(O)
I
Sto4
GF(5)
E
s~
Gr(5)
A
$=
N(O)
0
S4
=
$2e
(
S3o
E
Example 4 : PA'GINA (page)
Example 5 : RA'GGIO(ray)
Example 6 : SA'GGE(wise)
FIG. 5. Some examples of translation of Italian words. IS is the input symbol and S the state after the transition. Notice that the last vowel of the word is not output until the first grapheme of the successive word has been received, unless a punctuation mark separates the two words (elision).
process is accomplished by the speaker taking into account the syntactic and semantic structure of the text. As stated above, in our system no syntactic or semantic analysis is performed to determine such a structure; however, a rough segmentation can be directly obtained from the text using the punctuation marks. Such a solution has been shown effective for other languages (Lienard et aL, 1977), so that under this assumption, the segmentation process is feasible and the suprasegmental features descriptor can be split as shown in Fig. 6. TEXT SEGMENTATION
In our system the atomic elements of the segmentation process are the sequences of words delimited by two successive punctuation marks; these sequences are called, for obvious reasons, "pseudo-breath-groups" (PBG). The text is segmented in a hierarchical way, since it is considered as a sequence of sentences each of which is composed of a sequence of PBGs. Every segment (text, sentence, PBG) is uniquely determined by the right delimiter (i.e. by the punctuation mark which closes i0 according to the rule shown in Table 3. When the information regarding the syntactic structure of the text is conveyed by elements other than punctuation marks (e.g. conjunctions or prepositions), it may
L. LIKSMO,M. MEZZALAMAAND P. TORASSO
,
I
I Phonemeduration I computation I
I
J Pitchcontour evaluation I
FIo. 6. Sketch of the suprasegmental features description process.
happen that two consecutive punctuation marks are very far away from each other. If the strategy described above is applied without modifications, then the synthetic speech generated by the system may sound unnatural, since in some cases the PBG may be too long (bearing in mind the relation to the lung capacity). The system is able to notice this situation by evaluating the length of the PBGs by means of an accent counter. The choice of an accent counter as an estimation of the PBG length was suggested by the consideration that the accent number roughly corresponds to the number of content words. When the system detects that the accent counter (zeroed at the beginning of the PBG) exceeds a predetermined threshold, then the text analysis from left to right is interrupted and a backward search is performed to determine the last scanned conjunction or preposition. If it is found, then a special symbol ( # ) is introduced before it as a PBG delimiter. The left-to-right analysis is resumed from the interruption point with the accent counter set to a value equal to the number of the accent marks scanned during the backward search. TABLE 3 Punctuation marks as delimiters of segments
& : ? ! ? ! ,
end-of-text mark end-of-sentence marks : ; ( )end-of-PBGmarks
The heuristic considerations which have suggested this method do not guarantee correct placement of the PBG delimiter in all cases; for example, when no conjunction or preposition is found in the part of the PBG already scanned, no delimiter is inserted at all. Since the acoustic correlates of the breath-groups are the pause durations, the system forces the insertion of a special phoneme (whose parameters produce a silence in the synthesized speech) when an end-of-PBG mark (# enclosed) is encountered. The duration of the pauses inserted in the text are expressed by means of the allophonic variation
TEXT-TO-SPEECHTRANSLATIONSYSTEMFOR ITALIAN
581
TABLE4
Pause index and duration as function of the punctuation mark Punctuation mark ? ! ; : ( ) ,
Allophonic variation index
Pause duration (ms)
11 12 13 14
200 100 50 25
indices o f the special phoneme. The indices (as the durations) depend on the punctuation mark encountered in the text, according to Table 4. PHONEME DURATION COMPUTATION The variations of the phoneme durations are, together with pitch motion, the main acoustic realization of the prosodic structure of the text (Umeda, 1976). The contextual parameters which affect the intrinsic phoneme duration have been investigated for different languages (Umeda, 1976; Cinguino, Comoglio, Lesmo, Mezzalama, Rusconi & Torasso, 1975); the results of these studies have shown that the most important of these parameters are: (a) the stress level; (b) the position of the phoneme in the word and of the word in the sentence or in the breath-group; (c) the influence of the characteristics of the adjacent phonemes; (d) the sentence type. F o r Italian the differences in duration between stressed and unstressed syllables are mainly due to variations in the vowel duration, while the duration of a given consonantal phoneme can be roughly considered as a constant. Moreover, the variations due to context are more significant for stressed vowels, according to the rule: Ds = Vl(KxD i + K 2 - KsDc) where Ds = stressed vowel duration D ~ = intrinsic duration o f the vowel D c = duration of the consonantal duster following the vowel K 1, K2, Ka are appropriate constants Vl is a variable whose value depends on the stress levd. If the stress falls on the last vowel of the word the duration is computed according to Ds = K4D x The vowel duration is also modified when it occurs in prepausal position; the rule is: D p = V~DI
where D p = prepausal vowel duration D l = as above V~ = is a variable whose value depends on the duration o f the pause. The current values of the parameters introduced above are listed in Table 5.
L. LESMO, M. MEZZALAMA AND P. TORASSO
582
TABLE 5 Parameter valuesfor the computation of stressed vowel duration DI Vx V~
80 msec. Intrinsic vowel duration 1 Emphatic stress 0.7 word stress 1.5 1.2 1
Pause duration > 200 ms Pause duration e[100,200] ms Pause duration < 100 ms
K l + 2 " 5 ; K ~ + 30 (ms); K3 = 0-3; K~ = 1.3 A global modification of the phoneme durations is determined by the sentence type. In particular we have that all phoneme durations in an interrogative PBG are shortened using a multiplicative constant. Finally, some of the allophonic variations previously introduced affect phoneme duration; for example, the allophonic variations related to geminate consonants and to elision between vowels. PITCH CONTOUR EVALUATION Since the pitch is the major correlate of high level prosodic structure (syntax and semantics), the process for determining pitch contour must take into account the hierarchical structure obtained by the segmentation process. The global characteristics of the pitch contour are related to the sentence type (interrogative, exclamative or declarative), since the sentences of the text are handled independently of each other according to the behaviour of the human speaker who generally introduces long pauses between sentences in order to reset the phonatory system. The pitch contour of each sentence can be schematically described as a sequence of movements starting from a reference level called "declination line" (Maeda, 1974). The studies carried out for isolated sentences have shown that movement placings are related to stress positions and movement amplitudes depend on stress levels. The relative position of the movements gives the global shape of the pitch contour that is the most important cue to distinguish the different PBG types. Notice that the pitch description can be done in terms of fundamental frequency or fundamental period. The latter description will be used from now on. In declarative PBGs the pitch curve has relative minima (maxima in the frequency domain) in correspondence to accent positions, except for the last accent in the PBG, where the curve presents a rising (falling in the frequency domain) towards the absolute maximum (minimum). The interrogative PBGs differ from the declarative ones mainly in their larger movement amplitudes, sharper transitions and in the characteristic risingfalling (falling-rising) contour at the end of the PBG. Suspensive PBGs are characterized by a relative minimum (frequency maximum) in correspondence to the last accent of the PBG, after which they present a small rising-falling-rising (falling-rising-falling) movement. This very particular shape is obtained by introducing a pseudo-accent on the vrry-last vowel of the PBG. For emphatic PBGs, the pitch contour is quite similar to that characterizing declarative PBGs, but exhibits larger movements, as reported in Mezzalama & Rusconi (1975b). In particular, a relevant rising (falling in the frequency domain) is associated with the end of the PBG.
TEXT-TO-SPEECH TRANSLATIONSYSTEMFOR ITALIAN
583
TABLE 6 PBG type as function of the punctuation mark which closes the PBG • Declarative PBG ".9 ' Interrogative PBG : ( ) Suspensive PBG • Emphatic PBG
PBG type is determined by the punctuation mark which closes the PBG, according to Table 6. The global process of the pitch contour description can be outlined as follows (remember that pitch contour is given in the time domain). Step 1. Each sentence is assigned a declination line, which is a straight line segment characterized by its starting value VI and by its angular coefficient nDL. These parameters depend on sentence type and are reported in Table 7. TABLE 7 Starting value (VI) and angular coefficient aDL of the declination line as function o f t he sentence type
Declarative sentence Interrogative sentence Emphatic sentence
V~ (ms)
aDL
9.5 9-0 9.0
0-1 0.0 0.1
Step 2. For each PBG, the movement values are determined in correspondence to the suprasegmental mark (SM) positions (the term "suprasegmental mark" is used to refer both to accent marks and to punctuation marks). Then the pitch value in those points is computed by means of the following formula: VSM = VDL+AVsM
where VSM is the pitch value in correspondence to the SM position VDL is the value of the declination line in the same position AVsM is the movement value; it depends on the corresponding SM, its position in the PBG and the PBG type, as shown in Table 8. VSM is maintained for a period of time (stability period) equal to lhe duration of the stressed vowel if the SM is an accent mark or equal to zero, if the SM is a punctuation mark. Step 3. the evaluation of the pitch contour is completed by appropriate interpolations between the pairs of pitch values corresponding to consecutive SMs. The interpolation algorithm is based on the following formulae: Vsl(t) = VDL(t) + AVsle -
Vs,(t) = VDL(t) + AVs~e -
L. LESMO~ M. MEZZALAMA AND P.
584
TORASSO
TABLE 8
The movement value (AVsM) and the time constants ('Cl,q:~) as function of the PBG type and of the position of the suprasegmental mark in the PBG AVsM
"¢x
T2
Begin of PBG Not last emphatic stress Not last word stress Last stress of PBG End of PBG
+ 1.0 - 3.5 -2.0 + 2.0 + 2.0
0 6 3 0 0
0 6 3 Declarative PBG 0 0
Begin of PBG Not last emphatic stress Not last word stress Last stress of PBG (if emphatic) Last stress of PBG (if not emphatic) Pseudo-accent End of PBG
+ 1.0 -3.5 -2"0 -3"5 -2.0
0 6 3 6 3
0 6 3 3 Suspensive PBG 3
- 1.0
3
0
0-0
0
0
Begin of PBG Not last emphatic stress Not last word stress Last stress of PBG End of PBG
- 1.0 -4.5 -2.5 + 3.0 - 3.0
0 4.5 3 0 0
Begin of PBG Not last emphatic stress Not last word stress Last stress of PBG End of PBG
+ 1-0 -4.0 -2.5 + 2.5 + 2.5
0 6 3 0 0
0 4.5 3 Interrogative PBG 0 0 0 6 3 Emphatic PBG 0 0
where S1 and $2 are two consecutive SMs AVsl and AVs2 are the movement values in correspondence to S1 and $2 tsl is the end-point of the pitch stability period of S1 ts2 is the starting-point of the pitch stability period of $2 t, tsl and ts2 are measured in milliseconds; t is set to zero at the beginning of a sentence VDL(t) is the value of the declination line at time t zl and xz are appropriate time constants (see Table 8). I f the functions Vsl(t) and Vs,(t) intersect in a point tR internal to the interval [tsl,ts2], then the pitch value is: Vp(t) = [Vsl(t) if tsl< t<~tR ~Vs2(t) if tR< t<~ts~ Otherwise Ve(t) is the linear interpolation between Vsl(tsl) and Vs~(ts2). A graphical example of the use of the parameters in the process of pitch generation is shown in Fig. 7. In Fig. 8 an example of natural and computed pitch curve (in terms of period) is given for the sentence: "Elezioni americane. Concluse le elezioni primarie, sono ora i congressi dei partiti, le cosiddette convenzioni, che dovranno designare i candidati alia
TEXT-TO-SPEECH TRANSLATION SYSTEM FOR ITALIAN
--
x.~
K---T---~I
"TCI
I"
/ / /'
585
./"
'~. "~. .~.
",.
I t
ike p"
/ f' //"
/
Time
FIG. 7. A graphical example of the use of the parameters in the process of pitch generation (pitch period v s . time).
Casa Bianca." (The American Elections. Now that the primary elections are over, the party congresses, the so-called conventions, will have to designate the candidates for the White House.). For the sake of clarity the motion of the amplitude is also shown. In this case the computed contour matches very well the natural one, except in the final part of the sentence. A good approximation is obtained in correspondence to the stress, since the emphatic accents have been positioned on those vowels which have been emphasized by the speaker.
Conclusions Some perceptual experiments have been carried out in order to test the performances of the text-to-speech translation system. Synthetic speech has been judged to be intelligible and with a high degree of naturalness, especially for the accurate control of the pitch motion. Even if the system includes neither a syntactic nor a semantic analyser. in very few cases the intonation produced by the machine does not reflect the underlying semantic structure of the sentence. Moreover, in spite of the fact that the system has been designed for Italian, the text-to-speech methodology is quite general and can be used for other languages, because the main characteristic of the system is the sharp distinction between the knowledge, represented by means of formal techniques (transducers, automata, etc.) and the procedural part of the system implementing the strategy. The knowledge sources consist of the correctness rules, the grapheme-to-phoneme translation rules, the value of the parameters used to describe the pitch curve and, in general, the values relative to the parameters introduced to synthesize a specific language. Obviously, these knowledge sources must be updated if a different language is to be synthesized. On the other hand, the strategy of the system is highly independent of the specific language, because its procedural part is able to implement any finite automaton and transducer. In this way it is possible to study the phonetic and acoustic characteristics of a different language and to single out a suitable set of rules which may be input to the system to obtain the desired behaviour without changing the set of procedures which have been built to implement the strategy. Analogously, if more sophisticated performance is required, or the system has to be tailored to a variety of special applications, it is not necessary to rewrite the whole system, but it is suflficient to update the related knowledge sources.
If.
I:;
9
i+
~° 0 ~
°~
~.g ~
~
~'~
~~
°~ °~
m~
"_~ ~1~
-
w
_
o
~
+.+ ,~
"~N
~m w
",
o
o
z
~= N~ f
i
l
l
TEXT-TO-SPEECHTRANSLATION SYSTEMFOR ITALIAN
587
The use of rules for the graphemic-to-phonemic translation and the deduction of suprasegmental features behaviour from the position of punctuation marks and stresses allow us to reduce memory size requirements and execution time considerably. These are fundamental constraints if a text-to-speech facility is to be added to a computer system, since more elaborate approaches, embodying large pronunciation dictionaries or a syntactic and semantic analysis, require too much of the available computational resources. The system has been implemented in FORTRAN on a general purpose minicomputer (HP21MX) using the segment programming facilities supported by the RTE-III operating system. The segment strategy results effective because the process is strongly sequential and each module communicates with the subsequent one via a small amount of data. The overall system requires 12 kilowords for data and 10 kilowords for code. The overlay of data and code makes it possible to subdivide the program into three segments. The first one (the formal correctness verifier), the second one (the grapheme-tophoneme translator) and the third segment (the descriptor of suprasegmental features) require 4, 7 and 1 kilowords of data and 2, 4 and 4 kilowords of code respectively. An independent module, written in assembly code, controls the hardware speech synthesizer (De Mori, Rivoira & Serra, 1975) which outputs the synthetic speech in real-time at the frequency of 10 kHz. The authors are indebted to Professors R. De Mori and A. R. Meo and to Dr E. Rusconi for many useful suggestions and discussions. The work has been carried out at the Centro Elaborazione Numerale dei Segnali (Digital Signal Processing Laboratory) of the Consiglio Nazionale delle Ricerche of Italy.
References Ano, A. V. & ULLMAN,J. D. (1973). The Theory of Parsing, Translation and Compiling. PrenticeHall. AINSWORTn,W. A. (1974). Performance of a speech synthesis system. International Journal of Man-Machine Studies, 6, 493. ALLEN,J. (1976). Synthesis of speech from unrestricted text. Proceedings of the IEEE, 64' 433. CrNotnNo, M., COMOGLIO, G., LESMO, L., MEZZALAMA, M., ROSCO~U, E. & TORASSO, P. (1975). Phonetic rules for the synthesis of Italian language (in Italian).In FERRERO, F., Ed., Proceedings of the Symposium Sintesi della Parola, Padova, pp. 212-237. Dlz Molu, R., RwomA, S. & SERRA,A. (1975). A special purpose computer for digital signal processing. IEEE Transactions on Computers, C-24, 1202. DIVAy, M. & GUYOMARD,M. (1977). Grapheme to phoneme transcription for French. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Hartford, pp. 575-578. ELOVITZ,H. S., JOI-I~SON,R., McHuoH, A. & SHORE,J. E. (1976). Letter-to-sound rules for automatic translation of English text to phonetics. IEEE Transactions of Acoustics, Speech and Signal Processing, ASSP-24, 446. FERRERO,F., MAGNO-CALDOGNETTO,E., VAGGES,K. & LAVAGNOLI,C. (1975a). Some acoustic and perceptual characteristics of the Italian vowels. Proceedings of the 8th International Congress of Phonetic Sciences, 91c. FERRWRO,F., MAGNO-CALDOGNETTO,E., VAGOES,K. & LAVAONOLI,C. (1975b). Some acoustic characteristics of Italian consonants. Proceedings of the 8th International Congress of Phonetic Sciences, 296c. FLANAGAN,J. L. (1976). Computers that talk and listen: man-machine communication by voice. Proceedings of the IEEE, 64, 405.
588
L. LESMO,M. MEZZALAMAAND P. TORASSO
FRANCINI, G. L., DEBIASI, G. B. & SPINABELLI,R. D. (1968). Study of a system of minimal speech reproducing units for Italian speech. Journal of the Acoustical Society of America, 43, 1282. GARELLO, M. & RABBtA, E. (1977). Natural language analysis: grapheme-to-phoneme translation (in Italian). Thesis. University of Turin. KLATT, D. H. (1976). Structure of a phonological rule component for a synthesis-by-rule program. 1EEE Transactions on Acoustic, Speech and Signal Processing, ASSP-24, 391. LEmSTE,I. (1970). SuprasegmentaL Cambridge: MIT Press. LmNARD, S. C., TEIL, D., CaoPPY, C., RENARD, G. & SAP~X', J. (1977). Diphone synthesis of French: vocal response unit and automatic prosody from the text. Proceedings of the 1EEE International Conference on Acoustics, Speech and SignalProcessing. Hartford, pp. 560-563. MAre)A, S. (1974). A characterization of fundamental frequency contours of speech. Quarterly Progress Report, 114. Cambridge, Mass.: MIT Press. MATaaNGLY, I. G. (1966). Synthesis by rule of prosodic features. Language and Speech, 9, 1-13. MEZZALAMA,M. & RUSCONI,E. (1975). A general system for synthesizing speech. In FANT, G., Ed. Speech Communication. Uppsala: Almqvist and Wicksell, pp. 307-314. MEZZALAMA, M. & RUS¢ONI, E. (1975b). Intonation in speech synthesis: a preliminary study for the Italian language. In FArCr, G., Ed. Speech Communication. Uppsala: Almqvist and Wicksell, pp. 315-325. MEZZALAM_A,M., Ruscor~, E. & TORASSO,P. (1976). Automatic generation of pitch contour for speech synthesis. Proceedings of the 1EEE International Conference on Acoustics, Speech and Signal Processing. Philadelphia, pp. 82-82c. OLIVE,J. P. (1975). Fundamental frequency rules for the synthesis of simple declarative English sentences. Journal of the Acou~ical Society of America, 57, 476. RARmER, L. R. (1967). Speech synthesis by rule: an acoustic domain approach. Ph.D. Dissertation. Cambridge: MIT Press. RAerNER, L. R. & SCaAFER, R. W. (1976). Digital techniques for computer voice response: implementations and applications. Proceedings of the 1EEE, 64, 416. REDDY,D. R. (1976). Speech recognition by machine: a review. Proceedingsof the IEEE, 64, 501. ROTHENBERG, M., CARLSON, R., GRANSTROEM,B. & LUNDQVIST-GAuFFIN,J. (1975). A three parameter voice source for speech synthesis. In FANT, G., Ed. Speech Communication. Uppsala: Almqvist and Wicksell. TORASSO, P. (1977). Grammatical inference of fuzzy languages and syntactic pattern recognition. Proc. lnformatica 77, Bled (YU), 1.113. UMEr~A, N. (1976). Linguistic rules for text-to-speech synthesis. Proceedings of the IEEE, 64, 443. VAmsmP~, J. (1971). Contribution h la synth~se par r~gles du Franfaise. Thesis. Universit6 de Grenoble. Wn'~N, I. H. & MADAMS, P. H. C. (1977). The telephone enquiry service: a man-machine system using synthetic speech. International Journal of Man-Machine Studies, 9, 449.
Appendix An example is reported here in detail to illustrate the behaviour o f the text-to-speech system described above. The graphemic text given in input is: ESPLOSIONE I N U N DEPO'SITO: LO STA'BILE E ~ STATO DEMOLITO. NESSUNA VI'TTIMA. & (Explosion in a store: the building has been destroyed. No-one killed.) The correctness verifier accepts the string and inserts the subsumed accents producing the following representation: ESPLOSIO~NE I N U N DEPO'SITO: LO STA'BILE E" STA'TO D E M O L I ~ O . NESSU'NA VI'TTIMA. &
TEXT-TO-SPEECH TRANSLATION SYSTEM FOR ITALIAN
589
Notice that, as stated above, the function words IN (in), U N (a), LO (the) are left unstressed. The output of the translation step is the following phonemic representation: E(0)SN(0)Px(0)P~(0)Ps(0)L(0)O(0)Sv(0)I(0)O I.(2)N(0)E(0)I(0)N(0)U(0)N(0) DI(0)D,(0)D3(7)E(0)Pz(0)P,(9)P3(9)O I.(2)Sv(0)I(0)Tz(0)T~(9)T3(9)0(0): L(0)O(0)SN(0)TI(0)T~(6)Ta(6)A(2)BI(6)B~(0)Ba(8)I(0)L(0)E(0)EL(3)SN(0) Tz(0) T,(6)Ts(6)A(2)T~(0)T,(9)Ts(9)O(O)Dz(9)D,(0)D3(7)E(0)M(0)O(0)L(0) I(2)T x(0)T2(9)T3(9)O(0).N(0)E(0)Sr~(5)U(2)N(0)A(0)V(0)I(2)Tz(5)T,(8) T3(8)I(0)M(0)A(0).& The phonemic labels SN and Sv refer respectively to non-vocalized and vocalized S ("s","z"). EL, Es and E refer respectively to the open and closed stressed E and to the unstressed E. Analogously for OL, Os and O. Remember that each "stop" phoneme is translated into three different phonemic parts. The meanings of the allophonic variation indices are given in Table 2. The segmentation algorithm subdivides the text into two declarative sentences; the first one is: ESPLOSIONE IN UN DEPOSITO: LO STABILE E ~ STATO DEMOLITO. The second one is: NESSUNA VITTIMA. Notice that the two sentences have been given in graphemic form only for the sake of clearness; the algorithm actually operates on the phonemic text generated by the previous step. The first sentence is composed of a suspensive PBG (ESPLOSIONE IN UN DEPOSITO:) and by a declarative PBG (LO STABILE E" STATO DEMOLITO.). The step of text segmentation inserts pauses of suitable duration corresponding to end-of-PBG marks. The next step evaluates the phoneme durations obtaining the following informations (the allophonic variation indices are actually maintained, but they are omitted for the sake of brevity): E/80/SN/135/Pff 110/PJ20/Pa/25/L/90/O/80/S v/135/1/80/O L/140/N/90/E/80/
II801NI90/U /80/N /90/D1/ 35/D~/ 75/D~/20/E/80/Pff l l OIP~I20/P3/25/ O L/135/ S v/135/I/80/Tz/125/T,/20/T~/25/O/80/Q/100/:L/90/O/80/SN/135/T1/125/
TJ20/TJ25/A/135/Bff35/BJ60/B3/20/I/80/L/90/E/80/EL/lO5/SN/135/T~/125/ T J 20/Ts/25/A/125 /T ~/125/T J20/T3/25 / O/80/D ff 35/D J 75/D3/20/E/80/M /90/ O/80/L/90/I/125/T1/125/TJ20/Ts/25/O/80/Q/200/.N/90/E/80/SN/200/U/140/
N/90/A/80/V/85/I/1 lO/TI/200/T2/20/Ts/25/I/80/M/90/A/80/. The numbers included in brackets are the durations of the phonemes expressed in milliseconds. The last step is concerned with the determination of the pitch contour. The application of the rules given in the paper generates the pitch curve reported in Fig. 9. However, this curve only gives the global shape of the pitch contour; in fact not all the phonemes are voiced and, to take this fact into account, at the synthesis level (third block of Fig. 1), the pitch curve is combined with the excitation curve that gives the information about the vocalization of the phonemes. In Fig. 9 the motion of Ks parameter is shown. Ks is related to the excitation by the following formula: E(n) -- (1 - Ks)P(n) + KsN(n)
O~
E
<
Z
c~"~
~J C~ 0 !
,
I.-
0
t--
L~
~)
l
<[.
C=r~
"
o~ 0 1-7-
I--
8~ 0 Q.
0
Z 0
0 --I G.
W ]
.
!
,I
TEXT-TO-SPEECH TRANSLATION SYSTEM FOR ITALIAN
591
where E(n) is the input signal to the synthesizer P(n) is the glottal pulse N(n) is the noise excitation. Finally we can describe the four components (denoted in the paper by P, A, D, I) which form the command string input to the synthesis-by-rule system. P = E,SN,PI'P~,P3,L'O'Sv,I'OL,N,E,I,N,U,N,D1,D~,Dz,E'P1,P~,P3,OL,Sv, I,T1,T~,T~,O,Q,L,O,SN,T1,T~,Ts,A,B1,B~,B3,I'L'E'E L,SN,T1,Tz,T3,A, T1,T2,Ts,O,D1,D~,D3,E,M,O,L,I,T1,T~,T3,O,Q,N,E,SN,U,N,A,V,I,T1,T~, T3,I,M,A A = the modification of target values of the phonetic units are pointed t o by integers unrelated to the previous allophonic variation indices. For example, the "OL" phonemes of the list P are stressed and the frequency and amplitude values of of the standard (unstressed) "O" phoneme are modified according to a linear transformation whose coefficients are stored in the 12th entry of the transformation table. In this case the values of the formants' amplitude are multiplied by a constant equal to 1.5 and the formants' frequency of the unstressed "O" (F1 = 500, F2 = 850, F3 = 2200) are replaced by F1 = 570, F~ = 870, F3 = 2140. D = 80,135,110,20,25,90,80,135,80,140,90,80,80,90,80,90,35,75,20,80, 110,20,25,135,135,80,125,20,25,80,100,90, 80,135,125,20,25,135,35, 60,20,80,90,80,105,135,125,20,25,125,125,20,25,80,35,75,20,80,90, 80,90,125,125,20,25,80,200,90,80,200,140,90,80,85,110,200,20,25, 80,90,80 I = list I is composed of samples of the pitch curve described above, taken at a sample rate of five millisecond.