Voice personality transformation

Voice personality transformation

DIGITAL SIGNAL PROCESSING 1, 107-110 (1991) Voice Personality Transformation Michael Savic and Il-Hyun Nam Electrical, Computer, and Systems E...

390KB Sizes 26 Downloads 165 Views

DIGITAL

SIGNAL

PROCESSING

1, 107-110

(1991)

Voice Personality Transformation Michael

Savic and Il-Hyun

Nam

Electrical, Computer, and Systems Engineering Institute, Troy, New York 12180-3590

Department,

I. INTRODUCTION Some work has been published in the area of speaker adaptation; however, very little has been published in the related area of voice personality transformation. Speaker adaptation is a process of minimizing the degradation of a speech-processing system operating in a speaker-independent mode. When a speech-processing system is trained for a particular speaker (reference speaker) and then used for a new speaker (input speaker), the system usually performs worse for the new speaker than for the reference speaker. This technique transforms the spectral parameters of the input speaker’s voice onto spectral parameters of the reference speaker’s voice, thus minimizing the degradation in performance of the speech-processing system. The spectral parameters can be linear predictive coding (LPC) coefficients, reflection coefficients, cepstrum coefficients, partial correlation (PARCOR) coefficients, energies within frequency subbands: etc. Voice personality transformation is a process of making one person’s voice “source” sound like another person’s voice “target.” The voice personality of a person is characterized by features such as pitch, vocal tract response, speaking rate, regional accent, inflection, and word choice. The pitch and the vocal tract response are acoustic features, while the speaking rate, regional accent, and inflection characterize the speaking style. In order to accomplish adequate voice character transformation, it is necessary to transform the acoustic features of the source into acoustic features of the target, and to simulate the speaking style of the target. Generation of speech is often modeled by a digital filter which simulates the vocal tract, and by an exci-

Rensselaer

Polytechnic

tation source which drives the filter, as shown in Fig. 1. The excitation signal is a periodic pulse train for voiced speech (switch in position voiced) or white noise for unvoiced speech (switch in position unvoiced). The vocal tract is modeled using the LPC technique. Coefficients of the digital filter are obtained by linear prediction analysis and are termed LPC coefficients.

II. VOICEPERSONALITY TRANSFORMATION The functional block diagram of our voice character transformation system is illustrated in Fig. 2. The three basic stages are the Analyzer, the Parameter Transformer, and the Synthesizer. The technique applied involves three basic steps: Extraction of vocal tract parameters and the pitch period from utterances of the source and the target in the Analyzer, replacement of vocal tract parameters and the pitch period of the source with the vocal tract parameters and the pitch period of the target in the Transformer, and synthesis of the target voice in the Synthesizer. After preprocessing, each voiced frame of source speech is “analyzed” in the Analyzer, expressed in terms of a target LPC coefficient vector, and a pitch period, and then transformed into a target frame. A neural network is used in the transformer stage to transform the vocal tract parameters of the source into the vocal tract parameters of the target. Since LPC coefficients are a non-Euclidian space feature, and therefore not suitable for clustering, the LPC cepstrum coefficients were used instead. A threelayer neural network [l] is implemented to transform 1051.2004/91$1.50 Copyright 0 1991 by Academic Press, Inc. All rights of reproduction in any form reserved.

FIG.

1.

Generation

of speech-functional

block

diagram.

the LPC cepstrum coefficients of the source into the LPC cepstrum coefficients of the target. LPC cepstrum coefficients of the target are then converted into LPC coefficients of the target. The transformation of the pitch is accomplished using the “histogram modification technique,” which is often implemented in digital image processing [2]. Each transformed frame of speech is then downloaded into a vocoder, generating the target voice. This process is repeated frame by frame for the entire utterance.

Ill. PREPROCESSING AND THE ANALYZER After low-pass filtering the source speech signal is sampled at the rate of 10 kHz in an A/D converter. The frame size is 100 speech samples. Digitized speech samples are then preemphasized by a filter with a transfer function (l-0.952-‘), to reduce the spectral dynamic range. After segmentation into frames, the source speech signal is windowed by a Hamming window. The length of the Hamming window is set to 200 points. Since the size of the analysis frame is 100 sample points, the window moves at a rate of 100 sample points every 10 ms with overlaps. The effect of the overlapping window is to smooth the LPC analysis, avoiding abrupt changes of LPC coefficients between consecutive frames. The Analyzer has four functions: extraction of LPC cepstrum coefficients, the voiced/unvoiced decision, pitch period detection, and computation of the input energy. The functional block diagram of the Analyzer is shown in Fig. 3. The LPC predictor coefficients are found by minimizing the least-mean-square error between the real speech sample and the predicted speech sample at time n. The LPC coefficients are then converted into LPC cepstrum coefficients. Pitch detection and voiced/unvoiced decision are performed utilizing the Autocorrelation method using clipping, described in [5]. The application of this

method results in fast processing of the signal, which is essential for real-time applications. The input energy of speech is evaluated from the summation of the squared speech sample values within a frame of source speech. The frame of speech contains 100 samples. The computed input energy of the source speech is brought to the Synthesizer and used to adjust the energy level of synthesized target speech at the output of the voice transformation system. In other words the synthesized target speech frame is being adjusted to have the same total energy as the incoming source speech frame.

IV. THE PARAMETER TRANSFORMER AND THE SYNTHESIZER The function of the parameter transformer is to substitute the acoustic features of the source with the acoustic features of the target. These features are the LPC cepstrum coefficients and the pitch period. The LPC cepstrum coefficients are parameters of the vocal tract, and the pitch period is a parameter of the glottis. A neural network is used to transform vocal tract parameters of the source into vocal tract parameters of the target. Since LPC coefficients are not a Euclidian space feature, they are not suitable for the transformation in a neural net. LPC cepstrum coefficients are therefore used instead. The transformation of the pitch period is accomplished utilizing the “histogram modification technique.” The operation of the voice transformer includes two phases, “learning” and “transformation.” The relationship (mapping) between the acoustic features of the source and the target is established in the first, “learning” phase. The actual “transformation,” i.e., substitution of the acoustic features of the source with the features of the target is performed in the second “transformation” phase. The functional block diagram of the “learning” phase is shown in Fig. 4. “Learning” is accomplished as follows. Two speakers; the source and the target pronounce the same set of learning sentences. Mapping between these two sets of speech data is performed one frame at a time using Dynamic time Warping (DTW). DTW maps all frames of the first set to corresponding frames of the second set, making soura voia

-0

FIG. 2. system.

Analyzer + Functional

Transformer -b block

diagram

of the

Synthesizer -b voice

transformation

Fz

Pitch Detecuoll

r-4

VoicedAhdced

Deciskm lSOlJSUC

parameters ofthesovrce -b

FIG.

3.

Functional

block

the duration of the first set equal to the scaled duration of the second set. In our system DTW is used to produce natural segmentation of speech separating voiced and unvoiced segments. Vector quantization was implemented to model states of the target vocal tract. Each state of the vocal tract is represented by a LPC coefficient vector. The vocal tract is modeled accurately if a sufficiently large number of states is used. In terms of vector quantization, each state is a “codeword,” and a set of states constitutes a “codebook.” After dynamic time warping, LPC cepstrum coefficients are extracted from each frame. The source LPC cepstrum coefficient vector is brought to the input of the trained neural net, the desired output of the neural net is the best-matched target codeword index for voice transformation. In order to accomplish an accurate mapping of pitch periods from source to target, pitch period histograms of the source and the target must be known. The pitch period histogram of a person is an information which indicates how often that person uses particular pitch periods. The transformation of the pitch period from source to target is performed by modifying the pitch histogram of the source. This is accomplished using the well-known Histogram Equalization technique, which is often used in digital image processing for image enhancement.

diagram

of the Analyzer.

The actual transformation is performed using the mapping information obtained in the learning phase. The voice of the source is processed frame by frame. The LPC cepstrum coefficients which are extracted from each frame are brought to the input of the trained neural network. The output of the neural network is the best-matched target codeword index for voice transformation. Corresponding codewords (target LPC cepstrum coefficients) are then determined from the codebook and sent to the Synthesizer. Figure 5 shows the transformation of LPC cepstrum parameters. The transformation of pitch periods is based on the mapping information (histogram modification) acquired in the learning stage. Using the transformed LPC cepstrum coefficients and the transformed pitch periods, new frames of “transformed” speech (in target voice) are created in the Synthesizer. The block diagram of the Synthesizer is shown in Fig. 6. The type of the excitation source which is used to drive the Synthesizer at a particular time depends on the voiced/unvoiced decision made by the Analyzer. If the frame is voiced, the excitation source is a pulse train with a transformed pitch period, and if it is unvoiced, the excitation source is white noise. The gain matching procedure involves the adjustment of the energy level of synthesized

source LPC

ticm

COCffiCiC

-

-

I

1-I

Qget

codcbook

FIG. 4. Diagram transformation.

of the

learning

procedure

for

LPC

parameter FIG.

5.

Vocal

tract

parameter

transformation.

VI. CONCLUSIONS

FIG.

6.

Block

diagram

The system was tested on a number of male and female speakers. One group of female and male voices (source) was transformed into another group of voices (target). Experimental results demonstrated that there was almost no difference between the target voice generated by the voice transformation system, and the target voice output from the LPC Vector Quantization vocoder, which was used as reference.

of the Synthesizer.

REFERENCES speech to reflect the energy level of the original speech.

source

V. THE NEURALNETWORK Neural networks are devices which map vectors from one space to another. In our particular case a neural network maps vectors from the parameter space of the source to vectors in the parameter space of the target. In the training phase, the source and the target pronounce the same training utterances. The LPC cepstrum coefficients of the source and target are extracted in the Analyzer and used for training. LPC cepstrum coefficients of the source are brought to the input, and LPC cepstrum coefficients of the target are used as identification. Training is accomplished using the “back propagation” training algorithm. The transformation is performed using the mapping information obtained in the learning stage. After training, LPC cepstrum coefficients of the source are brought to the input of the neural network, producing at the output of the neural net the target codeword index, which indirectly represents the LPC cepstrum coefficients of the target. A three-layer perceptron with two hidden layers was used in the system. This neural network can form any desired decision region, and it can emulate any traditional deterministic classifier.

1. Lippmann, R. P. An Introduction to Computing with Nets. IEEE ASSP Magazine April 1987, pp. 4-22. 2. Gonzales, R. C., and Wints, P. Digital Image Processing, 4. Addison-Wesley, New York, 1977.

Neural Chap.

3. Nam, I.-H. Voice Personality Transformation. RPI Internal Publication, June 1989. 4. Savic, M., and Nam, I.-H. A system for voice personality transformation. Proceedings of SPEECHTECH-90, April 1990, pp. 118-122. 5. Dubnowski, R. W., Schafer, R. W., and Rabiner, L. R. Realtime digital hardware pitch detector. IEEE Trans. Acoust. Speech, Signal Process. ASSP-24, 1, (Feb. 1976), pp. 2-8.

IL-HYUN NAM was born in Seoul, Korea, on August 19,1957. He received his B.S. degree in electronics engineering from the Yonsei University, Seoul, in 1982 and his M.S. degree in electrical engineering from the Case Western Reserve University, Cleveland, in 1985. He is currently a graduate student pursuing his Ph.D. degree in the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute, Troy, New York. MICHAEL SAVIC received the Dipl. Ing. and Eng. Sc.D. degrees in electrical engineering from the School of Electrical Engineering, Polytechnic Institute (TVS), of the University of Belgrade in 1955 and 1965. Since 1982 he has been with Rensselaer Polytechnic Institute, Troy, New York, at the Electrical, Computer and Systems Engineering Department. Besides teaching, Dr. Savic leads the research in areas of speaker verification, speech recognition, voice character transformation, language identification, and signal recognition. Dr. Savic is a senior member of the IEEE Signal Processing Society.