Journal of Phonetics (1993) 21, 157-162
Comments on the WRAPSA model of speech perception development Kari Suomi Department of Logopedics and Phonetics, University of Oulu, P.O. B. 191, 90101, Oulu, Finland
JusczYK's WRAPSA is a very attractive model of the development of speech perception from earliest infancy to the mature, language-dependent adult system. Within the bounds set by its wide empirical basis, WRAPSA offers a number of novel suggestions concerning the component processes that are likely to underlie speech perception. I am particularly sympathetic with the assumptions that the infant's knowledge of the sound structure of the native language evolves in a way which is most efficient for handling the demands of on-line word recognition from fluent speech and that, rather than being a priori goals intentionally aimed at, the language-specific properties and representations are seen to emerge as a consequence of success in pursuing various communicative goals, such as trying to learn new words. From these background assumptions, and from the findings of his own research, JusczYK has successfully constructed an insightful and plausible account of how the process of acquiring a given language selectively tunes the listener's developing perceptual system to the structural-phonetic properties of that language. In the following, I will comment on some minor problematic areas of the model. For JusczYK, the central task of a developing speech perception system, viewed from the perspective of establishing an effective means of communication, is that it should allow the listener to recognize words (but he notes in passing that, in addition, the speech perception system is expected to provide the listener with information that might help in identifying the talker, the talker's dialect, his or her emotional state, and whether some unfamiliar utterance is likely to belong to the native language or a foreign one). JusczYK claims (p. 9) that because the auditory analyzers process all types of acoustic signals, they provide a description of the input that is far richer and detailed than what is needed for the purpose of speech perception. The most efficient coding of the input would be one which preserves only the information critical to making meaningful distinctions in the language. In other words, from the point of view of efficiency, it pays to focus attention to those properties that are relevant for distinguishing among words in the native language.
I see two potential problems here. One concerns the question of what exactly is "critical" information, and what this implies for recognition by a mature listener. If preserving only the critical information entails excessive loss of detail present in the description provided by the auditory analyzers, then the resultant, highly reduced coding might well fulfill its task in optimally favourable communicative situations, but it is less likely to constitute the efficient and noise-resistant basis for 0095-4470/93/01157 + 06 $08.00/0
© 1993 Academic Press Limited
158
K. Suomi
recogmtlon that speech communication obviously has available under adverse circumstances. Fast and efficient recognition may not only involve selecting among alternative lexical interpretations of the input on the basis of information that is sufficient to distinguish those interpretations from each other, it may also require positively identifying the incoming word candidate as a token of a particular lexical item on the basis of characteristic and abundantly available information pointing to exactly that item. That is , the listener may recognize a particular word not only because the input does not sound like any other lexical item , but also because the input sounds very much like a particular item; in view of the speeds at which words in a discourse are usually recognized, it simply cannot be the case that an exhaustive search through the lexicon is performed in every act of recognition. And is a "critical" coding sufficient to provide the listener with the additional information mentioned above? For these reasons it might be preferable to maintain that it pays to focus attention to, or at least have available for eventual exploitation, all systematic, language-specific properties of words. The second problem has to do with the developmental perspective of WRAPSA. Referring to the finding by Charles-Luce & Luce (1990) that the average seven-year-old's vocabulary contains a much lower proportion of highly confusable items than does that of an average adult, JusczYK concludes that when the number of items in the lexicon is small, a much less than complete description of their sound properties will suffice to distinguish the items, and that as more words are learned, only then will the processes specifying the representation be refined . Even if this were sufficient for the purposes of word recognition by the child under favourable circumstances (see above), WRAPSA completely overlooks another important task of a developing speech perception system, namely that it should furnish the similarly developing speech production system with detailed information on the sound pattern of the ambient language. If indeed the language-specific sound properties emerge as a consequence of the child's pursuit of various communicative goals as a listener, then acquisition of the idiomatic pronunciation of the ambient language should utilize these same activities, rather than requiring some extra mechanism not involved in communication oriented perception (and WRAPS A makes no provision for such a mechanism). I cannot refer to any empirical data that would be directly relevant, but my informal impression is that the pronunciation of seven-year-olds is quite idiomatic in many respects . The problem is that distinctiveness among items in the lexicon of, say, a seven-year-old speaker-hearer, which is the upper limit on detailedness of perceptual representations postulated in JusczYK's system of word recognition, seems to be too low a level of accuracy to account for the degree of detailedness in the speaker-hearer's own pronunciation , especially in view of the common assumption that production abilities lag behind those of perception. Of course, the reality of this problem also hinges on how reduced a "critical" coding is in comparison to the full auditory description. JusczYK's claim that perceptual representations are structured in terms of syllable-sized units is based on the results of a number of studies that he and his colleagues have carried out with infants. The evidence does not convince me. I agree with the interpretation that the results of the investigations by Jusczyk & Derrah (1987) and Bertoncini, Bijeljac-Babic, Jusczyk, Kennedy & Mehler (1988) rule out phonetic segments as the structural units. In these experiments, infants were familiarized with a series of syllables that included a common phonetic segment, e.g.,
Comments on the WRAPSA model
159
[bi], [ba], [bo], [b;:y.] . To the set a new item was then added that either shared ([bu]) or did not share ([ du]) the common segment. There was no indication that infants perceived the new item with the common segment as being more similar to the familiar items. Another study that Juscyzk reports investigated whether the presence of a common phonetic segment could enhance memory for a series of syllables. Infants were exposed to a set of syllables that either shared (e.g., [bi], [ba], [bu] or did not share (e.g. , [si]), [ba]), [tu]) a common phonetic segment. After thorough familiarization with the set, followed by a two-minute period of silence, testing resumed with either the original set of syllables or an altered set in which one of the original syllables had been changed for a new one (in the above sets, [da] was substituted for [ba]). Infants who had heard the sets with the common segments performed no better than those who heard the sets without common segments. Notice that in all experiments so far reviewed the stimuli constitute syllables. In Jusczyk, Kennedy, Jusczyk, Koenig & Schomberg (in preparation) one group of infants heard bisyllablic stimuli that contained a common syllable (e.g., [ba'si], [ba'lo], [ba'm1t]), and another group heard bisyllabic stimuli that did not contain a common syllable (e .g., [pre'zi], [nc'lo], [ko'm1t]). Only the former group detected changes to the stimulus set after a delay period following initial familiarization. JuszCYK concludes that, "Evidently, the presence of the common syllable in the set facilitated their encoding of the items for later recall." (p. 18). In a further check infants were exposed to familiarization sets that contained two common phonetic segments ([b] and [a]) but in different syllables (e.g., [za'bi], [la'bo], [ma'b1t]). This time infants did not detect changes to the familiarization set. "The upshot of all of this is that although there is evidence to suggest that infants' representations are structured in terms of syllabic segments, a comparable case cannot be mustered for phonetic segments," JusczYK concludes (p. 18). But the conclusion concerning syllables does not necessarily follow. Another, cheaper alternative is that infants represent stimuli in terms of auditory properties. Thus in all of those instances in the experiments under discussion in which infants failed either to perceive syllables containing the same phonetic segment as more similar than ones containing different segments, or to detect changes to a familiarization set after a delay period, all stimuli used are auditorily dissimilar. This goes, e.g., for the sets with stimuli [bi], [ba] plus [bo], [b;:y.] or [bu] (which, although they share a common initial phonetic segment for the adult, have different acoustic/auditory onsets), for [si], [ba], [tu], for [pre'zi], [nc'lo], [ko'm1t] as well as for [za'bi], [la'bo], [ma'b1t]. In contrast, in the one instance in which infants evidently extracted a commonality among the members of the familiarization set (e.g., [ba'si], [ba'lo], [ba'm1t]), the members are likely to share auditorily identical (or at least highly similar) onsets. Perhaps it is the presence of the common auditory onset in the set that facilitated the infants' encoding of the items for later recall? To determine whether the encoding is in terms of a common initial syllable or in terms of an identical auditory onset, infants could be profitably tested on stimuli like [bat'si], [bas'lo], [bal'm1t] that share an identical auditory onset yet do not have a common initial syllable. If JusczYK's claim is taken strictly, to refer to syllables proper and not just any syllable-like stretches, the claim predicts that infants fail to detect a change to such a familiarization set (because there is no common syllable in the set to facilitate their memorization). But from the claim that the encoding is based on auditory properties only, the prediction follows that a change involving a
160
K. Suomi
different onset will be detected (because the members of the familiarization set share a common onset, as do [ba'si), [ba'lo), [ba'm1t]). For the time being, the auditory explanation of the above findings is as good as the syllabic one. A very important task in the recognition of words from fluent speech that any model of this activity must address is detection of word boundaries in the signal to enable extraction of word-size candidates for comparison with representations stored in the lexicon. While this part of my own model has been much inspired by an earlier version of Jusczyk's model (Jusczyk, 1986), it differs from the present version, WRAPSA, in a number of respects. According to my proposal, the infant/child continually matches the input against auditory prototypes of familiar words in the lexicon. After first learning to recognize some words in isolation, the child then begins to segment fluent speech by breaking the input into known and unknown elements, and as the vocabulary grows, detection of boundaries of known words helps in establishing the boundaries of the unknown elements. JusczYK singles out three problems associated with this scenario: (1) the acoustic characteristics of a word spoken in isolation are often very different from those it has in context, (2) some stretches of sound that form words that may be heard in isolation can also appear as syllables in larger words (can occurs in candle, cannibal, candidate, toucan, Kansas etc.), (3) there are many items that frequently occur in the input that will seldom, if ever, occur as isolated words presented to the infant (e.g., the, did, of). I briefly respond to the criticism. As for problem (1), I readily admit that JusczYK's multiple trace system seems a far more plausible solution than the usual types of system (mine included) with singleton lexical representations plus, inevitably, some sort of (usually ill-defined) normalization. Problem (2) is probably more serious for models that postulate processing of discrete perceptual-phonetic subunits of words during recognition than it is for models like mine that do not; according to my model lexical items are represented, for recognition purposes, by holistic auditory prototypes, with no internal phonetic segmentaton, and thus there is no need to maintain that the same syllable is recognized in many different words. Even so, it is true that short words are often acoustically/auditorily very much like parts of larger words, and on the basis of this circumstance alone false identifications are possible. But, contrary to what JusczYK insists, the listener need not end up misparsing the input. This is because word recognition is not just a matter of bottom-up auditory matching, but very much under top-down contextual and cognitive control which more or less severely constraints lexical interpretation of the input. Problem (3): function words of this kind, which have low or no semantic content, indeed hardly ever occur in isolation, and in context they are typically highly reduced phonetically. And just as my model would predict, they also appear in the speech of children much later than the so-called content words, which prompts linguists to characterize the speech of young children as telegraphic speech (see, e.g., Fromkin & Rodman, 1988, pp. 373-375). As an alternative to the sort of word extraction system just discussed, Jusczyk proposes that infants might rely on cues present in the prosody, especially in word stress patterns, and he presents evidence that shows the sensitivity of American English infants to the strong-weak stress pattern that tends to predominate in many forms of English. But while word stress patterns and other prosodic properties may well turn out to be important cues to the location of word boundaries, they will hardly by themselves provide the full solution . I anticipate this because, after all,
Comments on the WRAPSA model
161
many different word stress patterns do occur in English in addition to the strong-weak pattern, because word stresses (also in languages with fixed word stress) often get deleted with sequences of words forming so-called prosodic words with only one peak of prominence, because function words are hardly ever stressed in context, etc. I welcome the inclusion of prosodic cues amongst the arsenal of word boundary signals, but rather than relying on a single class of cues, especially young listeners may have to resort to a catch-as-catch-can strategy. In some cases the successful recognition of the preceding familiar word may unequivocally point to the beginning of the next word candidate, in others the stress pattern may disambiguate, in still others only top-down expectations and constraints may enable the correct parsing. JusczYK mentions that another suggested basis for segmenting fluent speech into words is to use information about allophonic constraints; knowing the contexts in which various allopohones can occur could thus provide clues to syllable and word boundaries. JusczYK notes that, to implement such a strategy for word segmentation, one has to distinguish one allophone from the other, one has to have some means of generalizing the contexts for the allophones from regularities that hold among words already stored in the lexicon, and, for the latter to be possible, at least some of the lexical items must be segmented from speech before allophonic constraints can come into play. These are real and big problems for models of recognition that involve a stage with a segmental phonetic representation of the input; usually these models postulate a final phonemic representation for contact with similarly organized lexical representations, so that also an allophones-tophonemes conversion is needed. These very problems, and the learnability of such a system, were one of the major incentives for my own model building, and I fully agree with JusczYK that " allophonic constraints are not likely to be the primary means by which infants come to segment fluent speech into words" (p. 16). But this does not mean that my model does not make use of the regularities regarded, from one perspective, as contextual allophonic variation: on the contrary, the word prototypes contain all auditory properties regularly accompanying the pronunciation of a word, only the prototypes are not segmented into the discrete subunits. For instance , for American English , the prototype for bitter would contain the auditory consequences of opening a bilabial occlusion followed by a brief voice lag followed by a certain kind of voiced vocalic portion followed by a voiced retroflexed alveolar flap followed by another vocalic portion, whereas, e .g., tattoo would contain, in addition to its vocalic parts, the consequences of two yet different alveolar occlusions and accompanying laryngeal gestures. But unless I have seriously misunderstood th~ model , WRAPSA also captures this kind of word-internal regularity, although JusczYK does not seem to be aware of it. In the multiple trace system of WRAPSA the listener stores away representations or traces of individual exemplars of words actually encountered in recognition experience. But since idiomatically spoken individual exemplars contain the sort of "allophonic" regularities just discussed, so do the stored traces, but without the listener operating with anything like phonemes and their allophones in either storing the traces or in processing an input to be recognized. Given the other arguments given by JusczYK in favour of the multiple trace system over one involving singleton prototypes, I conclude that WRAPSA seems to offer a very powerful and promising solution to the problem of how the listener manages to recognize words in spite of extensive token variability across speakers and contexts.
162
K. Suomi References
Bertoncini, J. , Bijeljac-Babic, R ., Jusczyk, P., Kennedy, L. & Mehler, J . (1988) An investigation of young infants' perceptual representations of speech sounds, Journal of Experimental Psychology: General, 117, 21-33. Charles-Luce, J. & Luce P. (1990) Similarity neighborhoods of words in young children's lexicons, Journal of Child Language, 17, 205-215. Fromkin , V. & Rodman, R . (1988) An introduction to language, fourth edition . New York: Holt, Rinehart & Winston. Jusczyk, P. (1986) Toward a model of the development of speech perception. In lnvariance and variability in speech processes (J . Perkell & D . Klatt, editors) , pp. 1-19. Hillsdale , NJ : Lawrence Erlbaum Associates. Jusczyk, P. & Derrah , C. (1987) Representation of speech sounds by young infants, Developmental Psychology, 23, 648-654. Jusczyk, P., Kennedy, L., Jusczyk, A., Koenig, N. & Schomberg, T . (in preparation) An investigation of the infant's representation of information in bisyllabic utterances.