Brain and Language 86 (2003) 83–98 www.elsevier.com/locate/b&l
Connectionist modelling of the separable processing of consonants and vowelsq Padraic Monaghana,* and Richard Shillcocka,b a
b
Institute for Adaptive and Neural Computation, Division of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, UK Department of Psychology, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, UK Accepted 23 July 2002
Abstract Caramazza, Chialant, Capasso, and Miceli (2000) describe two aphasic patients, with impaired processing of vowels and consonants, respectively. The impairments could not be captured according to the sonority hierarchy or in terms of a feature level analysis. Caramazza et al. claim that this dissociation demonstrates separate representation of the categories of vowels and consonants in speech processing. We present two connectionist models of the management of phonological representations. The models spontaneously develop separable processing of vowels and consonants. The models have two hidden layers and are given as input vowels and consonants represented in terms of their phonological distinctive features. The first model is presented with feature bundles one at a time and the hidden layers have to combine their output to reproduce a unified copy of the feature bundle. In the second model a ‘‘fine-coded’’ layer receives information about feature bundles in isolation, and a ‘‘coarsecoded’’ layer receives information about each feature bundle in the context of the prior and subsequent feature bundle. Coarse-coding facilitated processing of vowels and fine-coding processing of consonants. These models show that separable processing of vowels and consonants is an emergent effect of modular processors operating on feature-based representations. We argue that it is not necessary to postulate an independent level of representation for the consonant/vowel distinction, separate from phonological distinctive features. Ó 2003 Elsevier Science (USA). All rights reserved. Keywords: Aphasia; Connectionist modelling; Hemispheric processing; Dissociations
1. Introduction In this paper, we show that in a distributed representational system, such as the cortex, categorical distinctions at the phoneme level can be accounted for by the interactions between phonological distinctive features at a lower level of represenq
This research was supported by Wellcome Trust grant 059080. Thanks to Laurie Stowe and two anonymous referees for helpful comments on an earlier draft of this paper. * Corresponding author. E-mail address:
[email protected] (P. Monaghan). 0093-934X/03/$ - see front matter Ó 2003 Elsevier Science (USA). All rights reserved. doi:10.1016/S0093-934X(02)00536-9
84
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
tation. Distinguishing a higher, phonemic level above the level of phonological features may be descriptively useful and convenient, but we will show in terms of cognitive processing that modularisation at the lower level of representation can result in distinctions at the higher level without requiring that higher level as an explanatory concept. We concentrate on data from aphasics, challenging the conclusion that data from such patients indicate that vowels and consonants are represented separately in speech processing. We present implemented models of the management of phonological representations; these models are intended to be minimal models of phonological processing and are ambiguous between speech production and perception. They employ only phonological features at input and output, and show that the distinction between consonants and vowels is respected in the interaction between phonological distinctive features in distributed systems.
2. Consonants and vowels in aphasic speech Aphasic speech often contains many errors in the processing of consonants, whereas errors on vowels are rarer. Blumstein (1978) noted that, in a group of 17 BrocaÕs, conduction and WernickeÕs aphasic patients, errors on vowels were very infrequent, whereas consonant errors were frequent. Canter, Trost, and Burns (1985) provide detailed data on the incidence of errors on consonants and vowels from 10 BrocaÕs aphasics and 10 paraphasic patients (five WernickeÕs aphasics and five conduction aphasics). Patients were presented with 130 monosyllabic words in picture-naming and repetition tasks, and speech errors were recorded. Both groups made errors on vowels and consonants, with approximately two-thirds of the errors being substitutions. For vowels, error rates were approximately 14% for the paraphasics and 10% for the BrocaÕs aphasics. For consonant singletons, error rates were approximately 25% for both groups. For consonant clusters, error rates were 35% for paraphasics and more than 50% for BrocaÕs aphasics. Errors were made on vowels, but were generally fewer than for consonants. Similarly, Beland, Caplan, and Nespoulous (1990) examined the speech errors in a 55-year-old conduction aphasic, RL, who made 8% errors on consonants and 3% errors on vowels when reading and repeating words. The distinction between consonant and vowel processing is general for both production and perception tasks. Boatman, Hall, Goldstein, Lesser, and Gordon (1997) used direct cortical electrical interference from electrodes implanted in the left superior temporal gyrus of patients awaiting surgical treatment for intractable partial complex seizures. Stimulating the cortex in this way results in a temporary ‘‘lesion’’ to a small region around the electrode. They found that introducing interference from one electrode resulted in impaired perception of consonants; the position of the electrode varied for each patient. However, little impairment of vowels was found at any electrode position. Monoi, Fukusako, Itoh, and Sasanuma (1983) examined the speech errors of three BrocaÕs aphasics and three conduction aphasics, and found equally large incidences of vowel errors and consonant errors by the conduction aphasics in picturenaming and repetition tasks. Of the errors produced by the conduction aphasics, errors on vowels were in the range 37.7–45.7% and errors on consonants were in the range 50.4–59.3%. For the BrocaÕs aphasics, most errors were on consonants (65– 96%) and only a few errors were on vowels (4–9%). Most errors of both types were substitutions and many of the remainder were transpositions, in line with earlier studies of vowel substitutions in BrocaÕs aphasics (Keller, 1978).
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
85
Monoi et al.Õs study shows that vowel errors may be frequent, yet cases are few where there is greater impairment of vowels than of consonants in aphasia. Romani, Gran a, and Semenza (1996) present as ‘‘unusual’’ the case of a patient with greater impairment of vowels than consonants. MM, a right-handed 65-year-old, suffered a left temperoparietal lesion and was diagnosed as a conduction aphasic, with good comprehension and fluent speech. Of the substitution errors she made in speech, 64% were replacements of vowels and 36% replacements of consonants. Though MM made more deletions, insertions and transpositions of consonants than vowels, overall the number of phonological errors on vowels was greater than on consonants (56% compared with 44%). Caramazza et al. (2000) report a patient AS with damage to the left parietal and temporal lobes, and the right parietal lobe. When repeating words, AS made many substitution errors (65.51% of the errors made), and of these substitution errors, she made three times as many errors on vowels as on consonants. In contrast, patient IFA, with damage to the left supramarginal, angular, and superior temporal gyri, made 58.37% substitution errors of which nearly five times as many were on consonants as on vowels. Furthermore, the substitutions of AS and AFI did not relate to sonority. AS made as many errors on consonants with high sonority (such as /l/) as on consonants with low sonority (such as /t/). AFI also made as many errors on high sonority consonants as on low sonority consonants, though she made more errors across all consonant types than AS. Caramazza et al. (2000) interpret these data to suggest that there is not a continuum of processing from consonants to vowels, but that there is a categorical distinction between the two. Caramazza et al. reject explanations in terms of phonemes being represented as bundles of features. The dorsal features high, low, and back discriminate vowels, and also discriminate certain consonants (/k/, /g/, /l/, /r/). If the representation of these features was damaged then there ought to be more impairment of the relevant consonants. However, AS and IFA did not demonstrate greater impairment of these consonants. For AS, the substitution error rate for consonants discriminated by dorsal features was 7.9% and for other consonants the substitution error rate was 11.7%. For IFA, the substitution error rates were 27.5% and 28.3%, respectively. The literature contains several proposed explanations for the greater incidence of consonant damage and for the dissociation between vowel and consonant processing: 1. Consonants require faster and more complex articulatory movements than vowels. Thus, consonants are more often impaired due to their greater articulatory difficulty. 2. Vowels and consonants are on different representational levels (Goldsmith, 1990), accounting for observations that vowels seem to behave as a sequence independent of the consonants that alternate with them. 3. The rhythmical movement of the jaw requires different representation of vowels and consonants (Fowler, 1983; Redford, Chen, & Miikkulainen, 2001). 4. Vowels have a greater frequency of occurrence than consonants and may have higher activation levels. When activity is reduced in the phonological processing system, then consonants will be impaired over vowels, but when noise is increased in the system, then the higher resting activity of vowels may be sufficient to induce substitutions in vowels over consonants. This model accounts for the greater number of omissions of consonants than vowels by Romani et al.Õs (1996) patient MM, accompanied by the greater number of substitutions of vowels than consonants by the same patient. The key aphasia data we wish to model are: 1. The greater vulnerability of consonants to impairment.
86
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
2. The apparent double dissociation between post-lesion impairment to consonants and vowels. However, we wish to model these data without assuming the different processing levels embodied in theories 1–3 above. If a dissociation between vowel and consonant processing emerged from a model processing only phonological features, then the additional representational level assumed in the above accounts would not be necessary. Theory 4 is incorporated into our modelling, as we utilise phoneme frequencies in our simulations. However, Theory 4 cannot wholly account for dissociations between vowel and consonant processing, it can only contribute to greater impairment of consonants over vowels. We will show that apparently modular processing of vowels and consonants emerges from a neural network learning to represent phonemes in terms of phonological distinctive features. In the next section, we review approaches to modularisation of processing as an emergent effect in neural networks.
3. Modelling emergent modularisation If a computational system is required to solve two independent tasks, then modularity is likely to emerge (Bullinaria, 2001). Kosslyn, Chabris, Marsolek, and Koenig (1992) produced a neural network model of a spatial processing task; different resources in the model spontaneously developed a specialisation either for the processing of categorical information (the shape of the stimulus) or co-ordinate information (the position of the stimulus). Jacobs and Kosslyn (1994) adapted this approach to show that the categorical versus co-ordinate distinction emerged from processors that take information from a wide input field (coarse-coding units) or a narrow input field (fine-coding), respectively. An alternative approach for developing separable processing is to include a division of resources within the model. Shkuro, Glezer, and Reggia (2000) investigated neural network models of single-word reading with separate hidden layers representing the two hemispheres of the brain. They encouraged the lateralisation of language processing to the left hemisphere of the model by increasing the left hidden layerÕs size and its speed of learning. However, they did not investigate separable processing within these hidden layers. Lesioning the left half of the model resulted in greater impairment to phonological processing, whereas damage to the right half of the model did not greatly impair such processing. The feature-based representations of phonemes we employed in our modelling are taken from Harm and SeidenbergÕs (1999) connectionist model of reading. They used an 11-dimensional vector for each phoneme in English, where each dimension represents a manner or place feature such as voiced, nasal, labial, as developed in the SPE tradition in formal phonology (Chomsky & Halle, 1968). (Note that such features are treated as abstract entities, intermediate between perception and production.) In our modelling, we used feature bundles to represent 32 phonemes, of which 8 were vowels and 24 were consonants. The phonemes and their feature representations are shown in Table 1. We performed a two-dimensional principal components analysis on the vectors and found that vowels and consonants were linearly separable in the resulting two-dimensional space, which accounted for 64% of the variance. This analysis shows that it is possible to distinguish the processing of vowels and consonants in terms of their feature-based representations, but it remained to be seen whether a divided information processing system could capitalise on this distinction and demonstrate dissociable processing of vowels and consonants. Supervised neural networks have input and output patterns specified and then a
Table 1 Phoneme representations in terms of the 11 phonological features (from Harm & Seidenberg, 1999) Consonantal
Voice
Nasal
Degree
Labial
Palatal
Pharngeal
Round
Tongue
Radical
Example
/p/ /b/ /t/ /d/ /k/ /g/ /f/ /v/ /th/ /dh/ /s/ /z/ /h/ /sh/ /zh/ /tch/ /dzh/ /m/ /n/ /ng/ /r/ /l/ /w/ /j/ /i/ /I/ /e/ /ae/ /a/ /uu/ /u/ /@/
)1 )1 )1 )1 )1 )1 )0.5 )0.5 )0.5 )0.5 )0.5 )0.5 )0.5 )0.5 )0.5 )0.8 )0.8 0 0 0 0.5 0.5 0.8 0.8 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 )1 )1 )1 )1 )1 )1 )1 )1
)1 0 )1 0 )1 0 )1 0 )1 0 )1 0 0 )1 0 )1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
)1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 )1 )1 0 0 0 0 )1 )1 )1 0 )1 )1
1 1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 1 )1 )1 )1 )1 1 )1 )1 )1 )1 )1 )1 )1 )1 )1
0 0 1 1 )1 )1 1 1 1 1 1 1 )1 0 0 0 0 0 1 )1 )1 1 )1 0 0 0 0 0 )1 )1 )1 0
)1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 )1 1 )1 )1 )1 )1 )1 )1 )1 1 )1 )1 )1 )1 )1 )1 1 1 )1 )1 1
1 1 )1 )1 )1 )1 1 1 )1 )1 )1 )1 )1 )1 )1 )1 )1 1 )1 )1 1 )1 1 )1 )1 )1 )1 )1 )1 1 )1 )1
0 0 1 1 )1 )1 0 0 0 0 1 1 )1 0 0 0 0 0 1 )1 )1 1 )1 0 0 0 )1 )1 )1 0 )1 )1
0 0 0 0 0 0 0 0 0 0 0 0 )1 0 0 0 0 0 0 0 )1 0 0 1 1 )1 )1 1 )1 )1 )1 1
pink bike tone dust kid gale far vote think this sip zone hall ship beige chink jug mop nit ring rope lair wave yolk seat pit pen fat hot look tug bird
87
Sonorant
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
Phoneme
88
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
general learning algorithm maps inputs onto outputs. Architectural constraints can be built into the models in order to assess their influence on the solution of tasks (see, for example, Shillcock & Monaghan, 2001). We implemented two separate hidden layers in the models presented below and examined the contribution of each hidden layer to learning an identity mapping between bundles of features at input and output. We hypothesised that separable processing of vowels and consonants would occur purely as a consequence of the distribution of feature bundles in the 11-dimensional feature space interacting with a parallel distributed processing model. The first simulation tested the extent to which separable processing of vowels and consonants is an emergent effect of a divided processor learning to represent bundles of features and with minimal assumptions regarding the nature of the task and the nature of the input.
4. Simulation 1 4.1. Architecture The model consisted of an input layer at which the 11-dimensional feature bundles were presented one at a time. The model is shown in Fig. 1. The input layer was fully connected to two hidden layers each containing six nodes. Six was the smallest number of units with which the model reliably solved the task. Note that in Fig. 1 there is no architectural distinction between the two hidden layers, only an apparent one in the way that the diagram is drawn: there is full connectivity between each hidden layer and the input layer and the output layer, meaning that each and every individual hidden unit is connected to every input node and every output node, as would be the case if there was just a single hidden layer. To encourage the hidden units to separate into two populations, and to provide a minimal scope for the modularisation of processing, we gave closely related but separate tasks to the two
Fig. 1. The model of phoneme to phoneme mapping. Noisy versions of the 11-dimensional feature vector of each phoneme are sent to each hidden layer. The hidden layers are fully connected to the output units and have to co-ordinate their activations in order to reproduce a cleaned-up version of the phoneme at the output.
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
89
subpopulations of hidden units. We added different random noise to the input of the two sets, thus encouraging coherent behaviour within each subpopulation; thus, any two nodes allocated to the left side of the model would receive the same noise on their feature bundle inputs. The modelÕs task was to clean up the activations of the noisily presented features and to reproduce the 11-bit feature vector at the output units. This task made the feature-to-feature mapping a more complex computational task. Noise was added to the activations of the individual features in a feature bundle with mean 0 and variance .5. Different randomisations (but with the same mean and variance) of noise were added to the feature bundles projecting to the two hidden layers. It was therefore adaptive for the model to use the input to both hidden layers in order to compute the overlap between the noisy feature bundles. The hidden layers were fully connected to the output layer, where the 11-bit feature vector had to be accurately reproduced, free from noise. Therefore, the two hidden layers processed different, but overlapping, noisy versions of each feature bundle in order to reproduce it accurately at the output. Error tolerance was set at .1, to avoid large weights developing on the connections, and a small learning rate was used (a ¼ :01) so that feature bundles presented early in the training would not have an inordinately large effect on the trajectory of learning. The connections were given small starting weights (mean 0, variance .1) and trained with the standard backpropagation algorithm. Feature bundles were presented randomly according to phoneme frequency in a corpus created by replacing the orthographic words of the London-Lund Corpus (Svartvik, 1990) with their citation form phonological transcriptions, as found in the CELEX lexical database (Baayen, Piepenbrock, & Gulikers, 1995). We tested the model after every presentation of 50,000 feature bundles and stopped training when the model reproduced all 32 feature bundles accurately. A feature bundle was judged to have been accurately reproduced when the output of the model was closer to the target feature bundle than any other feature bundle in the 11-dimensional feature space. After training, we impaired the model in order to judge the extent to which each hidden layer contributed to accurate reproduction of the feature bundles. To simulate cognitive impairment, connections between the input layer and each hidden layer were severed in turn. We trained 10 versions of the model, varying in terms of the random starting weights, the random noise added to the feature bundle inputs, and the random order of presentation of the feature bundles. 4.2. Results The models learned to reproduce all the feature bundles accurately after 50,000– 500,000 feature bundles had been presented. The mean number of feature bundles presented before a model learned correctly was 205,000. Typically, lesioning one hidden layer resulted in more impairment of consonants and lesioning the other resulted in more impairment of vowels. We labelled the two halves of each model post hoc in terms of whether there was greater damage to vowels (the ‘‘vowel side’’) or not (the ‘‘non-vowel side’’). Note that it was not possible to predict in advance which hidden layer would specialise in vowel processing, as the noise added to each hidden layerÕs representations was entirely random. Table 2 shows the accurate reproductions of feature bundles by phoneme type following damage to the vowel side or the non-vowel side. By definition, vowels were reproduced less accurately overall following damage to the vowel side. However, consonants were reproduced less accurately following damage to the non-vowel side, even though performance on consonants was not used to label the two sides. This asymmetry was due to greater impairment of the plosives, fricatives, and affricates in
90
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
Table 2 Mean percentages of phonemes reproduced (standard deviations in parentheses) for each phoneme type, following damage to the vowel layer or the non-vowel layer in Simulation 1
Plosives Fricatives Affricates Nasals Laterals Approximants All consonants Vowels
Vowel layer lesion
Non-vowel layer lesion
55% 52% 75% 37% 25% 25% 48% 31%
37% 20% 30% 60% 35% 60% 35% 66%
(14) (15) (26) (33) (26) (26) (6) (17)
(13) (9) (35) (31) (41) (21) (9) (15)
the consonant class. Hence, the models tended to allocate resources in one hidden layer to consonants and in the other to vowels. Overall, there was a main effect of accurate reproduction of all feature bundles for damage to the vowel side compared with damage to the non-vowel side (F ð1; 9Þ ¼ 10:71; p < :05) with more impairment occurring after damage to the non-vowel side. There was no main effect of phoneme type: consonants were not impaired more than vowels (F ð1; 9Þ ¼ 2:12; p ¼ :18). However, there was a significant interaction between the side of the model that was damaged and phoneme type (F ð1; 9Þ ¼ 55:56; p < :001). Caramazza et al. (2000) argue against the notion that it is individual features that are impaired in the phonological processing of their patients AS and IFA. We assessed the modelÕs performance on each feature following damage to the vowel or non-vowel side of the model. We found that no single feature was damaged disproportionately, but rather a cluster of features was more impaired following damage to the non-vowel side. Table 3 shows the size of the error in the representation of the individual features following lesions of each side; paired t tests show the significant differences. Following damage to the non-vowel side of the model, damage to four features—sonorant, consonantal, nasal, and labial—was significantly greater, but no feature was disproportionately impaired following damage to the vowel side. Therefore, the processing in the vowel layer is not primarily in terms of individual features. The most important representations in the non-vowel side consist of several features. Table 3 Accuracy of reproduction of each feature across the 32 phonemes following damage to the vowel or nonvowel layer, respectively, in Simulation 1 Feature
Vowel layer lesion (error)
Non-vowel layer lesion (error)
Paired t test
Sonorant Consonantal Voice Nasal Degree Labial Palatal Pharyngeal Round Tongue Radical
0.32 0.22 0.33 0.28 0.41 0.06 0.39 0.44 0.21 0.45 0.27
0.46 0.45 0.4 0.69 0.31 0.23 0.46 0.24 0.46 0.38 0.44
4.57* 5.84** 2.64 65.41*** )1.97 4.84* 1.18 )0.9 1.48 )0.9 2.83
The first column indicates the feature. The second and third columns indicate the mean level of error in reproducing the feature following damage to the consonantal and non-consonantal layers. The fourth column indicates the paired t test value (adjusted for multiple comparisons: *p < :05; **p < :01; ***p < :001).
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
91
Caramazza et al. (2000) argue that the sonority hierarchy plays no part in the separable processing of vowels and consonants in their patients AS and IFA. However, our feature bundles contain the features consonantal and sonority. Consonantal indicates an obstruction of the air flow and is a feature that separates consonants and vowels. Would the model still distinguish between vowels and consonants if this feature were not present in the feature bundles? Similarly, would the model distinguish between vowels and consonants if sonority were omitted? We have seen from the feature-based analysis of the modelÕs performance that the separation of processing was not due to damage to one side of the model impairing the processing of any one feature disproportionately, but rather impairing sets of features. Nevertheless, in order to determine that the separable processing in the model was not due to these features, we reran the simulations exactly as in the original model but with the omission of consonantal from the feature bundles (Simulation 1a), the omission of sonority (Simulation 1b), and the omission of both consonantal and sonority (Simulation 1c). Omitting these features predictably meant that the distinction between certain of the feature bundles was lost, such as the pair /i/–/j/, which are distinguished only by consonantal and sonority. This change in the input repertoire changed the task to one of cleaning up slightly fewer different feature bundles composed of fewer features, and made the mapping more difficult for the model to learn as the distinction between the feature bundles was now smaller. In none of the runs of these simulations did the model learn to distinguish all the feature bundles. We therefore stopped the training of the model after 300,000 patterns had been presented to each model. (At this point in training, all but one of the 10 runs of the original model had learned the original task.) We then tested the model on the feature bundles that it had correctly distinguished at that stage. Simulation 1a (no consonantal feature) solved the mapping for all of the vowels and a mean of 89% of the consonants across all 10 runs. In most cases, the model failed to distinguish the /i/–/j/ pair and the /dh/–/zh/ pair. After lesioning each side of the model, we found a replication of the original modelÕs results, though with a greater main effect of which side was damaged. Damaging the non-vowel side resulted in the correct reproduction of 42% of consonant feature bundles and 44% of vowel feature bundles. Damaging the vowel side resulted in correct reproduction of 34% of consonant feature bundles and 22% of vowel feature bundles. There was a main effect of side damaged, with the vowel side performing better on all feature bundles (F ð1; 9Þ ¼ 20:07; p < :005), no main effect of phoneme type (F ð1; 9Þ ¼ 2:59; p ¼ :14), and a significant interaction between side damaged and phoneme type impaired (F ð1; 9Þ ¼ 6:85; p < :05). Simulation 1b (no sonority feature) solved the mapping for all vowels and a mean of 96% of the consonants across all 10 runs. Again, the original modelÕs results were replicated. Damaging the non-vowel side resulted in the correct reproduction of 35% of consonant feature bundles and 44% of vowel feature bundles, whereas damaging the vowel side resulted in correct reproduction of 35% of consonant feature bundles and 20% of vowel feature bundles. As with the original simulation and Simulation 1a, there was a main effect of side damaged (F ð1; 9Þ ¼ 9:17; p < :05), no main effect of phoneme type (F ð1; 9Þ ¼ :85; p ¼ :38), and a significant interaction between side damaged and phoneme type (F ð1; 9Þ ¼ 24:75; p < :001). Simulation 1c (no consonantal or sonority feature) produced similar results, with 100% of vowels, and 90% of consonants being reproduced correctly. Damage to the non-vowel side resulted in 33% of consonants and 57% of vowels correctly reproduced. Damage to the vowel side resulted in 37% of consonants and 27% of vowels being correctly reproduced. There was a significant main effect of side damaged (F ð1; 9Þ ¼ 18:98; p < :005), no main effect of phoneme type (F ð1; 9Þ ¼ 2:26;
92
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
p ¼ :17), and a significant interaction between side damaged and phoneme type (F ð1; 9Þ ¼ 24:29; p < :001). 4.3. Discussion The model performed the trivial task of simply reinstating cleaned-up versions of the inputs over the output nodes. It was trained on featural representations of speech segments appearing in isolation; these feature bundles are the product of formal phonological theory and are ambiguous between perception and production. These representations contained no information about segmental context within the syllable, about acoustic–phonetic properties of the segments, or about ease of articulation. In learning to carry out the task the model tended to employ distributed representations of the features and to respect similarities and differences between the feature bundles; any one feature tended to be defined in terms of the other features. Crucially, the model tended to distinguish the categories of vowels and consonants. These behaviours are core qualitative behaviours of connectionist models, which have been advanced as cognitive models on the grounds that they capture important aspects of the brainÕs style of representation and processing. The scope for distributed representation in the brain is many orders of magnitude greater than in the simple model we have explored. However, such simple models performing a trivial task cannot help but represent the systematic distinctions within the set of feature bundles, even though the task does not require the explicit representation of categories like consonants or vowels. The division of the model into two halves was intended to provide the minimal opportunity for modularisation of processing to occur. Although this division resembles the hemispheric splitting of the brain, our simulation was not intended to model observed hemispheric differences (although such differences may well be relevant). Rather it was intended to provide the simplest opportunity for different types of processing to be physically differentiated. Such a modularisation might equally well occur intrahemispherically, given some sort of anatomical differentiation. An analytic explanation of the effect we have reported lies in the separation of the feature bundles in the feature space. The mean Euclidean distance between the vowels in the 11-dimensional feature space is 2.38 (sd ¼ :69), whereas for consonants, it is 3.05 (sd ¼ 1:05). However, consonants are more distant from vowels than they are from one another: mean Euclidean distance between vowels and consonants is 3.95 (sd ¼ :74). These distances are observable in the principal components analysis (PCA) described above, as PCA respects initial relative distances between elements. The first set of simulations was the minimal case, employing a symmetrical model; the two halves of the model were distinguished post hoc. In the second simulation, we introduce a difference between the two halves of the model, which we claim is psychologically valid and which tends to distinguish different regions within the brain.
5. Simulation 2 In the second simulation different processing styles were instantiated in the two hidden layers of the model, using ‘‘coarse’’ and ‘‘fine’’ coding in the modelÕs right and left side, respectively. Coarse-coding in the model can be seen as an implementation of magnocellular activity in the auditory pathways (cf. Hockfield & Sur, 1990; Trussell, 1997); equally, coarse-coding can be seen as a way of capturing the right hemisphereÕs widely recognised preference for synthetic, global, configurational
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
93
processing. In terms of phonological processing, we interpret coarse-coding in temporal terms, meaning that the context within which a phoneme appears is available for the processing of that phoneme. In contrast, fine-coding means that the phonological stream is already segmented and contextual information is not available, or not required, for processing. These assumptions correspond to PoeppelÕs (2001) view that the right hemisphere (RH) carries out processing over a long time window in the speech stream and the left hemisphere (LH) carries out processing over a short time window, corresponding to fine-coding in the LH and coarse-coding in the RH (Brown & Kosslyn, 1993). We tested the hypothesis that coarse-coding would be better suited for the encoding of vowels and fine-coding for consonants. This hypothesis was motivated by the fact that vowels occur in predictable syllabic contexts, whereas consonants often occur at the boundaries of syllables and words, in less predictable contexts. We employed the original 11-dimensional feature bundles, given that the first set of simulations had indicated that the separation of the representation and processing of vowels and consonants was similar even when the consonantal and sonority features were omitted. 5.1. Architecture The model is shown in Fig. 2. The model had two hidden layers receiving noisy, feature-based phonological input from the continuous stream of phonemes in the transcription of the London-Lund Corpus. The right hidden layer received information about the target feature bundle and the feature bundle either side of the target in the speech stream. The left hidden layer received input only from the target feature bundle. The hidden layers contained six units each, and were each fully connected to the input layer where the feature bundles were presented. The hidden layers projected to the output layer, where the model was required to reproduce the input feature bundle. The output units were fully connected to the two hidden layers. The model therefore had to integrate a coarse- and a fine-grained representation of
Fig. 2. The coarse-/fine-coding model of phonological processing. Continuous speech is used as input to the model. The right hidden layer receives information about three speech segments, the left hidden layer receives only one segment.
94
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
the noisy phonological input stream in order to reproduce the target feature bundle. Units in the model could take activity from )1 to 1 and connections were trained according to the backpropagation learning algorithm. 5.2. Training The model was trained on the idealised phonemic transcription of the LondonLund Corpus, described above, with the phonemes represented by 11-dimensional feature vectors. Each feature bundle in the corpus was presented between its two context segments to the right hidden layer and singly to the left hidden layer. Noise was added to the features according to a random distribution, with mean 0 and variance .5. The model was tested for its accurate reproduction of the feature bundles on 100 occurrences of each feature bundle from a part of the corpus to which the model had not been exposed. After 800,000 feature bundles had been presented, the model reliably learned to reproduce all the feature bundles in the test set. We lesioned the model by removing the connections from the input units to either the left or the right hidden layers, thus impairing the processing of one half of the model. Lesioning each hidden layer of the model allowed us to examine the contribution of that side of the model to the accurate reproduction of the feature bundle. After lesioning we presented the model with the test set of 100 instances of each feature bundle. We ran the model eight times, using different random starting weights, and different parts of the London-Lund Corpus for training. 5.3. Results The left part of Table 4 reports the accurate reproduction of phonemes of each type following lesioning to the coarse- or the fine-layer. Plosives, fricatives, nasals, and laterals were all impaired more following fine-layer damage than coarse-layer damage. There was no main effect of which layer was damaged (F ð1; 7Þ ¼ :01; p > :9), however there was a main effect of feature-bundle type (consonant/vowel) (F ð1; 7Þ ¼ 25:76; p < :001), but there was no significant interaction between which layer was damaged and feature-bundle type (F ð1; 7Þ ¼ :19; p > :9). General damage to the model results in greater impairment to consonants over vowels. However, there was no dissociation between impairment to vowels and consonants by layer damaged
Table 4 Reproduction of phonemes by type following 100% and 50% lesions to coarse- and fine-layers in Simulation 2 100% Lesion
Plosives Fricatives Affricates Nasals Laterals Approximants All consonants Vowels
50% Lesion
Coarse-layer lesion
Fine-layer lesion
Coarse-layer lesion
Fine-layer lesion
33.33% (8.91) 44.44% (8.40) 25.00% (26.73) 100% (0) 100% (0) 100%(0) 56.25% (4.45) 65.63% (18.60)
27.52% (7.15) 41.50% (5.10) 37.50% (23.15) 95.67% (8.27) 97.81% (5.25) 100% (0) 54.01 (4.13) 66.91% (6.97)
99.88% (0.29) 88.93% (18.41) 100% (0) 100% (0) 100% (0) 100% (0) 95.82% (6.95) 98.02% (3.21)
78.75% (16.32) 64.56% (13.16) 79.44% (20.65) 100% (0) 100% (0) 100% (0) 79.68% (9.35) 96.39% (3.68)
Values shown are mean percentages correct across the eight simulation runs, with standard deviations in parentheses.
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
95
as was found in Simulation 1, when connections were completely severed. This model reflects the greater vulnerability of consonants to damage, but does not generate the dissociation between consonant and vowel processing found in the first set of simulations. To investigate less extreme cases of damage to the model, we reduced the strength of connections between the input and either the coarse- or the fine-layer by 50%. This change corresponds to a 50% reduction in the activation feeding through to the hidden layer, compared with the 100% reduction resulting from complete lesioning of the connections. The right part of Table 4 shows the percentage correct for each phoneme type. Laterals, nasals, and approximants were accurately reproduced following both coarse- and fine-layer damage. However, plosives, fricatives, and affricates were less accurately reproduced following finecoding damage compared with coarse-coding damage. For the 50% lesion, overall there was a main effect of coarse-/fine-layer damage, with fine-layer damage causing greater impairment to feature bundle reproduction (F ð1; 7Þ ¼ 6:26; p < :05). There was also a main effect of feature-bundle type, with consonants being impaired more than vowels (F ð1; 7Þ ¼ 47:56; p < :001). There was also an interaction between layer damaged and feature-bundle type (F ð1; 7Þ ¼ 12:97; p < :01). Damage to the fine-layer impaired consonant reproduction more than vowel reproduction, whereas damage to the coarse-layer caused only slight damage to both consonants and vowels. However, this difference in damage to vowel feature bundles was slight, and we therefore explored a more sensitive measure of damage to the representation than just using number correctly reproduced. Mean squared error of the output provides a continuous measure of how closely the output approximates the target representation: high error means that the modelÕs representations are poor and low error means that representations are close to the target. We measured mean squared error for vowels and consonants in place of number correctly reproduced in the ANOVA analyses. Consonants were represented more accurately following damage to the coarse hidden layer than the fine hidden layer (mean squared error 1.90 and 1.99, respectively) and vowels were represented less accurately following damage to the coarse hidden layer than the fine hidden layer (mean squared error 1.53 and 1.31, respectively). There was a significant interaction (F ð1; 7Þ ¼ 5:74; p < :05).1 5.4. Discussion The impairment of the model suggested that modularisation of processing had occurred in the predicted direction: the fine-coding side of the model tended to specialise in the representation and processing of consonants, and the coarsecoding side of the model tended to specialise in vowels. This consonant/vowel distinction replicated the modularisation found in the first simulation, but showed further that the systematic differences between consonants and vowels direct the processing of the two segment types to the two different processing styles, fineand coarse-coding, respectively. In this second set of simulations, the differences in the contexts in which consonants and vowels are found were a determining factor. 1 We also performed simulations that omitted the consonantal and/or sonorant feature. We found a qualitatively similar, but reduced, effect in these cases, due to the smaller distinctions between feature bundles when these features are not included. The results were in the same direction, but did not reach significance. The highly significant main effects of phoneme type, however, remained, with consonants having greater impairment than vowels in each case.
96
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
6. General discussion We have shown that even given a minimal task and a minimal architecture, a parallel distributed processing model of phonological processing will spontaneously embody systematic differences within the input and will tend to modularise its processing to take advantage of these differences. The task was minimal because only the cleaning up of the input was required; the model was not being trained to find words or syllables in the input, only to reproduce the feature bundle inputs. Nevertheless, the model appeared to respect higher-order categories such as consonant and vowel. The architecture was minimal because we only allowed a single, equal division of the computational space. Nevertheless, the model took advantage of that division to instantiate a division of labour between the two halves of the model. When the two halves of the model possessed different processing characteristics, as in the second set of simulations, the appropriate lateralisation of the processing occurred. It is tempting to identify the two halves of the model with the two hemispheres of the brain. Our intention was only to allow the model the simplest possible scope for the division of labour, represented by a simple division of the computational resources. In the brain such a division might occur intra- and/or inter-hemispherically; an example of an intrahemispheric division of labour might be the distinction between magno- and parvo-cellular processing. Nevertheless, the division of a homogeneous processor into two partly encapsulated domains clearly has more significance in general than, for instance, the addition of an eleventh domain to a processor already divided into 10. The existence of a range of hemispheric lateralisations of function in the human brain is witness to the potential adaptiveness of the hemispheric division of the brain. It is becoming clear that phonological processing is typically not lateralised completely to the LH, but that the RH may also play a role (e.g., Ivry & Lebby, 1998). Nevertheless, the LH, with its predisposition for analytic, discrete processing does seem to play the greater role in expressive phonology overall, and simulations of some of the factors involved (e.g., Shkuro et al., 2000) have demonstrated how lateralisation might occur in the predictable direction (see also Hutsler, in press). The data we have presented, showing greater overall phonological impairment being caused by damage to the fine-coded processing and greater overall impairment to consonants than to vowels, correspond to the human data. In these respects the simple model we have proposed is a step towards a model of hemispheric lateralisation in phonological processing. Compared with the simple models we have described, the brain contains many orders of magnitude more scope for division of labour in the processing of speech. We have shown that it is inconceivable that similar categorical processing differences would not emerge when the brain learns to process speech. The implication is that it is not necessarily warranted to claim that such emergent categories as phoneme, consonant, and vowel are explicit representational levels or types which play a necessary, independent role in representation and processing. We should expect to see apparent evidence for these categories in patterns of dissociation after impairment, as in Caramazza et al.Õs (2000) data, but we cannot necessarily infer from these dissociations that there is processing that relies on those categories per se. One simplification we have made is to assume that the phonological features are restricted to discrete time-slices. In terms of the mechanics of speech production, the physics of the speech signal, and formal analyses within the tradition of autosegmental phonology it is truer to say that the features have a more overlapping, less discrete nature.
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
97
7. Conclusion We conclude, first, that the observation of patterns of impairment resembling double dissociations, that can be described using high-level categories such as consonant and vowel, should not be taken at face value. Rather, such observations may say more about the parallel distributed processing style of the brain than about necessary, independent levels of representation in speech processing. Such categories spontaneously emerge in a parallel distributed processor with even minimal scope for modularisation. Consonants and vowels may appear to have dissociable instantiations in the brain but it cannot be assumed from such data that they have separate, necessary instantiations in any functional model of speech processing. Our second conclusion is a metatheoretical one: connectionist modelling can play an important role in the interpretation of human data and in conceptualising normal and impaired models of language processing.
References Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). CELEX lexical database. Pennsylvania: Linguistic Data Consortium. Beland, R., Caplan, D., & Nespoulous, J.-L. (1990). The role of abstract phonological representations in word production: Evidence from phonemic paraphasia. Journal of Neurolinguistics, 5, 125–164. Blumstein, S. E. (1978). Segment structure and the syllable in aphasia. In A. Bell & J. B. Hooper (Eds.), Syllables and segments (pp. 189–200). Amsterdam: North-Holland. Boatman, D., Hall, C., Goldstein, M. H., Lesser, R., & Gordon, B. (1997). Neuroperceptual differences in consonant and vowel discrimination: As revealed by direct cortical electrical interference. Cortex, 33, 83–98. Brown, H. D., & Kosslyn, S. M. (1993). Cerebral lateralisation. Current Opinion in Neurobiology, 3, 183– 186. Bullinaria, J. A. (2001). Simulating the evolution of modular neural systems. In Proceedings of the 23rd annual conference of the cognitive science society (pp. 146–151). Mahwah, NJ: Lawrence Erlbaum. Canter, G. J., Trost, J. E., & Burns, M. S. (1985). Contrasting speech patterns in apraxia of speech and phonemic paraphasia. Brain and Language, 24, 204–222. Caramazza, A., Chialant, D., Capasso, D., & Miceli, G. (2000). Separable processing of consonants and vowels. Nature, 403, 428–430. Chomsky, N., & Halle, M. (1968). The sound pattern of English. New York: Harper & Row. Fowler, C. A. (1983). Converging sources of evidence on spoken and perceived rhythms of speech: Cyclic production of vowels in monosyllabic stress feet. Journal of Experimental Psychology: General, 112, 386–412. Goldsmith, J. A. (1990). Autosegmental and metrical phonology. Oxford: Blackwell. Harm, M. W., & Seidenberg, M. S. (1999). Phonology, reading acquisition, and dyslexia: Insights from connectionist models. Psychological Review, 106, 491–528. Hockfield, S., & Sur, M. (1990). Monoclonal CAT 301 identifies Y cells in cat LGN. Journal of Clinical Neurology, 300, 320–330. Hutsler (in press, this volume). Brain and Language. Ivry, R., & Lebby, P. C. (1998). The neurology of consonant perception: Specialized module or distributed processors? In M. Beeman & C. Chiarello (Eds.), Right hemisphere language comprehension: Perspectives from cognitive neuroscience (pp. 3–25). Mahwah, NJ: Lawrence Erlbaum. Jacobs, R. A., & Kosslyn, S. M. (1994). Encoding shape and spatial relations: The role of receptive field size in coordinating complementary representations. Cognitive Science, 18, 361–386. Keller, E. (1978). Parameters for vowel substitutions in BrocaÕs aphasia. Brain and Language, 5, 265–285. Kosslyn, S. M., Chabris, C. F., Marsolek, C. J., & Koenig, O. (1992). Categorical versus co-ordinate spatial relations: Computational analyses and computer simulations. Journal of Experimental Psychology: Human Perception and Performance, 18, 562–577. Monoi, H., Fukusako, Y., Itoh, M., & Sasanuma, S. (1983). Speech sound errors in patients with conduction and BrocaÕs aphasia. Brain and Language, 20, 175–194. Poeppel, D. (2001). Pure word deafness and the bilateral processing of the speech code. Cognitive Science, 25, 679–694.
98
P. Monaghan, R. Shillcock / Brain and Language 86 (2003) 83–98
Redford, M., Chen, C. C., & Miikkulainen, R. (2001). Constrained emergence of universals and variation in syllable systems. Language and Speech, 44, 28–57. Romani, C., Grana, A., & Semenza, C. (1996). More errors on vowels than on consonants: An unusual case of conduction aphasia. Brain and Language, 55, 144–146. Shillcock, R. C., & Monaghan, P. (2001). The computational exploration of visual word recognition in a split model. Neural Computation, 13, 1171–1198. Shkuro, Y., Glezer, M., & Reggia, J. A. (2000). Interhemispheric effects of simulated lesions in a neural model of single-word reading. Brain and Language, 72, 343–374. Svartvik, J. (Ed.). (1990). The London Corpus of spoken English: Description and research. Lund studies in English 82. Lund: Lund University Press. Trussell, L. O. (1997). Cellular mechanisms for preservation of timing in central auditory pathways. Current Opinion in Neurobiology, 7, 487–492.