Training radial basis function neural networks: effects of training set size and imbalanced training sets

Journal of Microbiological Methods 43 (2000) 33–44 Journal of Microbiological Methods www.elsevier.com / locate / jmicmeth Training radial basis fun...

Download PDF

321KB Sizes 0 Downloads 90 Views

Report

PDF Reader
Full Text

Journal of Microbiological Methods 43 (2000) 33–44

Journal of Microbiological Methods www.elsevier.com / locate / jmicmeth

Training radial basis function neural networks: effects of training set size and imbalanced training sets a a b, Luan Al-Haddad , Colin W. Morris , Lynne Boddy * a

b

School of Computer Studies, University of Glamorgan, Pontypridd, UK Cardiff School of Biosciences, University of Cardiff, Cardiff CF1 3 TL, UK

Abstract Obtaining training data for constructing artificial neural networks (ANNs) to identify microbiological taxa is not always easy. Often, only small data sets with different numbers of observations per taxon are available. Here, the effect of both size of the training data set and of an imbalanced number of training patterns for different taxa is investigated using radial basis function ANNs to identify up to 60 species of marine microalgae. The best networks trained to discriminate 20, 40 and 60 species respectively gave overall percentage correct identification of 92, 84 and 77%. From 100 to 200 patterns per species was sufficient in networks trained to discriminate 20, 40 or 60 species. For 40 and 60 species data sets an imbalance in the number of training patterns per species always affected training success, the greater the imbalance the greater the effect. However, this could be largely compensated for by adjusting the networks using a posteriori probabilities, estimated as network output values.  2000 Elsevier Science B.V. All rights reserved. Keywords: Identification; Microalgae; Flow cytometry; RBF neural networks; Training neural networks

1. Introduction

1.1. General considerations Artificial neural networks (ANNs) are increasingly being used in microbiological applications, one of the main areas being in identification, including microalgae, bacteria and fungi (e.g. see Refs. in Morris and Boddy, 1995; Boddy and Morris, 1999). The data used have often been from modern high technology equipment (e.g. Culverhouse et al., 1996; Goodacre et al., 1996; Frankel et al., 1996; Blackburn et al., 1998; Wilkins et al., 1999; Boddy et al., 2000), though more traditional taxonomic data such *Corresponding author. Tel.: 144-29-20874776; fax: 144-2920874305. E-mail address: [email protected] (L. Boddy).

as morphometric measurements of fungal spores (Morgan et al., 1998) and biochemical characteristics of bacterial colonies (Schindler et al., 1994) have also been used. Many of the published applications of ANNs in microbiology are, however, trivial with usually only discrimination of a few taxa, exceptions being identification of 35–70 species of microalgae from flow cytometry data (Boddy et al., 1994a, 2000; Wilkins et al., 1999). Scaling up is non-trivial, and success in discriminating a few taxa does not necessarily imply that discriminating large numbers of taxa will be possible. The taxonomic characters used, be they those used traditionally for identification or those obtained from modern high technology equipment, must be sufficient to discriminate between the taxa. Also, selecting an appropriate ANN, optimizing network architecture and training are essential for

0167-7012 / 00 / $ – see front matter  2000 Elsevier Science B.V. All rights reserved. PII: S0167-7012( 00 )00202-5

34

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

success. Radial basis function (RBF) ANNs are appropriate for species identification, since where comparisons have been made using microbiological data, RBF ANNs have been at least as successful as other types (Wilkins et al., 1994, 1996; Morgan et al., 1998), and have the additional useful properties that they train rapidly and have greater potential for rejecting patterns of novel taxa as unknown (Morris and Boddy, 1996; Wilkins et al., 1999). Considerations for appropriate architecture include whether or not a single network is used or whether many networks should be used and their outputs combined (cf. Morris and Boddy, 1998), numbers of hidden layer nodes (HLNs) and, in the case of RBF ANNs, the positioning and shape of the basis function represented by the HLNs (see below). The data used for training ANNs are also crucial for success, since as ANNs are not rule-based but ‘learn’ or ‘train’ from examples presented to them, the training data used must, therefore, be representative of the situation that they are modelling. When training ANNs for identification it is essential that the training set covers all of the biological variation in taxonomic characters that is likely to be encountered in samples to be identified. Obtaining sufficient training and test data is a common problem in many micro- and macro-biological applications. There is a large literature on the effects of training set size, though this is largely in relation to the dimensionality of the data (e.g. Raudys and Pikelis, 1980; Fukunga and Hayes, 1989). Equations have been developed for determining the necessary size of a data set for back propagation (multilayer perceptron; MLP) ANNs based on the number of synaptic weights, number of HLNs, and the proportion of errors tolerated (e.g. Baum and Haussler, 1989). Such equations, however, represent a worst-case scenario (Haykin, 1994), and does not consider the complexity of the data. A more pragmatic approach is therefore advocated. The objective of this paper is to assess the effect of the size of the training data set on the accuracy of identification of microalgae from flow cytometry data. It has been suggested that whenever the ratio of training events for two taxa is unequal, error rates for the smaller category will be larger than if training sets were of equal size (Barnard and Botha, 1993). Hence, the effect of imbalance in size of training data sets for different taxa is examined, using RBF

ANNs to discriminate 20, 40 and 60 microalgal species. Also, a method of compensating for this imbalance is considered.

1.2. Radial basis function ANNs RBF ANNs (Fig. 1) use kernels (basis functions) to represent the data (Fig. 2) and these kernels are placed into the input space using one of a variety of paradigms (see below). The kernels have a defined response to input data that varies according to the distance of the data point from the kernel centre. The global response of all kernels is then used to model the data space. The kernel with a simple mathematical function that is generally chosen is Gaussian in shape (Fig. 3). This has a response that is a function of distance from the kernel centre. The general form of the Gaussian is: Output 5 exp(2x 2 / 2s 2 ) where s controls the spread of the function, and x is

Fig. 1. Schematic diagram of RBF ANN. Raw data are entered in the input layer (here with seven nodes representing the seven flow cytometric parameters) via the hidden layer of processing nodes to an output layer (one node per taxon to be identified). The bias node has a constant output value irrespective of network input, and allows output layer nodes to add a constant offset. The network is fully connected between layers, though not all of the connections are shown here.

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

Fig. 2. Diagrams illustrating the way in which data are represented and how decision boundaries are formed between two groups (j, h) in two dimensions by RBF ANNs, using (A) the Euclidean, and (B) the Mahalanobis distance metric. Small circles represent kernel centres.

Fig. 3. The Gaussian kernel function with different values of s.

35

the Euclidean distance between the kernel centre and vector of interest. If, rather than the Euclidean distance, the Mahalanobis distance metric (Haykin, 1994) is used the kernels become non-radially symmetric, elongated into ellipsoids (Fig. 2b). Since the size of the kernel is determined by the variance of the (n-dimensional) patterns, the size of the region represented by the RBF kernel is not fixed. Kernels representing large diffusely distributed populations will have larger variances and the kernels will have greater spatial spread (Fig. 3) than those representing more compact well defined populations. Like the more commonly used MLP ANNs, RBF networks comprise three layers of nodes but with the middle (hidden) layer being made up of Gaussian or asymmetric kernels (Fig. 1). A number of kernels are positioned in the input space using one of a number of possible placement algorithms. As in MLPs, the inputs to the network are nodes that simply pass each of the input signals to the middle layer kernels (HLNs). The outputs of the kernels are fed to the output layer which is made up of ‘ordinary’ nodes with linear transfer functions. As in the MLP, values of the output layer nodes correspond to a posteriori probability estimators (Richard and Lippmann, 1991). The kernel centres can be placed randomly, equally or non-equally amongst classes, and their positions adjusted using, for example, an LVQ algorithm (Kohonen, 1988a,b) or K-means (Hush and Horne, 1993; Tou and Gonzalez, 1974) (which attempts to minimize the sum of squared distances between the training patterns and nearest centre). When positioning the kernel centre the aim is to ensure that their locations approximate the distribution for each class (i.e. species to be discriminated). Kernel positioning is then followed by two systematic passes through the training data set to determine the kernel widths and output layer weights (Musavi et al., 1992; Hush and Horne, 1993; Haykin, 1994). In a trained network each of the kernels gives a response whose magnitude is a function of the distance between the input and the kernel centre. The output layer combines these signals and performs the identification. As with MLPs, optimizing network architecture is important. The optimum number of HLNs can be determined by experiment or automatically by writing software that successively increases the number

36

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

of HLNs. Kernels that explain the largest proportion of the remaining unexplained error in classifying the training data are successively added to the network until the unexplained error is less than some specified amount, say 1% (Chen et al., 1991). Also, the shape of the Gaussian kernel can be adjusted by changing its width parameter; as the latter increases the width of the kernel increases becoming less peaked around the centre. Broader kernels allow smoother interpolation between adjacent basis functions. Strategy for selecting and positioning of kernels, and the distance metric (Euclidian vs. Mahalanobis) can affect success, and the most appropriate of these can be determined by experiment. Previous work on microalgal signatures from flow cytometry data has shown the Mahalanobis distance metric gives greater success, because it better models the asymmetric data distributions (e.g. Boddy et al., 2000).

2.2. Software and data pre-processing The software used was AimsNet, which was developed during an European Union funded project — AIMS (Automated Identification of Microorganisms). It is a multivariate data analysis program incorporating RBF ANNs, running on a Pentium PC under Windows95. Data pre-processing involved ‘gating’ the flow cytometry data, by omitting all events with low red fluorescence signals, to remove any clusters of events originating from ‘noise particles’ such as inorganic particles, bacterial contaminants, cellular debris, etc. Data were then gated on the other individual parameters to remove any outliers. No transformations were applied, except that input data were normalised by linear rescaling such that the training data had a mean of 0.0, and a standard deviation of 1.0 for each parameter.

2.3. Effect of size of training data set on identification success 2. Methods

2.1. Data set The data set comprised flow cytometric signatures of approximately 10 4 cells each of 60 marine microalgal species in eight orders, covering a wide range of morphologies and sizes (approx. 1–45 mm) (Table 1). Details are given in Boddy et al. (2000), which provides a full ANN analysis of the data. Briefly, cultures (from the Plymouth Culture Collection Marine Biological Association, UK) were maintained in F / 10 medium (Guillard and Ryther, 1962) at 158C618 under a 12:12 h light:dark cycle at 50 mmol quanta m 22 s 21 . Cultures were analysed by flow cytometry (AFC) using a Becton Dickinson FACSortE flow cytometer, and individual cell measurements were made for seven parameters: cellular forward light scatter, integrated and peak chlorophyll fluorescence (.650 nm), width of chlorophyll fluorescence pulse (time of flight — a measure of particle length) phycoerythrin fluorescence (585621 nm), side scatter and depolarised light scatter. These formed the inputs to the ANNs — one parameter per input node.

The 20 most easily discriminated species (Boddy et al., 2000) were selected from the database. Nine training files were created for each of the species, containing a random selection of 10, 25, 50, 100, 200, 400, 600, 800 and 1000 events (cell flow cytometric signatures comprising values of the seven parameters measured). Each of the training files were used to train five RBF ANNs having a stated maximum of 1, 2, 3, 4 or 5 kernels per class (equal number of kernels per class). The software automatically selects the optimum number (below the stated maximum) after training using an orthogonal least squares learning algorithm (Chen et al., 1991). Kernels were placed using the LVQ method, and all networks used the Mahalanobis distance metric. Twenty more species (the next most easily discriminable) were added to the original 20 creating training data sets with 40 species. Networks with the same topology and numbers of training patterns as in the first experiment were employed. A further 20 species were added to give a data set with 60 species, and networks were trained as previously. Three replicates were made of each network. All trained networks were tested on independent data sets (500 events per species; one test set for each of

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

37

Table 1 The 60 marine microalgal species in eight orders comprising the three data sets constructed a Group

Species

Size (mm)

Data set

% correct identification

Cryptophyceae

Chroomonas sp. Chroomonas salina Cryptomonas appendiculata Cryptomonas calceiformis Cryptomonas maculata Cryptomonas reticulata Cryptomonas rostrella Hemiselmis brunnescens Hemiselmis rufescens Hemiselmis virescens Plagioselmis punctata Rhodomonas sp.

8–10 5–12 15–25 10–15 12–20 18–25 16–25 5–8 4–9 5–8 6–9 8–13

A1 B1 C1 A1 B1 C1 A1 B1 C1 A1 B1 C1 A1 B1 C1 A1 B1 C1 A1 B1 C1 A1 B1 C1 A1 B1 C1 A2 B2 C1 A2 B2 C2 A2 B2 C2

93.5 94 98.5 89 94 95 98.5 64 64.5 93 92.5 87.5

Micromonas pusilla Nephroselmis pyriformis Nephroselmis rotunda Pyramimonas grossii Pyramimonas obovata Tetraselmis impellucida Tetraselmis suecica Tetraselmis verrucosa Tetraselmis tetrathele Tetraselmis striata

1–3 4–7 6–8 5–10 4–8 11–19 6–15 3–11 10–16 6–8

A2 B2 C1 B2 C2 B2 C2 C2 C2 A2 B2 C2 B2 C2 C2 A2 B2 C2 B2 C2

99.5 70 55.5 71 68 94 88.5 64 95.5 74.5

Chlamydomonas reginae Chlorella salina Dunaliella minuta Dunaliella primolecta Dunaliella tertiolecta Stichococcus bacillaris

11–20 4–8 3–12 5–12 6–12 5–8

B1 C1 B1 B1 B1 C2

89 54 62.5 88.5 82.5 63

Rhodophyceae

Porphyridium pupureum Rhodella maculata

4–6 7–24

A2 B2 C2 A2 B2 C2

96 91.5

Chrysophyceae

Ochromonas sp. Pelagococcus subviridis Pseudopedinella sp.

3–12 2–3 8–10

C2 A2 B2 C2 C2

39.5 86.5 79.5

Prymnesiophyceae

Chrysochromulina camella Chrysochromulina chiton Chrysochromulina cymbium

6–12 5–9 6–10

B1 C1 C1 C1

89 60.5 32.5

Chrysochromulina polylepis Pleurochrysis carterae Emiliania huxleyi B11 Emiliania huxleyi 92 Ochrosphaera neopolitana Pavlova lutheri Phaeocystis pouchetii Prymnesium parvum

6–8 10–18 5–7 5–6 8–10 4–6 3–6 8–10

C1 B2 C2 A1 B1 C1 C1 C2 C2 C2 C2

63.5 90 99.5 78.5 45.5 72 56.5 75.5

Prasinophyceae

Chlorophyceae

C1 C1 C1 C1

Species misidentified (.10%)

H. rufescens (33%) H. brunnescens (30%)

N. rotunda (23%) N. pyriformis (31%) P. lutheri (11%)

C. polylepis (12%)

P. parvum (10.5%), S. costatum (24%)

C. C. O. C.

polylepis (20%) polylepis (21%), C. chiton (13%), neopolitana (15%) chiton (17.5%)

P. obovata (14%) C. salina (11%)

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

38 Table 1. Continued Group

Species

Size (mm)

Data set

% correct identification

Bacillariophyceae

Chaetoceros calcitrans Phaeodactylum tricornutum Skeletonema costatum Thalassiosira weissflogii

4–6 8–35 3–5 12–20

B1 C1 A2 B2 C2 C2 C2

87 94.5 80 93.5

Dinophyceae

Aureodinium pigmentosum Gymnodinium micrum Gymnodinium simplex Gymnodinium veneficum Gymnodinium vitiligo Gyrodinium aureolum Heterocapsa triquetra Prorocentrum balticum Prorocentrum micans Prorocentrum minimum Prorocentrum nanum Scrippsiella trochoidea

7–12 8–15 6–10 9–16 7–22 35–45 15–27 9–15 30–40 16–18 8–10 30–42

C1 B1 C1 B1 B1 B1 B2 B2 B2 B2 C2 B2

86.5 74 62.5 46.5 63 88.5 79.5 66.5 74 58 63 42

C1 C1 C1 C1 C1 C2 C2 C2 C2

Species misidentified (.10%)

G. vitiligo (32%) G. veneficum (21%)

D. minuta (16.5%), D. tertiolecta (17.5%)

a

20 species set: A1 first 10 species, A2 second 10 species. 40 species set: B1 first 20 species, B2 second 20 species. 60 species set: C1 first 30 species, C2 second 30 species. Percentage correct identification is for a 60 species network trained on 200 patterns per species. Overall percentage correct identification 76.6%.

the 20, 40 and 60 species data sets) to assess performance in terms of proportion of correct identification, and to determine which species were consistently misidentified as other species.

half of the data sets was not biasing the results, the x:y ratio was reversed giving another 15 networks. Three replicate networks were trained. All networks were tested on an independent test file consisting of equal (500 hundred) events per species.

2.4. Effect of imbalanced training sets on identification success The same groupings of 20, 40 and 60 species (as above) were used. Networks employed three kernels per species, having been determined as satisfactory in the first set of experiments. For each of these groupings, 15 networks were trained using different combinations of unequal-sized (imbalanced) training data sets. Combinations used were x events per species for species 1 to ]12 n (i.e. the first half of each data set), and y events per species for species ]12 n11 to n (i.e. the second half of each data set; where n is the total number of species (i.e. 20, 40 or 60) and x and y are one of 400, 200, 100, 50, 25 or 10 events per species). Combinations of imbalanced sets were therefore: 400:200, 400:100, 400:50, 400:25, 400:10, 200:100, 200:50, 200:25, 200:10, 100:50, 100:25, 100:10, 50:25, 50:10, 25:10. To ensure that a preponderance of easily discriminable species in one

3. Results

3.1. Effect of size of training data set on identification success The best networks trained to discriminate 20, 40 and 60 species respectively gave overall percentage correct identification of 92, 84 and 77% (Fig. 4). There was little or no improvement in overall percentage correct identification when increasing the size of the training data set from 50 to 1000 patterns per species, in networks trained to discriminate between 20 species. From 100 to 200 patterns per species was sufficient in networks trained to discriminate between 40 or 60 species. Three nodes per class allowed sufficient discrimination between taxa (data not shown).

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

39

3.2. Imbalanced training sets Overall percentage successful identification was usually higher for the 20 species data set than for the 40 and 60 species sets, the exception sometimes being the species 1–10 group with #50 training patterns per species (Fig. 5). With 20 species, imbalanced training sets had little effect on training success when the ratio of patterns in the training set was better than 50:400. However, for the 40 and 60 species data sets an imbalance always affected training success; the greater the imbalance the greater the effect (Fig. 5). Imbalance improved identification of some species (not necessarily those with the largest set of training data) at the expense of reduced identification success of the others (Fig. 6).

4. Discussion

Fig. 4. Overall percentage correct identification by networks with different maximum numbers of kernels (nodes) per species: s, 1; d, 5 nodes (2, 3 and 4 nodes omitted for clarity; points fall between those for 1 and 5 nodes. (A) 20 species, (B) 40 species and (C) 60 species. Upper (j) and lower (m) plots are for the species that were identified respectively with most (Micromonas pusilla) and least success: (A) Hemiselmis brunnescens; (B) Prorocentrum minimum; (C) Ochrosphaera neopolitana.

Approaching 80% overall identification success of 60 species by the best networks is on a par with that obtained in similar studies with large numbers of taxa (Boddy et al., 1994a, 2000; Wilkins et al., 1999). Decreased success with networks trained to discriminate between a large number of species is an inevitable consequence of overlap of flow cytometric parameter distributions when species numbers increase. The fact that increasing the number of HLNs had relatively little effect on overall identification success may suggest that ANNs are not necessary for this application and that a simple normal-based linear discriminant function would suffice. This would be true if character distributions of all species were Gaussian, but this is not the case. Multimodal distributions, for example, sometimes occur. The importance of having sufficiently large training sets to cover biological variation is highlighted by the increase in overall identification success with 100 training patterns per taxon compared with 25 or less. Obtaining sufficiently large training sets in the laboratory is unlikely to pose a problem, because large populations are easy to culture, and flow cytometric analysis is rapid (in the region of 10 3 cells s 21 ). However, this may often not be the case in the field. Ideally, to make identifications of microalgae from the field, ANNs should be trained on samples from the field. This is because values of

40

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

Fig. 5. Percentage correct identification for networks with unbalanced training sets. Three data sets were used with 20, 40 and 60 species in each. For each data set half of the species had a fixed number of training patterns (events) (A, 400; B, 200; C, 100; D, 50; E, 25; F, 10) and the other half had a variable number events per species. The fixed and variable halves were then swapped. j, 20 species (1–10 varied); h, 20 species (11–20 varied); m, 40 species (1–20 varied); n, 40 species (21–40 varied); d, 60 species (1–30 varied), s, 60 species (31–60 varied).

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

41

Fig. 6. Percentage correct identification for selected individual species when networks were trained with imbalanced event numbers per species, for the 60 species data set: ♦, Chrysochromulina chiton; m, Gymnodinium veneficum; d, Micromonas pusilla; j, Ochroshaera neopolitana; 앳, Ochromonas sp.; n, Phaeodactylum tricornutum; h, Tetraselmis tetrathele. Training was described for Fig. 5: (A) 400, (B) 200, (C) 100, (D) 50, (E) 25, (F) 10 events per fixed class.

42

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

flow cytometric parameters may differ considerably, even doing so in the laboratory when grown under different conditions (Boddy et al., 1994b, 2000). Training data from the field could be obtained by using statistical or neural clustering approaches to delimit groups, and then using the flow cytometric sorting facility to physically remove cells from these clusters and to make an identification microscopically. Also, obtaining large training data sets where data capture is not automated will be less easy (e.g. morphological characteristics and morphometric measurements of fungal spores (Morgan et al., 1998)). Further, though there is considerable taxonomic information in the literature much of it (especially for eukaryotic microorganisms) is in the form of dichotomous keys. Since keys do not provide information on characters for individuals, the information cannot be used to train ANNs. The less common a species, the more difficult it is likely to be to obtain sufficient training data. If, in a particular situation, scientific interest centres largely around more common species, it may be satisfactory to construct training sets with a large number of patterns per species for the latter (e.g. in the microalgal example presented here 200 patterns per species) and a much smaller number of patterns per species for the rarer species. However, if rarer species are of major interest it may be prudent to have balanced training sets, even if the numbers of patterns per species are suboptimal. Since the network outputs approximate a posteriori probabilities, it has been suggested that the outputs of the networks can be adjusted to take account of differences between proportions of taxa in training and test data sets (e.g. Richard and Lippmann, 1991). This can be achieved by dividing network outputs by the ‘class probability’ (i.e. the probability that a particular pattern in the training data comes from a particular species) and multiplying by the ‘correct class probability’ (i.e. the probability that an event comes from a particular species). For the test data sets used here with equal numbers of test patterns per species, the latter probability is estimated as 1 / 20, 1 / 40 or 1 / 60 for the 20, 40 and 60 species data sets respectively. The training data class probabilities are estimated as the occurrence of events for each species in the training data set. Adjusting the networks in the present study to

compensate for imbalanced training data sets does dramatically improve identification success for test data: overall percentage successful identification is highest for the 20 species data set and lowest for the 60 species data set, but for all data sets is .75% (data not shown). There is a large improvement in identification success of each species for the under represented species, at the expense of a slight decrease in identification success of the well represented species (data not shown). Richard and Lippmann (1991) suggest that data on correct class probabilities are often readily available. However, in practice with biological applications, when applying ANNs to field data or samples from laboratory mixed culture experiments the correct probability that a pattern belongs to a particular species would usually be unknown and equal probabilities must be assumed. This assumption will, of course, usually be incorrect, and, indeed, often the whole point of using a trained ANN to identify individual patterns is to determine the proportions of each species in a sample! The value of obtaining ‘corrected’ classifiers for networks trained with imbalanced data sets in the way described here is unclear in most biological identification problems. On the other hand, there may be situations when adjustments to networks might be beneficial, for example when it is important to take into account the cost of misidentification. For example, when trying to detect toxic algal species making a few false positive identifications may be tolerable, but false negatives may not be. Network outputs could be scaled accordingly. If only one or a few species are of interest against a background of many other species a slightly different approach to network construction may be appropriate. This involves training a network to discriminate between two categories: (1) the species of interest and (2) every other species that might be encountered (Morris and Boddy, 1998). This approach is likely to need imbalanced training sets, category 1 having far fewer, say several hundred, patterns than category 2. The latter might contain several thousand patterns. If several species are of interest then several nets of this kind (one per species of interest) can be trained and the outputs combined, by a second net or using a ‘winner takes all’ strategy (Morris and Boddy, 1998). Ongoing research is

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

directed toward evaluating this approach for the identification of large (60 or more) numbers of species compared with identification using a single network trained to discriminate all species. In summary, when using single large networks to discriminate taxa it is important to have sufficient training patterns per taxon to cover intraspecific biological variation, and usually balanced training sets should be employed. Clearly, however, much research is still required on how to cope with environmentally realistic numbers of taxa, and how to overcome the difficulties in obtaining sufficient training data.

Acknowledgements This work was a PRiME project funded by the Natural Environment Research Council (grant GST / 02 / 1062). Thanks to Malcolm F. Wilkins for valuable discussion and for allowing us to use the Aimsnet software, developed in a project funded by the Commission of the European Community, grant MAS3-CT97-0080. Thanks to Glen A. Tarran who made the flow cytometric measurements on the 60 microalgal species (original data first published in Boddy et al. (2000)).

References Barnard, E., Botha, E.C., 1993. Back-propagation uses prior information efficiently. IEEE Trans. NN. 4, 794–802. Baum, E.B., Haussler, D., 1989. What size net gives valid generalization? Neural Comp. 1, 151–160. Blackburn, N., Hagstrom, A., Wikner, J., Cuadros-Hansson, R., Bjornsen, P.K., 1998. Rapid determination of bacterial abundance, biovolume, morphology, and growth by neural networkbased image analysis. Appl. Env. Micro. 64, 3246–3255. Boddy, L., Morris, C.W., 1999. Artificial neural networks for pattern recognition. In: Fielding, A.H. (Ed.), Machine Learning Methods for Ecological Applications. Kluwer, Boston, Dordrecht, London, pp. 37–87. Boddy, L., Morris, C.W., Wilkins, M.F., Al-Haddad, L., Tarran, G.A., Jonker, R.R., Burkill, P.H., 2000. Identification of 72 phytoplankton species by radial basis function neural network analysis of flow cytometric. Mar. Ecol. Prog. Ser. 195, 47–59. Boddy, L., Morris, C.W., Wilkins, M.F., Tarran, G.A., Burkill, P.H., 1994a. Neural network analysis of flow cytometric data for 40 marine phytoplankton species. Cytometry 15, 283–293. Boddy, L., Wilkins, M.F., Morris, C.W., Tarran, G.A., Burkill,

43

P.H., Jonker, R.R., 1994b. Techniques for neural network identification of phytoplankton for the EUROPA flow cytometer. Proc. OCEANS ’94 OSATES Conf. Vol. 1, pp. 565–569. Chen, S., Cowan, C.F.N., Grant, P.M., 1991. Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans. Neural Net. 2, 302–309. Culverhouse, P.F., Simpson, R.G., Ellis, R., Lindley, J.A., Williams, R., Parisini, T., Reguera, B., Bravo, I., Zoppoli, R., Earnshaw, G., McCall, H., Smith, G., 1996. Automatic classification of field-collected dinoflagellates by artificial neural network. Mar. Ecol. Prog. Ser. 139, 281–287. Frankel, D.S., Frankel, S.L., Binder, B.J., Vogt, R.F., 1996. Application of neural networks to flow cytometry data analysis and real-time cell classification. Cytometry 23, 290–302. Fukunga, K., Hayes, R.R., 1989. Effects of sample size in classifier design. IEEE Trans. Pattern Anal. Machine Intel. 11, 873–885. Goodacre, R., Hiom, S.J., Cheeseman, S.L., Murdoch, D., Weightman, A.J., Wade, W.G., 1996. Identification and discrimination of oral asaccharolytic Eubacterium spp. by pyrolysis mass spectrometry and artificial neural networks. Current Microbiol. 32, 77–84. Guillard, R.R.L., Ryther, J.H., 1962. Studies on marine planktonic diatoms. I. Cyclotellanana (Hustedt) and Detonula confervacae (Cleve) Gran. Can. J. Microbiol. 8, 229–239. Haykin, S., 1994. Neural Networks: A Comprehensive Foundation. Maxwell MacMillan Interntional, New York, NY. Hush, D.R., Horne, B.B., 1993. Progress in supervised neural networks — what’s new since Lippmann? IEEE Sig. Proc. Mag. 10, 8–39. Kohonen, T., 1988a. An introduction to neural computing. Neural Networks 1, 3–16. Kohonen, T., 1988b. Self-Organisation and Associative Memory, 2nd Edition. Springer-Verlag, New York. Morgan, A., Boddy, L., Morris, C.W., Mordue, J.E.M., 1998. Identification of species in the genus Pestalotiopsis from spore morphometric data: a comparison of some neural and nonneural methods. Mycol. Res. 102, 975–984. Morris, C.W., Boddy, L., 1995. Artificial neural networks in identification and systematics of eukaryotic microorganisms. Binary 7, 70–76. Morris, C.W., Boddy, L., 1996. Classification as unknown by RBF networks: discriminating phytoplankton taxa from flow cytometry data. In: Dagli, C.H., Akay, M., Chen, C.L.P., Fernandez, B.R., Ghosh, J. (Eds.). Intelligent Engineering Systems Through Artificial Neural Networks, Vol. 6. ASME Press, New York, pp. 629–634. Morris, C.W., Boddy, L., 1998. Partitioned RBF networks for identification of biological taxa: discrimination of phytoplankton from flow cytometry data. In: Dagli, C.H., Akay, M., Buczak, C.L.P., Ersoy, A.L., Fernandez, B.R. (Eds.). Intelligent Engineering Systems Through Artificial Neural Networks, Vol. 8. ASME Press, New York, pp. 637–642. Musavi, M.T., Ahmed, W., Chan, K.H., Faris, K.B., Hummels, D.M., 1992. On the training of radial basis function classifiers. Neural Networks 5, 595–603. Raudys, S., Pikelis, V., 1980. On dimensionality, sample size, classification error, and complexity of classification algorithm

44

L. Al-Haddad et al. / Journal of Microbiological Methods 43 (2000) 33 – 44

in pattern recognition. IEEE Trans. Pattern Anal. Machine Intel. 2, 242–252. Richard, M.D., Lippmann, R.P., 1991. Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comp. 3, 461–483. Schindler, J., Paryzek, P., Farmer, III J., 1994. Identification of bacteria by artificial neural networks. Binary 6, 191–196. Tou, J.T., Gonzalez, R.C., 1974. Pattern Recognition Principles. Addison-Wesley, London. Wilkins, M.F., Boddy, L., Morris, C.W., Jonker, R., 1996. A comparison of some neural and non-neural methods for identi-

fication of phytoplankton from flow cytometry data. CABIOS 12, 9–18. Wilkins, M.F., Boddy, L., Morris, C.W., Jonker, R.R., 1999. Identification of phytoplankton from flow cytometry data by using radial basis function neural networks. Appl. Environ. Microbiol. 65, 4404–4410. Wilkins, M.F., Morris, C.W., Boddy, L., 1994. A comparison of radial basis function and backpropagation neural networks for identification of marine phytoplankton from multivariate flow cytometry data. CABIOS 10, 285–294.

Training radial basis function neural networks: effects of training set size and imbalanced training sets

Training radial basis function neural networks: effects of training set size and imbalanced training sets

Recommend Documents