Neural networks

Neural networks

Current Anaesthesia and Critical Care (1998) 9, 168-173 © 1998 HarcourtBrace & Co. Ltd Focus on: Computers and anaesthesia Neural networks P. C. ...

628KB Sizes 2 Downloads 184 Views

Current Anaesthesia and Critical Care

(1998) 9, 168-173

© 1998 HarcourtBrace & Co. Ltd

Focus on: Computers and anaesthesia

Neural networks

P. C. W. Beatty

Classification o f data is an increasingly important branch of statistics as applied to medicine. Neural n e t w o r k models are n o w c o m m o n l y applied to medical data and h a v e been particularly successful in p r o b l e m s in anaesthesia and critical care.

'Neural networks' have been among the trendiest buzzwords in statistical analysis of medical data for some time. Their power and ease of application has led both to their appropriate and inappropriate use. This review will describe the historical context of neural networks and then consider their characteristics and what they can and cannot do. It then provides three examples of their application in anaesthesia and critical care.

tron models. However, fundamental problems existed with the two-layer perceptron design, which were dramatically exposed in 1969 when Marvin Minsky and Seymour Papert showed that two-layer preceptrons could not solve some very simple, but significant, patternrecognition tasks. 5 The result was that confidence in neural computing faded and research was limited to a small group of devotees. Gradually, these workers found solutions to the criticisms of neural networks. The turning point for neural networks came in 1983 when Hinton and Sejnowski published a three-layer neural network design 6 which answered the main criticism Minsky and Papert had laid against perceptrons. Since 1983, the interest in and variety of neural networks have exploded_ The initial growth was somewhat characterized by the hype and extravagance of the early developments. However, what we might call 'neural network fever' has somewhat cooled and neutral networks have now taken their place as extremely interesting and powerful statistical methods of multiple non-linear regression and classification, They are of particular interest in anaesthesia and intensive care.

Historical background The concept of neural computing and artificial neural networks predates the invention of the familiar digital computer. Its roots lie in a 1943 paper where two neurologists, Warren McCulloch and Walter Pitts, 1'2 published a proposal that the brain's function could be described by networks of what they called 'binary decision units,' to which logical calculus could be applied. They had applied some of the ideas in Alan Turing' s seminal work on computable number, ~ which laid the mathematical basis of all modern computing, to the brain. Despite this early start, neural computing made little impact compared to digital computing, until Frank Rosenblatt 4 showed that a network of the binary decision units, called a perceptron, could be trained to recognize geometrical patterns. Rosenblatt's work promoted a great deal of interest, and inspired extravagant claims about the power of percep-

The principles of artificial neural networks Structure

Figure 1 shows the basic building block of an artificial neural network (ANN), compared with its biological inspiration, a neuron. In the neuron dentrites, connected to other neurons by synapses, bring action potentials from those neurons to the central body of the cell, the soma.

Dr P.C.W. Beatty, Physical Sciences Research Group in Anaesthesia and IntensiveCare, Divisionof Bio-medicalEngineering, Manchester University, StopfordBuilding, OxfordRoad, ManchesterM13 9PT, UK. 168

NEURAL NETWORKS 169 Dendrites

~__

/

an MLP is characterized by the number of nodes in the layers, so Figure 2 shows a 4 - 5 - 2 MLP.

/ Soma /

Synapse

Training

S y n a p s e ~ ~ Nucleus Input Weight

Summer Transferfunction

y~=f(Y.wi.xi) / Output

~q Fig. 1

Node structure compared to that of a neuron.

In the soma these signals are integrated, and if the total excitation exceeds a threshold the neuron fires and an action potential is propagated down the axon to other synaptic terminals and on to the next layer of neurons. The basic building block of the ANN, the node, has a similar function. It has a number of inputs that are connected to the other nodes in the network. These connections have what might be thought of as a gain called a weight, which means that the signal that appears at the node from a given input is the original signal multiplied by the weight. In the node these weighted signals are summed and the result tested against a transfer function, which can be a straightforward threshold, as in the neuron, but is more likely to be a smooth function like a sigmoid. The output signal sent out from the node to the next layer in the network is governed by this function. Figure 2 shows how these nodes are built into an artificial neural network. The commonest design is a three-layer fully-interconnected network. There is an input layer of nodes connected to one input each. There is then a second layer of hidden nodes each of which are connected to all the nodes in the input layer. In turn, they are connected to all the nodes in the output layer, which give the possible outputs. Tiffs design is called a multi-layer perceptron (MLP). The architecture of

............Outputs

~

OutputLayer HiddenLayer

Nodes

InputLayer Inputs Fig. 2 A three layer multi-layer perceptron.

The neural network learns to recognize patterns in a way that closely resembles human learning. Initially, the network is built and random values are ascribed to the weights of all the connections. The commonest training method is supervised learning. In supervised learning the designer has available a data set, called a training set, where the 'correct' output answers are known for the values of a given input. So, for instance, in the simple network shown in Figure 2 the outputs might be apple or pear. The inputs could be shape, colour, size and weight. The values of the inputs for a pear would be put onto the inputs and the outputs examined in the light of knowing that the pear output ought to be 1 and the apple output 0. The difference between this correct state and the actual state would be calculated and the weights of the network adjusted according to some rule to bring the outputs nearer to the correct answer. The rule that adjusts the weights is called a credit assignment algorithm. These come in all shapes and sizes, but the most well known is that invented by Paul Werbos 7 called the back-propagation algorithm. So common is the use of back-propagation in MLPs, that they are sometimes incorrectly referred to as back-propagation networks. Back-propagation assumes that all the weights contribute equally to the error and corrects all of them proportionately. When the weights have been adjusted, a new example from the training set is selected and the process is repeated. This goes on until changes in the weights or some other measure of network stability, like mean error on a series of examples, is minimized. The network is then trained. The credit assignment algorithm is disabled and the weight values fixed. The alternative to supervised training is unsupervised training in which the 'correct' output is not known. In unsupervised training, the credit assignment algorithm has to decide the differences between different input data sets on the basis of some predetermined measure internal to the algorithm. The most commonly used network that uses unsupervised learning is the Self Organising Map (SOM) invented by Kahonen) The SOM does what its name suggests. It takes the input data set and clusters or maps that data set so that similar examples are close to one another. SOM can be very instructive in showing patterns in data that a human observer might have difficulty seeing. However, there is no causal reason for any map to cluster the inputs into any pattern that makes sense to a human. Whichever type of learning a particular type of neural net uses, the problem of when to stop training inevitable arises. In a network such as a MLP, along with this problem goes the question of how many nodes to have in the hidden layer. To understand what is at stake in making these decisions, we have to understand the nature of how a neural network, or any other form of statistical classifier, makes a classification.

170

CURRENT ANAESTHESIA AND CRITICAL CARE Red

A A AAA A p ~ A A AAAA AA A AAA A A P PP~A ppppp ppPXNA ..................Decision boundary PPP PP " ~ .....................

A Colour

PP P P A A A "N~PPP ....................Miss-classified ..... pears P PPP P A PP

P

A AA

Yellow Small

N

Size

A A

B

X~ P

B~3

AAAA

AA

A

AAA

A A

~Ax

\

\ \

P PP \ A

Colour

PPPPP PPP'~A PPP PP ~ ' " ~ " Z : : Decisionboundaries p "p A ~.....xN,i~.:. _......~" ........

pp" Yellow

PPPPP P ~ APAPx ~

Small

Fig. 3

time, adding nodes and training just the weights associated with each individual new node until the addition of another hidden node shows no more improvement in performance. This reduces the chance of overfitting. Another method is to train a large fully-interconnected network and then systematically eliminate weights until acceptable performance is found. This technique is called weight pruning. Entirely different techniques have been tried to set the weights replacing the credit algorithm altogether. Such a system uses a genetic algorithm to search for the set of weights that gives the best performance. 9 Whatever the technique, one property holds for all neural network methods. They are essentially data hungry and the number of training data points determines the maximum possible accuracy of a neural network.

Size

Big

Feature space for a hypothetical classifier.

Figure 3 shows our hypothetical apple and pears classification problem. We have found two variables, colour and size, to act as our inputs. Figure 3 shows what the clusters in a scatter plot of value looks like. The inputs are called features. They are the characteristics that are used to make the classification. A diagram such as Figure 3 is a diagram of feature space, since it shows the relative positions of the classes of the data mapped according to their features. Producing a good classifier is a matter of finding the decision boundary that divides the apples from the pears. In the case of Figure 3(A), a simple straight-line decision boundary provided by a logistic regression classifier will not suffice. The problem requires a more complex decision boundary and justifies the use of a neural network. As shown in Figure 3(B), the neural network is able to 'draw' more complex decision boundaries and thereby make a better classification. The complexity of the boundary that the network can determine grows with the number of different training cycles used and with the number of nodes in the hidden layer, or even with the number of hidden layers. At first sight, it might appear that the greater the number of nodes and the more training, the better: this is not the case. The more fixed and complex the decision boundary, the less well they perform on unseen data. The boundary becomes too specific for the training data used to be effective. It is overfitted: the model becomes brittle and fails to 'generalize' well; that is, perform well on unseen data near to the decision boundaries. Essentially, good generalization is the ability to interpolate or extrapolate well. Various techniques are adopted to prevent this effect. A common one is to stop the training early to maintain a high generalization. This is usually done at the point where no more improvement in performance is perceptible. Some types of network, notably the cascadecorrelator MLP, grow their hidden layer one node at a

Testing Having trained the network, it can now be used to classify unknown input data sets. When such a set of input values is applied, the trained network gives an output based on what it has learned and classifies the unknown input data set as either an apple or a pear, or places it close to one of its established clusters. It is, thus, implicit in neural networks that the performance of the network should be tested on unseen data, i.e. examples of data that have not been used in the training of the network. Thus, the commonest way to test the performance of the network is to present it with a randomly-selected data set, usually partitioned at the start of training from the data available. If the solution is to be deployed in a clinical environment, then a third validation set may well be required to use with the full system. When data is scarce, which is often the case in medical applications, the alternative may be to use a holdout method of testing. When the best design of network has been determined, all the data bar one result is used for training. The trained network is then tested against this data point, which has been held back. The data point is replaced in the training data set and a second holdout data point is selected. A new network with the same design is trained and the test is repeated, and so on. The performance of the network design on the held back data points gives an estimate of its performance that will represent the performance of the network trained on all the data available. This technique can be used with any type of classifier, not only neural networks.

Advantages and disadvantages The advantages and disadvantages of neural network classifiers are summarised in Table 1. The advantages of neural network models when applied to an appropriate problem outweigh their disadvantages. The fact that they learn, and that the architecture of nodes and weights can be almost infinitely varied, means that they can cope with high degrees of

NEURALNETWORKS 171 Table l models

Advantages and disadvantages of artificial neural network

Disadvantages

Advantages

Data hungry Black-box models

Cope with high degrees of non-linearity Able to deal with different types of data equally Simple to implement Robust to missing data Universal Reliable software available

Easily abused Need good pre-processing of data

non-linearity between inputs in multiple dimensions. This capacity for highly non-linear modelling allows them to cope with mixtures of dichotomous data, nonparametric and parametric data more easily than many other statistical techniques. They are often said to be robust to missing data since they can obtain information given by one data point from others in the set. This is partially true, but at times neural network models can be very brittle in the face of missing data if there is little redundancy of information in the input data. They are universal in the sense that they can be implemented as computer code, electrical circuits and mechanical systems. Most neural network applications are treated as computer code. In this form there is now reliable, user friendly software available. Apart from their data-hungry nature° the other most significant disadvantage of their use is the difficulty of understanding the final model. The trained network is essentially a 'black-box', the internal workings of which can appear hidden. The weights contain the same information as any other classifier, but you cannot write a simple equation for the output probability of one class or the other for a neural network model as you could for a logistic regression model made on the same data. Also, since they are, essentially, the result of a dynamic statistical process, there is no guarantee that two examples of the same type and structure of network made from the same data will be identical, though they will be within determinable statistical limits of one another. Thus, their implementation in safety critical situations is problematical and setting confidence limits to accuracy is very important. Their other two main disadvantages are the mirror images of the advantages. Since they are so powerful and relatively easy to use that they can easily be abused. Users should remember the old computing axiom GIGO - garbage ino garbage out. They should be used critically, especially when feature selection and pre-processing is being considered. In any neural network application, feature selection and preprocessing of the data account for 80-90% of the project. Since neural networks can and do find patterns in any data set, if they are to find meaningful patterns, then the data used must be of the highest quality. Missing data, common in many medical problem, has to be adequately accounted for and replaced if appropriate. As with curve fitting, the features used must have some phenomenological relationship to output. Statistics is no substitute for common sense.

Neural networks prefer inputs where the information content of the input is distributed evenly across the values of the input. As a result, a neural network may well work much better if the input data value range is transformed in some non-linear way before input. By doing this, the user is reducing the amount of nonlinearity that the network will have to learn and maximizing the use of the training data in learning the really difficult non-linearities. If all the inputs could be preprocessed perfectly in this way, then a neural network model would not be required, so practical pre-processing may be a matter of sensible compromise. The most difficult problem is the selection of the features to be input into the network. There is no single recognized statistical method for determining which features will make the best classifier. Feature selection methods used in medical applications have included multivariate statistics, cluster analysis, Principle Component Analysis (PCA), Classification and Regression Trees (CART), genetic algorithm, Fast Fourier Transformation, wavelet transformation and the use of other neural networks, especially Kahonen SOM, to act as a pre-processor. In the end, as noted by Ripley,l° there may be no substitute for trying them all systematically. T h r e e neural n e t w o r k case s t u d i e s

Anaesthesia breathing system failure detection The provision of smart alarm systems, where data is fused from various sources to reduce false alarms, is an area where neural networks have been particularly successful. In developing the Utah Anaesthesia Workstation, Westenskow et al.t~,t2 produced a series of neural network based smart alarm systems for the detection of breathing system failure. The models used the capnograph, pressure and flow waveforms gathered from tile patient connector of a circle breathing system. The waveforms were processed and 30 features extracted. Supervised training with a back-propagation MLP was used with the data from over 2000 simulated failures collected in the laboratory. Normal function plus 13 failure conditions were learned. On tests with simulated failures and on animal tests with dogs, the correct classification rates for the system were shown to be 95% for controlled ventilation and 86.9% for spontaneous ventilation. False alarm rates were reduced to less than 0.3% for controlled ventilation and less than 3.8% for spontaneous ventilation. Testing on response times by clinicians to simulated faults showed an increased speed of response of 62% (average time decreasing from 67 s to 17 s) for seven of the most important faults. In latter versions, hierarchies of neural networks were developed to monitor compliance/resistance status, fault detection and, finally, fault diagnosis. This three-tiered structure did not give significantly better classification results, but did cope with 23 different faults. The Utah experience shows the great flexibility of neural network approaches. The systems developed were computationally efficient and easily retrained in the

172 CURRENT ANAESTHESIA AND CRITICAL CARE light of new practices or changes in breathing system design. Feature extraction and signal pre-processing owed a lot to the expertise of the team at Salt Lake City, but used simple robust clinical sensors, In competitive tests with logical-rule-based systems the neural systems developed performed significantly better.

Prediction of ICU mortality Several workers have attempted to use neural networks to provide improved ICU scoring systems and predictors of mortality. Dybowski et al. 9 used a MLP neural network architecture to predict ICU outcome in patients presenting with inflammatory response syndrome and haemodynamic shock. The data used was from Guy's Hospital, and the training and test data set sizes were 168 and 90 respectively. Feature selection was performed using a combination of CART and stepwise logistic regression. Initially, 157 candidate features were available. CART reduced these to 11 and the logistic regression to nine. However, only two were in both sets. The final set of features used in the network is shown in Table 2. Dybowski et al.'s method of arriving at a final neural network architecture was unusual. Instead of using a conventional MLP with back-propagation approach, the group chose the use genetic algorithms to search for the best design. A large number of neural networks were created with different weights and architectures. Their characteristics were then expressed as a string of variables constituting a 'chromosome' in which every 'locus' described a substructure of the neural network. Using the principles of genetic algorithm searching, these chromosomes were copied to form a new generation of networks in which the proportion of chromosomes in the new generation from one net is proportional to the performance of the original net. The generation was pruned by selecting only the fittest and the process was repeated until no further improvement in network

Table 2

performance was observed, usually at about the 7th generation. The best performing network was then selected. The final example was a 1 7 - 3 - 7 - 4 - 1 network. This gave a correct classification rate of 86% and an area under a Receiver Operator Curve (ROC) of 0.86 (0.5 represents random classification) compared to 0.75 for a logistic regression approach on the same data. The application demonstrates the success of the neural network in combining disparate data types in a consistent whole. However, the final architecture seems rather unwieldy to be supported by only 168 training examples. The effort required for the feature selection is typical of a wellperformed study. For more conventional approaches using APACHE or PRISM type data, the correct classification rates have been comparable to logistic regression approaches for adult data (area under ROC: 0.83 logistic regression; 0.86 neural network) and superior for neonatal data (area under ROC: 0.954). 12

Self-organizing maps in 1CU classification Van Gils et al. 13 have recently reported the use of a self-organizing map (SOM) for the characterization of clinical state of patients in the ICU. Using a data library gathered from hospitals across Europe, a set of oxygen delivery disorders has been established. Prediction of two, sepsis (evolving into high blood-flow state) and cardiac failure have been investigated. Twenty trend data variables were defined by a group of clinicians as being indicative of the oxygen delivery disorders. These trend variables were used as the inputs to a Kahonen SOM. The total set of trend data was obtained from 58 patients, 39 of which were used in training and the remainder for testing. Individual data records were approximately 700 data points long, making a possible 28 000 individual input vectors for training. The intention was to use the trained network in an automatic disorder detector, All the variables were pre-processed by normalizing them

The features used for ICU prediction (Dybowsld et ai.)9

Features selected by CART

Features selected by logistic regression

Included in final network

Planned postoperative monitoring Blood urea concenlration Primary indication for admission: neurological Total number of days with one or more organ failures

Planned postoperative monitoring Blood urea concentration Primary indication for admission: gastrointesUlal

Planned postoperative monitoring Blood urea concentration Indication for admission: neurological Indication for admission: gastrointestinal Total number of days with one or more organ failures

Total number of days with renal failure

Total number of days with renal failure Chronic cardiovascular disease Post cardiac or respiratory arrest

Chronic cardiovascular disease Post cardiac or respiratory arrest Pa02

Urine output Serum pH Acid-base balance disturbed Has respiratory failure occurred today? Age Non-cardiogenic pulmonary oedema Race Cardiothoracic surgery Admitted after elective surgery

Urine output Arterial pH Precipitated by acid-base balance disturbance Has respiratory failure occurred today? Age PaCO2 Non-cardiogenic pulmonary oedema Race Admitted after elective surgery

NEURAL NETWORKS

A

B

Fig. 4 Simplified Kahonen self-ordering maps for ICU patient state data. The dark area on map A shows the area where cardiac failure cases w e r e mapped by the network. Map B on the same grid, shows the areas of high probability for mapping of cardiac index for the same data.

linearly to a -1-1 scale. The SOM maps the individual vectors to areas of a grid (12 x 16 in this case). Each cell can be identified with being in a sepsis state, a cardiac failure state or indeterminate. The higher the number of cases in each category mapped to each cell, the greater the probability that a unknown vector mapped to that cell will indicate that disorder. If after training the individual variables that make up the vectors are mapped into the trained space, then those useful in determining one or other disorder should map to areas in the grid previously identified. The performance of the network in identifying significant contributory variables to regions within the map was compared to that of expert panel on the same data. The SOM and expert panel agreed that three variables (cardiac index, pulmonary capillary wedge pressure and systemic vascular resistance) were important in indicating patient state. The other three identified by the SOM as being important (peripheral temperature, arterial pH and base excess) where also given high impact scores by the expert, but not as high as some other variables, particularly core temperature and the perfusion index (PaO2/FiO2). The results from the SOM created in this study do give intelligible results. The question remains, is a feature indicated by the SOM as being important, really important? In other words, who is right the network or the panel of experts?

Conclusions The number of systems relying on neural network applications will rise, particularly in anesthesia and critical care where their ability to fuse disparate data which is highly non-linear is particularly useful. For the clinician

173

interested in using them, it is recommended that software recognized as being reliable is used, and that will reduce the possibility of misapplication. In my own research group we use Neural Works Predict software for preliminary investigation. Predict has a Microsoft Excel interface and uses a cascade correlator MLP with automatic feature selection using genetic algorithms. For a variety of different networks, Neural Works Professional II+ can be recommended as being reliable. The SPSS range of software offers a limited neural network tool book and for those who want a free shareware system with a very good manual, the Stuttgart Neural Network system can be recommended. Before starting on a project, it is suggested that the 1995 Lancet rain-series of articles on the subject is read. 14-17 Werbos's thesis had the title Beyond Regression. He saw neural networks as an experimental tool for revealing relationships in data that would compliment regression. Neural network development has got to the point where they can be used by the non-specialist for this purpose provided due care is taken.

References 1. McCulloch W S, Pitts W A logical calculus of the ideas immanent in nervous activity. Bull Maths Biophys 1943; 5: 115-133. 2. Hodges A, Turing Am. The enigma, Chapter 5. Random House, London, 1983. 3. Turing A M. On computable number with an application to the Entcheidungs problem. Proc Lond Math Soc 1937; 2: 42-50_ 4_ Rosenblatt F. Principles of neurodynamics. Spartan, New York, 1962. 5. Minsky M, Papert P. Perceptrons_ MIT Press, Boston 1989 6. Hinton G, Sejnowski T. Optimal perceptual inference. Proceedings of the IEEE Conference. On Neural Networks lII, IEEE, New York, 1983; 448-453. 7. Werbos P. Beyond regression. PhD Thesis, Harvard University. 8. Kahonen T. Self-organized formation of topologically correct feature maps. Biol Cybern 1982; 43: 56-69. 9. Debowski R, Weller P, Chang R, Gant V. Prediction of outcome in critically ill patients using artificial neural network synthesised by genetic algorithm. Lancet 1996; 347:1146-1150. 10. Ripley B D. Pattern recognition and neural networks. Cambridge University Press, Cambridge, 1996. I 1. On" J A, Westenskow D R. Evaluation of a breathing circuit alarm system based on neural networks. Anesthesiology 1990; 733A: A445. 12. Loeb R G, Brenner J X, Westenskow D R, Feldman B, Pace N L. The Utah anesthesia workstation. Anaesthesiology 1989; 70: 999-1007. 13. Zernikow B, Holmannspoetter K, Michel E et al. Artificial neural network for risk assessment in preterm neonates. Arch Dis Child Fetal Neonatal 1998; 79: F129-F134. 14. Van Gils M, Jansen H, Nieman K, Summers R, Weller P R. Using artificial neural network for the classifying of ICU patient states. IEEE Engineering in Medicine and Biology 1997; 16(6): 41-47. 15_ Cross S S, Harrison R F, Kennedy R L. Introduction to neural networks. Lancet 1995; 346: 1075-1079. 16. Baxt W G. Application of neural networks to clinical medicine. Lancet 1995; 346: 1135-1138. 17. Dybowski R, Gant V. Artificial neural networks in pathology and medical laboratories. Lancet 1995; 346: 1203-1207.