Analytrca Chrmlca Acta, 235 (1990) 53-63 Elsevter Science Pubhshers B.V., Amsterdam
53 - Prmted
m The Netherlands
Can an instrument learn from experiments done by itself? J ZUPAN Boru KldrlE Instrtute of Chemutty,
P 0 Box 380, YU-61001 Lpbl/ana
(Received
(Yugoslavia)
25th July 1989)
ABSTRACT The hardware and software requirements for mtelhgent analytical mstruments are discussed Exlstmg technology m both fields allows the production of a new generation of instruments that can learn from their own operation and even transfer the acquired knowledge to machines of a later generation. The hardware needed is a processor with the power of at least the Intel 80386 processor, 4-5 Mbyte RAM, storage eqmvalent to a 16-Gbyte hard disk and a special encode/decode processor for compressmg complex mstrumental mformation. This hardware can support the transrmssion, transformation, storage and handhng of about 100000 complex measurements (up to 3 kbyte of numbers per measurement) and a knowledge base of about the same size. In this context, the space for software can be neglected or, m the case of neural networks, the “software” can be hardwued. At present, there are two different approaches to handhng the required amount of data, extracting vital mformanon, and acqmrmg (m close cooperation with experts) new knowledge. The first is neural networks and the second is hierarchtcal clustermg It is argued that combmation of both methods forms a very powerful basis for software development m intelligent mstruments.
In recent years, it has become apparent that computer-supported instruments (notably spectrometers of different kinds) which produce enormous amounts of numeric data in increasingly shorter periods of time, do not exploit the available data and data-handling capabilities in an adequate manner. The matter to be considered is not the handling of individual measurements or experiments (transformations, baseline corrections, peak detection, integration, deconvolution, etc.), but the junking of knowledge that might be extracted from the thousands of experiments done on such an instrument during its lifetime. The questions raised are: if an instrument can learn from the experiments which it does and, if it can, then how such self-learning systems should be set up, monitored, guided and finally used to their full potential. During the lifetime of an analytical instrument (3-5 years), a hundred thousand and more mea0003-2670/90/$03.50
0 1990 - Elsevler Science
Publishers
B.V
surements can be made; and with current computer technology, all of them can be stored on a single hard disk. This huge data collection should not be stored for no purpose or just for occasional retrieval. More sophisticated uses should be found and valuable answers to immediate simple or complex problems might be extracted if appropriate organization, planning and design of data manipulation were implemented. It is clearly beyond the capability of most current instruments to provide facilities that recognize faulty recordings instantly, pinpoint their cause, warn the operator and suggest remedies for the problems (e.g., increase the sensitivity, preprocess the samples, use different solvents or tracers), detect interfering components in samples and subtract their effects, adjust the measurements according to previous findings, provide a quality check of the measurement (based on previous results), suggest why the sample lies outside the
54
tolerance regions, suggest the appropriate chemical structure or components of an unknown sample, and predict properties of the sample. There are, of course, other similar problems that modem instrumentation should be capable of handling, at least to some extent, and the endeavours of producers should be directed towards building more adaptable knowledge into the instruments. Solutions to the above-mentioned problems are at present being considered, explored, and sometimes solved (under very constrained conditions, however) in the special area of artificial intelligence called expert systems. Unfortunately, this type of research has generally few links with the developers of instrumental software, which is mainly designed for handhng individual experiments rather than large amounts of complex data. Obviously, the problem of handling large amounts of data (storage, retrieval, comparison, evaluation, etc.) for purposes such as learning, recognition and prediction requires special techmques which are not encountered in standard programming practice. The extraction and accumulation of knowledge must be done through years of running the instrument on a variety of different samples with permanent checking, evaluation, and reevaluation of stored data. Additionally, the supplementation of recorded experiments with other mformation (experimental conditions, sample description, properties of the sample, relations of the properties of a sample to the properties of other samples, etc.) provided by the operators and experts must be included. The expanding knowledge data base should be formatted in a standardized manner independent of the brand of instrument and thus transferrable to the computers of other instruments (at least among instruments from the same producer). It is attractive to speculate about the possibilities offered by instruments which would automatically upgrade existing knowledge and provide the user with new relationships between experimental data and the findings of structural and other properties of measured samples but, before the design of such an instrument can be contemplated, two fundamental problems must be considered: first is the capability of the present state of technology in computer hardware to build such instruments eco-
J ZUPAN
nomically; second is the present knowledge about procedures for handling hundreds of thousands bits of complex information to yield adequate answers in real time. In principle, these problems are amenable to solution but there are still some cost/performance problems that could create difficulties about an immediate solution to the general problem. In the present paper, the goals of intelligent instruments are pinpointed, the basic hardware requirements are listed, and the two known methods which are capable of handling large amounts of data (neural network and hierarchical clustering) are discussed.
THE PURPOSES
OF SELF-LEARNING
INSTRUMENTS
Before the hardware and software details are considered, it is necessary to take a closer look at the general aims of self-learning instruments. After the question of what to do is agreed on, the answers on how to do it (i.e., what kind of hardware and what procedures to use) will reduce the possible choices to a few particular methods. For the sake of simplicity, this discussion will be limited to instruments that yield experimental data for a sample I in a standardized form of the measurement vector X, [1,2]:
x, = (x,1, x,2>
x,3,...*
x,,)
In the above form, the components x,, are measured variables of the sample z. They can be the intensities at specified wavelengths J (if the instrument is a spectrometer), a two-dimensional matrix with zero-one (or grey) elements from a video camera, recordings from different origins (e.g., in multi-element chemical analysis), or responses distributed over the entire measurement range sampled at m intervals from any other instrument. It should be emphasized that the arguments and reasoning given in this paper can be applied to most instruments giving multichannel signals. For the purposes of discussion, an intelligent instrument is assumed to be already available and has provided a measurement X, in the above form. The expected activities of an intelligent in-
SELF-LEARNING
55
SYSTEMS
strument can be roughly divided into three categories: (1) classification of measurement X into one of several prespecified classes; (2) content-dependent retrieval of measurements similar to measurement X from the entire data collection; and (3) prediction of structure and/or properties of the sample giving measurement X, based on the recorded data and the knowledge base of the instrument. Examples of the first category are information about the experiment itself and mformation about the conditions of recording. The user is mamly mterested in the following features: (1) the acceptability of the recorded data m all channels within given limits (or standard expectations); (2) any indication on the recording about a possible malfunction of the mstrument; (3) the acceptability of the relevant experimental properties, features and parameters (baseline, drift, range, sensitivity, etc.) for a given experiment; and (4) whether or not the recorded experiment indicates (or falls into) one of the special (predefmed) cases winch the user is expecting. Some other special features can also be placed in this category, e.g., an automatic association of the measured sample (as one of thousands that routmely undergo quality control) with prespecified sample(s) or pattern(s), detection of outhers on the basis of standard patterns, etc. The second category represents information concerning the identification of the sample. Typical queries would be concerned with the presence of an identical or very similar measurement in the data collection; if there is such a measurement, then how can the best possible match and/or the list of the most relevant group of previously recorded data be retrieved for mspection; where is the best place in the existing data collection for the update with the new measurement; and how can the supplementary mformation provtded by the user expand the knowledge base in the most efficient way. The thud category concerns the predictions that can be made about the properties of the sample on the basis of the recorded experiment and on previous similar recordmgs stored in the mstrument. The answers are obtained from the knowledge base that is gradually built up, by using the
sophisticated data-handling methods known m statistics, chemometrics, and artificial intelligence (multifactorial analysis, principal component analysis, clustering, optimization, pattern recognition, etc.). The aims of these procedures are to associate the measured data with classes of already existing pieces of knowledge, thus yielding valuable information, and to use these data to increase the existing knowledge base (self-learning). The attainment of all the above-mentioned activities requires not only large permanent storage on hard disks and fast communication between data, but also procedures for handling large amounts of multivariate data that allow parallel processing. In the initial development of new instruments, parallel processmg will be implemented via software, but software should later be replaced by hardwired parallel processing. Hence, datahandling methods that intrinsically lack the facility of parallel processing are, in the long term, of little value for intelhgent instruments.
HARDWARE
The most obvious data-handling features that an intelligent instrument must have are: (1) multichannel output and fast bus for data transfer; (2) large permanent storage capacity (disk) for storing all data produced during the lifetime of the instrument; (3) modules (hardwired or software) for parallel data processmg; and (4) feed-back control from the user and knowledge-base to the instrument. Figure 1 shows a possible basic scheme for the hardware configuration of an intelligent instrument that follows the above guidelines. The multichannel output typically consists of 103-lo4 values; an FFT infrared spectrometer outputs about 3800 intensity values for a spectrum in the range 4000-200 cm-‘. The multichannel output is called a measurement vector X, (Eqn. l), x,, representing a value recorded on channel J m the I th experiment. The output X,, wluch is produced in parallel (or is sampled component by component in a buffer), is transferred to the data-compression module and then to the hard disk via a delay module. The compression transforms the m-di-
56
F&J. 1 A scheme for an mtelhgent mstrument The multlchannel signals are carned from the classlcal mstrument to the compress and delay modules. In the compress module, the m-dlmenstonal expenment IS compressed to a p-dimensional vector, p bemg much smaller than m. In the learning block, the shorter vector 1s processed m order to fmd the best location for the expenment m the exlstmg data collection and knowledge base After the most appropriate location and connections have been found and the expenmental mformatlon has been supplemented wth other mformatlon gven by the user, the complete expenment (waiting m the delay module) and other mformatlon are stored on the disk m a content-dependent manner.
mensional input X, to a p-dimensional output Y, which is one or two orders of magnitude shorter, preservmg a large proportion, say 80-90%, of the relevant information:
where p +z m [1,2]. The compression module can use any suitable data-reduction algorithm (e.g., the fast Fourier or fast Hadamard transform with cut-off of high-frequency terms [3]) and can be built in as a specially designed processor or implemented to run on the mam processor as part of the software of the system. The compression module must be able to produce differently compressed outputs from the same input if required. Instruments producing low-dimensional outputs do not necessarily need the compression modules, although parallel processing on short vectors is more cost-effective than on large ones. The delay module is a simple permanent buffer where the actual measurement vector X, is stored temporarily; in some applications, vectors from a one-day measurement backlog may be stored.
.I ZUPAN
Vector X, waits in the delay module for the results from the learning block which produces the decision where and how the original X, should be placed m relation to other experiments already stored m the data and knowledge bases. The reduced signal Y is transferred to the learning block which consists of the processor, the knowledge data base resident on a disk, the supervising and managing software and the instrument controller, and is linked to the terminal for the user’s interference. The functionmg of the learning block (inputs, outputs, and interaction with the user and instrument) is discussed in the software section. In the design of intelligent instruments, two main parameters must be considered. First, enough space for all experimental results that are recorded during the lifetime of the instrument should be provided; secondly, the processing of each experiment, i.e., the response of the instrument to the measurement, should be done in acceptable real time. Given that the hfe span of an instrument is about 1000 days (3-5 years) and if during each operating day about 200 experiments (each represented by, say, 4000 values) are recorded, about 1.6 Gbyte (1000 X 200 X 4000 X 2 bytes) of permanent memory is needed for storing all data. This is well within the storage capacity of modern hard disks. Additionally, space is needed for the compressed data, but this requirement is at least one order of magnitude less than the space for the origmal data. The space required for the reduced data is important because it sets a lower limit to the space that must be allocated to the learning block where the real-time handling activities (searching, comparing, evaluation, ranking, feature extraction, learning, etc.) must be done. According to present standards, the requirement for a loo-Mbyte RAM memory (one order of magnitude less than storage on the disk) m an instrument seems to be excessive, thus the manipulation of data has to be implemented via file mampulation; this makes the real-time processmg slower, but much smaller RAMS (4-5 Mbyte) can be used. However, with a development of new parallel architectures and new processors, large RAM modules can be produced cheaply enough
SELF-LEARNING
SYSTEMS
57
to achieve acceptable cost/performance ratios. The above estimate for the memory demand offers enough space for the implementation of both software approaches. A more detailed analysis of needs will be given at the end of the paper. The second feature, which is essential for newgeneration instruments, i.e., real-time handling of large numbers of experiments, is much harder to achieve. The software required depends on the data-handling approach chosen for the design of the instrument. Two approaches (neural network and 3-distance hierarchical clustering) are considered below. It will be shown that both have specific merits and faults.
NEURAL
NETWORK
basis of inputs and weights, and a sigmoidal function yielding values between 0 and 1, the result of which is called true output Out, of the neuron. The neurons can be organized (interconnected) in many different ways; in one or more layers, their outputs can be connected to only a given neighbourhood or to all neurons, and they can receive feedback from outputs at a lower level or at the same level [4-81. A scheme of a neuron and its sigmoidal output is shown in Fig. 2. The neural network procedure starts with a set of N neurons (all or only those on the top layer) obtaining an m-channel signal X (x,, x2, x3,. . . , x,) simultaneously at their inputs. Each ith neuron, according to its weights wZ,, yields first a net output Net, which is later normalized with, for example, a sigmoidal function, l/[l + exp( -x)1, to give the final output Out, of the neuron. In one design of neural network, the output signals can be connected to inputs of other neurons on deeper levels, fed back to neurons on the same level, or only to neurons in a certain neighbourhood, or any combination of such inter-connections. The procedure of propagating the input through the network is relatively fast especially if done in
APPROACH
A neural network is a layered (single, two or three layers) network of identical electronic circuits, each of which resembles or mimics (albeit in a pnmitive way) a neuron (nerve cell). Each neuron 2 consists of a set of m input lines, a vector of adaptable weights W, ( w,~, w,,, . . . , w,,), a procedure for calculating the net output Net, on the Neuron
Out:f(Net)
Y Input X Xl
-
x2
=
x3
=
‘k
-
0
; ;,+--
0
“>I
’ t.c-D ! I
1
,* :-’ I I /*L-w
1
I
a
I
I
wmi
%
t
1
rw\ :; I
q “21
/ \
,’ ‘;;’ \ _d’
:
0
b
I
i
1
Net, C
I I
outj
t
b
0 I
Net
Fig 2 Each neuron J consists of a number of adaptable weights w,, which, dependmg on the input vector X (x,, x2, . , x,), yield an output Net, The same mput vector X IS simultaneously connected to all neurons on one layer To resemble real neurons better, the output Net, 1s normahzed by using any land of squashmg function f( Net). Three possible functions f(x) are shown on the nght (a) the hard hrmter; (b) the threshold logic f(x) = (l/a)max[O, mm(a, x)]; (c) the slgmoldal function f(x) = l/[l + exp( -x)]
J ZUPAN
58
parallel. The result of the propagation of the input signal through the neural network is a set of output signals which is expected to be equivalent to one of the prespecified patterns. If this is not the case, the weights must be trained until the desired pattern is output. The training procedure consists of an iterative application of the input signal X to the neural network. In each iterative cycle, the weights are corrected by using one of several possible procedures [6] in which the amount of wrong output at each neuron is always taken into account. The training is repeated until all weights in the network are stabilized, i.e., until the outputs Our, are either equal to the input or equal to some prespecified target pattern T within given tolerance limits. The way in which the weights are changed in order to achieve the desired output determines the type of neural network. Back-propagation of error in neural networks
Classification of an object (measurements X) into one of several preselected classes is usually achieved after some supervised learning method [1,9] has been applied to a training set of objects and some kind of decision procedure has been obtained. The recently developed approach of feedback error propagation in neural networks [lo] seems to be very suitable for this purpose. Because the back-propagation algorithm is mainly parallel in structure and can easily be implemented and used in intelligent instrument design, a more detailed description is given below. Figure 3 shows a 3-layer neural network through which the components of the reduced measurement vector Y (input) (Eqn. 2) are propagated. If the input belongs to one of the preselected categories (on which the neural network was trained), the component i of the third layer output (Out,?) associated with the corresponding category should be “on” (1 or close to l), while all others should be “off’ (zero or close to zero). Typically, there should be as many Out 3 components as there are categories that the instrument ought to be able to distinguish. The training of weights or stabilization of the entire network (for detecting standard events) should be done by the instrument producer, thus
Net;.Cout, I
--l--a
Input
out;
w’
:
1+e
n-l
n w,,
1 -thy,
Net’
out’
*
Net3 Out3
Rg. 3 Forward propagation of signals through a 3-layer neural network for predIctIon of categones Each layer n 1s represented by a (m x p)-dlmenslonal array of weights w,‘J (p neurons, each havmg m mputs) Each output of a previous layer acts as an mput component to the next layer of neurons. The final output IS obtamed after the slgnal propagates through three layers of neurons.
providing the user with an automatic built-in diagnostic procedure. However, for detection of user-specified events for solving particular problems, the user must have a possibility to define the type of experiment for which the network could be tramed to give signals (warnings). This definition of type of an experiment means the recording of an example or standard experiment rather than the input of a set of rules according to which the experiments should be classified. The procedure for recording the exemplary sample is exactly the same as for any other recording. The only difference from other measurements is that after the recording has been done, the user invokes the learning procedure and tells the network that this is the pattern to be learned so that a response will be obtained on output channel I. The exemplary pattern can be the spectrum of a compound to be routinely detected in a series of experiments, an image of an uncorrupted pattern to be distinguished from faulty patterns, complex analysis of an article with average properties, etc. The model experiment is displayed to the neural network which must learn to recognize it and react to
SELF-LEARNING
SYSTEMS
its occurrence by producing the required output (one signal lamp on, for example). The main underlying assumption is that the network will produce the specified output signal not only when an input identical to the model experiment is propagated through it, but also as a response to all similar inputs. This type of training 1s called learning by association [4,6]. The concept described requires a training procedure to adapt the network to a new pattern and either to forget previous experiments or (much better) to upgrade its knowledge base. A 3-layer neural network capable of performmg supervised learning (the target output state for a certain type of input is defined in advance) is shown m Fig. 4. The algorithm for back-propagation of errors and changing the weights is as follows [6,7]: Repeat: for all neurons J on the third layer DT = f ‘(Net;)
(T,-Out:)
w3 II + aD30ut2 I ’ II = w3 next neuron J for both layers: n = 2 to 1 D,” = f’(Net,“)x
D:+‘wt,+’ k
w”
11=
w”
II
+
aD”Out”-’ I ’
next n until (T,-Out:) value.
IS less than a specified
small
Here, 0~~ means input or reduced vector Y. Once the network has been stabilized, the output produced for inputs stmtlar to the class representative used in the training should be identical to the target output with the prespecified component “on”. The advantage of such an approach is that the user is enabled to define any experimental pattern (or a number of them) of his own choice and to train the mstrument to recogmze automatically, and to notify the user (or correct the instrument or trigger other actions), when identical or similar patterns are encountered. It is evident that such an instrument can have many networks for self-di-
D3 zf’(Net3)(T-Outj)
Out3 T
Fig 4. Trammg (learnmg or adaptmg) the weights m a multilayer neural network with back-propagation of errors. The error that appears on the fmal output 1s expressed as the difference between the target T (t,, r2, , tp) and the actual output Out3(o,, 02, o,,) The corrections of weights W” that have produced the wrong output Our” are proportional to this error The errors D” are calculated for each layer starting with the latest one and then the corrections are propagated back through the network Weights w” on each level n are changed according to the equations shown at the r&t (a IS the learnmg rate constant)
agnostics, for diagnosis of faulty conditions, for detecting specific patterns or specific compounds, etc. The use of several networks has an advantage over using only one, because the trammg rate is very slow and the neural networks do not behave very well when attempts are made to learn to distmgulsh between too many patterns [6]. Additionally, m order to distinguish between more classes, larger weight matrices (more neurons) have to be implemented, which further prolongs trainmg or requires more parallel built-m components.
HIERARCHICAL
CLUSTERING
The hierarchtcal organization of a large amount of information has an advantage over other types of organization mamly because it allows access to any item stored in a data base, m a number of comparisons that is proportional to log,N, N being the number of items stored. The larger the
60
collection, the more apparent is this advantage. In a collection containing lo6 measurements, only 20 comparisons are needed on average to access any item. In real applications where hierarchies are never perfectly balanced (I.e., they do not have the same number of leaves accessible from any node at any given level), the average number of comparisons needed for accessing the items may be much larger. However, even a path longer by an order of magnitude in a tree, say 200 comparisons, 1s an excellent performance for finding the desired item in a data base containing one m&on objects. Many procedures are available for the hierarchical clustering of multldlmenslonal objects [ll131. Unfortunately, most of them cannot cope with the amount of data produced by modern instruments. The reason for this is that all standard methods for clustermg are based on the N x N distance matrix, which prevents these methods from handling efficiently more than a few thousand objects (multlchannel measurements). Therefore, the 3-distance clustering method (3-DCM) must be applied. The method has been extensively explamed m several publications [14-171, and will be described only briefly here. The 3-DCM 1s baslcally the update of an existing hierarchy. The update with new items is done by letting the new multidimensional object X enter the tree at the root and deciding its path at each node by comparing all distances between the object X and the left and right descendants of the particular node. The algorithm is simple: For level i from root to leaf, do: check node on level i If (node.eq.leaf), do: In the case of search output the leaf, In the case of update insert X at leaf; If (X closer to left descendant): next node is the left descendant If (X closer to right descendant): next node is the right descendant If (X IS an outlier), do m the case of search: X not found: output the node I;
J ZUPAN
In the above level I,
case of the node
update. insert X on level I next
With this algorithm, a hierarchy containing any number of objects can be built. The problem inherent in this method 1s that, after a number of objects has been added as update, some of the previous entries can no longer be retrieved. Therefore, from time to time, an Iterative check of all objects is required and those that cannot be retrieved must be relocated to new powtlons. This checking and relocation can be regarded as learnmg by repetition and can be described in the followmg manner [ 2,181: Repeat: for measurement i, do: tf measurement cannot be retrieved: relocate tt next measurement I; until all measurements can be retrieved. Relocation of measurements m this algorithm is done by exactly the same procedure as for the update. In intelligent instruments, such learning by repetition can be done after each set of, say, a few hundred recorded experiments, i.e., after each 8-h working shift. Exhaustive and informative display of large hlerarchles, which 1s an essential facility for intelligent instruments dealing with large numbers of measurements, is a problem in itself, and will not be addressed here. In order to give some idea of the problem and what can be achieved by hierarchical clustering of multidimensional measurements, a small hierarchy based on reduced representations [2,19] of 590 infrared spectra is shown m Fig. 5. The hierarchy in Fig. 5 clearly shows long chains of objects (one of the main features of the organization obtained by the 3DCM method). Long chains are the result of starting new branches in the middle of the hierarchy (instead of at the leaves). This property prolongs the average access and retrieval time from the hierarchy (a drawback), but it increases the possiblhtles of distinguishing between very similar samples and of detecting outliers (a benefit).
SELF-LEARNING
SYSTEMS
61
Fig 5. A hierarchy of 590 infrared spectra The long chams seen are a very clear feature of the 3-DCM The query (compressed measurement) enters at the root and traverses the tree towards the most smular object On Its path, the query Jams a number of clusters hnkmg different expenments wluch share m common some (hopefully) relevant propertles A concise, mformauve and clear overvlew of a large luerarchy 1s very dlfflcult to a&eve Therefore, special software IS needed so that each part can be mspected, e g., the small branch from node A (Fig
6) Figure 6 shows detail of a small branch of the larger 590-member hierarchy (Fig. 5). The information given at each leaf is the Wiswesser lineformula notation (WLN) [20] describing the structure of the compound, which is added by the
L E5 6666
OV MUTJ
Fig 6. A branch of 18 spectra from node A m the larger hierarchy shown m Fig. 5 The shortest path IS 2, the average path 6 4, and the longest path 10 Specially developed software 1s used to display any node m a large luerarchy and link the node with supplementary mformatlon about the correspondmg oblects It can be seen that the eleven central structures have sterotd skeletons
operator after the corresponding spectrum has been recorded. In this particular example, most of the recorded spectra are accompanied by the corresponding structures, thus the instrument can pinpoint, by searching through the hierarchy, the cluster of spectra (with mformation on their structures) to which the spectrum of the unknown compound is linked. On the basis of the structures in these clusters, the structural fragments of the unknown compound can be predicted. If the spectrum of the unknown compound Jams the group of eleven steroid compounds in the middle (Fig. 6), the prediction that the compound is very likely to be a steroid is reasonable, if this is an expected answer; or it is reasonable to predict that the compound contains a carbonyl, a double bond and a number of CH, groups if its structure is completely unknown. When additional mformation (e.g., physical, chemical or pharmaceutical properties, biological or ecological activity of compounds, or other relevant properties) is linked to each measurement, a much larger and more informative knowledge base can be formed and explored as required. The regular update and relocation of measurements (as well as related information) enables new relations (clusters) and thus new knowledge to be accumulated permanently in the hierarchy. In training hierarchies, as in training neural networks, all objects must form part of the learning process. From the learning point of view, it seems to be more time-efficient to form several hierarchies rather than one huge tree. Whichever system is chosen, once one (or more) merarchy(ies) have been formed and supplemented with different kinds of information input by chemical or other experts, each experiment (or at least most of them) should be able to help solve problems simply by association of the current measurement with a collection of supplementary information linked by the hierarchical organization.
APPLICATION INSTRUMENTS
Some achieved
OF BOTH
of the aims with neural
METHODS
described networks
IN INTELLIGENT
above are better and some with
62
hierarchical clustering. Both methods should be implemented in an intelligent instrument. The possibihties of neural networks in intelligent instruments are best exploited, in the first of the three categories outlined initially (classification of the experiment into one of several preselected classes) and partly in the third category (prediction of properties). Although even the second category of activities (content-dependent retrieval) can be done efficiently with some neural networks [4], none of them is, unfortunately, appropriate for very large collections of data. Hierarchical clustermg is very much better for content-dependent retrieval from large data bases and for prediction of properties, but is conaderably less efficient in supervised learning. Hence, three general guidelines may be given for choosmg neural networks or hierarchical clustering for the main activities of intelligent instruments: (1) for classification of measurements into one of several prespecified classes, use the neural-network approach (e.g., the back-propagation algorithm); (2) for content-dependent retrieval of information associated with measurements already stored and similar to the measurement X, use hierarchical clustering (3-DCM method); and (3) for prediction of structures, compositions and/or properties of the sample on the basis of measurement X, stored data and the knowledge base of the mstrument, use special small hierarchically ordered data and special-purpose neural networks. Some special neuro-computers (e.g., Anza plus or Odyssey [21]) which can handle up to lo6 connections per second already exist. These commercially available boards [22] can be built directly mto intelligent instruments and trained by the producer and/or the users. Many such boards, each containing up to lo6 processing elements (neurons), could be used m parallel for more complicated procedures. Instrument producers could also implement special networks for their own needs and of their own design in order to exploit their benefits to the maximum. Because all weight matrices must be stabilized for all selected patterns, the learning of neural networks is rather slow. But once the matrices have been stabilized, they can be used on any instrument designed for the same type of measure-
J ZUPAN
ment provided that the input data to the weight matrices are uniformly formatted on all mstruments. This feature is especially attractive for self-diagnostics. Because of its more complex procedure, hierarchical clustering of multichannel data has not yet been implemented commercially. Compared to neural networks, tins clustering requires slightly less computer space for learning and considerably less space for predictions and retrieval. Predictions (propagation of the measurement vector) m neural networks require that all weight matrices with all weights participate m any decision, whereas a prediction m hierarchical clustering invokes only the nodes that he exactly on the main path and those parallel to it. This means that only 2 log,N nodes in the hierarchy are considered for a prediction. If the learning procedure in a hierarchy is to be fast, however, the entire hierarchy of nodes (and preferably all leaves, i.e., N reduced p-dimensional measurements) and connection addresses must be resident m RAM (about 6 N X p bytes), whereas neural networks require only the weights and intermediate inputs and outputs to be in RAM. Although the number of weights needed for classification of objects with a neural network can vary from application to application, it IS safe to assume that about one tenth of N Xp* bytes is needed [6] The existence of commercially available boards for neural networks makes then use very attractive. It is also necessary to consider that, over the lifetime of most instruments, only a small percentage of the measurements obtained needs to be recognized by the instrument (standards, impurities, faults, etc.); training hrmted to such purposes brings the size of the weight matrices (smaller N in N Xp*) to much lower figures. However, the problem of learning to recognize all recorded experiments (recognition of N = lo5 objects) is at present beyond the capability of an economically feasible neural network. For such problems, a number of hierarchies (say, lo-20 each contammg lo-20 thousand measurements) can easily be built, searched, updated, and tramed for 100% accurate retrieval m real time [17]. The hierarchical organization IS based on a scheme of binary decision trees where stabihzation is needed only after a few
SELF-LEARNING
63
SYSTEMS
hundred new entries; this makes it convenient for implementation on instruments which are mostly idle after an 8-h working shift. Both the weight matrices of the neural network and the nodes of the decision tree can be transferred to other instruments of the same kind. Both methods need the same space for storing original and supplementary information. The space needed for the knowledge base is difficult to estimate. The knowledge base consists of information supplemented by each (or most) measurement(s) and the links between these pieces of information. The links between the pieces of supplementary information are formed by association between measurements in the hierarchies (clusters of similar measurements) or by similar neural network responses (similar target outputs). Such links make it possible to retrieve all the supplementary information on similar compounds ever recorded on the instrument. Therefore, the space needed for the knowledge base expands continuously along with the approximately steady increase of supplementary information (structures, properties, ongin, experimental conditions, etc.). The number of links between these pieces of information increases much more rapidly because the number of similar compounds increases together with the increasing number of objects in the data base. Accordingly, the measure of similarity between experiments has to be very restrictive. It is, of course, necessary to recognize that a large proportion of the supplementary information may not be very significant or may even be redundant. Hence, unless the disk capacity presents no problems at all, the operator should input supplementary information according to a strict protocol. Because the quality of knowledge bases in general, and within such instruments in particular, is limited by the ability of human experts to describe the recorded phenomena in an adequate manner, the procedures for establishing links between the individual pieces of information (clustering, grouping, selection) must be very robust and not prone to fast or abrupt changes caused by small deviations. Unfortunately, this last requirement is m direct conflict with the requirements for obtaining the best selectivity and for retaining the possi-
bihty of detecting completely new phenomena. As always, a number of trade-offs and trial-and-error expenments will be needed before adequate solutions are found. The financial support of the Research Community of Slovenia is gratefully acknowledged.
REFERENCES 1 K. Varmuza, Pattern Recognmon m Chenustry, Spnnger, Berhn, 1980. 2 J. Zupan, Algonthms for Chemists, Wdey, (&Chester, 1989 3 J .&pan, S Bohanec, M Razmger and M. Novlc, Anal Chum. Acta, 210 (1988) 63. Self-Orgamzatlon and Assoclatlve Memory, 4 T. Kohonen, 2nd edn , Sprmger, Berlin, 1988 1 (1988) 3 5 T. Kohonen, Neural Networks, IEEE ASSP Mag., Apnl (1987) 4 6 R P. Llppmann, and T Schwartz, IEEE Expert, Sprmg I P.D Wasserman (1988) 10 EC. Posner, E.R Rodenuch and S.S 8 R J. McEhece, Venkates, IEEE Trans Inf Theory, 33 (1987) 461. SN Demmg and Y 9 D.L Massart, B G M Vandegmste, Mlchotte, Chemometncs: A Textbook, Elsevler, Amsterdam, 1988 G.E. Hmton and R J Wdhams, m DE 10 DE. Rumelhart, Rumelhart and J.L. McClelland (Eds.), Parallel Dlstnbuted Processmg. Explorations m the MIcrostructure of Cogmtlon, MIT Press, Boston, 1986. and R.R Sokal, Numerical Taxonomy, 11 P.H A. Sneath Freeman, San Francisco, 1973 London, 1977 12 B. Even& Cluster Analysis, Hememan, for Data Reduction 13 H. Speath, Cluster Analysis Algonthms and Classification of ObJects, Horwood, Chlchester, 1980. 14 J Zupan, Anal. Cl-urn. Acta, 122 (1980) 337 15 M F Delaney, Anal. Chem , 53 (1981) 2356 16 J. Zupan, Clustenng of Large Data Sets, Research Studies Press; Wiley, Chchester, 1982. 17 J Zupan and M.E Munk, Anal Chem., 57 (1985) 1609. 18 J Zupan, m A W South (Ed ), Proc Int. Conf General Systems Research-Hierarchy Theory as a Special TOPIC, Vol 2, Intersystems, New York, 1984, p. 633 19 J Zupan and M. NovlE, m J. Zupan (Ed.), Computer-supported Spectroscopic Databases, Horwood, Cluchester, 1986, p. 42 20 E G South, The Wlswesser Lme-Formula Chenucal Notation, McGraw-Hill, New York, 1968 21 R Hecht-Nielsen and T Gutschow, Hecht-Nielsen Neurocomputer Corporation, San Diego, CA 22 A. Penz and R. Wlggms, Texas Instruments Central Research Labs 23 R Hecht-Nielsen, IEEE Spectrum, March (1988) 36