Computational neural networks for predictive microbiology: I. methodology

Computational neural networks for predictive microbiology: I. methodology

hltemdional Journal ofFood-logy International Journal of Food Microbiology 34 (1997) 27-49 Computational neural networks for predictive microbiolo...

1MB Sizes 0 Downloads 57 Views

hltemdional Journal ofFood-logy International

Journal

of

Food Microbiology 34 (1997) 27-49

Computational neural networks for predictive microbiology: I. methodology Yacoub

M. NajjaF*,

Imad A. Basheer”,

“Department of Civil Engineering, Kansas State bDqmrtmnlent of Animul S&nces und Industry, Kmws

Maha

N. Hajmeerb

Uniwrsify, Manhattan, KS 66506, USA State Uniwrsity, Munhattnn, KS 66506, USA

Received 5 February 1996; revised 18 June 1996; accepted IO August 1996

Abstract Artificial neural networks are mathematical tools inspired by what is known about the physical structure and mechanism of the biological cognition and learning. Neural networks have attracted considerable attention due to their efficacy to model wide spectrum of challenging problems. In this paper, we present one of the most popular networks, the backpropagation, and discuss its learning algorithm and analyze several issues necessary for designing optimal networks that can generalize after being trained on examples. As an application in the area of predictive microbiology, modeling of microorganism growth by neural networks will be presented in a second paper of this series. Keyword~~:Neural eling

networks;

Backpropagation;

Learning

algorithm;

Microbial

growth;

Mod-

1. Basic definition A computational network, processing

neural

network

(CNN),

also referred

is a highly interconnected network structure elements (neurons) capable of performing

* Corresponding

author.

016%1605/97/$17.00 PIZ SO16EL1605(96)01

0 1997 Elsevier 168-3

Science B.V. All rights reserved

to

consisting massively

as artificial

neural

of many simple parallel computa-

28

Y.M. Najjur et 01.

)/lnt. J. Food Miuohiology

34 (1997) 27-49

tion for data processing and knowledge representation. The emergence of such a computational scheme was driven by the present understanding of how the biological nervous system might function. The CNN structure is said to be loosely copied or modelled after the neural structure of the mammalian cerebral cortex, but on a drastically oversimplified scale. The CNN approach is represented by mathematical algorithms which are designed in an attempt to mimic the methods of information processing and knowledge acquisition of the human brain. The networking of CNNs plays a more significant role than does the neural part of the proposed terminology. Although based on functionality of the nervous system, the structure of CNNs is the corner stone in their existence. In order to understand the networking model of the CNNs and to develop a gradual perception of their operation, it is imperative to briefly revise the available knowledge about the neuron, the basic building block of the nervous system.

2. Simple motor neuron The human nervous system consists of billions of neurons (neural cells) which vary in length and type depending on their location in the body. Regardless of its type, a neuron consists of three major functional units; namely the dendrites, cell body, and axon. An oversimplified structure of a neuron is schematically shown in Fig. 1. The dendrites receive signals from other neurons and send them to the cell body. The axon, which branches into collaterals, receives signals from the cell body and carries them away through the synapse to the dendrites of neighboring neurons. The synapse is a microscopic gap which regulates the movement of signals from the axon of a neuron to the dendrites of a nearby neuron. A detailed illustration of the signal transfer process between two adjacent neurons is shown in Fig. 2. An impulse (electric signal) travels within the dendrites and through the cell body until it reaches the pre-synaptic membrane of the synapse. Upon arrival at the membrane (where vesicles containing chemical substances known as neurotransmitters reside), the chemicals are released in quantities proportional to the strength of

Fig. 1. Schematic

of motor

neuron

illustrating

all parts

Y.M. Najjar rt al. /ht.

J. Food Microbiology

34 (1997) 27-49

29

rotransmitter

membrane

membrane

Fig. 2. Mechanism

of signal transfer

between

two neurons.

the incoming signal. The neurotransmitter diffuses within the synaptic gap towards the post-synaptic membrane, and eventually into the dendrites of neighboring neurons. This forces the neighboring dendrites (depending on the threshold of the receiving neuron) to generate a new electrical signal. The generated signal passes through the second neuron(s) in an identical manner as that explained previously. The amount of signal that passes through a receiving neuron depends on the intensity of signal emanating from each of the feeding neurons, their synaptic strengths, and the threshold of the receiving neuron. Since a neuron has a large number of dendrites/synapses, many signals can be received simultaneously by the neuron. Assuming each signal and synaptic strength to be equal to s, and wi, respectively, the integral incoming signal to a receiving neuron can be presented by a collective effect or net input. Although in reality the formation of the collective signal is not known, crude simplifying assumptions are usually made to model the neuron: (1) the net input (Net) is a function of all signals arriving within a given time interval, and of all synaptic strength (weights, wi); and (2) the function linking these quantities is a simple sum of products of the entering signals, s,, and the corresponding weights, wj: Net = w,s, + w2s2 + ... + IV,,J,~

(I)

Eq. (1) represents the dot product of two vectors: the weight vector, W, containing (synaptic strengths), and the vector S containing m signals. Fig. 3 illustrates three neurons with various signals of intensity s and synaptic strength w impinging onto a single neuron that only fires at a threshold of 0. The magnitude of the signal (Net) that passes through the dendrites of the receiving neuron is simply calculated as the weighted sum of all the impinging signals, which is equal to (wIs, + ~2, + w-9,). If (Net) 2 0, the receiving neuron fires and the signal (Net) m weights

30

Y.M. Natjar et al.

/)Int. J. Food M~uohiology 34 (1997) 27 49

is allowed to pass through. This simplified mechanism the fundamental step of neurocomputing development.

of signal transfer

constitutes

3. Approach of CNNS The previous understanding of the functioning mode of the biological neuron provides insight into construction of detailed artificial networks from assembly of artificial simple neurons. A CNN performs three sequential tasks: (i) entry of data from the input-layer side of the network, (ii) processing of information within the network body, and (iii) production of output(s) at the output layer. Alternatively, neural networks are designed to perform transformation of an n-dimensional input vector into an n?-dimensional output vector. Typically, the structure of a CNN consists of an input layer which is made of a number of input nodes that are presumed to account for and explain the variability observed in the output(s) of a specific problem. The output layer is designed to contain the output nodes (variables) of the problem at one’s hand. A set of intermediate layers (hidden layers) which constitutes a vital part of CNN architecture contains a number of nodes that have no interaction with the external environment, but are interconnected with all the nodes of all other layers. A schematic representation of a CNN of the feedforward multilayer neural network (discussed later) with three layers is shown in Fig. 4. As can be shown from this figure, nodes from one layer are connected (using interconnections or links) to all nodes in the adjacent layer(s), but no lateral connection within one layer is permitted. This type of networking is the most popular and more representative of a widely used class of neural networks

s2

s3

Fig. 3. Signal integration

and analogy

to signal

summing

in artificial

neuron.

Y.M. Naijar et al. /ht.

J. Food Microbiology

Direction of Activation Propagation

31

34 (1997) 27-49

Direction of Error Propagation Output

Layer

Connecting Weights

Hidden

Layer

Connecting Weights

Input Layer

Fig. 4. Schematic

of three-layer

feedforward

neural

network

referred to as the multilayer feedforward backpropagation-of-error neural network, simply the BPNN. In the present paper, emphasis is placed on this type of network. Readers interested in other types and the wider variety of network architectures are referred to Zupan and Gasteiger (1991) and Simpson (1990). The following sections include further discussion of the elements of BPNNs which are aimed at providing the readers with a systematic understanding of the BPNNs, their architecture, properties. and procedures of their development.

4. Activation

of neuron

As mentioned previously, a neuron (biological or artificial) receives signals from many other neurons, sums them up into a collective-effect signal (excitation) that is set ready for transfer to subsequent processing elements. Different modes of quantifying the output (activation) from a particular excitation are available. The receiving neuron may respond equally to an excitation if such an excitation exceeds its tolerance (threshold) value. Thus, a neuron may fire and allow an excitation equivalent to its threshold to pass through, or may not fire if the excitation is below that threshold. This form of activation is of a discrete type (see Fig. 5(a)). A more widely used transfer function is the continuous activation which is designed to respond relative to the amount of excitation received. Among the many continuous transfer functions, the sigmoidal function (Fig. 5(b)) is widely used in modeling the activity of neurons. In a simple form, this function can be expressed as

Y.M. Nuijar et al.

32

,f(x) =

: ht. J. Food Microbiology

34 (1997) 27 49

-L-1 +ec’

(2)

where J‘is the amount of activation and ,Yis the excitation. receiving a total excitation equivalent to (Net), the above as u =f‘(Net)

=

Therefore, for a neuron equation can be written

1 1 +e-

Net

where (I is the activation. Eq. (3) can accept any excitation in the range of ( - y_,r;) to map it into the range (0- 1). However, one can easily observe that as Net reaches a relatively high (about 4.0) or low (about - 4.0) values the activation stabilizes at values near 0 or 1, respectively. The sigmoidal function furnishes many advantages over other transfer functions, especially the linear and step functions. First, depending on the neural network algorithm adopted for solution of a specific problem (e.g., backpropagation), the existence of the derivative of the sigmoidal function with respect to Net is a necessity. Second, depending on the problem and its data, continuous functions are more suited to problems where mapping is continuous, while step functions (e.g., hard-limiters) are more appropriate for problems dealing with discrete data where the output is, for instance, expressed as O/l, off/on, or yes/no. Additionally, the derivative of the sigmoidal function vanishes (i.e. becomes equal to zero) as Net approaches high or low values. This presents a special feature that assists considerably in achieving faster learning and generalization of the trained network (Zupan and Gasteiger, 1993). The process carried out by a neuron of summing the signals impinging onto it and the subsequent calculation of its activation is repeated for each and every f(x)

A 1.0

*x 0

(a) Step Function

(b) Sigmoidal Function

Fig. 5. Transfer

functions

in artificial

’ neurons,

(a) step function,

and (b) sigmoidal

function

Y.M. Najjar et al. / ht. J. Food Microbiology 34 (1997) 27-49

neuron in the network. This leads to activation(s) being calculated for the node(s) on the output layer. The activation at the output layer is considered network’s solution for a given input vector. The sweep where the transfer of proceeds from the input layer to the output layer is called the feedforward which terminates in determining the output variables for a specific problem.

5. Development

33

output as the signals sweep

of CNN

Neural networks learn by training on examples relevant to the given problem. For a CNN to achieve appropriate mapping, reasonable training exemplars (input and output vectors) should preferably cover a wide range of the sampling domain. The CNNs are known to be effective tools to derive appropriate and representative mapping between input and output vectors where such mapping may not be feasible through physical and mathematical formulation of the observed phenomenon. The major phases on which a CNN can be developed are a training phase and a testing phase. In the training phase, the CNN is trained on the exemplars available in order for the network to generalize. On the other hand, testing the CNN (cross-validation) on exemplars never used in training is essential to verify its generalization capability. A CNN that is able to accurately predict the output(s) of the testing sets is said to have generalized, and can thus be employed to predict exemplars other than those used in training. In three-layer BPNNs the forward sweep begins by presenting the network with one example of the cases given in the training set. This starts at the input layer where each input node (neuron) transmits forward the value received to each hidden node on the hidden layer. The collective effect (excitation) on each of the hidden nodes is summed up by performing the dot product of all input nodes’ values and their corresponding interconnection weights. At this stage, the interconnection weights are not known and thus are given some initial guess values. Once the net excitation at one hidden node is determined, the activation at that node is calculated using a transfer function (e.g., sigmoidal function). The amount of activation obtained represents the new signal that is transferred forward onto the subsequent layer (i.e., the output layer). The same procedure of calculating the excitation is repeated for each hidden node. Similarly, the interconnection weights between the hidden and output layers are set to some assumed initial values. The excitation(s) calculated at the output node(s) are consequently transformed into activation(s) using a transfer function (normally the same function used in the hidden nodes). The activation(s) calculated at the output node(s) represents the solution vector for the given input vector. The solution obtained may considerably deviate from the target solution (the actual solution for the exemplar used) due to the arbitrary selected interconnection weights. In a backward sweep, the difference (i.e., error) between the obtained output vector and the target output vector is used to adjust the interconnection weights. This phase works backward starting from the output layer by updating the weights between the output and hidden layers and then from the hidden layer backward to the input layer while modifying all weights

34

NE7Vt’ORK x(t)

YUI

r-----------‘----‘--‘-------.----, j

WEIGHTS

L-.______-.__.-._.._-

Fig. 6. Illustration and biases.

V)

I

of [orward

activation

i _-_

ERROR

_ .__

i .-_.... J

and backward

r-.-P

propagation

w

of error

for adjustment

of weights

connecting the nodes of these two layers. A learning algorithm (discussed later) is used for updating all the interconnection weights. The forward and backward sweeps of the network are performed an appropriately sufficient number of times until the forward sweep produces an output vector that agrees with the target vector within a certain prespecified tolerance. The forward activation of outputs and the backward propagation of error are repeated for all exemplars available in the training set. Correction of weights can be implemented in either sequential scheme or batch scheme. In the sequential scheme, the correction of weights is made immediately after detecting an error in one input pattern (i.e., example). On the other hand, batch correcting is implemented after presenting all exemplars to the network. In such case, the individual errors are accumulated, and then an average representative error for the entire training patterns is used for correcting the weights of all the training sets. Both schemes have advantages and limitations; nevertheless the batch scheme provide immunity towards problems that arise from starting with an erroneous exemplar. A computational neural network tends to learn from examples in order to eventually find the set of weights that minimizes the error between the target and the predicted outputs. The backpropagation of error (the learning phase) is based on a search technique within the error surface. This error surface is an (IZ+ l)-dimensional surface that represents the variation of error with all the (12) interconnection weights. Although there is a number of search techniques available in the literature, the gradient descent method will be used and explained in this context. A schematic of the forward and error backpropagation sweeps used in developing a network is shown in Fig. 6.

6. General model for weight correction For well-defined problems, mapping between the input and output vectors can be obtained by properly selecting all weights of the network links. Therefore, a method of updating the weights to improve the quality of prediction and minimize the error is crucial for network development. Learning algorithms used to update the weights for the several types of neural networks are well documented in the literature. Zupan and Gasteiger (1991) present systematic derivation of the BPNN algorithm.

Y.M. Najjar r! al. / ht. J. Food Microbiology 34 (1997) 27-49

35

In this paper, discussion of the BPNN learning algorithm is entirely based on the analysis of Zupan and Gasteiger (1991). The derivation of the equations will not be presented, and only the basic and final equations used in correcting the weights will be discussed. 6.1. The Delta-rule

In order to understand the algorithm, a symbolic representation of the various layers and the various interconnection weights is required. A layer (input, hidden, or output) is assumed to contain n neurons with m weights leaving each neuron. Hence, a layer can be said to have an (n x m) weight matrix, W. The layer will be given an index L. Using this terminology, a particular weight can be referred to as W$ to refer to the ith weight of the jth neuron in the Lth layer. A graphical representation of this indexing scheme is shown in Fig. 7. In the present analysis, all signals are labeled as ‘Out’, including the actual input signal that enters through the input layer and will be labeled as OutO. The indexing for outputs will proceed as Out’ produced from the first layer, Out’ produced from the second layer, etc. Similarly, the output layer will produce an output labeled Outiast. For a general layer within the network, the indexing scheme is shown in Fig. 7.

Node (j)

Layer(L)

Node (i)

Layer (L-l)

Fig. 7. Indexing

labelling

scheme

used in the backpropagation

algorithm.

36

Y.M. ~Vqjur ei al.

i

Int. J. Food Microbiology

34 (1997) 27-49

Local Minimum

1 Fig. 8. Schematic

0 w
of oversimplified

error

v

Y

w(opt.)

surface

w%v(opt.)

illustrating

the gradient

w

search

technique.

The learning process is centered about updating the weights for each sweep through the network. The incremental change (positive or negative) that is to be applied to a current weight, I~~~can be simply expressed as

where (new) and (old) refer to the value of weight at the current respectively. The incremental change A~v; is usually calculated stated as A Parameter

= qg(output

error)f(input)

and previous steps, from the delta-rule

(5)

where g and f‘ are arbitrary functions. This equation states that the amount of change to be applied on a ‘parameter’ is proportional to both the ‘input’ signal and the ‘error’ produced at the output side. The proportionality constant, ye, is called the learning rate which controls the size of the updating step between two successive iterations. In a neural network symbolic representation, the delta-rule is written as Aw; = yh‘;Out:

- ’

(6)

where the 6 term represents the error function given in Eq. (5). The term Outf- ’ represents the input from the L - 1 layer. Comparison between Eqs. (5) and (6) reveals that f’ is the identity function. The error function, S is calculated using the gradient descent method for searching the error surface for the set of weights that yield the minimum error. Fig. 8 is an oversimplified error surface representing variation of error computed for a single output with the value of a particular interconnection weight. As can be seen from Fig. 8, the goal is to find the weight that minimizes the error (point b). The change in w (i.e., Arr) can be related to the gradient of the error surface according to A,c

=

,,,,(“W

_

@Id)

=

_

K

Y.M. Najjar et al. / ht. J. Food Microbiology 34 (1997) 27-49

37

where K is a constant. Eq. (7) states that the search for a global minimum requires that moves be made as the negative of the gradient at the error surface. As can be shown in Fig. 8, if the current weight is at the left side of the optimum (i.e., point a), the gradient is negative and Aw is a positive quantity thus forcing the search to move to the right (towards the optimum). Similarly, if the current M: value is at point C, the gradient is positive and A.w is a negative, thus requiring one to move backwards towards the optimum. With the neural network symbolic representation adopted in this paper, Eq. (7) is written as AH+ = -K which represents the incremental change in the weight associated with layer L. For a learning algorithm, the main task is to quantify the error gradient term given in Eq. (8). 6.2. Evaluation of error gradirnt term The error

gradient

term is calculated

via use of chain

rule as follows: (9)

To calculate the error gradient term, all the derivatives on the right-hand side of the above equation must be determined. For brevity considerations, the derivation of these derivatives will not be presented in this paper, and the reader is referred to Zupan and Gasteiger (1991) for details. The first derivative term in Eq. (9) is expressed in two different ways depending upon whether the updating is taking place on the last layer (i.e. the output layer) or on any other hidden layers. In the following section, the algorithm of the BPNN utilizing the final equations necessary for updating the weights is presented.

7. The learning algorithm With a sequential updating scheme, a learning phase begins with the presentation of one example (input vector) such as X(x,, x~,....x,,,). The following procedure is then followed to train the network on the example: 1) with m denoting the (1) Label the input vector X as Out’ (Out:, Outq,....,Outk, number of input variables and with the last input representing a dummy input variable (called the bias or threshold) that maintains a value of 1. This bias will be linked to each node in the subsequent hidden layer, thus, implying that each hidden node has a total Net equal to the original Net plus the bias. (2) Propagate the input vector Out0 through the network’s layers by consecutively evaluating the output vectors Out’ using (initially) assumed weights, ~‘/4, of the L-th layer and the output Out:- ’ from the previous layer:

38

Y.M.

NuJur

et al. / ht.

J. Food Microhiologv

34 (1997) 27-49

(10) where f is the sigmoidal function. (3) Calculate the correction factor d:,“) from:

(6 term) for all weights in the ‘output’

layer (i.e.

(5;,Q = (J,, - Out:aQ )Ot$“’ ( 1 _ Out:‘L\’ )

(11)

where y, is the target value of an output parameter in the target (4) Correct all weights II.:;“’ on the last layer according to:

output

A,l?;y\’ = ,$:;‘“’ OUt/.l”’ ~~I + I1lA,l’:~\l(p”‘“‘ollh)

vector

Y.

(12)

The above equation is a modified version of the cj-rule that contains a new constant, ,u, representing a momentum rate to guide the search towards the global minimum by allowing a portion of the previous correction (magnitude and direction) to be added to the current value. This may elude the search from oscillation about the global minimum. Similarly, update the biases associated with all nodes on the last layer according to: A$”

=

?,b

;A

+

I~Ab;\t(r)r‘!WUS,

(13)

(5) Calculate

consecutively, layer by layer the correction layers starting from L = last - 1 to L = 1 as follows:

factors

(S;- for all hidden

(14)

(6)

Note that the term (zt+’ is the correction factor belonging to the last layer which is known from the computations previously carried out in step 3. Correct all weights w,! on any hidden layer L using: A,$ Similarly,

= tI 0;. out:

1 + @,,.;(P’“\.‘“u\)

the biases within

the hidden

A/,; = yI 6; + ,&+P’c\‘“‘U”)

layer(s)

(15) are updated

using: (16)

vector presented. A new (7) Repeat steps (I) through (5) for each new input-output set of updated weights are now used in step (2). (8) Repeat steps (1) through (7) for another epoch (training cycle), and repeat as required until all the predicted output vectors for all exemplars presented to the network agree with the corresponding target output vectors, within a prespecified tolerance (allowable error). For the batch type of updating weights, the algorithm will be similar to the previously mentioned one. However, an error function representing all the errors produced for each input-output pattern resulting from a single forward activation is calculated. The error is subsequently backpropagated to update the weights.

Y.M. Kajar et al. / ht. J. Food Microbiology 34 (1997) 27-49

8. Applying

39

the algorithm: an example

Consider a three-layer network with only two nodes in each of its input, hidden, and output layers as shown in Fig. 9. This type of network is denoted by 2-2-2 to refer to the number of layers and number of nodes in each of the layers. The input (and the activation) from each node is referred to as Out, the weights are referred to as ~~~~~ (link leaving node i and impinging onto nodej), the biases (for all nodes in hidden and output layers) are named b,, and the summation of signals is denoted Net,. The target output (desired value) for an output node j will be referred to as t/. As mentioned previously, two passes representing a forward activation of the inputs and the backward backpropagation of errors to update the weights and biases are required to complete an epoch of training for a single input-output pattern. The learning procedure and the required computations are simply summarized as follows: (1) Compute Nets. Compute Net for each node j in one layer due to signals from each node i on the previous layer. For the hidden nodes:

Net, = w,,Out,

+ w,,Out,

Net, = w,,Out,

+ M’.420ut2+ b,

Similarly,

for the output

Net, = w,,Out,

+ 6, (17)

nodes:

+ w,,Out,

+ b,

Net, = +zt,,Out, + r,t!640ut4 + b,

Fig. 9. A 2-2-2 neural

network

(18)

showing

all components.

40

Y.M.

NUJ~UV et al. / Int. J. Food Microbiology

(2) Compute Out ‘s. Compute For the hidden nodes: Out,=

activation

using sigmoidal

27- 49

function

for each node j.

l/(1 +ePN”i)

Out, = l/(1 fe-

NctT)

Out, = l/(1 + e

Net4)

Similarly,

34 (1997)

for the output

(19)

nodes:

Out, = l/( 1 + e _ Net5) Out, = l/( 1 + e

Net6)

(3) Compute 0“s. Compute

error

functions:For

the output

(last) layer:

6, = Out/( 1 - Out,)( tp, - Out,) ci, = Out,( 1 - Outs)(r,,

- Out,)

(5, = Out,( 1 - Out&t/, h - Out,) For the hidden

layer:

ii, = Out,( 1 - Out,) Cb, lVk, L d, =

out,(i - out,)[b,~~,,+ ci,~,]

h‘, =

out,( 1 - out,)[d,~~~,, + o‘6t~~64]

(22)

(4) Compute Aw’s und Ab’s. Compute corrections weights (Aw’s) of the input-hidden links:

for weights

and

biases:

For

Aw,, = @,Out, AN,, = &Out,

Aw,, = grS,Out,

Aw,, = q6,0ut,

An>42= gd,Out,

Similarly,

for weights

(Aw’s) of the hidden-output

Abvs3 = v]c)‘,Out,

Aw,, = qh‘,Out,

Aw,, = qii,Out,

An’,, = t&Out,

For biases (Ab’s) of the hidden

links:

(24)

layer:

Ah, = qh‘, Ab, = r/S, Similarly,

Ab, = r/S4

for biases (Ab’s) of the output

Ah, = ~6,

Ab, = ~6~

(25) layer:

(26)

Y.M. Najar

et al. / ht.

J. Food Microbiology

34 (1997) 27-49

+I

6

41

STEP’

Another

compute

STEP2

(0ut)s

Save w's Kc b’s

Update (w)s, (b)s

STEPS

T

STEP4

Compute ( A+,

Compute ( 6)s

(Ab)s

STEP3

t

Fig. 10. A flowchart

demonstrating

use of the backpropagation

It should be noted thas if the momentum for the weights and biases are modified AW/(yw) = v8,Out; AbP

=

(5) Compute tions: W(!W = /I

term is included, to

algorithm.

the correction

+ CIAW~~=“~~~~)

V dj + cI AblprrviOUS)

new w’s and b ‘s. Update w,(P’evlous) J'

+

(27) weights

and biases

according

to the equa-

Awi,

b (new)= b~j!X~~~W+ Ab, / A flow chart

that demonstrates

9. CNN development

equations

(28) the algorithm

is illustrated

in Fig. 10.

issues

Identifying the problem to be modeled in terms of cause-and-effect is at the top of the list of items to be considered by the designer of a neural network. Other issues which also necessitate consideration are tackled with more details in the following subsections. These issues are related to: - database size, - initial connection weights, - learning rate, _ momentum coefficient, - activation function,

42

Y.M. Kajar et al. i Ini. J. Fooci Microhiolo~~

14 (1997) 27-49

_ convergence criteria for termination of training, ~ size of hidden layer(s), ~ number of iteration (or training) cycles. 9.1. Dutuhase size und dutu prr-processing Models developed from data can be very much dependent on the size of the database on which they are constructed. Neural networks, like regression models, can be developed from data of any size. However, the generalization capability of the developed models will be of great concern. Since CNNs are required to perform generalization for prediction of previously unseen cases, prediction can be more reliable if the developed CNNs are used as interpolation tools rather than for extrapolation. Preferably, the data to be used for training should be large enough to cover all possible variability within the application domain. Unfortunately, this requirement cannot be justified when economic issues are considered; some data are really expensive to obtain. Development of a CNN requires part of the mother database to be reserved for examining the generalization capability of the network, while a relatively larger part is used for training. Currently, there is no mathematically-based rules regarding the required size of the test set which will insure a perfect generalization of the trained network. A large testing set of the input-output patterns may provide a firm assurance for examining the generalization capability of the network, however, the remaining set of training patterns may not be adequate to yield any generalization. This issue is also dependent on the richness of data. The authors’ experience suggests that utilization of 20&25’!4, of the data for testing, from a relatively large mother database, is practically sufficient for examining the prediction accuracy of the network while maintaining a wide variety of patterns for the network to learn. It is recommended that testing data sets be selected randomly, and preferably, an advised random selection of these sets be adopted. In training, the network on data that cover large domain patterns, corresponding to extreme cases (max-min cases) must not be sampled for testing. Neural networks are questionable in extrapolation of cases beyond the boundaries of the training domain. Presentation of training patterns to the network in terms of some normalized values instead of their real values may also affect training. The input and output data are usually processed by scaling (normalizing) their values between the limits of the activation function used. Normalization has been found to be more effective in achieving faster training by preventing larger numbers from overriding smaller ones. Each input or output parameter is normalized relative to its maximum and minimum values observed in the data. Normalization of an arbitrary parameter, X, can be carried out using

where X,, is the normalized value, and X,,,,, and X,,,,,, are the maximum and minimum values of X. When using the sigmoidal function the data are normalized

Y.M. Najjrrr et al. /ht.

J. Food Microbiology

34 (1997) 27-49

43

between 0 and 1, or more preferably, between slightly offset values such as 0.1 and 0.9. Reducing the range may be advantageous to prevent the network from driving the weights to infinity, and thus slowing the learning process (Hassoun, 1995). 9.2. Connection

weights initiulization

Several methods abide for assigning the first-guess weights for the connection links. Weight initialization can be performed in a number of ways such as: (i) using random-number generators (Pao, 1989); (ii) specifying a constant value for all weights in the network; or (iii) by assuming these weights to be constant only within the layer on which they exist. Additionally, weight initialization can also be performed by allocating random values obtained from an interval which depends on the number of links available in the whole network or within the layer on which this assignment takes place, For example, weight initialization for a layer with n weights may be carried out by generating random values between - l/n and l/n. The literature provides a massive illustration of methodologies that have been utilized and furnishes a view on the abiding. For the same method of weight initialization, the success of network training can, unfortunately, be application-dependent. The initial weights selected for a specific problem may or may not influence the rate of convergence of the learning process. A slow learning of the weights could occur if the network starts the search in a relatively flat region of the error surface. However, with unlimited computation time and as the weights will be changed anyway during the learning process, the starting values may not have great significance. 9.3. Learning

rate

The convergence speed of the backpropagation network is highly dependant on the learning rate as it governs the speed at which the weights change between successive training cycles. If the learning rate is set too high, the weight can change significantly from one cycle to another. This may cause the search to be trapped in a local minimum or may result in oscillation on the error surface which may not even guarantee a local minimum. Alternatively, use of a small learning rate may lead to a search path in the direction of the global minimum but with very slow convergence due to the large number of updating steps needed. A reasonable choice of the learning rate can result in a compromise between the slow convergence and fast training. Two different types of learning procedures exist with regard to the learning rate. The first methodology of training utilizes a constant learning rate along the entire span of the training history. Various values in the literature were reported at which successful training of different networks was observed. Wythoff (1993) suggests learning rates between 0.1 and 10.0, while Zupan and Gasteiger (1991) recommend 0.3-0.6 as a good starting value. The authors have experienced different levels of success by utilizing values between 0.1 and 0.95 for diverse applications. Explicitly stated, the designer’s choice of the learning rate should only be based on testing the

44

Y.M. Nujjor et al. / ht. J. Food Microhiolo,sg~ 34 (1997) 27-49

generalization capability of the network for the given problem (i.e., a trial-and-error approach). The second approach of training employs a learning rate that varies with the training cycle number. Generally, it is desirable to have larger steps when the search point is far away from a minimum, while a smaller step size becomes necessary as the search approaches a minimum. Since the distance from the search point to a minimum cannot be told a priori, various heuristics for handling the situation have been developed (Hassoun, 1995). 9.4. Monwntunz

coQicient

The momentum term is closely related to learning rate, being also subject to considerable debate in the area of network training. A momentum rate is added to the weight updating equations (Eq. (I 2)) to help the search escape local minima existing on the error surface. It also can guide the learning rule to update the weights based on some information pertinent to the size of the previous update step. The effect of including a memory to weight updating has been found to push the solution past local minima. Constant and varying momentum coefficients can be utilized in training. Wythoff (1993) suggests a constant momentum coefficient between 0.4 and 0.9, while Hassoun (1995) suggests a normal choice of this coefficient between 0 and 1.O. Depending on the application problem, the success of training can be expected to vary with the selected momentum. and a selection based on trial-and-error is thus preferred. Adaptive momentum coefficients are also employed in the literature where the momentum dynamically varies with the training epoch (Hassoun. 1995). These mechanistic procedures are designed based on information obtained about the magnitude and sign of the current and the previous gradients along the error surface. 9.5. Acticution

,fiinctiorz

The activation (transfer) function is necessary to transform the weighted sum of all signals impinging onto a neuron into a state which determines the firing intensity of that neuron. Some functions may indicate whether a neuron could fire or not (the discrete or step functions) regardless of the magnitude of the net excitation. The activation function suppresses all the possible variation in the expected net signal. Depending upon the required mapping, the output from a transfer function may be either continuous or discrete. Some transfer functions produce an output that may be either positive or negative, while others permit an activation to lie in the range between 0 and I. Most of the applications utilizing the backpropagation algorithm employ the sigmoidal function for computing neuron activation. The sigmoidal function possesses the properties of being continuous and differentiable; characteristics that assist in employing the backpropagation and learning process. Other non-sigmoidal functions, e.g., tanh(x), can be of equivalent efficiency in backpropagation provided that they are differentiable. For specific applications, Moody and Yarvin (1992)

Y.M.

Najiar et al. ;’Int. J. Food Microbiology

34 (1997) 27-49

45

have reported various levels of success with transfer functions in relation to the nonlinearity and noisiness of data. However, the advantage of choosing a particular transfer function over others is not, currently, completely understood (Hassoun, 1995). 9.6. Convergence

criterirt

For training a particular network, a stopping criteria should be adopted. The convergence criteria are usually based on the error representing the difference between the target output(s) and the predicted output(s). Training is allowed to proceed until the predicted output(s) for any pattern agrees with the target output(s) within a prespecified tolerance. Another criterion may be stated such that the predicted output(s) and the target output(s) agree with each other with a prespecified coefficient of determination (while the intercept and the slope on 1: 1 plot are close to 0, and 1, respectively). The most commonly used method to check for the stopping criterion is the use of some error, that represents the total deviation of predicted output(s) from their corresponding target output(s), for all the input-output patterns involved in the training process. The error that can be considered may be either an absolute or relative error. The absolute error (dimensional quantity) can take several forms (Hanson, 1995), but the most common is the root-mean-square (RMS) error defined as RMS = if

i .$ (t,,;- 0ut;?t)2 ,‘-I r-l

where Out,,, and tPi are, respectively, the output and the target (correct) output of the ith output node on the pth pattern, YZis number of training patterns, and m is number of output nodes. Another measure of convergence, the relative error, can be calculated as mean value of the relative deviations of each pattern from the target value. The relative error (%) can be considered as the average of all relative errors computed for each pattern. This kind of dimensionless error may be viewed as more representative of the percentage deviation of an output from its target value. For all kinds of convergence criteria considered, one can state that any level of agreement between predicted and target outputs can be (but undesired) achieved by overtraining the network to a very large number of training cycles. Therefore, these convergence criteria should only be used in connection with cross-validation sets for testing the performance of the network on predicting the output(s) of the unseen patterns. Overtraining can lead to networks that memorize rather than generalize on the training data. 9. 7. Number qf training cycles The number of training cycles required to achieve proper mapping can also be determined by an iterative procedure. In this method, one starts with a small

number of training cycles and observes the change in error in both the training and testing data sets. Training for so long can result in a network that only serves as a look-up table, a phenomenon called overtraining, overfitting, memorization, or grandmothering. The continuous monitoring of error evolvement in the testing data set is the best way to avoid this undesirable behavior of the networks. During the phases of training and testing, one can typically plot the variation of the error with the number of training cycles for both the training and testing sets. Fig. 11 shows that the error infinitely decreases in the training set as the number of training cycles increase. On the contrary, an optimum point on the testing sets error curve can be observed. Any additional training beyond this optimum would lead to a network that is less capable to accurately predict unseen patterns. Therefore, error curves are extremely essential for optimal network design as they provide a stopping criterion for training.

Determination of the appropriate number of hidden layers and the number of hidden nodes in each layer is the most crucial matter in neural network design. Unlike the input and output layers, one starts with no prior information about the size of the hidden layer(s). Several rules-of-thumb were developed by many researchers regarding the approximate determination of the required number of hidden nodes in a hidden layer from the knowledge of the number of nodes in both the input and output layers. However, these rules are not based on solid proof. and thus may not work in most instances. Previously, the only way to determine the optimum size of a hidden layer is by trial-and-error. The required number of hidden nodes might not only be the function of the number of input and output nodes but

.

Optimum

Number of Training Cycles

Fig.

I I. Criteria

of termination

of training

or Hidden Nodes

and selection

of optimum

network

architecture.

Y.M.

NaJjar et al. /ht.

J. Food Microbiology

34 (1997) 27-49

41

may also be dependent on the available number of training and testing patterns as well as the nonlinearity of the system under consideration. There are two iterative methods for determining the required number of hidden nodes in a hidden layer. Assuming a network with one hidden layer, the ‘constructive method’ starts with one-hidden-node in the hidden layer and then adds up more nodes in the hidden layer as evidenced through testing the generalization ability of the network to predict the previously unseen patterns. The number of hidden nodes beyond which the error (using some convergence criterion) in prediction of the testing sets commence to increase is taken as the optimum. Error curves similar to those developed for determining optimum number of training cycles (Fig. 11) are also employed to optimize the network architecture. Unlike the constructive method, the ‘destructive method’ is based on starting with a large network and commencing by dropping out hidden nodes that do not add to prediction accuracy using a cost function (a combination of an ordinary error function and another related to the number of connection weights in the network). This method is also called network pruning (Sietsma and Dow, 1988). Similar to problems of overtraining due to excessive presentation of the training data to the network (Section 9.7) having a large number of hidden nodes in the network results in a network that is incapable of performing good predictions. This is due to over-parameterization. Currently, the best way to avoid over-parameterization is to monitor the growth of error in predicting unseen cases (cross validation) during the training process.

10. Neural modeling of microbial growth Neural networks, and specifically the backpropagation type has attracted the attention of many investigators in a variety of fields pertaining to science and engineering. Some of the applications in the food-science and -industry include the identification and control of the extrusion process (Eerikaeinen et al., 1994) prediction of sensory properties of food from related physical/chemical properties (Bardot et al., 1994) control of glucoamylase fermentation (Link0 and Zhu, 1992) spectroscopy analysis (Naes et al., 1993; Goodacre and Kell, 1993) and assessment of milk shelf life (Vellejo-Cordoba et al., 1995). In the field of environmental health studies, the authors have applied neural networks in estimation of adsorption isotherm parameters of organic compounds (Basheer et al., 1995), treating phenolpolluted waters by adsorption (Basheer and Najjar, 1995, 1996) treating volatile organic-polluted groundwater (Basheer et al., 1996) and identifying the extent of subsurface groundwater contamination (Najjar and Basheer, 1995). In the second paper of this series (Hajmeer et al., 1996) the authors investigate the potential of neural networks as universal mapping models for the problem of microbial growth. In that paper an application of the proposed method on the growth of Shigella Jlexneri is discussed. Moreover, the methodology for expanding the developed neural network into universal growth prediction model is also presented. The ultimate goal of a universal model lies in its ability to predict the

48

Y.M. Najar

rl al.

! ht.

J. Food Microbiology

extent of microbial growth for a variety ambient environmental conditions.

34 (1997)

of microorganisms,

27-49

food

systems,

and

11. Closure In predictive microbiology, little work relevant to utilization of artificial neural networks is observed. The current lack of popularity may be attributed to two reasons: (i) the development of neural networks is not as old (only few years) and well understood by the crowd of researchers in all fields as compared to statistical techniques; and (ii) the lack of an appreciable number of research papers cited in the literature on the methodology and application of neural networks in this particular field. In order to overcome these two obstacles, the present paper was initiated. The main objective of this paper was to uncover this new technology and publicize it to a similar level as is the case with other fields (e.g., analytical and instrumental chemistry). Both the basic definition of a neural network and the building block in its architecture (neuron) were analyzed to provide an insight into the significance of the networking scheme which accounts for effectiveness of such computational tools. In this paper, the backpropagation neural networks were discussed in relation to (i) their analogy to biological systems, (ii) learning algorithm used in training the network, and (iii) methods of developing neural networks. Issues such as determining the optimum topology of the network and others pertinent to selecting the critical parameters of the neural network (e.g., learning rate, momentum coefficient, number of iteration cycles, etc.) were also addressed. The backpropagation network was studied in detail and the corresponding learning algorithm was clearly presented and supported by an example of 2-2-2 network. The purpose of this example was to provide more insight into how to use the algorithm which, for many people, is quite often seen in books and research papers as a collection of summation terms and many (jk indices. The application of CNNs to predicting growth of S. ,fl~xneri and the method for utilization of such an efficient computational scheme for development of a universal microbial growth model is discussed in the second paper of this series (Hajmeer et al., 1996).

Acknowledgements The authors acknowledge of this manuscript.

Professor

Mogens

Jakobsen

for suggesting

preparation

References Bardot I., Bochereau L., combining data analysis

Martin M. and Palaces B. (1994) Sensory-instrumental and neural network techniques, Food Quality and Prefer.,

correlation by 5. 159 166.

Y.M. Nujar ef al. / ht. J. Food Microbiology 34 (1997) 27-49

49

Basheer I.A. and Najjar Y.M. (1996) Predicting dynamic response of adsorption columns with neural nets, J. Comp. Civ. Eng., ASCE 10 (I), 31-39. Basheer I.A., Najjar Y.M. and Hajmeer M.N. (1996) Neuronemt modeling of VOC adsorption onto GAC. J. Env. Tech., 17, 7955806. Basheer I.A. and Najjar Y.M. (1995) Designing and analyzing fixed-bed adsorption systems with artificial neural networks. J. Env. Sys. 24 (3), 291-312. Basheer LA., Najjar Y.M. and Hajmeer M.N. (1995) Adsorption isotherms of toxic organics assessed by neural networks. Proc. in Intelligent Engrg. Systems Through Artificial Neural Networks 5, 8299834. Eerikaeinen T., Zhu Y.H. and Linko P. (1994) Neural networks in extrusion process identification and control. Food Control 5, 11 l-l 19. Goodacre R. and Kell D.B. (1993) Rapid and quantitative analysis of bioprocesses using pyrolysis mass spectrometry and neural networks: application to indole production. Anal. Chim. Acta 279, 17-26. Hajmeer M.N., Basheer I.A. and Najjar, Y.M. (1996) Computational neural networks for predicitive microbiology. II. Application to microbial growth. Int. J. Food Microbial. 34, 51-66. Hanson S.J. (1995) Backpropagation: some comments and variations. In: D.E. Rumelhart and C. Yves (Eds.), Backpropagation: Theory, Architecture, and Applications, pp. 2377271. Lawrence Erlbaum, NJ. Hassoun M.H. (1995) Fundamentals of Artificial Neural Networks. MIT Press, Cambridge, MA. Linko P. and Zhu Y.H. (1992) Neural networks in bioengineering. Kemia Kemi 19, 215-220. Moody J. and Yarvin N. (1992) Networks with learned unit response functions. In: Moody et al. (Ed%), Advances in Neural Information Processing Systems, Vol. 4, pp. 1048-1055. Naes T., Kvaak K., Isaksson T. and Miller C. (1993) Artificial neural networks in multivariate calibration. J. Near IR Spectra., I, l-1 1. Najjar Y.M. and Basheer I.A. (1995) Spatial Mapping of Groundwater Contamination Using Neuronets, Proc. in Intelligent Engrg. Systems Through Artificial Neural Networks, 5, 817-822. Pao Y.H. (1989) Adaptive Pattern Recognition and Neural Networks. Addison-Wesley, Reading, MA. Sietsma J. and Dow R.J.F. (1988) Neural net pruning-why and how. IEEE Int. Conf. Neural Networks, 1, 3255333. Simpson P.K. (1990) Artificial Neural Systems: Foundations, Paradigm, Applications, and Implementations. Pergamon, New York. Vellejo-Cordoba B., Arteaga G.E. and Nakai S. (1995) Predicting milk shelf-life based on artificial neural networks and headspace gas chromatographic data. J. Food Sci. 60 (5) 8855888. Wythoff B.J. (I 993) Bdckpropagation Neural networks: a tutorial. Chemomet. Intelligent Lab. Syst. 18, 115-155. Zupan J. and Gasteiger J. (1991) Neural Networks: A new method for solving chemical problems or just a passing phase? Anal. Chim. Acta 248, I-30. Zupan J. and Gasteiger J. (1993) Neural Networks for Chemists: An Introduction. VCH, New York.