April 1994
Pattern Recognition Letters ELSEVIER
Pattern Recognition Letters 15 (1994) 409-418
The classification capability of a dynamic threshold neural network Cheng-Chin Chiang, Hsin-Chia Fu * Department of Computer Science and Information Engineering, National Chiao-Tung University, Hsinchu, Taiwan 300, ROC
Received 12 November 1992
Abstract
This paper proposes a new type of neural network called the Dynamic Threshold Neural Network (DTNN). Through theoretical analysis, we prove that the classification capability of a DTNN can be twice as effective as a conventional sigmoidal multilayer neural network in classification capability. In other words, to successfully learn an arbitrarily given training set, a DTNN may need as little as half the number of free parameters required by a sigmoidal multilayer neural network.
1. Introduction
Recently, a n u m b e r o f researchers (Sontag, 1990; Baum and Hausler, 1989; Huang and Huang, 1991; Nilsson, 1965) have studied the recognition capability o f multilayer neural networks. In general, the main results obtained in these studies are derivations of the lower or upper bounds on the number of hidden neurons required to learn the recognition o f a given training set S containing a fixed n u m b e r of patterns. For example, it has been proved (Huang and Huang, 1991; Nilsson, 1965) that a committee machine requires at most k - 1 hidden neurons to dichotomize an arbitrary dichotomy defined on any training set with k patterns. Sontag (1990) also proved that if the direct input-to-output connections or the continuous sigmoid activation functions are used, then a network requires at most k hidden neurons to dichotomize an arbitrary dichotomy defined on any training set containing 2k patterns. Chiang and Fu (1992) proposed the new activation function called Quadratic Sigmoidal Function ( Q S F ) for multilayer neural networks to approximate continuous-valued functions. A QSF in ~n is defined as QSF:
1 f ( n e t i , Oi) = 1 + e x p ( n e t 2 - 0 2 ) '
( 1)
where neti = w i . x = Wi.o+ ~ = o wi,:xj. The two vectors w~ = (wo, ..., w~) and x = ( 1, x l, ..., xn) are the weight vector and the input vector, respectively. The parameter 0~ is called the threshold because it controls the distance between the two state transition boundaries of the neuron. In comparison with conventional sigmoidal multilayer neural networks, we obtained satisfactory results, such as faster learning, smaller network size, and better generalization capability, with our QSF networks. Fig. I shows the graphical demonstration for a QSF in E2. Note that, by Eq. ( 1 ), the threshold 0r is independent o f the input x: thus we call this threshold a "static threshold" because the two state transition boundaries are fixed for each neuron during the retrieving phase. Thus, * Corresponding author. Email:
[email protected] 0167-8655/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved SSDI 0167-8655 (93)E0041-L
410
C.-C. Chiang, H.-C. Fu / Pattern Recognition Letters 15 (1994) 409-418
1,
O. 75 0.5 0.250
24
02 ~ ~ ~ _ 4 - 2
-
4
Fig. 1. Graphical demonstration ofa QSF in ~2. hereafter, we will refer to a QSF neuron as a Static Threshold Quadratic Sigmoidal neuron. As shown in Fig. 1, each QSF defines two parallel linear state transition boundaries in the input space. Theoretically, a nonlinear state transition boundary should have better partitioning capability than a linear state transition boundary. Therefore, in this paper, we extend the QSF to another more generalized activation function called E x t e n d e d Q S F which is defined as 1
Extended QSF:
f(neti, 0~) = 1 + e x p ( n e t 2 - (g(Oi, x) ) 2 ) ,
(2)
where neti = w i ' x = Wi,o + Y j%0 wijxj, O~ = ( Oi,o, Oi, l . . . . , Oi,,), and g( Oi, x ) = Oi,o + Y]=o Oi,jxj is called the thresholdingfunction. According to Eq. (2), we see that a D y n a m i c Threshold Quadratic Sigmoidal neuron i contains n weights (wij, 1 ~
o
_.
o
-'
°
(b) Fig. 2. Graphical demonstrations of Extended QSF in R2. (a) ( 1+exp ((2x+y) 2- (x+y) 2) ) - ~; (b) ( 1+ exp((x-y) 2- (x+y) 2) ) - ~; (c) ( 1+exp( (2.5)2--x2))-i.
C.-C. Chiang, H.-C. Fu / Pattern Recognition Letters 15 (1994.) 409-418
411
ent parameter settings. This property enables the Dynamic Threshold Quadratic Sigmoidal neuron to have more powerful classification capability. By incorporating both Dynamic and Static Threshold Quadratic Sigmoidal neurons, we can design a more powerful multilayer neural network called Dynamic Threshold Neural Network and its learning algorithm. The proposed network architecture for a DTNN is shown in Fig. 3. In this paper, we will study the classification capability of the DTNNs which contain one Dynamic Threshold Quadratic Sigmoidal neuron in the output layer and several Static Threshold Quadratic Sigmoidal neurons in one hidden layer. We will prove that a singlehidden-layer D T N N can be twice as effective as a single-hidden-layer sigmoidal neural network in classification capability. In other words, to successfully learn a given training set, a DTNN may need as little as half the number of free parameters required by a sigmoidal neural network.
2. Classification capability of single-hidden-layer DTNNs Before presenting our study on the capability of DTNNs, we have to introduce the Quadratic Heaviside function
QuadraticHeaviside:
0, if0 2 - ( w i . x ) 2 < 0 , ~q(Wi'X, Oi)= 1, i f 0 ~ - (wi-x):>~0.
(3)
The Quadratic Heaviside function is an extension of the conventional Heaviside function. We call a neuron which uses the Quadratic Heaviside activation function a Static Threshold Quadratic Heaviside neuron. Let ~ ( x ) denote the conventional Heaviside function, i.e., ~e (x) = 0 for x < 0 and Yf (x) = 1 for x 1>0. Then
~,V;a(wi.x, 0~) = ~ ( q ( w ~ . x , Oi) ) ,
(4)
Q
Dynamic Threshold Quadratic Sigrnoidal Neuron Static Threshold Quadratic Sigmoidal Neuron Input Neuron
•
O0 IIIIIIllllllllUll
Threshold Connection ( O~ ) Weight Connection ( ~: ) ij
Fig. 3. Network architecture of Dynamic Threshold Neural Networks.
C.C. Chiang. H.-C. Fu / PatternRecognitionLetters 15 (1994)409-418
412
where q(w~.x, 0~)=0 2 - (wi-x) 2. By Eq. (3), it is clear that the output of a Quadratic Heaviside function remains unchanged if we scale up (or scale down) the weight vector (wj) and threshold (0t) by a nonzero factor of k. Thus, the following l e m m a can easily be derived.
Lemma 1. For any constant k # O,
~q(Wi'X, Oi) =affq(kwi.x, kOi) for allx~R ~ . Let a(x) be the sigmoid function, i.e., a(x) = ( 1 + exp ( - x) ) - 1. The following l e m m a will also be very useful in later theorem proofs regarding the capability of our DTNNs. Lemma 2. For a given error tolerance e > O,
,a(x)-~(x),~¢
for ,x[ ~ > ] l o g ( 1 - ' ) ] . I \~]l
Proof. For x < 0, a(x) must be less than or equal to ~, i.e. (1 + exp ( - x ) ) - 1 < e. Thus, it is easy to derive that x~<-log((1-~)/E). On the other hand, for x>~O, a(x) must be larger than or equal to l - e , i.e., ( 1 + exp ( - x) ) - 1~> 1 - e. Thus, we derive x >1log ( ( I - E) / e). Therefore, we conclude that, if Ix 1i> Ilog ( ( 1 - e) / e) I, it is the case that [ a ( x ) - ~ ( x ) I ~
~ O,
(5)
where g(O~, x) = 0~,o+ ~7=oO~jxj. We also call a neuron which uses the E Q H F as its activation function a Dy-
namic Threshold Quadratic Heaviside neuron. In the following, we will first study the capability of highway-linked feedforward (direct input-to-output connections are included) single-hidden-layer networks (see Fig. 4 ( a ) ) which contain Static Threshold Quadratic Heaviside neurons in one hidden layer and one D y n a m i c Threshold Quadratic Heaviside neuron in the output layer. Then we will extend our results to feedforward (no direct input-to-output connections) networks (see Fig. 4 ( b ) ) which contain Static Threshold Quadratic Sigmoidal neurons in one hidden layer and one D y n a m i c Threshold Quadratic Sigmoidal neuron in the output layer. Suppose that a training set S consists of distinct vectors xl, ..., xp, where x ~ R n. since the set A=~"-
U
{s l s . ( x , - x A = O ,
s~n}
i ~ j , l <~i,j<~p
is not empty, we can always find a projection vector v in A such that the new training set
S~={yi l yi-~v'xi, l <~i<~p} contains only distinct elements. Assume that there exists a network containing h neurons in its first hidden layer that can dichotomize a dichotomy which is induced from S onto S~. Let the weights of these h hidden neurons be wt, w2, ..., wh ( w i c k ) . Then it is obvious that the network can be transformed to dichotomize the original dichotomy on S by replacing the weights of these h hidden neurons as wl v, w2v, .... why. Without loss of generality, we can sort the elements in the new training set Sv and reindex them such that y~
Sy ={yi[yi=v'xi, xi~S-).
C.-C. Chiang, H.-C. Fu / Pattern Recognition Letters 15 (1994) 409-418
413
I
Q O l b
Q
Dynamic Threshold Quadratic Heaviside neuron
O
Dynamic Threshold Quadratic Sigmoidal neuron
Static Threshold Quadratic Heaviside neuron
~
Static Threshold Quadratic Sigmoidal neuron
(a)
(b)
Fig. 4. (a) A highway-feedforwardsingle-hidden-layernetwork composed of Static Threshold Quadratic Heaviside neurons and Dynamic Threshold Quadratic Heaviside neurons; (b) a feedforwardsingle-hidden-layernetworkcomposedof Static Threshold Quadratic Sigmoidal neurons and Dynamic Threshold Quadratic Sigmoidalneurons. We shall assume that y, is in S + since we can always find a vector v for this purpose. Now, we can prove the following theorem. 1. Given a training set S = {y~, Y2, -.., Y4k+ 1 [ Y~~ ~, 1 <~i <~4 k + 1}, a highway-linked feedforward network containing at most k Quadratic Heaviside neurons in one hidden layer and one Dynamic Threshold Quadratic Heaviside neuron in the output layer can dichotomize an arbitrary dichotomy defined on S.
Theorem
Proof. Let us use the notation "/i < I / ' for intervals to mean that x < y for all x~I, and all y~Ij. We also use the notation " x < / " to denote that x < y for all yeI. Since the y,'s have been sorted in ascending order, we can find 4 k + 1 disjoint closed subintervals I~ ( 1 ~
S+=(Y2,+~ I l~
then it would be the worst case for the network to dichotomize. Let I - = U 02kI 2i and I + ~ U 02kI 2 i + 1 • Thus, if we can construct a network with the stated network architecture such that the network outputs "1" for x ~ I + and "0" for x d - , then the proof can be completed. Let flj (0 ~
for O<<.i<~k- 1 .
(6)
Let wi = (Wi.o, w/.t ) denote the bias and weight of the ith Quadratic Heaviside hidden neuron, and let 0! 1) denote the threshold of the ith Quadratic Heaviside hidden neuron. Also let u = ( u o , ut . . . . . uk) and O = (0~ 2) , 012), ..., 0~2)) denote the hidden-to-output connection weight vector and the threshold vector of the Dynamic Threshold Quadratic Heaviside neuron, respectively. In addition, v ( ~ ) is used to denote the direct input-to-output connection weight. Thus, the output of this network can be formulated by
414
C-C Chiang, H.-C Fu / Pattern Recognition Letters 15 (1994) 409-418
Target Output
~
"--
a~oun'ary , t~
'
Intervals
2nd Hidden Unit Output
0
"'"
'
~I~'i--n~n~,~,~3,'~ n
l stHidden __mi--i-] Unit Output
0
0
0 d
~/i I~ I~i,l .,~1 ~.i-1 ~.i I I ~.iI -I Ioo,0~k-2 I I7118:I9( 'YCi+l,J4i+2'~4i+3J4i+4 ']~i+5 ~ ' 14k-3
]k-I ~k-I +k~-Il~k-I ~ik
: L:
' .,'_.._~_:...~,""-- _ ~ ,
i'th Hidden Unit Output ~ - ~ - ~ -
k'th Hidden Unit Output
'
Network / Output A for x ~I4i+1,-~, ,
L" q
Netw°rk[~ ii[
Output for x ~I4i+1
,~, .,,~., ,
1 ,,~ -~, ,~
1
"
..I I..L.., ..i
, .,m
...I I I
,,.
Output
I I i
Fig. 5. Graphical demonstration of Theorem 1. k O=f(Uo +V' X+ ~ ui'h(Wi,o +Wi, lX,
0i(1)), O),
(7)
i=1
where x e ~ is the input, f ( ) denotes the Extended Quadratic Heaviside function (see Eq. (5)), and h( ) denotes the Quadratic Heaviside function (see Eq. (3)). Now, let us set the parameters of this network as follows:
U0 ~ 0,
u~ = - ½ (~,, + ~,;),
0~2~ =max{ Iff_l l, IPk[},
0,.(2) =
W,,o = - ½(P, + # D ,
wi,l =1,
0~ ]~ = ½(fl~-fl,) for 1
v=l .
- - ~l ( 7 i -, - 7 i ) - - 0 ~
2)
for i ~ 0 ,
Based on these settings, for a given input x in/4i+ 2 or/4i+ 3 or Li+ 4, we have x~ (fl~,fl;). By Eq. (3), it is easy to
C.-C. Chiang, H.-C. Fu /Pattern RecognitionLetters 15 (1994) 409-418
415
prove that only the ith hidden neuron will output "1" and all other hidden neurons will output "0" (see Fig. 5 ). The g(O, x) of the output neuron is derived as 1 (y~_ Yt). According to Eqs. (7) and (5), the output of the network becomes
O = f ( x + u i , 19) , ={1, if~i<<.x<~?'i, O, iffli
o=f(x, O), =51, if - 10~2) I ~x~< 10~2) I, andxsI4i+~ for 1 <<.i<~k, ( 0, otherwise. Therefore, we can prove that for all x in I4~+1 ( 1 ~
C~_
~O-__Wo,-O-wo~ ( wl wl )
and an error tolerance ~> O, then Iq)(2w.y, 20)-J,'fq(w.y, O)l ~e,
as 2>-'N/ll°g( (1-E)/~)lm
for a l l x ~ C ,
wherem=min({lO2-(Wo+WlX)2l I x e C } ) . Proof. The Quadratic Sigmoid function dp(w.y, O) can be regarded as a variant of the sigmoid function, i.e., ( w .y, O) = tr( 0 2- ( wo + wlx) 2), where tr(x) denotes the conventional sigmoid function. Thus, by Lemma 2, if 1(20) 2 - (2Wo"l-AWlX)2l >i I l o g ( ( 1 - ~ ) / e ) i, then [O'((20) 2 - (2W0 "~-2W1X) 2 ) - - o ~ ( ( 2 0 ) 2 - (2W0 "~-2WlX)2) I ~ e.
Let m = m i n ( { 1 0 2 - (wo +wtx)21 I x e C } ) . Since x cannot be (O-wo)/wl or ( - 0 Therefore, Eq. (2) can be rewritten as
I q)( 2w.y, 20) - Ygq(2w.y, 20) l <,,e, i f 2 f > N / l l ° g ( ( 1 - e ) / e ) l m By Lemma 1, we conclude
forallxeC.
Wo)/Wl, we have m > 0 .
416
C.-C. Chiang, H.-C. Fu / Pattern Recognition Letters 15 (1994) 409-418
I O(2w.y, 20) - ~ ( w . y , O) I ~<~, if2/> ~/llog( ( lm- ~)/e) I for all xE C, wherem=min({lO2-(Wo+WlX)21
[xeC}).
[]
Similarly, the following corollary tells us that each Extended Quadratic Heaviside function can be approximated by an Extended Quadratic Sigmoid function with an arbitrary small error tolerance. Corollary 1. Let 7t( w.y, O) and I2( w.y, O) denote an Extended Quadratic Sigmoid function and an Extended Quadratic ( 1, yl, Y2,
...,
Heaviside function, respectively, where w = ( Wo, wl, ..., wn), Yn). Given an error tolerance ~ > 0 and a compact set
(wo+
2
O = (0o, O~.... , On),
and
y=
2
then Iq)(Aw'y, A O ) - g 2 ( w ' y , O ) [ ~ , ,
i f 2>~ ~ l l o g ( ( ~ ' ) / ' ) ]
for all y ~ C ,
where 2
m=min({(Oo+i~=lOiY,)-(Wo+i~=lwiyi)
2
I(Y,,
Y2,
...,yn)~C}).
In the following lemma, we prove that a Quadratic Sigmoid function can also be used to approximate the linear functionf(x) =x. Lemma 4. Let ~ ( w.y, O) denote a Quadratic Sigmoid function, where w = ( Wo, wl ) and y = ( 1, x ). Suppose that, for some weight vector wc= ( c, O) and threshold 0o,
O~(w.y, Onet O) . . . . . o=Oo= lz v~0 , where net = w.y = wo + w tx. Let C = ~ be a compact domain. There exists a weight vector wx = ( c - 2 - ~c, ,t - ~), and
lim 2_ [ ¢ ( w ~ . y , O o ) - ¢ ( w ~ ' y , Oo) ] + c ~ x ,
,t.o~ It
for all x e C .
Proof. For convenience, let f ( n e t , O) denote the Quadratic Sigmoid function, where net= w.y = Wo+ wtx. Thus • (wc'y, 0 o ) = f ( c , 0o). Since O ~ (Onet w . y , O) . . . . . o=oo = l~v~O '
we have limf(C+; t -1 ( x - c), 00) - f ( c , 0o)
C.-C. Chiang, H.-C. Fu / Pattern Recognition Letters 15 (1994) 409-418
417
Rearranging the terms in the above equation, we obtain lim 2 [ f ( c + ; t _ 1( x - c ) ,
0o)-f(c, Oo)]+c~x.
In other words, there exists a weight vector wa= ( c - 2 - lc, 2 - ~), such that l i m 2 - [ ~ ( w ~ . y , Oo)-qg(w~.y, Oo)]+c-~x
for all x ~ C .
[]
In the proof of L e m m a 4, #, c, and ~(w~.y, 0o) are all constants (independent of x); thus this lemma says that we can use one Quadratic Sigmoid function ( • (w~-y, 00) ) to approximate the linear function f ( x ) = x. With the above auxiliary lemmas, the following theorem can be proved. Theorem 2. Given a training set S = {Yx, Y2, ..., Y4k+ 1 I Yi ~ ~ , 1 <~i <<.4k + 1}, a single-hidden-layer D T N N contain-
ing at most k + 1 Static Threshold Quadratic Sigmoidal hidden neurons and one Dynamic Threshold Quadratic Sigmoidal output neuron can dichotomize an arbitrary dichotomy defined on S. Proof. Consider each Quadratic Heaviside hidden neuron i of the network constructed in Theorem 1. Based on the parameter settings in the proof of Theorem l, we derive 0 } 1 ~ - W,,o _ ~ ,
Wi, l
- 0 } 1) -W,,o _
~,.
Wi,1
We have assumed that both fli and fl~ are not in any interval Ii for 1 <~i<~4k+ 1 in Eq. (6); then by Lemma 3, each term of the Quadratic Heaviside function (h (W~,o+ Wi,lX, Oi) ) in Eq. (7) can be replaced by a Static Threshold Quadratic Sigmoidal neuron with activation function ~ ( 2 w . y , 20) if 2 is large enough. Let h~ denote the output of the ith neuron. In the proof of Theorem 1, we have seen that for any input in U ~kl {I4i+2 t.J I4i+ 3 k.) I4~+4}, one and only one hidden neuron will output "1". Thus, based on the parameter settings in the proof of Theorem 1, the term (g(O, y ) ) 2 _ (w.y)2 for the Dynamic Threshold Quadratic Heaviside output neuron is equal to ( 0 ! 2) .~_ 0 ~2) ) 2 - - ( X "JI- U i ) 2 ( i # 0 ), where y = ( 1, x, h 1, h 2 , . , . , hk ) and w = ( Uo, v, u 1, u2, ..., uk). In the settings in the proof of Theorem 1, 0 ! 2 ) = ~1 (~,, - ~) - 0~2) and u~ = - - ~1 (~,,, + ~). In addition, we have also assumed that ~'~ and ~ are not in any interval Ii for l~
(g(O, y) )2_ ( w . y ) 2
= (0~2~)2
x ~ '
where 0~2) =max{ Ifl'_ ~ I, [ilk I }. Since both fl'__1 and flk are not in any interval Ii for 1 ~
418
c.-c. Chiang, H.-C. Fu / Pattern Recognition Letters 15 (1994) 409-418
Sigmoidal hidden neurons are needed. Let us compare Dynamic Threshold Neural Networks with conventional sigmoidal networks in terms of the number of free parameters. Suppose that the input dimension is n. Given a training set with 4 k + 1 training patterns, the Dynamic Threshold Quadratic Sigmoidal network requires at most ( k + 1 ) ( n + 2 ) + 2 ( k + 1 ) free parameters. However, the sigmoidal network requires at most 2 k ( n + 1 ) + ( 2 k + 1 ) free parameters. Thus, for problems with large input dimensions (n is large), the upper bound on the number of free parameters required for Dynamic Threshold Neural Networks is only half of the number required for sigmoidal networks. For a training set with a large number of patterns (k is large), the ratio between the upper bounds on the number of free parameters required for Dynamic Threshold Neural Networks and sigmoidal networks is (n + 4) / (2n + 4). Thus, Dynamic Threshold Neural Networks reduce the number of free parameters by a factor of between ~ (for n = 1 ) and 2 (for n--,oo).
3. Concluding remarks and future work In this paper, a new type of neural network called the Dynamic Threshold Neural Network with more powerful classification capability is proposed. By using a Dynamic Threshold Quadratic Sigmoidal neuron in the output layer and k + 1 Single-Threshold Quadratic Sigmoidal neurons in the hidden layer, a single-hidden-layer Dynamic Threshold Neural Network can be constructed to dichotomize an arbitrary dichotomy defined on any training set containing at least 4 k + 1 training patterns. Thus, in comparison with conventional sigmoidal multilayer neural networks, we claim that Dynamic Threshold Neural Networks improve the recognition capability of single-hidden-layer neural networks by a factor of 2. Based on the gradient descent method, it is very easy to develop the learning algorithm for the D T N N in a way similar to the backprop learning algorithm of Rumelhart et al. (1986) for conventional sigmoidal networks. In the future, research into the following two topics concerning D T N N s is suggested: • the capabilities of more complicated architectures, such as non-feedforward networks, networks with more layers, or networks with Dynamic Threshold Quadratic Sigmoidal neurons in hidden layers; • design of efficient learning algorithms and practical applications for DTNNs.
References E.B. Baum and D. Haussler ( 1989). What size net gives valid generalization?Neural Comput. 1, 151-160. C.C. Chiang and H.C. Fu (1992). A variant of second-order multilayer perceptron and its application to function approximations. In: Proc. IJCNN "92, Baltimore, MD, 111:887-111:892. S.C. Huang and Y.F. Huang ( 1991). Bounds on the number of hidden neurons in multilayerperceptrons. IEEE Trans. NeuralNetworks 2 (1), 47-55. N.J. Nilsson ( 1965). Learning Machines: Foundation of Trainable Pattern-Classifying Systems. McGraw-Hill,New York. D.E. Rumelhart, J.L. McClelland and the PDP Research Group (1986). Parallel Distributed Processing (PDP): Exploration in the Microstructure of Cognition (Vol. 1). MIT Press, Cambridge, MA. E.D. Sontag (1990). On the recognition capabilities of feedforward nets. Tech. Report SYCON 90-03, SYCON-Rutgers Center for Systemsand Control, Department of Mathematics, Rutgers University, New Brunswick, NJ.