Neural Networks, Vol. 8, No. 6, pp. 905--913, 1995
Pergamon
0s93-a~gs)000H-9
Copyright O 1995 Elsevier Science Lid Printed in Great Britain. All rights reserved 0893-6080/95 $9.50+ .00
CONTRIBUTED ARTICLE
Mapping Hierarchical Neural Networks to VLSI Hardware RALPH D. MASON 1 AND WILLIAM ROBERTSON 2 i Universityof Regina and 2 TechnicalUniversityof Nova Scotia (Received 3 August 1993; revisedand accepted4 January 1995) Abstract--Electronic ANNs rely heavily on the use of two-dimensional silicon and PCB substrates. Use of these substrates results in hierarchical hardware that exhibits varying levels of connectivity. Numerous approaches have been developed for generating hierarchical neural networks; however, generating hierarchical networks is only part of the problem. Equally important is the task of mapping hierarchical networks onto the hierarchical hardware. It will be shown theoretically and experimentally that, at least within a restricted domain, hierarchical networks can be mapped to hierarchical hardware more efficiently than nonhierarchieal networks. The experimental hypothesis will be carried out through simulations with a number of clustering algorithms that rely on graph partitioning information. The clustering algorithms will also serve as a general purpose hierarchical hardware mapping tool. Their performance will be evaluated with both hierarchical and nonhierarchical test cases that include a M A X N E T (pick maximum of inputs) application and a speech recognition task.
Keywords--Neural, Network, Hierarchy, Mapping, VLSI, Hardware, Cluster, Partition. we have neurons with a large number of interconneetions in what are referred to as clusters. These neurons, connections, and clusters are physically implemented as Processing Elements (PEs), wires, and chips, respectively. At the next level clusters are grouped together with reduced connections between clusters. These groups of clusters are equivalent to a PCB containing multiple chips. At the highest level, groups of clusters form the entire network, again with reduced connectivity between groups in relation to that within a group. This highest level corresponds to an overall system containing multiple PCBs. The structure outlined above is hierarchical in nature with each level in the hierarchy corresponding to different levels of connectivity. I A number of approaches have been developed for creating hierarchically connected networks to reduce the global connectivity problem. These include systems with multiple paradigms (Oyster, 1988; Grossberg, 1988; Olson & Huang, 1989), new hierarchical models
1. INTRODUCTION Electronic neural parallel processing architectures can take on numerous forms (Hubbard, 1986; Garth, 1987; Murray & Smith, 1988; Kung, 1988). All of the parallel structures rely in some form or another on high-density Very Large Scale Integration (VLSI) circuits that can incorporate one or more processors on a single die or wafer. These die/wafers can then be connected together on Printed Circuit Boards (PCBs) to form a complete system. The reliance on two-dimensional silicon and PCB structures presents a problem for electronic ANNs in that the high global connectivity in many neural network models (Hopfield, 1982; Rumelhart & McClelland, 1986; Carpenter & Grossberg, 1988) cannot be directly supported (one-to-one mapping between connections and wires). Alternative interconnection technologies that could reduce connectivity problems are being explored (Gschwendtner, 1988); however, these technologies, if at all feasible, will not be available in the near term. Examining a typical system we see that it is composed of a number of levels. At the lowest levels
! Throughout this paper when the authors refer to hierarchy they will be limiting their discussion strictly to connectivity hierarchy. For our purposes, hierarchy will be defined in terms of
the physical connectivityof the hardware. The closer the ANN levelsof connectivityare to the physicalconnectivity,the better the hardware mapping and the higher the level of hierarchy. By closer we mean near to or less than so that in most instanceswheneverwe can decrease the node connectivitywe increase the hierarchy.
Acknowledgements: This study was supported by the Natural Science and Engineering Research Council of Canada. Requests for reprints should be sent to Ralph Mason, Faculty of Engineering, University of Regina, Regina, Saskatchewan, Canada $4S 0A2.
905
906
R. D. Mason and W. Robertson
(Hampson & Volper, 1987; Edelman, 1987; Fukushima & Miyake; 1982), functional decomposition (Feldman & Ballard, 1982), pruning (Sietsma & Dow, 1988), reduced precision (Feldman & Ballard, 1982), coarse coding (Hinton, 1980), and feature space partitioning (Mason, 1991). Generating hierarchically connected networks is the first part of the problem. Equally important is the ability to map these networks to the final hardware. Assuming that the final implementation media is hierarchical in nature, then we would like to show that the more hierarchical the neural network, the more efficient the mapping to silicon. A general proof is difficult given the fact that most neural network models do not have a direct translation into a hierarchical form. Further, if we consider networks that do have a more hierarchical form, then generally there are varying numbers of neurons and connections and there is no simple way to relate these between different models (i.e. the advantages of hierarchy depend upon the problem as well as the model). It has been shown that the computational capabilities of networks are dependent on neuron function, overall network connectivity (feedback, multilayers, etc.), and the total number of connections (Lippman, 1987; Hopfield, 1982; Hartley & Szu, 1987). For our purposes we will consider hierarchy solely in terms of the number of connections (i.e. we have considered the hardware to place no limitations on the neuronal functions or types of network connectivity). Within these restricted confines, if you consider the number of connections as a measure of the degrees of freedom, then networks with equal amounts of connectivity should have similar functional capabilities. With this in mind, the authors will show both theoretically and experimentally that hierarchical networks do map more efficiently to hierarchical hardware when the average network connectivity is significantly greater than the connectivity at the different levels of hardware.
2. THEORETICAL ANALYSIS Determining the number of clusters required for a network is highly dependent on the individual node connectivity (i.e. whether or not the connecting nodes can be placed in the same cluster). It is possible to have two networks with the same number of nodes and connections that require different numbers of cluster due to their particular internode connectivity. To draw any kind of conclusions we will have to rely on measures such as average node connectivity and average interduster connectivity. The resulting equations we develop will therefore be general in nature; however, we will support the theoretical
conclusions with experimental results for a number of individual problems. Our problem can be summarized as follows: given two networks NET1 and NET2 with the same number of total connections Mt and average node connectivities M~ and M~2, can we show that the more hierarchical network NET2 (M,2 < M~I) will map better to the hierarchical hardware7 Our aim is to minimize the number of clusters required. We are, however, constrained by the hardware, which has a maximum number of nodes per duster (So) and the maximum intercluster connectivity (M~) such that the average number of connections per node outside a cluster (M~o) is given by:
M~o_< ~ We also know that the average number of connections per node (Mn) is composed of an average number of connections inside the cluster (M~i) as well as an average number of connections outside the cluster (Mno): Mn = M , , + Mno
(2)
Substituting for M~o in eqn (1) we have
M n - Mn, < Mc
(3)
Mc and St are fixed numbers determined by the hardware constraints. Mni, although variable, is limited to a maximum of Sc and is usually smaller depending on which nodes are in the cluster. Given that we have a fixed cluster size, we try to minimize the number of clusters by utilizing as many nodes in the cluster as possible. We will assume that the average cluster node connectivities (Mn, Mn, and Mno) do not vary a great deal from cluster to cluster. In the situations we are concerned with we have large networks with relatively high connectivity where M~ >> M~i and the only way to solve the inequality is to reduce M~ by some technique like functional decomposition (Feldman & Ballard, 1982). Using functional decomposition we can break a node with N connections into approximately log M ~vlayers of nodes with M connections. This will increase the number of nodes required by ( N - 1 ) / ( M - 1). Looking at a single node we see that it has an original connectivity Mn of which Mni c a n connect within the cluster and therefore do not have to be decomposed. We can therefore substitute M n - Mni for N in the previous in-line equation. The final nodes after functional decomposition will have M connec-
Mapping Hierarchical Networks
907
tions that are composed o f M . i intracluster connections and M.o intercluster connections. Making the above situations, the number o f new nodes (N.ew) is
given by ( M,, - Mn,) - 1
N ~ , = (Mno + M,a)
-
1
(4)
The total number o f nodes ( N t ) is a function o f the initial number o f node (Ninit) and can be written
N, = N ~ • Ni~,
(5)
The initial number of nodes is a function of the total number o f connections (Mr) and the average node connectivity so that
~,, = N, M.
(6)
The total number o f clusters (Ct) is now given by
C,
Art
(7)
S,'
where we assume as many nodes in a cluster are used as possible. Substituting eqns (4), (5), and (6) into eqn (7) we have Mt C,t = g n
Mn - Mn~ - 1
M,,o+ g n i
- 1
(8)
s¢
Finally, for analysis sake let us consider the optimum cases where we use all the intercluster connectivity available. Equation (1) now becomes an equality and the average intercluster node connectivity (M,,o) equals Sc (each node connects to every other node within its cluster). 2 We can now rewrite eqn (8) as Mt M.
C,=
number o f nodes per cluster, number o f intercluster connections, and average network node connectivity. One of the initial assumptions was that two networks with approximately the same connectivity could perform the same function; therefore, when comparing networks, we will assume that Mt is a constant. Selecting networks with average node connectivifies M,l = 100 and Mn2 = 30 and M t -- 10000 (i.e. N E T I initially has 100 nodes and NET2 has 334 nodes) we can plot the number o f clusters required as a function o f the cluster size and intercluster connectivity (Figure 1). It is evident from this plot (as well as other plots we have done with various values o f Mr, Mn, Sc, and Mc ) that as the cluster size and intercluster connectivity increase there is, as one might expect, a decrease in the total number o f clusters required. Looking at the plot in Figure 1 we can also see that for each set of cluster sizes and intercluster connectivities the clusters required are composed o f two components. The black filled component is the total number of clusters that would be required with an average node connectivity o f 30. The speckled component is the number o f clusters that would be required with an average node connectivity o f 100. For small cluster size and intercluster connectivity, the lower connectivity network ( M n = 30) has a smaller component and would therefore require fewer clusters. F o r larger cluster size and intercluster connectivity the situation is reversed. Later we will see that these same results are found when we perform a detailed mapping o f individual networks to hierarchical hardware. The reason for the above results is that with large cluster size and intercluster connectivity the total number o f nodes required is the dominant factor. Therefore, the more hierarchical network with its larger number of starting nodes requires a greater number of clusters. Clusters Required
M,, - S o - 1 Me+S_ l
&
&
(9)
We see that the number o f clusters required is a complex function o f the total number o f connections, 2 The first assumption is quite valid as functional decomposition of large networks will use a high percentage of the available intercluster connectivity.The second assumption is more arbitrary as the intracluster connectivity is dependent on the network architecture, cluster size, etc. The use of a constant value between different networks (So) is, however, valid for general comparison purposes. As mentioned, our experimental results will provide more accurate data for a number of individual cases.
.350
250 200
bin ~ 100
1.¢:,o 50
~1/~~ '~0
~-~.
~o 0
FIGURE 1. Number of clusters required as a hn~-llon of c l u m r size and Intercluster connectivity for a v e r a g e nelwork connectivltiea of Mo = 30 and M . = 100.
908
R. D. Mason and W. Robertson
As we decrease cluster size and connectivity the intercluster connectivity becomes dominant and functional decomposition of the less hierarchical network results in a larger number of nodes and clusters. It is important to realize that in many realworld situations (Gschwendtner, 1988; Rumelhart & McCleUand, 1986; Hecht-Nielsen, 1988) we are dealing with small cluster size and intercluster connectivity (relative to the network size and connectivity) so that hierarchical networks do appear to map better to hierarchical hardware.
3. EXPERIMENTAL IMPLEMENTATION In the previous section we saw how a theoretical analysis showed that hierarchical networks do indeed map better to hierarchical hardware, at least under the given constraints. In the following section we will look at experimental algorithms that can be used to map individual networks to hardware. The primary intent of exploring these mapping algorithms was to lend support to the theoretical analysis. As a secondary goal the algorithms act as a starting point for developing a network mapping tool. The experimental analysis will be based on clustering algorithms, which rely on graph partitioning information. For our simulations we will use only two levels of hierarchy; however, the algorithms could be easily extended to handle multiple levels. At the lowest level (which would correspond to intrachip connections) we will assume complete connectivity. This is achievable with many of the analog and multiplexed VLSI architectures being developed. Initially the authors looked at graph partitioning algorithms for mapping networks (Leiserson, 1983; Kernighan & Lin, 1970). Mapping networks is intimately related with minimizing graph partitions. Although numerous theorems have been developed for partitioning graphs they invariably deal with only a restricted set of graphs. Partitioning arbitrary graphs is N-P-complete and as a consequence heuristic algorithms are often required when dealing with arbitrary networks. A common starting point for heuristic algorithms is the incorporation of graph connectivity information such as connected regions, Strongly Connected Regions (SCRs), Maximally Strongly Connected Regions (MSCRs), and cliques. From our efforts, two major limitations with graph connectivity information were discovered. First, due to differences in graph and hardware granularity (size of connected regions), additional partitioning techniques were required. For example, a large fully connected network (this would correspond to a single clique) with a relatively small number of nodes per cluster would require additional
method(s) to partition the clique amongst a number of clusters. The second limitation is that determining graph connectivity information, such as the number of cliques in a graph, is extremely expensive with an order exponential in the size of the graph (Tarjan, 1973; Tiernan, 1970). Although connectivity measures are of value, the residual partitioning tasks, and in fact the entire graph partitioning problem, could be cast as a general clustering problem with constraints (McClelland & Rumelhart, 1988; Sebestyen, 1962).
3.1. Algorithm Selection and Implementation The selection of a clustering algorithm was narrowed down to switching and adding algorithms based upon the deficiencies of the other algorithms (Hartigan, 1975). The approach we decided upon was to use both types of algorithms in a two-stage clustering procedure. The ultimate clustering would therefore take advantage of the superior qualities of both types of algorithms. In the first stage a quick adding algorithm based on graph connectivity information was used to generate an initial suboptimal partition. This partition could then be used as a good starting point for a more extensive switching algorithm. The adding algorithm selected was the popular leader clustering algorithm. The switching algorithm we used was the K-means algorithm (Sebestyen, 1962; Cox, 1957). In most clustering algorithms the selection of an appropriate distance metric is critical (Hartigan, 1975). For our simulations we used the actual graph distance3 modified such that nodes with symmetric connection (i.e. two nodes a and b are symmetric if there is a connection from a to b and a connection from b to a) have a distance of 1/2. By definition, the distance between unconnected nodes is infinite; however, it was necessary to select an arbitrary large value to differentiate partitions with multiple disconnected nodes. The error term was simply a sum over all clusters of the distances between nodes within the cluster. The basic algorithm (which is referred to as CLUSTER1) can be seen in Figure 2. Of note is the modified K-means algorithm, which takes each node and tries switching it with all other nodes that are not in the same cluster. The switch is completed only if there is a reduction in the error function. After going through all permutations we test to see if the intercluster connectivity constraints are satisfied. If the connection constraints are satisfied 3 The distancebetweentwo nodes a and b, denotedd(a, b), is the number of edges in a path of shortest lengthfroma to b.
Mapping Hierarchical Networks
READ/NC/DENCE MATRIX
coNVERT NODES TO MAX/NORDER
9O9 TABLE 1 Hierarchical Connectivity and Node Requirements for a Fully Connected Network with 10 Nodes end 90 Conneclions CLUSTER1 CLUSTER2 CLUSTER3 Intercluster Cluster Connectivity Size CON NO CON NO CON NO 3
3 5 8
117 100 105
37 20 25
110 100 94
30 20 14
110 100 94
30 20 14
4
3 5 8
104 100 102
24 20 22
101 100 92
21 20 12
101 100 92
21 20 12
5
3 5 8
100 90 90
20 10 10
100 90 90
20 10 10
100 90 92
20 10 12
GENERATE DISTANCEMATRIX QUICKLFA . DERCLUSTERING
ADDI~.g~ ~ C~R NODES
FIGURE 2. Clustering algorithm flowcherL
we are finished, otherwise we add collector nodes (i.e. functionally decompose the inputs) and go back to the stage where we determined the number of clusters required. The authors also investigated two modifications to the basic algorithm. The first modification (referred to as CLUSTER2) contained a much simpler binary distance function based on the intercluster connectivity ( d = 0 inside the cluster and d = 1 between clusters). This is more suitable for the K-means part o f the algorithm as the intercluster connectivity is the only remaining constraint that has to be satisfied. The switching algorithm maintains cluster size and the initial functional decomposition eliminates any intracluster connectivity considerations. This algorithm has the same basic algorithm flowchart as in Figure 2 except it is now possible to keep a running count of the error (sum of distances) for each case, and end whenever the error for each case is less than the intercluster connectivity (i.e. the error is now a direct measure o f the intercluster connectivity). Another variation (CLUSTER3) that was experimented with was to make only a single switch for each case, through each pass of the K-means algorithm. The intent was to reduce the number of switches and thereby minimize the search space and computational complexity. This is desirable as in some situations we may switch a node a number of times with minimal error reduction only to have all those switches superseded with a latter switch that has a substantial error reduction.
3.2. Test Case Mapping Results The authors tested the three algorithms on a number of test cases. The basic architectures looked at were single-layer fully connected networks, multilayered feed-forward networks with full connectivity between layers, hierarchical feed-forward layered networks developed using feature space partitioning (Mason, 1991), and finally reduced connectivity alpha networks (Bailey, 1988), which are a crude approximation of the cortical structure of the brain. The results of the 10-node fully connected network can be seen in Table 1. It should be noted that it is not possible to make a comparison to an optimum due to the factorial explosion of possible partitions. We will therefore have to rely on a relative measure between algorithms. Because our goal is to determine whether hierarchical networks map better to hierarchical hardware and we will be using the same algorithms on all networks, an absolute measure is not required. The results show the number of connections (CON) and nodes (NOD) required as a function of intercluster connectivity and number of nodes per cluster. As mentioned, the simulations were carried out in three different versions. Version one (CLUSTER1) has an integer distance metric, CLUSTER2 has a binary distance metric, and CLUSTER3 has a binary distance metric but performs only a single switch in each pass. The intercluster connectivity and nodes per cluster are necessarily limited to small values relative to the number of nodes in the network. This corresponds to the type of situation we experience when we map large networks to parallel hardware. We can see that, due to the connectivity limitations, we generally require larger numbers of nodes and connections as compared to the original network of 10 nodes and 90 connections. CLUSTER2 and CLUSTER3 with their binary distance
R. D. Mason and W. Robertson
910 TABLE 2 Hierarchical Connectivity end Node Requirements for • 10-10-10 Feed-forward Network wllh 30 Nodes and 200 Connecaona
TABLE 3 Hierarchical Connectivity and Node Requirements for a Sample Hierarchical Network with 26 Nndeo and 90 Connections
CLUSTER1 CLUSTER2 CLUSTER3 Intercluster Cluster Connectivity Size CON NO CON NO CON NO
CLUSTER1 CLUSTER2 CLUSTER3 Inter-cluster Cluster Connectivity S i z e CON NO CON NO CON NO
3
10 15 20
271 268 262
101 98 92
245 225 214
75 55 44
257 241 239
87 71 69
2
5 10 20
103 90 94
39 26 30
103 99 90
39 35 26
103 106 93
39 42 29
4
10 15 20
231 229 207
61 59 37
213 200 200
43 30 30
213 210 200
43 40 30
3
5 10 20
100 90 90
36 26 26
92 90 90
28 26 26
101 92 90
37 28 26
5
10 15 20
204 210 205
34 40 35
200 200 200
30 30 30
203 209 200
33 39 30
4
5 10 20
90 90 90
26 26 26
90 90 90
26 26 26
90 90 90
26 26 26
measure give better performance than CLUSTER1. This was found to be true in almost every simulation. CLUSTER2 and CLUSTER3 have nearly identical results with only one simulation showing any variation, even though CLUSTER3 has substantially fewer calculations. This unfortunately was not found to be a general result. The results o f clustering the 10-10-10 feed-forward network can be seen in Table 2. Once again we see that with a highly connected network we require considerably larger numbers of connections and nodes (initially 30 nodes and 200 connections). Another observation is that CLUSTER2 outperforms CLUSTER3, especially for low intercluster connectivity. By peforming a single switch on each pass we effectively reduce the search space we are covering. For low intercluster connectivity we reduce the degrees o f freedom and we are therefore more likely to get caught in a local minima from which we cannot escape. Figure 3 is a hierarchical network with 26 nodes and 90 connections. This network is typical of the structures that result from the application of the feature space partitioning techniques (Mason, 1991). The results o f clustering this network can be seen in Table 3. Except when there is very low intercluster connectivity, only marginal increase in the number of connections and nodes is observed. One would expect
\~
\ j'
•
•
•
\ ./ ',\j
this, as the starting network has low connectivity. One anomaly appears when the intercluster connectivity is 2 and the cluster size is 10. We see that CLUSTER2 and CLUSTER3 have much larger numbers of nodes and connections than CLUSTER1. This is the result of poor starting cluster selection. Using a modified algorithm with different starting clusters can improve these results. An alpha network with 49 nodes and 129 connections can be seen in Figure 4. The results of clustering this network are shown in Table 4. Once again we see that CLUSTER2 has the best performance and the inherent hierarchical connectivity (high local connectivity and sparse global connectivity) maps naturally to the cluster structures. For instance, if we compare Table 1 and Table 3 we see that both networks have the same number of connections although the hierarchical network has more than two and a half times the number of nodes. Assuming that both networks had the same functionality, then, as one would expect, the superiority of one or the other implementations (final number of nodes required) is a function of the cluster size and the intercluster connectivity. j
.
,~ 7 )
' 0
FIGURE 3. A hierarchical mulUlayered network.
_ _~x,.._ )
)
~
41
FIGURE 4. A 49-node alpha network with a = 4.
~
42
Mapping Hierarchical Networks
911
TABLE 4 Hierarchical Connectivity and Node Requirements for an Alpha Network with 49 Nodes nod 129 Connections
CLUSTER1 CLUSTER2 CLUSTER3 Intercluster Cluster Connectivity Size CON NO CON NO CON NO
3
5
Fully Connected
Binary Tree
Intercluster Connectivity
Cluster Size
CON
NOD
CON
NOD
2
2 3 4
464 448 432
240 224 208
240 217 218
136 113 114
10 20
158 152 130
78 72 50
140 133 129
60 53 49
149 138 130
69 58 50
5 10 20
135 133 129
55 53 49
131 129 129
51 49 49
133 136 129
53 50 49
3
5 10 20
131 129 129
51 49 49
129 129 129
49 49 49
136 129 129
50 49 49
2 3 4
352 336 320
128 128 112
211 210 210
107 106 106
4
2 3 4
320 320 304
96 96 80
210 210 210
106 106 106
5
2
TABLE S Hierarchical Connectivity end Node Requirements for • Fully Connected and Binary Tree MAXNET
An important observation is that as we decrease the ratios of cluster size to total number of nodes, and intercluster connectivity to average node connectivity, the hierarchical network becomes increasingly desirable. There appears to be no general means of addressing equivalent functionality; however, there are a number of restricted problems where models with identical functionality have variable degrees of hierarchy. In addition, the work on feature space partitioning (Mason, 1991) allows us to examine a broad category of models where we can generate networks with increased hierarchy.
3.3. Application Mapping Results Let us first look at a restricted problem where two equivalent models exist. A network that picks the maximum of a number of inputs (MAXNET) was implemented as a fully connected network (Hopfield & Tank, 1986). The inputs are first applied to the network and then removed. The network then iterates until the output of only one node is positive. Another technique for picking the maximum uses comparator subnets (Martin, 1970). These subnetworks use threshold logic nodes to pick the maximum of two inputs and then feed this maximum value forward. Comparator subnets can be configured into approximately log2(M) layers to pick the maximum of M inputs. For a fully connected MAXNET with N inputs we require O(N) nodes and O(N 2) connections. The binary tree implementation requires O(N) and O(N(]og N)) connections. Therefore, for large N the tree structures will obviously have a better mapping. A fully connected network always has a smaller number of nodes and for small N has fewer connections. The crossover point (in terms of connectivity) occurs at approximately N = 16 where the fully connected network has 32 nodes (including input nodes) and 256 connections, whereas the tree network has 106 nodes and 210 connections.
The result of clustering the two different MAXNET networks with N = 16 can be seen in Table 5. The simulation uses only the CLUSTER2 algorithm given its superior performance. It is evident that with the given intercluster connectivities and cluster sizes that the hierarchical binary tree implementations have more efficient mappings. A representative example using hierarchical networks is from a speech recognition problem (Mason, 1991). This problem required a 2-40-10 conventional BP network or a 61-node hierarchical network, which were both developed for classifying similar words with different vowels. These networks perform essentially the same function so that a clustering of the two should offer a quantitative measure of hardware resource utilization. The BP network initially had 52 nodes (including input nodes) and 480 connections whereas the 61node hierarchical network had 184 connections. The clustering results, again with only the CLUSTER2 algorithm, are presented in Table 6. Once again we see that the hierarchical implementation is superior. TABLE 8 Hierarchical ConnscUvlty and Node Requirements for • Fully Conventional Back Propagation end our Hierarchical Network ImplementsUons for the Vowel ClassificaUon Problem
Back Prop.
Hierarchical
Intercluster Connectivity
Cluster Size
CON
NOD
CON
NOD
2
2 3 4
850 840 830
422 412 402
270 263 250
125 118 105
3
2 3 4
660 660 650
232 232 222
227 229 220
82 84 75
4
2 3 4
600 600 592
172 172 164
217 216 216
72 71 71
912
R. D. Mason and W. Robertson
4. C O N C L U S I O N The reliance on two-dimensional silicon and PCB structures presents a problem for electronic ANNs in that the high global connectivity in many neural network models cannot be directly supported (oneto-one mapping between connections and wires). Hierarchical networks offer one solution to this problem; however, generating hierarchical networks is only part o f the problem as it is still necessary to map these networks to hierarchical hardware. The important question of whether or not hierarchical networks map better to hierarchical hardware can only be answered by looking at the function as well as the structure o f the networks. Although intuitively the answer would appear to be yes, a general p r o o f is difficult given the lack of a direct translation from an initial network into a more hierarchical form. If, however, we assume that network function is directly related to the number o f connections, it has been shown that hierarchical networks do map more efficiently to hierarchical hardware when the average network connectivity is significantly greater than the connectivity at the different levels o f hardware. Networks have a natural representation as graphs, and as such, many o f the results o f graph theory can be applied. Mapping networks is intimately related with minimizing graph partitions. Although numerous theorems have been developed for partitioning graphs they invariably deal with only a restricted set o f graphs. Partitioning arbitrary graphs is NP-complete and as a consequence heuristic algorithms are often required when dealing with arbitrary networks. A common starting point for heuristic algorithms is the incorporation o f graph connectivity information such as connected regions, SCRs, MSCRs, and cliques. Two major limitations with graph connectivity information are the general difference in graph and hardware granularity, which necessitates additional partitioning techniques, and the computational expense of generating such information. The network partitioning problem can be recast as a general clustering problem with constraints. The authors have implemented a number o f clustering algorithms that maintain good results while relying on graph connectivity measures only during initial cluster formation. As with most clustering algorithms the selection of an appropriate distance measure is o f critical importance. It was found that a simple binary distance function based on intercluster connectivity ("1" outside the cluster and "0" inside) offered the best results. This is due to the fact that the intercluster connectivity is the primary hard constraint to be satisfied. An important observation was that as there is a decrease in the ratios o f cluster size to total number o f nodes, and/or intercluster
connectivity to average node connectivity, the hierarchical networks have increasingly superior mappings. This was in direct support of the theoretical analysis. The network mapping algorithms can also be used as a general tool for mapping neural networks to hierarchical VLSI hardware. The algorithms are functionally correct; however, the algorithm that returns the best results is computationally expensive O(n 2) when looking at large networks. When attempts were made to reduce the search space there was significant degradation in the results. Part of the future work will be to analyze the trade-offs between computation and mapping results and, if necessary, develop new algorithms for mapping large networks. Another area for future study is mapping networks to specific VLSI architctures. REVERENC~ Bailey, J. (1988). A VLSI interconnect structurefor neural networks (Tech. Rep. CS/E-88027). Beaverton, OR: Dept. of Computer Science/Engineering,Oregon Graduate Centre. Carpenter, G. A., & Grossber, S. (1988). The ART of adaptive pattern recognition by a seW-organizing neural network. Computer, 3, 77-90. Cox, D. R. (1957).Note on grouping.Journal of the American Star. Association, 52, 543--547. Edelman, G. M. (1987). Neural Darwinism. New York: Basic Books. Feldman, J. A., & Ballard, D. H. (1982). Connectionistmodels and their properties. Cognitive Science, 6, 205-254. Fukushima, K., & Miyake, S. (1982). Neocogultron: A new algorithm for pattern recognition tolerant of deformationsand shifts in position. Pattern Recognition, 15, 445. Garth, S. (1987). A chipset for high speed simulation of neural network systems. IEEE Conference on Neural Networks, San Diego, CA. Grossberg, S. (1988). Image Fusion. In Darpa neuralnetwork study (pp. 485-492). Faiffax, VA: AFCEA International Press. Gsehwendtner, A. B. (1988). Darpa neural network study. Faiffax, VA: AFCEA International Press. Hampson, S. E., & Volper, D. J. (1987). Disjunctive models of Boolean category learning. Biological Cybernetics, 53, 203-217. Hartigan, J. A. (1975). Clustering Algorithms, New York: John Wiley and Sons. Hartley, R., & Szu, H. (1987). A comparison of the computational power of neural network models. Proceedings of the IEEE International Conference on Neural Networks, 2, 15-22. Hecht-Nielsen, R. (1988). Neurocomputing: Picking the human brain. IEEE Spectrum, 3, 36-41. Hinton, G. E. (1980). Draft technical report. La Jolla. CA: University of California at San Diego. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collectivecomputational abilities. Proceedings of the National Academy of Science USA, 79, 2554-2558. Hopfield, J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Science, 233, 625~33. Hubbard, W. (1986). Electronic neural networks. Proceedings of AlP Conference on Neural Networks for Computing, pp. 227234.
Kernighan, B. W., & Lin, S. (1970). An effective heuristic procedure for partitioning graphs. Bell Systems Technical Journal, 49, 291-308. Kung, S. Y. (1988). Parallel architectures for artificial neural nets.
Mapping Hierarchical Networks Proceedings of the International Conference on Neural Networks, 2, 165-172. Leiserson, C. E. (1983). Area-e~icient VLSI computation. Doctoral dissertation, Cambridge, MA: MIT Press. Lippman, R. P. (1987). An introduction to computing with neural nets. IEEE Acoustics, Speech, and Signal Processing Magazine, 4, 4-22. Martin, T. (1970). Acoustic recognition of a limited vocabulary in continuous speech. Doctoral dissertation, Dept. Electrical Engineering, University of Pennsylvania. Mason, R. (1991). Hierarchical VLSI neural networks. Doctoral dissertation, Dept. Electrical Engineering, Technical University of Nova Scotia. McClelland, J. L., & Rumelhart, D. E. (1988). Explorations in parallel distributed processing (han~ook of models, programs, and exercises) (pp. 200-201). Cambridge, MA: MIT Press. Murray, A. F., & Smith, A. V. (1988). Asynchronous VLSI neural networks using pulse-stream arithmetic. IEEE Journal of SalidState Circuits, 23, 688-697. Olson, W. W., & Huan8, Y. (1989). Toward systemic neural network modelling. International Joint Conference on Neural Networks, 2, 602. Oyster, M. (1988). Targt recognizer. In Darpa neural network study (pp. 451-455). Faiffax, VA: AFCEA International Press. Rumelhart, D. E., McClelland, J. L., et al. (1986). Parallel distributed processing: Explorations in the mierostructures of cognition, Cambridge, MA: MIT Press. Sehestyen, G. S. (1962). Decision making processes in pattern recognition. New York: Macmillan.
913 Sietsma, J., and Dow, R. J. F. (1988). Neural net pruning--why and how. Proceedings of the International Conference on Neural Networks, 1, 325-332. Tarjan, R. (1973). Enumeration of the elementary circuits of a directed graph. SIAM Journal of Computing, 2, 211-216. Tiernan, J. C. (1970). An eificient search algorithm to find the elementary circuits of a graph. Commmication of the Association of Computing Machines, 13, 722-726.
NOMENCLATURE
c, Mn Mnl Mn2
Mc Mno M,,i
Nncw Ninit
Nt Sc
t o t a l n u m b e r o f clusters total number of connections average number of connections per node average node connectivity for network 1 average node connectivity for network 2 intercluster c o n n e c t i v i t y a v e r a g e n u m b e r o f c o n n e c t i o n s o u t s i d e the cluster a v e r a g e n u m b e r o f c o n n e c t i o n s inside the cluster number of new nodes initial n u m b e r o f n o d e s total number of nodes n u m b e r o f n o d e s p e r cluster