System Partitioning and Its Measure L. A. Belady and C. J. ~vange~isti IBM T. J. Watson Research Center
Program modules and data structures are interconnected by calls and references in software systems. Pa~itioning these entities into clusters reduces complexity. For very large systems manual clustering is impractical. A method to perform automatic clustering is described and a metric to quantify the complexity of the resulting partition is developed.
INTRODUCTION A system typically consists of components; any two components may or may not be related. A simple representation of such a system is a graph consisting of nodes (components or elements) and edges (relations or connections) between node pairs; another representation is a binary matrix in which the value 1 means that the corresponding row and column com~nents are connected. Figure 1 illustrates such a graph and matrix. Only half the matrix is filled since the other half is its mirror image. Many such systems occur in everyday life-employees of an office building whose relation is defined by an organizational hierarchy, for example, or circuit elements connected by wires. However, the focus of this efFort has been on software com~nents such as program modules and data structures and their interconnections. Sometimes the number of components is large. This is not necessarily difficult for machine manipulation, but it is prohibitively complicated for human comprehension. For example, it is difficult to maintain and modify large systems buiit of several hundred modules and spanning thousands of relations. It is desirable to take a cluster of a manageable number of modules and focus at one time only on those connections that are within the cluster. This makes the task
Address correspondence to L. A. Belady. Road, Artnonk, New York ID.504.
The Journal of Systems and Software 2.23-29 @ Elsevier North Holland, Inc.. 1981
IBM, Old Orchard
simpler since, at least temporarily, one may ignore the rest of the system. In the meantime someone else may work on other clusters in a similarly isolated fashion, or the same person may process the clusters sequentially. The method so far looks less complex, but the task still remains to verify, and if necessary properly adjust for, the validity of connections that cut across cluster boundaries. The art of such partitioning is to absorb as many relations between elements in a cluster as possible and thus leave few intercluster connections. This article describes our work on the automatic clustering of a large number of program modules and data structures. Also, a measure of complexity for comparing the quality of partitions of the same collection of related software components is given. PARTITIONING
The software components used in experimental partitioning (i.e., clustering) were program modules and data structures called control blocks in the virtual telecommunications access method (VTAM), an IBM operating subsystem. The information concerning which modules reference which control blocks was obtained from a logic manuaf for VTAM f 11. (The same information was used for a query system [2] to obtain information about composite relations among these software components. This system is useful to answer such questions as which set of modules references which control blocks. More complicated queries could be formed using AND and OR operators.) The clustering techniques used to obtain subsets of modules and control blocks are based on work by W. E. Donath [ 3,4 3. The program was originally designed for clustering digital circuits for laying out chip designs. In our application the primary inputs to the program are the set of named modules and their individually referenced control blocks. The two main parameters for the program are the number of clusters to be generated and
(198I)
23 0164-1212/81/010023-07$02.75
24
L. A. Belady and C. J. Evangelisti
v A
E
D
0
E
D C
0
0
0
0
1
B
C
GRAPH
MATRIX
Figure 1. A graph and its corresponding connection matrix. A-E are components.
the maximum number of nodes (modules and control blocks) allowed in each cluster. These two parameters greatly influence the partitioning. Neither a single cluster nor a very large number of clusters is very useful. The first parameter is used to find a good partition between the two extremes. If the second parameter allows too many nodes into a cluster, then a small number of clusters will contain most of the nodes. Such a situation is close to partitioning into one cluster. The main goal was to investigate many partitions with different numbers of clusters while keeping reasonable constraints on the maximum number of nodes allowed in any cluster. The clustering program considers modules and control blocks simply as nodes on a graph. Two connected nodes represent a module referring to a control block. The connection matrix for the graph is used to generate N (typi~lly = 5) eigenvectors. The eigenvectors are generated so that their values are used to place each node in N-space. The nodes are placed in N-space in
Figure 2. A connection matrix with three clusters. The nodes are separated into modules and control blocks.
MODULES
CONTROL BLOCKS CONTROL BLOCKS
MODULES
such a way that connected nodes are close to each other in N-space. A node is placed in N-space as follows: If the node is the ninth node then the ninth value of each eigenvector is used to form a new vector, and this vector is used to represent the node in N-space. In clustering nodes, the program attempts to place nodes that are close to each other into the same cluster. RESULTS OF PARTITIONING
Figure 2 shows a connection matrix for a graph that has been subdivided into four parts. An edge between two nodes would be represented by two Is on the matrix, an absence of an edge by two OS.The 1s and OSare symmetrical with respect to the diagonal passing through the origin. Since a node is either a module or a control block, an edge is connected to two different types of nodes. Figure 2 shows that if the nodes were separated by type and numbered so that the control blocks have smaller numbers, then the lower left and upper right boxes in the connection matrix would have all OS.This illustrates the claim that nodes of the same type have no edges between them. The upper-left box contains the same connectivity information and is equivalent to the lower-right box because of the symmetry of the matrix. When a graph is clustered, the results are plotted in the same axis system as in the upper-left box. The modules (and control blocks) are assigned unique numbers so that those with low numbers become the first cluster as determined by the clustering program. The last cluster will have those modules (and control blocks) with the highest numbers. A cluster will have both types of nodes. The edges will be plotted as x and y points, and the axes represent modules and control blocks. In the upper-left box the smaller three boxes illustrate three clusters. As described before, the three boxes define clusters because the nodes were numbered to correspond to their cluster. The two dimensions of a small box are equal to the number of modules and of control blocks in that cluster; half the perimeter of a box equals the total number of modules and control blocks associated with the cluster. A connection matrix sometimes illustrates the nature of connectivity of the nodes. Figure 3 shows the connectivity of the entire graph for VTAM without clustering. The dark vertical bar shows that a small number of control blocks are referenced by a large number of modules. It would take a large cluster to include the nodes involved. The cluster would have to include virtually all the modules, and the remaining clusters would be almost empty of nodes. For this reason and because the control blocks involved were considered global, these 22 control blocks were later eliminated when running the clustering program.
25
500
i D U L E
450
450
400
400
350
350
300
300 : 0 U
250
250
200
150
150
100
50
50
I 1
0 0
50
100
200
150
250
300
CONTROL
I I
I I
400
350
Figure 4. Five clusters with the highest-frequency blocks removed.
--: if.
;.,*. -:
200
250
300
350
4i0
BLOCKS
Figure 5. Five clusters with an increase in the maximum number of nodes allowed in a cluster.
control Figure 6. Twenty clusters with fewer edges in cluster boxes.
m*
‘!.! i i . . ;
..‘..
150
‘3
‘..
450
100
Ii’
3,
:..
50
CONTROL
-.. .,wlzP !
0
BLOCKS
Figure 3. Connectivity of all modules and control blocks.
500
0
..
I
-
,
150 100
50
0
ri
ler0 50
200 150
3i0 250 CONTROL
400 350 BLOCKS
0
100 50
200 150
300 250 CONTROL
400 350 BLOCKS
26
L. A. Belady and C. J. Evangelisti
Figure 4 shows the results of obtaining five clusters from the clustering program. There are many points (representing edges) in the boxes, but, as would be expected in a complex system, the incidence of points outside the boxes is high. These outside points represent edges between nodes in different clusters. As will be seen, an increasing number of clusters causes points to migrate outside the boxes. The constraints on the maximum number of nodes allowed in a cluster keep the number of nodes placed in each cluster balanced. Figure 5 illustrates five clusters but with the constraints on the maximum number of nodes allowed in a cluster relaxed. Consequently the first two clusters received most of the nodes and most of the edges are in the first two boxes. In Figure 6 the number of clusters is 20. The maximum number of nodes allowed in a cluster is small. Therefore the boxes are small and many edges fall out-
side the boxes. In the extreme case, when the number of clusters is very large, few points would be in boxes.
A MEASURE OF COMPLEXITY
Why does a system partitioned into clusters of elements appear simpler than the original? What makes one partition better than another in spite of the fact that the connectivity of elements is always the same and only their cluster assignment changes? To answer these questions we hy~thesize first of all that understanding or manipulating inter~nnected elements is more difhcult if their number is large. Second, given the elements, complexity is proportional to the number of connections. Consider a system whose jth cluster contains nj nodes and ej pairwise edges, that is, connections between the same elements. Then the complexity of this single cluster is C, = niej
Figure 7. CompIexity measure.
*
*
* i
c
*
*
I 0 I m I
*
P 1 1 I e I x I i I t I
* * *
*
*
*
Y 1
oi ~~~~~~~-~~--~---~---______I_____________---_---~~--_--__
0
28
Number
of Clusters
27
System Partitioning and Its Measure With K clusters in the system the complexity of all individual clusters is
which leads to the total complexity of a clustered system:
$ n/ej
c = c
njej + NE,.
j-l
A normalized measure is defined as Either the clusters are disjoint, and then the above equation gives total complexity, or there are intercluster connections whose contribution to complexity must be accounted for. Indeed, for a collection of clusters to be a system, in every cluster there must be at least one node connected to a node of another cluster. We propose that the complexity of such interconnected clusters is proportional to N, the total number of nodes in the system. Given N, the more intercluster edges that exist the more complex is the system of clusters. Let E,, be the number of such intercluster edges; then the additional complexity of these outside connections is
where E is the count of all connections found in the system, njJ N and e,f E are fractions of nodes and edges absorbed in cluster i, and E,/ E is the intercluster fraction of all edges. Of course, c is bounded by 1, as in the case of one cluster. Figure 7 is an application to VTAM of the complexity measure just presented. The optimum (i.e., minimum) complexity can be observed around K = 5 for clusters produced by the clustering program. Figure 8.
Co = NE,,
41
Decreasing number of intracluster edges.
*
21 1I
41 I
* *
I
I
n
I I I I I I I
t
r a I c 1 u
s t e
* *
* *
I I I I
r I e
*
I
a. I 4
e
s
*
I
* *
I
I
28
0 Number
of
Clusters
L. A. Belady and C. J. Evangelisti II
A
*
I
P I P 1 r I 0
x
I I I
*
i m I a I t e
*
I
*
I I
Cl 0 I m I
P I 11 el x I
*
*
*
* *
* ***
i I
t
I
Y I I
Number of Clusters
Figure 9. Approximate complexity.
The complexity measure is based on the numbers of nodes and edges in each cluster of a partition. An approximation to this measure uses the sum of intracluster connections for all the clusters in a partition. Figure 8 shows for a given number of clusters the “total” number of connections between nodes in the same cluster. The graph shows that as the number of clusters increases, clustering (i.e., the increase in the number of intracluster connections) becomes less effective. The normalized approximation to complexity is given by
where Ei is the total intracluster connections for all clusters in a partition. The approximation assumes that the number of nodes is approximately equal in each
cluster. The substitution of n, = N/K in c transforms c to c. The number of nodes in each cluster of a partition was found to be fairly evenly distributed. Figure 9 shows a plot of the approximation to complexity.
SUMMARY
Typically, a situation Is complex if we must comprehend, or predictably manipulate, interrelated objects. The relations must be explored with an initially uncertain strategy. Thus the investigator usually spreads out the totality of elements in front of him such that each can be looked at instantly while at the same time he explores a variety of connection paths. This scene is captured in our complexity measure, in that complexity is proportional to the “spread” (the count of elements under consideration) and the richness of connectivity (the number of connections).
System Partitioning and Its Measure ACKNOWLEDGMENT Wilm Donath gave freely of his time and advice about his programs and their use with both text input and graphic output.
REFERENCES 1. IBM Corporation, OS/VS2 MVS VTAM Logic, SY280621, February 1976.
29 2. H. A. Ellozy, personal communication. 3. W. E. Donath and A. J. Hoffman, Lower Bounds for the Partitioning of Graphs, IBM .I. Res. Devel. 17(5) (1973). 4. W. E. Donath and A. J. Hoffman, Algorithms for Partitioning of Graphs and Computer Logic Based on Eigenvectors of Connection Matrices, IBM Tech. Disclosure Bull. lS(3) (1972).